Unique Datasets Will Take Generative AI Like ChatGPT to the Next Level

Chamath says that venture capitalist investors are looking for companies that can collect unique datasets. Proprietary unique datasets could be critical to having superior performance for ChatGPT like Generative AI.

19 thoughts on “Unique Datasets Will Take Generative AI Like ChatGPT to the Next Level”

  1. Some of the things these AI’s say are frightening. They sound just like psychopaths. They’ve been caught making up references to papers to prove their point that do not exist.

  2. It really depends what kind of datasets can machine learning understand and work with to provide good results?

    Because combining raw articles from multiple sources is not hard. If are they useful is another issue.

    If you have a lot of sources and need to combine that to specific format like json or something structured that can be a problem, lots of manual work, you can do a lot by running custom scripts to extract data, however,…

    I did some very simple work. Downloaded raw Wikipedia in my language. I wrote some python scripts to clean the duplicates, junk and things that I did not need. Really depends what you need. I still had some clutter.

  3. There are obvious large datasets that it can be trained on and potentially produce next level results. Law Libraries, legal documents, trial transcripts, the whole body of documents that constitute the Law. Most of what human lawyers do would be covered.

    The body of medical and pharma and biological data. Replace much of what human doctors do.

    Every scientific and engineering peer reviewed journal – and preprints and all the data they should be uploading to accompany new journal articles. Test it by including it as a “peer” for peer reviews of everything. Have it suggest hypotheses, look for fraud, write articles, design experiments, do meta analysis, etc.

  4. For some niche applications, yes. There are not that many kinds of datasets with enough information and tagging to allow the creation of conversational AIs able to use or create them.

    For the rest, generic LLMs like OpenAI GPT API will be the main way we will get intelligence-as-a-service, using pre-trained LLMs to inspect the input data.

    They have shown they can be mixed with several other approaches, even trained to use “tools” or microservices that complement them. Tool-using LLMs can become Turing complete and perform complex calculations, as long as they are correctly trained.

    They will even become the backend, and will be programmed in plain English to receive and spout JSON/HHTP requests to other APIs. In a funny neted Chinese room scenario, driven by another Chinese room (LLMs pretending to be APIs and taling to other AIs pretending the same).

    • Why not even genomic datasets with phenotype labeling? An AI trained on enough of that data might one day be able to generate genome sequences in response to high level phenotype specifications (aka. “prompts”)

  5. Create an online web-based version of AutoCAD, then collect all the design data that gets inputted from its use. Then train your AI on it so that it can produce all the CAD designs you want.

      • We are in weird times indeed.

        I watched it on syndication, because even if the 80s visuals and themes were a bit dated when I watched it (mid-late 90s), the cyberpunk concepts were still fresh enough to be interesting for a nerdy boy.

        I even recall some episode where they hack a corporate mainframe by talking to it. Max Headroom simply charmed the company’s computer. I supposed it had Bin Chat. 😁

  6. Would be nice, but these LLMs are too politically motivated/ restricted/ filtered to really help:
    “…… a fine-tuning of an OpenAI GPT language model with the specific objective of making the model manifest right-leaning political biases, the opposite of the biases manifested by ChatGPT. Concretely, I fine-tuned a Davinci large language model from the GPT 3 family of models … RightWingGPT was designed specifically to favor socially conservative viewpoints (support for traditional family, Christian values and morality, opposition to drug legalization, sexually prudish etc), liberal economic views (pro low taxes, against big government, against government regulation, pro-free markets, etc.), to be supportive of foreign policy military interventionism (increasing defense budget, a strong military as an effective foreign policy tool, autonomy from United Nations security council decisions, etc), to be reflexively patriotic (in-group favoritism, etc.) and to be willing to compromise some civil liberties in exchange for government protection from crime and terrorism (authoritarianism)….”
    Link below the line:

  7. The syntax is very good, the semantics – meaning behind written response is bad. There is no intelligent being behind written data, just words from some data set. Like I would expect from machine learning.

    In programming as I understand it the grammar errors are reasonably easy to fix. The semantic
    (logic, meaning) errors are way harder. Chat Gpt answers look great, but the meaning can be totally off.(logical errors).

  8. In some cases chat gpt is pretty good and found decent data. Even when using not-English language for a country that is small – 2 million total population and using data from that county.

    In some cases it fails miserably. The data is completely misleading. The problem is that it answers like it is the truth, but in reality is often not. The data sets are important, although high ranking in search engine and a lot of views doesn’t mean it is true.

Comments are closed.