Unique Datasets Will Take Generative AI Like ChatGPT to the Next Level

February 19, 2023 by Brian Wang

Chamath says that venture capitalist investors are looking for companies that can collect unique datasets. Proprietary unique datasets could be critical to having superior performance for ChatGPT like Generative AI.

Brian Wang

Brian Wang is a Futurist Thought Leader and a popular Science blogger with 1 million readers per month. His blog Nextbigfuture.com is ranked #1 Science News Blog. It covers many disruptive technology and trends including Space, Robotics, Artificial Intelligence, Medicine, Anti-aging Biotechnology, and Nanotechnology.

Known for identifying cutting edge technologies, he is currently a Co-Founder of a startup and fundraiser for high potential early-stage companies. He is the Head of Research for Allocations for deep technology investments and an Angel Investor at Space Angels.

A frequent speaker at corporations, he has been a TEDx speaker, a Singularity University speaker and guest at numerous interviews for radio and podcasts. He is open to public speaking and advising engagements.

19 thoughts on “Unique Datasets Will Take Generative AI Like ChatGPT to the Next Level”

Sam

February 20, 2023 at 11:48 pm

Some of the things these AI’s say are frightening. They sound just like psychopaths. They’ve been caught making up references to papers to prove their point that do not exist.
Day33

February 20, 2023 at 2:04 pm

It really depends what kind of datasets can machine learning understand and work with to provide good results?

Because combining raw articles from multiple sources is not hard. If are they useful is another issue.

If you have a lot of sources and need to combine that to specific format like json or something structured that can be a problem, lots of manual work, you can do a lot by running custom scripts to extract data, however,…

I did some very simple work. Downloaded raw Wikipedia in my language. I wrote some python scripts to clean the duplicates, junk and things that I did not need. Really depends what you need. I still had some clutter.
Ludus

February 20, 2023 at 7:31 am

There are obvious large datasets that it can be trained on and potentially produce next level results. Law Libraries, legal documents, trial transcripts, the whole body of documents that constitute the Law. Most of what human lawyers do would be covered.

The body of medical and pharma and biological data. Replace much of what human doctors do.

Every scientific and engineering peer reviewed journal – and preprints and all the data they should be uploading to accompany new journal articles. Test it by including it as a “peer” for peer reviews of everything. Have it suggest hypotheses, look for fraud, write articles, design experiments, do meta analysis, etc.
Jean-Baptiste

February 19, 2023 at 3:11 pm

For some niche applications, yes. There are not that many kinds of datasets with enough information and tagging to allow the creation of conversational AIs able to use or create them.

For the rest, generic LLMs like OpenAI GPT API will be the main way we will get intelligence-as-a-service, using pre-trained LLMs to inspect the input data.

They have shown they can be mixed with several other approaches, even trained to use “tools” or microservices that complement them. Tool-using LLMs can become Turing complete and perform complex calculations, as long as they are correctly trained.

They will even become the backend, and will be programmed in plain English to receive and spout JSON/HHTP requests to other APIs. In a funny neted Chinese room scenario, driven by another Chinese room (LLMs pretending to be APIs and taling to other AIs pretending the same).
- sanman
  
  February 22, 2023 at 9:04 pm
  
  Why not even genomic datasets with phenotype labeling? An AI trained on enough of that data might one day be able to generate genome sequences in response to high level phenotype specifications (aka. “prompts”)
  - Vstar
    
    February 23, 2023 at 4:04 am
    
    If given a sequence, it could also provide you with a picture of the beast.
    - sanman
      
      February 25, 2023 at 12:36 am
      
      Or, crudely draw the beast and submit your picture, and it will spit out a genome sequence to produce it
sanman

February 19, 2023 at 3:11 pm

Create an online web-based version of AutoCAD, then collect all the design data that gets inputted from its use. Then train your AI on it so that it can produce all the CAD designs you want.
- Moriarty
  
  February 20, 2023 at 8:22 am
  
  I’m certain someone at AutoDesk is working on this plan right now. And I for one am looking forward to an AI home designer.
- John Anderson
  
  February 20, 2023 at 1:26 pm
  
  Some thinking as of August ’22
  
  https://develop3d.com/cad/the-future-of-cad/#:~:text=In%20the%20very%20short%20term,from%20a%203D%20CAD%20model.
Matt Musson

February 19, 2023 at 12:59 pm

The rught data set can train AI to be a psychotic, vindictive Bitch.
- Jeff Wright
  
  February 20, 2023 at 1:38 am
  
  Max Headroom was a head of his time 😉
  - Jean-Baptiste
    
    February 20, 2023 at 12:18 pm
    
    We are in weird times indeed.
    
    I watched it on syndication, because even if the 80s visuals and themes were a bit dated when I watched it (mid-late 90s), the cyberpunk concepts were still fresh enough to be interesting for a nerdy boy.
    
    I even recall some episode where they hack a corporate mainframe by talking to it. Max Headroom simply charmed the company’s computer. I supposed it had Bin Chat. 😁
Jer

February 19, 2023 at 10:09 am

Would be nice, but these LLMs are too politically motivated/ restricted/ filtered to really help:
“…… a fine-tuning of an OpenAI GPT language model with the specific objective of making the model manifest right-leaning political biases, the opposite of the biases manifested by ChatGPT. Concretely, I fine-tuned a Davinci large language model from the GPT 3 family of models … RightWingGPT was designed specifically to favor socially conservative viewpoints (support for traditional family, Christian values and morality, opposition to drug legalization, sexually prudish etc), liberal economic views (pro low taxes, against big government, against government regulation, pro-free markets, etc.), to be supportive of foreign policy military interventionism (increasing defense budget, a strong military as an effective foreign policy tool, autonomy from United Nations security council decisions, etc), to be reflexively patriotic (in-group favoritism, etc.) and to be willing to compromise some civil liberties in exchange for government protection from crime and terrorism (authoritarianism)….”
Link below the line:
- Jer
  
  February 19, 2023 at 10:10 am
  
  …specific combination of viewpoints was selected for RightWingGPT to be roughly a mirror image of ChatGPT previously documented biases:
  https://davidrozado.substack.com/p/political-bias-chatgpt
- Wtrmute
  
  February 21, 2023 at 11:34 am
  
  Well, there is at least one LLM which is politically motivated, then.
  
  At the end of the day, the proliferation of such models (pace OpenAI’s attempts to regulate the creation of these kinds of AIs) is what will allow us to avoid bias. If we are restricted to MSFT and GOOG offerings, perhaps with a sprinkling of AMZN here or there, then there is no way to avoid a SoCal bias from infecting our society even more than it already is.
  - Wtrmute
    
    February 21, 2023 at 11:57 am
    
    Here is an interesting interview done by Vox with the heads of OpenAI at the time they pivoted for-profit in 2019:
    
    https://www.vox.com/future-perfect/2019/4/17/18301070/openai-greg-brockman-ilya-sutskever
    
    I’m getting some large techno-priesthood vibes from that one, really.
Day33

February 19, 2023 at 10:04 am

The syntax is very good, the semantics – meaning behind written response is bad. There is no intelligent being behind written data, just words from some data set. Like I would expect from machine learning.

In programming as I understand it the grammar errors are reasonably easy to fix. The semantic
(logic, meaning) errors are way harder. Chat Gpt answers look great, but the meaning can be totally off.(logical errors).
Day33

February 19, 2023 at 9:48 am

In some cases chat gpt is pretty good and found decent data. Even when using not-English language for a country that is small – 2 million total population and using data from that county.

In some cases it fails miserably. The data is completely misleading. The problem is that it answers like it is the truth, but in reality is often not. The data sets are important, although high ranking in search engine and a lot of views doesn’t mean it is true.

Comments are closed.