How Much Data to Make a Human General Intelligence NPC ?

Geordi Rose and Suzanne Gildert at Sanctuary AI are one of the world leading companies working on humanoid robotics and AGI (artificial general intelligence). They have an interesting thought experiment about how much data it would take to generate a human like general intelligence. However, this would be an NPC (non-player character) style general intelligence. It would have no explicit internal world model. It would be like a perfect LLM (large language model) for text, where the LLM could perfectly predict the next word and it would appear to have intelligence and the results would be intelligent.

Jeff Hawkins, AI giant, had a theory that our brains predict the future. The book he wrote was “On Intelligence”.

If this is true, when a robot moves through the world it has to be able predict what will happen if even if it’s done in a way where the people who are programming the robots are actually doing all the hard work.

We are talking about the future because movement is about the future. It’s about where you will be in space.

A lot of sensory input and response is pattern prediction.

Putting these models into practice requires a motor layer.

A data-driven approach for humanoid robot motion control would be to take the experiential data of a robot so all of its senses and all of its positions and all of the information about where it is in the world and all that and building models on the data.

AI researchers tried in the early days and it didn’t work very well. However, flash forward to modern times so these large language models are good future predictors using text data for text prediction.

The Technologies underlying these large language models predict the future of text solely based on text. They don’t use priors and stuff like that. Other approaches to text AI parse sentences into verbs, nouns and perform analysis. LLM just dump a bunch of text into a bucket.

What if we took video frames of your entire life? As an approximation. 32 million seconds per year X 30 frames per second X 70 years. A billion frames per year for 70 years. 70 billion frames.

The frame is not just a video snapshot of your visual perception or of you. Geordie defines the life frame as a snapshot of all the data from you know. It sight, touch, hearing, everything. It where your limbs are which is commonly called proprioception. It is everything you can feel in your body you. Even what you feel in your digestive system. It’s a snapshot of your experiential immersion in the world 30 times a second.

How much data is in a frame ?

It is a tough problem. It is hard problem because the more tokens you have the more Fidelity you have to the underlying Raw stuff. You can choose to tokenize at different levels of detail so you as a designer of this data stream input have to decide how to tokenize. It’s a bit like saying what if you’re trying to decide how to compress a JPEG image it’s like what level of compression.

Let’s say that there’s a thousand tokens per frame when we tokenize each of the frames then we’re talking about a a hundred trillion tokens is your life.

Let’s say I wanted to take that data set assuming I had it. I wanted to train a Transformers like model like a prediction model by taking all of that data which is a sequence date in time. If I give you any set of frames as a prompt you’re just going to predict the next few frames. If I was a body running this model what it means to predict a frame is I’m going to hallucinate what’s about to happen and then I’m going to do it so I’m just going to act based on my prediction.

There’s a general rule of thumb and machine learning that you want to have 10 times more data than parameters. If I have a hundred trillion tokens in my life data set that’s 10 trillion parameters model.

The human brain has 100 trillion synapses. It’s weird that it’s even close. We get the same general estimate now one lifetime a 10 trillion parameter model and 10 lifetimes is the size of the human brain. If I could capture 10 lifetimes that’s a hundred times ten in this case A Thousand Years.

This would be enough data to train a model that was roughly the size of the human brain. The model that had no priors at all like there’s no nothing at all except data and we train it on experiential data of a person.

A robot running this type of model would behave just like a person but all they’re doing is predicting the next frame. they have no explicit inner World model they have no explicit understanding of anything.

Nextbigfuture notes that this robot would be the perfect NPC.

6 thoughts on “How Much Data to Make a Human General Intelligence NPC ?”

  1. Reductio ad absurdum:

    As Gregory Chaitin says “(lossless) compression is comprehension.”

    If you genuinely comprehend the contents of Wikipedia — meaning you can integrate it into the most coherent model of the universe generating its contents — you’ll have sufficient data* to perform any planning and reach any objective that a human level intelligence NPC could. The problem is generating that model means losslessly compressing Wikipedia more than a human level intelligence could. This, by the way, is how you know that the big bucks in government and/or philanthropists don’t understand AI. Otherwise they’d be backing the large text compression benchmark with prize awards as well as the Hutter Prize. The LTCB has no restrictions on compute resources so it is about finding the best model of knowledge. The Hutter Prize’s restriction on compute resources is to overcome “The Hardware Lottery” barrier to machine learning research. The former is about getting the model of knowledge. The latter is about advancing the state of the art in getting the model of knowledge. Two easily confused but different objectives.

  2. You’re familiar with Shannon information, right? The information content of something is the portion that you CAN’T predict based on what you already know.

    From a Shannon information standpoint, those frames of yours are practically information free; Sure, you can say there’s a thousand tokens per frame, but for the most part they’re unchanged from the prior frame!

    • Like Venters’ minimum viable cell, a minimum viable agent model, an empty, aware vessel, trained on the most common data sets of existence, could be “minted” in any number.

      A blank, but cognizant, slate.

      Then, trained to a task as required.

      Probably in a simulation to speed training.

      Or, connected through Neurolink to One’s conscious thoughts, residing on a marble implanted behind their breast bone, it could be another “room” for One’s thoughts. An inner voice of wisdom.

      A trusted confidante.

      A mind/self back-up for when the meat starts falling off One’s bones.

      Insert immortality sci-fi plots here.

    • Sure, but he did sort of take that into account when positing 1000 tokens per frame. Assuming an image size around 4kx4kx2bytes, and assuming 2 byte tokens, that’s a 16000:1 compression.

      I do figure that a human “NPC” could be really done with at least 2 orders of magnitude smaller model: one based on the fact that you’re already pretty human at 7 years old (vs 70 in his estimates), another based on human perception (what we actually focus on and notice from our senses) and memory (what we retain even short term) being fuzzier than his analysis implies. Possibly even four orders of magnitude smaller parameters – 1 billion parameters.

Comments are closed.