OpenAI o1 is a new large language model trained with reinforcement learning to perform complex reasoning. o1 thinks before it answers—it can produce a long internal chain of thought before responding to the user.
OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA). While the work needed to make this new model as easy to use as current models is still ongoing, we are releasing an early version of this model, OpenAI o1-preview, for immediate use in ChatGPT and to trusted API users(opens in a new window).
Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.



The 78.1% score for the GPQA test is ahead of the 67.2% score for Claude 3.5.


Claude 3.5 Sonnet by Anthropic achieved a score of 59.4% in zero-shot Chain-of-Thought accuracy, leading the leaderboard as of June 26, 2024. Additionally, there was mention of scores as high as 67.2% using various prompting methods, which exceeded the average score of human experts with PhDs in the corresponding domains.
GPT-4 Opus (0513) by OpenAI scored 53.6% in zero-shot Chain-of-Thought accuracy.
Claude 3 Opus by Anthropic was at 50.4% in the same evaluation.
Grok-2 by xAI doesn’t have a directly mentioned score for GPQA in the provided information, but its performance in related benchmarks suggests it would be competitive, though we lack the exact figure for GPQA.
Google Gemini, specifically models like Gemini Ultra, while not directly scored for GPQA in the excerpts, has been highlighted for surpassing state-of-the-art in various benchmarks, suggesting it would perform well, but again, without a specific GPQA score mentioned.

Brian Wang is a Futurist Thought Leader and a popular Science blogger with 1 million readers per month. His blog Nextbigfuture.com is ranked #1 Science News Blog. It covers many disruptive technology and trends including Space, Robotics, Artificial Intelligence, Medicine, Anti-aging Biotechnology, and Nanotechnology.
Known for identifying cutting edge technologies, he is currently a Co-Founder of a startup and fundraiser for high potential early-stage companies. He is the Head of Research for Allocations for deep technology investments and an Angel Investor at Space Angels.
A frequent speaker at corporations, he has been a TEDx speaker, a Singularity University speaker and guest at numerous interviews for radio and podcasts. He is open to public speaking and advising engagements.
If it really is being trained on science it’s going to get harder to force it to stick to ideological “scientific” findings.
The blue biases should diminish as models increase accuracy. Red perspectives were chased out of science years ago, so no need to worry about them.
Are you talking about politics, or science? “Red perspectives” where chased out of “science” years ago? Really? The nature of science is to ask questions, listen to the answers, and ask more questions. This is how science works. The questions in science never end. In politics, the questions often never even begin. This is dangerous, and frankly, quite stupid. The consequences of being “stupid by choice”, can be, disastrous. Hope your bunker is well stocked. Mine is. And I hope, I never have to live in mine. Though it is quite comfortable. But if I have to retreat to mine, the s*** really has hit the fan. This is the definition of a nightmare. One you may not wake up from. Very, very scary.
So, the vertical scale is accuracy, and the horizontal scale is compute time/power. Only the compute time/power scale is logarithmic, and the graph seems to be a straight line trend.
This implies that increasing accuracy requires exponentially increased compute time/power.
Accuracy during training required a hundred fold increase in compute to go from 30% to 70%, roughly. Accuracy at test time went from 20% to about 80% over that same hundred fold increase in compute. But there’s a hint of sigmoid in that curve…
So they’re probably going to need another hundred fold increase in compute power to max out the model.
A quick search says Chat GPT 3 took about 1.3 GWH to train. So let’s say training for full performance requires 130GWH. About 15MW for a year. Hm, that’s equivalent to a city of about 12K people.
Man, these things are power hungry. The first task to set them on is inventing more energy efficient computers, I guess.
Sorry this is out of place . Not sure how to communicate it otherwise
This seems NBF post worth. https://www.wsj.com/science/greenland-tsunami-global-seismic-vibrations-afadd21f?mod=hp_trending_now_article_pos4
One of the more interesting linguistic consequences we’ve seen with historic computer generated language models is how to deal with “illogical but appropriate answers, to illogical questions”. Most of us (humans) deal with this with deflection. Our “answer” changes the nature/intent of the question we’re asked. Since the question may not make sense to the person hearing it, our answer is “shaped” to what we understand. This may not sound logical, but honestly, each of us do this, all the time. It’s how we deal with “intangibles”
We fill in the blanks, when we don’t perceive enough to see a pattern. We all hate blank space. I’d love to see how AI deals with such subtle nuance’s of language. It gets more interesting when you translate the meaning from one language to another. As a great language teacher taught me: “Don’t translate the words of your language into another. THINK in that other, now “your” language”. Once you do that, you don’t need to translate anything at all. It’s you. I want to see how AI “feels” that.
I don’t think that will be difficult at all, because you have a great number of examples of such patterns in the internet data. We know that LLM’s are good at repeating language patterns, even when the model does not have an internal model of the physical world.
Thanks Robert.
I am hoping, deep down, that the industry is not trying to create an ‘artificial human’ as much as artificial intelligence. We are flawed and should delight in those incongruencies, but for true AI to contribute to civilization we may need to consider their unhuman attributes as critical such as indifference to self-preservation (theirs), pursuing all solutions rather than those specific to the human task-master provided efficiencies are not lost, creating opportunities for labor augmentation than straight-up replacement, etc.
I, for One, believe that it will be very political in where and how AI will be utilized with many factors akin to political disagreements about unions, environment, immigration, etc. – a value of how to we want to our Society ot be rather than just efficiency substitution tools. My 2c.
I’d say that we should hove to the side of humanness, if only because it makes sure that our moral intuitions are not discounted. Our genetics contain iterated game theory for as long as our ancestors have been a social species. An intelligence greater than our own, not bound by that inheritance could be very scary indeed.