OpenAI Strawberry LLM Reasoning Needs More Compute and Energy for Inference

Jim Fan is one of Nvidia’s senior AI researchers.

The shift could be about many orders of magnitude more compute and energy needed for inference that can handle the improved reasoning in the OpenAI Strawberry (QStar) approach.

This could mean far more powerful and energy intensive chips needed to run inference. Nextbigfuture had previously analyzed and estimated the scaling of AI training and AI inference. This new work suggests the ratio of AI training to AI inference could be changing. The old ratio was AI inference was about the square root of the AI training.

We still do not know how the compute and energy needs with AI large language models will evolve. What will be needed at centralized training clusters and what will be needed at distributed AI inference systems. Will our laptops and cellphones be good enough or will the hardware need to change?

OpenAI Strawberry as Described by Nvidia Jim Fan

OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there’re only 2 techniques that scale indefinitely with compute: learning & search. It’s time to shift focus to the latter.

1. You don’t need a huge model to perform reasoning. Lots of parameters are dedicated to memorizing facts, in order to perform well in benchmarks like trivia QA. It is possible to factor out reasoning from knowledge, i.e. a small “reasoning core” that knows how to call tools like browser and code verifier. Pre-training compute may be decreased.

2. A huge amount of compute is shifted to serving inference instead of pre/post-training. LLMs are text-based simulators. By rolling out many possible strategies and scenarios in the simulator, the model will eventually converge to good solutions. The process is a well-studied problem like AlphaGo’s monte carlo tree search (MCTS).

3. OpenAI must have figured out the inference scaling law a long time ago, which academia is just recently discovering. Two papers came out on Arxiv a week apart last month:

– Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Brown et al. finds that DeepSeek-Coder increases from 15.9% with one sample to 56% with 250 samples on SWE-Bench, beating Sonnet-3.5.
– Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Snell et al. finds that PaLM 2-S beats a 14x larger model on MATH with test-time search.

4. Productionizing o1 is much harder than nailing the academic benchmarks. For reasoning problems in the wild, how to decide when to stop searching? What’s the reward function? Success criterion? When to call tools like code interpreter in the loop? How to factor in the compute cost of those CPU processes? Their research post didn’t share much.

5. Strawberry easily becomes a data flywheel. If the answer is correct, the entire search trace becomes a mini dataset of training examples, which contain both positive and negative rewards.

This in turn improves the reasoning core for future versions of GPT, similar to how AlphaGo’s value network — used to evaluate quality of each board position — improves as MCTS generates more and more refined training data.

2 thoughts on “OpenAI Strawberry LLM Reasoning Needs More Compute and Energy for Inference”

  1. OpenAI is designing its own chips: https://www.msn.com/en-us/news/technology/openai-plans-to-build-its-own-ai-chips-on-tsmc-s-forthcoming-1-6-nm-a16-process-node/ar-AA1pRB1t
    They’ll be using TSMC’s facilities – which has international implications given how China threatens Taiwan – and once the expensive part of designing the chips is over, it’ll be much cheaper than ordering new chips from Nvidea, and faster given Nvidea’s supply chain bottlenecks. This is only for OpenAI, however, so it may not cut into Nvidea’s market share that much, but it is a warning that Nvidea’s days as a monopoly on AI chips (which are really high end graphics chips anyway) may be waning.

  2. This is a good news for AI startups because the upfront cost used to train models will (relatively) reduce drastically and they will have much more space to breath.

Comments are closed.