Cogito v2 – Inference-time search and New AI Self-improvement

The largest Cogito v2 671B MoE model is amongst the strongest open models in the world. It matches/exceeds the performance of the latest DeepSeek v3 and DeepSeek R1 models both, and approaches closed frontier models like o3 and Claude 4 Opus.

Deep Cogito extend their work on building superintelligence using Iterated Distillation and Amplification (IDA) by scaling the model’s intelligence prior. This is done by the model internalizing the reasoning process using iterative policy improvement, rather than simply searching longer at inference time.

This is a novel scaling paradigm where the models develop more “intuition”, and serves as a strong proof of concept for self-improvement (AI systems improving themselves).

Since the Cogito models develop a better intuition of the trajectory to take while searching, they have 60% shorter reasoning chains than Deepseek R1.

Contrary to the accepted belief that such technical innovations require capital intensive infrastructure, this approach is also significantly more efficient (the Cogito models have been trained for a combined cost of <$3.5M). They plan to extend the gains of iterative self-improvement to build superintelligence. All models they create will be open sourced. The Cogito v2 series from Deep Cogito has four open-source models: a 70B dense model, a 109B mixture-of-experts (MoE) model, a 405B dense model, and a 671B MoE model. They focused on enhancing intuition in both reasoning and non-reasoning modes. Specific details on the training compute in terms of floating-point operations (FLOPs) or GPU hours have not been publicly disclosed. The combined training cost for eight models across the Cogito lineup (spanning 3B to 671B parameters, including v2 and prior variants) was less than $3.5 million, encompassing synthetic and human data generation as well as over 1,000 training experiments. This was a low-cost improvement of the base AI models. How Deep Cognito Did It?
Building superintelligence is fundamentally a tractable machine learning problem.

Their approach to build superintelligence is the following:
Step 1 – Develop a scalable training recipe for unbounded iterative intelligence improvements
Step 2 – Use more compute to scale their efforts and iteratively improve intelligence to go beyond human performance

They are working on techniques that can reliably supervise AI systems much smarter than humans.

Earlier this year, they released Cogito v1 models, where they talked about using Iterated Distillation & Amplification (IDA) as a promising research direction for general superintelligence. Their main focus was to provide a training signal which is not upper bounded by overseer intelligence.

They focused on self-improvement via distillation.

Continuous Improvement Towards Superintelligence

Super-human performance has been achieved in multiple narrow domains (chess, Go and poker), via the same two-step loop:

Inference-time reasoning – which spends compute to search for a solution
Iterative policy improvement – which distills the discoveries of that search back into the model’s parameters. As a result, the next search starts closer to the goal

AlphaGo exemplifies the pattern: Monte-Carlo Tree Search (MCTS) generates an improved policy from each game, and the policy-and-value network is retrained on those visit counts.

LLMs can be thought of as a similar system but less structured.

They think they can complete the improvement loop and drive iterative intelligence improvements with a second important step: iterative policy improvement.

They distill the reasoning process back to the model parameters so that the model has a stronger prior.

They use inference time reasoning in a way which can make the model better.

The model should be able to directly guess the results of running reasoning (without actually doing the reasoning) and anticipate the outcome of its own reasoning process.

Although recent LLMs have made progress on reasoning, most improvements have been driven by scaling reasoning length without upgrading the model’s intelligence prior.

Instead of using more compute to search more, they want more accurate intuition of how to look for the right answer.

Cognito 2 is or may be a step toward technical breakthroughs in iterative policy improvement.

They believe that hillclimbing on iterative policy improvement will pave the way for significantly improved model capabilities beyond what added search (via reasoning tokens) alone can unlock.