Is Deepseek Training Lying About Chips Used for Training its AI ?

Altimeter Capital analyst and partner puts what Deepseek claims and results into numbers.

$6M Training Costs = Plausible IMO

Quick math: Training costs ∝ (active params * tokens). DeepSeek v3 (37B params; 14.8T tokens) vs. Llama3.1 (405B params; 15T tokens) = v3 theoretically should be 9% of Llama3.1’s cost. And the disclosed actual figures aligned with this back-of-the-envelope math, meaning, the number are directionally believable.

Deepseek clearly says in a footnote that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data.

Comparison of training costs between models trained at different times is inherently flawed: Training costs have been improving non-stop. Saying DeepSeek v3 (Jan 2025) is 1/10th the training cost of Llama3.1 (July 2024) is very misleading. Training costs have been dropping exponentially due to advancements in compute and algorithms.

Pre-training a model with hundreds of billions of parameters in the U.S. today costs less than $20M (go ask the engineers who actually build LLMs). DeepSeek may be ~50% more cost-effective than its U.S. peers – which seems entirely plausible to me! It’s like how a smaller-engine Japanese car can perform comparably to a larger-engine American car, thanks to engineering breakthroughs like turbocharging and lightweight design.

Training vs. R&D Costs

It’s tricky for all the labs to define training costs since a lot of experiments (incl. data costs) blend into training runs.
DeepSeek possibly required ~$500M capex (rumored 10K A100s + 2-3K H800s), still far less than top U.S. labs but a lot higher than $6M.

First Movers vs. Followers

Do people really have no idea about the massive R&D cost difference between “first-in-line drugs” vs. “me-too drugs”??
First movers inherently face “wasteful” R&D due to the trial-and-error nature of innovation. But when has humanity ever stopped pushing forward because of that? The effort is always worth it.

Huge Inference Efficiency Gains

Inference costs have always been coming down, and DeepSeek just unlocked a step-function drop in inference costs—faster, cheaper, and decent quality.
This is the moment many startup founders and developers have been waiting for! Suddenly, countless applications have achieved product-market fit from a cost perspective!
This should lead to a lot more inference spending, eventually.

Two Things Freda Believes and Nextbigfuture Agrees

1️⃣ Better and more efficient AI models = huge tailwind for the AI supercycle.
2️⃣ DeepSeek is a win for open-source AI & brings efficiency to the whole ecosystem.

Closed-source LLMs below this level of performance are irrelevant now. The same shakeup happened after Llama3 was launched, and DeepSeek is now cleaning the house. Open-source ecosystems, including $Meta’s, will thrive on this momentum.

DeepSeek at this point is more than the company itself. It’s a proof of concept: a hyper-efficient, small model running on cost-effective infrastructure.

I agree with Freda. Improved AI efficiency is good for AI and will make AI profitable. Profitable AI with vastly lower costs will mean more nvidia chips will be needed.

AI progress will get faster.

5 thoughts on “Is Deepseek Training Lying About Chips Used for Training its AI ?”

  1. All of this “cheap training” is about to run into a wall of expensive copyright infringement. The courts are slow, but powerful, and there are real rights for original content creators being trampled here, including traditional media, science, etc.
    The cheaper and more profitable AI capex gets, the more they will have to pay out to original content creators. There won’t be much public pushback against this either, because people in the real world are losing their jobs to AI, and they will want their cut of the proceeds.

  2. We know that evolution has developed a human-level intelligence (our human-level intelligence) that is extremely energy efficient (although the human brain is still the most energy-demanding organ in the whole body). I’m not particularly surprised that new strategies/architectures and tweaks could reduce the energy requirements by orders of magnitude.
    But out of curiosity, I calculated how much energy it takes to develop our intelligence.
    Making the following (quite arbitrary) assumptions:

    -If we assume that the brain training starts at birth and the learning phase ends around 25 years (I choose this threshold since in the AI field, the discussion is about a PhD level intelligence, so 25 years seems reasonable)
    -If we assume a standard daily metabolic intake of 2500 kCal (for simplicity, I assume the same intake even for newborns and during childhood, but this is an overestimation)
    -If we assume that the brain consumes, on average, 1/3 of the energy of the body
    -If we assume that the brain learns at a constant rate 24/7 (even when we sleep)

    Then we get:
    2500 kCal /3=834 kCal for the brain daily
    since 1kCal is 4184 joules 4184*834=3.49 megajoules daily
    3.49×10^6 x365= 1.27*10^9 joules per year, so approximately a gigajoule and a quarter per year
    1.27*10^9 x 25 years = 31.8 gigajoules in total
    since a kWh is 3600000 (or 3.6 million) joules
    31.8 gigajoules is equivalent to 8833 kWh, so roughly 9MWh

    If, however, we consider the cumulative development of intelligence until now as a development produced by the whole species, the energy cost increases significantly

    -If we assume that in the last 200000 years, cumulatively 100 billion humans ever lived and we assume that all reached 25 years of age (this is a huge overestimation since child mortality has been very high for the 99% of the life of our species), and all contributed to the collective development of the intelligence of our species

    then the energy cost of the development of human intelligence is 9Mwh*100 billion = 9e^17Wh or 900 petawatt hour PWh

    For comparison, our modern human civilization consumes approximately 170 PWh of energy each year, so to sustain the development of our biological intelligence from scratch (through 100 billion samples) it would cost 5 years of our global energy output. Not so cheap in the end.

  3. Musk said, tried to discredit them that they have nvidia 50,000 Nvidia GPUs.
    That does not mean that they lied. Perhaps Musk plays dirty, lies, perhaps others. US companies have financial motives to try to discredit them. Chinese have their own motives. Objective benchmarks and performances, facts would be the best.

Comments are closed.