Deep Deepseek History and Impact on the Future of AI

Some believe DeepSeek is so efficient that we don’t need more compute and everything has now massive overcapacity because of the model changes. Jevons Paradox is closer to reality because demand has already increased H100 and H200 pricing.

Deepseek and High Flyer have a mix of 50,000 H20s, H800s, A100s and H100 GPUs. Deepseek has about $1.3 billion in AI servers.

DeepSeek has hired in China based on capability and curiosity. DeepSeek recruited from top universities like PKU and Zhejiang. Hires are given flexibility and access to 10,000s GPUs with no usage limitations. They offer salaries of over $1.3 million dollars USD for promising candidates. This was more than competing big Chinese tech companies and AI labs like Moonshot. They have ~150 employees and are growing fast.

The $6M pre-training number is nowhere the actual amount spent on the model.

Multi-Head Latent Attention, a key DeepSeek breakthrough, took several months to develop and cost a whole team of manhours and GPU hours.

Deepseek V3 beats the performance of Open AI 4o. GPT-4o was released in May of 2024.

DeepSeek r1 matches OpenAI o1. Deepseek used the OpenAI o1 API to quickly train on its toughest questions and correct answers. It was easier to catch up in the newer AI reasoning models. However, the richer companies with more resources will scale reasoning even more and there are a lot more gains to be had from reasoning improvements and test time memory.

Dario, CEO of Anthropic says that algorithmic advancements are even faster and can yield a 10x improvement. As far as inference pricing goes for GPT-3 quality, costs have fallen 1200x. The DeepSeek difference was that a Chinese company made the AI cost improvement.

AI costs will likely fall another 5x by the end of 2025. The cost leader could any one of several competitors.

Google’s Gemini Flash 2.0 Thinking is considerably cheaper than DeepSeek R1.

DeepSeek improvement will be copied by Western labs almost immediately.

DeepSeek V3 uses Multi-Token Prediction (MTP) at a scale not seen before. Attention modules predict the next few tokens as opposed to a singular token. This improves model performance during training and can be discarded during inference. This algorithmic innovation improved performance with lower compute.

DeepSeek v3 is a mix of experts model. One large model is made of many other smaller models that specialize in different things.

DeepSeek useds gating network that routed tokens to the right expert in a balanced way without hurting model performance.

Deepseek was not allowed to use OpenAI O1 API to train another model. KYC (Know Your Customer) and other means will be used to stop distillation training.

Analysis on Which of Big Tech companies Win?

Deepseek has more innovations to release and are a significant competitor especially with strong backing from the China government and China banks.

Estimating share of the future AI market. Winners will need to have complete application offerings and be able to win the trust and loyalty of a large set of customers.

This can be estimated by thinking about what will you have open or around and which AI will be used from that device or system? Is it in your bot, your glasses, your phone. Or your car, your home system or your social media?
What is open and what do you have the relationship with? It would be Grok if you are in X all the time. Where is your Jiminy Cricket ? Who is the confidant and advisor that you are always talking ? Maybe it will be your neuralink ? Which is the company that you will trust ?

Winners could be Amazon by making mostly equal and very good models available on AWS.
It could be META if they can leverage their 3.5 billion customers.
Apple if they can keep people using AI from iPhones.

Ark thinks distribution and completeness of applications will be key.

13 thoughts on “Deep Deepseek History and Impact on the Future of AI”

  1. “Deepseek used the OpenAI o1 API to quickly train on its toughest questions and correct answers.”
    Doesn’t this mean that Deepseek basically leveraged (stole) from OpenAI o1?

    • Yes they took parts of it. But this is common practice in silicon valley. Also OpenAI also stole everyones data on the internet. NYTimes, Twitter, Youtube etc…

  2. As a semi illiterate in regards to technology, I have used chat gpt 4 and deepseek to answer the same questions. One question was who would be the best dem choice for president. Deepseek went down the present list of high ranking Washington candidates including Nancy Pelosi and Sanders while chat covered a wider more, likely choice of candidates. Ultimately, for me technological wizardry is secondary to the results to questions.

  3. “The cost leader could any one of several competitors.” What does this sentence mean? Is there a “be” or “buy” or “destroy” missing?

    • Hello, I literally used the same original source that David Sacks posted about. Semianalysis article written by Dylan Patel. I wrote an article saying the $1.3 billion spent on AI infrastructure.

      Please read my article when you comment.

  4. The point that seems to be missing is the whole element of roll this all forwards 10 years and what do we expect to be in place and what is driving the development at that point. We always think with a today and yesterday mindset as we tend to be unable to see into a future where there is insufficient appropriate background knowledge.
    One huge key difference with the current dynamics is the attention window and speed in processing that attention window, which already places the biological equivalent many magnitudes slower in processing (i.e. the 1m token full book summary in a few seconds).
    It’s rather quite interesting at how blind we appear to be. There is a change approaching that we don’t quite seem to realise yet.

    • [ Who, within a humanoid species mindset would claim humor, music or prospective visionary inspiration was a useless investment of time(?)
      And likewise comparable to that it’s different for a persons experience levels with experiencing situations, reading a book about that or a summary of a book.
      A challenge on this is, enabling experiences for all parts of a society (including knowledge from that) and keeping resources available and stable(?)
      For many tasks (on average) a precise and comprehensive summary might be sufficient for successful problem solving or improved insights to surroundings, tasks or knowledge fields and improved LLMs might get an interesting compromise for enabling more advanced educational supply(?)
      Me not being convinced that 0|1-decisions on GHz speeds are a measure for quality of development or evolution(?)
      How about an absurdity of ‘robots’ voting or ‘spare time acitvities’ for LLMs(?) :), while different compared to an ’emergency medical holographic program’ on Federation Starfleet Spaceships, available from around 2360’s, trying to improve its personality and cultural experience (while having access to a ‘whole world database’ knowledge). ]

      • It is that balance to which stability and society rests. AI will weigh the scales to one side of that balance for a period of time. We are also in the early process of continuing evolution in a different form, adaptive generative complexity or digital DNA.
        Having worked on multi bn parameter models the one perspective that stands out is the width of the attentional dynamic, to which biology is already magnitudes behind. It is this attentional width that is a critical step that preceeds chain of thought type working memory development. Working memory being a separate pool of attention feedback loops. Once coupled, the development cycle is closed off into a generative feedback loop which then requires an external dynamic to continue the development phase.
        How about the absurdity of an AI taking the place of government to remove corruption, inefficiency, waste, bad decisions, etc.

        • “to which biology is already magnitudes behind”
          only within an environment that is supportive for ‘our’ kind of technology (but, yes, on a reality, for to be accepted on a year 2025 progress level and possibilities), but imagine our situation without, e.g. easy availability of accumulated/stored fossil fuels what’s of advantage for technological/artificial conscience, but not necessary for biological (on the long run of million years?); not saying stored fuels is mandatory or without alternatives, but maybe our view is only a ‘split time’ measurement, depending on

          “early process of continuing evolution in a different form”
          who decides about the priorities for the direction/intensity of an evolution (who is an accepted source of ‘wisdom’ therefore within our societies, excerpts from collected knowledge, experts by salary, fortune or popularity, money/index scores, energy supply, compute power, international/supranational law (?)

          “AI taking the place of government to remove corruption, inefficiency, waste, bad decisions”
          what are definitions for e.g. ‘bad decisions’ on an AI’s configuration(?) (thx)

          (and why didn’t ‘they’ optimize all humans genes towards a level like Khan’s crew (Star Trek, “Into darkness”) was capable of) ]

          • Look up to a power source, where more of them exist than grains of sand on our small tiny sphere of arrogance and ignorance.
            Evolution is an undecided path we walk to a destination we do not choose. Evolution is just a self assembling hierarchical complexity, a path that can’t be stopped.
            A bad decision would be investing billions into a project based on yesterdays technology, which will be superceded by the time the project is deliverd. In the UK, HS2, i.e. overhead lines rather than battery. Political decisions to reverse EV adoption to protect companies struggling to change. Typical fiscal geopolitics of divide and conquer.
            Evolution needs chaos, the juggling of DNA, to which Khan is a distortion. Chaos brings the unexpected that is beneficial when the future cannot be seen. All arrows cannot be the same.

Comments are closed.