Up until late 2024, no one has been able to massively increase the amount of compute dedicated to a single model beyond the OpenAI GPT 4 model level. This information is from semianalysis and EIA.
Google’s Gemini Ultra, Nvidia Nemotron 340B, and Meta LLAMA 3 405B had similar or slightly more compute than GPT-4, but an inferior architecture was use.d. Those models did not unlock new capabilities.
A 100,000 GPU cluster needs
150MW in datacenter capacity
uses 1.59 terawatt hours in a single year
energy costs $123.9 million at a standard rate of $0.078/kWh
100,000 H100 GPU servers cost $4 billion
OpenAI began training GPT5 around May 2024.


OpenAI’s training BF16 FLOPS for GPT-4 21.5 million ExaFLOPs on ~20,000 A100s for 90 to 100 days. An 100k H100 cluster will have 15-31 times the compute.
A 100k H100 cluster training run for 100 days can reach 600 million ExaFLOPs. The reliability problems for hardware reduces effective compute to 35% of the theoretical level.
To understand network design, topology, reliability concerns, and checkpointing strategies we need to understand how LLM handle data and minimize data movement.
There are 3 different types of parallelism used in trillion parameter training – Data Parallelism, Tensor Parallelism, and Pipeline Parallelism.
Data Parallelism is the simplest form of parallelism in which each GPU holds the entire copy of the model weights and each GPU (rank) receives a different subset of the data. This type of parallelism has the lowest level of communication since just the gradients needs to be summed up (all reduce) between each GPU. This only works if each GPU has enough memory to store the entire model weights, activations, optimizer state. The model weights and optimizer state can take as much as 10.8 Terabytes of memory for training for GPT4.
Tensor parallelism reduces the total memory used per GPU by the number of tensor parallelism ranks. For example, it is common to use 8 tensor parallelism ranks today across NVLink so this will reduce the used memory per GPU by 8.
With Pipeline Parallelism, each GPU only has a subset of the layers and only does the computation for that layer and passes the output other the next GPU.


Brian Wang is a Futurist Thought Leader and a popular Science blogger with 1 million readers per month. His blog Nextbigfuture.com is ranked #1 Science News Blog. It covers many disruptive technology and trends including Space, Robotics, Artificial Intelligence, Medicine, Anti-aging Biotechnology, and Nanotechnology.
Known for identifying cutting edge technologies, he is currently a Co-Founder of a startup and fundraiser for high potential early-stage companies. He is the Head of Research for Allocations for deep technology investments and an Angel Investor at Space Angels.
A frequent speaker at corporations, he has been a TEDx speaker, a Singularity University speaker and guest at numerous interviews for radio and podcasts. He is open to public speaking and advising engagements.
In terms of circuit architecture for example, having a clocked parallel pipeline that can perform scatter gather type methods that are the equivalent of single (or few) clock type operations depending on pipeline width or depth. There was a paper on a pipelined FGPA voice recognition architecture, which could handle the equivalent of over 1 million conversations threads per second with a single chip. This is the type of architectural difference and scaling I’m thinking of.
I’m guessing that by around 2026 the NVidia attempt at AI creating the circuit architecture will evolve a strategy that changes the whole compute architecture and will see another jump in compute capability as a result. Extrapolating the current AI network approaches I believe will also be false because the current digital approach is quite (magnitudes of compute) inefficient compared to sparse methods. Sparse compute does not lend to CUDA and the current massively parallel compute architectures, which is where NVidia (or someone else) may end up with a unique approach via AI to create a sparse compute architecture that we currently think is just too complex to design.