Future Wafer Scale Chips Could Have 100 Trillion Transistors

The AI startup Cerebras has shown that a single wafer-scale Cerebras CS-1 can outperform one of the fastest supercomputers in the US by more than 200 times for physics simulations and many AI applications. They had 0.86 PetaFLOPS of performance on the single wafer system. The waferchip was built on a 16 nanomber FF process.

The WSE is the largest chip ever built. It is 46,225 square millimeters and contains 1.2 Trillion transistors and 400,000 AI optimized compute cores. The memory architecture ensures each of these cores operates at maximum efficiency. It provides 18 gigabytes of fast, on-chip memory distributed among the cores in a single-level memory hierarchy one clock cycle away from each core. AI-optimized, local memory fed cores are linked by the Swarm fabric, a fine-grained, all-hardware, high bandwidth, low latency mesh-connected fabric.

Wafer-scale chips were a goal of computer great Gene Amdahl decades ago. The issues preventing wafer-scale chips have now been overcome.

In an interview with Ark Invest, the Cerebras CEO talks about how they will beat Nvidia to make the processor for AI. The Nvidia GPU clusters take four months to set up to start work. The Cerebras can start being used in ten minutes. Each GPU needs two regular Intel chips to be usable.

If we assume that wafer-scale chips are the future of AI and many supercomputing applications what will happen if Cerebras has financial success and moves their wafer-scale chips to the leading-edge TSMC fabs.

TSMC says a 7 nanometer lithography chip have about 202-250 million transistors per square millimeter. The 46225 square millimeters would be 11.6 trillion transistors. This could have 175 GB of on-wafer memory.

Cerebras did announce that they had a first version of 7-nm wafer chip and it has 2.6 trillion transistors. It has 850,000 cores.

Simply squeezing in the maximum numbers of transistors is not what was done.

The world’s largest GPU, Nvidia’s A100, measures 826mm2 and has 54.2 billion transistors. This is 65 million transistors per square millimeter. At the same density for 46225 square millimeters would be over 3 trillion transistors.

TSMC says that its 5-nanometer process is 1.84x denser than its 7-nanometer node. This would be 4 to 5 trillion transistor wafer-chip based on scaling the Gen 2 Cerebras wafer chip. In 2017, IBM had a 5-nm process chip that had had 600 million transistors per mm2. The IBM density would enable a 27.7 trillion transistor wafer-chip and it could have about 380 GB of on-wafer memory by just scaling the Cerebras WSE.

Commercial integrated circuit manufacturing using the 3-nm process is set to begin around 2022-2023. There would be a delay to get the processes working for a full wafer chip. A 3-nm process would also see a doubling of transistor density to about 50 trillion transistors per wafer at max density versus about 10 trillion transistors for scaling the Gen 2 Cerebras.

A 2-nm process wafer chip would have double the transistors again be a max 100 trillion transistors per wafer or about 20 trillion transistors by scaling the Gen 2 Cerebras. The on-wafer memory could scale to about 200-800 GB. If memory were emphasized a bit more the wafer chip could have a terabyte or more of on-wafer memory.

Each step down the lithography nodes in transistor size usually enables about 30-40% improvements in energy and speed. The increased number of transistors could enable larger problems and scaling of processing power. Dropping to smaller lithography from their 2019 sixteen nanometers would likely be about ten times. This could mean that an exaflop wafer chip should be possible when made on near term future lithography and more fully optimized.

What Wafer Scale Computing Will Mean

Wafer Scale Computing: What it Means for AI and What it May Mean for HPC by Rob Schreiber of Cerebras.

in collaboration with NETL, of the CS-1 for the model problem of solving a large sparse system of linear equations posed on the regular mesh in 3D using the BiCGstab method, a typical Krylov subspace solver. On traditional systems, both memory bandwidth and communication latency limit performance and prevent strong scaling for such computations, which do not cache well and which require frequent collective communication. We achieved performance two orders of magnitude better than the best possible on a CPU cluster, because these limiting factors are no limits at all on the wafer scale system. With 18GB of on wafer memory, there is a limit to the size of problem we can solve this way. Rob discussed the future growth of the technology and the implications of the possibility of strong scaling and extreme performance for problems of modest memory footprint.

SOURCES- Cerebras, Wikichip, Techspot
Writteb By Brian Wang, Nextbigfuture.com

8 thoughts on “Future Wafer Scale Chips Could Have 100 Trillion Transistors”

  1. I do agree 18GB is a bit low for something this large, I wonder why. That being said, GPT-3 is really inefficient, that LMU Munich team built a model that can equal it on several benchmarks like SuperGLUE (which isn't everything I know) and with 1000x less parameters. https://arxiv.org/pdf/2009.07118.pdf

  2. OK, I like the memory bandwidth, but I think that 18 GB of memory is way to small. GPT 3 has 175 billion weights, so it's about 10 times to large to be trained in this system. Also note that this is not the "end" of the GPT "train"; the performance has still not flattened out with respect to number of parameters. How about a system with 1.8 trillion weights? Wonder what that system can do…

    The GPT 3 model was trained on V100 (Nvidia) systems, so the statement that NVIDIA can not combine the memory of different clusters to train larger models (made by the Cerebras people) does not seem to be accurate. How else to explain that Nvidias GPU-systems did train GPT 3?

    We also have the issue of IO to the system. Even though Cerebras seem to downplay the importance of IO, the massive amount of data required to train many networks may be a bottle neck for the Cerebras system. GPT 3, for instance, was trained with 45 TB of data. Cerebras system has 1.2 Tb per second IO [1], so it would take about 5 minutes to load the data into the system.

    If you need about 10 000 epochs of training, the data loading alone would take about 34 days. This would indicate that the Cerebras system would not be suited for very large models with very large datasets, irrespective of the calculation performance of the system.

    (1)
    https://www.cerebras.net/product/

  3. When discussing AI tech and design, it's useful to separate inference from training.
    Training takes orders of magnitude more resources, energy and memory. A large part of the "AI" compute clusters used for training is storage and high bandwidth interconnects to this storage, be it ram or non volatile media.
    This chip seems to address the first cache level or three of this storage hierarchy but not more. This will work fine for inference but not equally well for training.

Comments are closed.