Future Wafer Scale Chips Could Have 100 Trillion Transistors

The AI startup Cerebras has shown that a single wafer-scale Cerebras CS-1 can outperform one of the fastest supercomputers in the US by more than 200 times for physics simulations and many AI applications. They had 0.86 PetaFLOPS of performance on the single wafer system. The waferchip was built on a 16 nanomber FF process.

The WSE is the largest chip ever built. It is 46,225 square millimeters and contains 1.2 Trillion transistors and 400,000 AI optimized compute cores. The memory architecture ensures each of these cores operates at maximum efficiency. It provides 18 gigabytes of fast, on-chip memory distributed among the cores in a single-level memory hierarchy one clock cycle away from each core. AI-optimized, local memory fed cores are linked by the Swarm fabric, a fine-grained, all-hardware, high bandwidth, low latency mesh-connected fabric.

Wafer-scale chips were a goal of computer great Gene Amdahl decades ago. The issues preventing wafer-scale chips have now been overcome.

In an interview with Ark Invest, the Cerebras CEO talks about how they will beat Nvidia to make the processor for AI. The Nvidia GPU clusters take four months to set up to start work. The Cerebras can start being used in ten minutes. Each GPU needs two regular Intel chips to be usable.

If we assume that wafer-scale chips are the future of AI and many supercomputing applications what will happen if Cerebras has financial success and moves their wafer-scale chips to the leading-edge TSMC fabs.

TSMC says a 7 nanometer lithography chip have about 202-250 million transistors per square millimeter. The 46225 square millimeters would be 11.6 trillion transistors. This could have 175 GB of on-wafer memory.

Cerebras did announce that they had a first version of 7-nm wafer chip and it has 2.6 trillion transistors. It has 850,000 cores.

Simply squeezing in the maximum numbers of transistors is not what was done.

The world’s largest GPU, Nvidia’s A100, measures 826mm2 and has 54.2 billion transistors. This is 65 million transistors per square millimeter. At the same density for 46225 square millimeters would be over 3 trillion transistors.

TSMC says that its 5-nanometer process is 1.84x denser than its 7-nanometer node. This would be 4 to 5 trillion transistor wafer-chip based on scaling the Gen 2 Cerebras wafer chip. In 2017, IBM had a 5-nm process chip that had had 600 million transistors per mm2. The IBM density would enable a 27.7 trillion transistor wafer-chip and it could have about 380 GB of on-wafer memory by just scaling the Cerebras WSE.

Commercial integrated circuit manufacturing using the 3-nm process is set to begin around 2022-2023. There would be a delay to get the processes working for a full wafer chip. A 3-nm process would also see a doubling of transistor density to about 50 trillion transistors per wafer at max density versus about 10 trillion transistors for scaling the Gen 2 Cerebras.

A 2-nm process wafer chip would have double the transistors again be a max 100 trillion transistors per wafer or about 20 trillion transistors by scaling the Gen 2 Cerebras. The on-wafer memory could scale to about 200-800 GB. If memory were emphasized a bit more the wafer chip could have a terabyte or more of on-wafer memory.

Each step down the lithography nodes in transistor size usually enables about 30-40% improvements in energy and speed. The increased number of transistors could enable larger problems and scaling of processing power. Dropping to smaller lithography from their 2019 sixteen nanometers would likely be about ten times. This could mean that an exaflop wafer chip should be possible when made on near term future lithography and more fully optimized.

What Wafer Scale Computing Will Mean

Wafer Scale Computing: What it Means for AI and What it May Mean for HPC by Rob Schreiber of Cerebras.

in collaboration with NETL, of the CS-1 for the model problem of solving a large sparse system of linear equations posed on the regular mesh in 3D using the BiCGstab method, a typical Krylov subspace solver. On traditional systems, both memory bandwidth and communication latency limit performance and prevent strong scaling for such computations, which do not cache well and which require frequent collective communication. We achieved performance two orders of magnitude better than the best possible on a CPU cluster, because these limiting factors are no limits at all on the wafer scale system. With 18GB of on wafer memory, there is a limit to the size of problem we can solve this way. Rob discussed the future growth of the technology and the implications of the possibility of strong scaling and extreme performance for problems of modest memory footprint.

SOURCES- Cerebras, Wikichip, Techspot
Writteb By Brian Wang, Nextbigfuture.com