Cerebras Wafer Scale AI Chip Enables $100 Million for 4 Exaflop Supercomputers

Anastasi interviewed the CEO of Cerebras Andrew Feldman. They talk about the New Cerebras Supercomputer.

The wafer scale AI chip is a great match for the need for fast compute and large memory requirements of the large language model AI like ChatGPT.

They have made them easy to program with PyTorch.

Wafer engine 1 was at 16 nanometers, Wafer engine 2 (the current system) was produced at 7 nanometers and the next chip (Wafer Engine 3) will be made by TSMC at 5 nanometers.

Timestamps:
00:00 – Introduction
02:15 – Why such a HUGE Chip?
02:37 – New AI Supercomputer Explained
04:06 – Main Architectural Advantage
05:47 – Software Stack NVIDIA CUDA vs Cerebras
06:55 – Costs
07:51 – Key Applications & Customers
09:48 – Next Generation – WSE3
10:27 – NVIDIA vs Cerebras Comparison

A research paper shows that wafer engine 2 is 200 times faster than Nvidia A100 GPUs and are still 50 to 100 times faster than new Nvidia H100 GPUs.

Massively scalable stencil algorithm

Stencil computations lie at the heart of many scientific and industrial applications. Unfortunately, stencil algorithms perform poorly on machines with cache based memory hierarchy, due to low re-use of memory accesses. This work shows that for stencil computation a novel algorithm that leverages a localized communication strategy effectively exploits the Cerebras WSE-2, which has no cache hierarchy. This study focuses on a 25-point stencil finite-difference method for the 3D wave equation, a kernel frequently used in earth modeling as numerical simulation. In essence, the algorithm trades memory accesses for data communication and takes advantage of the fast communication fabric provided by the architecture. The algorithm — historically memory bound — becomes compute bound. This allows the implementation to achieve near perfect weak scaling, reaching up to 503 TFLOPs on WSE-2, a figure that only full clusters can eventually yield.

The Cerebras chips are faster and need 20 to 60 times less code to write the programs.