Cerebras Trillion Transistor AI Wafer-Chip Crushes GPU Supercomputer by 200 Times

Researchers at the National Energy Technology Laboratory (NETL), Cerebras showed that a single wafer-scale Cerebras CS-1 can outperform one of the fastest supercomputers in the US by more than 200 X. They had 0.86 PetaFLOPS of performance on the single wafer system.

The problem was to solve a large, sparse, structured system of linear equations of the sort that arises in modeling physical phenomena—like fluid dynamics—using a finite-volume method on a regular three-dimensional mesh. Solving these equations is fundamental to such efforts as forecasting the weather; finding the best shape for an airplane’s wing; predicting the temperatures and the radiation levels in a nuclear power plant; modeling combustion in a coal-burning power plant; and or making pictures of the layers of sedimentary rock in places likely to contain oil and gas.

The massive speedup was enabled by:

1. The memory performance on the CS-1.
2. The high bandwidth and low latency of the CS-1’s on-wafer communication fabric.
3. A processor architecture optimized for high bandwidth computing.

Cerebras CS-1 has the world’s largest chip. It is 72 square inches (462 cm2) and the largest square that can be cut from a 300 mm wafer. The chip is about 60 times the size of a large conventional chip like a CPU or GPU. It provides much-needed breakthrough in computer performance for deep learning.

They have delivered CS-1 systems to customers around the world, where they are providing an otherwise impossible speed boost to leading-edge AI applications in fields ranging from drug design to astronomy, particle physics to supply chain optimization, to name just a few applications.

NETL took a key component of their software for modeling fluid bed combustion in power plants and implemented it on the Cerebras CS-1.

The performance more than 200 times faster than that of NETL’s Joule 2.0 supercomputer, which is an 84,000 CPU core cluster. The Joule supercomputer is the 24th fastest supercomputer in the U.S. and 82nd fastest in the world.

The Cerebras software platform is comprised of four primary elements:
1. The optimized Cerebras Graph Compiler (CGC)
2. A flexible library of high-performance kernels and a kernel-development API
3. Development tools for debug, introspection, and profiling
4. Clustering software

The Wafer Scale Engine, the CS-1 system, and the Cerebras software platform together comprise a complete solution to high-performance deep learning compute. Deploying the solution requires no changes to existing workflows or to datacenter operations. The CS-1 has been deployed in some of the largest compute environments in the world, including the US Department of Energy’s supercomputing sites. CS-1s are currently being used to address some of the most difficult challenges of our time — from accelerating AI in cancer research, to better understanding and treating traumatic brain injury, to furthering discovery in fundamental science around the characteristics of black holes.

The Cerebras CS-1 eliminates the primary impediment to the advancement of artificial intelligence. It reduces the time it takes to train models from months to minutes and from weeks to seconds, allowing researchers to be vastly more productive.

Arxiv – Fast Stencil-Code Computation on a Wafer-Scale Processor

Abstract—The performance of CPU-based and GPUbased systems is often low for PDE codes, where large, sparse, and often structured systems of linear equations must be solved. Iterative solvers are limited by data movement, both between caches and memory and between nodes. Here we describe the solution of such systems of equations on the Cerebras Systems CS-1, a wafer-scale processor that has the memory bandwidth and communication latency to perform well. We achieve 0.86 PFLOPS on a single wafer-scale system for the solution by BiCGStab of a linear system arising from a 7-point finite difference stencil on a 600 × 595 × 1536 mesh, achieving about one third of the machine’s peak performance. We explain the system, its architecture and programming, and its performance on this problem and related problems. We discuss issues of memory capacity and floating point precision. We outline plans to extend this work towards full applications.

SOURCES- Arxiv, Cerebras
Written by Brian Wang, Nextbigfuture.com