Researchers at the National Energy Technology Laboratory (NETL), Cerebras showed that a single wafer-scale Cerebras CS-1 can outperform one of the fastest supercomputers in the US by more than 200 X. They had 0.86 PetaFLOPS of performance on the single wafer system.
The problem was to solve a large, sparse, structured system of linear equations of the sort that arises in modeling physical phenomena—like fluid dynamics—using a finite-volume method on a regular three-dimensional mesh. Solving these equations is fundamental to such efforts as forecasting the weather; finding the best shape for an airplane’s wing; predicting the temperatures and the radiation levels in a nuclear power plant; modeling combustion in a coal-burning power plant; and or making pictures of the layers of sedimentary rock in places likely to contain oil and gas.
The massive speedup was enabled by:
1. The memory performance on the CS-1.
2. The high bandwidth and low latency of the CS-1’s on-wafer communication fabric.
3. A processor architecture optimized for high bandwidth computing.
Cerebras CS-1 has the world’s largest chip. It is 72 square inches (462 cm2) and the largest square that can be cut from a 300 mm wafer. The chip is about 60 times the size of a large conventional chip like a CPU or GPU. It provides much-needed breakthrough in computer performance for deep learning.
They have delivered CS-1 systems to customers around the world, where they are providing an otherwise impossible speed boost to leading-edge AI applications in fields ranging from drug design to astronomy, particle physics to supply chain optimization, to name just a few applications.
NETL took a key component of their software for modeling fluid bed combustion in power plants and implemented it on the Cerebras CS-1.
The performance more than 200 times faster than that of NETL’s Joule 2.0 supercomputer, which is an 84,000 CPU core cluster. The Joule supercomputer is the 24th fastest supercomputer in the U.S. and 82nd fastest in the world.
The Cerebras software platform is comprised of four primary elements:
1. The optimized Cerebras Graph Compiler (CGC)
2. A flexible library of high-performance kernels and a kernel-development API
3. Development tools for debug, introspection, and profiling
4. Clustering software
The Wafer Scale Engine, the CS-1 system, and the Cerebras software platform together comprise a complete solution to high-performance deep learning compute. Deploying the solution requires no changes to existing workflows or to datacenter operations. The CS-1 has been deployed in some of the largest compute environments in the world, including the US Department of Energy’s supercomputing sites. CS-1s are currently being used to address some of the most difficult challenges of our time — from accelerating AI in cancer research, to better understanding and treating traumatic brain injury, to furthering discovery in fundamental science around the characteristics of black holes.
The Cerebras CS-1 eliminates the primary impediment to the advancement of artificial intelligence. It reduces the time it takes to train models from months to minutes and from weeks to seconds, allowing researchers to be vastly more productive.
Arxiv – Fast Stencil-Code Computation on a Wafer-Scale Processor
Abstract—The performance of CPU-based and GPUbased systems is often low for PDE codes, where large, sparse, and often structured systems of linear equations must be solved. Iterative solvers are limited by data movement, both between caches and memory and between nodes. Here we describe the solution of such systems of equations on the Cerebras Systems CS-1, a wafer-scale processor that has the memory bandwidth and communication latency to perform well. We achieve 0.86 PFLOPS on a single wafer-scale system for the solution by BiCGStab of a linear system arising from a 7-point finite difference stencil on a 600 × 595 × 1536 mesh, achieving about one third of the machine’s peak performance. We explain the system, its architecture and programming, and its performance on this problem and related problems. We discuss issues of memory capacity and floating point precision. We outline plans to extend this work towards full applications.
SOURCES- Arxiv, Cerebras
Written by Brian Wang, Nextbigfuture.com
Brian Wang is a Futurist Thought Leader and a popular Science blogger with 1 million readers per month. His blog Nextbigfuture.com is ranked #1 Science News Blog. It covers many disruptive technology and trends including Space, Robotics, Artificial Intelligence, Medicine, Anti-aging Biotechnology, and Nanotechnology.
Known for identifying cutting edge technologies, he is currently a Co-Founder of a startup and fundraiser for high potential early-stage companies. He is the Head of Research for Allocations for deep technology investments and an Angel Investor at Space Angels.
A frequent speaker at corporations, he has been a TEDx speaker, a Singularity University speaker and guest at numerous interviews for radio and podcasts. He is open to public speaking and advising engagements.
14 thoughts on “Cerebras Trillion Transistor AI Wafer-Chip Crushes GPU Supercomputer by 200 Times”
That's correct, the choice is always either unhinged right-wing lunacy or absolute totalitarian control.
Hmmm. The largest GPUs have more memory than the cele-bras chips. In fact Celebras marketing material is off by about 4,000 times.
But that's marketing for you I guess. I wonder what the ratio of good to defective chips is- there is a reason why you don't make monster chips.
I bet you just can't wait for all of the government regulations to be put back on the books under a totalitarian control freak harris since you need government to take care of you.
your demented orange man is off into obscurity and reality television
How long before the Chinese steal the technology and try to use it against us??
Well done, my good and faithful servant. Since you were faithful in small matters, I will give you great responsibilities. Come, share your master’s joy.
If a servant says that the son was created out of nothing,
he means that the master was generated by a bitxh.
Hem, 0.86 petaflops, that is 860 terraflops. And with 84 dies, that makes about ~10 Terraflops per die. Not that impressive.
Pressing problems? How about using it to make nano-manufacutring happen?
That hardware would be useful for a learning Mealy machine:
I don't know, yields on neuromorphic hardware should not
be very low, since a missing part doesn't prevent the whole
circuit from functioning. Time will tell…
In the linked article, they mention specifically that they have working CS-1 systems at Lawrence Livermore National Laboratory, and National Energy Technology Laboratory. The article also claims they have working systems at numerous other "customers around the world". If those claims are true, it sure seems that they have a fair bit more than just a story. Seems a little odd that there has not been more prominent news about this, so there must be some details that are not being covered. Maybe the yields are so low that the systems are unreasonably expensive, or maybe they break down very quickly, or have some other practical problems. I'd sure like to learn more.
If they deliver on promise, these chips will sell like hotcakes.
Comments are closed.