1.2 Trillion Transistors on a Wafer-Scale AI Chip

The Cerebras Wafer-Scale Engine (WSE) is the largest chip ever built and has one 1.2 trillion transistors. It is the heart of a deep learning system.

It is 56x larger than any other chip. It delivers more compute, more memory, and more communication bandwidth. This enables AI research at previously-impossible speeds and scale.

The Cerebras Wafer Scale Engine 46,225 square millimeters with 1.2 Trillion transistors and 400,000 AI-optimized cores.

By comparison, the largest Graphics Processing Unit is 815 square millimeters and has 21.1 Billion transistors.

Andrew Feldman and the Cerebras team have built the wafer-scale integrated chip. They have successfully solved issues of yield, power delivery, cross-reticle connectivity, packaging, and more. It has a 1,000x performance improvement over what’s currently available. It also contains 3,000 times more high speed, on-chip memory, and has 10,000 times more memory bandwidth.

It has a complex system of water-cooling. It uses an irrigation network to counteract the extreme heat generated by a chip running at 15 kilowatts of power.

The WSE has 18 GB of on-chip memory, all accessible within a single clock cycle, and provides 9 PB/s memory bandwidth. This is 3000x more capacity and 10,000x greater bandwidth than the leading competitor. More cores, more local memory enables fast, flexible computation, at lower latency and with less energy.

AI-optimized cores are connected entirely on silicon by the Swarm fabric in a 2D mesh with 100 Petabits per second of bandwidth. Swarm delivers breakthrough bandwidth and low latency at a fraction of the power draw of traditional techniques used to cluster graphics processing units. Software configures all the cores on the WSE to support the precise communication required for training user-specified models.

There is a 9 page white paper from Cerebras on the chip.

SOURCES- Cerebras
Written By Alvin Wang, Nextbigfuture.com

29 thoughts on “1.2 Trillion Transistors on a Wafer-Scale AI Chip”

  1. yes right neural networks can have flaws but still be trained and be used.
    if you got thousands of neurons a single neuron failure wont matter. (much alike our brains).

  2. The question I have is….at what sort of price point? I mean in the end thats the important part.

  3. The trend in CPU design is in the other direction with chiplets and high bandwidth interconnects. Intel is now in trouble due to AMD advancements in that field (or would be if AMD could deliver a lot). The reason is simply that smaller circuits result in better yields and thus become cheaper in mass production. A better price per performance ratio always win in the long run.

    But those massive things for sure would be fun to have….

  4. There is no reason you cant have server racks of these things….big server racks… and scale that indefinately. The interesting thing is that it completely takes away the memory bottleneck…until you get into the big server racks that is…but locally it would be super fast……………….I’m interested to see the numbers once they get this thing up and running.

  5. I would like to see a single or few “tile” version of this wafer scale chip (it would seen that their system consists of 7×12 “tiles”). They claim to reach less than 1 pJ per bit of data (access?) and this energy consumption would be very useful even in a smaller system.

    An inference system with, say, 1 GB of data and “only” 100 TB of bandwith would consume (assuming negligible static SRAM power consumption) 150 W (600 GB/s/W; see below) which would be just perfect for an autonomous car… or a humanoid worker robot..

  6. Let’s see… 9 PB/s = 9*10^15 bytes/s.. 9*10^15 Bytes/s /(1.5*10^4 W) = 6*10^11 Bytes/s/W= 600 *10^9 Bytes/s/W = 600 GB/s/W…

    Yup, you were right, I was off by a factor of 100. So this also means that SRAM is also still on the table (and I see from their home page that they in fact use SRAM..). Of course, the 15 kW also includes the power to perform the calculation… Which means that the memory power is *less* than 1 W per every 600 GB/s memory bandwidth.

  7. Just leave it to the AI to create the time machine, only to realize that only organic things can be sent back in time. While it’s working on growing skin for its remote murder modules, you can sneak in, jump back in time, and somehow change the future. Just don’t expect it to stay changed…

  8. Well, Jan Tångring, why don’t you inform us of the correct facts instead of issuing a “warning”? It seems like you went for slander instead of useful information..

  9. that picture looks more like pcb board with metal traces than a silicon wafer… silicon wafers are round and shiny and generally don’t have holes drilled through them….

  10. I wonder if the drill holes are “ vias” for stacking wafers on top of each other to create a 3D “multi-wafer module”….

  11. And I also imagine that it takes time to upload and then download the data.

    All fine if you are running a fluid flow simulation (providing it isn’t something you need to keep secure from “other people’s computers” ).
    But if you’re doing say control of a self driving vehicle or something, dropping a connection as you go through a tunnel, or behind a mountain, isn’t really going to be suitable.

    A lot of “cloud is the future” stuff seems to come from people who live and work in Silicon valley and don’t understand that most of the world doesn’t have 100% reliable, 100% high speed wireless internet. “It works all the time in the lab!”

  12. 9 PetaByte per second memory bandwidth, rip human brains for all tasks except creating better AI boards, which will soon be done also by AI lol.

    Learn 2 code meme going to turn into Learn 2 create a time machine and go back before AI took virtually all jobs meme.

    No seriously, making your own time machine will probably be easier than getting a livable wage and benefits in the 2030s job market.

  13. WARNING: These commenters have not read up on the subject and are mainly making half-informed guessing:
    * Jennifer
    * Jan Jansson
    * James Bowery

  14. Still a dead end with monolithic designs. They just can’t scale indefinitely like interconnected architectures. AI lends itself well to parallel computing.
    Maybe real time inferencing will be a temporary use case for this but any processing on the time scales above one second can be handled by distributed solutions.

    Wafer size very seldom increases. Process node size will perhaps drop to 2 nm. After that, there is 3D stacking. Maybe they can ultimately hit the same compute / volume ratio as a biological brain but they will never match the corresponding combined capacity of 5 billion interconnected brains.

  15. They don’t mention recurrent networks, which are the ultimate in deep networks. They’d be more interesting to me, at least, if they had some way of routing the signals in a circle. A 2.5D wafer scale integration perhaps: Two layers, with one forward with through-chip via to the return layer.

  16. The challenge in large chips is “yields”, i.e. a dust particle or other random item causes a microscopic fault. In a chip with a large number of processors you can simply “route around” or deactivate failed processors. I suspect this is easier for neural network/AI wafer scale systems.

  17. That’s a big economic advantage. Often chip fabs have to throw away more than half
    of the production, and the rest is often suboptimal (see celeron, duron)

  18. Dram’s would suggest using Arrays of cpu’s.. drams have a nasty about if latency … then you need SRAM cache as well to hide latency… better to use distributed SRAMs…

  19. sounds like an ibm supercomputer project… using the entire silicon die without slicing the wafer into separate chips… I would image AI structures are almost as boring and regular to layout as a dram memory… which lends its self to high repetition of a repeated structure… and easy identification and disabling of faulty network connection…can easily get around usual yield fault problems with using entire wafer… because structure of AI network is regular and repeated…

  20. Customary reminder: The “cloud” is just other people’s computers. Sometimes you don’t want your code running on somebody else’s computers.

  21. Congrats on ability to carry around your grand grandfather clock with industrial scale refrigeration mechanism. Do you have a special pocket for it in your pants or do you drag it behind tied to your wrist?

  22. Far easier just to use a cloud service. Cloud providers themselves will probably want less monolithic solutions. 15kw on one piece of metal does seem extreme.

  23. Interesting. They achieve abour 8.5 GB/s per Watt on their wafer, whereas the cottesponding number for Nvidia Titan V is about 4.5 GB/s per W. So I guess they are not using SRAM but rather DRAM just as NVIDIA.

    Tesla reaches about 40 GB/s per W on their inference chip (for cars; 1 TB/s per chip; 2 chips and a total of 50 W) by using SRAM instead..

    Bottom line.. Pretty good but not exeptional..

Comments are closed.