1000X Regular Chips from New Cerebras AI Wafer Chip

Cerebras has launched its second-generation wafer chip, the WSE2. It is created with a 7-nanometer process instead of the 16-nanometer process used two years ago on WSE1.

The WSE-2 covers 46,255 square millimeters — 56 times larger than the largest graphics processing unit. With 850,000 cores, 40 Gigabytes of on-chip SRAM, 20 petabytes/sec of memory bandwidth, and 220 petabits/sec of interconnect bandwidth, the WSE-2 contains 123 times more compute cores, 1,000 times more high-speed on-chip memory, 12,862 times more memory bandwidth, and 45,833 times more fabric bandwidth than its graphics processing competitor. In effect, it provides the compute capacity of an entire cluster in a single chip, without the cost, complexity, and bandwidth bottlenecks involved with lashing together hundreds of smaller devices.

The Cerebras CS-2 eliminates the primary impediment to the advancement of artificial intelligence, reducing the time it takes to train models from months to minutes and from weeks to seconds, allowing researchers to be vastly more productive. In so doing the CS-2 reduces the cost of curiosity, accelerating the arrival of the new ideas and techniques that will usher forth tomorrow’s AI.

Cerebras has ~300 staff across Toronto, San Diego, Tokyo, and San Francisco. Cerebras is already profitable, with dozens of customers. Beyond AI, Cerebras is getting from commercial high-performance compute markets, such as oil-and-gas and genomics. Deployments of CS-2 will occur later this year in Q3, and the price has risen from ~$2-3 million to ‘several’ million.

SOURCES Cerebras, Anandtech
Written By Brian Wang, Nextbigfuture.com

21 thoughts on “1000X Regular Chips from New Cerebras AI Wafer Chip”

  1. 2X increase in 2 years IS exponential progress. That's the classic Moore's law speed that is the standard example of exponential progress.

  2. Seems like a direction Tesla might go for designing custom silicon for it’s next generation Dojo supercomputer. In 3nm it would improve by about the same proportions as Cerebras 2 over 1. Tesla might well have billions to throw at it in a couple years. It seems like this next generation Dojo will be tasked with training robot NN of all sorts for the Tesla Alien Dreadnought factories and the general mobile robot Revolution that will follow Tesla’s robust solution to computer vision with it’s Tesla Network Robotaxi level FSD.

  3. I think that's more of a feature than a bug. It may speed loading for people not interested in the comments.

  4. I kind of remember 'talking' about this on some other occasion, perhaps years back. Dunno… memories of what I write get so indistinct that I am sometimes actually surprised to read a whole comment from 'someone', only to find it was I who wrote it 3+ years back.  Arrrggghhhh…

    Anyway, yes. DRAM + computing in a single chip.  

    There is 'another kind of AI' which can be modeled as a really, really large in-memory dataset, and really, really fast (and ideally hugely parallel) REGEX pattern matching engines to 'go find' complex things in the same DRAM store. 

    What better than to incorporate purpose built REGEX++ grain-of-salt cores into DRAM?  Data comes 'out' of DRAM arrays in huge swaths … 4096 bits to 65536 (or perhaps higher!) 'rows per column'.  It can be gated well into the GHz regime, so potentially 10s of TByte/sec … per chip … bandwidth.  

    Or even higher, given parallelism, and the fact that DRAM chips are subdivided themselves into many (4, 16, 64…) sub-chips, each of which offers the same bandwidth.

    Clipping these picayune monsters into one's PC, just like DRAM sticks today, would bring thousands-to-millions of REGEX++ cores to bear. Any given search might take many microseconds. Hundreds. But in the same 100 µs, simultaneously another 100,000 completely different searches would also be executed.  All of memory, up to petabytes… in 100 µs. 

    That's some AI horsepower.
    Even if AI needs to be reinvented.

    ⋅-⋅-⋅ Just saying, ⋅-⋅-⋅
    ⋅-=≡ GoatGuy ✓ ≡=-⋅

  5. Vendors have been successfully dealing with flawed processing cores since they started perusing many-core strategy over increased clock speeds.
    Produce an 8 core chip with 1 to 4 cores with flaws, sell it as a 6 or 4 core variant.

    Moving data around through assorted interconnects will soon turn out to be the main barrier. I like the idea of processing-in-memory as a solution to the Von Neumann bottleneck.

  6. I think that we have to think about this 40 GB on chip memory as a cache memory. The system can have much larger amounts of "conventional" RAM, but the system have a huge cache.

  7. I checked out some more facts. The previous generation, CS-1, could reach about 0.85 petaflops, which is not particularly impressive. But the "fabric" bandwidth did increase a lot between CS-1 and CS2, by a factor of about 400, so perhaps the effective computing power has increased as well?

    It would seem that the only way that CS-2 can work with more data than the 40GB SRAM is by using the 12×100 GB ethernet channels, which would then limit the system to about 1 TB of memory throughput for larger data sets. Comparing that with the A100 system from NVIDIA, that has 360 GB of RAM memory for a "mere" 200k, it's not obvious that CS-2 actually brings more "bang for the buck". And yes, the memory bandwith on the 40GB of SRAM is fantastic, but I wonder if this really results in faster effective training compared to a cluster of A100 systemes? Say, 25 A100 system? That system would have about 9 TB of RAM memory..

  8. Paying, say 5 million for a system with 40 GB of memory is very expensive. Yes, this memory is exceedingly fast, but its still kind of small.

    If you have larger systems or more training data, the training will be limited by the buss to external memory. And I guess this will almost always be the case.

    I would rather have an optical RAM bus so you could scale the system to thousands of GB without reaching excessive power consumption. Or possibly stacking SRAM chips on top of conventional neural processors. Lets say 10GB of SRAM per processor?

  9. i volunteer to test drive / review one. you know if they'd give me one. Well that's how it works for PC components. 🙂

  10. Hmm… maybe, maybe not.  

    The chief advantage of using the full-wafer scale, is that the myriad (and I mean gazillion!) interconnects between 'cores' and 'execution units' is reduced in scale to the components of the wafer's devices, and that then in turn can be 'gazillions' as may make sense to allow stream-oriented and also multiple-in-parallel oriented processing to happen.  

    However, I envision that the same could also be done without having all-one-wafer in tow. We already commercially have chip substrates of silicon (which 'solves' the thermal expansion differential problem) for multiple-chiplet implementations. AMD is stuffing 8 chiplets … and a data router … on a single socket substrate. It is nowhere near the limit of this reality. Moreover, between the central interface-rounter-cache and the chiplets, the logic levels and speeds are almost those of 'native'.  So not much is really lost.  

    I could easily see much larger substrates, say 100 × 100 or 150 × 150 mm, having hundreds-to-thousands of chiplets, accomplishing a very nearly competing design.  

    Of course, with ever higher integration, there would be 'dead chiplet' issues. But following Cerebras design ideals, such mightn't have any significantly negative impact on overall mean processing.  

    ⋅-⋅-⋅ Just saying, ⋅-⋅-⋅
    ⋅-=≡ GoatGuy ✓ ≡=-⋅

  11. Hey Brian
    Just to let you know…some people may have this problem, other not, but for me
    comments are still not visible, you must first click the button Load Comments, only then they appear. This additional step is quite annoying and very inconvenient
    cause you never see the comments below article and always must click before seing them

  12. If this works out, others will follow fast.
    There is not nearly as much room to scale up as there was to scale down.
    It doesn't take a lot of doublings before a chip reaches ungainly sizes.
    I could imagine the processing power of a 2-3nm 3d chip the size of a microwave oven. It could probably play the obligatory Crysis on max settings.

  13. Only 2x increase in 2 years, kinda slow progress for 2021 tech moving speed standards.
    Hope, next jump will be bigger

  14. That system is so expensive that even renting it by the hour for a few hours would probably be prohibitive. You would think that some wafers would have flaws and could be sold for less, which should still be good money. They should not go for selling just a few to fat corporations.
    They could sell a lot of these at the $100G level, or even the $250G level, I think. Still corporate costumers, but even small corporations could at least hope to be able to rent it as needed.

  15. Putting a whole computer system or super chip on a silicon wafer was an idea developed in 1985 by Sir Clive Sinclair of the Spectrum home computer fame. His spin off company developing Wafer Scale Integration was called Anamartic. I wonder if any of his patents are still valid 35 years later. CMOS Image sensors have been stitching Fields across Silicon Wafers for decades for use by astronomers.

Comments are closed.