Cray ‘Cascade’ XC30 can scale to over 100 petaflops in 3 to 4 years

Register UK – The Aries chip has 217 million gates and Cray has chosen Taiwan Semiconductor Manufacturing Corp as the foundry to build it, and uses its 40 nanometer processes for etching. Cray has used IBM as a foundry in the past for interconnect chips, but being rivals in the supercomputer racket, Cray is no doubt more comfortable with TSMC these days.

The Aries chip has 184 SERDES lanes, with 30 being optical links, 90 being electrical, and 64 being PCI-Express. Cray has reserved the right to tweak the Aries chip if needed as part of its deal with Intel, and enhanced versions could come out between now and 2016. But Cray is making no promises.

Cray is still putting together the performance specs of Aries, but Bolding dropped a few hints. The injection rate out of any PCI-Express 3.0 port on real workloads into the Aries interconnect is on the order of 8GB/sec to 10GB/sec. The more you load up the XC30 machine, the better it performs compared to the XE6 machine.

At a chassis level, the loaded injection bandwidth is on the order of 2X of that on the XE machine, and on a big box with ten cabinets or more where the whole system is used, its more on the order of 3X. The Dragonfly topology and Aries interconnect has 20X the global bandwidth as the 3D torus-Gemini combo, too.

The other thing that is important about the Dragonfly topology is that it is easier to build up. On the Blue Waters machine that Cray is building, you can’t just add one cabinet to the box because you have to build on the 3D torus in a way that everything can still route to everything else.

In fact, you have to add a row of a dozen cabinets at a time to keep the shape of the torus. With Aries, you can add a single group at a time, which is only two cabinets.

Assuming a certainly level of CPU performance enhancements and the addition of GPU or x86 coprocessors, Bolding says that the Aries machine will be able to scale well above 100 petaflops over the next three to four years.

To add support for Intel’s Xeon Phi x86 coprocessor or Nvidia’s Tesla K20 coprocessor, you just take out half one of the processor daughter cards with CPUs and add a new one in with accelerators. The CPU cards link to the accelerators over PCI-Express, and then on the other side the CPUs link to Aries through PCI-Express links.

The Cray announcement of the XC30. Early shipments of the Cray XC30 are starting now, and systems are expected to be widely available in first quarter of 2013.

Cascade uses custom cabinets that are slightly bigger than industry standard, which is something you can get away with when you are selling supercomputers that cost tens to hundreds of millions of dollars and data centers are built around them.

There are four nodes on each Cascade blade, and you put sixteen of these into a chassis. The cableless Rank 1 network is used to link the blades in the chassis together. You put three enclosures in a rack and two racks side-by-side, and this is a Cascade group.

This uses a passive electrical network to link the nodes together through the Aries interconnect controllers. Using current Xeon E5 processors and no x86 or GPU accelerators, these two cabinets yield around 120 teraflops of peak computing performance.

An XC30 system has hundreds of cabinets and uses an active optical network, which also hangs off those Aries chips, to link nodes to each other. A top-end system would have close to 200 cabinets, the same size as the “Jaguar” and now “Titan” supercomputers at Oak Ridge, but offer twice the number of x86 sockets per cabinet due to more dense packaging.

Cray XC30 Specifications

Processor 64-bit Intel® Xeon® E5-2600 Series processors; up to 384 per cabinet
Memory 32-128GB per node
Memory bandwidth: Up to 117GB/s per node
Compute Cabinet Initially up to 3,072 processor cores per system cabinet, upgradeable
Peak performance: Initially up to 66 Tflops per system cabinet
Interconnect 1 Aries routing and communications ASIC per four compute nodes

48 switch ports per Aries chip (500GB/s switching capacity per chip)

Dragonfly interconnect: Low latency, high bandwidth topology
                    Cray System Management Workstation (SMW)

If you liked this article, please give it a quick review on ycombinator or StumbleUpon. Thanks