Los Alamos Roadrunner supercomputer will run at over one petaflop/second sustained speed in 2009

Roadrunner is a cluster of approximately 3,250 compute nodes interconnected by an off-the-shelf parallel-computing network. Each compute node consists of two AMD Opteron dual-core microprocessors, with each of the Opteron cores internally attached to one of four enhanced Cell microprocessors. This enhanced Cell does double-precision arithmetic faster and can access more memory than can the original Cell in a PlayStation 3. The entire machine will have almost 13,000 Cells and half as many dual-core Opterons.

Scientists at the Los Alamos government weapons lab will have built the world’s fastest computer. It will run at a sustained 1,000 trillion operations per second. Roadrunner will also be the first computer to run the universally recognized code used to test supercomputer performance—LINPACK—at over 1 petaflop/s. Roadrunner supercomputer scheduled for installation at Los Alamos starting this summer 2008, with full operation targeted for early 2009.

The $133 million Roadrunner was just assembled and tested by IBM to run at 1.026 petaflops and has been disassembled for installation at Los Alamos

The Cell microprocessor contains a Power PC compute core that oversees all the system operations and a set of eight simple processing elements, known as SPEs, that are optimized for both image processing and arithmetic operations at the heart of numerical simulations. Each is specialized to work on multiple data items at a time (a process called vector processing, or SIMD), which is very efficient for repetitive mathematical operations on well-defined groups of data.

The Roadrunner has a standard cluster of microprocessors (in this case AMD Opteron dual-core microprocessors). Nothing new here except that each chip has two compute cores instead of one. The hybrid element enters the picture when each Opteron core is internally attached to another type of chip, the enhanced Cell (the PowerXCell 8i), which has been designed specially for Roadrunner. The enhanced Cell can act like a turbocharger, potentially boosting the performance up to 25 times over that of an Opteron compute core alone.

The rub is that achieving a good speedup (from 4 to 10 times) is not automatic. It comes about only if the programmers can get all the Cell and Opteron microprocessors and their memories working together efficiently.

“We replace our high-performance supercomputers every 4 or 5 years,” says Andy White, longtime leader of supercomputer development at Los Alamos. “They become outdated in terms of speed, and the maintenance costs and failure rates get too high.”

The Cell was designed with enough computer power to enhance interactivity, allowing video games to be even less scripted. It has eight specialized processing elements (SPEs) that get around the speed barrier by working together. They can generate dynamic image sequences in record time, sequences that reflect the game player’s intention and even have the correct physics.

The Cell gets around the memory barrier as well. It does so by having a small, fast local (on-chip) memory plus a memory engine for each SPE and an ultra high speed bus to move data within the Cell. The local memories store exactly the data and instructions needed to perform the next computations while all eight memory engines act like runners, simultaneously retrieving from off-chip memory the data that will be needed for computations further down the line.

Optimized for maximum computation per watt of electricity, the Cell looked like a good bet for accelerating supercomputing performance. Los Alamos knew, however, that the Cell would need some modifications for petaflop/s scientific computing. IBM was willing to work on the enhancements

Japan’s NEC working towards 10 petaflop supercomputer

Tensilica’s configurable processors could make exaflop supercomputers practical and petaflop computers cheaper

Cell processors and FPGAs and GPGPUs compared

Substantial rearchitecting of supercomputers will likely be needed to make practical zettaflop computers. An Extreme computing conference in 2007 examined the issues and it seems things like onchip photonics are necessary to being cost and power down to reasonable levels

Going beyond Zettaflops to Yottaflops and Xeraflops would probably require all optical computers or some other completely new architectures.