HPU4Science achieves 12.5 to 20 TeraFLOPS and costs $30,000

Ars Technica describes the creation of the HPU4Science cluster. The cluster, known as HPU4Science, began with a $40,000 budget and a goal of building a viable scientific computation system with performance roughly equivalent to that of a $400,000 “turnkey” system made by NVIDIA, Dell, or IBM. They spent $30,000 and built a system with estimated computational power of the whole system is 20 TFLOPS in theory, and 12.5 TFLOPS in practice.

The HPU4Science system costs $1.80 per GFLOP for the hardware and costs $3 per TFLOP per day in electricity to operate.

In November, 2010, the entry level to the Top 500 supercomputers moved up to the 31.1 TFLOP/s mark on the Linpack benchmark, compared to 24.7 TFLOP/s six months before.

The cluster is built in a master-worker configuration in which the master dispatches jobs to the workers, compiles and processes the results, and handles data storage. The master is equipped with a dual Intel XEON processor, a four-SSD RAID array for short-term storage, and an array of five 2TB hard drives for archival storage. The networking is a simple Gigabit Ethernet.

Currently, there are three workers in the cluster running Intel i7 or Core 2 Quad processors and using GPUs for highly parallelized computation. In the last paper, the third and newest worker had four GTX 580s that give four TFLOPS of measured, peak computational performance (this equates to six TFLOPs of theoretical performance, which is the measure used for the Top500 supercomputers list). The hardware for a fourth worker with the same configuration as the third has just arrived, so the cluster will soon comprise a total of four workers with eight GTX 580s, three GTX 480s, three GTX 285s, a C1060 Tesla GPU, and a GTX 295 dual GPU. Some brand new GTX 590s are currently being ordered for a fifth worker, so the total computational power is still increasing.

The GTX590 (40 nanometer process) should be 57% faster on graphics and 71% faster on memory than a GTX580 an uses 50% more electricity. A complete GTX590 version should have 18 to 30 TFLOPs of performance.

The expectation is that GPU based on the 28 nanometer Kepler will be released by the 4th quarter of 2011. There would probably be a GTX670 It should be three times faster and more energy efficient than the GTX590.

The HPU4Science could then upgrade processors to achieve 54-90 TFLOPS before the end of 2011 and edge onto the Top500.

A cluster of this scale requires careful software selection to maximize the performance of the hardware. They detail the software choices for the HPU4Science cluster and discuss the areas where software and performance collide.

The master and all workers run Ubuntu server edition, installed with the minimal OpenSSH server profile. The current kernel on the system is 2.6.35. All of the system software—including Python, SAGE, CUDA (a GPU programming library from NVIDIA)—and standard development software like gcc is installed using Python/Bash scripts and apt-get .

File system: BTRFS

To benchmark file system performance, fio was used to read and write files to SSDs and conventional hard drives. With regard to reading and writing, ext4 and BTRFS are on par: single SSD random read speed is 23.5MB/s for ext4 and 21.5MB/s for BTRFS. Random write speed was 10.4MB/s on ext4 and 10.2MB/s using BTRFS. For comparison, the random read and write speeds on a conventional HDD running BTRFS were both less than 1MB/s.

Primary programming language: Python

The major problem with Python on the CPU is the speed of execution, but the impact on overall system speed can be minimized using Numeric Python functions that run optimized C code. The cluster also makes use of Cython, which allows you to modify an existing Python code by essentially adding variable type declarations. The HPU4Science team regularly sees a 2000- or 4000-fold increase in speed simply by declaring variable types when moving from Python to Cython, and the execution speed is similar to that of native C.


The research performed on the HPU4SCience cluster is expected to require extensive mathematical exploration. The researchers are not trying to create new theorems, but they do use high level math—say, level 3 on a log scale that ranges from 1 for Sudoku and 5 for Weinberg’s Quantum Field Theory. Therefore, the system must explore both numeric and symbolic math. Since the programming for the cluster is largely written in Python, it would be nice if the mathematical software interacted well with Python.

Sage, a computer algebra system (CAS) the development of which was initiated by a number theorist, combines the power of commercial CASs but it is both written in and interprets Python. Sage is a combination of many open-source mathematical and scientific packages including Maxima, Octave, Numeric Python, Scilab, SymPy, Matplotlib, Latex, etc., bound together into a single framework that lets users work in a single language but access a wide universe of software

Related – Building an Open Source version of Watson

Steps for producing an open source personal version of the IBM artificial intelligence natural language Jeopardy system (Watson)

If you liked this article, please give it a quick review on ycombinator or StumbleUpon. Thanks