Nvidia GPU Lowers Supercomputer Costs Ten Times and Reduce Electricity Needed by 20 Times

New Nvidia Fermi-Based Tesla products deliver the performance of a CPU-based cluster at one-tenth the cost and one-twentieth the power

The Tesla 20-series GPUs combine parallel computing features that have never been offered on a single device before. These include:

* Support for the next generation IEEE 754-2008 double precision floating point standard
* ECC (error correcting codes) for uncompromised reliability and accuracy
* Multi-level cache hierarchy with L1 and L2 caches
* Support for the C++ programming language
* Up to 1 terabyte of memory, concurrent kernel execution, fast context switching, 10x faster atomic instructions, 64-bit virtual address space, system calls and recursive functions.

The family of Tesla 20-series GPUs includes:

Tesla C2050 & C2070 GPU Computing Processors

* Single GPU PCI-Express Gen-2 cards for workstation configurations
* Up to 3GB and 6GB (respectively) on-board GDDR5 memory(i)
*Double precision performance in the range of 520GFlops – 630 Gflops

Tesla S2050 & S2070 GPU Computing Systems

*Four Tesla GPUs in a 1U system product for cluster and datacenter deployments
*Up to 12 GB and 24 GB (respectively) total system memory on board GDDR5 memory(ii)
*Double precision performance in the range of 2.1 TFlops – 2.5 Tflops

The Tesla C2050 and C2070 products will retail for $2,499 and $3,999 and the Tesla S2050 and S2070 will retail for $12,995 and $18,995. Products will be available in Q2 2010.

22 page pdf on the Nvidia Fermi Compute Core Architecture

• Improve Double Precision Performance—while single precision floating point performance was on the order of ten times the performance of desktop CPUs, some GPU computing applications desired more double precision performance as well.
• ECC support—ECC allows GPU computing users to safely deploy large numbers of GPUs in datacenter installations, and also ensure data-sensitive applications like medical imaging and financial options pricing are protected from memory errors.
• True Cache Hierarchy—some parallel algorithms were unable to use the GPU’s shared memory, and users requested a true cache architecture to aid them.
• More Shared Memory—many CUDA programmers requested more than 16 KB of SM shared
memory to speed up their applications.
• Faster Context Switching—users requested faster context switches between application programs and faster graphics and compute interoperation.
• Faster Atomic Operations—users requested faster read-modify-write atomic operations for their parallel algorithms.