Nvidia’s Next Generation Fermi GPU Unveiled


HPCWire reports: Nvidia CEO Jen-Hsun Huang unveiled a seriously revamped graphics processor architecture representing the biggest step forward for general-purpose GPU computing since the introduction of CUDA in 2006.

The new architecture, codenamed “Fermi,” incorporates a number of new features aimed at technical computing, including support for Error Correcting Code (ECC) memory and greatly enhanced double precision (DP) floating point performance. Those additions remove the two major limitations of current GPU architectures for the high performance computing realm, and position the new GPU as a true general-purpose floating point accelerator. Sumit Gupta, senior product manager for NVIDIA’s Tesla GPU Computing Group, characterized the new architecture as “a dramatic step function for GPU computing.” According to him, Fermi will be the basis of all NVIDIA’s GPU offerings (Tesla, GeForce, Quadro, etc.) going forward, although the first products will not hit the streets until sometime next year.

Oak Ridge National Labs Plan Petaflops and Exaflops Using Nvidia
Oak Ridge National Laboratory (ORNL) announced plans today for a new supercomputer that will use NVIDIA®’s next generation CUDA(TM) GPU architecture, codenamed “Fermi”. Used to pursue research in areas such as energy and climate change, ORNL’s supercomputer is expected to be 10-times more powerful than today’s fastest supercomputer. this would be about 20 petaflops.

From CNET:

“With the help of Nvidia technology, Oak Ridge proposes to create a computing platform that will deliver exascale computing within ten years,” Jeff Nichols, Oak Ridge’s associate lab director for Computing and Computational Sciences said.

Oak Ridge also announced it will be creating the Hybrid Multicore Consortium focused on computing with different types of processor architectures. The goals of this consortium are to work with the developers to run applications on the next generation of supercomputers built with CPUs and GPUs.

NEXUS programming Environment
NVIDIA also introduced Nexus, The Industry’s First Integrated GPU/CPU Environment For Developers Working With Microsoft Visual Studio

NVIDIA Nexus radically improves productivity by enabling developers of GPU computing applications to use the popular Microsoft Visual Studio-based tools and workflow in a transparent manner, without having to create a separate version of the application that incorporates diagnostic software calls. NVIDIA Nexus also includes the ability to run the code remotely on a different computer. Nexus includes advanced tools for simultaneously analyzing efficiency, performance, and speed of both the graphics processing unit (GPU) and central processing unit (CPU) to give developers immediate insight into how co-processing affects their applications.

Nexus is composed of three components:

* The Nexus Debugger is a source code debugger for GPU source code, such as CUDA C, HLSL and DirectCompute. It supports source breakpoints, data breakpoints and direct GPU memory inspection. All debugging is performed directly on the hardware.
* The Nexus Analyzer is a system-wide performance tool for viewing GPU events (kernels, API calls, memory transfers) and CPU events (core allocation, threads and process events and waits)—all on a single, correlated timeline.
* The Nexus Graphics Inspector provides developers the ability to debug and profile frames rendered using APIs such as Direct3D. Developers can use the Graphics Inspector™ to scrub through draw calls, look at any textures, vertex buffers, and API state in the entire frame.

Besides ECC and a big boost in floating point performance, Fermi also more than doubles the number of cores (from 240 to 512), adds L1 and L2 caches, supports the faster GDDR5 memory, and increases memory reach to one terabyte. NVIDIA has also tweaked the hardware to enable greater concurrency and utilization of chip resources. In a nutshell, NVIDIA is making its GPUs a lot more like CPUs, while expanding the floating point capabilities.

Fermi supports 64-bit addressing, memory reach is now a terabyte. Although it’s not yet practical to place that much DRAM on a GPU card, memory capacities will surely exceed the 4 GB per GPU limit in the current Tesla S1070 and C1060 products. For data-constrained applications, the larger memory capacities will lessen the need for repeated data exchanges between the CPU and the GPU, since more of the data can be kept local to the GPU. This should help boost overall performance for many applications, but especially seismic processing, medical imaging, 3D electromagnetic simulation and image searching.

the GT200 architecture has a 1:8 performance ratio of double precision to single precision, which is why the current Tesla products don’t even manage to top 100 DP peak gigaflops per GPU. The new architecture changes this ratio to 1:2, which represents a more natural arrangement (inasmuch as double precision uses twice the number of bits as single precision). Because NVIDIA has also doubled the total core count, DP performance will enjoy an 8-fold increase. By the time the next Tesla products appear, we should be seeing peak DP floating point performance somewhere between 500 gigaflops to 1 teraflop per GPU.

Nvidia GPU Technology conference webcast

PC Perspective has more on the fermi architecture

The first implementation of this architecture, that we are tentatively calling GT300, will have some impressive raw specifications. The GPU is made up of 3.0 billion transistors and features 512 CUDA processing cores organized into 16 streaming multiprocessors of 32 cores each. The memory architecture is built around a new GDDR5 implementation and has six channels of 64-bits for a total memory bus of 384-bits. The memory system can technically support up to 6GB of memory as well – something that is key for HPC applications.

NVIDIA claims the GT300 will be 4.25x faster than GT200. The GT300 should be about 330 GFLOPS of double precision performance (based on the 78 GFLOPS the GT200 rests at). While definitely an impressive improvement, AMD’s new Evergreen family reaches a theoretical peak of 544 GFLOPS of double precision performance.

Nvidia’s Next Generation Fermi GPU Unveiled


HPCWire reports: Nvidia CEO Jen-Hsun Huang unveiled a seriously revamped graphics processor architecture representing the biggest step forward for general-purpose GPU computing since the introduction of CUDA in 2006.

The new architecture, codenamed “Fermi,” incorporates a number of new features aimed at technical computing, including support for Error Correcting Code (ECC) memory and greatly enhanced double precision (DP) floating point performance. Those additions remove the two major limitations of current GPU architectures for the high performance computing realm, and position the new GPU as a true general-purpose floating point accelerator. Sumit Gupta, senior product manager for NVIDIA’s Tesla GPU Computing Group, characterized the new architecture as “a dramatic step function for GPU computing.” According to him, Fermi will be the basis of all NVIDIA’s GPU offerings (Tesla, GeForce, Quadro, etc.) going forward, although the first products will not hit the streets until sometime next year.

Oak Ridge National Labs Plan Petaflops and Exaflops Using Nvidia
Oak Ridge National Laboratory (ORNL) announced plans today for a new supercomputer that will use NVIDIA®’s next generation CUDA(TM) GPU architecture, codenamed “Fermi”. Used to pursue research in areas such as energy and climate change, ORNL’s supercomputer is expected to be 10-times more powerful than today’s fastest supercomputer. this would be about 20 petaflops.

From CNET:

“With the help of Nvidia technology, Oak Ridge proposes to create a computing platform that will deliver exascale computing within ten years,” Jeff Nichols, Oak Ridge’s associate lab director for Computing and Computational Sciences said.

Oak Ridge also announced it will be creating the Hybrid Multicore Consortium focused on computing with different types of processor architectures. The goals of this consortium are to work with the developers to run applications on the next generation of supercomputers built with CPUs and GPUs.

NEXUS programming Environment
NVIDIA also introduced Nexus, The Industry’s First Integrated GPU/CPU Environment For Developers Working With Microsoft Visual Studio

NVIDIA Nexus radically improves productivity by enabling developers of GPU computing applications to use the popular Microsoft Visual Studio-based tools and workflow in a transparent manner, without having to create a separate version of the application that incorporates diagnostic software calls. NVIDIA Nexus also includes the ability to run the code remotely on a different computer. Nexus includes advanced tools for simultaneously analyzing efficiency, performance, and speed of both the graphics processing unit (GPU) and central processing unit (CPU) to give developers immediate insight into how co-processing affects their applications.

Nexus is composed of three components:

* The Nexus Debugger is a source code debugger for GPU source code, such as CUDA C, HLSL and DirectCompute. It supports source breakpoints, data breakpoints and direct GPU memory inspection. All debugging is performed directly on the hardware.
* The Nexus Analyzer is a system-wide performance tool for viewing GPU events (kernels, API calls, memory transfers) and CPU events (core allocation, threads and process events and waits)—all on a single, correlated timeline.
* The Nexus Graphics Inspector provides developers the ability to debug and profile frames rendered using APIs such as Direct3D. Developers can use the Graphics Inspector™ to scrub through draw calls, look at any textures, vertex buffers, and API state in the entire frame.

Besides ECC and a big boost in floating point performance, Fermi also more than doubles the number of cores (from 240 to 512), adds L1 and L2 caches, supports the faster GDDR5 memory, and increases memory reach to one terabyte. NVIDIA has also tweaked the hardware to enable greater concurrency and utilization of chip resources. In a nutshell, NVIDIA is making its GPUs a lot more like CPUs, while expanding the floating point capabilities.

Fermi supports 64-bit addressing, memory reach is now a terabyte. Although it’s not yet practical to place that much DRAM on a GPU card, memory capacities will surely exceed the 4 GB per GPU limit in the current Tesla S1070 and C1060 products. For data-constrained applications, the larger memory capacities will lessen the need for repeated data exchanges between the CPU and the GPU, since more of the data can be kept local to the GPU. This should help boost overall performance for many applications, but especially seismic processing, medical imaging, 3D electromagnetic simulation and image searching.

the GT200 architecture has a 1:8 performance ratio of double precision to single precision, which is why the current Tesla products don’t even manage to top 100 DP peak gigaflops per GPU. The new architecture changes this ratio to 1:2, which represents a more natural arrangement (inasmuch as double precision uses twice the number of bits as single precision). Because NVIDIA has also doubled the total core count, DP performance will enjoy an 8-fold increase. By the time the next Tesla products appear, we should be seeing peak DP floating point performance somewhere between 500 gigaflops to 1 teraflop per GPU.

Nvidia GPU Technology conference webcast

PC Perspective has more on the fermi architecture

The first implementation of this architecture, that we are tentatively calling GT300, will have some impressive raw specifications. The GPU is made up of 3.0 billion transistors and features 512 CUDA processing cores organized into 16 streaming multiprocessors of 32 cores each. The memory architecture is built around a new GDDR5 implementation and has six channels of 64-bits for a total memory bus of 384-bits. The memory system can technically support up to 6GB of memory as well – something that is key for HPC applications.

NVIDIA claims the GT300 will be 4.25x faster than GT200. The GT300 should be about 330 GFLOPS of double precision performance (based on the 78 GFLOPS the GT200 rests at). While definitely an impressive improvement, AMD’s new Evergreen family reaches a theoretical peak of 544 GFLOPS of double precision performance.