NVIDIA made the supercomputing cluster called Selene in less than a month. The cluster was built in the month since the announcement of Nvidia’s new Ampere architecture and A100 artificial intelligence (AI) accelerators. Selene provides up to 1 exaFLOPs of AI, and over 27 petaFLOPs of HPM. Selene ranks as the No. 2 supercomputer for energy-efficient supercomputers. It exceeds 20 gigaFLOPs per watt. Selene uses 280 of these DGX A100 systems, for a total of 2,240 A100 GPUs, and 35,840 processor cores.
More than 50 A100-powered servers from leading vendors around the world — including ASUS, Atos, Cisco, Dell Technologies, Fujitsu, GIGABYTE, Hewlett Packard Enterprise, Inspur, Lenovo, One Stop Systems, Quanta/QCT and Supermicro — are expected following last month’s launch of the NVIDIA Ampere architecture and the NVIDIA A100 GPU.
Third-Generation Tensor Cores
First introduced in the NVIDIA Volta™ architecture, NVIDIA Tensor Core technology has brought dramatic speedups to AI, bringing down training times from weeks to hours and providing massive acceleration to inference. The NVIDIA Ampere architecture builds upon these innovations by bringing new precisions—Tensor Float (TF32) and Floating Point 64 (FP64)—to accelerate and simplify AI adoption and extend the power of Tensor Cores to HPC.
TF32 works just like FP32 while delivering speedups of up to 20X for AI without requiring any code changes.
Modern AI networks are big and getting bigger, with millions and in some cases billions of parameters. Not all of these parameters are needed for accurate predictions and inference, and some can be converted to zeros to make the models “sparse” without compromising accuracy. Tensor Cores in A100 can provide up to 2X higher performance for sparse models. While the sparsity feature more readily benefits AI inference, it can also be used to improve the performance of model training.
Scaling applications across multiple GPUs requires extremely fast movement of data. The third generation of NVIDIA® NVLink® in A100 doubles the GPU-to-GPU direct bandwidth to 600 gigabytes per second (GB/s), almost 10X higher than PCIe Gen4. When paired with the latest generation of NVIDIA NVSwitch™, all GPUs in the server can talk to each other at full NVLink speed for incredibly fast data transfers.
Smarter and Faster Memory
A100 is bringing massive amounts of compute to data centers. To keep those compute engines fully utilized, it has a leading class 1.6 terabytes per second (TB/sec) of memory bandwidth, a 67 percent increase over the previous generation. In addition, A100 has significantly more on-chip memory, including a 40 megabyte (MB) level 2 cache—7X larger than the previous generation—to maximize compute performance.
Written By Brian Wang, Nextbigfuture.com