The Department of Defense (DoD)’s High Performance Computing Modernization Program (HPCMP) has a supercomputer in a shipping container with 6 PetaFLOPS of performance. It will be used for both training and inference workloads. It has 1.3 Petabytes of solid-state storage.
The shipping container supercomputer system has:
* 22 nodes for machine learning training workloads, each with two IBM Power9 processors, 512GB of system memory, 8 Nvidia V100 GPUs with 32GB of high-bandwidth memory, and 15TB of local solid-state storage
* 128 nodes for inferencing workloads, each with two IBM Power9 processors, 256GB of system memory, 4 Nvidia T4 GPUs with 16GB of high-bandwidth memory, and 4TB of local solid state storage
* Three solid-state parallel file systems, totaling 1.3 PB
* A 100 Gigabit per second InfiniBand network, as well as dual 10 gigabit Ethernet networks
* Platform LSF HPC job scheduling integrated with a Kubernetes container orchestration solution
* Integrated support for TensorFlow, PyTorch, Caffe, in addition to traditional HPC libraries and toolsets including FFTW and Dakota
The DoD HPC Modernization Program (HPCMP) aims to have
* 100 petaflops system by 2025
* a cognitive production system in 2026,
* an exaflops system in 2031 and a
* 10 exaflops system and a quantum pilot in 2036. Iin 2040, it hopes to have a quantum production system.
In 2019, the Department of Defense (DoD) High Performance Computing Modernization Program (HPCMP) procured 12.8 petaFLOPs of system. This will increase the DoD HPCMP’s aggregate supercomputing capability to 53 petaFLOPs.
The system, the HPCMP’s first with greater than 10 PetaFLOPS of peak computational performance, will be installed at the Navy’s DSRC’s facility at Stennis Space Center, Mississippi and will serve users from all of the services and agencies of the Department.
The architecture of the system is as follows:
A Cray Shasta system with 290,304 AMD EPYC “Rome” compute cores and 112 NVIDIA Volta V100 General-Purpose Graphics Processing Units (GPGPUs), interconnected by a 200 gigabit per second Cray Slingshot network and supported by 1 PB of NVMe-based solid state storage, 590 terabytes of memory, and 14 petabytes of usable storage.
The system is expected to enter production service early in fiscal year 2021