In 2021, Argonne Leadership Computing Facility (ALCF) will deploy Aurora, a new Intel-Cray system. Aurora will be capable of over 1 exaflops. It is expected to have over 50,000 nodes and over 5 petabytes of total memory, including high bandwidth memory.
Aurora will be a dramatically bigger and faster machine than Theta or Mira, the three months of pre-production Early Science time will be a large and valuable allocation of core-hours, with the potential for truly unprecedented computational science—as well as being the United States’ first exascale system. ALCF will fully fund 10 postdoctoral appointees for Aurora ESP—one for each selected project.
Timeline and request for proposals for Aurora Exaflop Supercomputer
Below is a rough timeline for the Aurora ESP. The rows labeled “A21 ESP projects” denote the central effort of the projects: developing, porting, and tuning code for the target system:
The speed and scale of A21 will be vastly greater than today’s systems, or systems on the near-term horizon.
Some Guidance About Aurora for Proposal Authors
- Nodes will have both high single thread core performance and the ability to get exceptional performance when there is concurrency of modest scale in the code.
- The architecture is optimized to support codes with sections of fine grain concurrency (~100 lines of code in a FOR loop for example) separated by serial section of code. The degree of fine grain concurrency (number of iterations of loop for example) that will be needed to fully exploit the performance opportunities is moderate. In the ~1000 range for most applications.
- Independence of these loops is ideal but not required for correctness although dependencies that restrict the number of things that can be done in parallel will likely impact performance.
- There is no limit on the number of such loops and the overhead of starting and ending loops is very low.
- Serial code (within an MPI rank) will execute very efficiently and the ratio of the performance of the serial to parallel capabilities is a moderate ratio of around 10X, allowing for code that has not been entirely reworked to still perform well.
- OpenMP 5 will likely contain the constructs necessary to guide the compiler to get optimal performance.
- The compute performance of the nodes will rise in a manner similar to the memory bandwidth so the ratio of memory BW to compute performance will not be significantly different than systems were a few years ago. A bit better in fact than they have been recently.
- The memory capacity will not grow as fast as the compute performance so getting more performance through concurrency from the same capacity will be a key strategy to exploit the future architectures. While this capacity is not growing fast compared to current machines it will have the characteristic that the memory will all be high performance alleviating some of the concerns of managing multiple levels of memory and data movement explicitly.
- The memory in a node will be coherent and all compute will be first-class citizens and will have equal access to all resources, memory and fabric etc.
- The fabric BW will be increasing similar to the compute performance for local communication patterns although global communication bandwidth will likely not increase as fast as compute performance.