ARM and Fujitsu working on chip targeted at supercomputer applications

ARM and Fujitsu announced a scalable vector extension (SVE) to the ARMv8-A architecture intended to enhance ARM capabilities in HPC workloads. Fujitsu is the lead silicon partner in the effort (so far) and will use ARM with SVE technology in its post K computer, Japan’s next flagship supercomputer planned for the 2020 timeframe. This is an important incremental step for ARM, which seeks to push more aggressively into mainstream and HPC server markets.

Fujitsu first announced plans to adopt ARM for the post K machine – a switch from SPARC processor technology used in the K computer – at ISC2016 and said at the time that it would reveal more at Hot Chips about the ARM development effort needed. Bull Atos is also developing an ARM-based supercomputer.

The SVE is focused on addressing “next generation high performance computing challenges and by that we mean workloads typically found in scientific computing environment where they are very parallelizable,” said Ian Smythe, director of marketing programs, ARM Compute Products Group, in a pre-briefing. SVE is scalable from 128-bits to 2048-bits in 128-bit increments and, among other things, should enhance ARM’s ability to exploit fine grain parallelism.

Key points about ARMv8-A SVE are:

  • ARM is significantly extending the vector processing capabilities associated with AArch64 (64-bit) execution in the ARM architecture, now and into the future, enabling implementation choices for vector lengths that scale from 128 to 2048 bits.
  • High Performance Scientific Compute provides an excellent focus for the introduction of this technology and its associated ecosystem development.
  • SVE features will enable advanced vectorizing compilers to extract more fine-grain parallelism from existing code and so reduce software deployment effort.

Here is some historical context. ARMv7 Advanced SIMD (aka the ARM NEON instructions) is ~12 years old, a technology originally intended to accelerate media processing tasks on the main processor. It operated on well-conditioned data in memory with fixed-point and single-precision floating-point elements in sixteen 128-bit vector registers. With the move to AArch64, NEON gained full IEEE double-precision float, 64-bit integer operations, and grew the register file to thirty-two 128-bit vector registers. These evolutionary changes made NEON a better compiler target for general-purpose compute. SVE is a complementary extension that does not replace NEON, and was developed specifically for vectorization of HPC scientific workloads.

Immense amounts of data are being collected today in areas such as meteorology, geology, astronomy, quantum physics, fluid dynamics, and pharmaceutical research. Exascale computing (the execution of a billion billion floating point operations, or exaFLOPs, per second) is the target that many HPC systems aspire to over the next 5-10 years. In addition, advances in data analytics and areas such as computer vision and machine learning are already increasing the demands for increased parallelization of program execution today and into the future.

Rather than specifying a specific vector length, SVE allows CPU designers to choose the most appropriate vector length for their application and market, from 128 bits up to 2048 bits per vector register. SVE also supports a vector-length agnostic (VLA) programming model that can adapt to the available vector length. Adoption of the VLA paradigm allows you to compile or hand-code your program for SVE once, and then run it at different implementation performance points, while avoiding the need to recompile or rewrite it when longer vectors appear in the future. This reduces deployment costs over the lifetime of the architecture; a program just works and executes wider and faster.

Scientific workloads, mentioned earlier, have traditionally been carefully written to exploit as much data-level parallelism as possible with careful use of OpenMP pragmas and other source code annotations. It’s therefore relatively straightforward for a compiler to vectorize such code and make good use of a wider vector unit. Supercomputers are also built with the wide, high-bandwidth memory systems necessary to feed a longer vector unit.

However, while HPC is a natural fit for SVE’s longer vectors, it offers an opportunity to improve vectorizing compilers that will be of general benefit over the longer term as other systems scale to support increased data level parallelism.

It is worth noting at this point that Amdahl’s law tells us the theoretical limit of a task’s speedup is governed by the amount of unparallelizable code. If you succeed in vectorizing 10% of your execution and make that code run 4 times faster (e.g. a 256-bit vector allows 4x64b parallel operations), then you’ve reduced 1000 cycles down to 925 cycles, providing a limited speedup for the power and area cost of the extra gates. Even if you could vectorize 50% of your execution infinitely (unlikely!) you’ve still only doubled the overall performance. You need to be able to vectorize much more of your program to realize the potential gains from longer vectors.

So SVE also introduces novel features that begin to tackle some of the barriers to compiler vectorization. The general philosophy of SVE is to make it easier for a compiler to opportunistically vectorize code where it would not normally be possible or cost effective to do so.

In summary, SVE opens a new chapter for the ARM architecture in terms of the scale and opportunity for increasing levels of vector processing on ARM processor cores. It is early days for SVE tools and software, and it will take time for SVE compilers and the rest of the SVE software ecosystem to mature. HPC is the current focus and catalyst for this compiler work, and creates development momentum in areas such as Linux distributions and optimized libraries for SVE, as well as in ARM and third party tools and software.

SOURCE – ARM Blog post by Nigel Stephens is Lead ISA Architect and ARM Fellow