Technological developments in several areas have the potential to impact exascale supercomputer systems in a very disruptive way. These technologies could lead to viable exascale systems in the 2015-2020 timeframe. Four technologies are:
* Quantum computing [Dwave Systems]
* Flash storage Sun Micro has introduced high performance flash from terabytes up a to half a petabyte
* Cheap and low power optical communications Keren Bergman talks about nanophotonics for onchip and interchip communication
* IBM 3D chip stacking
IBM's leadership in advancing chip-stacking technology in a manufacturing environment announced one year ago, which drastically shortens the distance that information needs to travel on a chip to just 1/1000th of that on 2-D chips and allows the addition of up to 100 times more channels, or pathways, for that information to flow.
IBM researchers are exploring concepts for stacking memory on top of processors and, ultimately, for stacking many layers of processor cores.
IBM scientists were able to demonstrate a cooling performance of up to 180 W/cm**2 per layer for a stack with a typical footprint of 4 cm**2.
Some of the best Technical Papers
Links to all technical paper abstracts are here.
1. High-Radix Crossbar Switches Enabled by Proximity Communication
Parallel applications are usually able to achieve high computational performance but suffer from large latency in I/O accesses. I/O prefetching is an effective solution for masking the latency. Most of existing I/O prefetching techniques, however, are conservative and their effectiveness is limited by low accuracy and coverage. As the processor-I/O performance gap has been increasing rapidly, data-access delay has become a dominant performance bottleneck. We argue that it is time to revisit the “I/O wall” problem and trade the excessive computing power with data-access speed. We propose a novel pre-execution approach for masking I/O latency. We describe the pre-execution I/O prefetching framework, the pre-execution thread construction methodology, the underlying library support, and the prototype implementation in the ROMIO MPI-IO implementation in MPICH2. Preliminary experiments show that the pre-execution approach is promising in reducing I/O access latency and has real potential.
2. Benchmarking GPUs to Tune Dense Linear Algebra
We present performance results for dense linear algebra using the 8-series NVIDIA GPUs. Our GEMM routine runs 60% faster than the vendor implementation and approaches the peak of hardware capabilities. Our LU, QR and Cholesky factorizations achieve up to 80-90% of the peak GEMM rate. Our parallel LU running on two GPUs achieves up to ~300 Gflop/s. These results are accomplished by challenging the accepted view of the GPU architecture and programming guidelines. We argue that modern GPUs should be viewed as multithreaded multicore vector units. We exploit register blocking to optimize GEMM and heterogeneity of the system (compute both on GPU and CPU). This study includes detailed benchmarking of the GPU memory system that reveals sizes and latencies of caches and TLB. We present a couple of algorithmic optimizations aimed at increasing parallelism and regularity in the problem that provide us with slightly higher performance.
3. A Scalable Parallel Framework for Analyzing Terascale Molecular Dynamics Trajectories
As parallel algorithms and architectures drive the longest molecular dynamics (MD) simulations towards the millisecond scale, traditional sequential post-simulation data analysis methods are becoming increasingly untenable. Inspired by the programming interface of Google's MapReduce, we have built a new parallel analysis framework called HiMach, which allows users to write trajectory analysis programs sequentially, and carries out the parallel execution of the programs automatically. We introduce (1) a new MD trajectory data analysis model that is amenable to parallel processing, (2) a new interface for defining trajectories to be analyzed, (3) a novel method to make use of an existing sequential analysis tool called VMD, and (4) an extension to the original MapReduce model to support multiple rounds of analysis. Performance evaluations on up to 512 processor cores demonstrate the efficiency and scalability of the HiMach framework on a Linux cluster.
The Conference schedule is here.