Deep Learning at 15 Petaflops from Intel and partners

An Arxiv paper, presents the first, 15-PetaFLOP Deep Learning system for solving scientific pattern classification problems on contemporary HPC architectures. Intel researchers and parters developed supervised convolutional architectures for discriminating signals in high-energy physics data as well as semi-supervised architectures for localizing and
classifying extreme weather in climate data. Our Intelcaffe based implementation obtains ∼2TFLOP/s on a single Cori Phase-II Xeon-Phi node. We use a hybrid strategy employing synchronous node-groups, while using asynchronous communication across groups. They use this strategy to scale training of a single model to ∼9600 Xeon-Phi nodes; obtaining peak performance of 11.73-15.07 PFLOP/s and sustained performance of 11.41-13.27 PFLOP/s. At scale, their HEP architecture produces state-of-the-art classification accuracy on a dataset with 10 Million images, exceeding
that achieved by selections on high-level physics-motivated features. Their semi-supervised architecture successfully extracts weather patterns in a 15TB climate dataset. Their results demonstrate that Deep Learning can be optimized and scaled effectively on many-core, HPC systems.

• They developed Deep Learning models which not only solve the problem at hand to desired precision but are also scalable to a large number of nodes. This includes for example to not use layers with large dense weights such as batch normalization or fully
connected units.
• They developed highly optimized Deep Learning software that can process complex scientific datasets on the Intel Xeon Phi architecture
• They built a system based on a hybrid asynchronous approach to scale Deep Learning to the full scale of the Cori supercomputer (∼9600 Xeon Phi nodes)
• They demonstrated supervised classification on a 7.4 TB High-Energy Physics dataset
• They developed a novel, semi-supervised architecture, and apply it to detect and learn new patterns on a 15 TB climate dataset
• They obtained a peak performance of 11.73-15.07 PFLOP/s and sustained performance of 11.41-13.27 PFLOP/s for their two problem

They formulate the high-energy physics (HEP) problem as a binary image classification task. They use a Convolutional Neural Net comprised of 5 convolution+pooling units with rectified linear unit (ReLU) activation functions.

This is the first 15-PetaFLOP Deep Learning software running on HPC platforms. They have utilized IntelCaffe to obtain ∼2 TF on single Xeon
Phi nodes. They utilize a hybrid strategy employing synchronous groups, and asynchronous communication among them to scale the training of a single model to ∼9600 Cori Phase II nodes. They apply this framework to solve real-world supervised and semi-supervised patterns classification problems in HEP and Climate Science. Their work demonstrates that manycore HPC platforms can be successfully used to accelerate Deep Learning, opening the gateway for broader adoption by the domain science community. Their results are not limited to the specific applications mentioned in this paper, but they extend to other kinds of models such as ResNets (Residual network) and LSTM (Long Short Term Memory, Long short-term memory (LSTM) is a recurrent neural network (RNN) architecture that remembers values over arbitrary intervals. Stored values are not modified as learning proceeds. RNNs allow forward and backward connections between neurons) although the optimal configuration between synchronous and asynchronous is expected to be model dependent. This highlights the importance of a flexible, hybrid architecture in achieving the best performance for a diverse set of problems.

Background

Demystifying Resnets – provides a theoretical explanation for the great performance of ResNet via the study of deep linear networks and some nonlinear variants.

A Beginner’s Guide to Recurrent Networks and LSTMs

In the mid-90s, a variation of recurrent net with so-called Long Short-Term Memory units, or LSTMs, was proposed by the German researchers Sepp Hochreiter and Juergen Schmidhuber as a solution to the vanishing gradient problem.

LSTMs help preserve the error that can be backpropagated through time and layers. By maintaining a more constant error, they allow recurrent nets to continue to learn over many time steps (over 1000), thereby opening a channel to link causes and effects remotely.