June 21, 2016

Baidu improves efficiency by 30 times and scaling by 16 time for Deep Learning on 128 GPU chips

Baidu made GPUs 30x more efficient on smaller units of work for Deep Learning artificial intelligence and it enables better strong scaling. Baidu achieved a 16x increase in strong scaling, going from 8 GPUs without our technique to 128 GPUs with it. Their implementation sustains 28 percent of peak floating point throughput at 128 GPUs over the entire training run, compared to 31 percent on a single GPU.

Although deep learning algorithms are typically compute bound, we have not figured out how to train them at the theoretical limits of performance of large clusters, and there is a big opportunity remaining. The difference between the sustained performance of the fastest RNN training system that we know about at Baidu, and the theoretical peak performance of the fastest computer in the world is approximately 2500x.

In the five year timeframe Gregory Diamos at Baidu is watching two things from Deep Learning chip and software makers: peak floating point throughput and software support for deep learning. So far GPUs are leading both categories, but there is certainly room for competition. If other processors want to compete in this space, they need to be serious about software, in particular, releasing deep learning primitive libraries with simple C interfaces that achieve close to peak performance. Looking farther ahead to the limits of technology scaling, Diamos hopes that a processor is developed in the next two decades that enables deep learning model training at 10 PFLOP per second in 300 Watts, and 150 EFLOP per second in 25 MWatts.

Baidu is using machine learning for image recognition, speech recognition, the development of autonomous vehicles and more.

Baidu's research allows them to train their models faster, which so far has translated into better application level performance, e.g. speech recognition accuracy.

Persistent RNNs: Stashing Recurrent Weights On-Chip


This paper introduces a new technique for mapping Deep Recurrent Neural Networks (RNN) efficiently onto GPUs. We show how it is possible to achieve substantially higher computational throughput at low mini-batch sizes than direct implementations of RNNs based on matrix multiplications. The key to our approach is the use of persistent computational kernels that exploit the GPU’s inverted memory hierarchy to reuse network weights over multiple timesteps. Our initial implementation sustains 2.8 TFLOP/s at a minibatch size of 4 on an NVIDIA TitanX GPU. This provides a 16x reduction in activation memory footprint, enables model training with 12x more parameters on the same hardware, allows us to strongly scale RNN training to 128 GPUs, and allows us to efficiently explore end-to-end speech recognition models with over 100 layers.

SOURCES - HPCWire, ICML paper, Baidu, Youtube

Форма для связи


Email *

Message *