Although deep learning algorithms are typically compute bound, we have not figured out how to train them at the theoretical limits of performance of large clusters, and there is a big opportunity remaining. The difference between the sustained performance of the fastest RNN training system that we know about at Baidu, and the theoretical peak performance of the fastest computer in the world is approximately 2500x.
In the five year timeframe Gregory Diamos at Baidu is watching two things from Deep Learning chip and software makers: peak floating point throughput and software support for deep learning. So far GPUs are leading both categories, but there is certainly room for competition. If other processors want to compete in this space, they need to be serious about software, in particular, releasing deep learning primitive libraries with simple C interfaces that achieve close to peak performance. Looking farther ahead to the limits of technology scaling, Diamos hopes that a processor is developed in the next two decades that enables deep learning model training at 10 PFLOP per second in 300 Watts, and 150 EFLOP per second in 25 MWatts.
Baidu is using machine learning for image recognition, speech recognition, the development of autonomous vehicles and more.
Baidu's research allows them to train their models faster, which so far has translated into better application level performance, e.g. speech recognition accuracy.
Persistent RNNs: Stashing Recurrent Weights On-Chip
This paper introduces a new technique for mapping Deep Recurrent Neural Networks (RNN) efficiently onto GPUs. We show how it is possible to achieve substantially higher computational throughput at low mini-batch sizes than direct implementations of RNNs based on matrix multiplications. The key to our approach is the use of persistent computational kernels that exploit the GPU’s inverted memory hierarchy to reuse network weights over multiple timesteps. Our initial implementation sustains 2.8 TFLOP/s at a minibatch size of 4 on an NVIDIA TitanX GPU. This provides a 16x reduction in activation memory footprint, enables model training with 12x more parameters on the same hardware, allows us to strongly scale RNN training to 128 GPUs, and allows us to efficiently explore end-to-end speech recognition models with over 100 layers.
SOURCES - HPCWire, ICML paper, Baidu, Youtube