Scaling Deep Learning until systems reach human level performance or better

BAIDU results indicate that in many real world contexts, simply scaling your training data set and models is likely to predictably improve the model’s accuracy. This predictable behavior may help practitioners and researchers approach debugging and target better accuracy scaling.

Arxiv- Deep Learning Scaling is Predictable, Empirically

Deep learning (DL) creates impactful advances following a virtuous recipe: model architecture search, creating large training data sets, and scaling computation. It is widely believed that growing training sets and models should improve accuracy and result in better products. As DL application domains grow, we would like a deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements to advance the state-of-the-art. This paper presents a large scale empirical characterization of generalization error and model size growth as training sets grow. We introduce a methodology for this measurement and test four machine learning domains: machine translation, language modeling, image processing, and speech recognition. Our empirical results show power-law generalization error scaling across a breadth of factors, resulting in power-law exponents—the “steepness” of the learning curve—yet to be explained by theoretical work. Further, model improvements only shift the error but do not appear to affect the power-law exponent. We also show that model size scales sublinearly with data size. These scaling relationships have significant implications on deep learning research, practice, and systems. They can assist model debugging, setting accuracy targets, and decisions about data set growth. They can also guide computing system design and underscore the importance of continued computational scaling.

Operational Implications

Learning and model size curves can also guide decisions about data collection and scaling computation. As we project forward on learning curves, we can run into three types of scaling limits:
training data is too small,
computation is too slow, or
irreducible error.

Model Exploration using Small Data: It may seem counterintuitive, but an implication of predictable scaling is that model architecture exploration should be feasible with small training data sets. Consider starting with a training set that is known to be large enough that current models show accuracy in the power-law region of the learning curve. Since we expect model accuracy to improve proportionally for different models, growing the training set and models is likely to result in the same relative gains across the models.

The possibility of doing small data testing has significant implications on manual and automatic architecture search. Researchers or DL systems may be able to iterate on small data sets to find models that can accurately model the structure of the data distribution. Then, these models can be scaled to larger data sets to ensure proportional accuracy gains.

Although small data set testing may be possible, it can be difficult to ensure that training data is large enough to see the power-law learning curve region. We have found that models with poor optimizer parameterization or model priors/initialization show accuracy cliffs, where accuracy is only as good as best guessing, but the model trains on enough data to be in the power-law region. Researchers must take great care when defining a “large enough” training set for small data testing. We leave the methodology for defining such a training set to future work.

Computational Limits: If we have identified a desirable model to scale to larger training sets, the next potential limitation is the speed of computation. In some cases, training large models on very large data sets would take months or years of critical path compute time, making these training runs impractical for any real world problem on existing systems. However, predictable learning and model size curves may offer a way to project the compute requirements to reach a particular accuracy level. The compute requirements could inform decisions about how to scale computational capacity to unlock these compute-limited applications.

After reviewing the tests performed for this work, we find that we have run into compute limitations for the largest data sets of each application domain. Most frequently, we run out of GPU memory when trying to train the largest models on the largest data sets. In many cases, we could alleviate these issues with techniques like data or model parallelism, though they would require significant software changes to reduce per-compute-unit memory requirements. Alternatively, we could migrate training to systems with more memory. Further, our longest running training sessions have taken as long as 6 weeks to converge. Parallelism and hardware improvements to reduce this time are highly desirable.

Running into Irreducible Error: If we approach the irreducible error region in real applications, improving accuracy will require techniques outside the straightforward recipe. As an example, reaching Bayes error for a problem would be an indicator that no further information can be extracted from the existing data set—the application might be considered “solved”. If further model architecture search, training set growth, or computational scale cannot improve accuracy, it is likely that models are achieving the irreducible error. To improve error beyond this irreducible level may require techniques that could increase the information content of the data to distinguish between the samples that contribute to the Bayes error.

It may be difficult to assess whether we have reached irreducible error or if models just have inherent bias that makes them unable to resolve more information from the data. One approach might be to estimate the human error rate for the task. As long as humans are constrained to the same data for the problem, their best-case accuracy may be a reasonable upper bound on the irreducible error. If humans can perform better than current models, it is likely that models could be improved.

Conclusion

The deep learning (DL) community has created impactful advances across diverse application domains by following a straightforward recipe: search for improved model architectures, create large training data sets, and scale computation. While model architecture search can be unpredictable, the model accuracy improvements from growing data set size and scaling computation are empirically predictable. We empirically validate that DL model accuracy improves as a power-law as we grow training sets for state-of-the-art (SOTA) model architectures in four machine learning domains: machine translation, language modeling, image processing, and speech recognition. These power-law learning curves exists across all tested domains, model architectures, optimizers, and loss functions. Further, within each domain, model architecture and optimizer changes only shift the learning curves but do not affect the power-law exponent—the “steepness” of the learning curve. We also show that model size scales sublinearly with data size. These scaling relationships have significant research, practice, and systems implications on deep learning progress.