100 Trillion Parameter AI Training Models

Recommender AI systems are an important component of Internet services today: billion dollar revenue businesses like Amazon and Netflix are directly driven by recommendation services.

AI recommenders get better as they get bigger. Several models have been previously released with billion parameters up to even trillion very recently. Every jump in the model capacity has brought in significant improvement on quality. The era of 100 trillion parameters is just around the corner.

Complicated, dense rest neural network is increasingly computation-intensive with more than 100 TFLOPs in each training iteration. Thus, it is important to have some sophisticated mechanism to manage a cluster with heterogeneous resources for such training tasks.

Recently, Kwai Seattle AI Lab and DS3 Lab from ETH Zurich have collaborated to propose a novel system named “Persia” to tackle this problem through careful co-design of both the training algorithm and the training system. At the algorithm level, Persia adopts a hybrid training algorithm to handle the embedding layer and dense neural network modules differently. The embedding layer is trained asynchronously to improve the throughput of training samples, while the rest neural network is trained synchronously to preserve statistical efficiency. At the system level, a wide range of system optimizations for memory management and communication reduction have been implemented to unleash the full potential of the hybrid algorithm.

Cloud Resources for 100 Trillion Parameter AI Models

Persia 100 trillion parameter AI workload runs on the following heterogeneous resources:

3,000 cores of compute-intensive Virtual Machines
8 A2 Virtual Machines adding a total of 64 A100 Nvidia GPUs
30 High Memory Virtual Machines, each with 12 TB of RAM, totalling 360 TB
Orchestration with Kubernetes
All resources had to be launched concurrently in the same zone to minimize network latency. Google Cloud was able to provide the required capacity with very little notice.

AI Training needs resources in bursts.

Google Kubernetes Engine (GKE) was utilized to orchestrate the deployment of the 138 VMs and software containers. Having the workload containerized also allows for porting and repeatability of the training.

Results and Conclusions
With the support of the Google Cloud infrastructure, the team demonstrated Persia’s scalability up to 100 trillion parameters. The hybrid distributed training algorithm introduced elaborate system relaxations for efficient utilization of heterogeneous clusters, while converging as fast as vanilla SGD. Google Cloud was essential to overcome the limitations of on-premise hardware and proved an optimal computing environment for distributed Machine Learning training on a massive scale.

Persia has been released as an open source project on github with setup instructions for Google Cloud —everyone from both academia and industry would find it easy to train 100-trillion-parameter scale, deep learning recommender models.

6 thoughts on “100 Trillion Parameter AI Training Models”

  1. All of the Large Model(tm) efforts lack the most critical stage:

    Parameter Distillation

    Without parameter distillation, not only do you not discover the true parameter count (which is at least an order of magnitude lower) but the models are not interpretable — you do not discover the what the models are doing in a manner that humans can understand even to the first order. This is important for a variety of reasons, not the least of which is that interpretable models can contribute to the scientific process in a way that a Large Model(tm) cannot.

  2. Big tech can’t help themselves. Any kind of search engine type application or recommendation engine they will screw up. They will find the neutral AI racist, sexist and insufficiently censorious and find ways to add ”diversity” and censorship on behalf of various governments and corporations who pay them on the side or threaten legal consequences. Thus they will subborn the very value the AI has created.

    • Though, this does leave a $100 billion dollar note laying on the ground.

      If anyone DOES implement a neutral, effective, payola-free, apolitical search engine they could start to eat Google’s lunch.

      And that is a very tasty lunch indeed. Even if ignoring all the opportunities for corrupt monetization would mean you only get 5% of the cash google is currently getting.

      There’s been a number of attempts, but none of them seem to give me better results when I go looking for obscure stuff that I know exists, but that google now hides on page 13.

      • On several occasions I have come across pages that I had downloaded (MHTML) or had bookmarks backed up from years ago that I could no longer find on the internet using google. Finding the page adress in the links inside the MHTML or using the bookmark; the page still exists on the web; it has just been made unsearchable on Google.

        It’s often not even controversial things, just old.

        I have noticed a steady decline in the ability to make Google search for my specific query exactly as phrased. Quotation marks are just treated as a suggestion and you get pages and pages of irrelevant cruft before you find what you are looking for.

        They seem to have a very strong recency bias in their ranking of pages and would rather return results that don’t really match the query because they are recent rather than return an exact match of the query because it is old. Quotation marks be damned.

  3. Just when a new killer ap comes along, Moore’s law is running out of steam. It must be time for a new hardware paradigm.

    • Moore’s law only gives more transistors at lower cost per transistor. That’s not interesting if they don’t scale in performance or power efficiency. The latter type of scaling is called Dennard scaling and it ran into a wall in 2003-
      2006 at the latest. Silicon CMOS has been running on fumes ever since. GPUs less so because they solve a particular class of ”embarassingly parallel” problems and don’t deal with gnarly general purpose code well or at all.

      It used to be that you could just shrink a transistor and it would magically be faster and lower power. Every generation now is a massive effort requiring new and interesting materials, geometries and techniques just to maintain basic transistor density scaling; but still failing on the transistors per dollar metric that is Moore’s law.

Comments are closed.