RedPajama is a project to create a set of leading, fully open-source models. Today, they announced the completion of the first step of this project: the reproduction of the LLaMA training dataset of over 1.2 trillion tokens.

AI is having its Linux moment. Stable Diffusion showed that open-source can not only rival the quality of commercial offerings like DALL-E but can also lead to incredible creativity from broad participation by communities around the world. A similar movement has now begun around large language models with the recent release of semi-open models like LLaMA, Alpaca, Vicuna, and Koala; as well as fully-open models like Pythia, OpenChatKit, Open Assistant and Dolly.

We are launching RedPajama, an effort to produce a reproducible, fully-open, leading language model. RedPajama is a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute. RedPajama has three key components:

* Pre-training data, which needs to be both high quality and have broad coverage

* Base models, which are trained at scale on this data

* Instruction tuning data and models, which improve the base model to make it usable and safe

The starting point is LLaMA, which is the leading suite of open base models for two reasons: First, LLaMA was trained on a very large (1.2 trillion tokens) dataset that was carefully filtered for quality. Second, the 7 billion parameter LLaMA model is trained for much longer, well beyond the Chincilla-optimal point, to ensure the best quality at that model size. A 7 billion parameter model is particularly valuable for the open community as it can run on a wide variety of GPUs, including many consumer grade GPUs.

The RedPajama base dataset
The full RedPajama 1.2 trillion token dataset and a smaller, more consumable random sample can be downloaded through Hugging Face. The full dataset is ~5TB unzipped on disk and ~3TB to download compressed.

RedPajama-Data-1T consists of seven data slices:

CommonCrawl: Five dumps of CommonCrawl, processed using the CCNet pipeline, and filtered via several quality filters including a linear classifier that selects for Wikipedia-like pages.

C4: Standard C4 dataset

GitHub: GitHub data, filtered by licenses and quality

arXiv: Scientific articles removing boilerplate

Books: A corpus of open books, deduplicated by content similarity

Wikipedia: A subset of Wikipedia pages, removing boilerplate

StackExchange: A subset of popular websites under StackExchange, removing boilerplate

Next: Models, instructions & OpenChatKit
Having reproduced the pre-training data, the next step is to train a strong base model. As part of the INCITE program, with support from Oak Ridge Leadership Computing Facility (OLCF), we are training a full suite of models, with the first becoming available in the coming weeks.

With a strong base model in hand, we are excited to instruction tune the models. Alpaca illustrated the power of instruction tuning – with merely 50K high-quality, diverse instructions, it was able to unlock dramatically improved capabilities. Via OpenChatKit, we received hundreds of thousands of high-quality natural user instructions, which will be used to release instruction-tuned versions of the RedPajama models.

Brian Wang

Brian Wang is a Futurist Thought Leader and a popular Science blogger with 1 million readers per month. His blog Nextbigfuture.com is ranked #1 Science News Blog. It covers many disruptive technology and trends including Space, Robotics, Artificial Intelligence, Medicine, Anti-aging Biotechnology, and Nanotechnology.

Known for identifying cutting edge technologies, he is currently a Co-Founder of a startup and fundraiser for high potential early-stage companies. He is the Head of Research for Allocations for deep technology investments and an Angel Investor at Space Angels.

A frequent speaker at corporations, he has been a TEDx speaker, a Singularity University speaker and guest at numerous interviews for radio and podcasts. He is open to public speaking and advising engagements.

4 thoughts on “Red Pajama Is a 1.2 Trillion Token Large Language Model”

johnfull

April 23, 2023 at 8:00 pm

That picture was totally done with AI. Sort of like …

https://www.bing.com/images/create/three-stuffed-bunnies-in-a-hear-no-evil2c-speak-no-/64349872e8014a5ca389112c57436625?id=WcA3vXnwz4fkB9KId5AJjw%3d%3d&view=detailv2&idpp=genimg&FORM=GCRIDP&ajaxhist=0&ajaxserp=0https://www.bing.com/images/create/three-stuffed-bunnies-in-a-hear-no-evil2c-speak-no-/64349872e8014a5ca389112c57436625?id=WcA3vXnwz4fkB9KId5AJjw%3d%3d&view=detailv2&idpp=genimg&FORM=GCRIDP&ajaxhist=0&ajaxserp=0

Sorry for the long link.
Ludus

April 22, 2023 at 8:36 am

LLaMA (Large Language Model Meta AI) Isn’t excessively cute just memorable. Vicuna and Alpaca branch off that as kinds of Llamas. There aren’t a lot of kinds of Llamas so kinda rhyming Pajama is next. Once you get a name that fits a class and is memorable maybe you get a logo or mascot out of it.

This is a lot less irritating than the alternative tendency to come up with non-cutesy but completely meaningless and hard to remember letter jumble acronyms (as in NASA and other cultures) when there is a need to give things a name better known as NTGTAN.
- johnfull
  
  April 23, 2023 at 8:05 pm
  
  It is what it is. I’ve been playing with ChatGPT 4 and DALL-E lately. Interesting stuff and there are some embarrassing artifacts. I did one with Bing Image Create that was based on Pink Floyd Animal album, “pigs on the wing”. Problem is some of the pigs don’t have wings … It’s like an “almost there” technology, but not quite.
Brett Bellmore

April 20, 2023 at 2:31 am

I have to admit, I’m starting to get to the point where excessively cutesy corporate emblems/mascots make me want to punch the screen. Seriously, they do.

Comments are closed.