Meet RedPajama: An AI project to create fully open source large language models from the launch of a 1.2 trillion token dataset

The more advanced basic models for AI are only partially open source and only available through commercial APIs. This restricts its use and limits research and customization. However, a project called RedPajama now aims to create leading open source models. The first step of this project, replication of the LLaMA training data set, has been completed. Open source models have made significant progress recently, and AI is experiencing a moment similar to the Linux movement. Stable Diffusion demonstrated that open source models could compete with commercial offerings and encourage creativity through community involvement. A similar movement has now sprung up around large language models, with the release of semi-open models such as Calls, Alpaca, vicunaand Koalaas well as fully open models like Pythia, OpenChatKit, Open Assistant, and Dolly.

RedPajama is a collaborative effort between several institutions, including Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, MILA Québec AI Institute, and Together. The project aims to develop a fully open, reproducible host model with three key components: pretraining data, basic models, and instruction tuning data and models. The project recently released the first component, pretraining data, a fully open dataset of 1.2 trillion tokens based on the LLaMA document. RedPajama’s starting point is LLaMA, the leading suite of open-based models. LLaMA was trained on a large dataset that was carefully filtered for quality. Your 7 billion parameter model is trained longer to ensure the best quality at that model size. However, LLaMA and its derivatives are only available for non-commercial research purposes. RedPajama aims to reproduce LLaMA fully open source, making it available for commercial applications and providing a more transparent pipeline for research.

The RedPajama data set is available for download at hug face and consists of a dataset of 1.2 trillion tokens and a smaller random sample. The dataset consists of seven data segments: CommonCrawl, C4, GitHub, arXiv, Books, Wikipedia, and StackExchange. Each data segment has undergone meticulous pre-processing and data filtering to ensure quality. The quality filters were adjusted to approximate the number of tokens reported by Meta AI in the LLaMA document. The CommonCrawl data segments were processed using the CCNet pipeline and filtered using a linear classifier to select pages resembling Wikipedia. Licensing and quality leaked data from GitHub, while data from arXiv consisted of scientific articles with no repetitions. The Books data was deduplicated for content similarity, the Wikipedia subset removed the standard model, and the StackExchange subset was a selection of popular websites with the standard model removed. The complete dataset is approximately 5TB uncompressed on disk and can be downloaded compressed to 3TB.

Check out 100 AI tools at AI Tools Club

The RedPajama project is collaborating with the meerkat project to launch a Meerkat panel and embeds for interactive analysis of the GitHub subset of the corpus. Instructions for installation and use can be found at GitHub. The next step in the project is to train a robust base model after reproducing the pre-training data. The project is supported by the Oak Ridge Leadership Computing Facility through the INCITE program, and a full suite of models will be available soon. The team is excited to instruct and fit the models, inspired by Alpaca’s success with only 50,000 diverse, high-quality instructions. The team has received hundreds of thousands of natural user instructions via OpenChatKit, which will be used to launch compliant versions of the RedPajama models.

review the RedPajama database set and RedPajama Github. Don’t forget to join our 19k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at asif@marktechpost.com

Check out 100 AI tools at AI Tools Club

Niharika is a technical consulting intern at Marktechpost. She is a third year student, currently pursuing her B.Tech from the Indian Institute of Technology (IIT), Kharagpur. She is a very enthusiastic individual with a strong interest in machine learning, data science, and artificial intelligence and an avid reader of the latest developments in these fields.

JOIN the fastest ML subreddit community