Meet Dolma: An English Open Corpus of 3T Tokens for Pre-Training Linguistic Model Research

Large language models (LLMs) are a recent trend as these models have gained significant importance in handling natural language processing (NLP) related tasks such as question answering, text summarization, learning in few occasions, etc. But the most powerful language models are published by keeping important aspects of model development secret. This lack of openness affects the pre-training data composition of language models, even when the model is released for public use.

This opacity complicates understanding how the composition of the pre-training corpus affects the capabilities and limitations of a model. It also impedes scientific advancement and affects the general public who use these models. A team of researchers has discussed transparency and openness in their recent study. To promote openness and facilitate studies on the preformation of linguistic models, the team has introduced Dolma, a large English corpus with three billion tokens.

Dolma has been compiled from a wide range of sources, such as encyclopedias, scientific publications, code repositories, public domain literature and online information. To encourage additional experimentation and replication of their findings, the team has made their data curation toolkit publicly available.

The team's main goal is to make language model research and development more accessible. They have highlighted multiple reasons to promote transparency and openness of data, which are as follows.

Developers and users of language model applications make better decisions by providing transparent pre-training data. The presence of documents in pre-training data has been associated with better performance on related tasks, making it important to account for social biases in pre-training data.

Research that examines how data composition affects model behavior requires access to open pre-training data. This makes it possible for the modeling community to examine and improve next-generation data curation techniques, addressing issues such as training data attribution, adversarial attacks, deduplication, memoization, and benchmark contamination.
Effective creation of open language models depends on access to data. The availability of a wide range of large-scale pre-training data is a crucial enabler for the potential functionality that newer models can offer, such as the ability to attribute generations to pre-training data.

The team has shared a comprehensive record of Dolma, including a description of its contents, construction details and architectural principles. They have incorporated into the research work analysis and experimental results of training language models at various intermediate levels of Dolma. These insights have clarified important data curation techniques, such as the effects of content or quality filters, deduplication techniques, and the advantages of using a combination of multiple sources on training data.

OLMo, a state-of-the-art open language model and framework, has been trained with Dolma. OLMo has been developed to advance the field of language modeling by demonstrating the usefulness and importance of the Dolma corpus. The team has summarized its main contributions as follows.

The Dolma Corpus has been made public, consisting of a multifaceted set of three billion tokens from seven different sources and frequently used for extensive pre-training of linguistic models.

A high-performance portable tool called Open Sourcing Dolma Toolkit has been introduced to assist with the efficient curation of large data sets for language model pre-training. With the help of this toolset, professionals can create their own data curation pipelines and double the curation effort.

Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter and Google news. Join our 36k+ ML SubReddit, 41k+ Facebook community, Discord Channeland LinkedIn Grabove.

If you like our work, you will love our Newsletter..

Don't forget to join our Telegram channel

Tanya Malhotra is a final year student of University of Petroleum and Energy Studies, Dehradun, pursuing BTech in Computer Engineering with specialization in artificial intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with a burning interest in acquiring new skills, leading groups and managing work in an organized manner.

<!– ai CONTENT END 2 –>

(FREE ai WEBINAR) 'Actions in GPT: Tips, Tricks and Techniques for Developers' (February 12, 2024)

Meet Dolma: An English Open Corpus of 3T Tokens for Pre-Training Linguistic Model Research

Technical Terrence Team

FTSE 100 shares look very cheap and I just bought these 3 unmissable bargains

Leave a Reply Cancel reply

Recommended.

Why Aviva could be one of the best value stocks in the FTSE 100!

Royal Caribbean raises tip prices again (and passengers are angry)

This AI article introduces POYO-1: an artificial intelligence framework that deciphers neural activity through large-scale recordings with deep learning

What does a Harry Potter fanfic have to do with OpenAI?

The ads driven by the field take the field during the Super Bowl 2025

Categories

Important Links

Meet Dolma: An English Open Corpus of 3T Tokens for Pre-Training Linguistic Model Research

Related

Technical Terrence Team

FTSE 100 shares look very cheap and I just bought these 3 unmissable bargains

Leave a Reply Cancel reply

Recommended.

Why Aviva could be one of the best value stocks in the FTSE 100!

Royal Caribbean raises tip prices again (and passengers are angry)

This AI article introduces POYO-1: an artificial intelligence framework that deciphers neural activity through large-scale recordings with deep learning

What does a Harry Potter fanfic have to do with OpenAI?

The ads driven by the field take the field during the Super Bowl 2025

Categories

Important Links

Get daily news updates to your inbox!