Stanford and Google researchers propose DoReMi: an AI algorithm that reweights data domains to train language models

Data sets are often drawn from multiple domains while training language models (LMs). For example, a sizable publicly accessible dataset called The Pile has 24% data online, 9% Wikipedia, 4% GitHub, etc. The composition of the pre-training data significantly affects the performance of an LM. It should be apparent how much of each domain needs to be included to create a model that excels at a variety of downstream tasks. Existing studies use intuition or a series of post tasks to establish domain weights or sample probabilities for each domain. For example, The Pile uses heuristically selected domain weights, which may not be the best option.

In this study, researchers at Google and Stanford University attempt to identify domain weights that provide models that perform well across domains by minimizing worst-case loss across domains rather than optimizing domain weights across domains. function of a collection of post tasks. Since each domain has a unique optimal loss (also known as entropy), a naive worst-case strategy would give more weight to the domains with the noisier data. However, the training of possibly thousands of LMs on various domain weights and the ability to adapt to a specific set of downstream tasks are involved with existing LMs such as PaLM and GLaM, which adjust domain weights based on a set of activities. later.

Figure 1: Domain Reweighting with Minimax Optimization (DoReMi) improves language models trained on a dataset by optimizing domain weights given a dataset containing a collection of domains. DoReMi first trains a reference model using some initial reference domain weights (Step 1). In Step 2, we fit the reference model to generate domain weights instead of a robust model by training a small proxy model using robust group distribution optimization (group DRO) over domains. The third step is to train a sizable model using the adjusted domain weights.

This serves as the driving force behind his technique, Domain Reweighting with Minimax Optimization (DoReMi), which uses robust distribution optimization (DRO) to adjust domain weights without being aware of tasks to be performed later (Figure 1). . DoReMi starts by conventionally training a small reference model with 280 million parameters. To reduce excess worst-case loss (compared to reference model loss), they also introduce a tiny distribution-resistant language model (DRO-LM). In particular, they use the domain weights generated by the DRO training instead of the robust LM. Instead of creating a robust model, his strategy uses the DRO-LM framework to optimize domain weights. A large LM (8B) is then trained on a new data set specified by these domain weights.

JOIN the fastest ML subreddit community

Instead of sub-selecting instances from a mini-batch, they use Group DRO’s e-learning-based optimizer, which dynamically changes domain weights based on the loss in each domain to re-scale the training target. DoReMi then uses the domain weights averaged over the DRO training stages. To optimize the domain weights in The Pile and the GLaM dataset, they run DoReMi on 280M proxy and reference models. An 8B parameter LM that is more than 30 times larger is trained using the DoReMi domain weights. Even when a domain has a lower weight, DoReMi reduces perplexity in The Pile across all domains relative to the reference domain weights.

In low-try productive tasks, DoReMi achieves down-reference accuracy 2.6 times faster than a reference model trained on The Pile’s default proficiency weights, improving average down-reference accuracy by 6.5%. They release adjusted mastery weights to buff future LMs learned using The Pile. They find that DoReMi consistently improves LM training when the sizes of the main model trained with optimized domain weights and the proxy model are changed. DoReMi even outperforms domain weight fitting on post-task performance in the GLaM dataset, where it is possible to fit domain weights on post-tasks.

review the Paper. Don’t forget to join our 22k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at asif@marktechpost.com

Check out 100 AI tools at AI Tools Club

Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.