In large language models (LLMs), the pre-training data landscape is a rich combination of diverse sources. It ranges from common English to less common languages, including informal conversations and academic texts, and even extends to modalities such as images and speech. Within this combination, data interact in complex ways, sometimes aligning well, diverging, and occasionally conflicting. The challenge lies in fine-tuning the proportions of this combination, leveraging the strengths of each domain while minimizing potential conflicts through which the resulting models gain enhanced capabilities, a testament to the valuable insights gained from extensive use around the world. real.
Despite being elusive in finding an ideal combination of training data, most existing practices adjust the combination using heuristics to sample a proportion of high-quality or underrepresented data without revealing the concrete criteria in detail. It is difficult to predict whether these data strategies are effective before completing the training run. Inspired by advances in scaling laws showing that model losses on a given set of evaluation data are quantitatively predictable for a wide range of variables, there is an interesting perspective. If this principle also applies to mixture proportions, they could estimate the performance of the resulting model before even starting training.
Researchers from Fudan University and Shanghai ai Laboratory introduced a data combination law and prediction pipeline, which solves the problem of accurately predicting the validation loss for a combination of training domains under a model size fixed and an amount of training data. The researchers conducted a pilot study on domain losses in two-domain mixtures to predict model losses related to data mixtures. This is achieved by training 70M and 160M language models on the combination of Github and Pile-CC subsets of the Pile dataset with five different mix ratios for Github. All models are trained with a batch size of 1 million tokens for 30 thousand steps, which is 30 billion tokens.
This paper addresses several challenges in optimizing data combinations. Some of them are (a) Discovering the quantitative predictability of model performance with respect to data mixing, summarizing this into a functional relationship, i.e., the laws of data mixing. (b) Proposed a process to predict large-scale training model performance at different mixing ratios, but only experiments on small models with few training data through nested scaling laws of training steps, model sizes, and data mixing laws. (c) Experimental verification of the reliability of data combination laws and the prediction process, showing their effectiveness in optimizing model performance, balancing model capabilities and prospects for guiding data program design.
Developing a pipeline for loss prediction involved training the models on the RedPajama mix and validating them against the Pile validation set. A series of 70M, 160M, 305M, and 410M models were trained for 30B tokens to comply with the scaling laws of training steps and model sizes. Surprisingly, the model trained with the optimized mixture achieves comparable performance to one trained with the default mixture, but with only 73% of the steps. Over time, it outperforms the default mix and requires 48% more steps, underscoring the channel's effectiveness in mix optimization.
In conclusion, this article introduces the data combination law and prediction process, which solves the problem of accurately predicting the validation loss for a combination of training domains under a fixed model size and amount of training data. . The use of nested scaling laws of training steps, model sizes, and data combination makes predictions only with small-scale experiments, allowing reuse of existing experiments and reducing computational costs. This study will further facilitate quantitative studies and theoretical analysis with an increasing focus on data engineering.
Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter. Join our Telegram channel, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our 39k+ ML SubReddit
Sajjad Ansari is a final year student of IIT Kharagpur. As a technology enthusiast, he delves into the practical applications of ai with a focus on understanding the impact of ai technologies and their real-world implications. His goal is to articulate complex ai concepts in a clear and accessible way.