This paper has been accepted into the Data Issues for Foundation Models workshop at ICLR 2024.
Large language models are trained on massive chunks of the web, which are often unstructured, noisy, and poorly written. Current scaling laws show that learning from such data requires a large amount of computation and data, which grows with the size of the model being trained. This is not feasible due to the large computational costs and duration associated with pre-training and the impending scarcity of high-quality data on the web. In this work, we propose a WebRephrase augmented pretrainer (WRAP) that uses a ready-to-use instruction model that is requested to paraphrase documents on the web in specific styles such as “Wikipedia-like” or in “question-answer format” to train previously jointly to the LLM in real and synthetic reformulations. First, we show that using WRAP on the C4 dataset, which is naturally noisy, speeds up pre-training by about 3 times. With the same pre-training compute budget, it improves perplexity by more than 10% on average across different subsets of Pile, and improves answer accuracy to zero-response questions across 13 tasks by more than 2%. Second, we investigate the impact of reformulation style on model performance, offering insights into how the composition of training data can affect the performance of LLMs in OOD environments. Our advances are attributed to the fact that the reformulated synthetic data (i) incorporates diversity of styles that closely reflects the subsequent evaluation style and (ii) has higher “quality” than data extracted from the web.