Reframing the Web: A recipe for modeling languages with efficient use of data and computation

Large language models are trained on massive chunks of the web, which are often unstructured, noisy, and poorly written. Current scaling laws show that learning from such data is computationally and data-intensive, which grows with the size of the model being trained. This is infeasible due to the large computational costs and duration associated with pretraining, and the looming shortage of high-quality data on the web. In this work, we propose Web Reformulation Augmented Pretraining (WRAP) that uses an off-the-shelf model tuned by instructions that it is asked to paraphrase documents on the web in specific styles such as “Wikipedia-like” or in “question-answer format” to jointly pretrain large language models on real and synthetic reformulations. We first show that using WRAP on the naturally noisy C4 dataset speeds up pretraining by about 3x. Given the same pre-training computational budget, it improves perplexity by over 10% on average across different subsets of the Pile, and improves zero-shot question answering accuracy across 13 tasks by over 2%. Second, we investigate the impact of reformulation style on model performance, providing insights into how the composition of the training data can affect the performance of LLMs in OOD settings. Our gains are attributed to the fact that reformulated synthetic data has higher utility than real data because (i) it incorporates style diversity that closely mirrors the post-evaluation style, and (ii) it has higher “quality” than data scraped from the web.

Reframing the Web: A recipe for modeling languages with efficient use of data and computation

Technical Terrence Team

With a 37% drop in one year, I think this value stock is a must buy.

Leave a Reply Cancel reply

Recommended.

When looking for dividend stocks to buy, should I go for the highest yields?

Analyst Reduces Timeline for Bitcoin Peak in This Bull Cycle

This is Why Marathon Digital Mined Less Bitcoins in August

Carnival Cruise blames passengers for unpopular beverage policy

CrypToadz’s Mysterious $1.6 Million NFT Purchase Leaves the NFT Community Baffled

Categories

Important Links

Reframing the Web: A recipe for modeling languages ​​with efficient use of data and computation

Related

Technical Terrence Team

With a 37% drop in one year, I think this value stock is a must buy.

Leave a Reply Cancel reply

Recommended.

When looking for dividend stocks to buy, should I go for the highest yields?

Analyst Reduces Timeline for Bitcoin Peak in This Bull Cycle

This is Why Marathon Digital Mined Less Bitcoins in August

Carnival Cruise blames passengers for unpopular beverage policy

CrypToadz’s Mysterious $1.6 Million NFT Purchase Leaves the NFT Community Baffled

Categories

Important Links

Get daily news updates to your inbox!

Reframing the Web: A recipe for modeling languages with efficient use of data and computation