Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

Large Language Models (LLM) are commonly trained on data sets consisting of sequences of fixed-length tokens. These data sets are created by randomly concatenating documents of various lengths and then chunking them into sequences of a predetermined target length (concat-and-chunk). Recent attention implementations mask attention between documents, reducing the effective length of a token fragment. Furthermore, training on long sequences becomes computationally prohibitive due to the quadratic cost of attention. In this study, we introduce dataset decomposition, a novel variable sequence length training technique, to address these challenges. We decompose a dataset into a union of buckets, each of which contains sequences of the same size extracted from a single document. During training, we use a variable sequence length and batch size, and simultaneously sample all buckets with a schedule. Unlike the concat-and-chunk baseline, which involves a fixed attention cost at each training step, our proposed method incurs a computational cost proportional to the actual length of the documents at each step, resulting in significant savings in training time. We train an 8k context length 1B model at the same cost as a 2k context length model trained using the baseline approach. Experiments on a web-scale corpus demonstrate that our approach significantly improves performance on standard language assessments and long-context benchmarks, reaching target accuracy with up to 6x faster training compared to the baseline. Our method not only allows for efficient pre-training on long sequences, but also scales effectively with data set size. Finally, we shed light on a critical but less studied aspect of training large language models: the distribution and timing of sequence lengths, resulting in a non-negligible difference in performance.

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

Technical Terrence Team

Apple shares: $100 million investment proposal

Leave a Reply Cancel reply

Recommended.

The 28 best early offers that you can buy before the great spring sale of Amazon

Black Hills, Avista cut to sell equivalent in Mizuho (NYSE: BKH)

Starfield: Shattered Space review: The big expansion isn't strange enough

Google and MIT researchers present Synclr: a novel AI approach to learning visual representations exclusively from synthetic images and synthetic captions without real data

Creating data pipelines to build applications with large language models

Categories

Important Links

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

Related

Technical Terrence Team

Apple shares: $100 million investment proposal

Leave a Reply Cancel reply

Recommended.

The 28 best early offers that you can buy before the great spring sale of Amazon

Black Hills, Avista cut to sell equivalent in Mizuho (NYSE: BKH)

Starfield: Shattered Space review: The big expansion isn't strange enough

Google and MIT researchers present Synclr: a novel AI approach to learning visual representations exclusively from synthetic images and synthetic captions without real data

Creating data pipelines to build applications with large language models

Categories

Important Links

Get daily news updates to your inbox!