Large Language Models (LLM) are commonly trained on data sets consisting of sequences of fixed-length tokens. These data sets are created by randomly concatenating documents of various lengths and then chunking them into sequences of a predetermined target length (concat-and-chunk). Recent attention implementations mask attention between documents, reducing the effective length of a token fragment. Furthermore, training on long sequences becomes computationally prohibitive due to the quadratic cost of attention. In this study, we introduce dataset decomposition, a novel variable sequence length training technique, to address these challenges. We decompose a dataset into a union of buckets, each of which contains sequences of the same size extracted from a single document. During training, we use a variable sequence length and batch size, and simultaneously sample all buckets with a schedule. Unlike the concat-and-chunk baseline, which involves a fixed attention cost at each training step, our proposed method incurs a computational cost proportional to the actual length of the documents at each step, resulting in significant savings in training time. We train an 8k context length 1B model at the same cost as a 2k context length model trained using the baseline approach. Experiments on a web-scale corpus demonstrate that our approach significantly improves performance on standard language assessments and long-context benchmarks, reaching target accuracy with up to 6x faster training compared to the baseline. Our method not only allows for efficient pre-training on long sequences, but also scales effectively with data set size. Finally, we shed light on a critical but less studied aspect of training large language models: the distribution and timing of sequence lengths, resulting in a non-negligible difference in performance.