This AI article reveals the key to scaling language models to 128,000 contexts with continuous pre-training

Large language models can perform tasks beyond current paradigms, such as reading repository-level code, modeling long-history dialogs, and powering autonomous agents with language models with a 128,000-token context window. The recent Needle-in-a-Haystack test is a popular way to see if models can use long context. In this test, the model is asked to precisely repeat the information in a given sentence, placing the sentence in an arbitrary location within a 128K long document.

A recent study by researchers at the University of Edinburgh, MIT-IBM Watson ai Lab, University of Washington, MIT, University of Melbourne, Ohio State University, and UIUC examines data engineering techniques to increase the context duration of language models. They continued to pre-train it with appropriate data combinations to ensure that the language model passed the Needle-in-a-Haystack test with a length of 128K. Continuous pre-training with full attention on significantly longer context lengths (we train on context lengths from 64K to 80K) may seem prohibitively expensive at first glance, given that most existing models train on context lengths of less than 4K and that attention has quadratic complexity.

The fundamental models of the equipment are LLaMA-2 7B and 13B. While they modified the foundation of RoPE, they did not alter the architecture of the model in any major way.

Most of your focus is on the data recipe or ingredients needed to properly train a model to succeed in the Needle-in-a-Haystack test with a context length of 128K. The researchers postulate that, even for models pre-trained on much shorter 4K contexts, the ability to use information at arbitrary positions within an extended context length is already (largely) learned during pre-training. Contrary to this hypothesis, the current research uses continuous pre-training on massive data sets (400 billion tokens) to provide long-term context modeling capabilities; This approach can be as expensive as starting from scratch with prior training.

In this study, the team demonstrates that a 7B model can be “unlocked” to perform accurate retrieval over significantly longer context durations compared to the original pre-training by continuously pre-training on a small long context data set, in this example, tokens 1-5B. . Furthermore, they demonstrate that previous studies neglected the need to upsample long sequences while maintaining domain mixing of pre-training corpora, even though it is critical for context scaling. It is important to upsample domains with long sequences in the data mix to represent long-range dependencies, as demonstrated by LongChat 32K and YaRN Mistral 128K, according to most previous publications. This is because domains such as books provide the necessary long sequence data. But as they suggest in their article, their obvious answer is not the best one, as it leads to confusion and degradation in other areas. Therefore, to achieve more consistent performance improvement, it is best to use a data mix that maintains the same domain mixing ratio as the pre-training mix and then sample long sequences within each domain.

Compared to strong baselines such as YaRN-Mistral 128K and LongLoRA 100K, the findings demonstrate that this is the root cause of our solution's improved long-context task performance, while preserving short-context performance.

Regarding the retrieval challenge, the team believes that their approach closes the gap with state-of-the-art models such as GPT-4 128K and lays the foundation for future research on long-context instruction tuning.

Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter and Google news. Join our 38k+ ML SubReddit, 41k+ Facebook community, Discord Channeland LinkedIn Grabove.

If you like our work, you will love our Newsletter..

Don't forget to join our Telegram channel

You may also like our FREE ai Courses….

Dhanshree Shenwai is a Computer Science Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking with a keen interest in ai applications. He is excited to explore new technologies and advancements in today's evolving world that makes life easier for everyone.

<!– ai CONTENT END 2 –>

LLMWare Releases SLIM: Small Specialized Function Call Models for Multi-Step Automation (See All Models)

This AI article reveals the key to scaling language models to 128,000 contexts with continuous pre-training

Technical Terrence Team

Silver drops to $22.25 this morning

Leave a Reply Cancel reply

Recommended.

Why has the price of Bitcoin gone down today?

Goldman Sachs open to the Bitcoin and Ethereum market

How a returning college student advocated to improve a fledgling online program

AI models are powerful, but are they biologically plausible? | MIT News

Techniques and approaches for monitoring large language models in AWS

Categories

Important Links

This AI article reveals the key to scaling language models to 128,000 contexts with continuous pre-training

Related

Technical Terrence Team

Silver drops to $22.25 this morning

Leave a Reply Cancel reply

Recommended.

Why has the price of Bitcoin gone down today?

Goldman Sachs open to the Bitcoin and Ethereum market

How a returning college student advocated to improve a fledgling online program

AI models are powerful, but are they biologically plausible? | MIT News

Techniques and approaches for monitoring large language models in AWS

Categories

Important Links

Get daily news updates to your inbox!