Expanding context length in large language models | by Donato Riccio | October 2023

How to turn your llama into a giraffe

Context length refers to the maximum number of tokens that the model can remember when generating text. A longer context window allows the model to better understand long-range dependencies in the text. Models with longer contexts can build connections between very distant ideas in the text, generating more globally coherent results.

During training, the model processes text data in fixed-length chunks or windows. Models should be trained on long texts to take advantage of large contexts. Training sequences must contain documents, books, articles, etc., with thousands of tokens.
The length of the training data sets a limit on the length of the usable context.

So why don’t we train models on longer sequences?

Not so fast.

Increasing the context length increases the number of possible token combinations that the model must learn to accurately predict.
This allows for more robust long-range modeling, but also requires more memory and processing power, resulting in higher training costs.

Without any optimization, the computation scales quadratically with the context length, meaning that a 4096-token model will require 64 times more computation than a 512-token model.

You can use sparse or approximate attention methods to reduce the calculation cost, but they can also affect the accuracy of the model.

Training and using context-rich language models presents three main challenges:

Place long contexts in the model.
Speed up inference and training so they don’t take forever.
Ensure high-quality inference that maintains full context knowledge.

The attention mechanism is the central component of transformer models. It relates different positions in a sequence to calculate its representation, allowing models to focus on relevant parts of the text and understand it better. Scaling transformers to longer sequences faces challenges due to the quadratic complexity of total attention.

Expanding context length in large language models | by Donato Riccio | October 2023

Technical Terrence Team

How to collect £1,000 a month from a FTSE 100 share

Leave a Reply Cancel reply

Recommended.

Will Rolls-Royce share price crash in 2023?

Renowned Economist Drops Bomb on US Dollar: Can Bitcoin Provide a Safe Haven?

Runes is making Bitcoin fun and accessible again

UCLA Researchers Introduce a Multispectral QPI System Designed Based on a Broadband Diffractive Optical Neural Network

Bitcoin Bridged in Avalanche Outperforms Value Locked in Lightning Network Bitcoin News

Categories

Important Links

Expanding context length in large language models | by Donato Riccio | October 2023

How to turn your llama into a giraffe

Related

Technical Terrence Team

How to collect £1,000 a month from a FTSE 100 share

Leave a Reply Cancel reply

Recommended.

Will Rolls-Royce share price crash in 2023?

Renowned Economist Drops Bomb on US Dollar: Can Bitcoin Provide a Safe Haven?

Runes is making Bitcoin fun and accessible again

UCLA Researchers Introduce a Multispectral QPI System Designed Based on a Broadband Diffractive Optical Neural Network

Bitcoin Bridged in Avalanche Outperforms Value Locked in Lightning Network Bitcoin News

Categories

Important Links

Get daily news updates to your inbox!