Context length refers to the maximum number of tokens that the model can remember when generating text. A longer context window allows the model to better understand long-range dependencies in the text. Models with longer contexts can build connections between very distant ideas in the text, generating more globally coherent results.
During training, the model processes text data in fixed-length chunks or windows. Models should be trained on long texts to take advantage of large contexts. Training sequences must contain documents, books, articles, etc., with thousands of tokens.
The length of the training data sets a limit on the length of the usable context.
So why don’t we train models on longer sequences?
Not so fast.
Increasing the context length increases the number of possible token combinations that the model must learn to accurately predict.
This allows for more robust long-range modeling, but also requires more memory and processing power, resulting in higher training costs.
Without any optimization, the computation scales quadratically with the context length, meaning that a 4096-token model will require 64 times more computation than a 512-token model.
You can use sparse or approximate attention methods to reduce the calculation cost, but they can also affect the accuracy of the model.
Training and using context-rich language models presents three main challenges:
- Place long contexts in the model.
- Speed up inference and training so they don’t take forever.
- Ensure high-quality inference that maintains full context knowledge.
The attention mechanism is the central component of transformer models. It relates different positions in a sequence to calculate its representation, allowing models to focus on relevant parts of the text and understand it better. Scaling transformers to longer sequences faces challenges due to the quadratic complexity of total attention.