Starting at a high level, Transformers require two inputs: token embeddings and positional encodings. Token embeddings are things like tiktoken
where they will use a fixed vocabulary size to generate a unique key for each token. Through training, the model learns the query and the value of each token in order to successfully generate the next token with the information.
In addition to embeddings, we also need positional information to tell the LLM where the token is in a sentence. The equations above show the most abstract view for conveying positional information. We have 3 functions, 1 for each element of the token and 2 word embedding vectors (xmeter and xnorthwhere meter and north they mean the different dimensions that each vector has).
One approach is to simply create a new vector for each token you see, so that the position is perfectly unique. Naturally, the downside here is that the single vector makes it difficult for the model to see similarities in the training data, which degrades performance.
A secondary approach would be to create a vector that has a similarity factor with other vectors for each token. This way we still capture information about how similar a situation is to a different situation. However, since we can create collisions of these vectors, confusion can arise from this methodology.
How do we find the best combination of these approaches?
The industry has largely focused on RoPE as a way to get the best of both worlds. Without getting too deep into the math, RoPE uses sine functions to assign place values to tokens. Since sine functions are repetitive by design, there are some place values that will be very similar to others. Consequently, items that are similar will have some quantitative value indicating how similar they are.
As you can see from the equation above, we have a sparse matrix filled with different functions revolving around the value θ that is passed in as a way to keep all the positional encodings related.
The exact way these θ are related is shown below:
The most critical part of this equation for context size is the value 10,000. As we try to create larger contexts with non-infinite number ranges, the value of 10,000 has become a limiting factor; After all, there are only so many vectors you can create with that number as a base.
While you can train a new model from scratch using a larger base value for your positional encodings, there are a few reasons that prevent people in general from doing this. First, there is a huge cost associated with training from scratch. As only a few organizations in the world currently have the resources to do so, the burden of doing so is great. Second, it is incredibly difficult to find a large volume of high-quality long-form text. Since training requires billions of tokens, finding quality extensive data at that scale is a big challenge.
Consequently, researchers have proposed different methodologies to expand RoPE to larger thetas.
The first method is linear positional interpolation (PI), where you can expand the number of possible positions by reducing theta by some value λ. The following equation uses Beta to represent the θ^(2/d) equation that we used to connect all the thetas from before.
While this works, the authors of the paper point out that there is an agglomeration effect where some information ends up being lost after reduction.
The second method is YaRN (another RoPE extension method) where we divide the RoPE dimensions into 3 groups and assign a different linear factor to each of them. The basic idea is that frequently appearing tokens should not be modified (their λ := 1) and less frequently appearing tokens should be modified. In the graph below, we can see that this works well for expanding up to 128k context length. The issue at stake here is determining the groupings. Groups are determined by people and therefore suboptimal decisions can be made that reduce performance.
Therefore, while both YaRN and Linear Projection (PI) work, they have limitations that hold them back. Long RoPE takes the best of each idea and finds a clever way to combine them.
The Long RoPE researchers realized that to improve on previous methods, they would introduce two key ideas: (1) the distribution of good λ is irregular, so searching for λ is better than assuming a correct answer and (2) there is a subset of tokens whose positions should simply not be changed.
Both findings are found in the following formula. To find the optimal λ, they created a loss function that they could minimize. The following formula is a reformatted version of RoPE with the result of 𝕀 y ( n/ βYo ) representing the scale made to our positional vector. When they find the smallest loss, they choose the corresponding λ.
The step function 𝕀 is how we update the subset of tokens that should not be modified. By choosing a value of 1, we are indicating that the positional encodings should remain the same. To keep the search limited, they only considered n-hat values of {0, 1, 2, 4, 8, 12, 16, 20, 24, 28, 32, 64, 128, 256}. The higher the value of n-hat, the more tokens will maintain their original positional encodings.
Now that we've covered the theory, let's look at the results!
Long RoPE works both without adjustment and with. The graph above shows the performance of LongRoPE when applied to LLaMA2–7B. The original context for that model was 4k. By finding the optimal λ, they were able to expand the context window to 32,000 tokens without a noticeable change in perplexity. The incredible thing about this is that the calculation required to make a change like this is almost negligible compared to the adjustment costs. An 8x expansion without a huge IT expense is incredible.
To obtain a large expansion a combination of fine tuning and searching for the optimal λ is required. The researchers of the article obtained a 512x expansion following this methodology. They first brought the model to a size of 128k and 256k. They made adjustments for 400 steps on the 128k and then changed to use the 256k factors for an additional 600 steps. Since this worked better than simply fitting 256k directly, it seems that learning a more general distribution rather than just one of the scales provides better performance. They then optimized again for the best λ and arrived at a context window of 2048k, an increase of 512 over the original 4k context window!
One of the difficulties of a larger context is the loss of performance on tasks with small contexts. This behavior has been seen before and the theory is that the data is initially condensed into a smaller range, resulting in some loss of attention.
They solved this in the 2048k context window model by finding the ideal λ for shorter lengths (in the paper it was 4k and 8k). During inference, if the context is determined to be small, the LLM will dynamically change to use the smallest λ for positional encoding data.
LLMs are excellent for reasoning and continue to surprise us with their real-world applications. With a larger context window, especially one that can be obtained at a limited cost with still high performance, we will only see your applications grow.
An interesting question is whether dynamic positional encoding calculations are the way of the future. If you can tune multi-position encodings and get quality performance for 2 λ, then we may have 1 model that can seamlessly switch between multiple λs at inference time.
One of the things I find most interesting about the LLM space is the potential to examine data. While the Internet has done an incredible job of democratizing access to information, it has unfortunately also flooded our lives with noise. There are many things we are shown online that have almost no consequences for us. With a tool that can extract important information from the mundane and even harmful, we can use the Internet to its full potential.
With larger context windows, the LLM's ability to summarize and condense information can be used to even greater effect. There may even come a time when breakthroughs are made by providing LLMs with two seemingly disparate sets of information and having them discover something new that can be reasoned with given the premises of each set.
It's an exciting time to build.