Ordered sequences, including text, audio, and code, rely on positional information to provide meaning. Large language models (LLMs), such as the Transformer architecture, lack inherent order information and treat sequences as sets. Position encoding (PE) addresses this by assigning an embedding vector to each position, which is crucial for understanding LLMs. PE methods, including absolute and relative measures, are integral to LLMs and are adaptable to various tokenization methods. However, the variability of tokens poses challenges for accurate targeting of positions in sequences.
Initially, the attention mechanisms did not require PE as they were used with RNNs. Memory Network introduced PE along with attention, employing embedding vectors that can be learned for relative positions. PE gained traction with the Transformer architecture, where both absolute and relative variants of PE were explored. Several modifications followed, such as simplified bias terms and CoPE, which contextualizes the measurement of position. Unlike RNNs, CoPE allows parallelization in Transformer training, which improves efficiency. Some research favors relative PE in recent LLMs, and RoPE offers an unmodified implementation.
Meta researchers present Contextual Position Encoding (CoPE), COPE determines the positions of the tokens based on their context vectors. By computing gate values for previous tokens using their key vectors relative to the current token, CoPE sets fractional place values, requiring interpolation of allocated embeddings for the calculation. These additions improve attention operation by incorporating positional information. CoPE excels at toy tasks such as counting and selective copying, outperforming token-based PE methods, particularly in out-of-domain scenarios. On language modeling tasks using Wikipedia text and code, CoPE consistently demonstrates superior performance, highlighting its real-world applicability.
In CoPE, the position measurement is context-dependent and determined by the gate values computed for each query key pair, allowing differentiation through backpropagation. Position values are calculated by adding gate values between the current and target tokens. It generalizes relative PE by accommodating various positional concepts, not just token counts. Unlike token positions, CoPE values can be fractional, requiring interpolation between integer embeddings for position embeddings. The effectiveness of CoPE is demonstrated in toy tasks and real-world applications, demonstrating its superiority over token-based PE methods. In state-of-the-art LLMs, standard position encodings are flawed, especially in tasks requiring precise counting, indicating the need for more advanced position addressing techniques such as CoPE.
Absolute PE shows the poorest performance among the compared PE methods. CoPE outperforms relative PE and shows additional improvement when combined with it, underscoring the effectiveness of CoPE in general language modeling tasks. Evaluation of CoPE on code data reveals its superiority over Absolute PE and RoPE, with perplexity improvements of 17% and 5%, respectively. While the combination of RoPE and CoPE additions produces improvements over RoPE alone, it does not outperform CoPE alone. This underscores the effectiveness of CoPE in using context to improve modeling, particularly in structured data domains such as code.
The article presents CoPE, a robust position encoding method that measures position contextually, diverging from token-based paradigms. This approach offers greater flexibility in positional addressing, resulting in performance improvements on various tasks in text and code domains. The potential of CoPE extends to domains such as video and voice, where symbolic positioning might be less appropriate. Future research could explore training larger models with CoPE and evaluate their performance on subsequent tasks to further evaluate their effectiveness and applicability.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter. Join our Telegram channel, Discord Channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our 43k+ ML SubReddit | Also, check out our ai Event Platform
Asjad is an internal consultant at Marktechpost. He is pursuing B.tech in Mechanical Engineering at Indian Institute of technology, Kharagpur. Asjad is a machine learning and deep learning enthusiast who is always researching applications of machine learning in healthcare.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>