The Transformer concept has been widely adopted and applied in various fields of study and business. The most significant flaw in the model is the quadratic complexity of the attention operation, which makes large models more difficult to apply to longer inputs. This study demonstrates how a single Nvidia GTX 1080Ti GPU can process streams of more than 1 million tokens using a simple token-based memory scheme combined with pretrained transformer models such as BERT.
The first step in allowing recurrent memory (RMT) to be generalized to problems with unknown features, such as language modeling, is the study of synthetic tasks. Since this design gained popularity, a large number of studies have been done on the subject of long inputs in transformers. This study shows that significant amounts of memory are only sometimes needed when using Transformers to parse large text. A recursive strategy and memory can transform quadratic complexity into linear complexity. Furthermore, models trained on sufficiently large inputs can be generalized to readers with orders of magnitude longer. They plan to modify the recursive memory technique in later work to increase the effective context size of the most commonly used Transformers.
Researchers at DeepPavlov, the Artificial Intelligence Research Institute and the London Institute of Mathematical Sciences make the following contributions
1. To enhance the existing system, token-based memory storage and segment-level recursion with Recursive Memory (RMT) are added to BERT.
2. They show that BERT with increased memory can be taught to handle jobs in sequences up to seven times longer than the expected input length of 512 tokens.
3. They found that the trained RMT can be extrapolated to tasks of various durations, including those that require a linear scale of computations and exceed 1 million tokens, effectively.
4. Using attention pattern analysis, they discovered the memory processes that RMT uses to successfully handle extraordinarily long sequences.
The authors present as a conclusion the use of a recursive memory in BERT, one of the most successful Transformer-based models in natural language processing. They have effectively extended the effective context length of the model to an unprecedented two million tokens, while retaining good memory retrieval accuracy using the Recurrent Memory Transformer architecture. His approach allows the flow of information through segments of the input stream through the use of recursion and allows the storage and processing of local and global information. Their tests show the effectiveness of their method, which has great potential to improve long-term dependency handling in tasks involving natural language creation and understanding, as well as to enable large-scale context processing for applications using memory intensive.
review the Paper. Don’t forget to join our 20k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.