Large Language Models (LLMs) have greatly improved the state of the art in various comprehension and generation tasks, revolutionizing natural language processing. Most LLMs benefit from self-paced training on large corpora by gathering information from a fixed-size local context and displaying emergent skills, including zero-trigger cues, learning in context, and chain-of-thought (CoT) reasoning. The input length restriction of current LLMs prevents them from generalizing to real-world applications, such as extended landscape planning, where the ability to handle long-form material beyond a fixed-size session is crucial.
The simplest solution to the length limit problem is to simply increase the length of the input context. To improve long-range interdependence, GPT-3, for example, increases GPT-2’s 1k input length to 2k tokens. However, context-dense attention is severely limited by the quadratic computational complexity of Transformer’s self-attention, and this technique often requires extensive computational training from the start. Another new area of research, most of which still requires training from the start, focuses on creating sparse care in context to avoid the quadratic cost of self-care.
While Memorizing Transformer (MemTRM) is a well-known study, it approaches sparse in-context attention through dense attention to both in-context tokens and memorized tokens retrieved from non-differentiable memory for Transformers. MemTRM offers significant perplexity benefits when modeling large books or documents by extending the resulting language model to handle up to 65k tokens. MemTRM’s linked memory approach, which uses a unique model to encode and merge memory for language modeling, presents the difficulty of memory staleness during training. In other words, older representations cached in memory can have distribution shifts from those of the newer model when model parameters are changed, which reduces increased memory usage.
In this article, the authors from UCSB and Microsoft Research propose the LONGMEM framework, which enables language models to cache long-form contexts or background knowledge in the non-differentiable memory bank and leverage them via a memory module. decoupled to address the memory staleness issue. . They create a revolutionary residual side network (SideNet) to achieve decoupled memory. A frozen backbone LLM is used to extract the attention keys and matched values from the previous context in the memory bank. The attention query resulting from the current input is used in the SideNet augmented memory layer to access the cache (keys and values) for previous contexts. The associated memory enhancements are then merged into hidden states of learning through a process of joint attention.
Better knowledge transfer from the pre-trained backbone LLM is made possible by the newly built residual cross-network connections between SideNet and the frozen backbone LLM. The pretrained LLM can be modified to use long context memory by repeatedly training the residual SideNet to extract and merge increased long context memory. There are two main advantages to your decoupled memory system. First, the LLM decoupled frozen backbone and SideNet in their proposed architecture isolate memory retrieval and merging from the encoding of previous entries in memory.
This efficiently addresses the problem of memory staleness, as the backbone LLM only serves as a long context knowledge encoder. Instead, the residual SideNet serves as a memory retriever and reader. Second, it is computationally inefficient and suffers from catastrophic neglect to change the LLM with memory boosts directly. In addition to being able to access knowledge that was previously learned, LONGMEM can also prevent devastating forgetting, as the LLM backbone freezes during the effective stage of augmented memory adaptation. Depending on subsequent activities, LONGMEM can enter different types of text and long-form information into the memory bank.
They focus on two illustrative instances: learning in context with augmented memory with thousands of task-relevant demo examples, and language modeling with whole book contexts. They assess how well the proposed LONGMEM performs on various long-text language modeling tasks and learning in context with augmented memory for language comprehension. Consistent with experimental findings, their model regularly exceeds strong baselines with respect to its ability for long-text modeling and in-context learning. Their approach significantly increases the ability of LLM to represent the long-context language at a perplexity of -1.38 ~ -1.62 at various length divisions of the Gutenberg-2022 corpus.
Remarkably, their model far exceeds current x’s previous strong baselines to achieve the state-of-the-art performance of 40.5% identification accuracy on ChapterBreak, a difficult long-context modeling benchmark. Lastly, compared to MemTRM and baselines without memory enhancement, LONGMEM shows strong in-context learning benefits on common NLU tasks.
review the Paper and github link. Don’t forget to join our 24k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.