Large language models (LLMs) have gained significant capabilities, reaching GPT-4 level performance. However, deploying these models for applications requiring extensive context, such as repository-level encoding and understanding hour-long videos, poses substantial challenges. These tasks demand input contexts ranging from 100K to 10M tokens, a significant jump from the standard 4K token limit. Researchers are grappling with an ambitious goal: how can the deployment of 1M context production-level transformers be made as cost-effective as their 4K counterparts? The main hurdle in serving long context transformers is the size of the KV cache. For example, a 30+B parameter model with 100K context requires a staggering 22.8GB of KV cache, compared to just 0.91GB for the 4K context, highlighting the exponential increase in memory requirements as context length increases.
To overcome the challenges of implementing long-context transformers, the University of Edinburgh researcher has developed a concurrent programming framework for quantitative analysis of efficiency issues when serving multiple long-context requests under limited high-bandwidth GPU memory (HBM). This framework focuses on a 34B GPT-3.5 level model with a 50K context on an A100 NVLink GPU as a representative example. The analysis reveals four key implementation challenges arising from the large KV cache: extended prefill time and memory usage for long inputs, restricted concurrent user capacity due to HBM occupancy, increased decoding latency from frequent KV cache access, and significant context switch latency when swapping the KV cache between HBM and DDR memory. This comprehensive framework enables researchers to evaluate existing solutions and explore possible combinations to develop end-to-end systems that can efficiently handle long-context language models.
The study focuses on KV cache compression in four dimensions: layer, head, token, and hidden. The researchers hypothesize that some tasks may not require full depth computation for the layer dimension, allowing layers to be skipped during pre-filling. This approach could potentially reduce the KV cache to just one layer, achieving a compression ratio of 1/60. In the head dimension, studies suggest that certain heads specialize in retrieval and long context capabilities. By retaining only these crucial heads and pruning others, significant compression can be achieved. For example, some research indicates that as few as 20 out of 1024 heads could be sufficient for retrieval tasks.
Token dimension compression is based on the hypothesis that if a token’s information can be inferred from its context, it can be compressed by dropping or merging it with neighboring tokens. However, this dimension appears less compressible than layers or heads, with most works showing a compression ratio below 50%. The hidden dimension, already small at 128, has seen limited exploration beyond quantization techniques. The researchers suggest that applying dimension reduction techniques such as LoRA to the KV cache could yield further improvements. The framework also considers the relative cost between prefilling and decoding, noting that as models get larger and context lengths increase, the cost shifts from decoding to prefilling, emphasizing the need to optimize both aspects for efficient deployment over long contexts.
The research presents a comprehensive analysis of the challenges presented by deploying long-context transformers, with the goal of making 1M context serving as cost-effective as 4K. This goal would democratize advanced ai applications such as video understanding and generative agents. The study presents a concurrent programming framework that breaks down user interaction performance into four key metrics: concurrency, prefilling, decoding, and context switching. By examining how various factors impact these metrics and reviewing existing optimization efforts, the research highlights significant opportunities to integrate current approaches to develop robust end-to-end long-context serving systems. This work lays the foundation for full-stack optimization of long-context inference.
Review the PaperAll credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter.
Join our Telegram Channel and LinkedIn GrAbove!.
If you like our work, you will love our Newsletter..
Don't forget to join our Subreddit with over 46 billion users
Asjad is a consultant intern at Marktechpost. He is pursuing Bachelors in Mechanical Engineering from Indian Institute of technology, Kharagpur. Asjad is a Machine Learning and Deep Learning enthusiast who is always researching the applications of Machine Learning in the healthcare domain.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>