KV-Runahead: Scalable Causal LLM Inference Using Parallel Key-Value Cache Generation

The large language model or LLM inference has two phases, the request (or preload) phase to generate the first token and the extension (or decode) phase to generate subsequent tokens. In this work, we propose an efficient parallelization scheme, KV-Runahead, to accelerate the fast phase. The key observation is that the extension phase generates tokens faster than the request phase due to the key-value cache (KV-cache). Therefore, KV-Runahead parallelizes the request phase by orchestrating multiple processes to fill the KV cache and minimizes time to first token (TTFT). The dual-purpose KV cache scheme has two main benefits. First, since KV-cache is designed to take advantage of the causal attention map, we minimize computation and computation automatically. Second, since it already exists for the extension phase, KV-Runahead is easy to implement. Furthermore, we propose context-level load balancing to handle uneven KV cache generation (due to causal attention) and optimize TTFT. Compared to an existing parallelization scheme, such as tensor or sequential parallelization, where keys and values are generated and exchanged locally across collectives, our experimental results demonstrate that KV-Runahead can deliver speedups of more than 1.4× and 1.6× for Llama 7B and Falcon 7B. respectively.

KV-Runahead: Scalable Causal LLM Inference Using Parallel Key-Value Cache Generation

Technical Terrence Team

3G Capital's main moves in the first quarter include the exit of Alibaba and the acquisition of stakes in Li Auto and Coupang

Leave a Reply Cancel reply

Recommended.

Royal Caribbean makes a food change that is more severe than Carnival’s

Retirement Analysis: Fidelity Reveals the Average 401(k) Lost 23% in 2022

Musk to Launch ‘Truthgpt’, Says Microsoft-Backed Chatbot Is Trained to Lie Cryptocurrencies and ICOs

Retrieval-Augmented Generation Workflows

Ethereum Price Analysis: ETH Bulls Target $3,500 as Bottom Hopes Rise

Categories

Important Links

KV-Runahead: Scalable Causal LLM Inference Using Parallel Key-Value Cache Generation

Related

Technical Terrence Team

3G Capital's main moves in the first quarter include the exit of Alibaba and the acquisition of stakes in Li Auto and Coupang

Leave a Reply Cancel reply

Recommended.

Royal Caribbean makes a food change that is more severe than Carnival’s

Retirement Analysis: Fidelity Reveals the Average 401(k) Lost 23% in 2022

Musk to Launch ‘Truthgpt’, Says Microsoft-Backed Chatbot Is Trained to Lie Cryptocurrencies and ICOs

Retrieval-Augmented Generation Workflows

Ethereum Price Analysis: ETH Bulls Target $3,500 as Bottom Hopes Rise

Categories

Important Links

Get daily news updates to your inbox!