LazyLLM: Dynamic token pruning for efficient LLM inference in long contexts

This article was accepted at the Workshop on Efficient Systems for Foundation Models at ICML 2024

Inference of large transformer-based language models consists of two sequential stages: 1) a pre-filling stage to compute the KV cache of hints and generate the first token, and 2) a decoding stage to generate subsequent tokens. For long hints, the KV cache must be computed for all tokens during the pre-filling stage, which can significantly increase the time taken to generate the first token. Consequently, the pre-filling stage can become a bottleneck in the generation process. The question whether all hint tokens are essential to generate the first token remains unanswered. To answer this question, we introduce a novel method, LazyLLM, that selectively computes the KV for tokens important for subsequent token prediction in both the pre-filling and decoding stages. Unlike static pruning approaches that prune the cue in one go, LazyLLM allows language models to dynamically select different subsets of context tokens at different generation steps, even though they may have been pruned in previous steps. Extensive experiments on standard datasets across a variety of tasks demonstrate that LazyLLM is a generic method that can be seamlessly integrated with existing language models to significantly speed up generation without any fine-tuning. For example, on the multi-document Q&A task, LazyLLM speeds up the pre-completion stage of the LLama 2 7B model by 2.34x while maintaining accuracy.

LazyLLM: Dynamic token pruning for efficient LLM inference in long contexts

Technical Terrence Team

I would buy 9,595 shares of this dividend-paying company to generate an additional £200 of passive income per month.

Leave a Reply Cancel reply

Recommended.

Bitcoin falls after ETF approval; Investors flock to Chainlink and NuggetRush

Hyundai's first US-made electric vehicle will be the Ioniq 5, ready to receive tax credit

Tesla’s $25,000 ‘next-generation car’ will have a Cybertruck design

UK Stocks: 3 Cheap Stocks to Buy in February

Creating more paths for each child to get a great victory

Categories

Important Links

LazyLLM: Dynamic token pruning for efficient LLM inference in long contexts

Related

Technical Terrence Team

I would buy 9,595 shares of this dividend-paying company to generate an additional £200 of passive income per month.

Leave a Reply Cancel reply

Recommended.

Bitcoin falls after ETF approval; Investors flock to Chainlink and NuggetRush

Hyundai's first US-made electric vehicle will be the Ioniq 5, ready to receive tax credit

Tesla’s $25,000 ‘next-generation car’ will have a Cybertruck design

UK Stocks: 3 Cheap Stocks to Buy in February

Creating more paths for each child to get a great victory

Categories

Important Links

Get daily news updates to your inbox!