This article was accepted at the Workshop on Efficient Systems for Foundation Models at ICML 2024
Inference of large transformer-based language models consists of two sequential stages: 1) a pre-filling stage to compute the KV cache of hints and generate the first token, and 2) a decoding stage to generate subsequent tokens. For long hints, the KV cache must be computed for all tokens during the pre-filling stage, which can significantly increase the time taken to generate the first token. Consequently, the pre-filling stage can become a bottleneck in the generation process. The question whether all hint tokens are essential to generate the first token remains unanswered. To answer this question, we introduce a novel method, LazyLLM, that selectively computes the KV for tokens important for subsequent token prediction in both the pre-filling and decoding stages. Unlike static pruning approaches that prune the cue in one go, LazyLLM allows language models to dynamically select different subsets of context tokens at different generation steps, even though they may have been pruned in previous steps. Extensive experiments on standard datasets across a variety of tasks demonstrate that LazyLLM is a generic method that can be seamlessly integrated with existing language models to significantly speed up generation without any fine-tuning. For example, on the multi-document Q&A task, LazyLLM speeds up the pre-completion stage of the LLama 2 7B model by 2.34x while maintaining accuracy.