Accelerating LLM inference is an important ML research problem, since generating autoregressive tokens is computationally expensive and relatively slow, and improving inference efficiency can reduce latency for users. In addition to ongoing efforts to accelerate inference on Apple silicon, we have recently made significant progress in accelerating LLM inference for NVIDIA GPUs widely used for production applications across the industry.
Earlier this year we published and open source Recurrent Drafter (ReDrafter), a novel approach to speculative decoding that achieves state-of-the-art performance. ReDrafter uses a preliminary RNN model and combines beam search with dynamic tree attention to speed up LLM token generation by up to 3.5 tokens per generation step for open source models, outperforming speculative decoding techniques. previous.
Production of ReDrafter to accelerate NVIDIA TensorRT-LLM
This research work demonstrated solid results, but its greatest impact comes from its application in production to accelerate LLM inference. To make this advance production-ready for NVIDIA GPUs, we collaborated with NVIDIA to integrate ReDrafter into the NVIDIA TensorRT-LLM Inference acceleration framework.
Although TensorRT-LLM supports numerous open source LLMs and the Medusa speculative decoding method, ReDrafter's beam search and tree attention algorithms are based on operators that have never been used in previous applications. To enable ReDrafter integration, NVIDIA added new operators or exposed existing ones, greatly improving TensorRT-LLM's ability to accommodate sophisticated decoding models and methods. Machine learning developers using NVIDIA GPUs can now easily benefit from accelerated ReDrafter token generation for their production LLM applications with TensorRT-LLM.
By comparing a tens of billions parameter production model on NVIDIA GPUs, using the NVIDIA TensorRT-LLM inference acceleration framework with ReDrafter, we have seen a 2.7x speedup in tokens generated per second for greedy decoding. (see Figure 1). These benchmark results indicate that this technology could significantly reduce the latency that users may experience, while using fewer GPUs and consuming less power.
For more details, check out this post on the NVIDIA Developer Blog.
Conclusion
LLMs are increasingly used to power production applications, and improving inference efficiency can impact computational costs and reduce latency for users. With ReDrafter's novel approach to speculative decoding built into the NVIDIA TensorRT-LLM framework, developers can now benefit from faster token generation on NVIDIA GPUs for their production LLM applications.
Expressions of gratitude
Many people contributed to this project, including: Aonan Zhang, Xuanyu Zhang, Yunfei Cheng, Chong Wang, Yi Wang, Abhishek Udupa, Dhaval Doshi, and our collaborators at NVIDIA.