Accelerate LLM inference on NVIDIA GPUs with ReDrafter

Accelerating LLM inference is an important ML research problem, since generating autoregressive tokens is computationally expensive and relatively slow, and improving inference efficiency can reduce latency for users. In addition to ongoing efforts to accelerate inference on Apple silicon, we have recently made significant progress in accelerating LLM inference for NVIDIA GPUs widely used for production applications across the industry.

Earlier this year we published and open source Recurrent Drafter (ReDrafter), a novel approach to speculative decoding that achieves state-of-the-art performance. ReDrafter uses a preliminary RNN model and combines beam search with dynamic tree attention to speed up LLM token generation by up to 3.5 tokens per generation step for open source models, outperforming speculative decoding techniques. previous.

Tokens per second accelerate

Figure 1: Tokens per second speed up with NVIDIA TensorRT-LLM with ReDrafter vs. automatic regression.

Production of ReDrafter to accelerate NVIDIA TensorRT-LLM

This research work demonstrated solid results, but its greatest impact comes from its application in production to accelerate LLM inference. To make this advance production-ready for NVIDIA GPUs, we collaborated with NVIDIA to integrate ReDrafter into the NVIDIA TensorRT-LLM Inference acceleration framework.

Although TensorRT-LLM supports numerous open source LLMs and the Medusa speculative decoding method, ReDrafter's beam search and tree attention algorithms are based on operators that have never been used in previous applications. To enable ReDrafter integration, NVIDIA added new operators or exposed existing ones, greatly improving TensorRT-LLM's ability to accommodate sophisticated decoding models and methods. Machine learning developers using NVIDIA GPUs can now easily benefit from accelerated ReDrafter token generation for their production LLM applications with TensorRT-LLM.

By comparing a tens of billions parameter production model on NVIDIA GPUs, using the NVIDIA TensorRT-LLM inference acceleration framework with ReDrafter, we have seen a 2.7x speedup in tokens generated per second for greedy decoding. (see Figure 1). These benchmark results indicate that this technology could significantly reduce the latency that users may experience, while using fewer GPUs and consuming less power.

For more details, check out this post on the NVIDIA Developer Blog.

Conclusion

LLMs are increasingly used to power production applications, and improving inference efficiency can impact computational costs and reduce latency for users. With ReDrafter's novel approach to speculative decoding built into the NVIDIA TensorRT-LLM framework, developers can now benefit from faster token generation on NVIDIA GPUs for their production LLM applications.

Expressions of gratitude

Many people contributed to this project, including: Aonan Zhang, Xuanyu Zhang, Yunfei Cheng, Chong Wang, Yi Wang, Abhishek Udupa, Dhaval Doshi, and our collaborators at NVIDIA.

Accelerate LLM inference on NVIDIA GPUs with ReDrafter

Technical Terrence Team

Could 2025 be a big year for the stock market?

Leave a Reply Cancel reply

Recommended.

What is the stock barcode and what is it for?

New Hampshire officials to investigate AI robocalls that imitate Biden

Bitcoin Peak Halving Doesn't Guarantee Higher Profits: Analyst

A Style of Substance: How Natalie Smolenski Will Deal With Big Ideas in Bitcoin 2023

Customizing the RStudio container with Docker Compose | by Rami Krispin | March 2024

Categories

Important Links

Accelerate LLM inference on NVIDIA GPUs with ReDrafter

Tokens per second accelerate

Production of ReDrafter to accelerate NVIDIA TensorRT-LLM

Conclusion

Expressions of gratitude

Related

Technical Terrence Team

Could 2025 be a big year for the stock market?

Leave a Reply Cancel reply

Recommended.

What is the stock barcode and what is it for?

New Hampshire officials to investigate AI robocalls that imitate Biden

Bitcoin Peak Halving Doesn't Guarantee Higher Profits: Analyst

A Style of Substance: How Natalie Smolenski Will Deal With Big Ideas in Bitcoin 2023

Customizing the RStudio container with Docker Compose | by Rami Krispin | March 2024

Categories

Important Links

Get daily news updates to your inbox!