Large Language Models (LLM) have revolutionized various ai-based applications, from chat models to autonomous driving. This evolution has spurred the need for systems that can efficiently implement and serve these models, especially under the increasing demand of handling long-running workloads. The main hurdle in this area has been balancing high throughput and low latency in service systems, a challenge that existing frameworks need to help overcome.
Traditional LLM approaches, while adept at training models effectively, fail during inference, especially in tasks like open text generation. This inefficiency is due to the interactive nature of these applications and the low arithmetic intensity of such tasks, which hinder inference performance in existing systems. vLLM, powered by PagedAttention, and research systems like Orca have improved LLM inference performance. However, they still face challenges in maintaining consistent quality of service, particularly for long and fast workloads.
Historical advances in LLM inference, such as locked KV caching and dynamic batching, were intended to address memory efficiency and GPU utilization. Locked KV caching, as implemented in vLLM Paged Attention, addressed memory fragmentation caused by large KV caches, increasing overall system performance. Despite its attempts to improve GPU utilization, dynamic batching often required padding inputs or stopping the system to build larger batches. These methods, while innovative, are still necessary to solve the challenges of fully serving LLMs efficiently, particularly under the constraints of long-term and one-time workloads.
Microsoft DeepSpeed researchers introduced DeepSpeed-FastGen, a revolutionary system that uses the Dynamic SplitFuse technique in response to the challenges mentioned above. This system delivers up to 2.3x higher effective throughput, 2x lower average latency, and up to 3.7x lower tail latency compared to state-of-the-art systems like vLLM. DeepSpeed-FastGen combines DeepSpeed-MII and DeepSpeed-Inference to create an efficient and easy-to-use service system for LLM. It supports a variety of models and offers persistent and non-persistent deployment options to suit various user scenarios.
The cornerstone of DeepSpeed-FastGen's efficiency is the Dynamic SplitFuse strategy, which improves continuous batch processing and system performance. This novel token composition strategy for notice generation and processing allows long notices to be decomposed into smaller chunks over multiple forward passes. This method leads to better system responsiveness and greater efficiency, as long prompts no longer require extremely long forward passes. The approach also ensures consistent forward pass sizes, which is a primary determinant of performance and results in more consistent latency than competing systems. This results in significant reductions in generation latency, as evidenced in performance evaluations.
The performance of DeepSpeed-FastGen was rigorously evaluated and analyzed. The system was evaluated with vLLM on various hardware models and configurations. Evaluations showed that DeepSpeed-FastGen achieves up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower tail latency compared to vLLM. These improvements are particularly notable in the LLM service, where both throughput and latency are crucial metrics.
To summarize the key takeaways from DeepSpeed-FastGen:
- Revolutionary strategy: It implements Dynamic SplitFuse, a novel token composition strategy.
- Significant performance improvements: Achieve up to 2.3x higher effective throughput and 2x lower latency on average.
- Tail latency reduction: It offers up to 3.7 times lower tail latency than vLLM.
- Scalability and versatility: It demonstrates near-perfect scalability and supports multiple hardware platforms.
- Community involvement: Encourages contribution and collaboration within the broader DeepSpeed ecosystem.
DeepSpeed-FastGen represents a major advance in efficiently deploying and scaling large language models. By addressing critical performance and latency challenges in LLM service, DeepSpeed-FastGen is a notable contribution to the field, paving the way for more efficient and scalable ai applications.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook community, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our Telegram channel
Hello, my name is Adnan Hassan. I'm a consulting intern at Marktechpost and soon to be a management trainee at American Express. I am currently pursuing a double degree from the Indian Institute of technology, Kharagpur. I am passionate about technology and I want to create new products that make a difference.
<!– ai CONTENT END 2 –>