Although large language models (LLMs) have demonstrated impressive capabilities when it comes to language processing, they are computationally expensive and require sophisticated hardware infrastructure. The rise in popularity of these models has required the deployment of GPUs at an unprecedented pace, posing significant challenges for cloud providers. Since the power to fuel this GPU demand is limited, it is not uncommon for user queries to be rejected and therefore researchers are working to improve the existing infrastructure to make it more efficient.
There is Two phases associated with an LLM inference process: quick calculation (user enters a message) and token generation (LLM generates the output). During the first phase, the input tokens are processed in parallel by the LLM, which is computationally intensive. In the second phase, the output tokens are generated sequentially, which is a memory-intensive task. Such a design leads to low overall hardware utilization and, ultimately, much higher costs for the user.
To address the aforementioned problem, Microsoft researchers have introduced divided, which is a technique that separates the rapid calculation and token generation phases on separate machines, leading to optimal utilization of the available hardware. Along with the two groups of machines for the two phases of inference, Splitwise also has a third, which is dynamically sized, meaning it expands and contracts based on the workload. Furthermore, the state context, i.e., the KV cache, is transferred from the message to the token machines via InfiniBand without any noticeable delay.
Splitwise also leverages two-level hierarchical scheduling to route incoming requests, maintain the pending queue, and manage batching of requests on each machine. The design of Splitwise is such that it focuses on better latency at a lower request rate and less performance reduction at a higher request rate.
For evaluation, the researchers used Spltwise to design clusters with different GPU specifications. They also optimized the power, cost, and performance of each query. They considered two uses of Splitwise, i.e., code and conversation using the BLOOM-176B and LLaMa-2-70B models. The results show that Splitwise successfully maximizes performance, minimizes costs, and reduces energy. Additionally, the cluster design was able to maximize performance at the same cost as a basic A100 cluster.
Furthermore, compared to the basic cluster, Splitwise delivered much higher performance while operating within the same power constraints. The results also show that Splitwise can be tuned based on workload requirements using the intelligent scheduler. Additionally, it is also resilient to changes in the LLM model, charging, and token distribution.
In conclusion, Splitwise is an effective technique for optimal hardware utilization to accelerate the LLM inference process by allowing separate machines to execute the two phases of the same. It marks a significant leap towards efficient, high-performance LLM implementation and provides a good foundation for other researchers to make LLM inference more efficient and sustainable.
Review the Paper and Blog. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook community, Discord channeland LinkedIn Grabove.
If you like our work, you'll love our newsletter.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for social good. His most recent endeavor is the launch of an ai media platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easily understandable to a wide audience. The platform has more than 2 million monthly visits, which illustrates its popularity among the public.
<!– ai CONTENT END 2 –>