Are your AI models hungry for too much power? This Microsoft document introduces Splitwise to split the bill

Although large language models (LLMs) have demonstrated impressive capabilities when it comes to language processing, they are computationally expensive and require sophisticated hardware infrastructure. The rise in popularity of these models has required the deployment of GPUs at an unprecedented pace, posing significant challenges for cloud providers. Since the power to fuel this GPU demand is limited, it is not uncommon for user queries to be rejected and therefore researchers are working to improve the existing infrastructure to make it more efficient.

There is Two phases associated with an LLM inference process: quick calculation (user enters a message) and token generation (LLM generates the output). During the first phase, the input tokens are processed in parallel by the LLM, which is computationally intensive. In the second phase, the output tokens are generated sequentially, which is a memory-intensive task. Such a design leads to low overall hardware utilization and, ultimately, much higher costs for the user.

To address the aforementioned problem, Microsoft researchers have introduced divided, which is a technique that separates the rapid calculation and token generation phases on separate machines, leading to optimal utilization of the available hardware. Along with the two groups of machines for the two phases of inference, Splitwise also has a third, which is dynamically sized, meaning it expands and contracts based on the workload. Furthermore, the state context, i.e., the KV cache, is transferred from the message to the token machines via InfiniBand without any noticeable delay.

Splitwise also leverages two-level hierarchical scheduling to route incoming requests, maintain the pending queue, and manage batching of requests on each machine. The design of Splitwise is such that it focuses on better latency at a lower request rate and less performance reduction at a higher request rate.

For evaluation, the researchers used Spltwise to design clusters with different GPU specifications. They also optimized the power, cost, and performance of each query. They considered two uses of Splitwise, i.e., code and conversation using the BLOOM-176B and LLaMa-2-70B models. The results show that Splitwise successfully maximizes performance, minimizes costs, and reduces energy. Additionally, the cluster design was able to maximize performance at the same cost as a basic A100 cluster.

Furthermore, compared to the basic cluster, Splitwise delivered much higher performance while operating within the same power constraints. The results also show that Splitwise can be tuned based on workload requirements using the intelligent scheduler. Additionally, it is also resilient to changes in the LLM model, charging, and token distribution.

In conclusion, Splitwise is an effective technique for optimal hardware utilization to accelerate the LLM inference process by allowing separate machines to execute the two phases of the same. It marks a significant leap towards efficient, high-performance LLM implementation and provides a good foundation for other researchers to make LLM inference more efficient and sustainable.

Review the Paper and Blog. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook community, Discord channeland LinkedIn Grabove.

If you like our work, you'll love our newsletter.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for social good. His most recent endeavor is the launch of an ai media platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easily understandable to a wide audience. The platform has more than 2 million monthly visits, which illustrates its popularity among the public.

<!– ai CONTENT END 2 –>

(Partnership and Promotion on Marktechpost) Now you can partner with Marktechpost to promote your research article, Github Repo and even add your professional comment on any trending research article on marktechpost.com. Increase the visibility of your and your company's ai research in the technology community… Learn more

Are your AI models hungry for too much power? This Microsoft document introduces Splitwise to split the bill

Technical Terrence Team

There is great value right now in the FTSE 250, especially in stocks like this.

Leave a Reply Cancel reply

Recommended.

Is Bitcoin at the bottom? Buying Sentiment Erodes Amid Drop Towards $60,000

Apple Fitness Plus and Strava collaborate with a new integration

United Airlines brings back popular Delta benefit, American eliminated

The Apple Airpods 4 reached a historical minimum of $ 100, plus the rest of the best technological offers of the week

Nvidia CEO Jensen Huang Sold a Ton of Stock This Year

Categories

Important Links

Are your AI models hungry for too much power? This Microsoft document introduces Splitwise to split the bill

Related

Technical Terrence Team

There is great value right now in the FTSE 250, especially in stocks like this.

Leave a Reply Cancel reply

Recommended.

Is Bitcoin at the bottom? Buying Sentiment Erodes Amid Drop Towards $60,000

Apple Fitness Plus and Strava collaborate with a new integration

United Airlines brings back popular Delta benefit, American eliminated

The Apple Airpods 4 reached a historical minimum of $ 100, plus the rest of the best technological offers of the week

Nvidia CEO Jensen Huang Sold a Ton of Stock This Year

Categories

Important Links

Get daily news updates to your inbox!