The most recent advancement in the field of artificial intelligence (ai), i.e. Large Language Models (LLM), has shown great improvement in language production. With model sizes reaching billions of parameters, these models are entering everything from healthcare and finance to education.
Although these models have demonstrated amazing capabilities, the development of model size has led to increased inference latency, which poses a problem for real-world applications. Memory-bound operations represent the main bottleneck in LLM inference, as it is inefficient to transport all model parameters from high-bandwidth memory (HBM) to the accelerator cache during autoregressive decoding.
Researchers have strived to find a solution to these limitations, one of which is to decrease the number of decoding steps and increase the arithmetic intensity of the decoding process. It has been suggested to use a smaller preliminary model for speculative decoding, which produces a series of tokens that are then improved by the larger original model. However, there are difficulties when incorporating a preliminary model into a distributed system.
To overcome these challenges, a team of researchers in a recent study presented MEDUSA, an efficient approach that improves LLM inference by incorporating additional decoding heads to predict multiple subsequent tokens in parallel. It uses the many decoding heads of the backbone model to speed up inference. These heads overcome the difficulties of speculative decoding by simultaneously predicting numerous tokens.
MEDUSA does not require a separate draft model as speculative decoding does, making it capable of being easily integrated into current LLM systems, even in sparse situations. The team has shared that MEDUSA constructs several candidate continuations in each decoding phase and verifies them simultaneously using a tree-based attention mechanism. By using parallel processing, MEDUSA reduces the number of decoding steps required while introducing very little overhead in terms of single-step latency.
Two new insights have been added to MEDUSA. First, numerous candidate continuations have been generated using MEDUSA heads and verified simultaneously. Secondly, an acceptance procedure was used to select suitable candidates. The team has shared the rejection sampling strategy used in speculative decoding, which a temperature-based threshold can effectively replace to handle drifts.
The study has suggested two methods to adjust the MEDUSA predictive heads of LLMs, which are as follows.
- MEDUSA-1: This enables lossless inference acceleration by directly fitting MEDUSA over a frozen core LLM. The use of MEDUSA-1 has been suggested when incorporating MEDUSA into an existing model or in environments with limited computational resources. It uses less memory and can be made even more efficient by applying quantization techniques.
- MEDUSA-2: This method adjusts MEDUSA and the main LLM simultaneously. While it offers greater speed and improved prediction accuracy for MEDUSA heads, it requires a unique training recipe to maintain the functionality of the main model. MEDUSA-2 is appropriate when resources are abundant and allows simultaneous training of the MEDUSA heads and the main model without sacrificing the quality of the result or the ability to predict the next token.
Research has also suggested several additions to improve or expand the use of MEDUSA. These include a regular acceptance scheme to increase the acceptance rate without sacrificing generation quality and a self-distillation method in the absence of training data. The team has shared that the MEDUSA evaluation process included testing on models of different sizes and training protocols. Results have shown that MEDUSA-1 can accelerate data by more than 2.2 times without sacrificing generation quality. Furthermore, the acceleration is improved to 2.3-3.6× using MEDUSA-2.
Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook community, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our Telegram channel
Tanya Malhotra is a final year student of University of Petroleum and Energy Studies, Dehradun, pursuing BTech in Computer Engineering with specialization in artificial intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with a burning interest in acquiring new skills, leading groups and managing work in an organized manner.
<!– ai CONTENT END 2 –>