In recent years, large language models (LLMs) based on transformative architectures have emerged. Models such as Chat-GPT and LLaMA-2 demonstrate how the parameters of LLMs have increased rapidly, from several billion to tens of trillions. Although LLMs are very good generators, they have problems with inference delay since there is a lot of computing load from all the parameters. Consequently, there has been a lot of push to accelerate LLM inference, especially for resource-constrained contexts such as edge devices and real-time applications like chatbots.
Recent articles show that most decoder-only LLMs follow a token-by-token generation pattern. Due to the autoregressive (AR) nature of token generation, each token must undergo its own inference execution, resulting in many transformer calls. Reduced computational efficiency and longer wall clock periods are common results of these calls running with memory bandwidth constraints.
By simultaneously synthesizing multiple tokens with a single model inference step, semi-autoregressive (SAR) decoding reduces the extensive need for inference runs. The problem is that most LLMs can only generate AR models, not SAR. Because SAR targets and AR pre-training are not synchronized, retraining the SAR model seems daunting.
Researchers at Intellifusion Inc. and Harbin Institute of technology hope to achieve lossless SAR decoding for AR language models with their new acceleration approach, Bidirectional Tuning for Lossless Acceleration (BiTA), by learning a small number of additional trainable parameters , only 0.01. %.
The two main parts of BiTA are the suggested two-way adjustment and the simplified verification of draft SAR candidates. To enable prediction of future tokens, bidirectional tuning of an AR model incorporates prompt and mask tokens, moving beyond the next token. The prefix and suffix embeddings that can be learned in a sequence of tokens are a metaphor for this approach. In the transformed AR model, generation and verification occur together in a single forward step, made possible by an intricate tree-based attention mechanism. Due to its universal architecture, no additional validation procedures or third-party verification models are required. The suggested approach, which uses quick tuning, can be used as a plug-and-play module to accelerate any publicly accessible transformer-based LLM, especially those well-trained chatbots, without weakening their exceptional generation powers.
The model performs efficient creation and verification in parallel using a tree-based decoding technique. Both aspects of BiTA work together to accelerate LLMs while keeping the original results intact. In numerous build jobs with LLMs of different sizes, extensive test results show impressive speedups ranging from 2.1× to 3.3×. Additionally, when resources are constrained or real-time applications are required, BiTA's adaptive prompting design makes it a plug-and-play method that can be used to accelerate any publicly available LLM.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook community, Discord Channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our Telegram channel
Dhanshree Shenwai is a Computer Science Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking with a keen interest in ai applications. He is excited to explore new technologies and advancements in today's evolving world that makes life easier for everyone.
<!– ai CONTENT END 2 –>