High implementation costs are a growing concern as huge base models (eg GPT-3.5/GPT-4) (OpenAI, 2023) are implemented in many practical contexts. Although quantization, pruning, compression, and distillation are useful general methods for reducing service costs for LLMs, the inference efficiency bottleneck of transformer-based generative models (eg, GPT ) is mainly associated with autoregressive decoding. This is because, at test time, the output tokens must be decoded (sequentially) one by one. This presents serious difficulties for implementing LLM at scale.
According to studies, the context of an LLM is often the source of its exit tokens in real-world applications. The context of an LLM generally consists of documents relevant to a query and retrieved from an external corpus for reference. The output of the LLM typically consists of multiple stretches of text discovered in the reference.
In light of this finding, a group of Microsoft researchers suggests LLMA. This referenced inference decoding technique can speed up LLM inference by taking advantage of the overlap between the output of an LLM and a reference in many real-world settings. This work aimed to speed up inference in LLM by improving the performance of autoregressive decoding.
Selecting a stretch of text from the reference, copying its tokens into the LLM decoder, and then performing an efficient parallel check based on the probabilities of the output token is how LLMA works. Doing so ensures that the output of the render is indistinguishable from the output of the vanilla greedy decoding method while speeding up decoding by providing improved parallelism on vector accelerators such as GPUs.
Unlike previous efficient decoding algorithms such as speculative decoding and speculative sampling, LLMA does not require an additional model to generate a rough for verification.
Experiments on various model sizes and practical application scenarios, including fetch boosting and cache-assisted creation, reveal that the proposed LLMA approach achieves a speedup of more than two factors compared to greedy decoding.
review the Paper and Github. Don’t forget to join our 19k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and has a strong interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new advances in technology and its real life application.