Together ai has introduced an innovative technique known as teal (Iraining-Fremy TOShortage of activation in ILM) which has the potential to significantly advance the field of efficient machine learning model inference. The company, a leader in open-source ai models, has been exploring innovative ways to optimize model performance, especially in environments with limited memory resources. TEAL is a notable advancement in this quest, as it provides a novel method for dispersing activation across LLMs, promising improved performance with minimal model degradation.
The challenge of large-scale language models
LLM models are known for their impressive capabilities, but are notorious for their huge memory requirements. Traditional inference processes on these models are limited by the speed at which data can be transferred between memory and processing units. This memory-bound nature has led to the development of several techniques, such as quantization and weight sparsity, to reduce the size of models without compromising performance.
One of the most recent advances is activation sparsity, which takes advantage of certain redundant hidden states in LLMs, allowing for the pruning of unnecessary weight channels. However, models such as LLaMA have moved from using ReLU-based MLPs (which naturally exhibit high sparsity) to SwiGLU-based MLPs, which are less conducive to activation sparsity. This has made it difficult to successfully apply activation sparsity techniques in newer models.
The concept behind TEAL
TEAL emerges as a solution to the challenges posed by activation sparsity in modern LLMs. It presents a simple, training-free approach that reduces activation by applying magnitude pruning to hidden states across the model. The approach enables an impressive 40-50% activation sparsity across the model with minimal impact on performance.
The main advantage of TEAL lies in its ability to optimize sparsity across all tensors in the model. Unlike previous methods such as CATS that sparsified only specific areas of the model, TEAL focuses on each tensor, achieving higher overall sparsity without requiring additional tuning or pre-training. TEAL significantly reduces the memory bandwidth required for LLM inference by avoiding transferring zero-valued weight channels to memory, leading to faster processing times.
The technical implementation of TEAL
TEAL’s implementation focuses on optimizing sparsity at the transformer block level, ensuring that all tensors in the model benefit from sparsity. At 25% sparsity, the model experiences near-zero performance degradation, while at 40–50% sparsity, the degradation remains minimal. This is in contrast to other methods such as CATS, which experience more significant performance drops at higher sparsity levels. One of the key factors behind TEAL’s success is its approach to sparsifying weight matrices. TEAL spars the weight matrices rather than through gated outputs, as seen in other methods. This design choice results in lower error rates and better overall performance, even at higher sparsity levels. As a result, TEAL can achieve speedups of 1.53x to 1.8x in single-batch decoding, a significant improvement for real-world applications where inference speed is critical.
Hardware Compatibility and Quantification
In addition to the benefits of activation sparsity, TEAL also supports quantization, another key technique for reducing the size and improving the efficiency of LLMs. Quantization reduces the precision of model parameters, thereby reducing the memory and computational resources required for inference. TEAL’s sparsity approach complements quantization methods, allowing models to achieve even greater speedups while maintaining performance. Together ai’s TEAL integration with GPT-Fast, along with support for CUDA Graphs and Torch Compile, has further improved its hardware efficiency. TEAL runs well on GPU hardware, including A100 GPUs, which can outperform traditional dense cores in certain scenarios. This makes it an attractive option for environments with limited hardware resources, particularly when handling low-batch inference tasks.
Applications and future potential
The most immediate application of TEAL is accelerating inference in resource-constrained environments, such as edge devices with limited memory and processing power. TEAL’s ability to optimize memory usage and reduce latency in LLM inference makes it an ideal solution in these scenarios. It excels in low batch configurations, where it can deliver the most significant speed improvements. TEAL also holds promise for inference vendors managing large fleets of GPUs and models. Hosting over 100 leading open source models, Together ai is well positioned to take advantage of TEAL’s performance improvements. TEAL enables these models to be managed more efficiently by reducing memory usage and improving processing speeds, even when active batch sizes are relatively small.
Conclusion
The release of TEAL by Together ai marks a significant advancement in LLM optimization. TEAL offers a simple and effective solution to the memory bottlenecks that have long plagued LLM inference by introducing a training-free approach to activation sparsity. Its ability to achieve model-wide sparsity with minimal degradation and its support for quantization make it a powerful tool for improving the efficiency of ML models in resource-constrained environments and large-scale inference settings.
Take a look at the ai/blog/teal-training-free-activation-sparsity-in-large-language-models” target=”_blank” rel=”noreferrer noopener”>Details here. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and LinkedInJoin our Telegram Channel.
If you like our work, you will love our fact sheet..
Don't forget to join our SubReddit of over 50,000 ml
ai/deepset-studio-waitlist?utm_campaign=2408%20-%20Campaign%20-%20Studio%20Launch&utm_source=marktechpost&utm_medium=banner-ad-desktop”>(Promotion) Join the waitlist: 'deepset Studio' – deepset Studio, a new free visual programming interface for Haystack, our leading open source ai framework
Nikhil is a Consultant Intern at Marktechpost. He is pursuing an integrated dual degree in Materials from Indian Institute of technology, Kharagpur. Nikhil is an ai and Machine Learning enthusiast who is always researching applications in fields like Biomaterials and Biomedical Science. With a strong background in Materials Science, he is exploring new advancements and creating opportunities to contribute.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>