Training large-scale ai models, such as transformers and language models, has become an indispensable but very demanding process in ai. With billions of parameters, these models offer innovative capabilities, but come at a high cost in terms of computational power, memory, and energy consumption. For example, OpenAI's GPT-3 comprises 175 billion parameters and requires weeks of GPU training. These enormous requirements limit these technologies to organizations with significant computational resources, exacerbating concerns about energy efficiency and environmental impact. Addressing these challenges has become critical to ensuring greater accessibility and sustainability of ai advances.
Inefficiencies in training large models are primarily due to their reliance on dense matrices, which require significant memory and computing power. Limited support for low-precision or low-range optimized operations in modern GPUs further exacerbates these requirements. Although some methods, such as matrix factorization and rank reduction heuristics, have been proposed to alleviate these problems, their real-world applicability is limited. For example, GaLore allows training in single-batch configurations, but suffers from impractical runtime overhead. Similarly, LTE, which adopts low-range adapters, struggles with convergence on large-scale tasks. The lack of a method that simultaneously reduces memory usage, computational cost, and training time without compromising performance has created an urgent need for innovative solutions.
Researchers from the University at Albany SUNY, the University of California at Santa Barbara, amazon Alexa ai and Meta presented Cowhite and METROemory –myefficient training method through Rank-TOadaptive tensor optimization (CoMERA), a novel framework that combines memory efficiency with computational speed through adaptive range tensor compression. Unlike traditional methods that focus solely on compression, CoMERA adopts a multi-objective optimization approach to balance compression ratio and model accuracy. It uses tensor embeddings and advanced tensor network contractions to optimize GPU utilization, reducing runtime overhead while maintaining robust performance. The framework also introduces CUDA Graph to minimize kernel launch delays during GPU operations, a major bottleneck in traditional tensor compression approaches.
The foundation of CoMERA is based on adaptive tensor representations, which allow model layers to adjust their ranges dynamically based on resource constraints. By modifying the ranges of the tensors, the framework achieves compression without compromising the integrity of the neural network operations. This dynamic optimization is achieved through a two-stage training process:
- A first stage focused on stable convergence
- A late stage that fine-tunes classifications to meet specific compression goals
On a six-encoder transformer model, CoMERA achieved compression ratios ranging from 43x in its early stage to an impressive 361x in its late-stage optimizations. Furthermore, it reduced memory consumption by 9 times compared to GaLore, with training 2-3 times faster per epoch.
When applied to transformer models trained on the MNLI dataset, CoMERA reduced the size of the models from 256 MB to just 3.2 MB, while preserving accuracy. On large-scale recommender systems like DLRM, CoMERA compressed models 99 times and achieved a 7-fold reduction in peak memory usage. The framework also excelled in pre-training CodeBERT, a domain-specific large language model, where it achieved an overall compression ratio of 4.23x and demonstrated a 2x speedup during certain training phases. These results underline its ability to handle diverse tasks and architectures, expanding its applicability across domains.

The key conclusions of this research are the following:
- CoMERA achieved compression ratios of up to 361x for specific layers and 99x for full models, dramatically reducing storage and memory requirements.
- The framework provided 2-3x faster training times per epoch for transformers and recommender systems, saving time and computational resources.
- Using tensor representations and CUDA Graph, CoMERA reduced peak memory consumption by 7x, enabling training on smaller GPUs.
- CoMERA's approach supports various architectures, including transformers and large language models, while maintaining or improving accuracy.
- By reducing the energy and resource demands of training, CoMERA contributes to more sustainable ai practices and makes cutting-edge models accessible to a broader audience.

In conclusion, CoMERA addresses some of the biggest barriers to ai scalability and accessibility by enabling faster, more memory-efficient training. Its adaptive optimization capabilities and compatibility with modern hardware make it an attractive option for organizations looking to train large models without incurring prohibitive costs. The results of this study pave the way for further exploration of tensor-based optimizations in domains such as distributed computing and resource-constrained edge devices.
Verify he <a target="_blank" href="https://www.amazon.science/publications/comera-computing-and-memory-efficient-training-via-rank-adaptive-tensor-optimization” target=”_blank” rel=”noreferrer noopener”>Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on <a target="_blank" href="https://twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. Don't forget to join our SubReddit over 60,000 ml.
Trending: LG ai Research launches EXAONE 3.5 – three frontier-level bilingual open-source ai models that deliver unmatched instruction following and broad context understanding for global leadership in generative ai excellence….

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for social good. Their most recent endeavor is the launch of an ai media platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easily understandable to a wide audience. The platform has more than 2 million monthly visits, which illustrates its popularity among the public.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>