The research field focuses on optimizing algorithms for training large language models (LLMs), which are essential for understanding and generating human language. These models are critical for various applications, including natural language processing and artificial intelligence. Training LLMs requires significant computational and memory resources, making optimization of these processes a high-priority area for researchers.
The main problem addressed by this paper is the high memory demand of optimization algorithms used in training large language models. In particular, the Adam optimizer, a standard in the field due to its superior performance, requires substantial memory to store optimizer states such as first- and second-order moment values. This memory demand doubles the required resources compared to the model size, creating a significant overhead. As a result, training large models becomes expensive and less accessible to resource-constrained researchers. Alternative methods such as Adafactor attempt to reduce memory usage but often compromise performance, highlighting the need for more efficient solutions.
The Adam optimizer is widely used to train LLMs due to its ability to handle various model sizes and tasks effectively. However, Adam’s requirement for extensive memory to store its optimizer states, particularly the first- and second-order moments, poses a considerable challenge. For example, training a 7 billion parameter model with Adam requires approximately 56 GB per card for these states alone, totaling 86 GB when gradients are included. This makes training prohibitively expensive, even with advanced graphics cards such as the A100-80GB. Furthermore, CPU offloading and fragmentation are used to manage this high memory requirement, which increases latency and slows down the training process.
Researchers from the Chinese University of Hong Kong, Shenzhen, Shenzhen Big Data Research Institute, Duke University and Stanford University presented Adam-minian optimizer designed to achieve similar or better performance than Adam, while reducing memory usage by 45-50%. Adam-mini achieves this by splitting the model parameters into blocks based on the Hessian structure of transformers. Each block is assigned a single high-quality learning rate, significantly reducing the number of learning rates from billions to a manageable amount. This approach allows Adam-mini to maintain or even improve performance with a fraction of the memory required by Adam.
Adam-mini works by leveraging the near-block diagonal structure of Transformer Hessians, splitting parameters into blocks such as query, key, value, and MLP layers. For each block, a single effective learning rate is computed using the average of Adam’s second-order moment values in that block. This method reduces the memory footprint and simplifies the learning rate assignment process. For example, during pre-training of Llama2-7B on two A800-80GB GPUs, Adam-mini achieved a throughput of 5572.19 tokens per second, compared to 3725.59 tokens per second with AdamW, representing a 49.6% increase. This efficiency results in a 33% reduction in wall-clock time for processing the same number of tokens.

The researchers validated Adam-mini’s performance on multiple language models ranging from 125 million to 7 billion parameters, including pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF). The optimizer demonstrated equal or superior performance to AdamW, with notable improvements in memory efficiency and training speed. For example, on supervised fine-tuning and reinforcement learning tasks, Adam-mini consistently outperformed AdamW, achieving higher evaluation scores and faster convergence.

In conclusion, the Adam-mini optimizer addresses significant memory inefficiencies of traditional optimization methods like Adam by introducing a new partitioning strategy based on the Hessian structure of models. This innovative approach results in substantial memory savings and improved training efficiency, making it a valuable tool for researchers working with large-scale language models. By reducing memory usage by up to 50% and increasing performance by almost 50%, Adam-mini not only improves the feasibility of training large models but also encourages broader participation by researchers with limited GPU resources.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter.
Join our Telegram Channel and LinkedIn GrAbove!.
If you like our work, you will love our Newsletter..
Don't forget to join our Subreddit with over 45 billion users

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary engineer and entrepreneur, Asif is committed to harnessing the potential of ai for social good. His most recent initiative is the launch of an ai media platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easily understandable to a wide audience. The platform has over 2 million monthly views, illustrating its popularity among the public.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>