Machine learning has made significant advances, particularly through deep learning techniques. These advances rely heavily on optimization algorithms to train large-scale models for various tasks, including language processing and image classification. At the core of this process is the challenge of minimizing complex, non-convex loss functions. Optimization algorithms such as Stochastic Gradient Descent (SGD) and its adaptive variants have become central to this effort. These methods aim to iteratively tune model parameters to minimize errors during training, ensuring that models can generalize well to unseen data. However, while these optimization techniques have proven useful, there is still much room for improvement in how they handle long-term gradient information.
A fundamental challenge in training large neural networks is the effective use of gradients, which provide the updates needed to optimize model parameters. Traditional optimizers such as Adam and AdamW rely heavily on an exponential moving average (EMA) of recent gradients, emphasizing the most current gradient information and discarding older gradients. This approach works well for models where recent changes matter more. However, this can be problematic for larger models and long training cycles, as older gradients often still contain valuable information. As a result, the optimization process can be less efficient, requiring longer training periods or failing to reach the best possible solutions.
In current optimization methods, particularly Adam and AdamW, the use of a single EMA for past gradients can limit the optimizer's ability to capture a full spectrum of gradient history. These methods can quickly adapt to recent changes, but often require richer information from older gradients. Researchers have explored several approaches to address this limitation, but many optimizers still struggle to find the optimal balance between efficiently incorporating recent and past gradients. This shortcoming can result in suboptimal convergence rates and poorer model performance, especially in large-scale training scenarios such as language models or vision transformers.
Researchers at Apple and EPFL introduced a new approach to this problem with the AdEMA mix Optimizer. Their method extends the traditional Adam optimizer by incorporating a mixture of two EMAs, one that changes rapidly and one that changes slowly. This approach allows the optimizer to balance the need to respond to recent updates while preserving valuable old gradients that existing optimizers often discard. This dual-EMA system, unique to AdEMAMix, enables more efficient training of large-scale models, reducing the total number of tokens required for training while achieving comparable or better results.
The AdEMAMix optimizer introduces a second EMA to capture older gradients without losing the reactivity provided by the original EMA. Specifically, AdEMAMix maintains a fast-moving EMA that prioritizes recent gradients while tracking a slower-moving EMA that retains information much earlier in the training process. For example, when training a 1.3 billion parameter language model on the RedPajama dataset, the researchers found that AdEMAMix could match the performance of an AdamW model trained on 197 billion tokens with only 101 billion tokens—a roughly 95% reduction in token usage. This efficiency gain translates to faster convergence and often better minima, allowing models to achieve superior performance with fewer computational resources.
Performance evaluations of AdEMAMix have demonstrated substantial improvements in speed and accuracy compared to existing optimizers. In a key experiment, a 110 million parameter model trained with AdEMAMix achieved loss values similar to those of an AdamW model that required almost twice as many training iterations. Specifically, the AdEMAMix model trained for 256,000 iterations achieved the same results as an AdamW model trained for 500,000 iterations. For even larger models, such as the 1.3 billion parameter language model, AdEMAMix yielded results comparable to those of an AdamW model trained for 1.5 million iterations, but with 51% fewer tokens. The optimizer also demonstrated a slower forgetting rate, which is a key advantage for maintaining model accuracy over long training cycles.
The researchers also addressed some common challenges faced by optimizers, such as initial training instabilities. To overcome these, they introduced warm-up steps for the larger of the two EMAs, progressively increasing the value of the slower-changing EMA during training. This gradual increase helps stabilize the model during the initial training phase, preventing the optimizer from prematurely becoming overly reliant on stale gradients. By carefully scheduling the adjustments for the two EMAs, AdEMAMix ensures that the optimization process remains stable and efficient during training, even for models with tens of billions of parameters.
In conclusion, the AdEMAMix optimizer presents a notable advancement in machine learning optimization. Incorporating two EMAs to leverage both recent and old gradients better addresses a key limitation of traditional optimizers such as Adam and AdamW. This dual-EMA approach allows models to achieve faster convergence with fewer tokens, reducing the computational burden of training large models; AdEMAMix consistently outperformed A on trialsdamW, demonstrating its potential to improve performance on language modeling and image classification tasks. The method’s ability to reduce model forgetting during training further underscores its value for large-scale and long-term ML projects, making it a powerful tool for researchers and industry.
Take a look at the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and LinkedInJoin our Telegram Channel.
If you like our work, you will love our fact sheet..
Don't forget to join our SubReddit of over 50,000 ml
Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary engineer and entrepreneur, Asif is committed to harnessing the potential of ai for social good. His most recent initiative is the launch of an ai media platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easily understandable to a wide audience. The platform has over 2 million monthly views, illustrating its popularity among the public.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>