Adam is widely used in deep learning as an adaptive optimization algorithm, but it has difficulty with convergence unless the hyperparameter β2 is tuned based on the specific problem. Attempts to solve this problem, such as AMSGrad, require the impractical assumption of uniformly bounded gradient noise, which does not hold in cases with Gaussian noise, as seen in variational autoencoders and diffusion models. Other methods, such as AdaShift, address convergence in limited scenarios but are not effective for general problems. Recent studies suggest that Adam can converge by adjusting β2 per task, although this approach is complex and problem-specific, warranting further exploration for universal solutions.
Researchers from the University of Tokyo presented ADOPT. This new adaptive gradient method achieves optimal convergence at a rate O(1/√T) without requiring specific choices for β2 or the bounded noise assumption. ADOPT addresses Adam's non-convergence by excluding the current gradient from the second moment estimate and adjusting the order of momentum and normalization updates. Experiments on various tasks, such as image classification, generative modeling, language processing, and reinforcement learning, show the superior performance of ADOPT over Adam and its variants. The method also converges reliably in challenging cases, including scenarios where Adam and AMSGrad struggle.
This study focuses on minimizing an objective function that depends on a vector of parameters by using first-order stochastic optimization methods. Instead of working with the exact gradient, they rely on an estimate known as a stochastic gradient. Since the function can be non-convex, the goal is to find a stationary point where the gradient is zero. Standard convergence analyzes in this area generally make several key assumptions: the function has a minimum bound, the stochastic gradient provides an unbiased estimate of the gradient, the function changes smoothly, and the variance of the stochastic gradient is uniformly bounded. For adaptive methods like Adam, an additional assumption is often made about the variance of the gradient to simplify convergence testing. Researchers apply a set of assumptions to investigate how adaptive gradient methods converge without relying on the stricter assumption that gradient noise remains limited.
Previous research suggests that while basic stochastic gradient descent often converges in non-convex environments, adaptive gradient methods like Adam are widely used in deep learning due to their flexibility. However, Adam sometimes needs to converge, especially in convex cases. To address this, a modified version called AMSGrad was developed, which introduces a non-decreasing scaling of the learning rate by updating the second moment estimate with a maximum function. Still, the convergence of AMSGrad is based on the stronger assumption of uniformly bounded gradient noise, which is not valid in all scenarios, such as in certain generative models. Therefore, the researchers propose a new adaptive gradient updating approach that aims to ensure reliable convergence without relying on strict assumptions about gradient noise, addressing Adam's limitations with respect to convergence, and optimizing the dependencies of the parameters.
The ADOPT algorithm is evaluated on several tasks to verify its performance and robustness compared to Adam and AMSGrad. Starting with a toy problem, ADOPT successfully converges where Adam does not, especially under high-gradient noise conditions. Tests with an MLP on the MNIST dataset and a ResNet on CIFAR-10 show that ADOPT achieves faster and more stable convergence. ADOPT also outperforms Adam in applications such as Swin Transformer-based ImageNet classification, NVAE generative modeling, and GPT-2 pre-training under noisy gradient conditions and produces improved LLaMA-7B language model tuning scores at the point of reference MMLU.
The study addresses the theoretical limitations of adaptive gradient methods like Adam, which need specific hyperparameter settings to converge. To solve this, the authors present ADOPT, an optimizer that achieves optimal convergence rates on various tasks without problem-specific tuning. ADOPT overcomes the limitations of Adam by altering the order of boost updates and excluding the current gradient from second-moment calculations, ensuring stability in tasks such as image classification, NLP, and generative modeling. The work bridges theory and application in adaptive optimization, although future research can explore more relaxed assumptions to further generalize the effectiveness of ADOPT.
look at the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. If you like our work, you will love our information sheet.. Don't forget to join our SubReddit over 55,000ml.
(ai Magazine/Report) Read our latest report on 'SMALL LANGUAGE MODELS'
Sana Hassan, a consulting intern at Marktechpost and a dual degree student at IIT Madras, is passionate about applying technology and artificial intelligence to address real-world challenges. With a strong interest in solving practical problems, he brings a new perspective to the intersection of ai and real-life solutions.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>