Recently, there has been increasing interest in improving the generalization of deep networks by regulating the sharpness of the loss landscape. Sharpness Aware Minimization (SAM) has gained popularity for its superior performance on several benchmarks, specifically in handling random label noise, outperforming SGD by significant margins. The robustness of SAM particularly shines in scenarios with label noise, showing substantial improvements over existing techniques. Furthermore, the effectiveness of SAM persists even with insufficient parameterization, potentially increasing gains with larger data sets. Understanding SAM behavior, especially in the early learning phases, is crucial to optimizing its performance.
While the underlying mechanisms of SAM remain elusive, several studies have attempted to shed light on the importance of regularization for example in 1-SAM. Some researchers showed that in a sparse regression, 1-SAM shows a bias towards more sparse weights compared to naive SAM. Previous studies also differentiate between the two by highlighting differences in the regularization of “flatness.” Recent research links naïve SAM to generalization, underscoring the importance of understanding SAM behavior beyond convergence.
Researchers at Carnegie Mellon University provide a study investigating why 1-SAM demonstrates greater robustness in labeling noise compared to SGD at a mechanistic level. By analyzing the gradient decomposition of each example, focusing particularly on the logit scale and Jacobian terms of the network, the research identifies key mechanisms that improve the accuracy of early stopping tests. In linear models, explicit weighting of low-loss points by SAM is beneficial, especially in the presence of mislabeled examples. Empirical findings suggest that the robustness of SAM's label noise primarily originates from its Jacobian term in deep networks, indicating a fundamentally different mechanism compared to the logit scale term. Furthermore, the Jacobian-only SAM analysis reveals a decomposition into SGD with ℓ2 regularization, providing insight into its performance improvement. These findings underscore the importance of trajectory optimization rather than sharpness properties in convergence for achieving noise robustness of SAM labels.
Through experimental investigations on Gaussian toy data with label noise, SAM demonstrates significantly higher early stop test accuracy compared to SGD. By analyzing SAM's update process, it becomes evident that its adversarial weight perturbation prioritizes increasing the weight of the gradient signal from low loss points, thus maintaining high contributions of clean examples in the early training epochs. This preference for clean data leads to higher test accuracy before overfitting to noise. The study sheds light on the role of SAM's logit scaling, showing how it effectively increases gradients from points of low loss, consequently improving overall performance. This preference for low loss points is demonstrated through mathematical proofs and empirical observations, highlighting the distinct behavior of SAM from naive SAM updates.
After simplifying SAM regularization to include ℓ2 regularization on the last layer weights and intermediate activations of the last hidden layer in deep network training using SGD. This regularization objective is applied to CIFAR10 with ResNet18 architecture. Due to instability issues with batch normalization, researchers replace it with layer normalization for 1-SAM. By comparing the performance of SGD, 1-SAM, L-SAM, J-SAM, and regularized SGD, they found that while regularized SGD does not match the accuracy of the SAM test, the gap is significantly reduced from 17% to 9%. % low label noise. However, in noise-free scenarios, the regularized SGD only improves marginally, while SAM maintains an 8% advantage over SGD. This suggests that, while it does not fully explain the benefits of SAM generalization, similar regularization in the final layers is crucial for SAM performance, especially in noisy environments.
In conclusion, this work aims to provide a solid perspective on the effectiveness of SAM by demonstrating its ability to prioritize learning clean examples before installing noisy examples, particularly in the presence of label noise. In linear models, SAM explicitly boosts gradients from low loss points, similar to existing label noise robustness methods. In nonlinear environments, regularizing the intermediate activations and weights of the final SAM layers improves the robustness to label noise, similar to methods regulating the logits norm. Despite their similarities, SAM remains underexplored in the label noise domain. However, SAM's simulation of aspects of Jacobian network regularization can preserve its performance, suggesting potential for developing label noise robustness methods inspired by SAM principles, albeit without the time costs of additional runs of 1-SAM.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter. Join our Telegram channel, Discord Channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our 41k+ ML SubReddit
Asjad is an internal consultant at Marktechpost. He is pursuing B.tech in Mechanical Engineering at Indian Institute of technology, Kharagpur. Asjad is a machine learning and deep learning enthusiast who is always researching applications of machine learning in healthcare.
(Recommended Reading) GCX by Rightsify – Your go-to source for high-quality, ethically sourced, copyright-cleared ai music training datasets with rich metadata
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>