The Slingshot Effect: A Late-Stage Optimization Anomaly in the Adam Family of Optimization Methods
Adaptive gradient methods, particularly Adam, have become indispensable for optimizing neural networks, particularly in conjunction with Transformers. In this paper, ...