Adaptive gradient methods, particularly Adam, have become indispensable for optimizing neural networks, particularly in conjunction with Transformers. In this paper, we present a novel optimization anomaly called the Slingshot effect, which manifests itself during extremely late stages of training. We identify a distinctive feature of this phenomenon through cyclical phase transitions between stable and unstable training regimes, as demonstrated by the cyclical behavior of the norm of the weights of the last layer. Although the Slingshot effect can be easily reproduced in more general settings, it does not align with any known optimization theory, emphasizing the need for in-depth examination.
Furthermore, we make a notable observation that Grokking predominantly occurs during the onset of Slingshot effects and is absent without it, even in the absence of explicit regularization. This finding suggests a surprising inductive bias of adaptive gradient optimizers in the later stages of training, urging a revised theoretical analysis of its origin.
Our study sheds light on an intriguing optimization behavior that has significant implications for understanding the inner workings of adaptive gradient methods.