The traditional theory of how neural networks learn and generalize is put to the test by the occurrence of grokking in neural networks. When a neural network is being trained, the expectation is that the network’s performance on test data will likewise improve as the training loss lowers and converges to a low value, but eventually, the network’s behavior stabilizes. Although the network first seems to memorize the training data, grokking adds an odd behavior that results in low and steady training loss but a poor generalization. Surprisingly, the network evolves to perfect generalization with more training.
With this, a question arises: Why, even after obtaining virtually perfect training performance, would the network’s test performance improve dramatically upon further training? A network first achieves perfect training accuracy but displays poor generalization, and then, with more training, it converts to perfect generalization. This behavior is basically grokking in neural networks. In a recent research paper, a team of researchers proposed an explanation for grokking based on the coexistence of two types of solutions within the task that the network is trying to learn. The solutions were as follows.
- Generalizing Solution: With this approach, the neural network is well-suited to generalizing to new data. With the same amount of parameter norm, i.e., the magnitude of the network’s parameters, it can create greater logits or output values, which are characterized by slower learning but higher efficiency.
- Memorizing the solution – The network memorizes the training data in this approach, which results in perfect training accuracy but an ineffective generalization. Memory circuits pick up new information rapidly, but they are less effective since they need more inputs to generate the same logit values.
The team has shared that memorizing circuits becomes less effective as the size of the training dataset rises, but generalizing circuits are mostly unaffected. This implies that there is a critical dataset size, i.e., a size at which both the generalization and memorization circuits are equally effective. The team has validated the following four innovative hypotheses, with strong evidence to support their explanation.
- The authors have predicted and demonstrated that grokking happens when a network shifts from memorizing the input at first to progressively emphasizing generalization. Test accuracy increases as a result of this change.
- They have suggested the idea of critical dataset size, at which memorization and generalization circuits are both equally effective. This critical size has represented a vital stage in the learning process.
- Ungrokking – One of the most unexpected findings has been the occurrence of “ungrokking.” If the network is further trained on a dataset that is significantly smaller than the crucial dataset size after successfully grasping, it regresses from perfect to low test accuracy.
- Semi-Grokking: The research introduces semi-grokking, in which a network goes through a phase transition after being trained on a dataset size that balances the effectiveness of memorization and generalization circuits but only achieves partial, as opposed to perfect, test accuracy. The subtle interaction between various learning mechanisms in neural networks is demonstrated by this behavior.
In conclusion, this research has offered a thorough and original explanation of the grokking phenomenon. It shows that a key factor influencing the network’s behavior during training is the cohabitation of memory and generalization solutions, as well as the effectiveness of these solutions. Thus, with the predictions and empirical data offered, neural network generalization and its dynamics can be better comprehended.
Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.