While it may seem attractive to train ML optimizers, doing so is expensive because the examples used to train these systems are optimization problems. Generalization in this context refers to the ability to apply knowledge to “similar” optimization tasks that were not encountered during training.
The concept that has revolutionized machine learning (replacing features designed by hand with ones that can be learned) can be seen as a natural push (into the optimizer space) through learn-to-learn (L2L) systems. It becomes difficult and requires your subject to carry out a rigorous mathematical investigation of the attributes of L2L systems which involves defining distributions over optimization problems.
The new study Mnemosyne: Learning to Train Transformers with Transformers by a team from Google and Columbia University proposes the Mnemosyne Optimizer, an L2L system intended to train entire neural network topologies without any task-specific optimizer tuning.
The scalable low-rank implicit attention memory cells used in Performer architectures form the basis of Mnemosyne, along with techniques for estimating attention via low-rank decomposition of the attention matrix. Mnemosyne is designed to reduce the cost of quadratic complexity of conventional attention while training a complete neural network architecture.
Standard transformers can be considered differentiable dictionaries that employ powerful associative memory processes with exponential memory. Meanwhile, low-range linear attention mechanisms are more space-efficient and ideal for large-scale memory systems.
The main advantages of Mnemosyne, identified by the researchers, are the following:
- It has better generalization than the latest generation LSTM optimizers.
- Meta-trained on conventional multilayer perceptrons, it can successfully train Vision Transformers (ViT) (MLPs).
- In robotics applications, you can initialize optimizers, resulting in faster convergence.
Mnemosyne was meta-trained and tested through a number of different NN training tasks using a wide variety of architectures and data sets in this empirical work. As the results demonstrate, Mnemosyne can optimize MLPs using a wide variety of NN designs and activation functions, and it does so faster than competitive optimizers.
The team theoretically examines Mnemosyne’s compact associative memory (CAM) and shows that it can store and restore patterns in much the same way as their usual non-compact counterparts, but stands out favorably for its ability to do so implicitly.
According to the researchers, their study believes that Mnemosyne’s algorithmic heart is the first to give such significant capacity results. They hope this will serve as a springboard for further research into the use of learnable attention-based optimizers to solve the extremely challenging challenge of training Transformers.
review the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 13k+ ML SubReddit, discord channel, and electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and has a strong interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new advances in technology and its real life application.