The transformer architecture has become a preferred choice for representing various domain structures. The empirical inductive biases of the transformer make it a good candidate for scaling. This paves the way for regular training and the release of expanded versions of existing smaller models. Although they are often an enlarged version of their smaller counterparts, new instances of such models are typically trained from scratch. Since even the smallest models need a significant amount of computational resources to train, the parameters of the smaller pretrained models should be used to speed up the training of larger models.
When looking at this problem from the perspective of model growth, one strategy is to use the pretrained parameters of a smaller model to initialize some of the parameters of the larger model. Recent research has shown that training can be sped up by copying a subset of the previously trained parameters to initialize the new parameters, and then tuning the entire network. This contrasts with previous work, which generally froze the initialized parameters of the pretrained model and only trained the new (randomly initialized) parameters.
The Computer Science and Artificial Intelligence Laboratory (CSAIL) suggests the use of pre-trained smaller language models to increase the effectiveness of these training approaches at reduced cost and time commitment. His approach uses machine learning to “grow” a more complex model from a simpler one to encode the smaller model’s prior knowledge. This allows the larger model to train more quickly. The team doesn’t get rid of old models, but instead takes their best parts and uses them to create something new.
Compared to methods that involve training a new model from scratch, his approach reduces the time and computational effort required to train a large model by around 50%. Furthermore, the MIT method produced models with the same or better performance than those produced by other methods that use smaller models to speed up the training of larger models.
Saving time in training large models could have a positive impact on research efficiency, cost, and environmental sustainability by reducing carbon emissions produced during the training process. This could also allow smaller research groups to access and collaborate on these huge models, which could pave the way for numerous new developments.
The proposed strategy is called the Learned Linear Growth Operator (LiGO), which expands the breadth and depth of a network based on the characteristics of a smaller network and empirical evidence. Researchers use ML to discover a linear mapping of the parameters of the simplified model. As a mathematical procedure, this linear map takes as input the parameters of the smaller model and produces as output the parameters of the larger model.
Researchers may want to create a model with a billion parameters, but the smallest model can be quite large (perhaps a hundred million parameters). To make the linear map more manageable for a machine learning system, the LiGO method segments it.
LiGO is superior to alternative strategies because it grows both in width and depth at the same time. They also highlight that entering the smaller model and its specifications allows users to adjust the width and depth of the larger model to their liking.
Their solution exceeded all baselines, including training a completely new model from the ground up and model growth approaches. His strategy reduces the computational costs of training vision and language models by around 50%, and in many cases an improvement in performance is observed. The team also found that LiGO was possible even without a smaller pretrained model to speed up transformer training. They hope to use LiGO in even more complex models in the future.
review the Paper, Project, and Reference. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 16k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and has a strong interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new advances in technology and its real life application.