This paper was accepted into the Efficient Natural Speech and Language Processing (ENLSP) Workshop at NeurIPS 2024.
The pre-training phase of language models often starts with randomly initialized parameters. With current trends in scaling models, training its large number of parameters can be extremely slow and expensive. In contrast, small language models are less expensive to train, but often cannot reach the accuracy of large models. In this article, we explore an intriguing idea to connect these two different regimes: can we develop a method to initialize large language models using pre-trained smaller models? Will such initialization bring any benefit in terms of training time and final accuracy? In this article, we present HyperCloning, a method that can expand the parameters of a pre-trained language model to those of a larger model with higher hidden dimensions. Our method ensures that the larger model retains the functionality of the smaller model. As a result, the larger model already inherits the predictive power and accuracy of the smaller model before training begins. We show that training such an initialized model results in significant savings in terms of GPU hours required for pre-training large language models.