Transformer-based neural networks have shown great ability to handle multiple tasks such as text generation, editing, and question answering. In many cases, models that use more parameters show better performance as measured by perplexity and high accuracy of the final tasks. This is the main reason for the development of larger models in industries. However, larger models sometimes result in poor performance; for example, the 2B MiniCPM model exhibits capabilities comparable to larger language models such as Llama2-7B, Mistral-7B, Gemma-7B, and Llama-13B. Additionally, the size of available high-quality data may not keep pace as computational resources to train larger models increase.
Current methods to overcome these shortcomings include scaling laws, energy-based models, and Hopfield models. In scaling laws, the performance of the models increases when there is an increase in the size of the models and the volume of training data. Energy-based models have become famous as a fundamental modeling tool in different areas of machine learning over the past few decades. The main idea of this method is to model the neural network using a parameterized probability density function to present the distribution in terms of a learnable energy function. The last is the Hopfield model, in which classical Hopfield networks were developed as an example of associative memory.
Researchers from the Central Research Institute, 2012 Laboratories Huawei Technologies Co., Ltd. introduced a theoretical framework focusing on the memorization process and performance dynamics of transformer-based language models (LM). The researchers conducted a series of experiments using GPT-2 on different data sizes to overcome signs of saturation and at the same time trained Vanilla Transformer models on a data set consisting of 2 million tokens. The results of these experiments validated the theoretical results and provided important theoretical insights into optimal cross-entropy loss that can guide and improve decision making in model training.
A 12-layer transformer LM is trained using the GPT-2 small tokenizer and architecture on the OpenWebText dataset. This data set is similar to the WebText data set used for training the original GPT-2 model, which contains 9B tokens from 8,013,769 documents. Using different amounts of data, three models are trained where a subset is created containing the first 1% (90M) and 0.1% (9M) of the OpenWebText data. Furthermore, the basic transformer models are trained using a small amount of high-quality data containing pairs of English sentences in declarative training and are context-free with a vocabulary size of 68 words, where the task is to convert declarative sentences into questions.
Training on 0.1% (9M) of the OpenWebText data shows overfitting and the training loss disappears over iterations. This happens because the training samples are not well separated, so the model energy decreases to a sum of some delta functions. When the model size is approximately of order O(D2) and trained with 90 million tokens, the model can achieve similar training and validation loss compared to the configuration with 9B tokens. Two basic 6- and 10-layer transformers are trained using a batch size of 8, and the training losses are stabilized at a value of around 1, as predicted in Proposition.
In conclusion, the researchers presented a theoretical framework focused on the memorization process and performance dynamics of transformer-based LM language models. In this article, transformer-based networks are modeled using associative memory and cross-entropy loss for model and data sizes is highlighted. Furthermore, experiments are carried out (a) using GPT-2 of different data sizes and (b) training Vanilla Transformer models on a data set of 2 million tokens. Finally, a global energy function is created for the layered structure of the transformer models using the majorization-minimization technique.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter. Join our Telegram channel, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our 42k+ ML SubReddit
Sajjad Ansari is a final year student of IIT Kharagpur. As a technology enthusiast, he delves into the practical applications of ai with a focus on understanding the impact of ai technologies and their real-world implications. His goal is to articulate complex ai concepts in a clear and accessible way.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>