Large Language Models (LLMs) have received much appreciation around the world and have gained immense popularity in the field of Natural Language Processing. This has allowed us to describe intelligent systems with a better and more articulate understanding of language than ever before. There has been significantly higher performance from LLMs like GPT-3, T5, PaLM, etc. These models are here to stay, doing everything from imitating humans learning to read to generating text and summarizing long paragraphs. According to some in-depth studies, an LLM works well if its size is huge. By training these models on large amounts of data, these models can understand the syntax, semantics, and pragmatics of human language.
The popular ChatGPT large language model, developed by OpenAI, has grown so much due to advanced techniques such as reinforcement learning with human feedback (RLHF). With RLHF, machine learning algorithms combine and use human input to improve model performance. Tune pre-trained LLMs for tasks like developing a chatbot, virtual assistants, etc. In recent years, the basic pre-trained models that LLMs like ChatGPT are based on have also improved significantly. This has been mainly due to three changes.
- Model scaling has been shown to be helpful in improving model performance. Taking the example of the Pathways Language Model (PaLM), the model has had a great impact on its performance when scaling in learning few shots. Few-shot learning reduces the number of task-specific training examples required to tune the model for a specific application. By scaling and training a 540 billion parameter on 6144 TPU v4 chips using pathways, PaLM showed repeatable scaling benefits. It surpassed several traditional models and showed a lot of progress. Therefore, scaling of both depth and width has been a big factor for better performance of foundation models.
- Another change has been the process of increasing the number of tokens at the time of pre-training. Models such as Chinchilla have shown that large language models perform more optimally with increasing pre-training data. Chinchilla, a computationally optimal model, was trained on 70B parameters and four times as much data as the Gopher model for the same computational budget, and Chinchilla consistently outperformed Gopher. It even performed better than LLMs like GPT-3, Jurassic-1, and Megatron-Turing NLG. He clearly described that for each computational optimal training, the number of tokens should be scaled accordingly, ie twice the model size, double the number of training tokens.
- The third change is the use of clean and diverse pre-training data. This has been demonstrated by the performance of Galactica, the great language model that stores, mixes and reasons scientific knowledge. Trained in text from various scientific articles, Galactica outperformed models like GPT-3, Chinchilla, etc. Another large language model, BioMedLM, a domain-specific LLM for biomedical text, showed a massive improvement in performance when trained on domain-specific data. He clearly outlined that prior training in domain specific data trumps general purpose data.
Without a doubt, the success of LLMs is due to a combination of factors, including the use of RLHF and developments on basic pretrained models. All three changes have greatly affected the performance of LLMs. Furthermore, GLaM (Generalist Language Model) has shown a massive improvement in its performance by using a sparsely activated expert blending architecture to scale the model’s capability with lower training cost. Consequently, these changes have paved the way for even more advanced language models that will continue to make our lives easier.
All credit for this research goes to the researchers on these projects. special credit to cheep from cameron Also don’t forget to join our 14k+ ML SubReddit, discord channel, and electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Some references and resources:
- MT-NLG: http://arxiv.org/abs/2201.11990
- Chinchilla: http://arxiv.org/abs/2203.15556
- Palm: http://arxiv.org/abs/2204.02311
- GLaM: http://arxiv.org/abs/2112.06905
- BioMedLM: http://bit.ly/3KuE7GY
- Galactic: http://arxiv.org/abs/2211.09085
Tanya Malhotra is a final year student at the University of Petroleum and Power Studies, Dehradun, studying BTech in Computer Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.