Big language models are advancing rapidly with the great success of generative artificial intelligence in recent months. These models are contributing to some remarkable economic and social transformations, best exemplified by the well-known ChatGPT powered by OpenAI, which has had millions of users since its launch, with the number increasing exponentially, if not the same. This chatbot, based on Natural Language Processing (NLP) and Natural Language Understanding (NLU), allows users to generate meaningful text just like humans. Answer questions meaningfully, summarize long paragraphs, complete codes and emails, etc. Other LLMs, such as PaLM, Chinchilla, BERT, etc., have also shown great performance in the domain of AI.
Fine tuning of pretrained language models has been a popular approach for many language-related tasks. Fine tuning allows these models to adapt to specialized domains, incorporate human input, and cater to individual preferences. Basically, it tunes the parameters of an already trained LLM using a smaller, domain-specific data set. As language models are extended with more parameters, fine tuning becomes computationally demanding and consumes a lot of memory for the gradient computation process during backpropagation. Memory usage is significantly higher than needed for inference due to the involvement of caching triggers, gradients, and gradient history storage.
Recently, a team of researchers from Princeton University has come up with a solution to the memory problem. Called MeZO, a memory-efficient zero-order optimizer, this is an adaptation of the traditional ZO-SGD method that estimates gradients using only differences in loss values and operates in place, allowing language models to be fitted with the same footprint. from memory than inference. The team has focused on zero-order approaches in MeZO, as ZO methods can estimate gradients using only two forward passes, making them memory efficient.
The MeZO algorithm has been specially designed to optimize large language models with billions of parameters. Some of the main contributions mentioned by the team are:
- MeZO has been developed by modifying the ZO-SGD method and some variations to run on models of arbitrary size with hardly any memory overhead.
- MeZO has been shown to support PEFT and comprehensive parameter tuning such as LoRA and prefix tuning.
- MeZO can improve non-differentiable targets like accuracy or F1 scoring while still using the same amount of memory as inference.
- Proper pretraining ensures that MeZO’s per-step optimization rate and global convergence rate depend on a landscape-specific condition number, i.e., the effective local range rather than a large number of parameters, which is in contrast to the lower limits of previous ZO. which imply that the rate of convergence can be slow depending on the number of parameters.
- The experiments suggested that in tests on various types of models, such as masked LM and autoregressive LM, the model scales from 350M to 66B and subsequent tasks such as sorting, multiple choice, and generation.
- MeZO outperforms zero-shot, ICL, and linear probing in experiments and even performs better than or similar to fine tuning in 7 out of 11 tests with OPT-13B, while consuming about 12 less memory than normal or large fine tuning. Roberta. respectively.
Upon evaluation, MeZO was able to train a 30 billion parameter model using a single 80 GB Nvidia A100 GPU, while backpropagation can only train a 2.7 billion parameter LM within the same memory constraints. In conclusion, MeZO is a memory-efficient zero-order optimizer that can fine tune large language models.
review the Paper and Github. Don’t forget to join our 23k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Tanya Malhotra is a final year student at the University of Petroleum and Power Studies, Dehradun, studying BTech in Computer Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.