Meet LOMO (LOw-Memory Optimization): A new AI optimizer that merges gradient calculation and parameter updating into a single step to reduce memory usage

Large language models have transformed natural language processing by displaying amazing abilities such as emergence and assimilation and continually increasing model size. The bar for NLP research is raised by training these models with billions of parameters, such as those with parameters from 30B to 175B. It is challenging for small labs and enterprises to participate in this field of research, as tuning LLMs frequently requires expensive GPU resources, such as 880 GB machines. Recently, resource-constrained LLM tuning has been made possible by parameter-efficient fine-tuning techniques, such as LoRA and prefix tuning.

Although full parameter fine tuning has been considered a more effective strategy than efficient parameter fine tuning, both techniques should provide a viable solution. They want to investigate methods to complete fine-tuning of integral parameters in the circumstances with limited resources. They examine activation, optimizer states, gradient tensor and parameters, the four characteristics of memory utilization in LLMs, and optimize the training process in three ways: 1) They re-evaluate the algorithmic functionality of an optimizer and find that SGD is a suitable surrogate for fitting full parameters for LLM. Since SGD does not maintain intermediate stages, we can remove all the states part of the optimizer. 2) Its suggested optimizer, LOMO, as shown in Figure 1, reduces the memory usage of gradient tensors to 0, equal to the memory consumption of the largest gradient tensor. 3) They incorporate loss scaling and gradient normalization and switch certain calculations to full precision during training to stabilize mixed precision training with LOMO. Your method combines the same amount of memory as the parameters, activation, and the largest gradient tensor.

They severely increase the memory consumption of full parameter fine tuning, reducing it to the inference level. This is because the forward process alone should not require less memory than the reverse process. In particular, they ensure that the fine tuning function is not affected when using LOMO to conserve memory because the parameter update process is similar to SGD. Researchers from Fudan University demonstrate how the use of LOMO makes it possible to successfully train a 65B model with just 8 RTX 3090 GPUs by empirically evaluating LOMO’s performance and memory capabilities. Furthermore, they use LOMO to fit all LLM parameters in the SuperGLUE dataset collection to validate the subsequent performance of their suggested approach. Empirical findings show how well LOMO works in optimizing LLMs with many parameters.

Join the fastest growing ML subreddit

https://arxiv.org/pdf/2306.09782.pdf

These are their general contributions:

• They offer a theoretical study that suggests that SGD can successfully adjust all the parameters of LLMs. The obstacles that once prevented SGD from being widely used may not be as severe when optimizing LLMs.

• Suggest LOMO, or Low Memory Optimization, to drastically reduce GPU memory utilization while maintaining the fine-tuning process.

• Empirically demonstrate the efficiency of LOMO in optimizing LLM in resource constrained circumstances by carefully analyzing memory utilization and performance. Performance evaluations of subsequent jobs provide additional justification for this.

The code implementation is available on GitHub.

review the Paper and GitHub link. Don’t forget to join our 25k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at asif@marktechpost.com

Featured Tools:

Check out 100 AI tools at AI Tools Club

Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.