Given the high initial cost of training a language model, any non-trivial improvements to the optimization process would drastically reduce the time and money required to complete the training process. Adam and its variants were state of the art for a long time, while second order optimizers (based on Hessian) were rarely used due to their higher overhead per step.
A lightweight estimate of the diagonal hessian is proposed as a preconditioner for the Sophia second-order optimizer, clipped second-order stochastic optimization, proposed by the researchers. Sophia is a novel optimizer who can solve LLM twice as fast as Adam. An element-by-element clip is made after the update, which is found by taking the mean of the gradients and dividing it by the mean of the estimated Hessian. Clipping limits the size of the worst-case update and mitigates the effect of non-convexity of the trajectory and rapid Hessian changes. Adding a few new lines of code could bring the budget from $2 million down to the $1 million range (assuming the laws of scale apply).
The average time per step and memory overhead are low because Sophia only estimates the diagonal Hessian every few iterations. Sophia doubles Adam’s speed in terms of number of steps, total computation, and wall clock time as she models the language with GPT-2 models ranging in size from 125 million to 770 million. The researchers demonstrate that Sophia can adapt to large variations in the parameters that she underlies in language modeling tasks. The execution time limit is independent of the stall condition number.
key features
- Sophia is easy to implement with PyTorch, as it requires a lightweight estimate of the diagonal burlap as a precondition on the gradient (see the pseudocode in the first image) before clipping individual elements.
- Sophia also helps with pre-workout stability. Much less frequently than in Adam and Lion, gradient clipping is induced. The reparameterization trick, where the focused temperature varies with the layer index, is unnecessary.
- Sophia ensures consistent loss reduction across all parameter dimensions by penalizing updates more on sharp sizes (with large hessian) than on flat dimensions (with small hessian). In two-dimensional space, Adam converges more slowly.
Important aspects of this undertaking
- This demonstrates that even with limited resources, academics can examine pre-LLM training and develop novel and effective algorithms.
- In addition to reviewing material from previous optimization courses, the researchers made extensive use of theoretical reasoning throughout the study process.
In the code scheduled for publication tomorrow, the researchers used a slightly modified version of the commonly accepted definition of LR. While neater to write, the article’s definition of LR might be better for computer code.
review the Paper. Don’t forget to join our 22k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Dhanshree Shenwai is a Computer Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking domain with strong interest in AI applications. She is enthusiastic about exploring new technologies and advancements in today’s changing world, making everyone’s life easier.