Vanishing gradients in reinforcement adjustment of language models

Pretrained language models are commonly adapted to meet human intent and downstream tasks through fine-tuning. The tuning process involves supervised fine-tuning (SFT), using labeled samples, and/or reinforcement learning-based fine-tuning (RFT) via policy gradient methods, using a (possibly learned) reward function. . This work highlights an overlooked optimization hurdle in RFT: we show that the expected gradient for an input sample (i.e., a message) vanishes if its modeled reward standard deviation is low, regardless of whether the reward mean is almost optimal or not. We then demonstrate the prevalence and detrimental effects of vanishing gradients due to low reward standard deviation in an RFT benchmark for language models. In particular, we show that in data sets where samples with low reward standard deviation under the pre-trained model are more frequent, the reward achieved by RFT compared to SFT is worse. Controlled experiments and a theoretical analysis further establish that, even in simplified settings, vanishing gradients in RFT can lead to extremely slow convergence. Finally, we explore ways to overcome vanishing gradients in the RFT of linguistic models. We consider the common practice of an initial SFT phase to be the most promising candidate, shedding light on its importance in an RFT process. Furthermore, our experiments reveal that a relatively small number of SFT optimization steps on a small number of labeled samples is sufficient, implying that the initial phase of SFT does not have to be costly in terms of computational efforts and labeling. data.

Vanishing gradients in reinforcement adjustment of language models

Technical Terrence Team

Gentherm reports record Q1 automotive awards, growth outlook By Investing.com

Leave a Reply Cancel reply

Recommended.

Analyst predicts 120% growth for this new cryptocurrency, surpassing ETH and XRP

5 PCA Visualizations You Should Try in Your Next Data Science Project | by Dario Radečić | August, 2024

Solana Earns the Trust of Institutional Investors on Bitcoin and Ethereum, Here's How

Shockwave Medical Stock Rises as J&J Deal Makes Sense (NYSE:JNJ)

Tesla silently delays the highly anticipated model caused last year

Categories

Important Links

Vanishing gradients in reinforcement adjustment of language models

Related

Technical Terrence Team

Gentherm reports record Q1 automotive awards, growth outlook By Investing.com

Leave a Reply Cancel reply

Recommended.

Analyst predicts 120% growth for this new cryptocurrency, surpassing ETH and XRP

5 PCA Visualizations You Should Try in Your Next Data Science Project | by Dario Radečić | August, 2024

Solana Earns the Trust of Institutional Investors on Bitcoin and Ethereum, Here's How

Shockwave Medical Stock Rises as J&J Deal Makes Sense (NYSE:JNJ)

Tesla silently delays the highly anticipated model caused last year

Categories

Important Links

Get daily news updates to your inbox!