Pretrained language models are commonly adapted to meet human intent and downstream tasks through fine-tuning. The tuning process involves supervised fine-tuning (SFT), using labeled samples, and/or reinforcement learning-based fine-tuning (RFT) via policy gradient methods, using a (possibly learned) reward function. . This work highlights an overlooked optimization hurdle in RFT: we show that the expected gradient for an input sample (i.e., a message) vanishes if its modeled reward standard deviation is low, regardless of whether the reward mean is almost optimal or not. We then demonstrate the prevalence and detrimental effects of vanishing gradients due to low reward standard deviation in an RFT benchmark for language models. In particular, we show that in data sets where samples with low reward standard deviation under the pre-trained model are more frequent, the reward achieved by RFT compared to SFT is worse. Controlled experiments and a theoretical analysis further establish that, even in simplified settings, vanishing gradients in RFT can lead to extremely slow convergence. Finally, we explore ways to overcome vanishing gradients in the RFT of linguistic models. We consider the common practice of an initial SFT phase to be the most promising candidate, shedding light on its importance in an RFT process. Furthermore, our experiments reveal that a relatively small number of SFT optimization steps on a small number of labeled samples is sufficient, implying that the initial phase of SFT does not have to be costly in terms of computational efforts and labeling. data.