Language models (LMs) are designed to reflect a wide range of voices, generating results that do not perfectly match any single perspective. To avoid generic responses, LLM using supervised fine-tuning (SFT) or reinforcement learning with human feedback (RLHF) can be used. However, these methods require huge data sets, making them impractical for new and specific tasks. Furthermore, there is often a discrepancy between the universal style formed in an LLM through instruction and the adjustment of preferences necessary for specific applications. This mismatch results in LLM results appearing generic and lacking a distinctive voice.
Various methods have been developed to address these challenges. One of the approaches involves LLM and Preference Finetuning, in which LLMs are trained on huge data sets to perform well with careful cues. However, designing cues can be difficult and sensitive to variation, so it is often necessary to fit these models on large data sets and use RLHF. Another strategy is self-improvement, where iterative sampling is used to improve LLMs. For example, methods like STaR are monitored by verifying the correctness of their results. Finally, online imitation learning can improve a policy beyond the performance of the demonstrator. However, these approaches need to learn a reward function and are not applicable to LLMs.
Researchers at Stanford University have introduced demonstration iterated task optimization (DITTO), a method that aligns language model results directly with demonstrated user behaviors. It is derived from ideas from online imitation learning and can generate comparative data online at low cost. To generate this data, DITTO prioritizes user demonstrations of the LLM results and their intermediate checkpoints. Additionally, win rates from this method outperform low-shot prompts, supervised adjustments, and other autoplay methods by an average of 19% points. Additionally, it provides a novel way to effectively customize LLMs using direct feedback from demos.
DITTO is capable of learning detailed styling and task alignment in domains such as news articles, emails, and blog posts. It is an iterative process containing three components: (a) on the expert proof set, a supervised fitting is run for a limited number of gradient steps; (b) a new data set is constructed during the training process by sampling the completions of each demo and adding them to the ranking policies, and (c) the RLHF is used to update the policy, in particular using lots sampled through the process mentioned above.
DITTO results are evaluated with GPT-4 eval and averaged across all authors, where it outperforms all baselines with an average gain rate of 77.09% on CMCC (71.67%) and CCAT50 (82. fifty %). It provides an average win rate increase of 11.7% compared to SFT, which serves as a solid foundation (56.78% in CMCC, 73.89% in CCAT). Furthermore, in the results of user studies, DITTO outperforms reference methods with DITTO (72.1% success rate) > SFT (60.1%) > few shots (48.1%) > self-indication ( 44.2%) > zero shots (25.0%) . Furthermore, self-promotion performs a little worse than giving examples in a few prompts and underperforms DITTO.
In conclusion, researchers at Stanford University have introduced Demonstration Iterated Task Optimization (DITTO), a method that aligns language model results directly with demonstrated user behaviors and generates online comparison data from demonstrations. . In this article, researchers highlighted the importance of using demonstrations as feedback and showed that even a small number of demonstrated behaviors can provide a strong signal of an individual's specific preferences. However, researchers do not test other model sizes due to computational cost and additional analysis is needed depending on the types of preference data needed. Therefore, future work in this area is necessary.
Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter. Join our Telegram channel, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our 43k+ ML SubReddit | Also, check out our ai Event Platform
Sajjad Ansari is a final year student of IIT Kharagpur. As a technology enthusiast, he delves into the practical applications of ai with a focus on understanding the impact of ai technologies and their real-world implications. His goal is to articulate complex ai concepts in a clear and accessible way.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>