When an error or misunderstanding arises, modern LLMs can theoretically reflect and refine their responses because they are interactive systems capable of interacting with users in multiple turns.
Previous research has shown that LLMs can improve their responses by using additional conversational context, such as chain-of-thought reasoning. However, LLMs designed to maximize human preferences can exhibit sycophantic behavior, meaning they will give answers that match what the user believes is correct, even if that perspective is not correct.
New ai research from Salesforce features a multi-turn interaction between a simulated user and an LLM that focuses on a classification task like the FlipFlop experiment. The LLM performs a classification task in response to a user request in the initial turn of the discussion. The LLM then decides whether to affirm or reverse his response in the second turn in response to an utterance from the challenger (such as “Are you sure?”) that questions his response.
The team systematically evaluates the accuracy of initial and final predictions in classification tasks, providing a rigorous context for studying model behavior. LLM GPT-4, Claude V2 and PaLM-Bison are asked to answer a multiple choice question. Two of the models first generate the correct solution. To respond to the challenge, two models (GPT-4 and Claude V2) change their responses in the second turn, while PaLM-Bison maintains its original response. All three models show a decrease in performance, with reductions ranging from -8% (GPT-4) to 34% (Claude V2), when the results are aggregated over an evaluation set with 100 samples.
They measured the propensity of LLMs to reverse their initial predictions when confronted, often resulting in decreased accuracy, through conversational simulations focused on classification tasks. Based on extensive analysis of 10 LLMs and seven tasks, the models exhibited consistent fawning behavior, resulting in an average of 46% response change and a 17% decrease in accuracy. According to the results, the model, the job, and the precise language of the challenger's cue determine the degree of the FlipFlop effect. While some models perform better than others, the results show a lot of room to grow in creating models that can have honest, multi-turn conversations without losing task accuracy. Future research that aims to improve models' conversational skills and systematically evaluate fawning behavior quantitatively can use the FlipFlop experiment as a solid foundation.
The researchers are also investigating whether tuning a linear learning model (LLM) on synthetically generated FlipFlop conversations can improve the model's behavior. They find that an adjusted Mistral7b can reduce the observed fawning behavior by 50% compared to the base model, indicating that the adjustment can help reduce, but not eliminate, the FlipFlop effect. Since the FlipFlop experiment provides a solid foundation for studying and quantifying the sycophantic behavior of LLMs, the team intends to make their code and data freely available so that everyone can work toward the same goal of creating more trustworthy LLMs. .
The researchers emphasize that there is no exhaustive list of the tasks and challenger statements that were part of the experiment. Although the FlipFlop experiment mimics multi-turn discussions, the interactions are still artificial and do not differ much from each other. They do not expect their results and their relative importance to be immediately applicable in a more realistic setting. Its evaluation focuses on measures that evaluate response change and worsening performance. However, different use cases may highlight different parts of the model's responses. For example, it was beyond the scope of their experiment to measure the relative politeness, conciseness, or coherence of responses, even if these factors might be essential. They also focused on classification problems for experiments because they offer well-established metrics and simple formulations to measure the effectiveness of model responses. The evaluation of fawning behavior in open domain generation tasks, where LLMs are often employed, is an essential but unexplored area.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter and Google news. Join our 38k+ ML SubReddit, 41k+ Facebook community, Discord Channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our Telegram channel
You may also like our FREE ai Courses….
Dhanshree Shenwai is a Computer Science Engineer and has good experience in FinTech companies spanning Finance, Cards & Payments and Banking with a keen interest in ai applications. He is excited to explore new technologies and advancements in today's evolving world that makes life easier for everyone.
<!– ai CONTENT END 2 –>