Evaluating conversational ai assistants, such as GitHub Copilot Chat, is challenging due to their reliance on language models and chat-based interfaces. Existing metrics for conversational quality must be reviewed for domain-specific dialogues, making it difficult for software developers to assess the effectiveness of these tools. While techniques such as SPUR use large language models to analyze user satisfaction, they can miss domain-specific nuances. The study focuses on automatically generating high-quality, task-aware rubrics to evaluate task-oriented conversational ai assistants, emphasizing the importance of context and task progression to improve assessment accuracy.
Microsoft researchers present RUBICON, a technique for evaluating domain-specific human-ai conversations using large language models. RUBICON generates candidate rubrics for evaluating conversation quality and selects the best-performing ones. It improves SPUR by incorporating domain-specific cues and Grice maxima, creating a set of iteratively evaluated rubrics. RUBICON was tested on 100 conversations between developers and a chat-based assistant for C# debugging, using GPT-4 for rubric generation and evaluation. It outperformed alternative rubric sets, achieving high accuracy in predicting conversation quality and demonstrating the effectiveness of its components through ablation studies.
Natural language conversations are fundamental to modern ai applications, but traditional NLP metrics such as BLEU and Perplexity are inadequate for evaluating long-form conversations, especially in LLM. While user satisfaction has been a key metric, manual analysis is resource-intensive and impacts privacy. Recent approaches use language models to assess conversation quality through natural language statements, capturing themes of user engagement and experience. Techniques such as SPUR generate rubrics for open-domain conversations, but need more domain-specific contexts. This study emphasizes a holistic approach, integrating user expectations and interaction progress, and explores optimal prompt selection using bandit methods to improve assessment accuracy.
RUBICON estimates conversation quality for domain-specific assistants by learning rubrics for Satisfaction (SAT) and Dissatisfaction (DSAT) from labeled conversations. It involves three steps: generating diverse rubrics, selecting an optimized rubric set, and scoring conversations. Rubrics are natural language statements that capture conversation attributes. Conversations are evaluated using a 5-point Likert scale, normalized to a range (0, 10). Rubric generation involves supervised extraction and summarization, while selection optimizes rubrics for accuracy and coverage. Precision and sharpness losses guide the selection of an optimal rubric subset, ensuring effective and accurate assessment of conversation quality.
The evaluation of RUBICON involves three key questions: its effectiveness compared to other methods, the impact of Domain Awareness (DS) and Conversation Design Principles (CDP), and the performance of its selection policy. The conversation data, obtained from a C# Debugger Copilot wizard, was filtered and annotated by experienced developers, resulting in a 50:50 training-test split. Metrics such as accuracy, precision, recall, F1-score, ΔNetSAT score, and throughput rate were evaluated. The results showed that RUBICON outperforms baselines in separating positive and negative conversations and classifying conversations with high accuracy, highlighting the importance of DS and CDP instructions.
Internal validity is threatened by the subjective nature of manually assigned ground truth labels despite high inter-scorer agreement. External validity is limited by the lack of diversity of the dataset, which is specific to C# debugging tasks at a software company, potentially affecting generalizability to other domains. Construct validity issues include the reliance on an automated scoring system and assumptions made when converting Likert scale responses to a (0, 10) scale. Future work will address different calculation methods for NetSAT scoring. RUBICON has succeeded in improving rubric quality and differentiating conversation effectiveness, proving valuable in real-world implementation.
Review the ai-conversations/” target=”_blank” rel=”noreferrer noopener”>Paper and ai-systems/” target=”_blank” rel=”noreferrer noopener”>Details. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter.
Join our Telegram Channel and LinkedIn GrAbove!.
If you like our work, you will love our Newsletter..
Don't forget to join our Subreddit with over 46 billion users
Sana Hassan, a Consulting Intern at Marktechpost and a dual degree student at IIT Madras, is passionate about applying technology and ai to address real-world challenges. With a keen interest in solving practical problems, she brings a fresh perspective to the intersection of ai and real-life solutions.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>