P&F's data science team faces a challenge: They must weigh each expert's opinion equally, but they can't satisfy everyone. Instead of focusing on subjective expert opinions, they decide to evaluate the chatbot based on historical customer questions. Now experts do not need to ask questions to test the chatbot, bringing the evaluation closer to real-world conditions. After all, the initial reason for involving experts was their better understanding of real customer questions compared to P&F's data science team.
It turns out that the most frequently asked questions on P&F are related to the technical instructions for the clips. P&F customers want to know the detailed technical specifications of the clips. P&F has thousands of different types of clips and customer service takes a long time to answer questions.
By understanding test-driven development, the data science team creates a data set from conversation history, including the customer question and customer service response:
By having a data set of questions and answers, P&F can test and evaluate the performance of the chatbot retrospectively. They create a new column, “Chatbot Response,” and store the chatbot's example responses to the questions.
We can have experts and GPT-4 evaluate the quality of the chatbot's responses. The ultimate goal is to automate the chatbot accuracy assessment using GPT-4. This is possible Yeah Experts and GPT-4 evaluate responses similarly.
The experts create a new Excel sheet with each expert's evaluation and the data science team adds the GPT-4 evaluation.
There is conflicts about how different experts evaluate he same chatbot responds. GPT-4 evaluates similarly to expert majority voting, indicating that we could perform automatic evaluations with GPT-4. However, each expert's opinion is valuable, and it is important to address conflicting evaluation preferences among experts.
P&F organizes a workshop with experts to create gold standard answers to historical questions dataset
and assessment best practice guidelineson which all experts agree.
Using insights from the workshop, the data science team can create a more detailed evaluation message for GPT-4 that covers edge cases (i.e., “the chatbot should not request to raise support tickets”). Now him Experts can use the time to improve clip documentation. and define best practices, instead of laborious chatbot evaluations.
By measuring the chatbot's percentage of correct responses, P&F can decide whether to deploy the chatbot to the support channel. They approve the accuracy and implement the chatbot.
Finally, it's time to save all the chatbot responses and calculate how well the chatbot performs in resolving real customer queries. As the customer can respond directly to the chatbot, it is also important to record the customer's response to understand their sentiment.
The same evaluation workflow can be used to measure chatbot success objectively, without actual answers. But now customers receive the initial response from a chatbot and we don't know if they like it. We should investigate how customers react to chatbot responses. We can automatically detect negative sentiments from customer responses and assign customer service specialists to handle angry customers.