How to find and solve valuable generative AI use cases | by Teemu Sormunen | June 2024

P&F's data science team faces a challenge: They must weigh each expert's opinion equally, but they can't satisfy everyone. Instead of focusing on subjective expert opinions, they decide to evaluate the chatbot based on historical customer questions. Now experts do not need to ask questions to test the chatbot, bringing the evaluation closer to real-world conditions. After all, the initial reason for involving experts was their better understanding of real customer questions compared to P&F's data science team.

It turns out that the most frequently asked questions on P&F are related to the technical instructions for the clips. P&F customers want to know the detailed technical specifications of the clips. P&F has thousands of different types of clips and customer service takes a long time to answer questions.

By understanding test-driven development, the data science team creates a data set from conversation history, including the customer question and customer service response:

Data set collected from the Paperclips & Friends discord channel.

By having a data set of questions and answers, P&F can test and evaluate the performance of the chatbot retrospectively. They create a new column, “Chatbot Response,” and store the chatbot's example responses to the questions.

Augmented dataset with proposed chatbot response.

We can have experts and GPT-4 evaluate the quality of the chatbot's responses. The ultimate goal is to automate the chatbot accuracy assessment using GPT-4. This is possible Yeah Experts and GPT-4 evaluate responses similarly.

The experts create a new Excel sheet with each expert's evaluation and the data science team adds the GPT-4 evaluation.

Dataset augmented with expert evaluations and GPT-4.

There is conflicts about how different experts evaluate he same chatbot responds. GPT-4 evaluates similarly to expert majority voting, indicating that we could perform automatic evaluations with GPT-4. However, each expert's opinion is valuable, and it is important to address conflicting evaluation preferences among experts.

P&F organizes a workshop with experts to create gold standard answers to historical questions dataset

The gold standard data set for evaluation.

and assessment best practice guidelineson which all experts agree.

Evaluation of “best practice guidelines” for the chatbot as defined by customer service specialists.

Using insights from the workshop, the data science team can create a more detailed evaluation message for GPT-4 that covers edge cases (i.e., “the chatbot should not request to raise support tickets”). Now him Experts can use the time to improve clip documentation. and define best practices, instead of laborious chatbot evaluations.

By measuring the chatbot's percentage of correct responses, P&F can decide whether to deploy the chatbot to the support channel. They approve the accuracy and implement the chatbot.

Finally, it's time to save all the chatbot responses and calculate how well the chatbot performs in resolving real customer queries. As the customer can respond directly to the chatbot, it is also important to record the customer's response to understand their sentiment.

The same evaluation workflow can be used to measure chatbot success objectively, without actual answers. But now customers receive the initial response from a chatbot and we don't know if they like it. We should investigate how customers react to chatbot responses. We can automatically detect negative sentiments from customer responses and assign customer service specialists to handle angry customers.

How to find and solve valuable generative AI use cases | by Teemu Sormunen | June 2024

Technical Terrence Team

Nextracker and Unimacts to build second solar tracker plant in Nevada (NASDAQ:NXT)

Leave a Reply Cancel reply

Recommended.

Magic Eden ETH Marketplace goes live today

Biden orders crackdown on sale of Americans' personal data abroad

Leveraging Transfer Learning for Large-Scale Differentially Private Image Classification – Google AI Blog

Do Child Care Solutions Stand a Chance in Congress?

21 razones para ser optimista sobre Bitcoin en 2024

Categories

Important Links

How to find and solve valuable generative AI use cases | by Teemu Sormunen | June 2024

Related

Technical Terrence Team

Nextracker and Unimacts to build second solar tracker plant in Nevada (NASDAQ:NXT)

Leave a Reply Cancel reply

Recommended.

Magic Eden ETH Marketplace goes live today

Biden orders crackdown on sale of Americans' personal data abroad

Leveraging Transfer Learning for Large-Scale Differentially Private Image Classification – Google AI Blog

Do Child Care Solutions Stand a Chance in Congress?

21 razones para ser optimista sobre Bitcoin en 2024

Categories

Important Links

Get daily news updates to your inbox!