Large language models (LLMs) are all the rage and many people are incorporating them into their applications. Examples include chatbots that answer questions about relational databases, assistants that help programmers write code more efficiently, and co-pilots that take actions on their behalf. The powerful capabilities of LLMs allow you to launch projects with rapid initial success. However, as you move from a prototype to a mature LLM application, a robust assessment framework becomes essential. Such an assessment framework helps your LLM application achieve optimal performance and ensures consistent and reliable results. In this blog post, we will cover:
- The difference between evaluating an LLM and an LLM-based application
- The Importance of LLM Application Assessment
- The Challenges of Assessing LLM Applications
- Starting
to. Collecting data and creating a test suite.
b. Performance measurement - The LLM Application Assessment Framework
Using the fictional example of FirstAidMatey, a first aid assistant for pirates, we will navigate seas of techniques, challenges and evaluation strategies. We will conclude with conclusions and key ideas. So, let’s set sail on this enlightening journey!
Evaluation of individual large language models (LLMs) such as OpenAI’s GPT-4, Google’s PaLM 2, and Anthropic’s Claude is typically done with benchmarks such as MMLU. However, in this blog post, we are interested in evaluating LLM-based applications. These are applications that work with an LLM and contain other components, such as an orchestration framework that manages a sequence of LLM calls. Often, retrieval augmented generation (RAG) is used to provide context to the LLM and prevent hallucinations. In summary, RAG requires that context documents be integrated into a vector store from which the relevant fragments can be retrieved and shared with the LLM. Unlike an LLM, an LLM-based application (or LLM application) is designed to perform one or more specific tasks very well. Finding the right configuration often involves some experimentation and iterative improvement. RAG, for example, can be implemented in many different ways. An evaluation framework like the one discussed in this blog post can help you find the best configuration for your use case.
FirstAidMatey is an LLM based application that helps hackers with questions like “My hand got caught in the ropes and now it’s swollen, what should I do, buddy?”. In its simplest form, Orchestrator consists of a single message that sends the user’s question to the LLM and asks it to provide useful answers. You can also tell the LLM to reply on Pirate Lingo for optimal understanding. As an extension, a vector store could be added with built-in first aid documentation. Depending on the user’s question, relevant documentation can be retrieved and included in the message, so that the LLM can provide more precise answers.
Before we get into how, let’s look at why you should set up a system to evaluate your LLM-based application. The main objectives are three:
- Consistency– Ensure stable and reliable LLM application results in all scenarios and discover regressions when they occur. For example, when you improve the performance of your LLM application in a specific scenario, you want to be warned in case you compromise performance in another scenario. When you use proprietary models like OpenAI’s GPT-4, you are also subject to their update schedule. As new versions are released, your current version may become obsolete over time. Research shows that switching to a newer version of GPT is not always better. Therefore, it is important to be able to evaluate how this new version affects the performance of your LLM application.
- Perspectives: Understand where the LLM application works well and where it can be improved.
- Comparative evaluation: Set performance standards for your LLM application, measure the effect of experiments, and release new versions with confidence.
As a result, you will achieve the following results:
- Gain user trust and satisfaction because your LLM application will work consistently.
- Increase stakeholder confidence because it can show how well the LLM application is working and how new versions improve previous ones.
- Boost your competitive advantage as you can quickly iterate, make improvements, and deploy new versions with confidence.
After reading the benefits above, it becomes clear why adopting an LLM-based application can be advantageous. But before we can do that, we need to figure out the following. two main challenges:
- Lack of labeled data: Unlike traditional machine learning applications, LLM-based ones do not need labeled data to get started. LLMs can perform many tasks (such as text classification, summarization, generation, and more) immediately, without having to show specific examples. This is great because we don’t have to wait for data and labels, but on the other hand, it also means that we don’t have data to check the performance of the application.
- Multiple valid answers: In an LLM application, the same entry can often have more than one correct answer. For example, a chatbot can provide multiple responses with similar meanings, or code can be generated with identical functionality but different structures.
To address these challenges, we must define the appropriate data and metrics. We will do it in the next section.
Collecting data and creating a test suite.
To evaluate an LLM-based application, we use a test suite consisting of Test cases, each with specific inputs and objectives.. What they contain depends on the purpose of the application. For example, a code generation application expects verbal instructions as input and output code in return. During the evaluation, inputs will be provided to the LLM application and the generated result can be compared with the reference target. Here are some test cases for FirstAidMatey: