For a moment, imagine an airplane. What comes to mind? Let's now imagine a Boeing 737 and a Osprey V-22. Both are aircraft designed to transport cargo and people, but they have different purposes: one more general (commercial flights and cargo), the other very specific (infiltration, exfiltration and resupply missions for special operations forces). They look very different because they are designed for different activities.
With the rise of LLMs, we have seen our first truly general-purpose ML models. Its generality helps us in many ways:
- The same engineering team can now perform sentiment analysis and structured data mining.
- Professionals from many fields can share knowledge, making it possible for the entire industry to benefit from each other's experience.
- There are a wide range of industries and jobs where the same experience is useful.
But as we see with airplanes, generality requires a very different evaluation than excelling at a particular task, and at the end of the day, business value often comes from solving particular problems.
This is a good analogy for the difference between model and task evaluations. Model evaluations focus on overall evaluation, but task evaluations focus on evaluating the performance of a particular task.
The term LLM Assessments It is used quite widely. OpenAI released some tools to do LLM assessments very early, for example. Most professionals are more concerned with LLM assignment assessments, but that distinction is not always made clearly.
What is the difference?
Model evaluations look at the “overall fit” of the model. How well do you perform on a variety of tasks?
Task evaluations, on the other hand, are specifically designed to look at how well the model fits your particular application.
Someone who exercises in general and is fairly fit would probably do poorly against a professional sumo wrestler in a real competition, and model evaluations cannot be compared to task evaluations in evaluating your particular needs.
Model evaluations are specifically designed to create and tune generalized models. They are based on a set of questions that you ask a model and a set of actual responses that you use to rate the responses. Consider taking the SATs.
While each question in a model evaluation is different, there is generally a general area of testing. There is a topic or skill that each metric specifically addresses. For example, HellaSwag performance has become a popular way to measure LLM quality.
He HellaSwag The data set consists of a collection of contexts and multiple choice questions where each question has multiple potential answers. Only one of the endings is sensible or logically coherent, while the others are plausible but incorrect. These completions are designed to be a challenge for ai models, requiring not only linguistic understanding but also common sense reasoning to choose the correct option.
Here is an example:
A tray of potatoes is loaded into the oven and removed. A large cake tray is turned over and placed on the counter. a large tray of meat
A. is placed on a baked potato
B. ls, and the pickles are placed in the oven.
C. is prepared and then an assistant takes it out of the oven when it is ready.
Another example is MMLU. MMLU features tasks that span multiple subjects, including science, literature, history, social sciences, mathematics, and professional domains such as law and medicine. This diversity of topics is intended to mimic the breadth of knowledge and understanding that human learners require, making it a good test of a model's ability to handle multifaceted language comprehension challenges.
Below are some examples – can you solve them?
For which of the following thermodynamic processes does the increase in internal energy of an ideal gas equal the heat added to the gas?
A. Constant temperature
B. Constant volume
C. Constant pressure
D. adiabatic
He Hugging Face Leaderboard It is perhaps the best known place to obtain these types of model evaluations. The leaderboard tracks large open source language models and tracks many model evaluation metrics. This is often a great place to start to understand the difference between open source LLMs in terms of how they perform on a variety of tasks.
Multimodal models require even more evaluations. He gemini paper demonstrates that multimodality introduces a number of other benchmarks such as VQAv2, which tests the ability to understand and integrate visual information. This information goes beyond the simple recognition of objects to interpret actions and relationships between them.
Similarly, there are metrics for audio and video information and how to integrate them across modalities.
The goal of these tests is to differentiate between two models or two different snapshots of the same model. Choosing a model for your application is important, but it is something that is done once or, at best, very infrequently.
The most common problem is solved through task evaluations. The goal of task-based evaluations is to analyze the performance of the model using LLM as a judge.
- Did your recovery system get the correct data?
- Are there hallucinations in your answers?
- Did the system answer important questions with relevant answers?
Some may feel a little unsure about one LLM evaluating other LLMs, but we have humans evaluating other humans all the time.
The real distinction between model and task evaluations is that for a model evaluation we ask many different questions, but for a task evaluation the question remains the same and it is the data that we change. For example, let's say you are operating a chatbot. You could use your task assessment in hundreds of customer interactions and ask: “Is there a hallucination here? The question remains the same in all conversations.
There are several libraries intended to help professionals create these assessments: ragas, Phoenix (full disclosure: the author leads the team that developed Phoenix), OpenAI, ai/en/latest/optimizing/evaluation/evaluation.html” rel=”noopener ugc nofollow” target=”_blank”>CallIndex.
How do they work?
Task evaluation rates the performance of each output of the application as a whole. Let's see what it takes to put one together.
Establishing a reference point
The foundation is based on establishing a solid reference point. This starts with creating a golden data set that accurately reflects the scenarios the LLM will encounter. This data set should include ground truth labels (often derived from meticulous human review) that serve as a standard of comparison. Don't worry though, you can usually get away with dozens or hundreds of examples here. Selecting the right LLM for assessment is also essential. While it may differ from the application's core LLM, it should align with cost-effectiveness and accuracy goals.
Preparation of the evaluation template
The heart of the assignment evaluation process is the evaluation template. This template should clearly define the input (for example, user queries and documents), the evaluation question (for example, the relevance of the document to the query), and the expected output formats (binary or multiclass relevance). Adjustments to the template may be necessary to capture specific nuances of your application, ensuring that you can accurately evaluate the performance of the LLM against the reference data set.
Below is an example of a template for evaluating a question and answer task.
You are given a question, an answer and reference text. You must determine whether the given answer correctly answers the question based on the reference text. Here is the data:
(BEGIN DATA)
************
(QUESTION): {input}
************
(REFERENCE): {reference}
************
(ANSWER): {output}
(END DATA)
Your response should be a single word, either "correct" or "incorrect", and should not contain any text or characters aside from that word.
"correct" means that the question is correctly and fully answered by the answer.
"incorrect" means that the question is not correctly or only partially answered by the answer.
Metrics and iteration
Running the evaluation on your gold data set allows you to generate key metrics such as accuracy, precision, recall, and F1 score. These provide insight into the effectiveness of the assessment template and highlight areas for improvement. Iteration is crucial; Refining the template based on these metrics ensures that the evaluation process remains aligned with the application's goals without overfitting the gold data set.
In task evaluations, relying solely on overall accuracy is insufficient, as we always expect significant class imbalance. Precision and recall offer a more robust view of LLM performance, emphasizing the importance of accurately identifying relevant and irrelevant results. A balanced approach to metrics ensures that assessments contribute significantly to improving the LLM application.
Application of LLM assessments
Once an assessment framework is in place, the next step is to apply these assessments directly to your LLM application. This involves integrating the evaluation process into the application workflow, allowing real-time evaluation of the LLM's responses to user input. This continuous feedback loop is invaluable in maintaining and improving the relevance and accuracy of the application over time.
Evaluation throughout the system life cycle
Effective task assessments are not limited to a single stage, but are an integral part of the entire life cycle of the LLM system. From benchmarking and pre-production testing to ongoing performance evaluations in production, Master of Laws Evaluation ensures that the system continues to respond to user needs.
Example: is the model hallucinating?
Let's look at an example of hallucination in more detail.
Since hallucinations are a common problem for most professionals, there are some reference data sets available. This is a great first step, but often you will need to have a custom data set within your company.
The next important step is to develop the notice template. Again, a good library can help you get started. Previously we saw an example notice template, here we see another one specifically for hallucinations. You may need to modify it for your purposes.
In this task, you will be presented with a query, a reference text and an answer. The answer is
generated to the question based on the reference text. The answer may contain false information, you
must use the reference text to determine if the answer to the question contains false information,
if the answer is a hallucination of facts. Your objective is to determine whether the reference text
contains factual information and is not a hallucination. A 'hallucination' in this context refers to
an answer that is not based on the reference text or assumes information that is not available in
the reference text. Your response should be a single word: either "factual" or "hallucinated", and
it should not include any other text or characters. "hallucinated" indicates that the answer
provides factually inaccurate information to the query based on the reference text. "factual"
indicates that the answer to the question is correct relative to the reference text, and does not
contain made up information. Please read the query and reference text carefully before determining
your response.(BEGIN DATA)
************
(Query): {input}
************
(Reference text): {reference}
************
(Answer): {output}
************
(END DATA)
Is the answer above factual or hallucinated based on the query and reference text?
Your response should be a single word: either "factual" or "hallucinated", and it should not include any other text or characters.
"hallucinated" indicates that the answer provides factually inaccurate information to the query based on the reference text.
"factual" indicates that the answer to the question is correct relative to the reference text, and does not contain made up information.
Please read the query and reference text carefully before determining your response.
Now you are ready to give your assessment LLM the queries of your golden data set and have it label hallucinations. When looking at the results, remember that there should be a class imbalance. You want to track precision and recall rather than overall precision.
It is very useful to construct a confusion matrix and plot it visually. When you have such a plot, you can feel confident about the performance of your LLM. If the performance is not to your liking, you can always optimize the notice template.
With the assessment created, you now have a powerful tool that can label all your data with known precision and recall. You can use it to track hallucinations in your system during both the development and production phases.
Let's summarize the differences between task and model evaluations.
Ultimately, both model evaluations and task evaluations are important to creating a functional LLM system. It is important to understand when and how to apply each. Most professionals spend the majority of their time on task evaluations, which provide a measure of the system's performance on a specific task.