You can ask ChatGPT to act in a million different ways: as your nutritionist, language tutor, doctor, etc. It’s no surprise that we see many demos and products released in addition to the OpenAI API. But while it’s easy to get LLMs to act a certain way, ensuring they perform well and accurately complete their assigned task is a completely different story.
The problem is that many of the criteria we care about are extremely subjective. Are the answers accurate? Are the answers coherent? Was there something hallucinating? It is difficult to construct quantifiable metrics for evaluation. In general, human judgment is needed, but it is costly for humans to verify a large number of LLM results.
Additionally, LLMs have many parameters that you can adjust. Notice, temperature, context, etc. You can tune the models on a specific data set to fit your use case. With quick engineering, even asking a model to take a deep breath (1) or making its request more emotional (2) can improve performance. There’s a lot of room to tweak and experiment, but after you change something, you should be able to tell if the overall system got better or worse.
Since human work is slow and expensive, there is a strong incentive to find automatic metrics for these more subjective criteria. An interesting approach, which is gaining popularity, is to use LLM to evaluate the outcome of LLMs. After all, if ChatGPT can generate a good, coherent answer to a question, can’t it also tell whether a given text is coherent? This opens up a whole box of potential biases, techniques, and opportunities, so let’s dive into it.
If you have a negative knee-jerk reaction about creating metrics and evaluators that use LLM, your concerns are well founded. This could be a horrible way to simply propagate existing prejudices.
For example, in the G-Eval paper, which we will discuss in more detail later, the researchers showed that their LLM-based evaluation gives higher scores to GPT-3.5 summaries than to human-written summaries, even when human judges They prefer human-written summaries. .
Another study, titled “Great language models are not fair evaluators” (3), they found that when asked to choose which of the two options presented is better, there is a significant bias in the order in which the options are presented. GPT-4, for example, often preferred the first option, while ChatGPT the second. You can simply ask the same question with the order reversed and see how consistent the LLMs are in their answers. They later developed techniques to mitigate this bias by running the LLM multiple times with different option orders.
At the end of the day, we want to know if LLMs can perform as well or similarly to human raters. We can still approach this as a scientific problem:
- Establish evaluation criteria.
- Ask humans and LLMs to evaluate according to the criteria.
- Calculate the correlation between human evaluation and LLM.
In this way, we can get an idea of how similar LLMs are to human evaluators.
In fact, there are already several studies like this one that show that, for certain tasks, LLMs do a much better job than more traditional assessment metrics. And it’s worth noting that we don’t need a perfect correlation. If we evaluate many examples, even if the evaluation is not perfect, we could still get an idea of whether the new system is working better or worse. We could also use LLM evaluators to point out edge cases of concern to human evaluators.
Let’s take a look at some of the recently proposed metrics and evaluators that are based on LLMs at their core.
G Evaluation (4) works by first outlining the evaluation criteria and then simply asking the model to give the rating. It could be used for summarization and dialogue generation tasks, for example.
G-Eval has the following components:
- Immediate. Defines the evaluation task and its criteria.
- Intermediate instructions. Describes the intermediate instructions for the evaluation. In fact, they ask the LLM to generate these steps.
- Scoring function. Instead of taking the LLM score at face value, we look at symbolic probabilities to get the final score. So if you ask to score between 1 and 5, instead of just taking any number given by the LLM (say “3”), we would look at the probability of each rank and calculate the weighted score. This is because the researchers found that typically one digit dominates the evaluation (e.g., mainly generating 3), and even when the LLM is asked to provide a decimal value, it still tends to return whole numbers.
G-Eval was found to significantly outperform traditional benchmark-based metrics such as BLEU and ROUGE, which had relatively low correlation with human judgments. At first glance it seems quite simple as we simply ask the LLM to carry out the assessment. We could also try breaking down tasks into smaller components.
Fact Score (Factual Accuracy in Atomicity Score) (5) is a metric for factual accuracy. The two key ideas are to treat atomic facts as a unit and to base trust on a particular source of knowledge.
For evaluation, you divide the generation into small “atomic” facts (e.g., “He was born in New York”) and then check whether each fact is supported by the given source of truth knowledge. The final score is calculated by dividing the number of facts supported by the total number of facts.
In the paper, the researchers asked LLMs to generate biographies of people and then used Wikipedia articles about them as a source of truth. The error rate of LLMs performing the same procedure as humans was less than 2%.
Now, let’s take a look at some metrics for Recovery Augmented Generation (RAG). With RAG, you first retrieve the relevant context in an external knowledge base and then ask the LLM to answer the question based on those facts.
RAGAS (Recovery Augmented Generation Assessment) (6) is a new framework for evaluating RAGs. It is not a single metric but rather a collection of them. The three proposed in the article are fidelity, response relevance, and context relevance. These metrics perfectly illustrate how assessment can be broken down into simpler tasks for LLMs.
Fidelity Measures how well-grounded the responses are in the given context. It is very similar to FactScore, in that you first split the generation into the set of statements and then ask the LLM if the statement is supported by the given context. The score is the number of supported claims divided by the number of all claims. As for fidelity, the researchers found a very high correlation with human annotators.
Relevance of the answer Try to capture the idea that the answer addresses the real question. Start by asking the LLM to generate questions based on the answer. For each generated question, you can calculate the similarity (by creating an embedding and using cosine similarity) between the generated question and the original question. By doing this north times and averaging the similarity scores, the final relevance value of the response is obtained.
Context Relevance It refers to how relevant the context provided is. That is, the context provided contains only the information necessary to answer the question. Ideally, we give the LLM the right information to answer the question and only that. Context relevance is calculated by asking the LLM to extract the sentences in the given context that were relevant to the answer. Then simply divide the number of relevant sentences by the total number of sentences to get the final score.
You can find more metrics and explanations (along with the open source GitHub repository) here.
The key point is that we can transform the evaluation into a smaller subproblem. Instead of asking whether the entire text is supported by context, we ask only whether a small specific fact is supported by context. Instead of directly giving a number to know if the answer is relevant, we ask LLM to think of a question for the given answer.
The evaluation of LLMs is an extremely interesting research topic that will receive increasing attention as more systems begin to reach production and are also applied in more safety-critical environments.
We could also use these metrics to monitor the performance of LLMs in production and notice if the quality of the results starts to degrade. Especially for applications with high error costs, such as healthcare, it will be crucial to develop guardrails and systems to detect and reduce errors.
While there are definitely biases and problems with using LLMs as evaluators, we still need to keep an open mind and approach it as a research problem. Of course, humans will still be involved in the evaluation process, but automated metrics could help partially evaluate performance in some environments.
These metrics don’t always have to be perfect; They just need to work well enough to guide product development in the right way.
Special thanks to Daniel Raff and Yevhen Petyak for their comments and suggestions.
Originally published in Medplexity Substack.