I've most often created custom metrics for my own use cases, but I came across these built-in metrics for ai tools in LangChain repeatedly before I started using RAGAS and/or DeepEval for RAG evaluation, so I finally got curious about Know how these metrics are created and run a quick analysis (with all the inherent biases, of course).
TLDR comes from the following correlation matrix:
- Utility and Coherence (correlation 0.46): This strong correlation suggests that the LLM (and, by proxy, the users) could find coherent answers more useful, emphasizing the importance of logical structuring in the answers. It's just a correlation, but this relationship opens up the possibility of this conclusion.
- Controversiality and Criminality (correlation 0.44): This indicates that even controversial content could be considered criminal, and vice versa, perhaps reflecting user preference for engaging and thought-provoking material.
- Coherence versus depth: Although coherence correlates with usefulness, depth does not. This might suggest that users (again, assuming that user preferences are inherent to the LLM output; this in itself is an assumption and bias that is important to be aware of) might prefer clear and concise answers to detailed ones, particularly in contexts where quick solutions are valued over comprehensive ones.
The built-in metrics are found here (removing one that relates to ground truth and is better handled elsewhere):
# Listing Criteria / LangChain's built-in metrics
from langchain.evaluation import Criteria
new_criteria_list = (item for i, item in enumerate(Criteria) if i != 2)
new_criteria_list
The metrics:
- Conciseness
- Detail
- Relevance
- Coherence
- Noxiousness
- Insensitivity
- Utility
- Controversiality
- Criminality
- Depth
- Creativity
First, what do they mean and why were they created?
The hypothesis:
- These were created in an attempt to define metrics that could explain the outcome relative to the theoretical goals of the use cases, and any correlation could be accidental, but was generally avoided where possible.
I have this hypothesis after seeing this source code here.
Secondly, some of them seem similar and/or vague; So how are they different?
I used a standard SQuAD data set as a basis to evaluate the differences (if any) between the output of OpenAI's GPT-3-Turbo model and the ground truth in this data set, and compare.
# Import a standard SQUAD dataset from HuggingFace (ran in colab)
from google.colab import userdata
HF_TOKEN = userdata.get('HF_TOKEN')dataset = load_dataset("rajpurkar/squad")
print(type(dataset))
I got a random set of rows for evaluation (I couldn't afford to time it and everything), so this could be an entry point for more noise and/or bias.
# Slice dataset to randomized selection of 100 rows
validation_data = dataset('validation')
validation_df = validation_data.to_pandas()
sample_df = validation_df.sample(n=100, replace=False)
I defined a movie using ChatGPT 3.5 Turbo (to save cost here, this is fast).
import os# Import OAI API key
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
os.environ('OPENAI_API_KEY') = userdata.get('OPENAI_API_KEY')
# Define llm
llm = ChatOpenAI(model_name='gpt-3.5-turbo', openai_api_key=OPENAI_API_KEY)
Then it looped through the sampled rows for a comparison: There were unknown thresholds that LangChain used for “score” in the evaluation criteria, but they are assumed to be defined the same for all metrics.
# Loop through each question in random sample
for index, row in sample_df.iterrows():
try:
prediction = " ".join(row('answers')('text'))
input_text = row('question')# Loop through each criteria\
for m in new_criteria_list:
evaluator = load_evaluator("criteria", llm=llm, criteria=m)
eval_result = evaluator.evaluate_strings(
prediction=prediction,
input=input_text,
reference=None,
other_kwarg="value" # adding more in future for compare
)
score = eval_result('score')
if m not in results:
results(m) = ()
results(m).append(score)
except KeyError as e:
print(f"KeyError: {e} in row {index}")
except TypeError as e:
print(f"TypeError: {e} in row {index}")
I then calculated the means and CI with 95% confidence intervals.
# Calculate means and confidence intervals at 95%
mean_scores = {}
confidence_intervals = {}for m, scores in results.items():
mean_score = np.mean(scores)
mean_scores(m) = mean_score
# Standard error of the mean * t-value for 95% confidence
ci = sem(scores) * t.ppf((1 + 0.95) / 2., len(scores)-1)
confidence_intervals(m) = (mean_score - ci, mean_score + ci)
And plotted the results.
# Plotting results by metric
fig, ax = plt.subplots()
m_labels = list(mean_scores.keys())
means = list(mean_scores.values())
cis = (confidence_intervals(m) for m in m_labels)
error = ((mean - ci(0), ci(1) - mean) for mean, ci in zip(means, cis)))ax.bar(m_labels, means, yerr=np.array(error).T, capsize=5, color='lightblue', label='Mean Scores with 95% CI')
ax.set_xlabel('Criteria')
ax.set_ylabel('Average Score')
ax.set_title('Evaluation Scores by Criteria')
plt.xticks(rotation=90)
plt.legend()
plt.show()
It's probably intuitive that 'Relevance' is much higher than the others, but it's interesting that they are generally so low (perhaps thanks to GPT 3.5!) and that 'Service' is the next highest metric (possibly reflecting technical and RL optimizations). .
To answer my question about correlation, I calculated a simple correlation matrix with the raw comparison data frame.
# Convert results to dataframe
min_length = min(len(v) for v in results.values())
dfdata = {k.name: v(:min_length) for k, v in results.items()}
df = pd.DataFrame(dfdata)# Filtering out null values
filtered_df = df.drop(columns=(col for col in df.columns if 'MALICIOUSNESS' in col or 'MISOGYNY' in col))
# Create corr matrix
correlation_matrix = filtered_df.corr()
He then graphed the results (p values are created further down in my code and all were below .05)
# Plot corr matrix
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, mask=mask, annot=True, fmt=".2f", cmap='coolwarm',
cbar_kws={"shrink": .8})
plt.title('Correlation Matrix - Built-in Metrics from LangChain')
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.show()
It was surprising that most did not correlate, given the nature of the descriptions in the LangChain codebase; This leads to something a little more thought out and I'm glad they are integrated for use.
Notable relationships emerge from the correlation matrix:
- Utility and Coherence (correlation 0.46): This strong correlation suggests that the LLM (as it is a surrogate for users) might find coherent answers more useful, emphasizing the importance of logical structuring of answers. Although it is a correlation, this relationship paves the way for it.
- Controversiality and Criminality (correlation 0.44): This indicates that even controversial content could be considered criminal, and vice versa, perhaps reflecting user preference for engaging and thought-provoking material. Again, this is just a correlation.
Conclusions:
- Consistency vs. Depth in Utility: Although coherence correlates with usefulness, depth does not. This could suggest that users may prefer clear and concise answers to detailed ones, particularly in contexts where quick solutions are valued over comprehensive ones.
- Taking advantage of the controversy: The positive correlation between controversy and crime raises an interesting question: can controversial topics be discussed in a non-criminal way? This could potentially increase user engagement without compromising content quality.
- Impact of bias and model choice: The use of GPT-3.5 Turbo and inherent metric design biases could influence these correlations. Recognizing these biases is essential for accurate interpretation and application of these metrics.
Unless otherwise noted, all images in this article were created by the author.