The evaluation of large language models (LLM) is essential. You must understand how well they work and make sure your standards. The Evaluate Face Evaluate library offers a useful set of tools for this task. This guide shows how to use the Evaluate library to evaluate LLM with examples of practical code.
Understand the embraced face evaluation library
The hug face evaluation library provides tools for different evaluation needs. These tools are divided into three main categories:
- Metrics: These measure the performance of a model by comparing their predictions with the labels of fundamental truth. The examples include precision, F1, Bleu and Rouge score.
- Comparisons: These help compare two models, often examining how their predictions align with each other or with reference labels.
- Measures: These tools investigate the properties of the data sets themselves, such as calculating the complexity of the text or labeling distributions.
You can access all these evaluation modules using a single function: Evaluate.load ().
Starting
Facility
First, you must install the library. Open your system or symbol of the system and execute:
pip install evaluate
pip install rouge_score # Needed for text generation metrics
pip install evaluate(visualization) # For plotting capabilities
These commands install the nucleus evaluation library, the Rouge_Score package (required for the Rouge metric that is often used in summary) and optional dependencies for visualization as radar graphs.
Loading of an evaluation module
To use a specific evaluation tool, the load by name. For example, to load the precision metric:
import evaluate
accuracy_metric = evaluate.load("accuracy")
print("Accuracy metric loaded.")
Production:

This code imports the evaluation and load library the precision metric object. Will use this object to calculate precision scores.
Basic evaluation examples
Let's walk through some common evaluation scenarios.
Computer precision directly
You can calculate a metric providing all references (terrestrial truth) and predictions at the same time.
import evaluate
# Load the accuracy metric
accuracy_metric = evaluate.load("accuracy")
# Sample ground truth and predictions
references = (0, 1, 0, 1)
predictions = (1, 0, 0, 1)
# Compute accuracy
result = accuracy_metric.compute(references=references, predictions=predictions)
print(f"Direct computation result: {result}")
# Example with exact_match metric
exact_match_metric = evaluate.load('exact_match')
match_result = exact_match_metric.compute(references=('hello world'), predictions=('hello world'))
no_match_result = exact_match_metric.compute(references=('hello'), predictions=('hell'))
print(f"Exact match result (match): {match_result}")
print(f"Exact match result (no match): {no_match_result}")
Production:

Explanation:
- We define two lists: the references contain the correct labels and the predictions contain the exits of the model.
- The calculation method takes these lists and calculates precision, returning the result as a dictionary.
- We also show the exact_match metric, which verifies if the prediction coincides perfectly with the reference.
Incremental evaluation (using add_batch)
For large data sets, lot predictions can be more efficient in memory. You can add lots incremental and calculate the final score at the end.
import evaluate
# Load the accuracy metric
accuracy_metric = evaluate.load("accuracy")
# Sample batches of refrences and predictions
references_batch1 = (0, 1)
predictions_batch1 = (1, 0)
references_batch2 = (0, 1)
predictions_batch2 = (0, 1)
# Add batches incrementally
accuracy_metric.add_batch(references=references_batch1, predictions=predictions_batch1)
accuracy_metric.add_batch(references=references_batch2, predictions=predictions_batch2)
# Compute final accuracy
final_result = accuracy_metric.compute()
print(f"Incremental computation result: {final_result}")
Production:

Explanation:
- We simulate processing data in two lots.
- Add_Batch updates the internal state of the metric with each lot.
- Call Compute () without arguments calculate the metric in all added lots.
Combining multiple metrics
Often you want to calculate several metrics simultaneously (for example, precision, F1, precision, memory for classification). He Evaluate.combine The function simplifies this.
import evaluate
# Combine multiple classification metrics
clf_metrics = evaluate.combine(("accuracy", "f1", "precision", "recall"))
# Sample data
predictions = (0, 1, 0)
references = (0, 1, 1) # Note: The last prediction is incorrect
# Compute all metrics at once
results = clf_metrics.compute(predictions=predictions, references=references)
print(f"Combined metrics result: {results}")
Production:

Explanation:
- Evaluate.combine Take a list of metric names and return a combined evaluation object.
- Call computer into this object calculates all specified metrics using the same input data.
Using measures
Measurements can be used to analyze data sets. Here we show you how to use the Word_Length measurement:
import evaluate
# Load the word_length measurement
# Note: May require NLTK data download on first run
try:
word_length = evaluate.load("word_length", module_type="measurement")
data = ("hello world", "this is another sentence")
results = word_length.compute(data=data)
print(f"Word length measurement result: {results}")
except Exception as e:
print(f"Could not run word_length measurement, possibly NLTK data missing: {e}")
print("Attempting NLTK download...")
import nltk
nltk.download('punkt') # Uncomment and run if needed
Production:

Explanation:
- We load Word_Length and specify module_type = “measurement”.
- The computing method takes the data set (a chain list here) as an entrance.
- Returns statistics on the lengths of the words in the data provided. (Note: Requires NLTK and its 'punkt' tokenizer data).
Evaluation of specific NLP tasks
Different NLP tasks require specific metrics. The embraced face evaluation includes many standards.
Automatic translation (Bleu)
Bleu (bilingual evaluation substitute) is common for the quality of translation. It measures the superposition of N-Gram between the translation of the model (hypothesis) and the reference translations.
import evaluate
def evaluate_machine_translation(hypotheses, references):
"""Calculates BLEU score for machine translation."""
bleu_metric = evaluate.load("bleu")
results = bleu_metric.compute(predictions=hypotheses, references=references)
# Extract the main BLEU score
bleu_score = results("bleu")
return bleu_score
# Example hypotheses (model translations)
hypotheses = ("the cat sat on mat.", "the dog played in garden.")
# Example references (correct translations, can have multiple per hypothesis)
references = (("the cat sat on the mat."), ("the dog played in the garden."))
bleu_score = evaluate_machine_translation(hypotheses, references)
print(f"BLEU Score: {bleu_score:.4f}") # Format for readability
Production:

Explanation:
- The function loads the Metric Bleu.
- Calculate the score compared the predicted translations (hypothesis) with one or more correct references.
- A higher Blu score (closer to 1.0) usually indicates a better translation quality, which suggests more overlap with reference translations. A score around 0.51 suggests a moderate overlap.
Recognition of an entity appointed (NER – using Seqeval)
For sequence labeling tasks such as NER, metrics such as Precision, Rechinc and F1-Score by type of entity are useful. Seqeval metric handles this format (for example, b-per, i-per, or labels).
To execute the following code, the Seqeval library would be required. It could be installed by running the following command:
pip install seqeval
Code:
import evaluate
# Load the seqeval metric
try:
seqeval_metric = evaluate.load("seqeval")
# Example labels (using IOB format)
true_labels = (('O', 'B-PER', 'I-PER', 'O'), ('B-LOC', 'I-LOC', 'O'))
predicted_labels = (('O', 'B-PER', 'I-PER', 'O'), ('B-LOC', 'I-LOC', 'O')) # Example: Perfect prediction here
results = seqeval_metric.compute(predictions=predicted_labels, references=true_labels)
print("Seqeval Results (per entity type):")
# Print results nicely
for key, value in results.items():
if isinstance(value, dict):
print(f" {key}: Precision={value('precision'):.2f}, Recall={value('recall'):.2f}, F1={value('f1'):.2f}, Number={value('number')}")
else:
print(f" {key}: {value:.4f}")
except ModuleNotFoundError:
print("Seqeval metric not installed. Run: pip install seqeval")
Production:

Explanation:
- We load the Seqeval metric.
- Take lists of lists, where each internal list represents the labels for a sentence.
- The computing method returns the detailed precision, retirement and F1 scores for each type of identified entity (such as PER for person, loc for location) and general scores.
Text Summary (Rouge)
Rouge (substitute oriented to retirement for the evaluation of the relationship) Compare a summary generated with reference summaries, focusing on superimposed n-grams and the longest longer common.
import evaluate
def simple_summarizer(text):
"""A very basic summarizer - just takes the first sentence."""
try:
sentences = text.split(".")
return sentences(0).strip() + "." if sentences(0).strip() else ""
except:
return "" # Handle empty or malformed text
# Load ROUGE metric
rouge_metric = evaluate.load("rouge")
# Example text and reference summary
text = "Today is a beautiful day. The sun is shining and the birds are singing. I am going for a walk in the park."
reference = "The weather is pleasant today."
# Generate summary using the simple function
prediction = simple_summarizer(text)
print(f"Generated Summary: {prediction}")
print(f"Reference Summary: {reference}")
# Compute ROUGE scores
rouge_results = rouge_metric.compute(predictions=(prediction), references=(reference))
print(f"ROUGE Scores: {rouge_results}")
Production:
Generated Summary: Today is a beautiful day.Reference Summary: The weather is pleasant today.
ROUGE Scores: {'rouge1': np.float64(0.4000000000000001), 'rouge2':
np.float64(0.0), 'rougeL': np.float64(0.20000000000000004), 'rougeLsum':
np.float64(0.20000000000000004)}
Explanation:
- We load the Rouge metric.
- We define a simplistic summary for demonstration.
- The calculation calculates different rouge scores:
- The scores closest to 1.0 indicate a greater similarity to the reference summary. The low scores here reflect the basic nature of our simple_summarizer.
Question Answer (Squadron)
The squad metric is used for questions to answer extractive questions. Calculate the exact coincidence (EM) and the F1 score.
import evaluate
# Load the SQuAD metric
squad_metric = evaluate.load("squad")
# Example predictions and references format for SQuAD
predictions = ({'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'})
references = ({'answers': {'answer_start': (97), 'text': ('1976')}, 'id': '56e10a3be3433e1400422b22'})
results = squad_metric.compute(predictions=predictions, references=references)
print(f"SQuAD Results: {results}")
Production:

Explanation:
- Load the squad metric.
- Take predictions and references in a specific dictionary format, including the predicted text and the truth of land responses with your starting positions.
- Exact_Match: Percentage of predictions that coincide exactly one of the responses of the truth of the soil.
- F1: Average F1 score on all questions, considering partial matches at Token level.
Advanced evaluation with the evaluation class
The evaluator class optimizes the process by integrating the load of the model, inference and metric calculation. It is particularly useful for standard tasks such as text classification.
# Note: Requires transformers and datasets libraries
# pip install transformers datasets torch # or tensorflow/jax
import evaluate
from evaluate import evaluator
from transformers import pipeline
from datasets import load_dataset
# Load a pre-trained text classification pipeline
# Using a smaller model for potentially faster execution
try:
pipe = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english", device=-1) # Use CPU
except Exception as e:
print(f"Could not load pipeline: {e}")
pipe = None
if pipe:
# Load a small subset of the IMDB dataset
try:
data = load_dataset("imdb", split="test").shuffle(seed=42).select(range(100)) # Smaller subset for speed
except Exception as e:
print(f"Could not load dataset: {e}")
data = None
if data:
# Load the accuracy metric
accuracy_metric = evaluate.load("accuracy")
# Create an evaluator for the task
task_evaluator = evaluator("text-classification")
# Correct label_mapping for IMDB dataset
label_mapping = {
'NEGATIVE': 0, # Map NEGATIVE to 0
'POSITIVE': 1 # Map POSITIVE to 1
}
# Compute results
eval_results = task_evaluator.compute(
model_or_pipeline=pipe,
data=data,
metric=accuracy_metric,
input_column="text", # Specify the text column
label_column="label", # Specify the label column
label_mapping=label_mapping # Pass the corrected label mapping
)
print("\nEvaluator Results:")
print(eval_results)
# Compute with bootstrapping for confidence intervals
bootstrap_results = task_evaluator.compute(
model_or_pipeline=pipe,
data=data,
metric=accuracy_metric,
input_column="text",
label_column="label",
label_mapping=label_mapping, # Pass the corrected label mapping
strategy="bootstrap",
n_resamples=10 # Use fewer resamples for faster demo
)
print("\nEvaluator Results with Bootstrapping:")
print(bootstrap_results)
Production:
Device set to use cpuEvaluator Results:
{'accuracy': 0.9, 'total_time_in_seconds': 24.277618517999997,
'samples_per_second': 4.119020155368932, 'latency_in_seconds':
0.24277618517999996}Evaluator Results with Bootstrapping:
{'accuracy': {'confidence_interval': (np.float64(0.8703044820750653),
np.float64(0.9335706530476571)), 'standard_error':
np.float64(0.02412928142780514), 'score': 0.9}, 'total_time_in_seconds':
23.871316319000016, 'samples_per_second': 4.189128017226537,
'latency_in_seconds': 0.23871316319000013}
Explanation:
- We load a transformer pipe for text classification and a sample of the IMDB data set.
- We create an evaluator specifically for “text classification”.
- The calculation method manages the feeding data (text column) with the pipe, obtaining predictions, comparing them with true labels (label column) using the specified metric and applying the_Mapping label.
- Returns the metric score together with performance statistics such as total time and samples per second.
- The use of strategy = “Bootstrap” performs a resume to estimate trust intervals and standard error for metric, giving an idea of the stability of the score.
Use of evaluation suites
Evaluation suites group multiple evaluations, often aimed at specific reference points such as glue. This allows a model to be executed in a standard task set.
# Note: Running a full suite can be computationally intensive and time-consuming.
# This example demonstrates the concept but might take a long time or require significant resources.
# It also installs multiple datasets and may require specific model configurations.
import evaluate
try:
print("\nLoading GLUE evaluation suite (this might download datasets)...")
# Load the GLUE task directly
# Using "mrpc" as an example task, but you can choose from the valid ones listed above
task = evaluate.load("glue", "mrpc") # Specify the task like "mrpc", "sst2", etc.
print("Task loaded.")
# You can now run the task on a model (for example: "distilbert-base-uncased")
# WARNING: This might take time for inference or fine-tuning.
# results = task.compute(model_or_pipeline="distilbert-base-uncased")
# print("\nEvaluation Results (MRPC Task):")
# print(results)
print("Skipping model inference for brevity in this example.")
print("Refer to Hugging Face documentation for full EvaluationSuite usage.")
except Exception as e:
print(f"Could not load or run evaluation suite: {e}")
Production:
Loading GLUE evaluation suite (this might download datasets)...Task loaded.
Skipping model inference for brevity in this example.
Refer to Hugging Face documentation for full EvaluationSuite usage.
Explanation:
- Evaluationsuite.load Loads a predefined set of evaluation tasks (here, only the MRPC task from the glue reference point for the demonstration).
- The suite command.
- The exit is usually a list of dictionaries, each that contains the results for a task in the suite. (Note: Executing this often requires specific environment settings and substantial computing time).
Display of evaluation results
Visualizations help compare multiple models in different metrics. Radar graphs are effective for this.
import evaluate
import matplotlib.pyplot as plt # Ensure matplotlib is installed
from evaluate.visualization import radar_plot
# Sample data for multiple models across several metrics
# Lower latency is better, so we might invert it or consider it separately.
data = (
{"accuracy": 0.99, "precision": 0.80, "f1": 0.95, "latency_inv": 1/33.6},
{"accuracy": 0.98, "precision": 0.87, "f1": 0.91, "latency_inv": 1/11.2},
{"accuracy": 0.98, "precision": 0.78, "f1": 0.88, "latency_inv": 1/87.6},
{"accuracy": 0.88, "precision": 0.78, "f1": 0.81, "latency_inv": 1/101.6}
)
model_names = ("Model A", "Model B", "Model C", "Model D")
# Generate the radar plot
# Higher values are generally better on a radar plot
try:
# Generate radar plot (ensure you pass a correct format and that data is valid)
plot = radar_plot(data=data, model_names=model_names)
# Display the plot
plt.show() # Explicitly show the plot, might be necessary in some environments
# To save the plot to a file (uncomment to use)
# plot.savefig("model_comparison_radar.png")
plt.close() # Close the plot window after showing/saving
except ImportError:
print("Visualization requires matplotlib. Run: pip install matplotlib")
except Exception as e:
print(f"Could not generate plot: {e}")
Production:

Explanation:
- We prepare sample results for four models in precision, precision, F1 and inverted latency (so it is better).
- Radar_Lot creates a graph where each axis represents a metric, which shows how the models are visually compared.
Save evaluation results
You can save your evaluation results in a file, often in JSON format, for record maintenance or subsequent analysis.
import evaluate
from pathlib import Path
# Perform an evaluation
accuracy_metric = evaluate.load("accuracy")
result = accuracy_metric.compute(references=(0, 1, 0, 1), predictions=(1, 0, 0, 1))
print(f"Result to save: {result}")
# Define hyperparameters or other metadata
hyperparams = {"model_name": "my_custom_model", "learning_rate": 0.001}
run_details = {"experiment_id": "run_42"}
# Combine results and metadata
save_data = {**result, **hyperparams, **run_details}
# Define save directory and filename
save_dir = Path("./evaluation_results")
save_dir.mkdir(exist_ok=True) # Create directory if it doesn't exist
# Use evaluate.save to store the results
try:
saved_path = evaluate.save(save_directory=save_dir, **save_data)
print(f"Results saved to: {saved_path}")
# You can also manually save as JSON
import json
manual_save_path = save_dir / "manual_results.json"
with open(manual_save_path, 'w') as f:
json.dump(save_data, f, indent=4)
print(f"Results manually saved to: {manual_save_path}")
except Exception as e:
# Catch potential git-related errors if run outside a repo
print(f"evaluate.save encountered an issue (possibly git related): {e}")
print("Attempting manual JSON save instead.")
import json
manual_save_path = save_dir / "manual_results_fallback.json"
with open(manual_save_path, 'w') as f:
json.dump(save_data, f, indent=4)
print(f"Results manually saved to: {manual_save_path}")
Production:
Result to save: {'accuracy': 0.5}evaluate.save encountered an issue (possibly git related): save() missing 1
required positional argument: 'path_or_file'Attempting manual JSON save instead.
Results manually saved to: evaluation_results/manual_results_fallback.json
Explanation:
- We combine the results dictionary calculated with other metadata such as hyperparams.
- Evaluate.Save tries to save this data in a JSON file in the specified directory. I could try to add GIT COMMIT Information if you run within a repository, which can cause errors otherwise (as seen in the original registry).
- We include an advantage to manually save the dictionary as a JSON file, which is often enough.
Choose the correct metric
Selecting the appropriate metric is crucial. Consider these points:
- Task type: Is it classification, translation, summary, ner, qa? Use the standard of metrics for that task (precision/F1 for classification, bleu/rouge for generation, seqeval for ner, squad for qa).
- Data set: Some reference points (such as glue, squad) have specific associated metrics. Classification tables (for example, in documents with code) often show metric commonly used for specific data sets.
- Goal: What aspect of performance is more important?
- Accuracy: General correction (good for balanced classes).
- Precision/Retirement/F1: Important for unbalanced classes or when false positives/negatives have different costs.
- Blue/red: Creep and content overlap in the generation of text.
- Perplexity: How good a language model predicts a sample (lower is better, it is often used for generative models).
- Metric cards: Read the metric cards (documentation) of hugs for detailed explanations, appropriate limitations and use cases (for example, Blue card, Squad card).
Conclusion
The Evaluate Face Evaluate library offers a versatile and easy -to -use way to evaluate large language data models and sets. It provides standard metrics, measurements of the data set and tools such as Evaluator and Evaluationsuite To optimize the process. By using these tools and choosing appropriate metrics for your task, you can obtain clear information about the strengths and weaknesses of your model.
For more details and advanced use, see official resources:
Log in to continue reading and enjoying content cured by experts.