How to evaluate the LLM using the embraced face evaluation

The evaluation of large language models (LLM) is essential. You must understand how well they work and make sure your standards. The Evaluate Face Evaluate library offers a useful set of tools for this task. This guide shows how to use the Evaluate library to evaluate LLM with examples of practical code.

Understand the embraced face evaluation library

The hug face evaluation library provides tools for different evaluation needs. These tools are divided into three main categories:

Metrics: These measure the performance of a model by comparing their predictions with the labels of fundamental truth. The examples include precision, F1, Bleu and Rouge score.
Comparisons: These help compare two models, often examining how their predictions align with each other or with reference labels.
Measures: These tools investigate the properties of the data sets themselves, such as calculating the complexity of the text or labeling distributions.

You can access all these evaluation modules using a single function: Evaluate.load ().

Starting

Facility

First, you must install the library. Open your system or symbol of the system and execute:

pip install evaluate

pip install rouge_score # Needed for text generation metrics

pip install evaluate(visualization) # For plotting capabilities

These commands install the nucleus evaluation library, the Rouge_Score package (required for the Rouge metric that is often used in summary) and optional dependencies for visualization as radar graphs.

Loading of an evaluation module

To use a specific evaluation tool, the load by name. For example, to load the precision metric:

import evaluate

accuracy_metric = evaluate.load("accuracy")

print("Accuracy metric loaded.")

Production:

This code imports the evaluation and load library the precision metric object. Will use this object to calculate precision scores.

Basic evaluation examples

Let's walk through some common evaluation scenarios.

Computer precision directly

You can calculate a metric providing all references (terrestrial truth) and predictions at the same time.

import evaluate

# Load the accuracy metric

accuracy_metric = evaluate.load("accuracy")

# Sample ground truth and predictions

references = (0, 1, 0, 1)

predictions = (1, 0, 0, 1)

# Compute accuracy

result = accuracy_metric.compute(references=references, predictions=predictions)

print(f"Direct computation result: {result}")

# Example with exact_match metric

exact_match_metric = evaluate.load('exact_match')

match_result = exact_match_metric.compute(references=('hello world'), predictions=('hello world'))

no_match_result = exact_match_metric.compute(references=('hello'), predictions=('hell'))

print(f"Exact match result (match): {match_result}")

print(f"Exact match result (no match): {no_match_result}")

Production:

Explanation:

We define two lists: the references contain the correct labels and the predictions contain the exits of the model.
The calculation method takes these lists and calculates precision, returning the result as a dictionary.
We also show the exact_match metric, which verifies if the prediction coincides perfectly with the reference.

Incremental evaluation (using add_batch)

For large data sets, lot predictions can be more efficient in memory. You can add lots incremental and calculate the final score at the end.

import evaluate

# Load the accuracy metric

accuracy_metric = evaluate.load("accuracy")

# Sample batches of refrences and predictions

references_batch1 = (0, 1)

predictions_batch1 = (1, 0)

references_batch2 = (0, 1)

predictions_batch2 = (0, 1)

# Add batches incrementally

accuracy_metric.add_batch(references=references_batch1, predictions=predictions_batch1)

accuracy_metric.add_batch(references=references_batch2, predictions=predictions_batch2)

# Compute final accuracy

final_result = accuracy_metric.compute()

print(f"Incremental computation result: {final_result}")

Production:

Explanation:

We simulate processing data in two lots.
Add_Batch updates the internal state of the metric with each lot.
Call Compute () without arguments calculate the metric in all added lots.

Combining multiple metrics

Often you want to calculate several metrics simultaneously (for example, precision, F1, precision, memory for classification). He Evaluate.combine The function simplifies this.

import evaluate

# Combine multiple classification metrics

clf_metrics = evaluate.combine(("accuracy", "f1", "precision", "recall"))

# Sample data

predictions = (0, 1, 0)

references = (0, 1, 1) # Note: The last prediction is incorrect

# Compute all metrics at once

results = clf_metrics.compute(predictions=predictions, references=references)

print(f"Combined metrics result: {results}")

Production:

Explanation:

Evaluate.combine Take a list of metric names and return a combined evaluation object.
Call computer into this object calculates all specified metrics using the same input data.

Using measures

Measurements can be used to analyze data sets. Here we show you how to use the Word_Length measurement:

import evaluate

# Load the word_length measurement

# Note: May require NLTK data download on first run

try:

   word_length = evaluate.load("word_length", module_type="measurement")

   data = ("hello world", "this is another sentence")

   results = word_length.compute(data=data)

   print(f"Word length measurement result: {results}")

except Exception as e:

   print(f"Could not run word_length measurement, possibly NLTK data missing: {e}")

   print("Attempting NLTK download...")

   import nltk

   nltk.download('punkt') # Uncomment and run if needed

Production:

Explanation:

We load Word_Length and specify module_type = “measurement”.
The computing method takes the data set (a chain list here) as an entrance.
Returns statistics on the lengths of the words in the data provided. (Note: Requires NLTK and its 'punkt' tokenizer data).

Evaluation of specific NLP tasks

Different NLP tasks require specific metrics. The embraced face evaluation includes many standards.

Automatic translation (Bleu)

Bleu (bilingual evaluation substitute) is common for the quality of translation. It measures the superposition of N-Gram between the translation of the model (hypothesis) and the reference translations.

import evaluate

def evaluate_machine_translation(hypotheses, references):

   """Calculates BLEU score for machine translation."""

   bleu_metric = evaluate.load("bleu")

   results = bleu_metric.compute(predictions=hypotheses, references=references)

   # Extract the main BLEU score

   bleu_score = results("bleu")

   return bleu_score

# Example hypotheses (model translations)

hypotheses = ("the cat sat on mat.", "the dog played in garden.")

# Example references (correct translations, can have multiple per hypothesis)

references = (("the cat sat on the mat."), ("the dog played in the garden."))

bleu_score = evaluate_machine_translation(hypotheses, references)

print(f"BLEU Score: {bleu_score:.4f}") # Format for readability

Production:

Explanation:

The function loads the Metric Bleu.
Calculate the score compared the predicted translations (hypothesis) with one or more correct references.
A higher Blu score (closer to 1.0) usually indicates a better translation quality, which suggests more overlap with reference translations. A score around 0.51 suggests a moderate overlap.

Recognition of an entity appointed (NER – using Seqeval)

For sequence labeling tasks such as NER, metrics such as Precision, Rechinc and F1-Score by type of entity are useful. Seqeval metric handles this format (for example, b-per, i-per, or labels).

To execute the following code, the Seqeval library would be required. It could be installed by running the following command:

pip install seqeval

Code:

import evaluate

# Load the seqeval metric
try:

   seqeval_metric = evaluate.load("seqeval")

   # Example labels (using IOB format)
   true_labels = (('O', 'B-PER', 'I-PER', 'O'), ('B-LOC', 'I-LOC', 'O'))

   predicted_labels = (('O', 'B-PER', 'I-PER', 'O'), ('B-LOC', 'I-LOC', 'O')) # Example: Perfect prediction here

   results = seqeval_metric.compute(predictions=predicted_labels, references=true_labels)

   print("Seqeval Results (per entity type):")

   # Print results nicely

   for key, value in results.items():

       if isinstance(value, dict):

           print(f"  {key}: Precision={value('precision'):.2f}, Recall={value('recall'):.2f}, F1={value('f1'):.2f}, Number={value('number')}")

       else:

           print(f"  {key}: {value:.4f}")

except ModuleNotFoundError:

   print("Seqeval metric not installed. Run: pip install seqeval")

Production:

Explanation:

We load the Seqeval metric.
Take lists of lists, where each internal list represents the labels for a sentence.
The computing method returns the detailed precision, retirement and F1 scores for each type of identified entity (such as PER for person, loc for location) and general scores.

Text Summary (Rouge)

Rouge (substitute oriented to retirement for the evaluation of the relationship) Compare a summary generated with reference summaries, focusing on superimposed n-grams and the longest longer common.

import evaluate

def simple_summarizer(text):

   """A very basic summarizer - just takes the first sentence."""

   try:

       sentences = text.split(".")

       return sentences(0).strip() + "." if sentences(0).strip() else ""

   except:

       return "" # Handle empty or malformed text

# Load ROUGE metric

rouge_metric = evaluate.load("rouge")

# Example text and reference summary

text = "Today is a beautiful day. The sun is shining and the birds are singing. I am going for a walk in the park."

reference = "The weather is pleasant today."

# Generate summary using the simple function

prediction = simple_summarizer(text)

print(f"Generated Summary: {prediction}")

print(f"Reference Summary: {reference}")

# Compute ROUGE scores

rouge_results = rouge_metric.compute(predictions=(prediction), references=(reference))

print(f"ROUGE Scores: {rouge_results}")

Production:

Generated Summary: Today is a beautiful day.Reference Summary: The weather is pleasant today.
ROUGE Scores: {'rouge1': np.float64(0.4000000000000001), 'rouge2':
np.float64(0.0), 'rougeL': np.float64(0.20000000000000004), 'rougeLsum':
np.float64(0.20000000000000004)}

Explanation:

We load the Rouge metric.
We define a simplistic summary for demonstration.
The calculation calculates different rouge scores:
The scores closest to 1.0 indicate a greater similarity to the reference summary. The low scores here reflect the basic nature of our simple_summarizer.

Question Answer (Squadron)

The squad metric is used for questions to answer extractive questions. Calculate the exact coincidence (EM) and the F1 score.

import evaluate

# Load the SQuAD metric

squad_metric = evaluate.load("squad")

# Example predictions and references format for SQuAD

predictions = ({'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'})

references = ({'answers': {'answer_start': (97), 'text': ('1976')}, 'id': '56e10a3be3433e1400422b22'})

results = squad_metric.compute(predictions=predictions, references=references)

print(f"SQuAD Results: {results}")

Production:

Explanation:

Load the squad metric.
Take predictions and references in a specific dictionary format, including the predicted text and the truth of land responses with your starting positions.
Exact_Match: Percentage of predictions that coincide exactly one of the responses of the truth of the soil.
F1: Average F1 score on all questions, considering partial matches at Token level.

Advanced evaluation with the evaluation class

The evaluator class optimizes the process by integrating the load of the model, inference and metric calculation. It is particularly useful for standard tasks such as text classification.

# Note: Requires transformers and datasets libraries
# pip install transformers datasets torch # or tensorflow/jax

import evaluate

from evaluate import evaluator

from transformers import pipeline

from datasets import load_dataset

# Load a pre-trained text classification pipeline
# Using a smaller model for potentially faster execution

try:

   pipe = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english", device=-1) # Use CPU

except Exception as e:

   print(f"Could not load pipeline: {e}")

   pipe = None

if pipe:

   # Load a small subset of the IMDB dataset

   try:

       data = load_dataset("imdb", split="test").shuffle(seed=42).select(range(100)) # Smaller subset for speed

   except Exception as e:

       print(f"Could not load dataset: {e}")

       data = None

   if data:

       # Load the accuracy metric

       accuracy_metric = evaluate.load("accuracy")

       # Create an evaluator for the task

       task_evaluator = evaluator("text-classification")

       # Correct label_mapping for IMDB dataset

       label_mapping = {

           'NEGATIVE': 0,  # Map NEGATIVE to 0

           'POSITIVE': 1   # Map POSITIVE to 1

       }

       # Compute results

       eval_results = task_evaluator.compute(

           model_or_pipeline=pipe,

           data=data,

           metric=accuracy_metric,

           input_column="text",  # Specify the text column

           label_column="label", # Specify the label column

           label_mapping=label_mapping  # Pass the corrected label mapping

       )

       print("\nEvaluator Results:")

       print(eval_results)

       # Compute with bootstrapping for confidence intervals

       bootstrap_results = task_evaluator.compute(

           model_or_pipeline=pipe,

           data=data,

           metric=accuracy_metric,

           input_column="text",

           label_column="label",

           label_mapping=label_mapping,  # Pass the corrected label mapping

           strategy="bootstrap",

           n_resamples=10  # Use fewer resamples for faster demo

       )

       print("\nEvaluator Results with Bootstrapping:")

       print(bootstrap_results)

Production:

Device set to use cpuEvaluator Results:
{'accuracy': 0.9, 'total_time_in_seconds': 24.277618517999997,
'samples_per_second': 4.119020155368932, 'latency_in_seconds':
0.24277618517999996}
Evaluator Results with Bootstrapping:
{'accuracy': {'confidence_interval': (np.float64(0.8703044820750653),
np.float64(0.9335706530476571)), 'standard_error':
np.float64(0.02412928142780514), 'score': 0.9}, 'total_time_in_seconds':
23.871316319000016, 'samples_per_second': 4.189128017226537,
'latency_in_seconds': 0.23871316319000013}

Explanation:

We load a transformer pipe for text classification and a sample of the IMDB data set.
We create an evaluator specifically for “text classification”.
The calculation method manages the feeding data (text column) with the pipe, obtaining predictions, comparing them with true labels (label column) using the specified metric and applying the_Mapping label.
Returns the metric score together with performance statistics such as total time and samples per second.
The use of strategy = “Bootstrap” performs a resume to estimate trust intervals and standard error for metric, giving an idea of the stability of the score.

Use of evaluation suites

Evaluation suites group multiple evaluations, often aimed at specific reference points such as glue. This allows a model to be executed in a standard task set.

# Note: Running a full suite can be computationally intensive and time-consuming.

# This example demonstrates the concept but might take a long time or require significant resources.

# It also installs multiple datasets and may require specific model configurations.

import evaluate

try:

   print("\nLoading GLUE evaluation suite (this might download datasets)...")

   # Load the GLUE task directly

   # Using "mrpc" as an example task, but you can choose from the valid ones listed above

   task = evaluate.load("glue", "mrpc")  # Specify the task like "mrpc", "sst2", etc.

   print("Task loaded.")

   # You can now run the task on a model (for example: "distilbert-base-uncased")

   # WARNING: This might take time for inference or fine-tuning.

   # results = task.compute(model_or_pipeline="distilbert-base-uncased")

   # print("\nEvaluation Results (MRPC Task):")

   # print(results)

   print("Skipping model inference for brevity in this example.")

   print("Refer to Hugging Face documentation for full EvaluationSuite usage.")

except Exception as e:

   print(f"Could not load or run evaluation suite: {e}")

Production:

Loading GLUE evaluation suite (this might download datasets)...

Task loaded.

Skipping model inference for brevity in this example.

Refer to Hugging Face documentation for full EvaluationSuite usage.

Explanation:

Evaluationsuite.load Loads a predefined set of evaluation tasks (here, only the MRPC task from the glue reference point for the demonstration).
The suite command.
The exit is usually a list of dictionaries, each that contains the results for a task in the suite. (Note: Executing this often requires specific environment settings and substantial computing time).

Display of evaluation results

Visualizations help compare multiple models in different metrics. Radar graphs are effective for this.

import evaluate

import matplotlib.pyplot as plt # Ensure matplotlib is installed

from evaluate.visualization import radar_plot

# Sample data for multiple models across several metrics

# Lower latency is better, so we might invert it or consider it separately.

data = (

   {"accuracy": 0.99, "precision": 0.80, "f1": 0.95, "latency_inv": 1/33.6},

   {"accuracy": 0.98, "precision": 0.87, "f1": 0.91, "latency_inv": 1/11.2},

   {"accuracy": 0.98, "precision": 0.78, "f1": 0.88, "latency_inv": 1/87.6},

   {"accuracy": 0.88, "precision": 0.78, "f1": 0.81, "latency_inv": 1/101.6}

)

model_names = ("Model A", "Model B", "Model C", "Model D")

# Generate the radar plot

# Higher values are generally better on a radar plot

try:

   # Generate radar plot (ensure you pass a correct format and that data is valid)

   plot = radar_plot(data=data, model_names=model_names)

   # Display the plot

   plt.show()  # Explicitly show the plot, might be necessary in some environments

   # To save the plot to a file (uncomment to use)

   # plot.savefig("model_comparison_radar.png")

   plt.close() # Close the plot window after showing/saving

except ImportError:

   print("Visualization requires matplotlib. Run: pip install matplotlib")

except Exception as e:

   print(f"Could not generate plot: {e}")

Production:

Explanation:

We prepare sample results for four models in precision, precision, F1 and inverted latency (so it is better).
Radar_Lot creates a graph where each axis represents a metric, which shows how the models are visually compared.

Save evaluation results

You can save your evaluation results in a file, often in JSON format, for record maintenance or subsequent analysis.

import evaluate

from pathlib import Path

# Perform an evaluation

accuracy_metric = evaluate.load("accuracy")

result = accuracy_metric.compute(references=(0, 1, 0, 1), predictions=(1, 0, 0, 1))

print(f"Result to save: {result}")

# Define hyperparameters or other metadata

hyperparams = {"model_name": "my_custom_model", "learning_rate": 0.001}

run_details = {"experiment_id": "run_42"}

# Combine results and metadata

save_data = {**result, **hyperparams, **run_details}

# Define save directory and filename

save_dir = Path("./evaluation_results")

save_dir.mkdir(exist_ok=True) # Create directory if it doesn't exist

# Use evaluate.save to store the results

try:

   saved_path = evaluate.save(save_directory=save_dir, **save_data)

   print(f"Results saved to: {saved_path}")

   # You can also manually save as JSON

   import json

   manual_save_path = save_dir / "manual_results.json"

   with open(manual_save_path, 'w') as f:

       json.dump(save_data, f, indent=4)

   print(f"Results manually saved to: {manual_save_path}")

except Exception as e:

    # Catch potential git-related errors if run outside a repo

    print(f"evaluate.save encountered an issue (possibly git related): {e}")

    print("Attempting manual JSON save instead.")

    import json

    manual_save_path = save_dir / "manual_results_fallback.json"

    with open(manual_save_path, 'w') as f:

        json.dump(save_data, f, indent=4)

    print(f"Results manually saved to: {manual_save_path}")

Production:

Result to save: {'accuracy': 0.5}

evaluate.save encountered an issue (possibly git related): save() missing 1
required positional argument: 'path_or_file'

Attempting manual JSON save instead.

Results manually saved to: evaluation_results/manual_results_fallback.json

Explanation:

We combine the results dictionary calculated with other metadata such as hyperparams.
Evaluate.Save tries to save this data in a JSON file in the specified directory. I could try to add GIT COMMIT Information if you run within a repository, which can cause errors otherwise (as seen in the original registry).
We include an advantage to manually save the dictionary as a JSON file, which is often enough.

Choose the correct metric

Selecting the appropriate metric is crucial. Consider these points:

Task type: Is it classification, translation, summary, ner, qa? Use the standard of metrics for that task (precision/F1 for classification, bleu/rouge for generation, seqeval for ner, squad for qa).
Data set: Some reference points (such as glue, squad) have specific associated metrics. Classification tables (for example, in documents with code) often show metric commonly used for specific data sets.
Goal: What aspect of performance is more important?
- Accuracy: General correction (good for balanced classes).
- Precision/Retirement/F1: Important for unbalanced classes or when false positives/negatives have different costs.
- Blue/red: Creep and content overlap in the generation of text.
- Perplexity: How good a language model predicts a sample (lower is better, it is often used for generative models).
Metric cards: Read the metric cards (documentation) of hugs for detailed explanations, appropriate limitations and use cases (for example, Blue card, Squad card).

Conclusion

The Evaluate Face Evaluate library offers a versatile and easy -to -use way to evaluate large language data models and sets. It provides standard metrics, measurements of the data set and tools such as Evaluator and Evaluationsuite To optimize the process. By using these tools and choosing appropriate metrics for your task, you can obtain clear information about the strengths and weaknesses of your model.

For more details and advanced use, see official resources:

Hard Mishra

Harsh Mishra is an IA/ml engineer who spends more time talking with large language models than real humans. Passionate about Genai, NLP, and make smarter machines (so they still do not replace it). When it does not optimize the models, it is probably optimizing its coffee consumption.

Log in to continue reading and enjoying content cured by experts.

How to evaluate the LLM using the embraced face evaluation

Technical Terrence Team

272% in just one year, do Palantir actions begin?

Leave a Reply Cancel reply

Recommended.

25 Search Strategies You Need To Know: A New Course Starting July 1

How top district leaders are preparing for the return to school

Python Tutorial | Concepts, Resources and Projects

Espresso Systems ICO fuels blockchain rollup revolution

¿Qué mejora el aprendizaje sobre los métodos de enseñanza de la década de 1990?

Categories

Important Links

How to evaluate the LLM using the embraced face evaluation

Understand the embraced face evaluation library

Starting

Facility

Loading of an evaluation module

Basic evaluation examples

Computer precision directly

Incremental evaluation (using add_batch)

Combining multiple metrics

Using measures

Evaluation of specific NLP tasks

Automatic translation (Bleu)

Recognition of an entity appointed (NER – using Seqeval)

Text Summary (Rouge)

Question Answer (Squadron)

Advanced evaluation with the evaluation class

Use of evaluation suites

Display of evaluation results

Save evaluation results

Choose the correct metric

Conclusion

Log in to continue reading and enjoying content cured by experts.

Related

Technical Terrence Team

272% in just one year, do Palantir actions begin?

Leave a Reply Cancel reply

Recommended.

25 Search Strategies You Need To Know: A New Course Starting July 1

How top district leaders are preparing for the return to school

Python Tutorial | Concepts, Resources and Projects

Espresso Systems ICO fuels blockchain rollup revolution

¿Qué mejora el aprendizaje sobre los métodos de enseñanza de la década de 1990?

Categories

Important Links

Get daily news updates to your inbox!