In the dynamic realm of artificial intelligence, natural language processing (NLP), and information retrieval, advanced architectures such as retrieval augmented generation (RAG) have gained significant attention. However, most data science researchers suggest not launching into sophisticated RAG models until the evaluation process is completely reliable and robust.
Carefully evaluating RAG pipes is vital, but is often overlooked in the rush to incorporate cutting-edge features. It is recommended that researchers and practitioners strengthen their evaluation setup as a top priority before tackling complex model improvements.
Understanding the nuances of RAG pipeline evaluation is critical because these models depend on both generation capabilities and recovery quality. The dimensions have been divided into two important categories, which are as follows.
1. Recovery dimensions
to. Context Accuracy: Determines whether each element of ground truth in the context has a higher priority ranking than any other element.
b. Context reminder: Evaluates the degree to which the ground truth response and the recovered context correspond. It depends on the recovered context as well as the fundamental truth.
C. Relevance of context: Evaluate the contexts offered to assess the relevance of the recovered context.
d. Context Entity Retrieval: By comparing the number of entities present in the ground truths and contexts with the number of entities present in the ground truths alone, the context entity recovery metric calculates the recovery of the recovered context.
my. Noise robustness: The Noise Robustness metric evaluates the model's ability to handle noise documents related to questions that do not provide much information.
2. Generational dimensions
to. Fidelity: Evaluate the factual coherence of the generated response according to the given context.
b. Relevance of the answer Calculate how well the generated answer answers the given question. Lower points are awarded for answers that contain redundant or missing information, and vice versa.
C. Negative Rejection: It evaluates the model's ability to delay the response when the documents it has obtained do not include enough information to answer a query.
d. Information integration: It evaluates how well the model can integrate data from different documents to provide answers to complex questions.
my. Counterfactual robustness: Evaluates the model's ability to recognize and ignore known errors in documents, even when aware of possible misinformation.
Below are some frameworks consisting of these dimensions which can be accessed through the following links.
1. Ragas – https://docs.ragas.io/en/stable/
2. TruLens – https://www.trulens.org/
3. ARES – ai.vercel.app/”>https://ares-ai.vercel.app/
4. Deep evaluation – ai.com/docs/getting-started”>https://docs.confident-ai.com/docs/getting-started
5. Tonic validation – ai/validate”>https://docs.tonic.ai/validate
6. LangFuse – https://langfuse.com/
This article is inspired by this. LinkedIn Post.
Tanya Malhotra is a final year student of University of Petroleum and Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with specialization in artificial intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with a burning interest in acquiring new skills, leading groups and managing work in an organized manner.