Evaluate RAGs Rigorously or Perish | by Jarek Grygolec, Ph.D. | April 2024

The results presented in Table 1 seem very attractive, at least to me. He simple Evolution works very well. In the case of the evolution of reasoning, the first part of the question is answered perfectly, but the second part is left unanswered. Examining the Wikipedia page (3) shows that there is no answer to the second part of the question in the actual document, so it can also be interpreted as restricting hallucinations, a good thing in itself. He multi-context The question and answer pair seems very good. The conditional type of evolution is acceptable if we look at the question-answer pair. One way to look at these results is that there is always room for better rapid engineering behind evolutions. Another way is to use better LLMs, especially for the critical role, as is the default in the ragas library.

Metrics

The ragas library can not only generate synthetic evaluation sets, but also provides us with built-in metrics for component evaluation as well as end-to-end evaluation of RAGs.

At the time of writing, RAGAS provides eight out-of-the-box metrics for RAG evaluation (see Image 2) and new ones are likely to be added in the future. In general, you are about to choose the most suitable metrics for your use case. However, I recommend selecting the most important metric, that is:

Answer correction — the end-to-end metric with scores between 0 and 1, the higher the better, and measures the accuracy of the generated response compared to the ground truth.

Focusing on one end-to-end metric helps start optimizing your RAG system as quickly as possible. Once you achieve some quality improvements, you can look at component metrics, focusing on the most important one for each RAG component:

Fidelity — the generation metric with scores between 0 and 1, the higher the better, measuring the factual coherence of the generated response in relation to the provided context. It is about grounding the generated response as much as possible in the context provided and, in doing so, preventing hallucinations.

Context Relevance — the retrieval metric with scores between 0 and 1, the higher the better, measuring the relevance of the retrieved context in relation to the question.

Rag factory

Okay, so we have a RAG ready for optimization… not so fast, this is not enough. To optimize RAG we need the factory function to generate RAG strings with a given set of RAG hyperparameters. Here we define this factory function in 2 steps:

Step 1: A function to store documents in the vector database.

# Defining a function to get document collection from vector db with given hyperparemeters
# The function embeds the documents only if collection is missing
# This development version as for production one would rather implement document level check
def get_vectordb_collection(chroma_client,
documents,
embedding_model="text-embedding-ada-002",
chunk_size=None, overlap_size=0) -> ChromaCollection:if chunk_size is None:
collection_name = "full_text"
docs_pp = documents
else:
collection_name = f"{embedding_model}_chunk{chunk_size}_overlap{overlap_size}"
text_splitter = CharacterTextSplitter(
separator=".",
chunk_size=chunk_size,
chunk_overlap=overlap_size,
length_function=len,
is_separator_regex=False,
)
docs_pp = text_splitter.transform_documents(documents)
embedding = OpenAIEmbeddings(model=embedding_model)
langchain_chroma = Chroma(client=chroma_client,
collection_name=collection_name,
embedding_function=embedding,
)
existing_collections = (collection.name for collection in chroma_client.list_collections())
if chroma_client.get_collection(collection_name).count() == 0:
langchain_chroma.from_documents(collection_name=collection_name,
documents=docs_pp,
embedding=embedding)
return langchain_chroma

Step 2: A function to generate RAG in LangChain with document collection, or the appropriate RAG factory function.

# Defininig a function to get a simple RAG as Langchain chain with given hyperparemeters
# RAG returns also the context documents retrieved for evaluation purposes in RAGAsdef get_chain(chroma_client,
documents,
embedding_model="text-embedding-ada-002",
llm_model="gpt-3.5-turbo",
chunk_size=None,
overlap_size=0,
top_k=4,
lambda_mult=0.25) -> RunnableSequence:
vectordb_collection = get_vectordb_collection(chroma_client=chroma_client,
documents=documents,
embedding_model=embedding_model,
chunk_size=chunk_size,
overlap_size=overlap_size)
retriever = vectordb_collection.as_retriever(top_k=top_k, lambda_mult=lambda_mult)
template = """Answer the question based only on the following context.
If the context doesn't contain entities present in the question say you don't know.
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
llm = ChatOpenAI(model=llm_model)
def format_docs(docs):
return "\n\n".join((doc.page_content for doc in docs))
chain_from_docs = (
RunnablePassthrough.assign(context=(lambda x: format_docs(x("context"))))
| prompt
| llm
| StrOutputParser()
)
chain_with_context_and_ground_truth = RunnableParallel(
context=itemgetter("question") | retriever,
question=itemgetter("question"),
ground_truth=itemgetter("ground_truth"),
).assign(answer=chain_from_docs)
return chain_with_context_and_ground_truth

The previous function get_vectordb_collection joins the latter function get_string, which generates our RAG string for a given set of parameters, i.e.: embedding_model, llm_model, chunk_size, Temporary_size, top_k, lambda_mult. With our factory function, we are only scratching the surface of the possibilities of which hyperparameters of our RAG system we optimize. Also note that the RAG string will require 2 arguments: ask and ground_truthwhere the latter simply passes through the RAG chain as it is necessary for evaluation using RAGA.

# Setting up a ChromaDB client
chroma_client = chromadb.EphemeralClient()# Testing full text rag
with warnings.catch_warnings():
rag_prototype = get_chain(chroma_client=chroma_client, 
documents=news, 
chunk_size=1000, 
overlap_size=200)
rag_prototype.invoke({"question": 'What happened in Minneapolis to the bridge?',
"ground_truth": "x"})("answer")

GAR evaluation

To evaluate our RAG we will use the diverse dataset of CNN and Daily Mail news articles, which is available at hugging face (4). Most articles in this data set are less than 1000 words. Additionally, we will use a small extract of the data set of only 100 news articles. This is all done to limit the costs and time required to run the demo.

# Getting the tiny extract of CCN Daily Mail dataset
synthetic_evaluation_set_url = "https://gist.github.com/gox6/0858a1ae2d6e3642aa132674650f9c76/raw/synthetic-evaluation-set-cnn-daily-mail.csv"
synthetic_evaluation_set_pl = pl.read_csv(synthetic_evaluation_set_url, separator=",").drop("index")

# Train/test split
# We need at least 2 sets: train and test for RAG optimization.shuffled = synthetic_evaluation_set_pl.sample(fraction=1, 
shuffle=True, 
seed=6)
test_fraction = 0.5
test_n = round(len(synthetic_evaluation_set_pl) * test_fraction)
train, test = (shuffled.head(-test_n), 
shuffled.head( test_n))

Since we will be considering many different RAG prototypes beyond the one defined above, we need a function to collect responses generated by the RAG into our synthetic evaluation set:

# We create the helper function to generate the RAG ansers together with Ground Truth based on synthetic evaluation set
# The dataset for RAGAS evaluation should contain the columns: question, answer, ground_truth, contexts
# RAGAs expects the data in Huggingface Dataset formatdef generate_rag_answers_for_synthetic_questions(chain,
synthetic_evaluation_set) -> pl.DataFrame:
df = pl.DataFrame()
for row in synthetic_evaluation_set.iter_rows(named=True):
rag_output = chain.invoke({"question": row("question"), 
"ground_truth": row("ground_truth")})
rag_output("contexts") = (doc.page_content for doc 
in rag_output("context"))
del rag_output("context")
rag_output_pp = {k: (v) for k, v in rag_output.items()}
df = pl.concat((df, pl.DataFrame(rag_output_pp)), how="vertical")
return df

RAG optimization with RAGA and Optuna

First of all, it is worth emphasizing that proper optimization of the RAG system should involve global optimization, where all parameters are optimized at once, in contrast to the sequential or greedy approach, where parameters are optimized one by one. The sequential approach ignores the fact that there may be interactions between the parameters, which may result in a suboptimal solution.

Now we are finally ready to optimize our RAG system. We will use a hyperparameter optimization framework. To opt. To this end, we define the objective function for the Optuna study by specifying the allowed hyperparameter space and calculating the evaluation metric; see the code below:

def objective(trial):embedding_model = trial.suggest_categorical(name="embedding_model",
choices=("text-embedding-ada-002", 'text-embedding-3-small'))
chunk_size = trial.suggest_int(name="chunk_size",
low=500,
high=1000,
step=100)
overlap_size = trial.suggest_int(name="overlap_size",
low=100,
high=400,
step=50)
top_k = trial.suggest_int(name="top_k",
low=1,
high=10,
step=1)
challenger_chain = get_chain(chroma_client,
news,
embedding_model=embedding_model,
llm_model="gpt-3.5-turbo",
chunk_size=chunk_size,
overlap_size= overlap_size ,
top_k=top_k,
lambda_mult=0.25)
challenger_answers_pl = generate_rag_answers_for_synthetic_questions(challenger_chain , train)
challenger_answers_hf = Dataset.from_pandas(challenger_answers_pl.to_pandas())
challenger_result = evaluate(challenger_answers_hf,
metrics=(answer_correctness),
)
return challenger_result('answer_correctness')

Finally, having the objective function, we define and execute the study to optimize our RAG system in Optuna. It is worth noting that we can add our hyperparameter educated guesses to the study with the method enqueue_trialas well as limit the study by time or number of trials, see the Optuna Documents for more tips.

sampler = optuna.samplers.TPESampler(seed=6)
study = optuna.create_study(study_name="RAG Optimisation",
direction="maximize",
sampler=sampler)
study.set_metric_names(('answer_correctness'))educated_guess = {"embedding_model": "text-embedding-3-small", 
"chunk_size": 1000,
"overlap_size": 200,
"top_k": 3}
study.enqueue_trial(educated_guess)
print(f"Sampler is {study.sampler.__class__.__name__}")
study.optimize(objective, timeout=180)

The educated assumption was not confirmed in our study, but I am sure that with a rigorous approach like the one proposed above things will improve.

Best trial with answer_correctness: 0.700130617593832
Hyper-parameters for the best trial: {'embedding_model': 'text-embedding-ada-002', 'chunk_size': 700, 'overlap_size': 400, 'top_k': 9}

Limitations of RAGA

After experimenting with the ragas library to synthesize evaluation sets and evaluate RAG, I have a few caveats:

The question may contain the answer.
The fundamental truth is only the literal extract of the document.
Problems with RateLimitError and network overflows in Colab.
Built-in evolutions are few, and there's no easy way to add new ones.
There is room for improvement in the documentation.

The first two caveats are related to quality. The main cause of them may be in the LLM used and, obviously, GPT-4 gives better results than GPT-3.5-Turbo. At the same time, it seems that this could be improved by some rapid engineering for the evolutions used to generate synthetic evaluation sets.

Regarding problems with rate limiting and network overflows, it is advisable to use: 1) checkpoints during the generation of synthetic evaluation sets to avoid loss of created data and 2) exponential backoff to ensure completion all the homework.

Lastly, and most importantly, more integrated evolutions would be welcome for the ragas package. Not to mention the possibility of creating custom evolutions more easily.

Other useful features of the RAGA

Personalized notices. The ragas package gives you the option to change the cues used in the provided abstractions. The example of custom requests for metrics in the evaluation task is described. In the documents. I then use personalized prompts to modify developments and mitigate quality issues.
Automatic language adaptation. RAGAs has you covered for languages other than English. It has a great feature called automatic language adaptation that supports RAG testing in languages other than English; see the docs for more information.

Conclusions

Despite RAGA's limitations, DO NOT miss the most important thing:

RAGAs is already a very useful tool despite its young age. It allows the generation of a set of synthetic evaluations for rigorous RAG evaluation, a critical aspect for successful RAG development.

Evaluate RAGs Rigorously or Perish | by Jarek Grygolec, Ph.D. | April 2024

Technical Terrence Team

Israel Stock Markets Close Higher; The TA 35 rises 0.39% By Investing.com

Leave a Reply Cancel reply

Recommended.

Can AI Outperform Humans at Creative Thinking Task? This Study Provides Insights into the Relationship Between Human and Machine Learning Creativity

This AI Research Introduces AstroLLaMA: A 7B Parameter Model Fine-Tuned from LLaMA-2 Using Over 300K Astronomy Abstracts From ArXiv

=Null; Foundation introduces advanced zkEVM to improve Ethereum security and scalability

Top 5 Altcoins to invest in right now December 7: Algorand, Monero, Bitcoin ETF Token

Bitcoin price breakout or bull trap? 5K Twitter users give their opinion

Categories

Important Links

Evaluate RAGs Rigorously or Perish | by Jarek Grygolec, Ph.D. | April 2024

Related

Technical Terrence Team

Israel Stock Markets Close Higher; The TA 35 rises 0.39% By Investing.com

Leave a Reply Cancel reply

Recommended.

Can AI Outperform Humans at Creative Thinking Task? This Study Provides Insights into the Relationship Between Human and Machine Learning Creativity

This AI Research Introduces AstroLLaMA: A 7B Parameter Model Fine-Tuned from LLaMA-2 Using Over 300K Astronomy Abstracts From ArXiv

=Null; Foundation introduces advanced zkEVM to improve Ethereum security and scalability

Top 5 Altcoins to invest in right now December 7: Algorand, Monero, Bitcoin ETF Token

Bitcoin price breakout or bull trap? 5K Twitter users give their opinion

Categories

Important Links

Get daily news updates to your inbox!