How to create a RAG evaluation data set from documents | by Dr. León Eversberg | November 2024

Automatically create domain-specific datasets in any language using LLM

The HuggingFace dataset card showing an example RAG evaluation dataset we generated. — Our auto-generated RAG evaluation dataset on Hugging Face Hub (PDF input file of the European Union licensed under CC BY 4.0). Author's image

In this article, I will show you how to create your own RAG dataset consisting of contexts, questions and answers from documents in any language.

Retrieval-Augmented Generation (RAG) (1) is a technique that allows LLMs to access an external knowledge base.

By uploading PDF files and storing them in a vector database, we can retrieve this knowledge using a vector similarity search and then insert the retrieved text into the LLM message as additional context.

This provides the LLM with new knowledge and reduces the possibility of the LLM making up facts (hallucinations).

The basic RAG pipeline. Image from the author of the article “How to build an open source local LLM chatbot with RAG”

However, there are many parameters we need to set in a RAG pipeline, and researchers are always suggesting new improvements. How do we know which parameters to choose and which methods will actually improve performance for our particular use case?

That's why we need a validation/development/test dataset to evaluate our RAG pipeline. The data set must be from the domain we are interested in…

How to create a RAG evaluation data set from documents | by Dr. León Eversberg | November 2024

Technical Terrence Team

Royal Caribbean makes rebooking passengers will love

Leave a Reply Cancel reply

Recommended.

Despite the fall of cryptocurrencies, Bitcoin Minetrix exceeds one million dollars; Pre-sale is ready for an imminent low-cost sale

TD's chief compliance officer leaves amid U.S. anti-money laundering probe By Reuters

Coherent Q4 results beat estimates as revenue grows 9% YoY

Supercharge your AI team with Amazon SageMaker Studio: A comprehensive view of Deutsche Bahn’s AI platform transformation

How music technology helped my students harness their creativity

Categories

Important Links

How to create a RAG evaluation data set from documents | by Dr. León Eversberg | November 2024

Automatically create domain-specific datasets in any language using LLM

Related

Technical Terrence Team

Royal Caribbean makes rebooking passengers will love

Leave a Reply Cancel reply

Recommended.

Despite the fall of cryptocurrencies, Bitcoin Minetrix exceeds one million dollars; Pre-sale is ready for an imminent low-cost sale

TD's chief compliance officer leaves amid U.S. anti-money laundering probe By Reuters

Coherent Q4 results beat estimates as revenue grows 9% YoY

Supercharge your AI team with Amazon SageMaker Studio: A comprehensive view of Deutsche Bahn’s AI platform transformation

How music technology helped my students harness their creativity

Categories

Important Links

Get daily news updates to your inbox!