In this article, I will show you how to create your own RAG dataset consisting of contexts, questions and answers from documents in any language.
Retrieval-Augmented Generation (RAG) (1) is a technique that allows LLMs to access an external knowledge base.
By uploading PDF files and storing them in a vector database, we can retrieve this knowledge using a vector similarity search and then insert the retrieved text into the LLM message as additional context.
This provides the LLM with new knowledge and reduces the possibility of the LLM making up facts (hallucinations).
However, there are many parameters we need to set in a RAG pipeline, and researchers are always suggesting new improvements. How do we know which parameters to choose and which methods will actually improve performance for our particular use case?
That's why we need a validation/development/test dataset to evaluate our RAG pipeline. The data set must be from the domain we are interested in…