Embeddings are vector representations that capture the semantic meaning of words or sentences. Besides having quality data, choosing a good integration model is the most important and underrated step in optimizing your RAG application. Multilingual models are especially challenging as most are pre-trained on English data. The right inlays make a big difference – don't just go with the first model you see!
The semantic space determines the relationships between words and concepts. An accurate semantic space improves retrieval performance. Inaccurate embeddings lead to irrelevant fragments or missing information. A better model directly improves the capabilities of your RAG system.
In this article, we will create a question and answer dataset from PDF documents to find the best model for our task and language. During RAG, if the expected answer is recovered, it means that the embedding model placed the question and the answer close enough in the semantic space.
While we focus on French and Italian, the process can be adapted to any language because the best additions may differ.
Embed models
There are two main types of embedding models: static and dynamic. Static embeds like word2vec generates a vector for each word. The vectors are combined, often by averaging, to create a final embedding. These types of embeddings are no longer frequently used in production because they do not consider how the meaning of a word can change based on the words around it.
Dynamic embeds They are based on Transformers like BERT, which incorporate context awareness through layers of self-attention, allowing them to represent words based on the surrounding context.
Most of the current improved models use contrastive learning. The model learns semantic similarities by seeing positive and negative text pairs during training.