In this article, my goal is to explain how and why it is beneficial to use a large language model (LLM) for fragment-based information retrieval.
I use OpenAI's GPT-4 model as an example, but this approach can be applied with any other LLM, such as those from Hugging Face, Claude, and others.
Everyone can access this. article free.
Standard Information Retrieval Considerations
The main concept involves having a list of documents (pieces of text) stored in a database, which could be retrieved based on some filters and conditions.
Typically, a tool is used to enable hybrid search (such as Azure ai Search, LlamaIndex, etc.), which allows:
- perform a text-based search using term frequency algorithms such as TF-IDF (e.g. BM25);
- perform a vector-based search, which identifies similar concepts even when different terms are used, calculating vector distances (typically cosine similarity);
- combining elements from steps 1 and 2, weighting them to highlight the most relevant results.
Figure 1 shows the classic recovery process:
- the user asks the system a question: “I would like to talk about Paris”;
- the system receives the question, converts it into an embedding vector (using the same model applied in the ingest phase) and finds the fragments with the smallest distances;
- The system also performs a textual search based on frequency;
- The fragments returned by both processes undergo further evaluation and are reordered according to a sorting formula.
This solution achieves good results but has some limitations:
- not all relevant fragments are always recovered;
- Sometimes some fragments contain anomalies that affect the final response.
An example of a typical recovery problem
Let's consider the “documents” array, which represents an example of a knowledge base that could lead to incorrect fragment selection.
documents = (
"Chunk 1: This document contains information about topic A.",
"Chunk 2: Insights related to topic B can be found here.",
"Chunk 3: This chunk discusses topic C in detail.",
"Chunk 4: Further insights on topic D are covered here.",
"Chunk 5: Another chunk with more data on topic E.",
"Chunk 6: Extensive research on topic F is presented.",
"Chunk 7: Information on topic G is explained here.",
"Chunk 8: This document expands on topic H. It also talk about topic B",
"Chunk 9: Nothing about topic B are given.",
"Chunk 10: Finally, a discussion of topic J. This document doesn't contain information about topic B"
)
Suppose we have a RAG system, which consists of a vector database with hybrid search capabilities and an LLM-based message, to which the user poses the following question: “I need to know something about topic B.”
As shown in Figure 2, the search also returns an incorrect fragment that, while semantically relevant, is not suitable for answering the question and, in some cases, could even confuse the LLM tasked with providing an answer.
In this example, the user requests information about “topic B”, and the search returns snippets including “This document expands on topic H. It also talks about topic B.” and “Ideas related to topic B can be found here.“as well as the fragment that says”Nothing is given about topic B.”.
While this is the expected behavior of hybrid search (as referenced by the snippets “topic B“), is not the desired result, since the third fragment is returned without acknowledging that it is not useful in answering the question.
The search did not produce the desired result, not only because the search for BM25 found the term “topic B” in the third Chunk, but also because the vector search returned a high cosine similarity.
To understand this, see Figure 3, which shows the cosine similarity values of the fragments relative to the question, using OpenAI's text-embedding-ada-002 model for embedding.
It is evident that the cosine similarity value for “fragment 9” is among the highest, and that between this fragment and fragment 10, which refers to “topic B”, there is also fragment 1, which does not mention “topic B”.
This situation remains unchanged even when the distance is measured using a different method, as seen in the case of the Minkowski distance.
Using LLM for information retrieval: an example
The solution I will describe is inspired by what was published in my GitHub repository. https://github.com/peronc/LLMRetriever/.
The idea is that the LLM analyzes which fragments are useful in answering the user's question, not by ranking the returned fragments (as in the case of RankGPT) but by directly evaluating all available fragments.
In summary, as shown in Figure 4, the system receives a list of documents to analyze, which can come from any data source, such as file storage, relational databases, or vector databases.
The chunks are divided into groups and processed in parallel by a number of threads proportional to the total number of chunks.
The logic of each thread includes a loop that loops through the input fragments and calls an OpenAI message for each to check its relevance to the user's question.
The message returns the fragment along with a boolean value: TRUE if it is relevant and FAKE if it is not.
Let's code
To explain the code, I will simplify it using the snippets present in the documents array (in the conclusions I will refer to a real case).
First, I import the necessary standard libraries, including os, langchain, and dotenv.
import os
from langchain_openai.chat_models.azure import AzureChatOpenAI
from dotenv import load_dotenv
Next, I import my class LLMRetrieverLib/llm_retrieve.py, which provides several essential static methods to perform the analysis.
from LLMRetrieverLib.retriever import llm_retriever
After that, I need to import the variables needed to use the Azure OpenAI GPT-4o model.
load_dotenv()
azure_deployment = os.getenv("AZURE_DEPLOYMENT")
temperature = float(os.getenv("TEMPERATURE"))
api_key = os.getenv("AZURE_OPENAI_API_KEY")
endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
api_version = os.getenv("API_VERSION")
Next, I proceed with the initialization of the LLM.
# Initialize the LLM
llm = AzureChatOpenAI(api_key=api_key, azure_endpoint=endpoint, azure_deployment=azure_deployment, api_version=api_version,temperature=temperature)
We're ready to go: the user asks a question to gather additional information about Topic B.
question = "I need to know something about topic B"
At this point the search for relevant fragments begins and for this I use the function llm_retrieve.process_chunks_in_parallel
from LLMRetrieverLib/retriever.py
library, which is also located in the same repository.
relevant_chunks = LLMRetrieverLib.retriever.llm_retriever.process_chunks_in_parallel(llm, question, documents, 3)
To optimize performance, the function llm_retrieve.process_chunks_in_parallel
Employs multithreading to distribute fragment analysis across multiple threads.
The main idea is to assign each thread a subset of fragments extracted from the database and have each thread analyze the relevance of those fragments based on the user's question.
At the end of processing, the fragments returned are exactly as expected:
('Chunk 2: Insights related to topic B can be found here.',
'Chunk 8: This document expands on topic H. It also talk about topic B')
Finally, I ask the LLM to answer the user's question:
final_answer = LLMRetrieverLib.retriever.llm_retriever.generate_final_answer_with_llm(llm, relevant_chunks, question)
print("Final answer:")
print(final_answer)
Below is the LLM's response, which is trivial since the content of the fragments, while relevant, is not exhaustive on topic B:
Topic B is covered in both Chunk 2 and Chunk 8.
Chunk 2 provides insights specifically related to topic B, offering detailed information and analysis.
Chunk 8 expands on topic H but also includes discussions on topic B, potentially providing additional context or perspectives.
Scoring scenario
Now let's try asking the same question but using a score-based approach.
I ask the LLM to assign a score from 1 to 10 to evaluate the relevance between each fragment and the question, considering only those with a relevance greater than 5.
To do this, I call the function llm_retriever.process_chunks_in_parallel
passing three additional parameters that indicate, respectively, that scoring will be applied, that the threshold for it to be considered valid must be greater than or equal to 5, and that I want a printout of the fragments with their respective scores.
relevant_chunks = llm_retriever.process_chunks_in_parallel(llm, question, documents, 3, True, 5, True)
The scoring recovery phase produces the following output:
score: 1 - Chunk 1: This document contains information about topic A.
score: 1 - Chunk 7: Information on topic G is explained here.
score: 1 - Chunk 4: Further insights on topic D are covered here.
score: 9 - Chunk 2: Insights related to topic B can be found here.
score: 7 - Chunk 8: This document expands on topic H. It also talk about topic B
score: 1 - Chunk 5: Another chunk with more data on topic E.
score: 1 - Chunk 9: Nothing about topic B are given.
score: 1 - Chunk 3: This chunk discusses topic C in detail.
score: 1 - Chunk 6: Extensive research on topic F is presented.
score: 1 - Chunk 10: Finally, a discussion of topic J. This document doesn't contain information about topic B
It's the same as before, but with an interesting score .
Finally, I ask the LLM again to give an answer to the user's question, and the result is similar to the previous one:
Chunk 2 provides insights related to topic B, offering foundational information and key points.
Chunk 8 expands on topic B further, possibly providing additional context or details, as it also discusses topic H.
Together, these chunks should give you a well-rounded understanding of topic B. If you need more specific details, let me know!
Considerations
This recovery approach has emerged as a necessity after some previous experiences.
I have noticed that purely vector-based searches produce useful results, but are often insufficient when the embedding is done in a language other than English.
Using OpenAI with Italian sentences makes it clear that tokenization of terms is often incorrect; For example, the term “song“, what does it mean “song”In Italian, it is divided into two different words: “can” and “area”.
This leads to the construction of an embedding array that is far from what was intended.
In cases like this, hybrid search, which also incorporates term frequency counting, leads to better results, but they are not always as expected.
So, this recovery methodology can be used in the following ways:
- as main search method: where the database is queried for all fragments or a subset based on a filter (for example, a metadata filter);
- as a refinement in the case of hybrid search: (this is the same approach used by RankGPT) this way hybrid search can extract a large number of snippets and the system can filter them so that only the relevant ones reach the LLM while still meeting the input token limit;
- alternatively: In situations where a hybrid search does not produce the desired results, all fragments can be analyzed.
Let's look at costs and performance.
Of course, all that glitters is not gold, as response times and costs must be taken into account.
In a real use case, I retrieved the fragments of a relational database consisting of 95 semantically split text segments using my LLMChunkizerLib/chunkizer.py
library from two Microsoft Word documents, with a total of 33 pages.
Analysis of the relevance of the 95 snippets to the question was performed by calling the OpenAI APIs from a local PC with unguaranteed bandwidth, averaging around 10 Mb, resulting in response times that varied from 7 to 20 seconds.
Naturally, in a cloud system or using local LLM on GPUs, these times can be significantly reduced.
I think response time considerations are very subjective: in some cases it is acceptable to take longer to give a correct response, while in others it is essential not to make users wait too long.
Similarly, cost considerations are also quite subjective, as a broader perspective must be taken to evaluate whether it is more important to provide answers that are as accurate as possible or whether some errors are acceptable.
In certain fields, the reputational damage caused by incorrect or missing answers can exceed the token spend.
Additionally, while the costs of OpenAI and other vendors have been steadily decreasing in recent years, those who already have GPU-based infrastructure, perhaps due to the need to handle sensitive or confidential data, will likely prefer to use an on-premise LLM.
Conclusions
In conclusion, I hope I have provided my perspective on how recovery can be approached.
At least my goal is to be helpful and perhaps inspire others to explore new methods in their own work.
Remember, the world of information retrieval is vast, and with a little creativity and the right tools, we can discover insights in ways we never imagined.