fast POC
The fastest proof of concepts (POC) that allows a user to explore data with the help of conversational ai will simply blow your mind. It feels like pure magic when you can suddenly talk to your documents, data, or code base.
These POCs work great on small data sets with a limited number of documents. However, as with almost everything, when you take it to production, you quickly run into problems at scale. When you dig deeper and inspect the answers the ai gives you, you notice:
- Your agent does not respond with complete information. Some important data was lost.
- Your agent does not reliably give the same answer
- Your agent can't tell you how and where they got what information, which makes the answer much less helpful.
It turns out that the real magic in RAG It doesn't happen in the generative ai step, but in the retrieval and composition process. Once you dive in, it's pretty obvious why…
* RAG = Recovery Augmented Generation — Wikipedia definition of RAG
A quick summary of how a simple RAG process works:
- It all starts with a consultation. The user asked a question or some system is trying to answer a question. For example, “Does patient Walker have a broken leg?”
- TO look for The consultation ends. Basically, you would embed the query and do a similarity search, but you can also do a classic elastic search or a combination of both, or a direct information search.
- The search result is a set of documents (or fragments of documents, but for now let's just call them documents)
- The documents and the essence of the query are combined into some easily readable ones. context so the ai can work with it
- He ai interprets the question and the documents and generates an answer
- Ideally this answer is verified factto see if the ai based the answer on the documents and/or if it is appropriate for the audience
The dirty little secret is that the essence of the RAG process is that you have to provide the answer to the ai (before it does anything), so it can give you the answer you're looking for.
In other words:
- the work that the ai does (step 5) is Apply judgment and appropriately articulate the response.
- The work the engineer does (steps 3 and 4) is find the answer and compose it in a way that the ai can digest it
Which is more important? The answer is, of course, that it depends, because if the criterion is the critical element, then the ai model does all the magic. But for an infinite number of business use cases, the most important part is finding and properly putting together the pieces that make up the answer.
The first set of problems to solve when running a RAG process are the problems of ingesting, splitting, fragmenting, and interpreting data documents. I have written about some of these in previous articlesbut I'm ignoring them here. For now, let's assume you've got data ingestion figured out correctly and have a nice vector store or searchable index.
Typical challenges:
- Duplication — Even the simplest production systems often have duplicate documents. Even more so when your system is large, has many users or tenants, connects to multiple data sources or deals with version control, etc.
- Almost doubling — Documents that contain largely the same data, but with minor changes. There are two types of near-duplication:
— Significant: e.g. a small correction or minor addition, e.g. a date field with an update
— Meaningless — For example: minor differences in punctuation, syntax, or spacing, or simply differences introduced by timing or admission processing - Volume — Some queries have a very large relevant response data set.
- Data update versus quality — Which chunks of the response data set have the highest quality content for the ai to use vs. which chunks are most relevant from a time (freshness) perspective?
- Data variety —How do we ensure a variety of search results so that the ai is adequately informed?
- Query phrase and ambiguity — The message that triggered the RAG flow might not be worded in a way that produces the optimal result, or it might even be ambiguous.
- Response customization — The query may require a different response depending on who asks it.
This list goes on, but you get the gist.
Short answer: no.
The cost and performance impact of using extremely large context windows (easily 10 or 100x the cost per query) should not be underestimated, not including any follow-up interactions the user/system has.
However, putting that aside. Imagine the following situation.
We brought Anne into the room with a piece of paper. The document says: *patient Joe: complex foot fracture.* Now we ask Anne, does the patient have a foot fracture? His answer is “yes, it does.”
Now we give Anne a hundred pages of Joe's medical history. His response is “well, depending on what time you mean, he had…”
Now we give Anne thousands of pages about all the patients at the clinic…
What you quickly notice is that the way we define the question (or the message in our case) begins to be very important. The larger the context window, the more nuance the query needs.
Besides, the larger the context window, The universe of possible answers grows. This may be a good thing, but in practice it is a method that invites lazy engineering behavior and is likely to reduce the capabilities of your application if not handled intelligently.
As you scale a RAG system from POC to production, here's how to address typical data challenges with specific solutions. Each approach has been adjusted to fit production requirements and includes examples where useful.
Duplication
Duplication is inevitable in multi-source systems. By using fingerprints (hashed content), document IDs, or semantic hashing, you can identify exact duplicates at the time of ingestion and avoid redundant content. However, consolidating metadata between duplicates can also be valuable; This lets users know that certain content appears in multiple sources, which can add credibility or highlight repetition in the data set.
# Fingerprinting for deduplication
def fingerprint(doc_content):
return hashlib.md5(doc_content.encode()).hexdigest()# Store fingerprints and filter duplicates, while consolidating metadata
fingerprints = {}
unique_docs = ()
for doc in docs:
fp = fingerprint(doc('content'))
if fp not in fingerprints:
fingerprints(fp) = (doc)
unique_docs.append(doc)
else:
fingerprints(fp).append(doc) # Consolidate sources
Almost doubling
Near duplicate documents (similar but not identical) often contain important updates or small additions. Since a minor change, such as a status update, can contain critical information, freshness becomes crucial when filtering out near-duplicates. A practical approach is to use cosine similarity for initial detection and then keep the most recent version within each group of near-duplicates while flagging any significant updates.
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import DBSCAN
import numpy as np# Cluster embeddings with DBSCAN to find near duplicates
clustering = DBSCAN(eps=0.1, min_samples=2, metric="cosine").fit(doc_embeddings)
# Organize documents by cluster label
clustered_docs = {}
for idx, label in enumerate(clustering.labels_):
if label == -1:
continue
if label not in clustered_docs:
clustered_docs(label) = ()
clustered_docs(label).append(docs(idx))
# Filter clusters to retain only the freshest document in each cluster
filtered_docs = ()
for cluster_docs in clustered_docs.values():
# Choose the document with the most recent timestamp or highest relevance
freshest_doc = max(cluster_docs, key=lambda d: d('timestamp'))
filtered_docs.append(freshest_doc)
Volume
When a query returns a large volume of relevant documents, effective management is key. One approach is a **layered strategy**:
- Topic extraction: Preprocess documents to extract specific themes or summaries.
- Top-k filtering: After synthesis, filter the summarized content based on relevance scores.
- Relevance score: Use similarity metrics (e.g. BM25 or cosine similarity) to prioritize top documents before retrieving them.
This approach reduces the workload by retrieving synthesized information that is more manageable for the ai. Other strategies might involve grouping documents by topic or pre-grouping summaries to further speed up retrieval.
Data updating vs. quality
Balancing quality with freshness is essential, especially in rapidly evolving data sets. Many scoring approaches are possible, but here is a general tactic:
- Composite score: Calculate a quality score using factors such as source trustworthiness, depth of content, and user engagement.
- Recent weighting– Adjust punctuation with a timestamp weight to emphasize freshness.
- Filter by threshold: Only documents that meet a combined quality and timeliness threshold are retrieved.
Other strategies might involve grading only high-quality sources or applying deterioration factors to older documents.
Data variety
Ensuring diverse data sources in recovery helps create a balanced response. Grouping documents by source (for example, different databases, authors, or content types) and selecting the best snippets from each source is an effective method. Other approaches include grading based on single perspectives or applying diversity restrictions to avoid over-reliance on a single document or perspective.
# Ensure variety by grouping and selecting top snippets per sourcefrom itertools import groupby
k = 3 # Number of top snippets per source
docs = sorted(docs, key=lambda d: d('source'))
grouped_docs = {key: list(group)(:k) for key, group in groupby(docs, key=lambda d: d('source'))}
diverse_docs = (doc for docs in grouped_docs.values() for doc in docs)
Query phrase and ambiguity
Ambiguous queries can lead to suboptimal retrieval results. Using the exact user message is usually not the best way to get the results they need. For example, there may have been an exchange of relevant information earlier in the chat. Or the user pasted a large amount of text with a question about it.
To ensure that you use a refined query, one approach is to ensure that a RAG tool provided to the model asks you to rephrase the question into a more detailed search query, similar to how you might carefully craft a search query for Google. This approach improves the alignment between user intent and the RAG retrieval process. The following wording is not optimal, but it provides the essentials:
tools = ({
"name": "search_our_database",
"description": "Search our internal company database for relevent documents",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "A search query, like you would for a google search, in sentence form. Take care to provide any important nuance to the question."
}
},
"required": ("query")
}
})
Response customization
To get personalized responses, integrate the user-specific context directly into the RAG context composition. By adding a user-specific layer to the final context, you allow ai to take into account individual preferences, permissions, or history without disrupting the core retrieval process.