RAG is a widely deployed and widely discussed solution to address document summarization optimization using GenAI technologies. However, like any new technology or solution, it is prone to extreme challenges, especially in today’s enterprise environment. Two major concerns are contextual bloat along with cost per message and the “lost in the middle” context problem mentioned above. Let’s dig a little deeper to understand these challenges.
Note: I will perform the exercises in Python using the LangChain, Scikit-Learn, Numpy and Matplotlib libraries for quick iterations.
Today, with the automated workflows enabled by GenAI, analyzing large documents has become an industry expectation/requirement. People want to quickly find relevant information from medical reports or financial audits just by applying for the LLM. But there is a caveat: business documents are not like the documents or data sets we work with in academia – the sizes are considerably larger and relevant information can be present virtually anywhere in the documents. Therefore, methods such as data cleaning/filtering are often not a viable option as domain knowledge about these documents is not always provided.
In addition to this, even the latest Large Language Models (LLMs) like OpenAI’s GPT-4o with context windows of 128K tokens cannot consume these documents in one go or even if they did, the quality of the response would not meet standards, especially for the cost involved. To demonstrate this, let’s take a real-world example where we try to summarize the GitLab Employee Handbook which can be downloaded from the GitLab website. hereThis document is freely available under the MIT license available on their GitHub repository.
1 We start by loading the document and also initializing our LLM, to keep this exercise relevant I will use GPT-4o.
from langchain_community.document_loaders import PyPDFLoader# Load PDFs
pdf_paths = ("/content/gitlab_handbook.pdf")
documents = ()
for path in pdf_paths:
loader = PyPDFLoader(path)
documents.extend(loader.load())
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")
2 We can then split the document into smaller chunks (this is for incrustationI will explain why in later steps.)
from langchain.text_splitter import RecursiveCharacterTextSplitter# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
# Split documents into chunks
splits = text_splitter.split_documents(documents)
3 Now, let's calculate how many tokens make up this document, for this we will iterate through each fragment of the document and calculate the total number of tokens that make up the document.
total_tokens = 0for chunk in splits:
text = chunk.page_content # Assuming `page_content` is where the text is stored
num_tokens = llm.get_num_tokens(text) # Get the token count for each chunk
total_tokens += num_tokens
print(f"Total number of tokens in the book: {total_tokens}")
# Total number of tokens in the book: 254006
As we can see, the number of tokens is 254,006, while the context window limit for GPT-4o is 128,000. This document cannot be submitted in one go via the LLM API. On top of this, considering that the pricing for this model is $0.00500 / 1000 input tokens, a single request submitted to OpenAI for this document would cost $1.27! This doesn’t sound horrible until you put it into an enterprise paradigm with multiple users and daily interactions on many such large documents, especially in a startup scenario where many GenAI solutions are being born.
Another challenge that LLMs face is the Lost in the middle, Context problem as discussed in detail in this paperResearch and my experiences with RAG systems that handle multiple documents describe that LLMs are not very robust when it comes to extrapolating information from long context inputs. Model performance degrades considerably when the relevant information is somewhere in the middle of the context. However, performance improves when the required information is at the beginning or end of the provided context. Document reranking is one solution that has become a topic of growing debate and research to address this specific problem. I will explore some of these methods in another post. For now, let’s return to the solution we are exploring, which uses K-means clustering.
Ok, I admit I introduced a technical concept in the last section, let me explain (for those of you who don't know the method, I understand).
First the basics
To understand K-means clustering, we first need to know what clustering is. Consider the following: we have a messy desk with pens, pencils, and notes all scattered around. To sort, one would group the similar items together, such as all the pens in one group, the pencils in another, and the notes in another, essentially creating three separate groups (not promoting segregation). Clustering is the same process where among a collection of data (in our case, the different text fragments of the document), similar data or information is grouped together creating a clear separation of concerns for the model, making it easier for our RAG system to effectively and efficiently select and curate the information instead of having to go through it all like a greedy method.
K, does it mean?
K-means is a specific method for clustering (there are other methods but we won't go into that). Let me explain how it works in 5 simple steps:
- Selecting the number of groups (K):How many groups do we want the data to be divided into?
- Group Center Selection:Initially, a central value is randomly selected for each of the K groups
- Group work:Each data point is then assigned to each cluster based on its proximity to the previously chosen centers. Example: items closest to center 1 are assigned to cluster 1, items closest to center 2 will be assigned to cluster 2… and so on up to cluster K.
- Adjusting the centers:Once all the data points have been classified, we calculate the average of the positions of the elements in each group and these averages become the new centers to improve accuracy (because we had initially selected them randomly).
- Rinse and repeat: With the new centers, the data point assignments are updated again for the K groups. This is done until the difference (mathematically the Euclidean distance) is minimum for elements within a group and maximum for other data points in other groups, i.e. optimal segregation.
While this may be a rather simplified explanation, a more detailed and technical (for my fellow nerds) explanation of this algorithm can be found here. here.