LLM + RAG: Create a file reader assistant with AI

Introduction

ai is everywhere.

It is difficult not to interact at least once a day with a large language model (LLM). Chatbots are here to stay. They are in their applications, they help you write better, they make up emails, read emails … well, they do a lot.

And I don't think that is bad. In fact, my opinion is the other way, at least so far. I defend and advocate for the use of ai in our daily lives because, okay, it makes everything much easier.

I do not have to spend the double reading time to find problems or types of score. He does that for me. I do not waste time writing that monitoring email every Monday. He does that for me. I do not need to read a huge and boring contract when I have an ai to summarize the main conclusions and points of action for me!

These are just some of ai's great uses. If you want to learn more about LLM use to facilitate life, I wrote a complete book about them.

Now, think as a data scientist and look at the technical side, not everything is so brilliant and brilliant.

LLMs are excellent for several general use cases that apply to anyone or any company. For example, codify, summarize or answer questions about the general content created until the training cutting date. However, when it comes to specific commercial applications, for a single purpose, or something new that did not reach the cutting date, it is when the models will not be so useful if they are used ribbon – That is, they will not know the answer. Therefore, you will need adjustments.

Training a LLM model can take months and millions of dollars. What is even worse is that if we do not adjust and tune the model to our purpose, there will be unsatisfactory results or hallucinations (when the response of the model makes no sense given our consultation).

So what is the solution, then? To spend a lot of money on the model's requirement to include our data?

Not precisely. That's when the increase in augmented recovery (RAG) becomes useful.

RAG is a framework that combines obtaining information from an external knowledge base with large language models (LLM). It helps the models of the produce more precise and relevant responses.

Let's learn more about RAG next.

What is the rag?

Let me tell you a story to illustrate the concept.

I love movies. For some time in the past, I knew what films were competing for the best category of films in the Oscars or the best actors and actresses. And certainly I would know which ones obtained the statue for that year. But now I am rusty on that subject. If you asked me who was competing, I wouldn't know. And even if I tried to answer you, I would give you a weak answer.

So, to provide a quality response, I will do what everyone else does: look for information online, get it and then give it to you. What I just did is the same idea as the rag: I obtained data from an external database to give an answer.

When we improve the LLM with a Content store Where can you go and recover data increase (Increase) your knowledge base, that is the frame of rag in action.

RAG is like creating a content store where the model can improve your knowledge and respond more accurately.

User request on Content C. LLM recovers external content to add to the answer. Image of the author.

Summary:

Use search algorithms to consult external data sources, such as databases, knowledge bases and web pages.
Preprocess the information recovered.
Incorporates the information preprocessed in the LLM.

Why wear a rag?

Now that we know what RAG frame is, we understand why we should use it.

These are some of the benefits:

Improves objective precision referring to real data.
RAG can help LLMS process and consolidate knowledge to create more relevant responses
RAG can help LLMS access additional knowledge bases, such as internal organization data
RAG can help LLM create more specifically precise domain content
The rag can help reduce knowledge gaps and the hallucination of ai

As explained above, I like to say that with the RAG Marco, we are giving an internal search engine for the content that we want to add to the knowledge base.

Good. All of that is very interesting. But let's see a rag application. We will learn to create a PDF reader assistant with ai.

Project

This is an application that allows users to load a PDF document and ask questions about their content using natural language processing tools with ai (NLP).

The USA application Streamlit Like the front.
LangchainOpenAi GPT-4 model, and FAISS (Search for Similarity of facebook ai) for the recovery of documents and the response of the questions in the backend.

We break the steps for a better understanding:

Loading a PDF file and dividing it into pieces of text.
1. This makes the data optimized for recovery
Present the pieces to an incrusting tool.
1. Inserts are numerical vector representations of data used to capture relationships, similarities and meanings in a way that machines can understand. They are widely used in natural language processing (NLP), recommendation systems and search engines.
Next, we put those pieces of text and inlays in the same DB for recovery.
Finally, we make it available to LLM.

Data preparation

Preparing a Content store For the LLM it will take some steps, as we have just seen. Then, let's start creating a function that can load a file and divide it into text fragments for efficient recovery.

# Imports
from  langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def load_document(pdf):
    # Load a PDF
    """
    Load a PDF and split it into chunks for efficient retrieval.

    :param pdf: PDF file to load
    :return: List of chunks of text
    """

    loader = PyPDFLoader(pdf)
    docs = loader.load()

    # Instantiate Text Splitter with Chunk Size of 500 words and Overlap of 100 words so that context is not lost
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
    # Split into chunks for efficient retrieval
    chunks = text_splitter.split_documents(docs)

    # Return
    return chunks

Next, we will start building our transmission application, and we will use that function in the following script.

Web application

We will begin to import the necessary modules in Python. Most of them will come from Langchain's packages.

FAISS It is used for document recovery; OpenAIEmbeddings Transforms text fragments into numerical scores for a better calculation of similarity by LLM; ChatOpenAI It is what allows us to interact with the API Operai; create_retrieval_chain It is what the rag really does, recover and increase the LLM with that data; create_stuff_documents_chain Paste the model and the chatprompttemplate.

Note: You will need Generate an openai key Be able to execute this script. If it is the first time you create your account, get some free credits. But if you have it for some time, you may have to add 5 dollars in credits to access the OpenAi API. One option is to use the Hugging Face inlays.

# Imports
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.chains import create_retrieval_chain
from langchain_openai import ChatOpenAI
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from scripts.secret import OPENAI_KEY
from scripts.document_loader import load_document
import streamlit as st

This first code fragment will create the application title, create a frame for file loading and prepare the file that will be added to the load_document() function.

# Create a Streamlit app
st.title("ai-Powered Document Q&A")

# Load document to streamlit
uploaded_file = st.file_uploader("Upload a PDF file", type="pdf")

# If a file is uploaded, create the TextSplitter and vector database
if uploaded_file :

    # Code to work around document loader from Streamlit and make it readable by langchain
    temp_file = "./temp.pdf"
    with open(temp_file, "wb") as file:
        file.write(uploaded_file.getvalue())
        file_name = uploaded_file.name

    # Load document and split it into chunks for efficient retrieval.
    chunks = load_document(temp_file)

    # Message user that document is being processed with time emoji
    st.write("Processing document... :watch:")

The machines understand the numbers better than the text, so in the end, we will have to provide the model with a number database that can compare and verify the similarity when making a consultation. That's where he embeddings It will be useful to create the vector_dbIn this next code.

# Generate embeddings
    # Embeddings are numerical vector representations of data, typically used to capture relationships, similarities,
    # and meanings in a way that machines can understand. They are widely used in Natural Language Processing (NLP),
    # recommender systems, and search engines.
    embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_KEY,
                                  model="text-embedding-ada-002")

    # Can also use HuggingFaceEmbeddings
    # from langchain_huggingface.embeddings import HuggingFaceEmbeddings
    # embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

    # Create vector database containing chunks and embeddings
    vector_db = FAISS.from_documents(chunks, embeddings)

Next, we create a retriever object to navigate in the vector_db.

# Create a document retriever
    retriever = vector_db.as_retriever()
    llm = ChatOpenAI(model_name="gpt-4o-mini", openai_api_key=OPENAI_KEY)

Then, we will create the system_promptwhich is a set of instructions for the LLM on how to answer, and we will create a warning template, preparing it to add the model once we get the user's entrance.

# Create a system prompt
    # It sets the overall context for the model.
    # It influences tone, style, and focus before user interaction starts.
    # Unlike user inputs, a system prompt is not visible to the end user.

    system_prompt = (
        "You are a helpful assistant. Use the given context to answer the question."
        "If you don't know the answer, say you don't know. "
        "{context}"
    )

    # Create a prompt Template
    prompt = ChatPromptTemplate.from_messages(
        (
            ("system", system_prompt),
            ("human", "{input}"),
        )
    )

    # Create a chain
    # It creates a StuffDocumentsChain, which takes multiple documents (text data) and "stuffs" them together before passing them to the LLM for processing.

    question_answer_chain = create_stuff_documents_chain(llm, prompt)

Continuing, we create the nucleus of the rag frame, hitting together the retriever object and the prompt. This object adds relevant documents of a data source (for example, a vector database) and prepares it to process using a LLM to generate an answer.

# Creates the RAG
     chain = create_retrieval_chain(retriever, question_answer_chain)

Finally, we create the variable question For user entry. If this question box is full of a consultation, we pass it to the chainwhich calls the LLM to process and return the answer, which will be printed on the application screen.

# Streamlit input for question
    question = st.text_input("Ask a question about the document:")
    if question:
        # Answer
        response = chain.invoke({"input": question})('answer')
        st.write(response)

Here is a screenshot of the result.

Screen capture of the final application. Image of the author.

And this is a gif to see the ai assistant of the file reader in action!

Reader ai file in action. Image of the author.

Before you go

In this project, we learned what the RAG Marco and how it helps the LLM to work better and also work well with a specific knowledge.

The ai can feed with the knowledge of an instruction manual, databases of a company, some financial files or contracts, and then adjust to respond precisely to specific domain content consultations. The knowledge base is increased With a content store.

To recapitulate, this is how the framework works:

1minte⃣ User consultation → The entry text is received.

2minte⃣ Recover relevant documents → Search for a knowledge base (for example, a database, vector store).

Increase context → The recovered documents are added to the entrance.