Learn critical knowledge for building ai applications, in simple language
Recovery Augmented Generation, or RAG, is all the rage these days because it introduces some important capabilities to large language models like OpenAI's GPT-4, and that's the ability to use and leverage your own data.
This post will teach you the fundamental intuition behind RAG while also giving you a simple tutorial to help you get started.
There is a lot of noise in the ai space and in particular about RAG. Marketers are trying to overcomplicate it. They are trying to inject their tools, their ecosystems, their vision.
You're making RAG much more complicated than it needs to be. This tutorial is designed to help beginners learn how to create RAG applications from scratch. No nonsense, no jargon (okay, minimal), no libraries, just a simple step-by-step RAG application.
Jerry from LlamaIndex advocates building things from scratch to really understand the pieces. Once you do that, using a library like LlamaIndex makes more sense.
Build from scratch to learn and then build with libraries to scale.
Let us begin!
You may or may not have heard of Retrieval Augmented Generation or RAG.
Here is the definition of ai.meta.com/blog/retrieval-augmented-generation-streamlining-the-creation-of-intelligent-natural-language-processing-models/” rel=”noopener ugc nofollow” target=”_blank”>the blog post introducing the concept of Facebook:
Building a model that investigates and contextualizes is more challenging, but is essential for future advances. We recently made substantial progress in this area with our Retrieval Augmented Generation (RAG) architecture, an end-to-end differentiable model that combines an information retrieval component (Facebook ai's dense passage retrieval system) with a generator. seq2seq (our bidirectional and automatic system). -Backward Transformers (BART) Model. RAG can be fine-tuned on downstream knowledge-intensive tasks to achieve state-of-the-art results compared to even the largest pre-trained seq2seq language models. And unlike these pre-trained models, RAG's internal knowledge can be easily altered or even supplemented on the fly, allowing researchers and engineers to control what RAG knows and doesn't know without wasting time or computing power retraining everything. the model.
Wow, that's a mouthful.
Simplifying the technique for beginners, we can state that the essence of RAG involves adding your own data (via a retrieval tool) to the message you pass to a large language model. As a result, you get a result. This gives you several benefits:
- You can include data in the order to help the LLM avoid hallucinations.
- You can (manually) query sources of truth when responding to a user's query, which helps verify any potential problems.
- You can leverage data that the LLM may not have been trained on.
- a collection of documents (formally called a corpus)
- A user input.
- a measure of similarity between the document collection and user input
Yes, it is that simple.
To start learning and understanding RAG-based systems, you don't need a vector store, not even need an LLM (at least to learn and understand conceptually).
While it is often presented as complicated, it doesn't have to be.
We will perform the following steps in sequence.
- Receive user input
- Perform our similarity measure
- Post-process user input and obtained documents.
Post-processing is done with an LLM.
The real RAG paper is obviously he resource. The problem is that it assumes A LOT of context. It's more complicated than we need.
For example, here is the overview of the RAG system proposed in the paper.
That's dense.
It's great for researchers, but for the rest of us it will be much easier to learn step by step by building the system ourselves.
Let's go back to building RAG from scratch, step by step. These are the simplified steps we will work on. While this is not technically “RAG”, it is a good simplified model to learn with and allow us to progress to more complicated variations.
Below you can see that we have a simple corpus of 'documents' (be generous ).
corpus_of_documents = (
"Take a leisurely walk in the park and enjoy the fresh air.",
"Visit a local museum and discover something new.",
"Attend a live music concert and feel the rhythm.",
"Go for a hike and admire the natural scenery.",
"Have a picnic with friends and share some laughs.",
"Explore a new cuisine by dining at an ethnic restaurant.",
"Take a yoga class and stretch your body and mind.",
"Join a local sports league and enjoy some friendly competition.",
"Attend a workshop or lecture on a topic you're interested in.",
"Visit an amusement park and ride the roller coasters."
)
Now we need a way to measure the similarity between the user input we are going to receive and the collection of documents we organize. Arguably the simplest similarity measure is jaccard similarity. I've written about that in the past (see this post but the short answer is that the jaccard similarity It is the intersection divided by the union of the “sets” of words.
This allows us to compare our users' contributions with the original documents.
Side note: preprocessing
One challenge is that if we have a simple string like "Take a leisurely walk in the park and enjoy the fresh air.",
, we will have to preprocess it in a set to be able to make these comparisons. Let's do this as simply as possible, in lowercase and divided by " "
.
def jaccard_similarity(query, document):
query = query.lower().split(" ")
document = document.lower().split(" ")
intersection = set(query).intersection(set(document))
union = set(query).union(set(document))
return len(intersection)/len(union)
Now we need to define a function that takes the exact query and our corpus and selects the “best” document to return to the user.
def return_response(query, corpus):
similarities = ()
for doc in corpus:
similarity = jaccard_similarity(query, doc)
similarities.append(similarity)
return corpus_of_documents(similarities.index(max(similarities)))
Now that we can run it, we'll start with a simple message.
user_prompt = "What is a leisure activity that you like?"
And a simple user input…
user_input = "I like to hike"
Now we can return our response.
return_response(user_input, corpus_of_documents)
'Go for a hike and admire the natural scenery.'
Congratulations, you have created a basic RAG application.
I have 99 problems and one of them is bad similarity.
We have now opted for a simple similarity measure for learning. But this is going to be problematic because it is very simple. He has no notion of semantics. It is simply a matter of seeing what words are in both documents. That means that if we provide a negative example, we will get the same “result” because that is the closest document.
user_input = "I don't like to hike"
return_response(user_input, corpus_of_documents)
'Go for a hike and admire the natural scenery.'
This is a topic that will come up a lot with “RAG,” but for now, rest assured we will address this issue later.
At this point, we have not done any post-processing of the “document” we are responding to. So far, we have implemented only the “recovery” part of the “Augmented Recovery Generation”. The next step is to augment the generation by incorporating a large language model (LLM).
To do this, we are going to use ai/” rel=”noopener ugc nofollow” target=”_blank”>be to launch an open source LLM on our local machine. We could easily use gpt-4 from OpenAI or Claude from Anthropic, but for now we'll start with the open source llama2 from ai.meta.com/llama/” rel=”noopener ugc nofollow” target=”_blank”>Meta ai.
This post will assume some basic knowledge of large language models, so let's start querying this model.
import requests
import json
First let's define the inputs. To work with this model, we are going to take
- user input,
- find the most similar document (as measured by our similarity measure),
- pass that to a message to the language model,
- so return the result to the user
This introduces a new term, the immediate. In short, they are the instructions you provide to the LLM.
When you run this code, you will see the streaming result. Streaming is important for user experience.
user_input = "I like to hike"
relevant_document = return_response(user_input, corpus_of_documents)
full_response = ()
prompt = """
You are a bot that makes recommendations for activities. You answer in very short sentences and do not include extra information.
This is the recommended activity: {relevant_document}
The user input is: {user_input}
Compile a recommendation to the user based on the recommended activity and the user input.
"""
Having defined that, let's now make the API call to olivama (and llama2). An important step is to make sure that Ollama is already running on your local machine by running ollama serve
.
Note: this may be slow on your machine, it certainly is on mine. Be patient, young grasshopper.
url = 'http://localhost:11434/api/generate'
data = {
"model": "llama2",
"prompt": prompt.format(user_input=user_input, relevant_document=relevant_document)
}
headers = {'Content-Type': 'application/json'}
response = requests.post(url, data=json.dumps(data), headers=headers, stream=True)
try:
count = 0
for line in response.iter_lines():
# filter out keep-alive new lines
# count += 1
# if count % 5== 0:
# print(decoded_line('response')) # print every fifth token
if line:
decoded_line = json.loads(line.decode('utf-8'))full_response.append(decoded_line('response'))
finally:
response.close()
print(''.join(full_response))
Great! Based on your interest in hiking, I recommend trying out the nearby trails for a challenging and rewarding experience with breathtaking views Great! Based on your interest in hiking, I recommend checking out the nearby trails for a fun and challenging adventure.
This gives us a complete RAG Application, from scratch, without providers, without services. Learn all the components of an augmented generation-recovery application. Visually, this is what we have built.
The LLM will (if you're lucky) handle user input that goes against the recommended document. We can see that below.
user_input = "I don't like to hike"
relevant_document = return_response(user_input, corpus_of_documents)
# https://github.com/jmorganca/ollama/blob/main/docs/api.md
full_response = ()
prompt = """
You are a bot that makes recommendations for activities. You answer in very short sentences and do not include extra information.
This is the recommended activity: {relevant_document}
The user input is: {user_input}
Compile a recommendation to the user based on the recommended activity and the user input.
"""
url = 'http://localhost:11434/api/generate'
data = {
"model": "llama2",
"prompt": prompt.format(user_input=user_input, relevant_document=relevant_document)
}
headers = {'Content-Type': 'application/json'}
response = requests.post(url, data=json.dumps(data), headers=headers, stream=True)
try:
for line in response.iter_lines():
# filter out keep-alive new lines
if line:
decoded_line = json.loads(line.decode('utf-8'))
# print(decoded_line('response')) # uncomment to results, token by token
full_response.append(decoded_line('response'))
finally:
response.close()
print(''.join(full_response))
Sure, here is my response:Try kayaking instead! It's a great way to enjoy nature without having to hike.
If we go back to our RAG application diagram and think about what we just created, we will see several opportunities for improvement. These are opportunities where tools like vector stores, embeddings, and rapid “engineering” come in.
Here are ten potential areas where we could improve the current setup:
- The number of documents more documents can mean more recommendations.
- The depth/size of the documents. Higher quality content and longer documents with more information could be better.
- The number of documents we deliver to the LLM. At this time, we only give the LLM one document. We could introduce several as “context” and allow the model to provide a more personalized recommendation based on user input.
- The parts of the documents that we deliver to the LLM. If we have larger or more complete documents, we may simply want to add parts of those documents, parts of multiple documents, or some variation of them. In the lexicon, this is called fragmentation.
- Our document storage tool We may store our documents in a different way or in a different database. In particular, if we have a lot of documents, we could explore storing them in a data lake or vector warehouse.
- The similarity measure How we measure similarity is important; We may need to balance performance and thoroughness (for example, examining each individual document).
- Preprocessing of documents and user input. We could do some additional preprocessing or augmentation of the user input before passing it to the similarity measure. For example, we could use an embedding to convert that input to a vector.
- The similarity measure We can change the similarity measure to obtain better or more relevant documents.
- The model We can change the final model we use. We're using llama2 above, but we could easily use an Anthropic or Claude model.
- The advertisement We could use a different message in LLM/Model and adjust it according to the result we want to get the result we want.
- If you are concerned about harmful or toxic production We could implement a sort of “circuit breaker” that runs user input to see if there are toxic, harmful or dangerous discussions. For example, in a healthcare context, you could see if information contained unsafe language and respond accordingly, outside of the typical flow.
The room for improvement is not limited to these points; The possibilities are wide and we will delve into them in future tutorials. Until then, don't hesitate communicate on twitter If you have any question. Happy FURIO :).
ai/tutorials/rag-from-scratch” rel=”noopener ugc nofollow” target=”_blank”>This post was originally published on learnbybuilding.ai. In the coming months I will be teaching a course on how to build generative ai products for product managers. sign up here.