How to improve LLMs with RAG | by Shaw Talebi

Imports

We start by installing and importing the necessary Python libraries.

!pip install llama-index
!pip install llama-index-embeddings-huggingface
!pip install peft
!pip install auto-gptq
!pip install optimum
!pip install bitsandbytes
# if not running on Colab ensure transformers is installed too

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

Set up the knowledge base

We can configure our knowledge base by defining our embedding model, chunk size, and chunk overlay. Here we use the parameter ~33M. bge-small-es-v1.5 BAAI inlay model, which is available at the Hugging Face center. Other embedding model options are available in this text embed leaderboard.

# import any embedding model on HF hub
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")Settings.llm = None # we won't use LlamaIndex to set up LLM
Settings.chunk_size = 256
Settings.chunk_overlap = 25

Next, we upload our source documents. Here I have a folder called “articles”, which contains PDF versions of 3 Medium articles I wrote about fat tails. If you run this in Colab, you need to download the articles folder from the GitHub repository and manually upload it to your Colab environment.

For each file in this folder, the following function will read the text from the PDF, split it into chunks (based on the settings defined above), and store each chunk in a list called documents.

documents = SimpleDirectoryReader("articles").load_data()

Since the blogs were downloaded directly as PDF files from Medium, they look more like a web page than a well-formatted article. Therefore, some snippets may include text unrelated to the article, for example web page headers and Medium article recommendations.

In the following code block, refine document snippets, removing most snippets before or after an article's content.

print(len(documents)) # prints: 71
for doc in documents:
if "Member-only story" in doc.text:
documents.remove(doc)
continueif "The Data Entrepreneurs" in doc.text:
documents.remove(doc)
if " min read" in doc.text:
documents.remove(doc)
print(len(documents)) # prints: 61

Finally, we can store the refined fragments in a vector database.

index = VectorStoreIndex.from_documents(documents)

Set up a recoverer

With our knowledge base in place, we can create a retriever using LlamaIndex. VectorIndexRetreiver(), which returns the 3 fragments most similar to a user query.

# set number of docs to retreive
top_k = 3# configure retriever
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=top_k,
)

Next, we define a query engine that uses the retriever and the query to return a set of relevant fragments.

# assemble query engine
query_engine = RetrieverQueryEngine(
retriever=retriever,
node_postprocessors=(SimilarityPostprocessor(similarity_cutoff=0.5)),
)

Use query engine

Now, with our knowledge base and retrieval system configured, let's use them to return relevant fragments for a query. Here, we'll pass along the same technical question we asked ShawGPT (the YouTube comment responder) from the previous article.

query = "What is fat-tailedness?"
response = query_engine.query(query)

The query engine returns a response object containing the text, metadata, and indexes of the relevant fragments. The following code block returns a more readable version of this information.

# reformat response
context = "Context:\n"
for i in range(top_k):
context = context + response.source_nodes(i).text + "\n\n"print(context)

Context:
Some of the controversy might be explained by the observation that log-
normal distributions behave like Gaussian for low sigma and like Power Law
at high sigma (2).
However, to avoid controversy, we can depart (for now) from whether some
given data fits a Power Law or not and focus instead on fat tails.
Fat-tailedness — measuring the space between Mediocristan
and Extremistan
Fat Tails are a more general idea than Pareto and Power Law distributions.
One way we can think about it is that “fat-tailedness” is the degree to which
rare events drive the aggregate statistics of a distribution. From this point of
view, fat-tailedness lives on a spectrum from not fat-tailed (i.e. a Gaussian) to
very fat-tailed (i.e. Pareto 80 – 20).
This maps directly to the idea of Mediocristan vs Extremistan discussed
earlier. The image below visualizes different distributions across this
conceptual landscape (2).print("mean kappa_1n = " + str(np.mean(kappa_dict(filename))))
print("")
Mean κ (1,100) values from 1000 runs for each dataset. Image by author.
These more stable results indicate Medium followers are the most fat-tailed,
followed by LinkedIn Impressions and YouTube earnings.
Note: One can compare these values to Table III in ref (3) to better understand each
κ value. Namely, these values are comparable to a Pareto distribution with α
between 2 and 3.
Although each heuristic told a slightly different story, all signs point toward
Medium followers gained being the most fat-tailed of the 3 datasets.
Conclusion
While binary labeling data as fat-tailed (or not) may be tempting, fat-
tailedness lives on a spectrum. Here, we broke down 4 heuristics for
quantifying how fat-tailed data are.
Pareto, Power Laws, and Fat Tails
What they don’t teach you in statistics
towardsdatascience.com
Although Pareto (and more generally power law) distributions give us a
salient example of fat tails, this is a more general notion that lives on a
spectrum ranging from thin-tailed (i.e. a Gaussian) to very fat-tailed (i.e.
Pareto 80 – 20).
The spectrum of Fat-tailedness. Image by author.
This view of fat-tailedness provides us with a more flexible and precise way of
categorizing data than simply labeling it as a Power Law (or not). However,
this begs the question: how do we define fat-tailedness?
4 Ways to Quantify Fat Tails

Add RAG to LLM

We start by downloading the tuned model from the Hugging Face center.

# load fine-tuned model from hub
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizermodel_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name,
device_map="auto",
trust_remote_code=False,
revision="main")
config = PeftConfig.from_pretrained("shawhin/shawgpt-ft")
model = PeftModel.from_pretrained(model, "shawhin/shawgpt-ft")
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

As a starting point, we can see how the model answers the technical question without any context from the articles. To do this, we create a message template using a lambda function, which receives a comment from the viewer and returns a message to the LLM. For more details on the origin of this message, see the previous article in this series.

# prompt (no context)
intstructions_string = f"""ShawGPT, functioning as a virtual data science \
consultant on YouTube, communicates in clear, accessible language, escalating \
to technical depth upon request. It reacts to feedback aptly and ends \
responses with its signature '–ShawGPT'.ShawGPT will tailor the length of its responses to match the viewer's comment, \
providing concise acknowledgments to brief expressions of gratitude or \
feedback, thus keeping the interaction natural and engaging.
Please respond to the following comment.
"""
prompt_template = lambda comment: f'''(INST) {intstructions_string} \n{comment} \n(/INST)'''
comment = "What is fat-tailedness?"
prompt = prompt_template(comment)
print(prompt)

(INST) ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.Please respond to the following comment.
What is fat-tailedness? 
(/INST)

We can then pass this message to the model using the following code.

model.eval()inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs("input_ids").to("cuda"), 
max_new_tokens=280)
print(tokenizer.batch_decode(outputs)(0))

Here is the model's response (without context).

Great question!Fat-tailedness is a statistical property of a distribution. In simple terms, 
it refers to the presence of extreme outliers or heavy tails in the 
distribution.
For instance, consider the distribution of heights in a population. A normal 
distribution would have most people clustered around an average height with a 
few people deviating slightly from the mean. However, in a fat-tailed 
distribution, you would observe a larger number of people being significantly 
taller or shorter than the average.
In finance, fat-tailedness is often used to describe the presence of extreme 
market moves or outliers in financial returns. It's an important concept in 
risk management and financial modeling.
I hope this explanation helps! Let me know if you have any questions.
–ShawGPT

Although the style and format of the answer is excellent, the model's explanation of fat tailing differs from how I defined it in my video and blog series.

Let's see what happens to the model's response when we include the appropriate context. To do this, we create another notice template, which can also take the context of the RAG system.

# prompt (with context)
prompt_template_w_context = lambda context, comment: f"""(INST)ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. \
It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. \
ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, \
thus keeping the interaction natural and engaging.{context}
Please respond to the following comment. Use the context above if it is helpful.
{comment}
(/INST)
"""

We then pass the message with query engine context and the view comment to the model.

prompt = prompt_template_w_context(context, comment)inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs("input_ids").to("cuda"), max_new_tokens=280)
print(tokenizer.batch_decode(outputs)(0))

Here is the new answer (with context).

Great question!Fat-tailedness is a property of a distribution that describes how much the 
presence of extreme values (rare events) affects the overall statistics of the 
distribution.
In the context of the video, the author discusses how Mediocristan and 
Extremistan distributions differ in terms of fat-tailedness. Mediocristan 
distributions have a relatively small impact from extreme values, while 
Extremistan distributions are heavily influenced by extreme values.
The author also mentions that fat-tailedness is a more general concept than 
specific distributions like Pareto or Power Laws. Instead, it lives on a 
spectrum ranging from thin-tailed (Gaussian) to very fat-tailed (Pareto 80-20).
I hope that helps clarify things a bit! Let me know if you have any questions.
–ShawGPT

This captures my explanation of fat tails much better than the answer without context and even highlights the specific concepts of Mediocristan and Extremistan.

Here, I provided an introduction to RAG for beginners and shared a concrete example of how to implement it using LlamaIndex. RAG allows us to enhance an LLM system with domain-specific, updatable knowledge.

While much of the recent hype around ai has focused on the creation of ai assistants, a powerful (if less popular) innovation comes from text embeds (i.e., the stuff we used to retrieve). In the next article in this series, I will explore text embeds in more detail, including how they can be used to semantic search and classification tasks.

More about LLM

Large Language Models (LLM)

How to improve LLMs with RAG | by Shaw Talebi

Technical Terrence Team

Shares in manufacturer Ozempic soar as concerns cause controversy

Leave a Reply Cancel reply

Recommended.

Ethereum Futures Premium Hits One-Year High: Will ETH Price Follow?

South African Startup Momint Seeks to Boost Electricity Generation Using a Blockchain-Based Solution – Bitcoin News

Contain Python applications with Docker in 5 easy steps

Redeem free Gods Unchained packs through Amazon Prime Gaming

Is altcoin season approaching? This is what Bitcoin's performance shows

Categories

Important Links

How to improve LLMs with RAG | by Shaw Talebi

Imports

Set up the knowledge base

Set up a recoverer

Use query engine

Add RAG to LLM

Large Language Models (LLM)

Related

Technical Terrence Team

Shares in manufacturer Ozempic soar as concerns cause controversy

Leave a Reply Cancel reply

Recommended.

Ethereum Futures Premium Hits One-Year High: Will ETH Price Follow?

South African Startup Momint Seeks to Boost Electricity Generation Using a Blockchain-Based Solution – Bitcoin News

Contain Python applications with Docker in 5 easy steps

Redeem free Gods Unchained packs through Amazon Prime Gaming

Is altcoin season approaching? This is what Bitcoin's performance shows

Categories

Important Links

Get daily news updates to your inbox!