Introduction
LlamaIndex is a popular framework for building LLM applications. To build a robust application, we need to know how to count embedded tokens before creating them, make sure there are no duplicates in the vector store, get source data for the generated response, and many other things. This article will review the steps to create a resilient application using LlamaIndex.
Learning objectives
- Understand the essential components and features of the LlamaIndex framework to build robust LLM applications.
- Learn how to create and run an efficient ingestion pipeline to transform, analyze, and store documents.
- Learn how to initialize, save, and load documents and vector stores to effectively manage persistent data storage.
- Master creating indexes and using custom messages to facilitate efficient queries and ongoing interactions with chat engines.
Previous requirements
Below are some prerequisites for creating an application using LlamaIndex.
Use the .env file to store the OpenAI key and load it from the file
import os
from dotenv import load_dotenv
load_dotenv('/.env') # provide path of the .env file
OPENAI_API_KEY = os.environ('OPENAI_API_KEY')
We will use Paul Graham's essay as an example document. It can be downloaded from here https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
How to create an application using LlamaIndex
Load the data
The first step to creating an application using LlamaIndex is to load the data.
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(input_files=("./data/paul_graham_essay.txt"),
filename_as_id=True).load_data(show_progress=True)
# 'documents' is a list, which contains the files we have loaded
Let's look at the keys to the document object.
documents(0).to_dict().keys()
# output
"""
dict_keys(('id_', 'embedding', 'metadata', 'excluded_embed_metadata_keys',
'excluded_llm_metadata_keys', 'relationships', 'text', 'start_char_idx',
'end_char_idx', 'text_template', 'metadata_template', 'metadata_seperator',
'class_name'))
"""
We can modify the values of those keys as we do with a dictionary. Let's look at an example with metadata.
If we want to add more information about the document, we can add it to the document metadata as follows. These metadata tags can be used to filter documents.
documents(0).metadata.update({'author': 'paul_graham'})
documents(0).metadata
# output
"""
{'file_path': 'data/paul_graham_essay.txt',
'file_name': 'paul_graham_essay.txt',
'file_type': 'text/plain',
'file_size': 75042,
'creation_date': '2024-04-16',
'last_modified_date': '2024-04-15',
'author': 'paul_graham'}
"""
Ingestion tubing
Using the ingestion pipeline, we can perform all data transformations such as parsing the document into nodes, extracting metadata for the nodes, creating embeds, storing the data in the document store, and storing the embeddings and text of the nodes in the vector. store. This allows us to keep everything needed to make the data available for indexing in one place.
More importantly, using the document store and vector store will ensure that no duplicate embeds are created if we save and load the document store and vector stores and run the ingest process on the same documents.
Token counting
The next step in creating an application using LlamaIndex is token counting.
import the dependencies
import nest_asyncio
nest_asyncio.apply()
import tiktoken
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
from llama_index.core import MockEmbedding
from llama_index.core.llms import MockLLM
from llama_index.core.node_parser import SentenceSplitter,HierarchicalNodeParser
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.extractors import TitleExtractor, SummaryExtractor
Initialize the token counter
token_counter = TokenCountingHandler(
tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode,
verbose=True
)
Now, we can move on to creating an ingest pipeline using MockEmbedding and MockLLM.
mock_pipeline = IngestionPipeline(
transformations = (SentenceSplitter(chunk_size=512, chunk_overlap=64),
TitleExtractor(llm=MockLLM(callback_manager=CallbackManager((token_counter)))),
MockEmbedding(embed_dim=1536, callback_manager=CallbackManager((token_counter)))))
nodes = mock_pipeline.run(documents=documents, show_progress=True, num_workers=-1)
The above code applies a sentence splitter to documents to create nodes, then uses mock embeddings and llm models for metadata extraction and embedding creation.
Then we can check the token counts.
# this returns the count of embedding tokens
token_counter.total_embedding_token_count
# this returns the count of llm tokens
token_counter.total_llm_token_count
# token counter is cumulative. When we want to set the token counts to zero, we can use this
token_counter.reset_counts()
We can try different node parsers and metadata extractors to determine how many tokens will be needed.
Create document and vector stores
The next step in creating an application using LlamaIndex is to create document and vector stores.
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb
Now we can initialize the document and vector stores.
doc_store = SimpleDocumentStore()
# mention the path, where vector store is saved
chroma_client = chromadb.PersistentClient(path="./chroma_db")
# we will create a collection if doesn't already exists
chroma_collection = chroma_client.get_or_create_collection("paul_essay")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
pipeline = IngestionPipeline(
transformations = (SentenceSplitter(chunk_size=512, chunk_overlap=128),
OpenAIEmbedding(model_name="text-embedding-3-small",
callback_manager=CallbackManager((token_counter)))),
docstore=doc_store,
vector_store=vector_store
)
nodes = pipeline.run(documents=documents, show_progress=True, num_workers=-1)
Once we run the pipeline, the embeddings are stored in the vector store for the nodes. We also need to save the document store.
doc_store.persist('./document storage/doc_store.json')
# we can also check the embedding token count
token_counter.total_embedding_token_count
Now we can reboot the kernel to load the saved stores.
Load document and vector stores
Now, let's import the necessary methods, as mentioned above.
# load the document store
doc_store = SimpleDocumentStore.from_persist_path('./document storage/doc_store.json')
# load the vector store
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("paul_essay")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
Now, you initialize the old pipeline again and run it. However, it does not create embeds because the system has already processed and stored the document. So, we add any new document to a folder, upload all the documents, and run the pipeline, creating embeds just for the new document.
We can check it with the following
# hash of the document
documents(0).hash
# you can get the doc name from the doc_store
for i in doc_store.docs.keys():
print(i)
# hash of the doc in the doc store
doc_store.docs('data/paul_graham_essay.txt').hash
# When both of those hashes match, duplicate embeddings are not created.
Search the vector store
Let's see what is stored in the vector store.
chroma_collection.get().keys()
# output
# dict_keys(('ids', 'embeddings', 'metadatas', 'documents', 'uris', 'data'))
chroma_collection.get()('metadatas')(0).keys()
# output
# dict_keys(('_node_content', '_node_type', 'creation_date', 'doc_id',
'document_id', 'file_name', 'file_path', 'file_size',
'file_type', 'last_modified_date', 'ref_doc_id'))
# this will return ids, metadatas, and documents of the nodes in the collection
chroma_collection.get()
How do we know which node corresponds to which document? We can look at the node_content metadata
ids = chroma_collection.get()('ids')
# this will print doc name for each node
for i in ids:
data = json.loads(chroma_collection.get(i)('metadatas')(0)('_node_content'))
print(data('relationships')('1')('node_id'))
# this will include the embeddings of the node along with metadata and text
chroma_collection.get(ids=ids(0),include=('embeddings', 'metadatas', 'documents'))
# we can also filter the collection
chroma_collection.get(ids=ids, where={'file_size': {'$gt': 75040}},
where_document={'$contains': 'paul'}, include=('metadatas', 'documents'))
Consulting
from llama_index.llms.openai import OpenAI
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core import get_response_synthesizer
from llama_index.core.response_synthesizers.type import ResponseMode
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.chat_engine import (ContextChatEngine,
CondenseQuestionChatEngine, CondensePlusContextChatEngine)
from llama_index.core.storage.chat_store import SimpleChatStore
from llama_index.core.memory import ChatMemoryBuffer
from llama_index.core import PromptTemplate
from llama_index.core.chat_engine.types import ChatMode
from llama_index.core.llms import ChatMessage, MessageRole
from llama_index.core import ChatPromptTemplate
Now we can create an index from the vector store. An index is a data structure that facilitates quick retrieval of the context relevant to a user's query.
# define the index
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
# define a retriever
retriever = VectorIndexRetriever(index=index, similarity_top_k=3)
In the above code, the retriever retrieves the top 3 nodes similar to the query we provided.
If we want the LLM to answer the query based solely on the context provided and nothing else, we can use custom prompts accordingly.
qa_prompt_str = (
"Context information is below.\n"
"---------------------\n"
"{context_str}\n"
"---------------------\n"
"Given the context information and not prior knowledge, "
"answer the question: {query_str}\n"
)
chat_text_qa_msgs = (
ChatMessage(role=MessageRole.SYSTEM,
content=("Only answer the question, if the question is answerable with the given context. \
Otherwise say that question can't be answered using the context"),
),
ChatMessage(role=MessageRole.USER, content=qa_prompt_str))
text_qa_template = ChatPromptTemplate(chat_text_qa_msgs)
Now we can define the response synthesizer, which passes the context and queries the LLM to get the response. We can also add a token counter as a callback handler to keep track of the tokens used.
gpt_3_5 = OpenAI(model="gpt-3.5-turbo")
response_synthesizer = get_response_synthesizer(llm = gpt_3_5, response_mode=ResponseMode.COMPACT,
text_qa_template=text_qa_template,
callback_manager=CallbackManager((token_counter)))
Now we can combine the retriever and the response synthesizer as a query engine that accepts the query.
query_engine = RetrieverQueryEngine(
retriever=retriever,
response_synthesizer=response_synthesizer)
# ask a query
Response = query_engine.query("who is paul graham?")
# response text
Response.response
To find out what text is used to generate this response, we can use the following code
for i, node in enumerate(Response.source_nodes):
print(f"text of the node {i}")
print(node.text)
print("------------------------------------\n")
Similarly, we can try different query engines.
chatting
If we want to converse with our data, we need to store previous queries and responses instead of making isolated queries.
chat_store = SimpleChatStore()
chat_memory = ChatMemoryBuffer.from_defaults(token_limit=5000, chat_store=chat_store, llm=gpt_3_5)
system_prompt = "Answer the question only based on the context provided"
chat_engine = CondensePlusContextChatEngine(retriever=retriever,
llm=gpt_3_5, system_prompt=system_prompt, memory=chat_memory)
In the code above, we initialize chat_store and create the chat_memory object with a token limit of 5000. We can also provide a system_prompt and other prompts.
Then, we can create a chat engine also including retriever and chat_memory.
We can get the answer as follows.
streaming_response = chat_engine.stream_chat("Who is Paul Graham?")
for token in streaming_response.response_gen:
print(token, end="")
We can read the chat history with the given code.
for i in chat_memory.chat_store.store('chat_history'):
print(i.role.name)
print(i.content)
Now we can save and restore chat_store as needed.
chat_store.persist(persist_path="chat_store.json")
chat_store = SimpleChatStore.from_persist_path(
persist_path="chat_store.json"
)
This way, we can build robust RAG applications using the LlamaIndex framework and test various advanced recoverers and reclassifiers.
Read also: Build a RAG pipeline with the LLama index
Conclusion
The LlamaIndex framework offers a comprehensive solution for building resilient LLM applications, ensuring efficient data handling, persistent storage, and enhanced query capabilities. It is a valuable tool for developers working with large language models. The key takeaways from this guide on LlamaIndex are:
- The LlamaIndex framework enables robust data ingestion pipelines, ensuring organized document analysis, metadata extraction, and embed creation, while avoiding duplicates.
- By effectively managing document and vector stores, LlamaIndex ensures data consistency and facilitates the retrieval and storage of document embeddings and metadata.
- The framework supports the creation of custom query engines and indexes, enabling rapid retrieval of context for user queries and continued interactions via chat engines.
Frequent questions
A. The LlamaIndex framework is designed to build robust LLM applications. It provides tools for efficient data ingestion, storage, and retrieval, ensuring organized and resilient handling of large language models.
A. LlamaIndex prevents duplicate embeddings by using document and vector stores to check existing embeddings before creating new ones, ensuring that each document is processed only once.
A. LlamaIndex can handle various types of documents by parsing them into nodes, extracting metadata, and creating embeds, making it versatile for different data sources.
A. LlamaIndex supports continuous interaction through chat engines, which store and use chat history, allowing for continuous and contextual conversations with data.