How to create a resilient application using LlamaIndex?

Introduction

LlamaIndex is a popular framework for building LLM applications. To build a robust application, we need to know how to count embedded tokens before creating them, make sure there are no duplicates in the vector store, get source data for the generated response, and many other things. This article will review the steps to create a resilient application using LlamaIndex.

Learning objectives

Understand the essential components and features of the LlamaIndex framework to build robust LLM applications.
Learn how to create and run an efficient ingestion pipeline to transform, analyze, and store documents.
Learn how to initialize, save, and load documents and vector stores to effectively manage persistent data storage.
Master creating indexes and using custom messages to facilitate efficient queries and ongoing interactions with chat engines.

How to create a resilient application using LlamaIndex?

Previous requirements

Below are some prerequisites for creating an application using LlamaIndex.

Use the .env file to store the OpenAI key and load it from the file

import os
from dotenv import load_dotenv

load_dotenv('/.env') # provide path of the .env file
OPENAI_API_KEY = os.environ('OPENAI_API_KEY')

We will use Paul Graham's essay as an example document. It can be downloaded from here https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt

How to create an application using LlamaIndex

Load the data

The first step to creating an application using LlamaIndex is to load the data.

from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(input_files=("./data/paul_graham_essay.txt"), 
filename_as_id=True).load_data(show_progress=True)

# 'documents' is a list, which contains the files we have loaded

Let's look at the keys to the document object.

documents(0).to_dict().keys()

# output
"""
dict_keys(('id_', 'embedding', 'metadata', 'excluded_embed_metadata_keys', 
'excluded_llm_metadata_keys', 'relationships', 'text', 'start_char_idx', 
'end_char_idx', 'text_template', 'metadata_template', 'metadata_seperator', 
'class_name'))
"""

We can modify the values of those keys as we do with a dictionary. Let's look at an example with metadata.

If we want to add more information about the document, we can add it to the document metadata as follows. These metadata tags can be used to filter documents.

documents(0).metadata.update({'author': 'paul_graham'})

documents(0).metadata

# output
"""
{'file_path': 'data/paul_graham_essay.txt',
 'file_name': 'paul_graham_essay.txt',
 'file_type': 'text/plain',
 'file_size': 75042,
 'creation_date': '2024-04-16',
 'last_modified_date': '2024-04-15',
 'author': 'paul_graham'}
"""

Ingestion tubing

Using the ingestion pipeline, we can perform all data transformations such as parsing the document into nodes, extracting metadata for the nodes, creating embeds, storing the data in the document store, and storing the embeddings and text of the nodes in the vector. store. This allows us to keep everything needed to make the data available for indexing in one place.

More importantly, using the document store and vector store will ensure that no duplicate embeds are created if we save and load the document store and vector stores and run the ingest process on the same documents.

Token counting

The next step in creating an application using LlamaIndex is token counting.

import the dependencies
import nest_asyncio

nest_asyncio.apply()

import tiktoken

from llama_index.core.callbacks import CallbackManager, TokenCountingHandler

from llama_index.core import MockEmbedding
from llama_index.core.llms import MockLLM

from llama_index.core.node_parser import SentenceSplitter,HierarchicalNodeParser

from llama_index.core.ingestion import IngestionPipeline

from llama_index.core.extractors import TitleExtractor, SummaryExtractor

Initialize the token counter

token_counter = TokenCountingHandler(
    tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode,
    verbose=True
)

Now, we can move on to creating an ingest pipeline using MockEmbedding and MockLLM.

mock_pipeline = IngestionPipeline(
 transformations = (SentenceSplitter(chunk_size=512, chunk_overlap=64),
 TitleExtractor(llm=MockLLM(callback_manager=CallbackManager((token_counter)))),
 MockEmbedding(embed_dim=1536, callback_manager=CallbackManager((token_counter)))))
 
nodes = mock_pipeline.run(documents=documents, show_progress=True, num_workers=-1)

The above code applies a sentence splitter to documents to create nodes, then uses mock embeddings and llm models for metadata extraction and embedding creation.

Then we can check the token counts.

# this returns the count of embedding tokens 
token_counter.total_embedding_token_count

# this returns the count of llm tokens 
token_counter.total_llm_token_count

# token counter is cumulative. When we want to set the token counts to zero, we can use this
token_counter.reset_counts()

We can try different node parsers and metadata extractors to determine how many tokens will be needed.

Create document and vector stores

The next step in creating an application using LlamaIndex is to create document and vector stores.

from llama_index.embeddings.openai import OpenAIEmbedding

from llama_index.core.storage.docstore import SimpleDocumentStore

from llama_index.vector_stores.chroma import ChromaVectorStore

import chromadb

Now we can initialize the document and vector stores.

doc_store = SimpleDocumentStore()

# mention the path, where vector store is saved
chroma_client = chromadb.PersistentClient(path="./chroma_db")

# we will create a collection if doesn't already exists
chroma_collection = chroma_client.get_or_create_collection("paul_essay")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

pipeline = IngestionPipeline(
    transformations = (SentenceSplitter(chunk_size=512, chunk_overlap=128),
    OpenAIEmbedding(model_name="text-embedding-3-small", 
              callback_manager=CallbackManager((token_counter)))),
    docstore=doc_store,
    vector_store=vector_store
)
nodes = pipeline.run(documents=documents, show_progress=True, num_workers=-1)

Once we run the pipeline, the embeddings are stored in the vector store for the nodes. We also need to save the document store.

doc_store.persist('./document storage/doc_store.json')

# we can also check the embedding token count
token_counter.total_embedding_token_count

Now we can reboot the kernel to load the saved stores.

Load document and vector stores

Now, let's import the necessary methods, as mentioned above.

# load the document store
doc_store = SimpleDocumentStore.from_persist_path('./document storage/doc_store.json')

# load the vector store
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("paul_essay")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

Now, you initialize the old pipeline again and run it. However, it does not create embeds because the system has already processed and stored the document. So, we add any new document to a folder, upload all the documents, and run the pipeline, creating embeds just for the new document.

We can check it with the following

# hash of the document
documents(0).hash

# you can get the doc name from the doc_store
for i in doc_store.docs.keys():
    print(i)
    
# hash of the doc in the doc store
doc_store.docs('data/paul_graham_essay.txt').hash

# When both of those hashes match, duplicate embeddings are not created.

Search the vector store

Let's see what is stored in the vector store.

chroma_collection.get().keys()
# output
# dict_keys(('ids', 'embeddings', 'metadatas', 'documents', 'uris', 'data'))

chroma_collection.get()('metadatas')(0).keys()
# output
# dict_keys(('_node_content', '_node_type', 'creation_date', 'doc_id', 
  'document_id', 'file_name', 'file_path', 'file_size', 
  'file_type', 'last_modified_date', 'ref_doc_id'))

# this will return ids, metadatas, and documents of the nodes in the collection
chroma_collection.get()

How do we know which node corresponds to which document? We can look at the node_content metadata

ids = chroma_collection.get()('ids')

# this will print doc name for each node
for i in ids:
    data = json.loads(chroma_collection.get(i)('metadatas')(0)('_node_content'))
    print(data('relationships')('1')('node_id'))

# this will include the embeddings of the node along with metadata and text
chroma_collection.get(ids=ids(0),include=('embeddings', 'metadatas', 'documents'))

# we can also filter the collection
chroma_collection.get(ids=ids, where={'file_size': {'$gt': 75040}}, 
   where_document={'$contains': 'paul'}, include=('metadatas', 'documents'))

Consulting

from llama_index.llms.openai import OpenAI

from llama_index.core.retrievers import VectorIndexRetriever

from llama_index.core import get_response_synthesizer

from llama_index.core.response_synthesizers.type import ResponseMode

from llama_index.core.query_engine import RetrieverQueryEngine

from llama_index.core.chat_engine import (ContextChatEngine, 
CondenseQuestionChatEngine, CondensePlusContextChatEngine)

from llama_index.core.storage.chat_store import SimpleChatStore

from llama_index.core.memory import ChatMemoryBuffer

from llama_index.core import PromptTemplate

from llama_index.core.chat_engine.types import ChatMode

from llama_index.core.llms import ChatMessage, MessageRole
from llama_index.core import ChatPromptTemplate

Now we can create an index from the vector store. An index is a data structure that facilitates quick retrieval of the context relevant to a user's query.

# define the index
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

# define a retriever
retriever = VectorIndexRetriever(index=index, similarity_top_k=3)

In the above code, the retriever retrieves the top 3 nodes similar to the query we provided.

If we want the LLM to answer the query based solely on the context provided and nothing else, we can use custom prompts accordingly.

qa_prompt_str = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the question: {query_str}\n"
)
chat_text_qa_msgs = (
    ChatMessage(role=MessageRole.SYSTEM,
 content=("Only answer the question, if the question is answerable with the given context. \
        Otherwise say that question can't be answered using the context"),
                ),
    ChatMessage(role=MessageRole.USER, content=qa_prompt_str))
    
text_qa_template = ChatPromptTemplate(chat_text_qa_msgs)

Now we can define the response synthesizer, which passes the context and queries the LLM to get the response. We can also add a token counter as a callback handler to keep track of the tokens used.

gpt_3_5 = OpenAI(model="gpt-3.5-turbo")

response_synthesizer = get_response_synthesizer(llm = gpt_3_5, response_mode=ResponseMode.COMPACT, 
                                                text_qa_template=text_qa_template, 
                                                callback_manager=CallbackManager((token_counter)))

Now we can combine the retriever and the response synthesizer as a query engine that accepts the query.

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer)
    
# ask a query
Response = query_engine.query("who is paul graham?")

# response text
Response.response

To find out what text is used to generate this response, we can use the following code

for i, node in enumerate(Response.source_nodes):
    print(f"text of the node {i}")
    print(node.text)
    print("------------------------------------\n")

Similarly, we can try different query engines.

chatting

If we want to converse with our data, we need to store previous queries and responses instead of making isolated queries.

chat_store = SimpleChatStore()

chat_memory = ChatMemoryBuffer.from_defaults(token_limit=5000, chat_store=chat_store, llm=gpt_3_5)

system_prompt = "Answer the question only based on the context provided"

chat_engine = CondensePlusContextChatEngine(retriever=retriever, 
              llm=gpt_3_5, system_prompt=system_prompt, memory=chat_memory)

In the code above, we initialize chat_store and create the chat_memory object with a token limit of 5000. We can also provide a system_prompt and other prompts.

Then, we can create a chat engine also including retriever and chat_memory.

We can get the answer as follows.

streaming_response = chat_engine.stream_chat("Who is Paul Graham?")

for token in streaming_response.response_gen:
    print(token, end="")

We can read the chat history with the given code.

for i in chat_memory.chat_store.store('chat_history'):
    print(i.role.name)
    print(i.content)

Now we can save and restore chat_store as needed.

chat_store.persist(persist_path="chat_store.json")
chat_store = SimpleChatStore.from_persist_path(
    persist_path="chat_store.json"
)

This way, we can build robust RAG applications using the LlamaIndex framework and test various advanced recoverers and reclassifiers.

Read also: Build a RAG pipeline with the LLama index

Conclusion

The LlamaIndex framework offers a comprehensive solution for building resilient LLM applications, ensuring efficient data handling, persistent storage, and enhanced query capabilities. It is a valuable tool for developers working with large language models. The key takeaways from this guide on LlamaIndex are:

The LlamaIndex framework enables robust data ingestion pipelines, ensuring organized document analysis, metadata extraction, and embed creation, while avoiding duplicates.
By effectively managing document and vector stores, LlamaIndex ensures data consistency and facilitates the retrieval and storage of document embeddings and metadata.
The framework supports the creation of custom query engines and indexes, enabling rapid retrieval of context for user queries and continued interactions via chat engines.

Frequent questions

P1. What is the purpose of the LlamaIndex framework?

A. The LlamaIndex framework is designed to build robust LLM applications. It provides tools for efficient data ingestion, storage, and retrieval, ensuring organized and resilient handling of large language models.

P2. How does LlamaIndex prevent duplicate embeds?

A. LlamaIndex prevents duplicate embeddings by using document and vector stores to check existing embeddings before creating new ones, ensuring that each document is processed only once.

P3. Can LlamaIndex handle different types of documents?

A. LlamaIndex can handle various types of documents by parsing them into nodes, extracting metadata, and creating embeds, making it versatile for different data sources.

Q4. How does LlamaIndex support continuous interaction with data?

A. LlamaIndex supports continuous interaction through chat engines, which store and use chat history, allowing for continuous and contextual conversations with data.

How to create a resilient application using LlamaIndex?

Technical Terrence Team

Saudi Arabia stock markets close higher; Tadawul All Shares Up 1.06% By Investing.com

Leave a Reply Cancel reply

Recommended.

Saks Fifth Avenue bets on the New York casino

10 best virtual gifts to buy online

‘Like the holy grail’: how Star Wars Jedi: Survivor was made | Games

Apple launches 4M-21: a highly effective multimodal AI model that solves dozens of tasks and modalities

EU agrees to allow sales of e-fuel internal combustion engine cars after 2035

Categories

Important Links

How to create a resilient application using LlamaIndex?

Introduction

Previous requirements

How to create an application using LlamaIndex

Load the data

Ingestion tubing

Token counting

Create document and vector stores

Load document and vector stores

Search the vector store

Consulting

chatting

Conclusion

Frequent questions

Related

Technical Terrence Team

Saudi Arabia stock markets close higher; Tadawul All Shares Up 1.06% By Investing.com

Leave a Reply Cancel reply

Recommended.

Saks Fifth Avenue bets on the New York casino

10 best virtual gifts to buy online

‘Like the holy grail’: how Star Wars Jedi: Survivor was made | Games

Apple launches 4M-21: a highly effective multimodal AI model that solves dozens of tasks and modalities

EU agrees to allow sales of e-fuel internal combustion engine cars after 2035

Categories

Important Links

Get daily news updates to your inbox!