Extracting embedded objects with LlamaParse

Introduction

LlamaParse is a document parsing library developed by Llama Index to efficiently and effectively parse documents like PDF, PPT, etc.

Building RAG applications on top of PDF documents presents a significant challenge that many of us face, specifically with the complex task of parsing embedded objects such as tables, figures, etc. The nature of these objects often means that conventional analysis techniques have difficulty interpreting and extracting them. the information encoded in them accurately.

The software development community has introduced several libraries and frameworks in response to this widespread problem. Examples of these solutions include LLMSherpa and unstructured.io. These tools provide robust and flexible solutions to some of the most persistent problems when analyzing complex PDF files.

The latest addition to this list of invaluable tools is LlamaParse. LlamaParse was developed by Llama Index, one of the most highly regarded LLM frameworks currently available. Because of this, LlamaParse can be integrated directly with Llama Index. This seamless integration represents a significant advantage as it simplifies the implementation process and ensures a higher level of compatibility between the two tools. In conclusion, LlamaParse is a promising new tool that makes parsing complex PDF files less daunting and more efficient.

Learning objectives

Recognize the challenges of document analysis: Understand the difficulties when parsing complex PDF files with embedded objects.
Introduction to LlamaParse: Find out what LlamaParse is and its seamless integration with Llama Index.
Configuration and initialization: Create a LlamaCloud account, obtain an API key and install the necessary libraries.
Implementing FlameParse: Follow the steps to initialize the LLM, upload and analyze documents.
Create a vector index and query data: Learn how to create a vector warehouse index, configure a query engine, and extract specific information from analyzed documents.

This article was published as part of the Data Science Blogathon.

Steps to create a RAG application over PDF using LlamaParse

Step 1 – Get the API Key

LlamaParse is part of the LlamaCloud platform, so you need to have a LlamaCloud account to get an API key.

First, you must create an account on ai/” target=”_blank” rel=”noreferrer noopener nofollow”>callcloud and sign in to create an API key.

Step 2 – Install the necessary libraries

Now open your Jupyter Notebook/Colab and install the necessary libraries. Here, we only need to install two libraries: llama-index and ai/en/stable/llama_cloud/llama_parse/” target=”_blank” rel=”noreferrer noopener nofollow”>call-parse. We will use the OpenAI model for querying and embedding.

!pip install llama-index
!pip install llama-parse

Step 3: Set environment variables

import os

os.environ('OPENAI_API_KEY') = 'sk-proj-****'

os.environ("LLAMA_CLOUD_API_KEY") = 'llx-****'

Step 4: Initialize the LLM and Onboarding Model

Here, I am using gpt-3.5-turbo-0125 as LLM and OpenAI's text-embedding-3-small as embedding model. We will use the Configuration module to replace the default LLM and embedding model.

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

embed_model = OpenAIEmbedding(model="text-embedding-3-small")
llm = OpenAI(model="gpt-3.5-turbo-0125")

Settings.llm = llm
Settings.embed_model = embed_model

Step 5: Analyze the document

Now, we will upload our document and convert it to Markdown type. It is then parsed using MarkdownElementNodeParser.

The table I used is taken from ncrb.gov.in and can be found here: https://ncrb.gov.in/accidental-deaths-suicidios-in-india-adsi. It has data integrated at different levels.

Below is the snapshot of the table I am trying to analyze.

from llama_parse import LlamaParse
from llama_index.core.node_parser import MarkdownElementNodeParser


documents = LlamaParse(result_type="markdown").load_data("./Table_2021.pdf")

node_parser = MarkdownElementNodeParser(
    llm=llm, num_workers=8
)

nodes = node_parser.get_nodes_from_documents(documents)

base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

Step 6: Create the vector index and query engine

Now, we will create a vector store index using the built-in implementation of the called index to create a query engine on top of it. We can also use vector stores like chromadb, pinecone for this.

from llama_index.core import VectorStoreIndex

recursive_index = VectorStoreIndex(nodes=base_nodes + objects)

recursive_query_engine = recursive_index.as_query_engine(
    similarity_top_k=5
)

Step 7: Consult the index

query = 'Extract the table as a dict and exclude any information about 2020. Also include % var'
response = recursive_query_engine.query(query)
print(response)

The above user query will query the index of the underlying vector and return the content embedded in the PDF document in JSON format, as shown in the image below.

As you can see in the screenshot, the table was extracted in a clean JSON format.

Step 8: Putting it all together

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings
from llama_parse import LlamaParse
from llama_index.core.node_parser import MarkdownElementNodeParser
from llama_index.core import VectorStoreIndex

embed_model = OpenAIEmbedding(model="text-embedding-3-small")
llm = OpenAI(model="gpt-3.5-turbo-0125")

Settings.llm = llm
Settings.embed_model = embed_model

documents = LlamaParse(result_type="markdown").load_data("./Table_2021.pdf")

node_parser = MarkdownElementNodeParser(
    llm=llm, num_workers=8
)

nodes = node_parser.get_nodes_from_documents(documents)

base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

recursive_index = VectorStoreIndex(nodes=base_nodes + objects)

recursive_query_engine = recursive_index.as_query_engine(
    similarity_top_k=5
)

query = 'Extract the table as a dict and exclude any information about 2020. Also include % var'
response = recursive_query_engine.query(query)
print(response)

Conclusion

LlamaParse is an effective tool for extracting complex objects from various types of documents such as PDF files with few lines of code. However, it is important to note that to fully utilize this tool requires a certain level of experience working with LLM frameworks, as the index calls it.

LlamaParse is valuable in handling tasks of varying complexity. However, like any other tool in the technological field, it is not entirely immune to errors. Therefore, it is highly recommended to conduct a thorough evaluation of the application independently or take advantage of available evaluation tools. Evaluation libraries such as Ragas, Truera, etc. provide metrics to evaluate the accuracy and reliability of their results. This step ensures that potential issues are identified and resolved before the application is shipped to a production environment.

Key takeaways

LlamaParse is a tool created by the Llama Index team. Extract complex embedded objects from documents such as PDF files with just a few lines of code.

LlamaParse offers free and paid plans. The free plan allows you to scan up to 1000 pages per day.
LlamaParse currently supports more than 10 file types (.pdf, .pptx, .docx, .html, .xml, and more).
LlamaParse is part of the LlamaCloud platform, so you need a LlamaCloud account to get an API key.
With LlamaParse, you can provide natural language instructions to format the output. It even supports image extraction.

The media shown in this article is not the property of Analytics Vidhya and is used at the author's discretion.

Frequently asked questions (FAQ)

P1. What is the flame index?

A. LlamaIndex is the leading LLM framework, along with LangChain, for building LLM applications. It helps connect custom data sources to large language models (LLM) and is a widely used tool for building RAG applications.

Q2 What is CallParse?

A. LlamaParse is an offering from Llama Index that can extract complex tables and figures from documents like PDF, PPT, etc. Because of this, LlamaParse can integrate directly with Llama Index, allowing us to use it in conjunction with a wide variety of agents and tools that Llama Index offers.

P3. How is LlamaParse different from Llama Index?

A. Llama Index is an LLM framework for creating custom LLM applications and provides various tools and agents. LlamaParse is especially focused on extracting complex embedded objects from documents like PDF, PPT, etc.

Q4. What is the importance of LlamaParse?

A. The importance of LlamaParse lies in its ability to convert complex unstructured data in tables, images, etc., to a structured format, which is crucial in the modern world where the most valuable information is available in unstructured form. This transformation is essential for analytical purposes. For example, studying a company's financials from its SEC filings, which can span 100 to 200 pages, would be a challenge without such a tool. LlamaParse provides an efficient way to handle and structure this large amount of unstructured data, making it more accessible and useful for analysis.

Q5. Does LlamaParse have any alternatives?

A. Yes, LLMSherpa and unstructured.io are the available alternatives for LlamaParse.

Extracting embedded objects with LlamaParse

Technical Terrence Team

OpenSea Launches 'Get Based' Series with NFT Drops

Leave a Reply Cancel reply

Recommended.

There is great value right now in the FTSE 250, especially in stocks like this.

Justin Sun Predicts SEC Rejection of Ethereum ETF in May, Citing Need for Crypto Education

Legendary Bitcoin Trader Says Under-the-Radar Dogecoin Killer Rising from $0.03 to $1 by 2025

Stock Market News Today: Markets End Little Changed After Latest CPI Data (SP500)

Ethereum Price Hits Two-Year High as Network Fees Soar and SEC Holds Back on ETF Decision

Categories

Important Links

Extracting embedded objects with LlamaParse

Introduction

Learning objectives

Steps to create a RAG application over PDF using LlamaParse

Step 1 – Get the API Key

Step 2 – Install the necessary libraries

Step 3: Set environment variables

Step 4: Initialize the LLM and Onboarding Model

Step 5: Analyze the document

Step 6: Create the vector index and query engine

Step 7: Consult the index

Step 8: Putting it all together

Conclusion

Key takeaways

Frequently asked questions (FAQ)

Related

Technical Terrence Team

OpenSea Launches 'Get Based' Series with NFT Drops

Leave a Reply Cancel reply

Recommended.

There is great value right now in the FTSE 250, especially in stocks like this.

Justin Sun Predicts SEC Rejection of Ethereum ETF in May, Citing Need for Crypto Education

Legendary Bitcoin Trader Says Under-the-Radar Dogecoin Killer Rising from $0.03 to $1 by 2025

Stock Market News Today: Markets End Little Changed After Latest CPI Data (SP500)

Ethereum Price Hits Two-Year High as Network Fees Soar and SEC Holds Back on ETF Decision

Categories

Important Links

Get daily news updates to your inbox!