Creating an invoice extraction robot with LangChain and LLM

Introduction

Before the era of large language models, extracting invoices was a tedious task. For invoice extraction, you need to collect data, create a document search machine learning model, tune the model, etc. The introduction of generative ai took us all by surprise and many things were simplified using the LLM model. The big language model has eliminated the model building process from machine learning; You just need to be good at quick engineering and your job will be done in most of the scenario. In this article, we are creating an invoice extraction robot with the help of a large language model and LangChain. However, detailed knowledge of LangChain and LLM is out of reach, but below is a brief overview of LangChain and its components.

Learning objectives

Learn how to extract information from a document.
How to structure your backend code using LangChain and LLM
How to provide appropriate directions and instructions to the LLM model
Good knowledge of Streamlit framework for front-end work.

This article was published as part of the Data Science Blogathon.

What is a large language model?

Large language models (LLM) are a type of artificial intelligence (ai) algorithm that uses deep learning techniques to process and understand natural language. LLMs are trained on huge volumes of text data to discover linguistic patterns and relationships between entities. Thanks to this, they can now recognize, translate, forecast or create text or other information. LLMs can be trained with possible petabytes of data and can be tens of terabytes in size. For example, one gigabit of text space can contain about 178 million words.

For companies that want to offer customer service through a chatbot or virtual assistant, LLMs can be useful. Without a human presence, they can offer individualized responses.

What is LangChain?

LangChain is an open source framework used to create and build applications using a large language model (LLM). It provides a standard interface for chains, many integrations with other tools, and end-to-end chains for common applications. This allows you to develop interactive, data-responsive applications that use the latest advances in natural language processing.

LangChain main components

A variety of Langchain components can be “chained together” to create complex LLM-based applications. These elements consist of:

Message templates
LLM
Agents
Memory

Creating an invoice extraction robot with LangChain and LLM

Before the era of generative ai, extracting any data from a document was a time-consuming process. You have to create a machine learning model or use the cloud service API from Google, Microsoft and AWS. But LLM makes it very easy to extract any information from a given document. LLM does it in three simple steps:

Call the LLM model API
Adequate notice must be given
It is necessary to extract information from a document.

For this demo, we have taken three invoice PDF files. Below is the screenshot of an invoice file.

Step 1 – Create an OpenAI API Key

First, you need to create an OpenAI API key (paid subscription). How to create an OpenAI API key can be easily found on the Internet. Assuming the API key is created. The next step is to install all the necessary packages like LangChain, OpenAI, pypdf, etc.

#installing packages

pip install langchain
pip install openai
pip install streamlit
pip install PyPDF2
pip install pandas

Step 2: Import libraries

Once all packages are installed. It’s time to import them one by one. We will create two Python files. One contains all the backend logic (called “utils.py”) and the second is for creating the front-end with the help of the streamlit package.

First, we will start with “utils.py” where we will create some functions.

#import libraries
  
from langchain.llms import OpenAI
from pypdf import PdfReader
import pandas as pd
import re
from langchain.llms.openai import OpenAI
from langchain.prompts import PromptTemplate

Let’s create a function that extracts all the information from a PDF file. For this we will use the PdfReader package:

#Extract Information from PDF file
def get_pdf_text(pdf_doc):
    text = ""
    pdf_reader = PdfReader(pdf_doc)
    for page in pdf_reader.pages:
        text += page.extract_text()
    return text

Next, we will create a function to extract all the required information from an invoice PDF file. In this case we are extracting Invoice No., description, quantity, date, unit price, amount, total, email, phone number, and ADDRESS and call the OpenAI LLM API from LangChain.


def extract_data(pages_data):

    template=""'Extract all following values: invoice no., Description, 
    Quantity, date, Unit price, Amount, Total,
    email, phone number and address from this data: {pages}
    
    Expected output : remove any dollar symbols {{'Invoice no.':'1001329', 
    'Description':'Office Chair', 'Quantity':'2', 'Date':'05/01/2022', 
    'Unit price':'1100.00', Amount':'2200.00', 'Total':'2200.00',
    'email':'(email protected)', 'phone number':'9999999999',
    'Address':'Mumbai, India'}}
    '''

    prompt_template = PromptTemplate(input_variables=('pages'), template=template)

    llm = OpenAI(temperature=0.4)
    full_response = llm(prompt_template.format(pages=pages_data))

    return full_response

Step 5 – Create a function that iterates through all PDF files

Writing one last function for the utils.py file. This feature will cycle through all PDF files, meaning you can upload multiple invoice files at once.

# iterate over files in
# that user uploaded PDF files, one by one

def create_docs(user_pdf_list):
    
    df = pd.DataFrame({'Invoice no.': pd.Series(dtype="str"),
                   'Description': pd.Series(dtype="str"),
                   'Quantity': pd.Series(dtype="str"),
                   'Date': pd.Series(dtype="str"),
	                'Unit price': pd.Series(dtype="str"),
                   'Amount': pd.Series(dtype="int"),
                   'Total': pd.Series(dtype="str"),
                   'Email': pd.Series(dtype="str"),
	                'Phone number': pd.Series(dtype="str"),
                   'Address': pd.Series(dtype="str")
                    })

    for filename in user_pdf_list:
        
        print(filename)
        raw_data=get_pdf_text(filename)
        #print(raw_data)
        #print("extracted raw data")

        llm_extracted_data=extracted_data(raw_data)
        #print("llm extracted data")
        #Adding items to our list - Adding data & its metadata

        pattern = r'{(.+)}'
        match = re.search(pattern, llm_extracted_data, re.DOTALL)

        if match:
            extracted_text = match.group(1)
            # Converting the extracted text to a dictionary
            data_dict = eval('{' + extracted_text + '}')
            print(data_dict)
        else:
            print("No match found.")

        
        df=df.append((data_dict), ignore_index=True)
        print("********************DONE***************")
        #df=df.append(save_to_dataframe(llm_extracted_data), ignore_index=True)

    df.head()
    return df

So far our utils.py file is complete. Now it’s time to start with the app.py file. The app.py file contains UI code with the help of the streamlit package.

Optimized framework

An open source Python application framework called Streamlit makes it easy to build web applications for data science and machine learning. You can build applications using this system the same way you write Python code because it was created for machine learning engineers. Major Python libraries, including scikit-learn, Keras, PyTorch, SymPy(latex), NumPy, pandas, and Matplotlib, are supported in Streamlit. Running pip will get you started with Streamlit in less than a minute.

Install and import all packages

First, we will install and import all the necessary packages.

#importing packages

import streamlit as st
import os
from dotenv import load_dotenv
from utils import *

Create the main function

Then we will create a main function where we will mention all the titles, subtitles and UI with the help of streamlit. Believe me, with Streamlit it is very simple and easy.

def main():
    load_dotenv()

    st.set_page_config(page_title="Invoice Extraction Bot")
    st.title("Invoice Extraction Bot... ")
    st.subheader("I can help you in extracting invoice data")


    # Upload the Invoices (pdf files)
    pdf = st.file_uploader("Upload invoices here, only PDF files allowed",
    type=("pdf"),accept_multiple_files=True)

    submit=st.button("Extract Data")

    if submit:
        with st.spinner('Wait for it...'):
            df=create_docs(pdf)
            st.write(df.head())

            data_as_csv= df.to_csv(index=False).encode("utf-8")
            st.download_button(
                "Download data as CSV", 
                data_as_csv, 
                "benchmark-tools.csv",
                "text/csv",
                key="download-tools-csv",
            )
        st.success("Hope I was able to save your time")


#Invoking main function
if __name__ == '__main__':
    main()

Run streamlit run app.py

Once done, save the files and run the command “streamlit run app.py” in the terminal. Remember that by default Streamlit uses port 8501. You can also download the extracted information in an Excel file. Download option is provided in the user interface.

Conclusion

Congratulations! You have created an amazing time-saving application using a large and optimized language model. In this article, we have learned what a large language model is and what it is for. Additionally, we have learned the basics of LangChain and its core components and some functionalities of the optimized framework. The most important part of this blog is the “extract_data” function (from the code session), which explains how to give proper prompts and instructions to the LLM model.

You have also learned the following:

How to extract information from an invoice PDF file.
Using a UI-optimized framework
Using the OpenAI LLM model

This will give you some ideas on how to use the LLM model with proper prompts and instructions to accomplish your task.

Frequent questions

P1. Is Streamlit a front-end framework?

A. Streamlit is a library that allows you to create the interface (UI) for your data science and machine learning tasks by writing all the code in Python. Beautiful user interfaces can be easily designed through numerous library components.

P2. What is Streamlit vs Flask?

A. Flask is a lightweight microframework that is easy to learn and use. A newer framework called Streamlit is built exclusively for data-driven web applications.

P3. Will the same LLM instructions work for extracting invoices?

A. No, it depends on the use case in use case. In this example, we know what information needs to be extracted, but if you want to extract more or less information, you need to give proper instructions and an example of the LLM model will accordingly extract all the mentioned information.

Q4. What is the future of generative ai?

A. Generative ai has the potential to have a profound impact on the creation, construction and play of video games, and can replace most human-level tasks with automation.

The media shown in this article is not the property of Analytics Vidhya and is used at the author’s discretion.

Creating an invoice extraction robot with LangChain and LLM

Technical Terrence Team

Charlotte Hornets Partners with MrBeast for Jersey Patch

Leave a Reply Cancel reply

Recommended.

How much in an action and actions isa could generate £ 990 of passive income every month?

What is Book Creator and How Can It Be Used to Teach?

DLUME, DEBO and 6 coins will skyrocket

2 mega-cheap FTSE 100 shares! Which one should I buy right now?

FTX debtors can issue subpoenas to company ‘insiders’, court says

Categories

Important Links

Creating an invoice extraction robot with LangChain and LLM

Introduction

Learning objectives

What is a large language model?

What is LangChain?

LangChain main components

Creating an invoice extraction robot with LangChain and LLM

Step 1 – Create an OpenAI API Key

Step 2: Import libraries

Step 5 – Create a function that iterates through all PDF files

Optimized framework

Install and import all packages

Create the main function

Run streamlit run app.py

Conclusion

Frequent questions

Related

Related

Technical Terrence Team

Charlotte Hornets Partners with MrBeast for Jersey Patch

Leave a Reply Cancel reply

Recommended.

How much in an action and actions isa could generate £ 990 of passive income every month?

What is Book Creator and How Can It Be Used to Teach?

DLUME, DEBO and 6 coins will skyrocket

2 mega-cheap FTSE 100 shares! Which one should I buy right now?

FTX debtors can issue subpoenas to company ‘insiders’, court says

Categories

Important Links

Get daily news updates to your inbox!