Introduction
Before the era of large language models, extracting invoices was a tedious task. For invoice extraction, you need to collect data, create a document search machine learning model, tune the model, etc. The introduction of generative ai took us all by surprise and many things were simplified using the LLM model. The big language model has eliminated the model building process from machine learning; You just need to be good at quick engineering and your job will be done in most of the scenario. In this article, we are creating an invoice extraction robot with the help of a large language model and LangChain. However, detailed knowledge of LangChain and LLM is out of reach, but below is a brief overview of LangChain and its components.
Learning objectives
- Learn how to extract information from a document.
- How to structure your backend code using LangChain and LLM
- How to provide appropriate directions and instructions to the LLM model
- Good knowledge of Streamlit framework for front-end work.
This article was published as part of the Data Science Blogathon.
What is a large language model?
Large language models (LLM) are a type of artificial intelligence (ai) algorithm that uses deep learning techniques to process and understand natural language. LLMs are trained on huge volumes of text data to discover linguistic patterns and relationships between entities. Thanks to this, they can now recognize, translate, forecast or create text or other information. LLMs can be trained with possible petabytes of data and can be tens of terabytes in size. For example, one gigabit of text space can contain about 178 million words.
For companies that want to offer customer service through a chatbot or virtual assistant, LLMs can be useful. Without a human presence, they can offer individualized responses.
What is LangChain?
LangChain is an open source framework used to create and build applications using a large language model (LLM). It provides a standard interface for chains, many integrations with other tools, and end-to-end chains for common applications. This allows you to develop interactive, data-responsive applications that use the latest advances in natural language processing.
LangChain main components
A variety of Langchain components can be “chained together” to create complex LLM-based applications. These elements consist of:
- Message templates
- LLM
- Agents
- Memory
Creating an invoice extraction robot with LangChain and LLM
Before the era of generative ai, extracting any data from a document was a time-consuming process. You have to create a machine learning model or use the cloud service API from Google, Microsoft and AWS. But LLM makes it very easy to extract any information from a given document. LLM does it in three simple steps:
- Call the LLM model API
- Adequate notice must be given
- It is necessary to extract information from a document.
For this demo, we have taken three invoice PDF files. Below is the screenshot of an invoice file.
Step 1 – Create an OpenAI API Key
First, you need to create an OpenAI API key (paid subscription). How to create an OpenAI API key can be easily found on the Internet. Assuming the API key is created. The next step is to install all the necessary packages like LangChain, OpenAI, pypdf, etc.
#installing packages
pip install langchain
pip install openai
pip install streamlit
pip install PyPDF2
pip install pandas
Step 2: Import libraries
Once all packages are installed. It’s time to import them one by one. We will create two Python files. One contains all the backend logic (called “utils.py”) and the second is for creating the front-end with the help of the streamlit package.
First, we will start with “utils.py” where we will create some functions.
#import libraries
from langchain.llms import OpenAI
from pypdf import PdfReader
import pandas as pd
import re
from langchain.llms.openai import OpenAI
from langchain.prompts import PromptTemplate
Let’s create a function that extracts all the information from a PDF file. For this we will use the PdfReader package:
#Extract Information from PDF file
def get_pdf_text(pdf_doc):
text = ""
pdf_reader = PdfReader(pdf_doc)
for page in pdf_reader.pages:
text += page.extract_text()
return text
Next, we will create a function to extract all the required information from an invoice PDF file. In this case we are extracting Invoice No., description, quantity, date, unit price, amount, total, email, phone number, and ADDRESS and call the OpenAI LLM API from LangChain.
def extract_data(pages_data):
template=""'Extract all following values: invoice no., Description,
Quantity, date, Unit price, Amount, Total,
email, phone number and address from this data: {pages}
Expected output : remove any dollar symbols {{'Invoice no.':'1001329',
'Description':'Office Chair', 'Quantity':'2', 'Date':'05/01/2022',
'Unit price':'1100.00', Amount':'2200.00', 'Total':'2200.00',
'email':'(email protected)', 'phone number':'9999999999',
'Address':'Mumbai, India'}}
'''
prompt_template = PromptTemplate(input_variables=('pages'), template=template)
llm = OpenAI(temperature=0.4)
full_response = llm(prompt_template.format(pages=pages_data))
return full_response
Step 5 – Create a function that iterates through all PDF files
Writing one last function for the utils.py file. This feature will cycle through all PDF files, meaning you can upload multiple invoice files at once.
# iterate over files in
# that user uploaded PDF files, one by one
def create_docs(user_pdf_list):
df = pd.DataFrame({'Invoice no.': pd.Series(dtype="str"),
'Description': pd.Series(dtype="str"),
'Quantity': pd.Series(dtype="str"),
'Date': pd.Series(dtype="str"),
'Unit price': pd.Series(dtype="str"),
'Amount': pd.Series(dtype="int"),
'Total': pd.Series(dtype="str"),
'Email': pd.Series(dtype="str"),
'Phone number': pd.Series(dtype="str"),
'Address': pd.Series(dtype="str")
})
for filename in user_pdf_list:
print(filename)
raw_data=get_pdf_text(filename)
#print(raw_data)
#print("extracted raw data")
llm_extracted_data=extracted_data(raw_data)
#print("llm extracted data")
#Adding items to our list - Adding data & its metadata
pattern = r'{(.+)}'
match = re.search(pattern, llm_extracted_data, re.DOTALL)
if match:
extracted_text = match.group(1)
# Converting the extracted text to a dictionary
data_dict = eval('{' + extracted_text + '}')
print(data_dict)
else:
print("No match found.")
df=df.append((data_dict), ignore_index=True)
print("********************DONE***************")
#df=df.append(save_to_dataframe(llm_extracted_data), ignore_index=True)
df.head()
return df
So far our utils.py file is complete. Now it’s time to start with the app.py file. The app.py file contains UI code with the help of the streamlit package.
Optimized framework
An open source Python application framework called Streamlit makes it easy to build web applications for data science and machine learning. You can build applications using this system the same way you write Python code because it was created for machine learning engineers. Major Python libraries, including scikit-learn, Keras, PyTorch, SymPy(latex), NumPy, pandas, and Matplotlib, are supported in Streamlit. Running pip will get you started with Streamlit in less than a minute.
Install and import all packages
First, we will install and import all the necessary packages.
#importing packages
import streamlit as st
import os
from dotenv import load_dotenv
from utils import *
Create the main function
Then we will create a main function where we will mention all the titles, subtitles and UI with the help of streamlit. Believe me, with Streamlit it is very simple and easy.
def main():
load_dotenv()
st.set_page_config(page_title="Invoice Extraction Bot")
st.title("Invoice Extraction Bot...💁 ")
st.subheader("I can help you in extracting invoice data")
# Upload the Invoices (pdf files)
pdf = st.file_uploader("Upload invoices here, only PDF files allowed",
type=("pdf"),accept_multiple_files=True)
submit=st.button("Extract Data")
if submit:
with st.spinner('Wait for it...'):
df=create_docs(pdf)
st.write(df.head())
data_as_csv= df.to_csv(index=False).encode("utf-8")
st.download_button(
"Download data as CSV",
data_as_csv,
"benchmark-tools.csv",
"text/csv",
key="download-tools-csv",
)
st.success("Hope I was able to save your time❤")
#Invoking main function
if __name__ == '__main__':
main()
Run streamlit run app.py
Once done, save the files and run the command “streamlit run app.py” in the terminal. Remember that by default Streamlit uses port 8501. You can also download the extracted information in an Excel file. Download option is provided in the user interface.
Conclusion
Congratulations! You have created an amazing time-saving application using a large and optimized language model. In this article, we have learned what a large language model is and what it is for. Additionally, we have learned the basics of LangChain and its core components and some functionalities of the optimized framework. The most important part of this blog is the “extract_data” function (from the code session), which explains how to give proper prompts and instructions to the LLM model.
You have also learned the following:
- How to extract information from an invoice PDF file.
- Using a UI-optimized framework
- Using the OpenAI LLM model
This will give you some ideas on how to use the LLM model with proper prompts and instructions to accomplish your task.
Frequent questions
A. Streamlit is a library that allows you to create the interface (UI) for your data science and machine learning tasks by writing all the code in Python. Beautiful user interfaces can be easily designed through numerous library components.
A. Flask is a lightweight microframework that is easy to learn and use. A newer framework called Streamlit is built exclusively for data-driven web applications.
A. No, it depends on the use case in use case. In this example, we know what information needs to be extracted, but if you want to extract more or less information, you need to give proper instructions and an example of the LLM model will accordingly extract all the mentioned information.
The media shown in this article is not the property of Analytics Vidhya and is used at the author’s discretion.