This blog serves as a starting point for anyone looking to extract tables from PDFs and images. We start with a python code tutorial that takes you through the process of implementing OCR on PDFs and images to detect and extract tables in structured formats (list, json object, pandas dataframe). We then take a look at a no-code platform for automated tabular extraction, and then explore some table extraction tools available for free online.
Introduction
It is estimated that the total number of PDF documents in the world exceeded 3 trillion. The adoption of these documents can be attributed to their inherent nature of being platform agnostic, so they have a consistent and reliable rendering experience across all environments.
There are many cases that arise every day where it is necessary to read and extract text and tabular information from PDF files. People and organizations that traditionally did this manually have begun looking at technological alternatives that can replace manual effort using AI.
OCR stands for Optical Character Recognition and uses AI to convert an image of printed or handwritten text into machine readable text. There are several open source and closed source OCR engines today. It should be noted that often the job is not complete after OCR has read the document and returned a text stream, and layers of technology are built on top of it to use the now machine-readable text and extract relevant attributes in a structured format.
We will use the following invoice for the extraction of the table. The goal is to read the quantity, description, unit price, quantity of each product in the invoice PDF in tabular format.
let’s get started
previous requirements
The OCR required to process the file and extract the table is handled by an API call to the Nanonets API.
To make the API call and get tables extracted from pdf, we need the requests library. For the post-processing code that transforms the API response into a list of data frames, we need the panda and numb library. You can install them in your python environment using pip.
pip install requests pandas numpy
To get your first prediction, run the code snippet below. You need to add your API_KEY and MODEL_ID to authenticate.
You can get your free API_KEY and MODEL_ID by signing up at https://app.nanonets.com/#/signup?redirect=tools.
Once this is done, run the following code snippet.
import requests
url = 'https://app.nanonets.com/api/v2/OCR/Model/REPLACE_MODEL_ID/LabelFile/?async=false'
data = {'file': open('invoice.png', 'rb')}
response = requests.post(url, auth=requests.auth.HTTPBasicAuth('REPLACE_API_KEY', ''), files=data)
We get the following output.
He result
The object contains an array of results objects per page. Each object contains the prediction
an object that has all detected tables as array elements. Each detected table has an array called cells
, which is an array of all the cells in the detected table. The detected row, column, and original text are present as row
, col
, ocr_text
attribute of each cell object in cells
.
We will now do some post processing to transform the json response into pandas dataframes. After getting the above API response, you can run the snippet below to get a list of data frames that contain the detected tables.
import pandas as pd
import numpy as np
alldfs = []
for item in response.json()["result"]:
tables = []
dfs = []
for pred in item['prediction']:
if pred['type'] == 'table':
labels = ['none'] * 100
maxcol = 0
for cell in pred['cells']:
if labels[cell['col'] - 1] == 'none':
labels[cell['col'] - 1] = cell['label']
if cell['col'] > maxcol:
maxcol = cell['col']
labels = labels[:maxcol]
df = pd.DataFrame(index=np.arange(100), columns=np.arange(100))
for cell in pred['cells']:
df[cell['col']][cell['row']] = cell['text']
df=df.dropna(axis=0,how='all')
df=df.dropna(axis=1,how='all')
df.columns = labels
tables.append(df)
alldfs.append(tables)
After running this, the alldfs
object is a list where each object in the list contains predictions for each page of the document. Also, each of these objects is itself a list of data frames that contain all the tables on that page.
We can see that the two tables present in the invoice pdf have been detected and stored as data frames on the first page. alldfs[0]
and both tables can be accessed at alldfs[0][0]
and alldfs[0][1]
.
Therefore, we OCRed our first PDF file and extracted tables from it. We examine the json response and do some post processing with pandas and numpy to get the data in the desired format. You can also apply your own post-processing to process and use data from the json response based on your use case.
We also provide a no-code platform along with the Nanoents API with additional support for line items, automated imports and exports from popular ERP/software/databases, framework for configuring approval and validation rules, and much more.
One of our AI experts can take a 15-minute call to discuss your use case, give you a personalized demo, and find the best plan for you.
Do More: Extract Line Elements and Raw Fields
You can extend the functionality of Nanonets OCR to detect flat fields and line items along with tables from pdf files and images. Can train your own custom model in 15 minutes to detect any flat field or line element in an image or a pdf file. Nanonets also offers pre-built templates with added line item support for popular document types like invoices, receipts, driver’s licenses, ID cards, resumes, etc.
Therefore, creating a custom model or using one of our pre-built models allows you to discover and extract line items, flat fields, and tables in a single API call.
Let’s take the example of the invoice above. The goal now is to detect flat fields like seller name, seller address, phone number, email, total amount along with tables from invoice pdf file using Nanonets OCR.
You can go to https://app.nanonets.com and clone the pre-trained invoice document type model.
Once done, navigate to the To integrate in the left navigation pane, which provides ready-to-use code snippets for extracting line items, flat fields, and tables using the Nanonets API.
By running the above code snippet in our invoice file, we are able to detect line items along with tables in the API call.
You can also use our online platform to set up an automated workflow and extract line items and tables from PDFs and images, set up external integrations with popular ERP/software/databases, and set up approval and validation rules.
One of our AI experts can take a 15-minute call to discuss your use case, give you a personalized demo, and find the best plan for you.
We offer table extraction on our online platform as well as via the Nanonets API. Once your Nanonets account is up and running, you can choose to use the platform instead of the API to extract tables from your documents.
You can set up your workflow here. We offer out-of-the-box integrations with multiple popular ERPs/software/databases.
- automated imports and exports to ERP / software / database
- configure automated validation and approval rules
- configure post processing after extraction
One of our AI experts can take a 15-minute call to discuss your use case, give you a personalized demo, and find the best plan for you.
There are plenty of free OCR tools online that can be used to perform OCR and extract tables online. It’s simply a matter of uploading your input files, waiting for the tool to process them and provide the output, and then downloading the output in the required format.
Here is a list of free online OCR tools that we offer:
Do you have an Intelligent Document Processing/OCR business use case? Try nanogrids
We provide customized OCR and IDP solutions for various use cases: Accounts Payable Automation, Invoice Automation, Accounts Receivable Automation, Receipt/ID Card/DL/Passport OCR, Accounting Software Integrations, BPO Automation, table extraction, PDF extraction and many more. Explore our products and solutions using the dropdown menus at the top right of the page.
For example, suppose you have a large number of invoices that are generated every day. With Nanonets, you can upload these images and teach your own model what to look for. For example: In invoices, you can create a template to extract product names and prices. Once you’ve annotated and created your model, integrating it is as easy as copying 2 lines of code.
Here are some reasons why you should consider using Nanonets:
- Nanonets makes it easy to extract text, structure the relevant data into the required fields, and discard irrelevant data extracted from the image.
- It works well with multiple languages.
- Works fine on text in the wild
- Train with your own data to make it work for your use case
- The Nanonets OCR API allows you to easily retrain your models with new data, so you can automate your operations anywhere faster.
- No internal team of developers required
Visit Nanonets for OCR and IDP business solutions.
Sign up to start a free trial.