How to OCR payment receipts? This blog is a comprehensive overview of the different methods for extracting structured text using OCR from pay stubs to automate manual data entry.
Pay stubs, or pay stubs, as they are more commonly known, are a common form of income verification used by lenders to verify your creditworthiness. If you are a working employee or have been in the past, you have no doubt come across one. Typically, these payslips contain details like an employee’s earnings during a particular time, including other fields like their tax deductions, insurance amounts, social security numbers, etc. These can be in print or digital format and are sometimes sent via email. or post.
Currently, lenders obtain scanned or digital PDF files of these pay stubs and manually enter the details into their systems to issue a loan. This process is time-consuming, especially during peak seasons, leading to a long time from loan application to the release of funds. What if you could extract PDF versions of these pay stubs and reduce this time to seconds to speed up loan processing to delight your customer?
In this blog, we will review different ways how you can automate the extraction of information from payslips (Payslip OCR extract or Payslip PDF) and save it as structured data using optical character recognition (OCR). Additionally, we will discuss the frequent challenges we encounter in building accurate OCR integrated with machine learning and deep learning models. Below is the table of contents.
In this section, we will discuss how we can use OCR-based algorithms to extract information from payslips. If you’re new to OCR, think of it as a computer algorithm that can read images of typed or handwritten text in plain text. There are different free and open source tools on GitHub like Tesseract, Ocropus, Kraken, but they have certain limitations. For example, Tesseract is very accurate at extracting organized text, but doesn’t do well on unstructured data. Similarly, the other OCR tools have various limitations based on fonts, language, alignment, templates, etc. Now, back to our problem of extracting information from Payslips, an ideal OCR should be able to extract all the essential fields, regardless of the drawbacks discussed above. Now, before setting up an OCR, let’s look at the standard fields that we need to extract from a Payslip document.
- Net income
- Gross salary
- Bank account
- Name of employer
- employee address
- Employee name
- Employee number
- employee address
- salary period
- Birthdate
- Worked days
- Hours worked
- Date of service of entry / exit
- Hourly rate
- Tax rate
- Date of issue
Before setting up an OCR and analyzing the results, we must realize that OCR does not know what type of documents we are giving them to extract, they blindly identify the text and return them regardless of the fields or identifiers mentioned above. Now, we will use Tesseract, which is a free and open source OCR engine from Google. For more information on setting this up on your system and developing Python scripts for scanned images, check out our Tesseract guide here.
As we can clearly see, Tesseract identified all text in the given image, regardless of text tables, positions, and alignment, and printed it accurately. But it takes a lot of post-processing to pick out all the important fields and put them in a structured way. For example, let’s say you only need to extract the tax deducted from an employee, Tesseract alone can’t do it. This is where machine learning and deep learning models come in to intelligently identify the location of the fields and extract the necessary values. We call this extraction of key-value pairs, let’s discuss how we can achieve this in the following sections.
Drawbacks and challenges
When scanning payment receipts, we encountered different issues such as capturing at wrong angles or low light conditions. Also, after capturing them, it is equally important to check whether they are original or fake. In this section, we will discuss these critical challenges and how they can be addressed.
wrong scan
It is the most common problem when performing OCR. For high-quality scanned and aligned images, OCR has high precision to produce fully searchable, editable text. However, when a scan is distorted or when the text is blurry, OCR tools can have a hard time reading it and sometimes produce inaccurate results. To overcome this, we need to be familiar with techniques like image transformation and deskew, which help us align the image in a proper position.
Fraud checks and blurry images
It is important for companies and employees to verify whether the pay stubs are authentic or not. These are some of the features that can help us check if the image is fake or not.
- Identify bottoms for bent or distorted parts.
- Beware of low quality images.
- Check for blurred or edited texts.
An algorithm that is familiar to overcome this task is the “Laplacian Variance”. It helps us to find and examine the distribution of low and high frequencies in the given image.
As discussed above, key-value extraction will search for user-defined keys that are static text in forms and then identify the values associated with them. To achieve this technique first, one must be familiar with Deep Learning. We will also need to make sure that these deep learning algorithms are applicable for different templates, since the same algorithm should be appropriate for documents of other formats. After the algorithm finds the position of the keys and the required values, we use OCR to extract the text.
Here is an example of how tesseract extracts text,
Sample Company LLC EARNINGS STATEMENT
2305 Gruene Lake Drive, Suite C New Braunfels, Texas
Hidalgo P. Swift XXX-XX-1234 12345 76612 01/08/19-01/14/19 0115/19
GROSS WAGES 24.25 40.00 970.00 FICA MED TAX 14.06 28.12
FICA SS TAX 60.14 120.28
FED TAX 117.68 235.36
1,940.00 383.76 1,556.24 970.00 191.88 778.12
Whereas for the extraction of key value pairs we will have a JSON output of the required keys and values of the given payment receipt. The output JSON data can be saved as structured data in Excel sheets, databases, and CRM systems using simple automation scripts. In the next section, we’ll look at some deep learning techniques for extracting key-value pairs from documents like payroll.
Deep learning models for IE payment receipt
There are two ways to extract information using deep learning, one by building algorithms that can learn from images and the other from text.
Alright, now let’s dive into deep learning and understand how these algorithms identify key-value pairs from images or text. Also especially for payroll, it is essential to extract the data in the tables, since most of the earnings and deductions in a payroll are mentioned in tabular format. Now, let’s review some popular deep learning architectures for scanned documents.
In the research, CUTIE (Learning to Understand Documents with Convolutional Universal Text Information Extractor), xiaohui zhao He proposed extracting key information from documents, such as receipts or invoices, and preserving interesting text in structured data. The heart of this research is convolutional neural networks, which are applied to texts. Here, the texts are embedded as features with semantic connotations. This model is trained on 4,484 labeled receipts and has achieved an average accuracy of 90.8%, 77.7% on taxi receipts and entertainment receipts, respectively.
BERTgrid is a popular deep learning-based language model for understanding generic documents and performing key-value pair extraction tasks. This model also uses convolutional neural networks based on semantic instance segmentation to perform the inference. Overall, the average accuracy across the header and line items for the selected document was 65.48%.
In DeepOfSRT, Schreiber et al. presented the end-to-end system for table comprehension in document images. The system contains two post models for table discovery and structured data extraction on the recognized tables. It outperformed state-of-the-art methods for table detection and structure recognition by achieving F1 measures of 96.77% and 91.44% for table detection and structure recognition, respectively. Models like these can be used to extract values from payslip tables exclusively.
Other reading