Build end-to-end document processing pipelines with Amazon Textract IDP CDK Constructs

Intelligent Document Processing (IDP) with AWS helps automate the extraction of information from documents of different types and formats, quickly and with high precision, without the need for machine learning (ML) skills. Faster, highly accurate information extraction can help you make quality business decisions on time, while reducing overall costs. For more information, see Intelligent Document Processing with AWS AI Services: Part 1.

However, complexity arises when real world scenarios are implemented. Documents are often sent out of order or may be sent as a combined package with various types of forms. You need to create orchestration pipelines to introduce business logic and also take into account different processing techniques depending on the type of form entered. These challenges are only magnified when teams are dealing with large volumes of documents.

In this post, we demonstrate how to solve these challenges using Amazon Textract IDP CDK Constructs, a set of pre-built IDP constructs, to speed development of real-world document processing pipelines. For our use case, we process an insurance document from Acord to allow direct processing, but you can extend this solution to any use case, which we’ll discuss later in the post.

Acord Document Processing at Scale

Straight Through Processing (STP) is a term used in the financial industry to describe the automation of a transaction from start to finish without the need for manual intervention. The insurance industry uses STP to streamline the underwriting and claims process. This involves the automatic extraction of data from insurance documents such as applications, policy documents and claim forms. STP implementation can be challenging due to the large amount of data and the variety of document formats involved. Insurance documents are inherently diverse. Traditionally, this process involves manually reviewing each document and entering the data into a system, which is time consuming and error prone. This manual approach is not only inefficient, but can also lead to errors that can have a significant impact on the underwriting and claims process. This is where IDP on AWS comes into play.

For a more efficient and accurate workflow, insurance companies can integrate IDP on AWS into the underwriting and claims process. With Amazon Textract and Amazon Comprehend, insurers can read handwriting and different form formats, making it easy to extract information from various types of insurance documents. By implementing IDP on AWS in the process, STP becomes easier to achieve, reducing the need for manual intervention and speeding up the overall process.

This pipeline allows insurers to easily and efficiently process their commercial insurance transactions, reducing the need for manual intervention and improving the overall customer experience. We demonstrate how to use Amazon Textract and Amazon Comprehend to automatically extract data from commercial insurance documents, such as Acord 140, Acord 125, Affidavit of Home Ownership, and Acord 126, and analyze the extracted data to facilitate the underwriting process. These services can help insurance companies improve the accuracy and speed of their STP processes and ultimately provide a better experience for their customers.

Solution Overview

The solution is built using the AWS Cloud Development Kit (AWS CDK) and consists of Amazon Comprehend for document classification, Amazon Textract for document extraction, Amazon DynamoDB for storage, AWS Lambda for the logic of the application and AWS Step Functions for organizing workflows.

The pipeline consists of the following phases:

Break down document packages and classify each type of form with Amazon Comprehend.
Run the processing pipelines for each type of form or form page with the appropriate Amazon Textract API (signature detection, table extraction, form extraction, or query).
Post-process the Amazon Textract output into a machine-readable format.

The following screenshot of the Step Functions workflow illustrates the pipeline.

previous requirements

To get started with the solution, make sure you have the following:

AWS CDK version 2 installed
Docker installed and running on your machine
Proper access to Step Functions, DynamoDB, Lambda, Amazon Simple Queue Service (Amazon SQS), Amazon Textract, and Amazon Comprehend

Clone the GitHub repository

Start by cloning the GitHub repository:

git clone https://github.com/aws-samples/aws-textract-e2e-processing.git

Create an Amazon Comprehend classification endpoint

We must first provide an Amazon Comprehend classification endpoint.

For this publication, the endpoint detects the following document classes (make sure the name is consistent):

acord125
acord126
acord140
property_affidavit

You can create one using the comprehend_acord_dataset.csv sample dataset in the GitHub repository. To train and create a custom classification endpoint using the provided sample data set, follow the instructions in Training custom classifiers. If you want to use your own PDF files, see the first workflow in the post Intelligently Split Multi-Form Document Packages with Amazon Textract and Amazon Comprehend.

After you train your classifier and create an endpoint, you should have an Amazon Comprehend custom classification endpoint ARN that resembles the following code:

arn:aws:comprehend:<REGION>:<ACCOUNT_ID>:document-classifier-endpoint/<CLASSIFIER_NAME>

navigate to docsplitter/document_split_workflow.py and modify lines 27 and 28, which contain comprehend_classifier_endpoint. Enter your endpoint ARN in line 28.

install dependencies

Now you install the project dependencies:

python -m pip install -r requirements.txt

Initialize the account and region for the AWS CDK. This will create the Amazon Simple Storage Service (Amazon S3) buckets and roles for the AWS CDK tool to store artifacts and to deploy the infrastructure. See the following code:

Deploy the AWS CDK stack

When the Amazon Comprehend classifier and document configuration table are ready, implement the stack with the following code:

cdk deploy DocumentSplitterWorkflow --outputs-file document_splitter_outputs.json --require-approval never

upload the document

Verify that the stack is fully implemented.

Then, in the terminal window, run the aws s3 cp command to upload the document to the DocumentUploadLocation For him DocumentSplitterWorkflow:

aws s3 cp sample-doc.pdf $(aws cloudformation list-exports --query 'Exports[?Name==`DocumentSplitterWorkflow-DocumentUploadLocation`].Value' --output text)

We have created a 12-page sample document packet containing the Acord 125, Acord 126, Acord 140, and Affidavit of Ownership forms. The following images show a 1 page excerpt from each document.

All data on the forms is synthetic, and Acord’s standard forms are the property of Acord Corporation and are used here for demonstration purposes only.

Run the Step Functions workflow

Now open the Step Function workflow. You can get the Step Function workflow link from the document_splitter_outputs.json file, the Step Functions console, or by using the following command:

aws cloudformation list-exports --query 'Exports[?Name==`DocumentSplitterWorkflow-StepFunctionFlowLink`].Value' --output text

Depending on the size of the document package, the workflow time will vary. The sample document should take 1-2 minutes to process. The following diagram illustrates the Step Functions workflow.

When your work is complete, navigate to the code in and out. From here, you’ll see the machine-readable CSV files for each of the respective forms.

To download these files, open getfiles.py. Configure the files to be the list generated by the execution of the state machine. You can run this function by running python3 getfiles.py. This will generate the csvfiles_<TIMESTAMP> folder, as shown in the following screenshot.

Congratulations, you have now implemented an end-to-end processing workflow for a commercial insurance application.

Extend the solution for any type of form

In this post, we demonstrate how we could use Amazon Textract IDP CDK Constructs for a commercial insurance use case. However, you can extend these constructs for any type of form. To do this, we first retrain our Amazon Comprehend classifier to account for the new form type and adjust the code as we did before.

For each of the form types you trained, we need to specify its queries and textract_features in it generate_csv.py archive. This customizes the processing pipeline for each form type by using the appropriate Amazon Textract API.

Queries is a list of queries. For example, “What is the primary email address?” on page 2 of the sample document. For more information, see Queries.

textract_features is a list of the Amazon Textract functions that you want to extract from the document. They can be TABLES, FORMS, QUERIES or SIGNATURES. For more information, see Feature Types.

navigate to generate_csv.py. Each type of document needs its classification, queriesand textract_features configured creating CSVRow instances.

For our example we have four types of documents: acord125, acord126, acord140and property_affidavit. Next, we want to use the FORMS and TABLES functions in the agreement documents, and the INQUIRIES and SIGNATURES functions for the affidavit of ownership.

def get_csv_rows():
# acord125
acord125_queries: List[List[str]] = list()
acord_125_features: List[str] = ["FORMS", "TABLES"]
acord125_row = CSVRow("acord125",
acord125_queries,
acord_125_features)
# acord126
acord126_queries: List[List[str]] = list()
acord126_features: List[str] = ["FORMS", "TABLES"]
acord126_row = CSVRow("acord126",
acord126_queries,
acord126_features)
# acord140
acord140_queries: List[List[str]] = list()
acord140_features: List[str] = ["FORMS", "TABLES"]
acord140_row = CSVRow("acord140",
acord140_queries,
acord140_features)
# property_affidavit
property_affidavit_queries: List[List[str]] = [
["PROP_AFF_OWNER", "What is your name?"],
["PROP_AFF_ADDR", "What is the property's address?"],
["PROP_AFF_DATE_EXEC_ON", "When was this executed on?"],
["PROP_AFF_DATE_SWORN", "When was this subscribed and sworn to?"],
["PROP_AFF_NOTARY", "Who is the notary public?"],
]
property_affidavit_features: List[str] = ["SIGNATURES", "QUERIES"]
property_affidavit_row = CSVRow("property_affidavit",
property_affidavit_queries,
property_affidavit_features)

See the GitHub repository to learn how this was done for the sample commercial insurance documents.

Clean

To remove the solution, run the cdk destroy domain. You will then be asked to confirm the deletion of the workflow. Deleting the workflow will delete all generated resources.

Conclusion

In this post, we demonstrate how you can get started with Amazon Textract IDP CDK Constructs by implementing a straight-through processing scenario for a set of Acord business forms. We also demonstrate how you can extend the solution to any type of form with simple configuration changes. We encourage you to test the solution with their respective documents. Send a pull request to github repository for any feature requests you may have. To learn more about IDP on AWS, see our documentation.

About the authors

Raj Pathak is a Senior Solutions Architect and Technologist specializing in Financial Services (Insurance, Banking, Capital Markets) and Machine Learning. He specializes in natural language processing (NLP), large language models (LLM), and machine learning infrastructure and operations (MLOps) projects.

aditi rajnish is a second-year software engineering student at the University of Waterloo. His interests include computer vision, natural language processing, and edge computing. He is also passionate about community-based STEM outreach and advocacy. In his spare time, he can be found rock climbing, playing the piano, or learning how to bake the perfect bun.

Enzo Staton is a solutions architect with a passion for working with enterprises to increase their knowledge of the cloud. He works closely as a trusted advisor and industry specialist with clients across the country.

Build end-to-end document processing pipelines with Amazon Textract IDP CDK Constructs

Technical Terrence Team

Value has been accumulating behind Diageo's share price

Leave a Reply Cancel reply

Recommended.

UBS begins the purchase of Biohaven and cites various projects (NYSE:BHVN)

Saber Interactive may escape Embracer's death grip and become a private company

FTX Sues Grayscale to Unlock $9 Billion Bitcoin and Ethereum Trusts

London Stock Exchange to accept applications for Bitcoin exchange-traded notes

BlackRock Spot Bitcoin ETF Launches in Brazil, ETF Market Secures 4% of Total BTC Supply

Categories

Important Links

Build end-to-end document processing pipelines with Amazon Textract IDP CDK Constructs

Acord Document Processing at Scale

Solution Overview

previous requirements

Clone the GitHub repository

Create an Amazon Comprehend classification endpoint

install dependencies

Deploy the AWS CDK stack

upload the document

Run the Step Functions workflow

Extend the solution for any type of form

Clean

Conclusion

About the authors

Related

Technical Terrence Team

Value has been accumulating behind Diageo's share price

Leave a Reply Cancel reply

Recommended.

UBS begins the purchase of Biohaven and cites various projects (NYSE:BHVN)

Saber Interactive may escape Embracer's death grip and become a private company

FTX Sues Grayscale to Unlock $9 Billion Bitcoin and Ethereum Trusts

London Stock Exchange to accept applications for Bitcoin exchange-traded notes

BlackRock Spot Bitcoin ETF Launches in Brazil, ETF Market Secures 4% of Total BTC Supply

Categories

Important Links

Get daily news updates to your inbox!