Intelligent Document Processing (IDP) with AWS helps automate the extraction of information from documents of different types and formats, quickly and with high precision, without the need for machine learning (ML) skills. Faster, highly accurate information extraction can help you make quality business decisions on time, while reducing overall costs. For more information, see Intelligent Document Processing with AWS AI Services: Part 1.
However, complexity arises when real world scenarios are implemented. Documents are often sent out of order or may be sent as a combined package with various types of forms. You need to create orchestration pipelines to introduce business logic and also take into account different processing techniques depending on the type of form entered. These challenges are only magnified when teams are dealing with large volumes of documents.
In this post, we demonstrate how to solve these challenges using Amazon Textract IDP CDK Constructs, a set of pre-built IDP constructs, to speed development of real-world document processing pipelines. For our use case, we process an insurance document from Acord to allow direct processing, but you can extend this solution to any use case, which we’ll discuss later in the post.
Acord Document Processing at Scale
Straight Through Processing (STP) is a term used in the financial industry to describe the automation of a transaction from start to finish without the need for manual intervention. The insurance industry uses STP to streamline the underwriting and claims process. This involves the automatic extraction of data from insurance documents such as applications, policy documents and claim forms. STP implementation can be challenging due to the large amount of data and the variety of document formats involved. Insurance documents are inherently diverse. Traditionally, this process involves manually reviewing each document and entering the data into a system, which is time consuming and error prone. This manual approach is not only inefficient, but can also lead to errors that can have a significant impact on the underwriting and claims process. This is where IDP on AWS comes into play.
For a more efficient and accurate workflow, insurance companies can integrate IDP on AWS into the underwriting and claims process. With Amazon Textract and Amazon Comprehend, insurers can read handwriting and different form formats, making it easy to extract information from various types of insurance documents. By implementing IDP on AWS in the process, STP becomes easier to achieve, reducing the need for manual intervention and speeding up the overall process.
This pipeline allows insurers to easily and efficiently process their commercial insurance transactions, reducing the need for manual intervention and improving the overall customer experience. We demonstrate how to use Amazon Textract and Amazon Comprehend to automatically extract data from commercial insurance documents, such as Acord 140, Acord 125, Affidavit of Home Ownership, and Acord 126, and analyze the extracted data to facilitate the underwriting process. These services can help insurance companies improve the accuracy and speed of their STP processes and ultimately provide a better experience for their customers.
Solution Overview
The solution is built using the AWS Cloud Development Kit (AWS CDK) and consists of Amazon Comprehend for document classification, Amazon Textract for document extraction, Amazon DynamoDB for storage, AWS Lambda for the logic of the application and AWS Step Functions for organizing workflows.
The pipeline consists of the following phases:
- Break down document packages and classify each type of form with Amazon Comprehend.
- Run the processing pipelines for each type of form or form page with the appropriate Amazon Textract API (signature detection, table extraction, form extraction, or query).
- Post-process the Amazon Textract output into a machine-readable format.
The following screenshot of the Step Functions workflow illustrates the pipeline.
previous requirements
To get started with the solution, make sure you have the following:
- AWS CDK version 2 installed
- Docker installed and running on your machine
- Proper access to Step Functions, DynamoDB, Lambda, Amazon Simple Queue Service (Amazon SQS), Amazon Textract, and Amazon Comprehend
Clone the GitHub repository
Start by cloning the GitHub repository:
Create an Amazon Comprehend classification endpoint
We must first provide an Amazon Comprehend classification endpoint.
For this publication, the endpoint detects the following document classes (make sure the name is consistent):
acord125
acord126
acord140
property_affidavit
You can create one using the comprehend_acord_dataset.csv
sample dataset in the GitHub repository. To train and create a custom classification endpoint using the provided sample data set, follow the instructions in Training custom classifiers. If you want to use your own PDF files, see the first workflow in the post Intelligently Split Multi-Form Document Packages with Amazon Textract and Amazon Comprehend.
After you train your classifier and create an endpoint, you should have an Amazon Comprehend custom classification endpoint ARN that resembles the following code:
navigate to docsplitter/document_split_workflow.py
and modify lines 27 and 28, which contain comprehend_classifier_endpoint
. Enter your endpoint ARN in line 28.
install dependencies
Now you install the project dependencies:
Initialize the account and region for the AWS CDK. This will create the Amazon Simple Storage Service (Amazon S3) buckets and roles for the AWS CDK tool to store artifacts and to deploy the infrastructure. See the following code:
Deploy the AWS CDK stack
When the Amazon Comprehend classifier and document configuration table are ready, implement the stack with the following code:
upload the document
Verify that the stack is fully implemented.
Then, in the terminal window, run the aws s3 cp
command to upload the document to the DocumentUploadLocation
For him DocumentSplitterWorkflow
:
We have created a 12-page sample document packet containing the Acord 125, Acord 126, Acord 140, and Affidavit of Ownership forms. The following images show a 1 page excerpt from each document.
All data on the forms is synthetic, and Acord’s standard forms are the property of Acord Corporation and are used here for demonstration purposes only.
Run the Step Functions workflow
Now open the Step Function workflow. You can get the Step Function workflow link from the document_splitter_outputs.json
file, the Step Functions console, or by using the following command:
Depending on the size of the document package, the workflow time will vary. The sample document should take 1-2 minutes to process. The following diagram illustrates the Step Functions workflow.
When your work is complete, navigate to the code in and out. From here, you’ll see the machine-readable CSV files for each of the respective forms.
To download these files, open getfiles.py
. Configure the files to be the list generated by the execution of the state machine. You can run this function by running python3 getfiles.py
. This will generate the csvfiles_<TIMESTAMP>
folder, as shown in the following screenshot.
Congratulations, you have now implemented an end-to-end processing workflow for a commercial insurance application.
Extend the solution for any type of form
In this post, we demonstrate how we could use Amazon Textract IDP CDK Constructs for a commercial insurance use case. However, you can extend these constructs for any type of form. To do this, we first retrain our Amazon Comprehend classifier to account for the new form type and adjust the code as we did before.
For each of the form types you trained, we need to specify its queries and textract_features
in it generate_csv.py archive. This customizes the processing pipeline for each form type by using the appropriate Amazon Textract API.
Queries
is a list of queries. For example, “What is the primary email address?” on page 2 of the sample document. For more information, see Queries.
textract_features
is a list of the Amazon Textract functions that you want to extract from the document. They can be TABLES, FORMS, QUERIES or SIGNATURES. For more information, see Feature Types.
navigate to generate_csv.py
. Each type of document needs its classification
, queries
and textract_features
configured creating CSVRow
instances.
For our example we have four types of documents: acord125
, acord126
, acord140
and property_affidavit
. Next, we want to use the FORMS and TABLES functions in the agreement documents, and the INQUIRIES and SIGNATURES functions for the affidavit of ownership.
See the GitHub repository to learn how this was done for the sample commercial insurance documents.
Clean
To remove the solution, run the cdk destroy
domain. You will then be asked to confirm the deletion of the workflow. Deleting the workflow will delete all generated resources.
Conclusion
In this post, we demonstrate how you can get started with Amazon Textract IDP CDK Constructs by implementing a straight-through processing scenario for a set of Acord business forms. We also demonstrate how you can extend the solution to any type of form with simple configuration changes. We encourage you to test the solution with their respective documents. Send a pull request to github repository for any feature requests you may have. To learn more about IDP on AWS, see our documentation.
About the authors
Raj Pathak is a Senior Solutions Architect and Technologist specializing in Financial Services (Insurance, Banking, Capital Markets) and Machine Learning. He specializes in natural language processing (NLP), large language models (LLM), and machine learning infrastructure and operations (MLOps) projects.
aditi rajnish is a second-year software engineering student at the University of Waterloo. His interests include computer vision, natural language processing, and edge computing. He is also passionate about community-based STEM outreach and advocacy. In his spare time, he can be found rock climbing, playing the piano, or learning how to bake the perfect bun.
Enzo Staton is a solutions architect with a passion for working with enterprises to increase their knowledge of the cloud. He works closely as a trusted advisor and industry specialist with clients across the country.