Amazon Textract is a machine learning (ML) service that enables the automatic extraction of text, handwriting, and data from scanned documents, surpassing traditional optical character recognition (OCR). You can identify, understand, and extract data from tables and forms with remarkable accuracy. Currently, several companies rely on manual extraction methods or basic OCR software, which is tedious, time-consuming, and requires manual configuration that must be updated when the form changes. Amazon Textract helps solve these challenges by using ML to automatically process different types of documents and accurately extract information with minimal manual intervention. This allows you to automate document processing and use the extracted data for different purposes, such as automating loan processing or collecting invoice and receipt information.
As travel resumes after the pandemic, in many cases it may be necessary to verify the traveler's vaccination status. Hotels and travel agencies often need to review vaccination cards to collect important details, such as whether the traveler is fully vaccinated, the dates of the vaccination, and the traveler's name. Some agencies do this by manually verifying cards, which can be time-consuming for staff and leaves room for human error. Others have created custom solutions, but they can be expensive, difficult to scale, and time-consuming to implement. In the future, there may be opportunities to streamline the vaccination status verification process in a way that is efficient for businesses while also respecting the privacy and convenience of travelers.
Amazon Textract Queries helps address these challenges. Amazon Textract Queries allows you to specify and extract only the information you need from the document. It gives you precise and accurate information of the document.
In this post, we walk you through a step-by-step implementation guide to create a vaccination status verification solution using Amazon Textract queries. The solution shows how to process vaccination cards using an Amazon Textract query, verify vaccination status, and store the information for future use.
Solution Overview
The following diagram illustrates the architecture of the solution.
The workflow includes the following steps:
- The user takes a photograph of a vaccination record.
- The image is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket.
- When the image is saved to the S3 bucket, it invokes an AWS Step Functions workflow:
- The Queries-Decider AWS Lambda function examines the passed document and adds information about the mime type, number of pages, and number of queries to the Step Functions workflow (for our example, we have four queries).
NumberQueriesAndPagesChoice
is a choice state that adds conditional logic to a workflow. If there are between 15 and 31 queries and the number of pages is between 2 and 3001, then Amazon Textract asynchronous processing is the only option, because synchronous APIs only support up to 15 queries and documents on a page. For all other cases, we resort to random selection of synchronous or asynchronous processing.- He
TextractSync
The Lambda function sends a request to Amazon Textract to parse the document based on the following Amazon Textract queries:- What is vaccination status?
- What's Name?
- What is the date of birth?
- What is the document number?
- Amazon Textract parses the image and sends the responses of these queries to the Lambda function.
- The Lambda function checks the customer's vaccination status and stores the final result in CSV format in the same S3 bucket (
demoqueries-textractxxx
) in itcsv-output
file.
Previous requirements
To complete this solution, you must have an AWS account and the appropriate permissions to create the necessary resources as part of the solution.
Download the implementation code and vaccination card template from GitHub.
Use the Queries feature in the Amazon Textract console
Before creating your vaccination verification solution, let's explore how you can use Amazon Textract queries to extract vaccination status through the Amazon Textract console. You can use the sample vaccination card that you downloaded from the GitHub repository.
- In the Amazon Textract console, choose Analyze document in the navigation panel.
- Low Upload documentchoose Choose document to upload the vaccination card from your local unit.
- After uploading the document, select Consultations in it Set up document section.
- You can then add queries in the form of natural language questions. Let's add the following:
- What is vaccination status?
- What's Name?
- What is the date of birth?
- What is the document number?
- After adding all your queries, choose Apply settings.
- Check out the queries tab to see the answers to the questions.
You can see that Amazon Textract extracts the answer to your query from the document.
Implement vaccination verification solution
In this post, we use an AWS Cloud9 instance and install the necessary dependencies on the instance using the AWS Cloud Development Kit (AWS CDK) and Docker. AWS Cloud9 is a cloud-based integrated development environment (IDE) that lets you write, run, and debug your code with just a browser.
- In the terminal, choose Upload local files about him Archive menu.
- Choose Select the folder and choose the
vaccination_verification_solution
folder you downloaded from GitHub. - In the terminal, prepare your serverless application for the subsequent steps of your development workflow in AWS Serverless Application Model (AWS SAM) using the following command:
- Deploy the application using the
cdk deploy
domain:Wait for AWS CDK to deploy the model and create the resources mentioned in the template.
- When the deployment is complete, you can verify the deployed resources in the AWS CloudFormation console on page Resources stack details page tab.
Try the solution
Now is the time to test the solution. To activate the workflow, use aws s3 cp
to raise the vac_card.jpg
file to DemoQueries.DocumentUploadLocation
inside the documents folder:
Vaccination certificate file is automatically uploaded to S3 bucket demoqueries-textractxxx
in the uploads folder.
The Step Functions workflow is triggered via a Lambda function as soon as the vaccination certificate file is uploaded to the S3 bucket.
The Queries-Decider Lambda function examines the document and adds information about the MIME type, number of pages, and number of queries to the Step Functions workflow (for this example, we use four queries: document number, customer name, date of birth and vaccination status).
He TextractSync
The function sends the input queries to Amazon Textract and synchronously returns the entire result as part of the response. Supports 1-page documents (TIFF, PDF, JPG, PNG) and up to 15 queries. He GenerateCsvTask
The function takes JSON output from Amazon Textract and converts it to a CSV file.
The final output is stored in the same S3 bucket in the csv-output folder as a CSV file.
You can download the file to your local machine using the following command:
The format of the result is timestamp
, classification
, filename
, page number
, key name
, key_confidence
, value
, value_confidence
, key_bb_top
, key_bb_height
, key_bb.width
, key_bb_left
, value_bb_top
, value_bb_height
, value_bb_width
, value_bb_left
.
You can scale the solution to hundreds of vaccination certificate documents for multiple clients by uploading their vaccination certificates to DemoQueries.DocumentUploadLocation
. This automatically triggers multiple executions of the Step Functions state machine and the final result is stored in the same S3 bucket in the csv-output folder.
To change the initial set of queries that are entered into Amazon Textract, you can go to your AWS Cloud9 instance and open the start_execution.py file. In the file view in the left pane, navigate to lambda, start_queries
, app
, start_execution.py
. This Lambda function is invoked when a file is uploaded to DemoQueries.DocumentUploadLocation
. Queries sent to the workflow are defined in start_execution.py
; You can change them by updating the code as shown in the following screenshot.
Clean
To avoid incurring ongoing charges, delete the resources created in this post using the following command:
Answer the question Are you sure you want to delete: DemoQueries (y/n)?
with and.
Conclusion
In this post, we show you how to use Amazon Textract Queries to create a vaccination verification solution for the travel industry. You can use Amazon Textract Queries to build solutions in other industries, such as finance and healthcare, and retrieve information from documents such as pay stubs, mortgage notes, and insurance cards based on natural language questions.
For more information, see Document Analysis or check out the Amazon Textract console and try this feature.
About the authors
Dhiraj Thakur He is a solutions architect at Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and likes to build and experiment in the analytics and ai/ML space.
Rishabh Yadav is an AWS Partner Solutions Architect with extensive experience in DevOps and security offerings on AWS. He works with ASEAN partners to provide guidance on enterprise cloud adoption and architecture reviews, as well as developing AWS practices by implementing the Well-Architected Framework. Outside of work, he likes to dedicate his time to the field of sports and FPS games.