Healthcare data is complex and siled, and exists in multiple formats. It is estimated that 80% of data within organizations is considered unstructured or “dark” data that is locked within text, emails, PDFs, and scanned documents. This data is difficult to interpret or analyze programmatically and limits how organizations can gain insight from it and serve their customers more effectively. The rapid rate of data generation means that organizations that aren’t investing in document automation risk being locked into legacy processes that are manual, slow, error-prone, and difficult to scale.
In this post, we propose a solution that automates the ingest and transformation of previously untapped PDF files and clinical notes and handwritten data. We explain how to extract information from customer clinical data tables with Amazon Textract, then use the extracted raw text to identify discrete data elements with Amazon Comprehend Medical. We store the final result in a Fast Healthcare Interoperability Resources (FHIR) compliant format on Amazon HealthLake, making it available for further analysis.
Solution Overview
AWS offers a variety of services and solutions for healthcare providers to unlock the value of their data. For our solution, we process a small sample of documents through Amazon Textract and upload the extracted data as appropriate FHIR resources to Amazon HealthLake. We create a custom process for FHIR conversion and test it from start to finish.
The data is first loaded into DocumentReference
. Amazon HealthLake then creates system-generated resources after processing this unstructured text in DocumentReference
and load it in Condition
, MedicationStatement
and Observation
resources. We identified some data fields within FHIR resources, such as patient ID, date of service, provider type, and medical facility name.
TO MedicationStatement
It is a record of a medication that is being consumed by a patient. It can indicate that the patient is taking the drug now, has taken it in the past, or will take it in the future. A common scenario in which this information is captured is during the history taking process during the course of a patient’s visit or stay. The source of medication information could be the patient’s memory, a prescription bottle, or a list of medications maintained by the patient, physician, or another party.
Observations
they are a central element in healthcare, used to support diagnosis, monitor progress, determine baselines and patterns, and even capture demographic characteristics. Most observations are simple name/value pair assertions with some metadata, but some observations logically group other observations, or could even be multi-component observations.
He Condition
resource is used to record detailed information about a condition, problem, diagnosis, or other clinical event, situation, problem, or concept that has reached a level of concern. The condition could be a point diagnosis in the context of an encounter, an item on the professional’s problem list, or a concern that is not on the professional’s problem list.
The following diagram shows the workflow for migrating unstructured data to FHIR for artificial intelligence and machine learning (ML) analysis in Amazon HealthLake.
The workflow steps are as follows:
- A document is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket.
- Uploading documents to Amazon S3 triggers an AWS Lambda function.
- The Lambda function sends the image to Amazon Textract.
- Amazon Textract extracts the text from the image and stores the output in a separate Amazon Textract output S3 bucket.
- The final result is stored as specific FHIR resources (extracted text is loaded into
DocumentReference
as base64-encoded text) in Amazon HealthLake to extract meaning from unstructured data with integrated Amazon Comprehend Medical for easy search and query. - Users can create meaningful analyzes and run interactive analyzes with Amazon Athena.
- Users can create visualizations, perform ad hoc analysis, and quickly gain business insights with Amazon QuickSight.
- Users can make predictions with health data using machine learning models from Amazon SageMaker.
previous requirements
This publication assumes familiarity with the following services:
By default, Amazon Comprehend Medical’s built-in natural language processing (NLP) capability within Amazon HealthLake is disabled in your AWS account. To enable it, submit a support case with your account ID, AWS Region, and Amazon HealthLake datastore ARN. For more information, see How do I turn on HealthLake’s built-in natural language processing feature?
Refer to github repository for more implementation details.
Implement solution architecture
To configure the solution, complete the following steps:
- clone the github repositoryrun
cdk deploy PdfMapperToFhirWorkflow
from your command prompt or terminal and follow the README file. Deployment will complete in approximately 30 minutes. - In the Amazon S3 console, navigate to the bucket that starts with
pdfmappertofhirworkflow
-, which was created as part ofcdk deploy
. - Inside the repository, create a folder called uploads and upload the sample PDF (Sample Medical Record.pdf).
As soon as the document upload is successful, the pipeline will be activated and you can start viewing the data in Amazon HealthLake, which you can query with various AWS tools.
consult the data
To explore your data, complete the following steps:
- In the CloudWatch console, find the
HealthlakeTextract
registration group. - In the registration group details, note the unique ID of the document that you processed.
- In the Amazon HealthLake console, choose data warehouses in the navigation pane.
- Select your data warehouse and choose run query.
- For query typechoose Search with GET.
- For resource typechoose document reference.
- For Search parametersenter the parameter relative to and the value as
DocumentReference/
Unique ID. - Choose run query.
- In it response body section, minimize the resource sections to see only the six resources that were created for the six-page PDF document.
- The following screenshot shows the embedded analytics with Amazon Comprehend Medical and NLP enabled. The screenshot on the left is the source PDF; the screenshot on the right is the NLP output from Amazon HealthLake.
- You can also run a query with query type set as Read and resource type set as Condition using the appropriate resource ID.
The following screenshot shows the results of the query. - In the Athena console, run the following query:
Similarly, you can check MedicationStatement
, Condition
and Observation
resources.
Clean
Once you are done using this solution, run cdk destroy PdfMapperToFhirWorkflow
to ensure you do not incur additional charges. For more information, see AWS CDK Toolkit (cdk Command).
Conclusion
AI services from AWS and Amazon HealthLake can help store, transform, query, and analyze insights from unstructured healthcare data. Although this post only covered a clinical chart in PDF, you could extend the solution to other types of PDFs, images, and handwritten healthcare notes. After the data is extracted in text form, analyzed into discrete data elements with Amazon Comprehend Medical, and stored in Amazon HealthLake, downstream systems could further enrich it to generate meaningful and actionable healthcare insights, and ultimately instance, improve patient health outcomes.
The proposed solution does not require the implementation and maintenance of the server infrastructure. All services are managed by AWS or serverless. With AWS’s pay-as-you-go billing model and its depth and breadth of services, the cost and effort of initial setup and experimentation is significantly lower than traditional on-premises alternatives.
additional resources
For more information about Amazon HealthLake, see the following:
About the authors
Shravan Vurputoor He is a Senior Solutions Architect at AWS. As a trusted customer trust advocate, he helps organizations understand best practices around advanced cloud-based architectures and provides strategic advice to help drive successful business outcomes across a broad set of enterprise customers through his passion for educating, training, designing and building the cloud. solutions In his spare time he likes to read, spend time with his family, and cook.
Rafael M. Koike is a Principal Solutions Architect at AWS supporting enterprise customers in the Southeast and is part of the storage and security technical field community. Rafael is passionate about building, and his experience in security, storage, networking, and application development has been instrumental in helping clients migrate to the cloud quickly and securely.
Randhir Gehlot is a Senior Manager of Customer Solutions at AWS. Randheer is passionate about AI/ML and its application within the HCLS industry. As the builder of AWS, he works with large enterprises to rapidly design and implement strategic cloud migrations and build modern cloud-native solutions.