Today, personally identifiable information (PII) is everywhere. PII is found in emails, SMS, videos, PDF files, etc. Refers to any data or information that can be used to identify a specific individual. PII is sensitive in nature and includes various types of personal data such as name, contact information, identification numbers, financial information, medical information, biometric data, date of birth, etc.
Finding and redacting PII is essential to safeguard privacy, ensure data security, comply with laws and regulations, and maintain trust with customers and stakeholders. It is a critical component of modern data management and cybersecurity practices. But finding PII among the thicket of electronic data can present challenges for an organization. These challenges arise due to the large volume and variety of data, data fragmentation, encryption, data sharing, dynamic content, false positives and negatives, contextual understanding, legal complexities, resource limitations, data evolving, user-generated content and adaptive threats. However, failing to accurately detect and redact PII can have serious consequences for organizations. Consequences can encompass legal sanctions, lawsuits, reputational damage, data breach costs, regulatory investigations, operational disruptions, erosion of trust, and sanctions.
In the legal system, discovery is the legal process that governs the right to obtain and the obligation to produce non-privileged matter relevant to the claims or defenses of any party in litigation. Electronic discovery, also known as eDiscovery, is the electronic aspect of identifying, collecting, and producing electronically stored information (ESI) in response to a production request in a lawsuit or investigation. In the legal field, it is often necessary to identify, collect, and present ESI during a lawsuit or investigation. If organizations are dealing with eDiscovery for litigation over responses to subpoenas, they are probably concerned about accidentally sharing PII. Many organizations, including government agencies, school districts, and legal professionals, face the challenge of accurately detecting and redacting PII at scale. Especially if they are part of a government group, redacting PII through the Freedom of Information Act and Digital Services Act is crucial to protecting individual privacy, ensuring compliance with data protection laws, preventing theft of identity and maintain trust and transparency in government and the digital sector. services. It strikes a balance between transparency and privacy while mitigating legal and security risks.
Organizations can search for PII using methods such as keyword searches, pattern matching, data loss prevention tools, machine learning (ML), metadata analysis, data classification software, optical character recognition (OCR), taking document fingerprinting and encryption.
Now part of Reveal’s ai-powered eDiscovery platform, logiccull is a self-service solution that allows legal professionals to process, review, tag and produce electronic documents as part of a lawsuit or investigation. This unique offering helps attorneys uncover valuable information related to the matter at hand while reducing costs, expediting resolutions, and mitigating risks.
In this post, Reveal experts show how they used Amazon Comprehend in their document processing process to detect and redact individual pieces of PII. Amazon Comprehend is a fully managed and continuously trained natural language processing (NLP) service that can extract information about the content of a document or text. You can use Amazon Comprehend ML capabilities to detect and redact PII in customer emails, support tickets, product reviews, social media, and more.
Solution Overview
The engineering team’s overall goal is to detect and redact PII from millions of legal documents for their clients. Using Reveal’s Logikcull solution, the engineering team implemented two processes: first-pass PII detection and second-pass PII detection and redaction. This two-step solution was made possible by using the ContainPiiEntities and DetectPiiEntities APIs.
First pass PII detection
The goal of first-pass PII detection is to find documents that may contain PII.
- Users upload the files they would like to perform PII detection and redaction on via the Logikcull public website to a project folder. These files can be in the form of Office documents, .pdf files, emails, or a .zip file containing all supported file types.
- Logikcull stores these project folders securely within an Amazon Simple Storage Service (Amazon S3) bucket. The files are then passed through Logikcull’s massively parallel processing pipeline hosted on Amazon Elastic Compute Cloud (Amazon EC2), which processes the files, extracts the metadata, and generates artifacts in text format for data review. Logikcull’s processing pipeline supports text extraction for a wide variety of forms and files, including audio and video files.
- Once the files are available in text format, Logikcull passes the input text along with the language model, which is English, through Amazon Comprehend by calling the ContienePiiEntities API. Processing pipeline servers hosted on Amazon EC2 make Amazon Comprehend
ContainsPiiEntities
API call passing the request parameters as text and language code. HeContainsPiiEntities
The API call analyzes the input text for the presence of PII and returns labels of the identified PII entity types, such as name, address, bank account number, or phone number. The API response also includes a confidence score that indicates the level of confidence that Amazon Comprehend has assigned to the accuracy of the detection. The confidence score has a value between 0 and 1, where 1 means 100 percent confidence. Logikcull uses this confidence score to assign the detected PII tag to documents. Logikcull only assigns this label to documents that have a confidence score greater than 0.75. - Detected PII tagged documents are entered into Logikcull’s search index group for its users to quickly identify documents containing PII entities.
Second-Step PII Detection and Redaction
The first step PII detection process narrows the scope of the data set by identifying which documents contain PII information. This speeds up the PII detection process and also reduces the overall cost. The goal of the second step PII detection is to identify individual instances of PII and remove them from the documents tagged in the first step.
- Users search for documents through the Logikcull website that contain PII using Logikcull’s advanced search filters feature.
- The request is handled by Logikcull application servers hosted on Amazon EC2 and the servers communicate with the search index cluster to find the documents.
- Logikcull application servers can identify individual instances of PII by making the DetectPiiEntities API call. The servers make the API call by passing the text and language of the input documents. He
DetectPiiEntities
The API action inspects the input text for entities containing PII. For each entity, the response provides the entity type, where the entity’s text begins and ends, and the level of confidence Amazon Comprehend has in its detection. - Users then select the specific entities they want to redact using Logikcull’s web interface. The application server sends these requests to the Logikcull processing process. The following is a screenshot of a PDF that was uploaded to the Logikcull application. In the following screenshot, you can see that different PII entities such as name, address, phone number, email address, etc. have been highlighted.
- PII redaction is securely applied within the Logikcull processing pipeline using custom business logic. In the screenshot below, you can see that users can select specific PII entity types or all PII entity types they want to redact and then, with the click of a button, redact all of the PII information.
Results
Logikcull, a Reveal technology, currently processes more than 20 million documents each week and was able to reduce the scope of detection using the ContainsPiiEntities
API and display individual instances of PII entities to your clients by using DetectPiiEntities
API.
“With Amazon Comprehend, Logikcull has been able to quickly deploy powerful NLP capabilities in a fraction of the time a custom solution would have required.”
– Steve Newhouse, VP of Product at Logikcull.
Conclusion
Amazon Comprehend enables Reveal’s Logikcull technology to perform large-scale PII detection at a relatively low cost using Amazon Comprehend. He ContainsPiiEntities
The API is used to perform an initial scan of millions of documents. He DetectPiiEntities
The API is used to run a detailed analysis of thousands of documents and identify individual pieces of PII in your documents.
Take a look at all the features of Amazon Comprehend. Try the features and send us feedback through the AWS forum for Amazon Comprehend or through your regular AWS support contacts.
About the authors
Aman Tiwari is a general solutions architect working with global commercial sales on AWS. He works with clients in the digital native business segment and helps them design innovative, resilient and profitable solutions using AWS services. He has a master’s degree in Telecommunications Networks from Northeastern University. Outside of work, he enjoys playing lawn tennis and reading books.
Jeff Newburn is a Senior Software Engineering Manager leading the Data Engineering team at Logikcull – A Reveal technology. He oversees the company’s data initiatives, including data warehouses, visualizations, analytics, and machine learning. With experience spanning development and management in areas ranging from ridesharing to data systems, he enjoys leading teams of brilliant engineers and interesting products.
Søren Rubio Daugaard is a Staff Engineer on the Data Engineering team at Logikcull – A Reveal technology. He implements highly scalable ai and machine learning solutions into the Logikcull product, allowing our customers to do their jobs more efficiently and with greater accuracy. His experience spans data pipelines, web-based systems, and machine learning systems.
Kevin Lufkin is a Senior Software Engineer on the Search Engineering team at Logikcull – A Reveal technology, where he focuses on the development of customer-facing and search-related features. His extensive UI/UX experience is complemented by experience in full-stack web development, with a strong focus on bringing product visions to life.