Piiranha-v1 released: an open source 280M small encoder model for PII detection with 98.27% token detection accuracy, supporting 6 languages and 17 types of PII, released under MIT license

The Internet Integrity Initiative team has taken an important step in data privacy by publishing Piiranha-v1a model specifically designed to detect and protect personal information. This tool is designed to identify personally identifiable information (PII) in a wide variety of textual data, providing an essential service at a time when digital privacy concerns are paramount.

Piiranha-v1, a lightweight 280M encoder model for PII detection, is released under the MIT license and offers advanced capabilities for detecting PII. Supporting six languages (English, Spanish, French, German, Italian, and Dutch), Piiranha-v1 achieves near-perfect detection, with an impressive 98.27% PII token detection rate and 99.44% overall classification accuracy. It excels at identifying 17 types of PII, with 100% accuracy for emails and near-perfect accuracy for passwords. Piiranha-v1 is built on the powerful DeBERTa-v3 architecture. This makes it a versatile tool suitable for global data protection efforts.

The model’s performance in detecting various types of personally identifiable information is particularly notable. For example, it has near-perfect accuracy in identifying email addresses and phone numbers, with an F1 score of 1.0 and 0.99, respectively. Piiranha-v1 is extremely effective at recognizing passwords and usernames, with nearly 100% accuracy in these areas. These metrics indicate its utility in safeguarding sensitive information in digital communication and transaction environments.

One of the key advantages of Piiranha-v1 is its ability to flag personally identifiable information (PII) even when the specific data category may be misclassified. For example, the model may occasionally confuse first names with last names, but still correctly identify the information as PII. This flexibility makes Piiranha-v1 a robust tool for real-world applications where data inconsistencies often occur. Such misclassifications, while technically errors, do not compromise the model’s primary goal of identifying and protecting sensitive data.

In collaboration with partners such as Hugging Face and Akash Network, the Internet Integrity Initiative team trained Piiranha-v1 using a comprehensive dataset comprising over 400,000 records of masked PII. This exhaustive training has resulted in a model that boasts high accuracy and demonstrates resilience in diverse linguistic and contextual scenarios. The use of H100 GPUs during training enabled the model to achieve high levels of efficiency, ensuring rapid identification of PII in real-time applications.

Despite its high accuracy, Piiranha-v1 developers emphasize that it should be used with caution. While the model is highly reliable, the team does not take responsibility for any incorrect predictions it may produce. This notice serves as a reminder of the limitations inherent to any machine learning model, particularly one tasked with something as complex as detecting personally identifiable information across multiple languages and data formats.

The training process of Piiranha-v1 was meticulously planned to optimize its performance. The model was trained for five epochs using a batch size of 128. Mixed-precision training with Native AMP was leveraged to ensure speed and accuracy during the learning process. The result is a highly refined model capable of recognizing subtle variations in PII tokens, which is particularly important for identifying information that might be hidden or presented in non-standard formats.

The model’s evaluation results further highlight its impressive capabilities. Piiranha-v1 achieves an F1 score of 93.12% when tested on a dataset containing approximately 73,000 sentences. Its precision and recall metrics are also strong, at 93.16% and 93.08%, respectively. These numbers, while slightly lower than the overall accuracy due to the model’s multi-class classification task, still represent a high level of proficiency in PII detection.

In practical terms, Piiranha-v1 can be used in a variety of applications. It is particularly suitable for organizations that handle large volumes of personal data, such as financial institutions, healthcare providers, and technology companies. By integrating Piiranha-v1 into their data processing pipelines, these companies and organizations can ensure that sensitive information is automatically flagged and redacted, reducing the risk of data breaches and ensuring compliance with privacy regulations such as GDPR and CCPA.

The Piiranha-v1 model is also available for deployment via the Hugging Face platform, where it can be easily integrated into existing workflows. The model is licensed under the Creative Commons BY-NC-ND 4.0 license, which allows broad use within the limits of non-commercial applications. This open-access approach further reinforces the Internet Integrity Initiative team’s commitment to improving data privacy on a global scale.

In conclusion, Piiranha-v1 represents a significant advancement in the detection of personally identifiable information (PII). Its high accuracy, multi-language support, and flexible application possibilities make it a valuable tool for any organization looking to improve its data privacy efforts. The Internet Integrity Initiative team has created a model that meets the technical challenges of detecting personally identifiable information and reflects the growing importance of safeguarding personal information in today’s digital world. As data privacy concerns continue to rise, tools like Piiranha-v1 will undoubtedly play a crucial role in protecting individuals’ sensitive information from exposure and misuse.

Take a look at the Model card and Colab NotebookAll credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram Channel and LinkedIn GrAbove!.

If you like our work, you'll love our Fact Sheet..

Don't forget to join our SubReddit of over 50,000 ml

FREE ai WEBINAR: 'SAM 2 for Video: How to Optimize Your Data' (Wednesday, September 25, 4:00 am – 4:45 am EST)

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary engineer and entrepreneur, Asif is committed to harnessing the potential of ai for social good. His most recent initiative is the launch of an ai media platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easily understandable to a wide audience. The platform has over 2 million monthly views, illustrating its popularity among the public.

HyperAgent: Generalist software engineering agents for solving coding tasks at scale.

Piiranha-v1 released: an open source 280M small encoder model for PII detection with 98.27% token detection accuracy, supporting 6 languages and 17 types of PII, released under MIT license

Technical Terrence Team

Three different ways to think about an ISA

Leave a Reply Cancel reply

Recommended.

Guilty plea or life sentence? What to expect from the Bankman-Fried trial

This AI research proposes LayoutNUWA: an AI model that treats layout generation as a code generation task to enhance semantic information and leverages hidden layout expertise from large language models (LLMs).

Arcee AI Arcee Spark Launched: A New Era of Compact and Efficient 7B Parameter Language Models

Google Photos will soon show you if an image was edited with AI

Infuse PBL with educational technology to improve collaboration and critical thinking

Categories

Important Links

Piiranha-v1 released: an open source 280M small encoder model for PII detection with 98.27% token detection accuracy, supporting 6 languages ​​and 17 types of PII, released under MIT license

Related

Technical Terrence Team

Three different ways to think about an ISA

Leave a Reply Cancel reply

Recommended.

Guilty plea or life sentence? What to expect from the Bankman-Fried trial

This AI research proposes LayoutNUWA: an AI model that treats layout generation as a code generation task to enhance semantic information and leverages hidden layout expertise from large language models (LLMs).

Arcee AI Arcee Spark Launched: A New Era of Compact and Efficient 7B Parameter Language Models

Google Photos will soon show you if an image was edited with AI

Infuse PBL with educational technology to improve collaboration and critical thinking

Categories

Important Links

Get daily news updates to your inbox!

Piiranha-v1 released: an open source 280M small encoder model for PII detection with 98.27% token detection accuracy, supporting 6 languages and 17 types of PII, released under MIT license