Duplicate detection with GenAI. How using LLM and GenAI techniques can... | by Ian Ormesher | July 2024

How using LLM and GenAI techniques can improve deduplication

Musicbrainz UMAP 2D 200K Nearest Neighbor Plot

Customer data is typically stored as records in customer relationship management (CRM) systems. Data that is manually entered into such systems by one or more users over time leads to data replication, partial duplication, or fuzzy duplication. This, in turn, means that there is no longer a single source of truth for customers, contacts, accounts, and so on. Downstream business processes become increasingly complex and artificial without a unique correspondence between a record in a CRM and the target customer. Current methods for detecting and deduplicating records use traditional natural language processing techniques known as entity matching. But it is possible to use the latest advances in large language models and generative artificial intelligence to greatly improve the identification and repair of duplicate records. On common benchmark data sets, I found an improvement in the accuracy of data deduplication rates from 30 percent using natural language processing techniques to nearly 60 percent using my proposed method.

I want to explain the technique here in the hope that others will find it useful and use it for their own deduplication needs. It is useful for other scenarios where you want to identify duplicate records, not just for customer data. I also wrote and published a research paper on this that you can view on Arxiv, if you want to learn more in depth:

The task of identifying duplicate records is often performed through pairwise record comparisons and is called “entity matching” (EM). Typical steps in this process would be:

Data preparation
Candidate generation
Blocking
Pareo
Group

Data preparation

Data preparation is the cleaning of data and involves things like removal of non-ASCII characters, capitalization, and tokenization of text. This is an important and necessary step for NLP matching algorithms that are used later in the process and do not work well with mixed case or non-ASCII characters.

Candidate generation

In the usual EM method, we would produce candidate records by combining all the records in the table with themselves to produce a Cartesian product. We would eliminate all combinations that are of a row with itself. For many NLP comparison algorithms, comparing row A to row B is equivalent to comparing row B to row A. In those cases, you can get away with keeping only one of those pairs. But even after this, there are still many candidate records left. To reduce this number, a technique called “blocking” is often used.

Blocking

The idea of blocking is to remove those records that we know cannot be duplicates of each other because they have different values for the “locked” column. For example, if we were considering customer records, a possible column to block could be something like “City”. This is because we know that even if all the other details of the record are similar enough, they cannot be the same customer if they are located in different cities. Once we have generated our candidate records, we use blocking to remove those records that have different values for the locked column.

Pareo

After blocking, we examine all candidate records and calculate attribute value metrics based on traditional NLP similarities to the fields in the two rows. Using these metrics, we can determine whether we have a potential match or non-match.

Grouping

Now that we have a list of candidate records that match, we can group them into clusters.

There are several steps to the proposed method, but the most important thing to note is that we no longer need to perform the “Data Preparation” or “Candidate Generation” step of traditional methods. The new steps become:

Create matching sentences
Create embedding vectors from those matching sentences
Group

Create matching sentences

First, a “Match Sentence” is created by concatenating the attributes we are interested in and separating them with spaces. As an example, let’s say we have a customer record that looks like this:

Nearest Neighbor Raster 2D UMAP Musicbrainz 200K

Tags: can..Detection Duplicate GenAI Ian July LLM Ormesher Techniques

Duplicate detection with GenAI. How using LLM and GenAI techniques can… | by Ian Ormesher | July 2024

Technical Terrence Team

Algonquin joins Wells Fargo's list of tactical ideas ahead of potential renewable energy sale

Leave a Reply Cancel reply

Recommended.

Essential features of email notifications

Cosmos and Chainlink Price Predictions: Meme Moguls Rallying

What Ethereum Needs To Break Above $2000: Bloomberg Analyst

Cryptocurrency Flash Crash Cuts Crypto Market Cap 4% As Bitcoin, Solana, Dogecoin Plunge

Governing the ML lifecycle at scale, Part 1: A framework for architecting ML workloads using Amazon SageMaker

Categories

Important Links