Customer data is typically stored as records in customer relationship management (CRM) systems. Data that is manually entered into such systems by one or more users over time leads to data replication, partial duplication, or fuzzy duplication. This, in turn, means that there is no longer a single source of truth for customers, contacts, accounts, and so on. Downstream business processes become increasingly complex and artificial without a unique correspondence between a record in a CRM and the target customer. Current methods for detecting and deduplicating records use traditional natural language processing techniques known as entity matching. But it is possible to use the latest advances in large language models and generative artificial intelligence to greatly improve the identification and repair of duplicate records. On common benchmark data sets, I found an improvement in the accuracy of data deduplication rates from 30 percent using natural language processing techniques to nearly 60 percent using my proposed method.
I want to explain the technique here in the hope that others will find it useful and use it for their own deduplication needs. It is useful for other scenarios where you want to identify duplicate records, not just for customer data. I also wrote and published a research paper on this that you can view on Arxiv, if you want to learn more in depth:
The task of identifying duplicate records is often performed through pairwise record comparisons and is called “entity matching” (EM). Typical steps in this process would be:
- Data preparation
- Candidate generation
- Blocking
- Pareo
- Group
Data preparation
Data preparation is the cleaning of data and involves things like removal of non-ASCII characters, capitalization, and tokenization of text. This is an important and necessary step for NLP matching algorithms that are used later in the process and do not work well with mixed case or non-ASCII characters.
Candidate generation
In the usual EM method, we would produce candidate records by combining all the records in the table with themselves to produce a Cartesian product. We would eliminate all combinations that are of a row with itself. For many NLP comparison algorithms, comparing row A to row B is equivalent to comparing row B to row A. In those cases, you can get away with keeping only one of those pairs. But even after this, there are still many candidate records left. To reduce this number, a technique called “blocking” is often used.
Blocking
The idea of blocking is to remove those records that we know cannot be duplicates of each other because they have different values for the “locked” column. For example, if we were considering customer records, a possible column to block could be something like “City”. This is because we know that even if all the other details of the record are similar enough, they cannot be the same customer if they are located in different cities. Once we have generated our candidate records, we use blocking to remove those records that have different values for the locked column.
Pareo
After blocking, we examine all candidate records and calculate attribute value metrics based on traditional NLP similarities to the fields in the two rows. Using these metrics, we can determine whether we have a potential match or non-match.
Grouping
Now that we have a list of candidate records that match, we can group them into clusters.
There are several steps to the proposed method, but the most important thing to note is that we no longer need to perform the “Data Preparation” or “Candidate Generation” step of traditional methods. The new steps become:
- Create matching sentences
- Create embedding vectors from those matching sentences
- Group
Create matching sentences
First, a “Match Sentence” is created by concatenating the attributes we are interested in and separating them with spaces. As an example, let’s say we have a customer record that looks like this:
We would create a “Match Sentence” by concatenating with spaces the attributes name1, name2, name3, address and city which would give us the following:
“John Hartley Smith 20 Main Street London”
Create embedding vectors
Once our “Match Sentence” has been created, it is encoded into the vector space using our chosen embedding model. This is achieved using “Sentence Transformers”. The output of this encoding will be a floating point vector of predefined dimensions. These dimensions are related to the embedding model that is used. Use the all-mpnet-base-v2 embedding model which has a 768-dimensional vector space. This embedding vector is then added to the record. This is done for all records.
Group
Once the embedding vectors for all records have been calculated, the next step is to create groups of similar records. For this I use the DBSCAN technique. DBSCAN works by first selecting a random record and searching for records close to it using a distance metric. There are 2 different types of distance metrics that I have found to work:
- L2 Normal distance
- Cosine similarity
For each of those metrics, an epsilon value is chosen as a threshold value. All records that are within the epsilon distance and have the same value for the “locked” column are added to this bucket. Once that bucket is filled, another random record is selected from the unvisited records and a bucket is created around it. This continues until all records have been visited.
I used this approach to identify duplicate records with customer data at my work. I produced some very good matches. To be more objective, I also ran some experiments using a benchmark data set called “Musicbrainz 200K”. I produced some measurable results that were an improvement over standard NLP techniques.
Clustering Visualization
I produced a nearest neighbor cluster map for the Musicbrainz 200K dataset which I then rendered in 2D using the UMAP reduction algorithm:
Resources
I have created several notebooks that will help you try the method yourself: