-1

I have a large database of customers. There's a need to identify customers who have two or more customer IDs.

This comes mostly from people at the front-desk creating new customers instead of searching for their existing IDs or doing a bad data capture, so we end up having this sort of thing:

ID         name        surname        birthday
------------------------------------------------
14         Juan        Vazquez        1955-01-20
78         Juan        Vasquez        1980-06-12
85         Juan        Vazquez        1980-06-12

IDs 78 and 85 almost surely belong to the same person (notice same birthday but a slight difference in surname)

ID 14 definitely is another person due to a big difference in birthdays related to the similar records

Therefore we can assign equivalencies:

MainID      EquivalentID
------------------------
78          85

What solution can you recommend to identify similar records based on rules?

azathoth
  • 107

2 Answers2

1

The approach I've seen used for this is Bayesian inference. There are plenty of technical details on the web to find for how to do this but the basic idea is that you start with some estimate of the likelihood of two random records being the same person. Then you look at each attribute you want to use and see if they match. Each one that matches increases the chance that this is the same person. You decide at what threshold you decide they are the same person. Often there are people involved to review/confirm these.

As Paparazzi mentions, the Levenshtein distance is a useful tool. Soundex and other similar algorithms are often also used but I would caution that the names in your system do not all come from the same language, such 'sounds like' algorithms are of limited use.

But any test you can think of that would make you more confident that these are the same person can be used in this approach.

JimmyJames supports Canada
  • 30,578
  • 3
  • 59
  • 108
0

For name you can use Levenshtein distance. It measures the difference between words.

paparazzo
  • 1,927