Deduplicate records without IDs?

Question

I have an array of records for bids (sample below in JSON). I would like to store this data in a relational database (Postegresql), however, the supplier data is not given with IDs, and there will be some entries that need to be deduplicated. For instance, in the example below the "John Smith and Associates" is listed under several similar names as a supplier.

Should I give each supplier a unique ID (even those with matching names) and then deduplicate after the db is populated, or do this while adding entries to the database?

[ { "Solicitation No": "B2342", "Issuing Organization": "VT Timber Sales", "Award Date": "2017/06/29", "Supplier_details": [{ "Successful Supplier(s)": "John Smith & Associates", "Supplier City": "Georgetown", "Award Total": "$22034.13" }] }, { "Solicitation No": "B2344", "Issuing Organization": "VT Timber Sales", "Award Date": "2017/06/30", "Supplier_details": [{ "Successful Supplier(s)": "John Smith & Assoc", "Supplier City": "Georgetown", "Award Total": "$5034.13" }, { "Successful Supplier(s)": "Some Logging ltd.", "Supplier City": "Georgetown", "Award Total": "$1034.13" }] }, (...)

score 2 · Answer 1 · answered Jul 26 '17 at 16:14

You should avoid populating database with duplicates whenever possible.

If you have a multi-user system, creating a duplicate with it's own ID, could cause another user to refer by accident to this additional ID before tour cleaning job occurs. In this case it will be much more difficult to clean the mess!

A second argument is that you know that the source can contain duplicates. If you insert duplicates, you later no longer know which one come from unreliable source and which come from other sources (such as manual entry in the db).

score 0 · Answer 2 · answered Aug 26 '17 at 07:43

Most databases have something like MERGE or ON DUPLICATE KEY UPDATE and that's what you could use. Add a unique key spanning all relevant columns and the DB won't ever let you insert a duplicate.

If you're using a decent programming language, then a deduplication in code is pretty easy and (as long as all data fit in memory) also faster than when using the database (since you save yourself communication costs). Still, you should create the table unique key, so that the database checks for duplicates, too (keep your data consistent). This check helps in case of a programming error or concurrent access.

Deduplicate records without IDs?

2 Answers2