How can I efficiently diff a CSV file against a database?

Question

I have an inventory of products stored in Postgres. I need to be able to take a CSV file and get a list of changes—the things in the CSV file that are different to what is in the database. The CSV file has about 1.6 million rows.

The naive approach is to simply take each row, retrieve that product from the database using the key field, make the comparison, emit the changes (including updating the database), then move on to the next row. However, that many round trips causes the whole process to take a long time (upwards of two minutes). I've tried locally caching the inventory in an off-heap map (using MapDB), which improved the performance a lot, since I only needed to hit the database to write changed data, but I didn't figure out a way to make that scale. There will be many inventories for different customers. Perhaps some kind of sharding approach would be needed, but then I have to deal with nodes going on- and offline. Maybe Akka Cluster could help here too.

Are there some good approaches that I'm overlooking?

score 39 · Answer 1 · answered Aug 31 '22 at 06:45

Since your goal is to produce a list of changes, not to change the stored records, the way to go is to simply export the database to text and wrangle it once to get into exactly the same CSV format. Provided you have a functional sorting criterion, the standard diff tool will have no problem processing millions of lines.

score 22 · Accepted Answer · answered Aug 31 '22 at 08:56

Since the roundtrip seems to be the issue, you could:

either opt for a local solution, with the scaling issue you mentioned. (You could still try to split the task across several local nodes, each responsible of a subrange of the index space).
or opt to the db solution, bulk uploading your csv in a temporary table, and let the db server work very efficiently on (indexed) tables. The benefit of this approach is that you’d reach the scalability of the db itself. You could fine tune the approach for any distribution scheme that woukd already be in place, if it’s already a distributed database.

Some more thoughts:

if you have many columns/fields to compare, you may consider adding a hash code on each row, in the csv as well as in the db (updated at every row change). The hash code would be calculated using the fields that are relevant for the comparison. Finding the diff is then reduced to finding the new rows and the existing rows with a difference on the hash.
ultimately, it would be more efficient to handle the problem at the source, i.e intercepting the events that would cause the csv to change, or using some kind of timestamp of the last change. But ok, this is not always possible.

Daniel Koszta · Answer 3 · 2022-08-31T16:02:43.733

There are already some good answers, but here's another possibility: if you could sort the CSV file by key and add a similar ORDER BY to the SQL query, you could compare the rows obtained from the query to the rows obtained from the CSV file without loading any of them fully into memory. (If key matches, compare the row. If it doesn't match, you can deduce if it's a removal or insertion from the ordering. Then, move the cursor.)

(Not sure what language or library you want to use, but plain JDBC allows you to fetch the next rows without loading the whole result into memory.)

score 3 · Answer 4 · answered Sep 01 '22 at 08:58

3

For me the most intelligent and simple way will be to put the files in database and write some SQL script to compare what you want.

answered Sep 01 '22 at 08:58

Daljeet

41

score 2 · Answer 5 · answered Sep 01 '22 at 09:59

To make this scale both for size of CSV and the amount of products in the database, you could do:

Read e.g. 10 000 rows from CSV to memory. Collect all the keys.
Fetch the 10 000 corresponding database rows. Depending on database, it can make sense to create a temporary table with the keys and perform a JOIN with that to fetch the rows. If the CSV file is sorted, you may be able to select keys between min and max value from the CSV.
Collect the required updates into an array in RAM and execute a single query to write to database. Typically INSERT INTO table VALUES (row1),(row2),...,(rowN) ON DUPLICATE KEY UPDATE x=VALUES(x).

This will reduce the time overhead of waiting for database to respond, while still limiting the length of individual queries and required RAM.

How can I efficiently diff a CSV file against a database?

5 Answers5