How to ignore duplicates during bulk inserts?

Question

In Postgres 9.3.5, I'm importing records from an external source where duplicates are VERY rare, but they do happen. Given a readings table with a unique compound key on (real_time_device_id, recorded_at), the following will fail once in a blue moon:

INSERT INTO readings (real_time_device_id, recorded_at, duration) VALUES
  ('150', TIMESTAMP WITH TIME ZONE '2014-11-01 23:06:33 -0700', 10.0),
  ... many more records ...
  ('150', TIMESTAMP WITH TIME ZONE '2014-11-01 23:06:43 -0700', 10.0);

(FWIW, the above fails 'properly' with a duplicate key violation.)

I know that handling exceptions is expensive, but as I said, duplicate entries are very rare. So to keep the code simple, I followed an example given in Optimal way to ignore duplicate inserts?:

BEGIN
    INSERT INTO readings (real_time_device_id, recorded_at, duration) VALUES
      ('150', TIMESTAMP WITH TIME ZONE '2014-11-01 23:06:33 -0700', 10.0),
      ... many more records ...
      ('150', TIMESTAMP WITH TIME ZONE '2014-11-01 23:06:43 -0700', 10.0);
EXCEPTION WHEN unique_violation THEN
  -- silently ignore inserts
END;

The above gets two errors:

psql:sketches/t15.sql:11: ERROR:  syntax error at or near "INSERT"
LINE 2:         INSERT INTO readings (real_time_device_id, recorded_...
                ^
psql:sketches/t15.sql:14: ERROR:  syntax error at or near "EXCEPTION"
LINE 1: EXCEPTION WHEN unique_violation THEN
        ^

Can anyone set me straight on the correct syntax? Or is my error deeper than mere syntax? (For example, will all of the INSERTs be ignored if there is a single duplicate?)

Generally, what's a good way to do bulk inserts where very few (< .1%) are duplicates?

Erwin Brandstetter · Accepted Answer · 2023-06-28T00:25:30.423

There are 3 possible kinds of duplicates:

Duplicates within the rows of the bulk insert. That's your immediate cause for the exception.
Duplicates between inserted rows and existing rows.
Duplicates between inserted rows and concurrently inserted / updated rows from other transactions.

(Assuming no overlapping rows are ever deleted concurrently, which would introduce new challenges.)

1. and 2. can be fixed easily. But you really need to define exactly how to solve conflicts. Which row is to be picked?

INSERT INTO readings (real_time_device_id, recorded_at, duration)
SELECT DISTINCT ON (real_time_device_id, recorded_at)  -- solves 1.
       i.real_time_device_id, i.recorded_at, i.dur
FROM  (
   VALUES
     ('150', TIMESTAMP WITH TIME ZONE '2014-11-01 23:06:33 -0700', 10.0)
   , ('150', TIMESTAMP WITH TIME ZONE '2014-11-01 23:06:43 -0700', 10.0)
   ) i (real_time_device_id, recorded_at, dur)
LEFT   JOIN readings r1 USING (real_time_device_id, recorded_at)
WHERE  r1.real_time_device_id IS NULL                  -- solves 2.

I am picking an arbitrary row from each set of dupes with DISTINCT ON. You may want to define deterministically instead:

Select first row in each GROUP BY group?

3. is more tricky - but hopefully not applicable to your case ...

If it applies after all, consider this:

How to use RETURNING with ON CONFLICT in PostgreSQL?

To insert ~250K rows, it would be much more efficient to COPY the lot to a temp table (or as much as you can easily store in RAM and process without spilling to disk) and proceed from there.

If that's a large part of the table, it might pay to drop indexes you don't need for the duration of the update and create them afterwards ...

How to ignore duplicates during bulk inserts?

1 Answers1

Linked