Deleting would-be orphans when creating FK in Postgres

Question

I am working on a script that replicates schema and data. I first create all the structure except for foreign keys after reading it from the source's system tables, then copy the data, then create the FKs so that if a would-be orphan gets inserted in a child table after I copied the parent (while I am running the script), I still get the basic schema and the data, even though the FK creation fails due to the orphan data.

My next step is to do something like wrap a try/catch around the create FK statement and, if it fails, delete the orphan data that caused it. Doing this will be somewhat complicated because I would have to extract all the reference keys and assemble my own delete statement (I currently don't do that because PG system tables provide a ready REFERENCES statement which I can concatenate).

But I am wondering if there is a way to do a ALTER TABLE x ADD CONSTRAINT fk_y FOREIGN KEY ... DELETE ANY WOULD-BE ORPHANS which would save me that step.

So if my parent table is:

 parent_id     | parent_name
---------------------------------
 1             | Joe
 2             | Mary

and the child is:

child_id      | child_name     | parent_parent_id
--------------------------------------------------
 11           | Jimmy          | 1
 21           | Joey           | 2
 31           | Jeff           | 3

'Jeff' would be deleted before enforcing referential integrity because his parent and he were introduced in the source after the source parent was copied to the target. Then the FK can be created against consistent data.

Clarifications (from comments / chat discussion):

Do you read the data in a single transaction or not?

I don't use a single transaction as it would be too huge. I break the retrieved data by table and for some large tables by table segment (batch of 5000 or 10000 rows).
This is essentially an ETL that makes an "evil twin" out of a DB. hope that makes sense.
On average a DB copy in my environment takes between 20 min (for smalled DBs) to 1.5-2 hrs (for larger DBs) some tables have millions of rows. I load chunk by chunk in the memory of my Python script

What about pg_dump or pg_basebackup?

My app does some other things that pg_dump doesn't do, such as better logging. Also, pg_dump expects the same users on both the source and target and a host of other issues. I tried it, didn't like it. I like my tool much better, I want to work through this minor issue that is left and happens only sporadically.

score 4 · Accepted Answer · edited Apr 13 '17 at 12:42

As has been suggested in comments, the standard tool pg_dump might serve you better. It takes a consistent snapshot and preserves referential integrity during backup automatically - given the source DB is consistent (FK constraints in place). Use it if you can!

_{The related tools pg_dumpall and pg_basebackup have also been mentioned, but those work on whole DB clusters while you obviously only want to copy (parts of) a single database.}

However, there may still be valid reasons for the question. It seems unfair to turn it down on speculation. The question itself is interesting enough:

How to import inconsistent data into a FK relationship and fix it?

`NOT VALID` option

There is no DELETE ANY WOULD-BE ORPHANS option in the creation of FK constraints (like you'd wish), but the related NOT VALID option for FOREIGN KEY constraints in Postgres 9.1 or later should be instrumental. The manual on ALTER TABLE:

ADD table_constraint [ NOT VALID ]
This form adds a new constraint to a table using the same syntax as CREATE TABLE, plus the option NOT VALID, which is currently only allowed for foreign key and CHECK constraints. If the constraint is marked NOT VALID, the potentially-lengthy initial check to verify that all rows in the table satisfy the constraint is skipped. The constraint will still be enforced against subsequent inserts or updates (that is, they'll fail unless there is a matching row in the referenced table, in the case of foreign keys; and they'll fail unless the new row matches the specified check constraints). But the database will not assume that the constraint holds for all rows in the table, until it is validated by using the VALIDATE CONSTRAINT option.

So you can create involved tables and import data and create PK constraints, and FK constraints with NOT VALID.

ALTER TABLE child ADD CONSTRAINT child_parent_id_fkey
FOREIGN KEY (parent_id) REFERENCES parent(parent_id) NOT VALID;

That prevents any new writes from violating referential integrity, while tolerating existing violations. (And the system is aware of it.)

Delete orphans later, at the earliest opportunity:

DELETE FROM child c
WHERE  NOT EXISTS (
   SELECT 1 FROM parent p
   WHERE  p.parent_id = c.parent_id
   );

This does nothing if the data is consistent, and the work needs to be done either way to verify integrity. So it won't get any more efficient for unreliable data that needs verification.

Then validate the constraint:

ALTER TABLE child VALIDATE CONSTRAINT child_parent_id_fkey;

Doesn't have to be in the same transaction. Read the manual for details.

Evan Carroll · Answer 2 · 2016-12-14T19:44:45.063

I am working on a script that replicates schema and data. I first create all the structure except for foreign keys after reading it from the source's system tables, then copy the data, then create the FKs so that if a would-be orphan gets inserted in a child table after I copied the parent (while I am running the script), I still get the basic schema and the data, even though the FK creation fails due to the orphan data.

This is not sound design. Integrity first, always. What you're describing leaves your database in a state that requires later clean-up.

What you really want is to set up a schema that is tight, and fool proof and then to INSERT .. ON CONFLICT DO NOTHING This will skip rows that fail to make the foreign key constraint.

Shy of that, if you already have dirty data, and you just want to delete that you need to

BEGIN a transaction
DELETE the rows that conflict
add the foreign keys CONSTRAINTS
COMMIT the transaction.

There is nothing that does DELETE ANY WOULD-BE ORPHANS, and that would be a bad idea. The above is the closest you can get.

Deleting would-be orphans when creating FK in Postgres

2 Answers2

NOT VALID option

`NOT VALID` option