I am working on a script that replicates schema and data. I first create all the structure except for foreign keys after reading it from the source's system tables, then copy the data, then create the FKs so that if a would-be orphan gets inserted in a child table after I copied the parent (while I am running the script), I still get the basic schema and the data, even though the FK creation fails due to the orphan data.
My next step is to do something like wrap a try/catch around the create FK statement and, if it fails, delete the orphan data that caused it. Doing this will be somewhat complicated because I would have to extract all the reference keys and assemble my own delete statement (I currently don't do that because PG system tables provide a ready REFERENCES statement which I can concatenate).
But I am wondering if there is a way to do a ALTER TABLE x ADD CONSTRAINT fk_y FOREIGN KEY ... DELETE ANY WOULD-BE ORPHANS which would save me that step.
So if my parent table is:
parent_id | parent_name
---------------------------------
1 | Joe
2 | Mary
and the child is:
child_id | child_name | parent_parent_id
--------------------------------------------------
11 | Jimmy | 1
21 | Joey | 2
31 | Jeff | 3
'Jeff' would be deleted before enforcing referential integrity because his parent and he were introduced in the source after the source parent was copied to the target. Then the FK can be created against consistent data.
Clarifications (from comments / chat discussion):
Do you read the data in a single transaction or not?
I don't use a single transaction as it would be too huge. I break the retrieved data by table and for some large tables by table segment (batch of 5000 or 10000 rows).
This is essentially an ETL that makes an "evil twin" out of a DB. hope that makes sense.
On average a DB copy in my environment takes between 20 min (for smalled DBs) to 1.5-2 hrs (for larger DBs) some tables have millions of rows. I load chunk by chunk in the memory of my Python script
What about
pg_dumporpg_basebackup?
- My app does some other things that
pg_dumpdoesn't do, such as better logging. Also, pg_dump expects the same users on both the source and target and a host of other issues. I tried it, didn't like it. I like my tool much better, I want to work through this minor issue that is left and happens only sporadically.