How to "merge" rows along with their foreign many-to-many relations without violating unique constraints?

Question

Table

CREATE TABLE files (
  id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  name text
);
CREATE TABLE folders (
  id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  name text
);
CREATE TABLE file_folders (
  id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  file_id bigint NOT NULL REFERENCES files,
  folder_id bigint NOT NULL REFERENCES folders,
  UNIQUE (file_id, folder_id)
);

Query

/*
  Merges
*/
WITH targets AS (
  SELECT 
    ARRAY (
      SELECT
        id
      FROM
        folders TABLESAMPLE BERNOULLI (50)
      LIMIT 3
    ) AS folders
),
-- basically a setup to ensure unique target/folder pairs
-- and no targets in the merges
input_folders AS (
  SELECT
    folders.id AS folder_id,
    random_array_element(targets.folders) AS target_id
  FROM
    folders
    CROSS JOIN
    targets
  WHERE
    NOT ( 
      folders.id = ANY (targets.folders)
    ) 
),
input_files AS (
  SELECT
    file_folders.id,
    file_folders.folder_id,
    file_folders.file_id,
    input_folders.target_id
  FROM
    input_folders
    INNER JOIN
    file_folders
    ON
      input_folders.folder_id = file_folders.folder_id
      OR
      input_folders.target_id = file_folders.folder_id
),
deleted_files AS (
    WITH deletions AS (
    SELECT
      inputs.id
    FROM
      input_files AS inputs
      INNER JOIN
      input_files AS targets
      ON
        NOT (inputs.folder_id = targets.target_id)
        AND
        inputs.file_id = targets.file_id
  )
  DELETE
  FROM
    file_folders
  WHERE
    id IN (
      SELECT
        id
      FROM
        deletions
    )
),
merged_files AS (
  WITH merges AS (
    SELECT
      inputs.id,
      inputs.folder_id,
      inputs.target_id
    FROM
      input_files AS inputs
      INNER JOIN
      input_files AS targets
      ON
        NOT (inputs.folder_id = targets.target_id)
        AND
        NOT (inputs.file_id = targets.file_id)
  )
  UPDATE file_folders
  SET
    folder_id = merges.target_id
  FROM
    merges
  WHERE
    merges.id = file_folders.id
),
deleted_folders AS (
  DELETE
  FROM
    folders
  WHERE
    id IN (
      SELECT DISTINCT
        folder_id
      FROM
        input_folders
    )
)
SELECT
  folders AS targets
FROM
  targets
;

Inputs

The array-transforming setup is me trying to replicate the JSON input of the application in pure SQL. The input looks like this:

interface IQueryInput extends Array<IMergeInput> {};
interface IMergeInput {
  target: IEntityID;
  inputs: IEntityID[];
};
// postgresql bigints are treated as strings in the application
type IEntityID = string;

So the prepping query from above can be replaced with:

WITH inputs AS (
  SELECT
    input.*
  FROM
    -- the application interpolates JSON there
    json_to_recordset($inputs$$inputs$) AS input(
      target bigint,
      inputs bigint[]
    )
),
input_folders AS (
  SELECT
    inputs.target AS target_id,
    merge.folder_id
  FROM
    inputs
    CROSS JOIN
    UNNEST(inputs.inputs) AS merge(
      folder_id
    )
)

It must run as a batch operation, so the application provides these guaranties for the query input:

all target values are unique.
all inputs concatenated result in unique values.
target values do not intersect with concatenated inputs. Therefore input_folders always ends up as unique target_id-folder_id pairs.

The query is ran as a background task, so the speed and memory are of secondary importance. The main requirement is of a typical transaction: either it should go through completely on success or reject completely on any error.

The problem

I want to "merge" several folders into a single folder. So given a target_id and an array of folder_ids, replace all foreign references to folder_ids with target_id and remove non-target folder afterwards.
This however becomes an issue in relations table with unique constraints, since after updating the references there are duplicates.
So I went this path:

Select all relation rows related to the query, so all file_folders with target_ids and folder_ids in them.
Separate them into two categories:
- Deletes - the rows which will result in dupes when updated.
- Merges - the rows which will not result in.
Delete the delete candidates.
Update the merge candidates.
Repeat previous 4 steps for all relations.
Delete rows in folders with folder_ids.

However I still stumble upon unique key violation error.
"Merge" is in quotes because it doesn't look like what I am trying to do can be accomplished by merge in docs and it requires a newer version of postgresql anyway.

Erwin Brandstetter · Answer 1 · 2023-01-15T12:41:03.697

Basically, it burns down to just this to create your new table file_folders:

SELECT DISTINCT ON (2, 3)
       i.id, i.file_id, COALESCE(random_array_element(t.folders), i.folder_id)
FROM   file_folders  i
LEFT   JOIN (SELECT ARRAY (SELECT id FROM folders TABLESAMPLE BERNOULLI (50) LIMIT 3)) t(folders) ON i.folder_id <> ALL (t.folders)
ORDER  BY 2, 3, 1;

Using DISTINCT ON to remove duplicates. I chose to keep the row with the smallest id value. See:

Select first row in each GROUP BY group?

If you can afford an exclusive lock, the cheapest way will be to create a new table from this like:

BEGIN;
LOCK TABLE file_folders, folders, files IN EXCLUSIVE MODE;
-- temp table for targets for multiple uses!
CREATE TEMP TABLE targets AS
SELECT ARRAY (SELECT id FROM folders TABLESAMPLE BERNOULLI (50) LIMIT 3) AS folders;
CREATE TABLE file_folders_new AS
SELECT DISTINCT ON (2, 3)
       i.id, i.file_id, COALESCE(random_array_element(t.folders), i.folder_id) AS folder_id
FROM   file_folders i
LEFT   JOIN targets t ON i.folder_id <> ALL (t.folders)
ORDER  BY 2, 3, 1;
-- make sure the new table is good HERE !
DROP  TABLE file_folders;
ALTER TABLE file_folders_new RENAME TO file_folders;
ALTER TABLE file_folders
  ADD CONSTRAINT file_folders_file_id_fkey   FOREIGN KEY (file_id) REFERENCES files
, ADD CONSTRAINT file_folders_folder_id_fkey FOREIGN KEY (folder_id) REFERENCES folders
, ADD CONSTRAINT file_folders_uni UNIQUE (file_id, folder_id)
, ADD CONSTRAINT file_folders_pkey PRIMARY KEY (id)  -- do you need the surrogate PK?
, ALTER id ADD GENERATED ALWAYS AS IDENTITY;
-- sync sequence of new IDETNTITY
SELECT setval(pg_get_serial_sequence('file_folders', 'id'), COALESCE(max(id), 1), false))
FROM file_folders;
-- recreate all other indexes etc. your org. table file_folders had HERE !
-- drop orphaned folders
DELETE FROM folders f
USING  targets t
WHERE  f.id <> ALL (t.folders);
COMMIT;

fiddle

All in a single transaction, so you don't lose data if anything goes wrong.

Alternatively (to avoid recreating constraints and indexes) you could create a temp table for file_folders, TRUNCATE the old one, and INSERT rows from the temp table. See:

Best way to delete millions of rows by ID

But when updating most rows, it's typically cheaper to recreate all indexes from scratch than incrementally udating all rows.

How to "merge" rows along with their foreign many-to-many relations without violating unique constraints?

Table

Query

Inputs

The problem

1 Answers1