3

Fiddle: https://dbfiddle.uk/-JLFuIrN

Table

CREATE TABLE files (
  id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  name text
);

CREATE TABLE folders ( id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY, name text );

CREATE TABLE file_folders ( id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY, file_id bigint NOT NULL REFERENCES files, folder_id bigint NOT NULL REFERENCES folders, UNIQUE (file_id, folder_id) );

Query

/*
  Merges
*/

WITH targets AS ( SELECT ARRAY ( SELECT id FROM folders TABLESAMPLE BERNOULLI (50) LIMIT 3 ) AS folders ), -- basically a setup to ensure unique target/folder pairs -- and no targets in the merges input_folders AS ( SELECT folders.id AS folder_id, random_array_element(targets.folders) AS target_id FROM folders CROSS JOIN targets WHERE NOT ( folders.id = ANY (targets.folders) ) ), input_files AS ( SELECT file_folders.id, file_folders.folder_id, file_folders.file_id, input_folders.target_id FROM input_folders INNER JOIN file_folders ON input_folders.folder_id = file_folders.folder_id OR input_folders.target_id = file_folders.folder_id ), deleted_files AS ( WITH deletions AS ( SELECT inputs.id FROM input_files AS inputs INNER JOIN input_files AS targets ON NOT (inputs.folder_id = targets.target_id) AND inputs.file_id = targets.file_id ) DELETE FROM file_folders WHERE id IN ( SELECT id FROM deletions ) ), merged_files AS ( WITH merges AS ( SELECT inputs.id, inputs.folder_id, inputs.target_id FROM input_files AS inputs INNER JOIN input_files AS targets ON NOT (inputs.folder_id = targets.target_id) AND NOT (inputs.file_id = targets.file_id) ) UPDATE file_folders SET folder_id = merges.target_id FROM merges WHERE merges.id = file_folders.id ), deleted_folders AS ( DELETE FROM folders WHERE id IN ( SELECT DISTINCT folder_id FROM input_folders ) ) SELECT folders AS targets FROM targets ;

Inputs

The array-transforming setup is me trying to replicate the JSON input of the application in pure SQL. The input looks like this:

interface IQueryInput extends Array<IMergeInput> {};

interface IMergeInput { target: IEntityID; inputs: IEntityID[]; };

// postgresql bigints are treated as strings in the application type IEntityID = string;

So the prepping query from above can be replaced with:

WITH inputs AS (
  SELECT
    input.*
  FROM
    -- the application interpolates JSON there
    json_to_recordset($inputs$$inputs$) AS input(
      target bigint,
      inputs bigint[]
    )
),
input_folders AS (
  SELECT
    inputs.target AS target_id,
    merge.folder_id
  FROM
    inputs
    CROSS JOIN
    UNNEST(inputs.inputs) AS merge(
      folder_id
    )
)

It must run as a batch operation, so the application provides these guaranties for the query input:

  • all target values are unique.
  • all inputs concatenated result in unique values.
  • target values do not intersect with concatenated inputs. Therefore input_folders always ends up as unique target_id-folder_id pairs.

The query is ran as a background task, so the speed and memory are of secondary importance. The main requirement is of a typical transaction: either it should go through completely on success or reject completely on any error.

The problem

I want to "merge" several folders into a single folder. So given a target_id and an array of folder_ids, replace all foreign references to folder_ids with target_id and remove non-target folder afterwards.
This however becomes an issue in relations table with unique constraints, since after updating the references there are duplicates.
So I went this path:

  1. Select all relation rows related to the query, so all file_folders with target_ids and folder_ids in them.
  2. Separate them into two categories:
    • Deletes - the rows which will result in dupes when updated.
    • Merges - the rows which will not result in.
  3. Delete the delete candidates.
  4. Update the merge candidates.
  5. Repeat previous 4 steps for all relations.
  6. Delete rows in folders with folder_ids.

However I still stumble upon unique key violation error.
"Merge" is in quotes because it doesn't look like what I am trying to do can be accomplished by merge in docs and it requires a newer version of postgresql anyway.

Biller Builder
  • 288
  • 1
  • 12

1 Answers1

0

Basically, it burns down to just this to create your new table file_folders:

SELECT DISTINCT ON (2, 3)
       i.id, i.file_id, COALESCE(random_array_element(t.folders), i.folder_id)
FROM   file_folders  i
LEFT   JOIN (SELECT ARRAY (SELECT id FROM folders TABLESAMPLE BERNOULLI (50) LIMIT 3)) t(folders) ON i.folder_id <> ALL (t.folders)
ORDER  BY 2, 3, 1;

Using DISTINCT ON to remove duplicates. I chose to keep the row with the smallest id value. See:

If you can afford an exclusive lock, the cheapest way will be to create a new table from this like:

BEGIN;
LOCK TABLE file_folders, folders, files IN EXCLUSIVE MODE;

-- temp table for targets for multiple uses! CREATE TEMP TABLE targets AS SELECT ARRAY (SELECT id FROM folders TABLESAMPLE BERNOULLI (50) LIMIT 3) AS folders;

CREATE TABLE file_folders_new AS SELECT DISTINCT ON (2, 3) i.id, i.file_id, COALESCE(random_array_element(t.folders), i.folder_id) AS folder_id FROM file_folders i LEFT JOIN targets t ON i.folder_id <> ALL (t.folders) ORDER BY 2, 3, 1;

-- make sure the new table is good HERE !

DROP TABLE file_folders; ALTER TABLE file_folders_new RENAME TO file_folders; ALTER TABLE file_folders ADD CONSTRAINT file_folders_file_id_fkey FOREIGN KEY (file_id) REFERENCES files , ADD CONSTRAINT file_folders_folder_id_fkey FOREIGN KEY (folder_id) REFERENCES folders , ADD CONSTRAINT file_folders_uni UNIQUE (file_id, folder_id) , ADD CONSTRAINT file_folders_pkey PRIMARY KEY (id) -- do you need the surrogate PK? , ALTER id ADD GENERATED ALWAYS AS IDENTITY;

-- sync sequence of new IDETNTITY SELECT setval(pg_get_serial_sequence('file_folders', 'id'), COALESCE(max(id), 1), false)) FROM file_folders;

-- recreate all other indexes etc. your org. table file_folders had HERE !

-- drop orphaned folders DELETE FROM folders f USING targets t WHERE f.id <> ALL (t.folders);

COMMIT;

fiddle

All in a single transaction, so you don't lose data if anything goes wrong.

Alternatively (to avoid recreating constraints and indexes) you could create a temp table for file_folders, TRUNCATE the old one, and INSERT rows from the temp table. See:

But when updating most rows, it's typically cheaper to recreate all indexes from scratch than incrementally udating all rows.

Related:

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633