How to efficiently do partial updates on huge dataset in Postgresql

Question

Disclaimer: I am a total "newb" in regards to PgSql, but due to some unfortunate aligment of circumstances I am in charge of this project which I have to rewrite/migrate away from SqlServer (alone), so there's that.

The process is fairly straightforward: I have multiple sources of unique data which I must check every day for updates, import the changes into table Items and aggregate the results into Aggregates.

The tables are as following:

CREATE TABLE public."Items" (
    "Md5Hash" bytea NOT NULL,
    "SourceId" int4 NOT NULL,
    "LastModifiedAt" timestamp NULL,
    "PropA" int4 NOT NULL,
    "PropB" timestamp NULL,
    "PropC" bool NULL,
    ...
    CONSTRAINT "PK_Items" PRIMARY KEY ("Md5Hash", "SourceId"),
    CONSTRAINT "FK_Items_Sources_SourceId" FOREIGN KEY ("SourceId") REFERENCES "Sources"("Id") ON DELETE RESTRICT
);
CREATE INDEX "IX_Items_SourceId" ON public."Items" USING btree ("SourceId");

CREATE TABLE public."Aggregates" (
    "Id" int4 NOT NULL GENERATED BY DEFAULT AS IDENTITY,
    "UniqueIdentifier" varchar(150) NOT NULL,
    "Md5Hash" bytea NOT NULL, <- I need this
    "Aggregated" bool NOT NULL DEFAULT false,
    "LastModifiedAt" timestamp NULL,
    "FoundIn" int2 NOT NULL DEFAULT 0,
    "PropA" int4 NOT NULL DEFAULT 0,
    "PropB" timestamp NULL,
    "PropC" int4 NOT NULL DEFAULT 0,
    ...
    CONSTRAINT "PK_Aggregates" PRIMARY KEY ("Id")
);
CREATE INDEX "IX_Aggregates_Aggregated" ON public."Aggregates" USING btree ("Aggregated");
CREATE UNIQUE INDEX "IX_Aggregates_UniqueIdentifier" ON public."Aggregates" USING btree ("UniqueIdentifier");

Aggregation examples:

Aggregates.FoundIn = COUNT(Items.SourceId)
Aggregates.PropA = SUM(Items.PropA)
Aggregates.PropB = MAX(Items.PropB)
Aggregates.PropC = SUM(CASE WHEN .. Items.PropC)
...

Right now I:

drop all indexes in Aggregates
disable foreign key checks in Items
check each source 1 by 1 and in parallel I get a subset of the data (batches of 5000) and insert it in both Items as well as Aggregates at the same time using arrays of custom types to batch insert (thus the @entities)
run a query to aggregate the modified data
recreate all indexes concurrently

Queries I use:

INSERT INTO "Items" ({insertColumns}) 
SELECT * FROM unnest(@entities) e
ON CONFLICT("Md5Hash", "SourceId") DO UPDATE SET {updateColumns};

INSERT INTO "Aggregates" ({insertColumns}) 
SELECT * FROM unnest(@entities) e
ON CONFLICT("UniqueIdentifier") DO UPDATE SET "Aggregated" = FALSE;

UPDATE "Aggregates"
SET "Aggregated" = TRUE, {updateColumns}
FROM (
    SELECT "Md5Hash", {selectColumns} FROM "Items"
    GROUP BY "Md5Hash"
) AS agg WHERE agg."Md5Hash" = "Aggregates"."Md5Hash" AND AND "Aggregated" = FALSE;

IMPORTANT NOTES:

The Items table has close to 1 billion rows (will grow)
The Aggregates table has several hundred million rows
Tables have approximately 20 columns (will grow)
Data in each source may or may not change, so I update only what changes (aprox 20-30 million per day in total), thus the need of constant re-aggregation
Server Postgresql 12.2 is running on Windows (12 procs, 128GB RAM) with the following settings:

max_connections = 300
shared_buffers = 24GB
effective_cache_size = 72GB
maintenance_work_mem = 2047MB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
work_mem=320MB
min_wal_size = 1GB
max_wal_size = 4GB
synchronous_commit = off
max_worker_processes = 72
max_parallel_workers_per_gather = 4
max_parallel_workers = 72
max_parallel_maintenance_workers = 4

I also went through a MySql version where in order to bypass unique index lock gasps I used an in memory staging table for inserting into Aggregates the UniqueIdentifier and LOAD DATA for Items inserts.

For Postgresql, I've gone through several other approaches and this seems to be the fastest in terms of initial creation, however the daily updates are 40% slower compared to SqlServer/MySql and the consolidation query is also slower (60% slower). I tried using COPY but it since it has no support for IGNORE/REPLACE like MySql counterpart it's not useful. I tried using it in a combination with a staging table but the later transfer to the production tables was very very slow. Overall, I suspect and I don't leverage Pgsql as I should due to my lack of knowledge.

Is there anything I am doing wrong here? Should I drop the multi threading and insert the data for each source all at once? Any input is much appreciated.

Edit: As requested in the comments I am attaching the query plan for a much smaller dataset (67m Aggregates - 300k Aggregated = FALSE, 191m Items)

Update on "Aggregates"  (cost=2.27..18501202.86 rows=1 width=409) (actual time=656794.218..656794.218 rows=0 loops=1)
  Buffers: shared hit=193661352 read=3444109 dirtied=919387 written=366887
  ->  Merge Join  (cost=2.27..18501202.86 rows=1 width=409) (actual time=1323.481..625667.850 rows=262679 loops=1)
        Merge Cond: ("Aggregates"."Md5Hash" = agg."Md5Hash")
        Buffers: shared hit=189969175 read=2738925 dirtied=258646 written=237386
        ->  Sort  (cost=1.70..1.70 rows=1 width=103) (actual time=1321.782..1444.692 rows=262679 loops=1)
              Sort Key: "Aggregates"."Md5Hash"
              Sort Method: quicksort  Memory: 50901kB
              Buffers: shared hit=121000 read=35
              ->  Index Scan using "IX_Aggregates_Aggregated" on "Aggregates"  (cost=0.57..1.69 rows=1 width=103) (actual time=0.137..1080.727 rows=262679 loops=1)
                    Index Cond: ("Aggregated" = false)
                    Buffers: shared hit=121000 read=35
        ->  Subquery Scan on agg  (cost=0.57..18432053.72 rows=27658963 width=362) (actual time=0.103..615067.274 rows=67537041 loops=1)
              Buffers: shared hit=189848175 read=2738890 dirtied=258646 written=237386
              ->  GroupAggregate  (cost=0.57..18155464.09 rows=27658963 width=169) (actual time=0.081..559315.620 rows=67537041 loops=1)
                    Group Key: "Items"."Md5Hash"
                    Buffers: shared hit=189848175 read=2738890 dirtied=258646 written=237386
                    ->  Index Scan using "PK_Items" on "Items"  (cost=0.57..7571149.04 rows=196337627 width=107) (actual time=0.052..331107.259 rows=191030895 loops=1)
                          Buffers: shared hit=189848175 read=2738890 dirtied=258646 written=237386
Planning Time: 0.370 ms
Execution Time: 656803.036 ms

How to efficiently do partial updates on huge dataset in Postgresql

0 Answers0