While trying to find the fastest way to do a bulk update of a large table, I came up with the following plan:
having main_table (potentially above 1000 mln rows) and update table up to a few million rows. Data in update are indexed according to main_table (have corresponding id columns).
- copy rows from
main_tableintotemp_tablewhereidnot inupdateidtable (where data has actually changed). - simple append of the
updatetable ontotemp_table - drop
main_table - rename
temp_tabletomain_table - renumerate
main_tablepk - create the next
updatewithpythonand repeat.
The above is to be done with the following sequence of queries:
CREATE TABLE temp_table (like main_table including all);
INSERT INTO temp_table (listing_id,date,available,price,timestamp)
(SELECT listing_id,date,available,price,timestamp
FROM main_table c
WHERE NOT EXISTS (SELECT
FROM update
WHERE id = c.id));
INSERT INTO temp_table (listing_id,date,available,price,parsing_timestamp)
(SELECT listing_id,date,available,price,timestamp
FROM update c);
DROP TABLE main_table cascade;
ALTER TABLE temp_table RENAME TO main_table;
ALTER TABLE main_table DROP COLUMN id
ALTER TABLE main_table ADD COLUMN id SERIAL PRIMARY KEY;
DROP TABLE update cascade;
While debugging the queries, I have found the slow queries with explain:
EXPLAIN (ANALYZE, BUFFERS) CREATE TABLE temp_table (like main_table including all);
INSERT INTO temp_table (listing_id,date,available,price,timestamp)
(SELECT listing_id,date,available,price,timestamp
FROM main_table c
WHERE NOT EXISTS (SELECT
FROM update
WHERE id = c.id));
This is the result of EXPLAIN (ANALYZE, BUFFERS) as requested in comments:
QUERY PLAN
Merge Anti Join (cost=513077.42..4789463.48 rows=109800833 width=40) (actual time=1216.018..48520.564 rows=112757269 loops=1)
Merge Cond: (c.id = update.id)
Buffers: shared hit=4 read=1359891 written=2838, temp read=8701 written=15077
-> Index Scan using cals_pkey on cals c (cost=0.57..3958576.01 rows=113119496 width=44) (actual time=0.857..29573.798 rows=113119497 loops=1)
Buffers: shared hit=1 read=1330192 written=2838
-> Sort (cost=513076.85..521373.51 rows=3318663 width=8) (actual time=1215.147..1260.191 rows=362229 loops=1)
Sort Key: update.id
Sort Method: external merge Disk: 58480kB
Buffers: shared hit=3 read=29699, temp read=8701 written=15077
-> Seq Scan on update (cost=0.00..62881.63 rows=3318663 width=8) (actual time=0.179..423.100 rows=3318663 loops=1)
Buffers: shared read=29695
Planning Time: 0.259 ms
Execution Time: 52757.349 ms
How to optimize the queries?