I am using a Postgres data warehouse hosted on Amazon RDS. When trying to update one column of a fact table (25 million rows) from another table in the same database, the query takes several days to run. Why is this happening and how can I improve this performance? I know that PG is designed more for OLTP than OLAP, but select query performance is usually pretty decent on this table.
The query in question looks like this:
UPDATE a
SET a.value = b.value
FROM b
WHERE a.id = b.id
b is a temp table in a different schema but same database that has the same number of rows as a. Both tables have primary keys on id. There is no index or constraints on the value column. There are views that depend on table a but no foreign keys
I am using PG 9.5 on RDS. General purpose (SSD) with 256 GB of storage, so after exhausting our initial burst IOPS, I should get a little under 800 IOPS.
Is the IOPS throttling really the issue here? While watching the query run I see ~ 400 IOPS of write performance, and similar read performance. 25,000,000 rows / 400 IOPS = 17 hours, but this query took much longer than 24 hours to run ( cancelled after ~ 30 hours to try and make tweaks). There was some other periodic update traffic on the same table, but I halted this at around the 20 hour mark when I saw how long this query was taking.
I wondering if my general update approach is wrong, or if there is general advice for operating a data warehouse (OLAP workload) using postgres. Could I get better performance by ditching RDS and running an PG on EC2?
UPDATE: Inspired by responses and comment, I ran a test on 45k rows (by limiting the pk below a certain range)
You can see the results of explain analyze here. The vast majority of the time is spent writing the actual updates to the table. Right now I am still leaning towards write IOPS being a limiting factor, but I will dig into possible replication issues as mentioned by joanolo.
This image shows the RDS instance monitoring page. The most recent spike is the query in question.
