Improve update performance in PostgreSQL data warehouse (RDS)

Question

I am using a Postgres data warehouse hosted on Amazon RDS. When trying to update one column of a fact table (25 million rows) from another table in the same database, the query takes several days to run. Why is this happening and how can I improve this performance? I know that PG is designed more for OLTP than OLAP, but select query performance is usually pretty decent on this table.

The query in question looks like this:

UPDATE a
SET a.value = b.value
FROM b
WHERE a.id = b.id

b is a temp table in a different schema but same database that has the same number of rows as a. Both tables have primary keys on id. There is no index or constraints on the value column. There are views that depend on table a but no foreign keys

I am using PG 9.5 on RDS. General purpose (SSD) with 256 GB of storage, so after exhausting our initial burst IOPS, I should get a little under 800 IOPS.

Is the IOPS throttling really the issue here? While watching the query run I see ~ 400 IOPS of write performance, and similar read performance. 25,000,000 rows / 400 IOPS = 17 hours, but this query took much longer than 24 hours to run ( cancelled after ~ 30 hours to try and make tweaks). There was some other periodic update traffic on the same table, but I halted this at around the 20 hour mark when I saw how long this query was taking.

I wondering if my general update approach is wrong, or if there is general advice for operating a data warehouse (OLAP workload) using postgres. Could I get better performance by ditching RDS and running an PG on EC2?

UPDATE: Inspired by responses and comment, I ran a test on 45k rows (by limiting the pk below a certain range)

You can see the results of explain analyze here. The vast majority of the time is spent writing the actual updates to the table. Right now I am still leaning towards write IOPS being a limiting factor, but I will dig into possible replication issues as mentioned by joanolo.

This image shows the RDS instance monitoring page. The most recent spike is the query in question.

score 2 · Answer 1 · answered Dec 01 '17 at 19:43

Have you considered the possibility that WAL (Write-Ahead Logging) is slowing down your update? Also see this more detail explanation.

This is a common problem for UPDATEs, INSERTs and ALTER TABLE operations on very large tables. According to my understanding, for any row affected by an update--even if that update is only on a single column--Postgres archives and replaces the entire row and updates all indexes on all indexes columns. While Postgres's implementation of WAL is extremely effective for maintaining data integrity in a transactional setting, it can seriously degrade performance for operations on large tables--particularly for data warehouses, where bulk updates involving many records are common.

I would need to know more about the table being updated to know if WAL is the culprit. Is a large proportion of records (say 30% or more) affected by the update? You stated that the value column is not indexed, but are there indexes on other columns? If the answer to both questions is yes, then I would strongly suspect WAL.

An effective solution to this problem is to create a new, unindexed table using the CREATE TABLE AS method described here. You will need to rebuild indexes, keys and constraints on the new table, but this is still much faster than updating-in-place. See also the related answer here.

The downside to the CREATE TABLE AS approach is that a simple UPDATE query becomes a monster multi-statement transaction. The latter code is not only cumbersome but also fragile: the table schema must be repeated for every update that uses this approach. Imagine a data warehouse pipeline with dozens of such updates. Any change to the table schema must be hard-coded into every single update operation.

As an alternative approach, I suggest you first try stripping all indexes not involved in the update (joins or where clause), then use a regular UPDATE statement and rebuild the indexes. Depending on the dimensions of you table, its indexes, the complexity of the update and the number of rows involved, an update-in-place may nearly as fast as the "CREATE TABLE AS" method, and your code will be simpler and more stable.

score 1 · Answer 2 · answered May 20 '17 at 15:14

This is not trying to be a real answer, but at least a reference comparison point, and some hints with an extremely simplified setup on a non-RDS machine:

CREATE TABLE a
(
    id integer PRIMARY KEY,
    value float
) ;
CREATE TABLE b
(
    id integer PRIMARY KEY,
    value float
) ;

-- Fill table with 25M records
INSERT INTO a 
    (id, value)
SELECT
    generate_series(1, 25e6) AS id, random();
-- 3 min 44 s

-- Fill table with 25M records as well
INSERT INTO b
    (id, value)
SELECT
    generate_series(1, 25e6) AS id, random();
-- 4 min 15 s

UPDATE
    a
SET
    value = b.value
FROM
    b
WHERE
    b.id = a.id ;
-- Query returned successfully: 25000000 rows affected, 08:06 minutes execution time.

The resources used by the computer while performing the query:

-- iostat => 22 to 24 KB/t, 366 to 428 tps, 8.65 MB/s .. 10 MB/s 

-- Activity Monitor, Disk tab
-- Before starting...
-- Process   ReadBytes WrittenBytes 
-- postgres 4,12 GB       224 KB    
-- postgres 3,51 GB     227,6 MB
-- postgres 1,98 GB       1,1 MB    

-- After finishing
-- postgres 7,71 GB 2,85 GB 
-- postgres 6,59 GB 252 KB  
-- postgres 4,17 GB 2,3 MB  

-- Diff: Read Bytes: 8,86 GB,  WrittenBytes 2,85 GB

You can check here the exeuction plan.

This test has been performed on "PostgreSQL 9.6.3 on x86_64-apple-darwin14.5.0, compiled by Apple LLVM version 7.0.0 (clang-700.1.76), 64-bit", on a MacBook Air with i7 processor @1,7 GHz, 512 GB SSD, and 8 GB RAM, and macOS Sierra 10.12.5. The settings for PostgreSQL are "out-of-the-box" (installed via Postgress.app), with no further optimization.

This scenario is very different from the one of RDS, obviously.

The difference in timings is so enormous as to suggest that either:

The simplified model I've used is a very bad representation of your real situation.
Your needs for IOPS on Amazon RDS are far far bigger than what you actually have.

Things to consider:

If you're running an OLAP process, you most probably have the origin data somewhere else, and replicated. This means you don't need a database installation with replication. If RDS has been set-up with replication, I'd suggest to take it out
An EC2 r3.large instance would be (relatively) close to the computer in which I performed the test.
Having your own setup of PostgreSQL means you'll have to deal with backups, updates, etc... Take it into account.
If setting-up an EC2 instance and performing some trials lies within your realistic possibilities; I'd give it a try.

Improve update performance in PostgreSQL data warehouse (RDS)

2 Answers2