Horribly performing query, how to fix

Question

select
    user_id,
    count(id) as unread_count
from
    notifications
where
    is_read = false
    and user_id in(select(unnest('{200 user IDs}' :: bigint[])))
group by
    user_id;

Problem is, this query runs for 1 minute and sometimes slightly more than that. The table is 32gb big, and there is already an index on the user_id field.

Here is an execution plan

HashAggregate  (cost=123354.81..123629.64 rows=27483 width=16) (actual time=90823.880..90823.972 rows=188 loops=1)
  Group Key: user_id
  ->  Nested Loop  (cost=2.32..123217.40 rows=27483 width=16) (actual time=0.184..90752.136 rows=48571 loops=1)
        ->  HashAggregate  (cost=1.76..2.76 rows=100 width=8) (actual time=0.146..0.577 rows=200 loops=1)
              Group Key: unnest(200 user IDs)
              ->  Result  (cost=0.00..0.51 rows=100 width=8) (actual time=0.021..0.073 rows=200 loops=1)
        ->  Index Scan using ix_notification_user_id on notification  (cost=0.56..1229.40 rows=275 width=16) (actual time=119.659..453.533 rows=243 loops=200)
              Index Cond: (200 user IDs)
              Filter: (NOT is_read)
              Rows Removed by Filter: 368
Planning time: 0.189 ms
Execution time: 90824.196 ms

I have tried a solution using a temp table, inserting the unnest values into the temp table and then comparing. But the performance hasn't improved at all.

I have run this query to see index stats:

    schemaname,
    tablename,
    reltuples::bigint,
    relpages::bigint,
    otta,
    round(case when otta = 0 then 0.0 else sml.relpages / otta::numeric end, 1) as tbloat,
    relpages::bigint - otta as wastedpages,
    bs*(sml.relpages-otta)::bigint as wastedbytes,
    pg_size_pretty((bs*(relpages-otta))::bigint) as wastedsize,
    iname,
    ituples::bigint,
    ipages::bigint,
    iotta,
    round(case when iotta = 0 or ipages = 0 then 0.0 else ipages / iotta::numeric end, 1) as ibloat,
    case
        when ipages < iotta then 0
        else ipages::bigint - iotta
    end as wastedipages,
    case
        when ipages < iotta then 0
        else bs*(ipages-iotta)
    end as wastedibytes
    --CASE WHEN ipages < iotta THEN pg_size_pretty(0) ELSE pg_size_pretty((bs*(ipages-iotta))::bigint) END AS wastedisize

    from (
    select
        schemaname,
        tablename,
        cc.reltuples,
        cc.relpages,
        bs,
        ceil((cc.reltuples*((datahdr + ma- (case when datahdr % ma = 0 then ma else datahdr % ma end))+ nullhdr2 + 4))/(bs-20::float)) as otta,
        coalesce(c2.relname, '?') as iname,
        coalesce(c2.reltuples, 0) as ituples,
        coalesce(c2.relpages, 0) as ipages,
        coalesce(ceil((c2.reltuples*(datahdr-12))/(bs-20::float)), 0) as iotta
        -- very rough approximation, assumes all cols

        from (
        select
            ma,
            bs,
            schemaname,
            tablename,
            (datawidth +(hdr + ma-
            (
                case
                when hdr % ma = 0 then ma
                else hdr % ma
            end)))::numeric as datahdr,
            (maxfracsum*(nullhdr + ma-
            (
                case
                when nullhdr % ma = 0 then ma
                else nullhdr % ma
            end))) as nullhdr2
        from
            (
            select
                schemaname,
                tablename,
                hdr,
                ma,
                bs,
                sum((1-null_frac)* avg_width) as datawidth,
                max(null_frac) as maxfracsum,
                hdr +(
                select
                    1 + count(*)/ 8
                from
                    pg_stats s2
                where
                    null_frac <> 0
                    and s2.schemaname = s.schemaname
                    and s2.tablename = s.tablename ) as nullhdr
            from
                pg_stats s,
                (
                select
                    (
                    select
                        current_setting('block_size')::numeric) as bs,
                    case
                        when substring(v, 12, 3) in ('8.0',
                        '8.1',
                        '8.2') then 27
                        else 23
                    end as hdr,
                    case
                        when v ~ 'mingw32' then 8
                        else 4
                    end as ma
                from
                    (
                    select
                        version() as v) as foo ) as constants
            group by
                1,
                2,
                3,
                4,
                5 ) as foo ) as rs
    join pg_class cc on
        cc.relname = rs.tablename
    join pg_namespace nn on
        cc.relnamespace = nn.oid
        and nn.nspname = rs.schemaname
    left join pg_index i on
        indrelid = cc.oid
    left join pg_class c2 on
        c2.oid = i.indexrelid ) as sml
where
    sml.relpages - otta > 0
    or ipages - iotta > 10
order by
    wastedbytes desc,
    wastedibytes desc;

And both the PK index and user_id index have over 5gb of wastedsize and over 500k+ wastedpages.

My question is, what solution is there for this? Is it purely an index issue that needs reindex or is it something else that I am missing?

I am not allowed to change the structure of the table, I simply have to optimize it to somehow go from 1+ minutes to under 1s

After adding partial index on user_id where is_read = false, the query time was reduced by roughly ~10-15 seconds. But it's obviously still taking too long.

EDIT: There's a total of 32.5 million rows in this table. Running this query:

SELECT t.user_id, COALESCE(unread_count, 0) AS unread_count
FROM   unnest('{200 user_ids}'::bigint[]) t(user_id)
LEFT   JOIN LATERAL (
   SELECT count(*) AS unread_count
   FROM   notification n
   WHERE  n.user_id = t.user_id
   AND    n.is_read = false
   ) sub ON true
;

results in this execution plan (funny enough, yesterday this ran for over a minute, today for ~30 sec or under):

Nested Loop Left Join  (cost=1209.05..120908.50 rows=100 width=16) (actual time=333.088..27260.557 rows=200 loops=1)
  Buffers: shared hit=1981 read=20396 dirtied=7
  I/O Timings: read=27023.896
  ->  Function Scan on unnest t  (cost=0.00..1.00 rows=100 width=8) (actual time=0.022..0.360 rows=200 loops=1)
  ->  Aggregate  (cost=1209.04..1209.05 rows=1 width=8) (actual time=136.292..136.293 rows=1 loops=200)
        Buffers: shared hit=1981 read=20396 dirtied=7
        I/O Timings: read=27023.896
        ->  Index Only Scan using ix_test on notification n  (cost=0.44..1208.29 rows=300 width=0) (actual time=2.153..136.170 rows=105 loops=200)
              Index Cond: (user_id = t.user_id)
              Heap Fetches: 21088
              Buffers: shared hit=1981 read=20396 dirtied=7
              I/O Timings: read=27023.896
Planning time: 0.135 ms
Execution time: 27260.745 ms

score 2 · Accepted Answer · answered Oct 08 '19 at 14:42

Your explain plan is a bit confusing, as it looks like the index scan is getting the data for all 200 user_ids at once, but then doing that 200 times. But doing the experiment, that is not what it is doing, each iteration of the nested loop is getting the data for one user_id from that list, not the whole list. So it is just a presentation issue in the EXPLAIN output.

If you set track_io_timing = on and do EXPLAIN (ANALYZE, BUFFERS), I'm sure you will find that most of the time is spent on reading data from disk. Reading 48571 rows randomly scattered over 32 GB is not fast, unless all that data is already cached in memory, or the data is on blazingly fast PCIe SSD.

Your best bet here, other than throwing some serious hardware at it, is to get it to use an index-only scan. For the query you show, that would require an index like this:

create index on notifications (user_id , is_read, id);

Vacuum the table before trying it. If it works, you will need to consider how to keep the table well-vacuumed, as the default autovac setting might not be adequate.

I wouldn't worry about the reported bloat. That query (where did you get it?) reports large amounts of wasted bytes in the indexes even on a freshly reindexed table. Also, what it reports for wastedpages is not totally empty pages, but rather wastedbytes divided by the page size. Which seems pretty dumb to me.

Erwin Brandstetter · Answer 2 · 2019-10-10T23:42:27.140

Your query eliminates user_id from the passed array. Typically, you'd want to show those with a count of 0.
LEFT JOIN LATERAL .. ON true, followed by COALESCE takes care of that. If you actually want those eliminated switch to CROSS JOIN and drop COALESCE, same performance:

SELECT t.user_id, COALESCE(unread_count, 0) AS unread_count
FROM   unnest('{200 user IDs}'::bigint[]) t(user_id)
LEFT   JOIN LATERAL (
   SELECT count(*) AS unread_count
   FROM   notifications n
   WHERE  n.user_id = t.user_id
   AND    n.is_read = false
   ) sub ON true

Major point: I have seen this query style being faster than a large IN followed by GROUP BY over an over. Related:

Optimizing a Postgres query with a large IN

Minor point: count(*) is a bit faster than count(id) - and equivalent in this query since id is the PK and hence NOT NULL. See:

Postgres: count(*) vs count(id)

A basic btree index on (user_id) is good for it. Since id is irrelevant, do not include it. A partial index might help, like a_horse suggested:

CREATE INDEX ON notifications (user_id) WHERE is_read = false;

And you reported it shaves off 10-15 sec. That can mean one of two things, though:

A substantial share of rows with is_read = false is involved, which makes the index useful.
The new index is a win because it starts out in pristine condition without bloat. (You mentioned lots of bloat on existing indexes.) But if there are only few relevant rows with is_read = false, the gain will evaporate over time and all that remains is added write cost and more occupied space.

There is just not enough information in the question to tell. This bit in your EXPLAIN output is inconclusive:

Rows Removed by Filter: 368

The number of rows in the table matters. The share of is_read = false. And other items suggested here.

Horribly performing query, how to fix

2 Answers2