Simple JOIN switches direction based on queried ID only

Question

My query:

EXPLAIN (ANALYZE, BUFFERS)
SELECT
    work_token_assignment.work_id
FROM
    work_token_assignment
    JOIN token ON token.id = work_token_assignment.token_id
WHERE
    work_token_assignment.work_id = 207064
ORDER BY
    token.added_at,
    token.id
LIMIT 1;

For most WORKID values, the plan is:

Limit  (cost=901.37..1801.70 rows=1 width=16) (actual time=49.917..49.918 rows=1 loops=1)
  Buffers: shared hit=55473
  ->  Incremental Sort  (cost=901.37..8371230.78 rows=9297 width=16) (actual time=49.915..49.916 rows=1 loops=1)
        Sort Key: token.added_at, token.id
        Presorted Key: token.added_at
        Full-sort Groups: 1  Sort Method: top-N heapsort  Average Memory: 25kB  Peak Memory: 25kB
        Pre-sorted Groups: 1  Sort Method: top-N heapsort  Average Memory: 25kB  Peak Memory: 25kB
        Buffers: shared hit=55473
        ->  Nested Loop  (cost=0.99..8370812.42 rows=9297 width=16) (actual time=48.799..49.852 rows=180 loops=1)
              Buffers: shared hit=55473
              ->  Index Scan using ix_token_added_at on token  (cost=0.56..1702792.66 rows=14940404 width=12) (actual time=0.015..10.580 rows=15884 loops=1)
                    Buffers: shared hit=7819
              ->  Index Only Scan using work_token_assignment_pkey on work_token_assignment  (cost=0.42..0.45 rows=1 width=8) (actual time=0.002..0.002 rows=0 loops=15884)
                    Index Cond: ((work_id = 18345) AND (token_id = token.id))
                    Heap Fetches: 0
                    Buffers: shared hit=47654
Planning:
  Buffers: shared hit=15
Planning Time: 0.405 ms
Execution Time: 49.956 ms

And that is pretty fast.

But randomly for some ids, the plan is:

Limit  (cost=817.04..1633.05 rows=1 width=16) (actual time=23821.261..23821.262 rows=1 loops=1)
  Buffers: shared hit=32475418 read=90739 dirtied=13663
  ->  Incremental Sort  (cost=817.04..8370622.85 rows=10257 width=16) (actual time=23821.260..23821.260 rows=1 loops=1)
        Sort Key: token.added_at, token.id
        Presorted Key: token.added_at
        Full-sort Groups: 1  Sort Method: quicksort  Average Memory: 25kB  Peak Memory: 25kB
        Buffers: shared hit=32475418 read=90739 dirtied=13663
        ->  Nested Loop  (cost=0.99..8370161.28 rows=10257 width=16) (actual time=23819.997..23821.234 rows=2 loops=1)
              Buffers: shared hit=32475415 read=90739 dirtied=13663
              ->  Index Scan using ix_token_added_at on token  (cost=0.56..1702682.64 rows=14939188 width=12) (actual time=0.017..7344.302 rows=8228792 loops=1)
                    Buffers: shared hit=7789036 read=90739 dirtied=13663
              ->  Index Only Scan using work_token_assignment_pkey on work_token_assignment  (cost=0.42..0.45 rows=1 width=8) (actual time=0.002..0.002 rows=0 loops=8228792)
                    Index Cond: ((work_id = 207064) AND (token_id = token.id))
                    Heap Fetches: 2
                    Buffers: shared hit=24686379
Planning:
  Buffers: shared hit=15
Planning Time: 0.265 ms
Execution Time: 23821.286 ms

Which takes forever!

Setting set enable_incremental_sort TO off; changes the broken query plan to yet another one, using Heap Scans - but it is fast, as it is searching the assignment table first:

Limit  (cost=42633.16..42633.28 rows=1 width=16) (actual time=38.210..40.668 rows=1 loops=1)
  Buffers: shared hit=50202
  ->  Gather Merge  (cost=42633.16..43630.50 rows=8548 width=16) (actual time=38.208..40.666 rows=1 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=50202
        ->  Sort  (cost=41633.14..41643.82 rows=4274 width=16) (actual time=12.396..12.397 rows=0 loops=3)
              Sort Key: token.added_at, token.id
              Sort Method: top-N heapsort  Memory: 25kB
              Buffers: shared hit=50202
              Worker 0:  Sort Method: quicksort  Memory: 25kB
              Worker 1:  Sort Method: quicksort  Memory: 25kB
              ->  Nested Loop  (cost=256.48..41611.77 rows=4274 width=16) (actual time=0.233..11.650 rows=3333 loops=3)
                    Buffers: shared hit=50174
                    ->  Parallel Bitmap Heap Scan on work_token_assignment  (cost=255.92..5821.58 rows=4274 width=8) (actual time=0.226..0.864 rows=3333 loops=3)
                          Recheck Cond: (work_id = 207064)
                          Heap Blocks: exact=124
                          Buffers: shared hit=171
                          ->  Bitmap Index Scan on work_token_assignment_pkey  (cost=0.00..253.35 rows=10257 width=0) (actual time=0.608..0.609 rows=10000 loops=1)
                                Index Cond: (work_id = 207064)
                                Buffers: shared hit=47
                    ->  Index Scan using token_pkey on token  (cost=0.56..8.37 rows=1 width=12) (actual time=0.003..0.003 rows=1 loops=10000)
                          Index Cond: (id = work_token_assignment.token_id)
                          Buffers: shared hit=50003
Planning:
  Buffers: shared hit=15
Planning Time: 0.296 ms
Execution Time: 40.709 ms

Postgres seems to know correctly that there are about 10K rows with the queried ID in the work_token_assignment table (in all cases described the number in question is comparable).

Most rows in token do not have a corresponding entry in work_token_assignment - a small minority do. There should not be a lot of duplicate added_at columns.

score 1 · Answer 1 · answered Aug 19 '21 at 19:17

The "bad" work_ids are the ones where it correctly knows there are a large number of rows matching them. It thinks that if it reads the rows already in order by added_at, then it can stop very early, as soon as it finds two matches in the other table. Why 2? Because it needs to find the best row ordered by added_at, then it needs observe that the next row is not a tie on added_at. (That is what the incremental sort does, read an extra row to make sure there are no ties, or deal with with ties if they do exist).

The problem is that it has to walk half of the index before it finds those two matches, because almost all of the rows with low values of "added_at" do not have matches in the other table. While it assumes the 10,000 matches in the other table will be evenly scattered across all values of added_at, and so the first two will be found very quickly.

There is nothing you can do about this statistics-wise. None of the current advanced statistics will help change the fact that it thinks the 10,000 rows in one table are uniformly scattered over the other table's added_at.

The best option is to rewrite the query so it cannot use the index to obtain the order. You can do that by doing some dummy arithmetic on the column before ordering by it. Assuming added_at is a timestamp, that could be:

ORDER BY token.added_at + interval '0 seconds', token.id

Adding zero doesn't change the order of course, but tricks PostgreSQL into not using the index.

This is a very common problem of sorting vs. reading an index in order. But this is the first time I've seen the problem involve an incremental sort layered on top of the ordered index scan. I guess because incremental sorting is a new enough feature not many people are using it yet.

Simple JOIN switches direction based on queried ID only

1 Answers1

Linked