WHERE A=x DISTINCT ON (B), with a composite index on (A, B, C)

Question

I have huge table with a composite index on (A, B, C).

-- psql (13.16 (Debian 13.16-0+deb11u1), server 14.12)
\d index_a_b_c
         Index "public.index_a_b_c"
  Column  |         Type          | Key? | 
----------+-----------------------+------+
 A        | character varying(44) | yes  |
 B        | numeric(20,0)         | yes  |
 C        | numeric(20,0)         | yes  |
btree, for table "public.table_a_b_c"

I need all distinct `B`s.

This query runs with Index Only Scan, but, scans over the all A matches. Which is not scale for my case since for some As there as millions of rows. Millions of Index Only Scan row is slow.

EXPLAIN (ANALYZE true) 
SELECT DISTINCT ON ("B") "B"
  FROM "table_a_b_c"
 WHERE "A" = 'astring'
-- Execution time: 0.172993s
-- Unique  (cost=0.83..105067.18 rows=1123 width=5) (actual time=0.037..19.468 rows=67 loops=1)
--  ->  Index Only Scan using index_a_b_c on table_a_b_c  (cost=0.83..104684.36 rows=153129 width=5) (actual time=0.036..19.209 rows=1702 loops=1)
--        Index Cond: (A = 'astring'::text)
--        Heap Fetches: 351
-- Planning Time: 0.091 ms
-- Execution Time: 19.499 ms

As you see, runs over 1.7k rows and manually filter and returns 67 rows. 20ms getting tens of seconds when 1.7k to millions.

I also need all biggest `C`s for distinct `B`s.

Same thing as in 1). In theory, Postgres could know possible Bs, and not need to check the whole list matched to A.

EXPLAIN (ANALYZE true)
SELECT DISTINCT ON ("B") *
  FROM "table_a_b_c"
 WHERE "A" = 'astring'
 ORDER BY "B" DESC,
          "C" DESC
-- Execution time: 0.822705s 
-- Unique  (cost=0.83..621264.51 rows=1123 width=247) (actual time=0.957..665.927 rows=67 loops=1)
--   ->  Index Scan using index_a_b_c on table_a_b_c  (cost=0.83..620881.69 rows=153130 width=247) (actual time=0.955..664.408 rows=1702 loops=1)
--         Index Cond: (a = 'astring'::text)
-- Planning Time: 0.116 ms
-- Execution Time: 665.978 ms

But for instance, this is fast:

SELECT * WHERE A="x" AND B=1 ORDER BY C DESC
  UNION
SELECT * WHERE A="x" AND B=2 ORDER BY C DESC
  UNION
....

for all possible Bs. It is like loop with number of B time.

Questions

a) Shouldn't the index on (A, B, C) be a superset of (A, B) in theory? (A, B) will be super fast for distinct.

b) Why is it hard to find distinct Bs for Postgres?

c) How to handle this without new index?

Erwin Brandstetter · Accepted Answer · 2024-10-12T00:22:43.013

Exhibit A (from your description):

for some As there are millions of rows.

Exhibit B (from the query plan):

rows=153129 → rows=67

Meaning many rows per distinct value in B.

Emulate index-skip scan for many rows per group

Just distinct B:

WITH RECURSIVE rcte AS (
   (
   SELECT B
   FROM   table_abc
   WHERE  A = 'astring'
   ORDER  BY B
   LIMIT  1
   )
UNION ALL
   SELECT t.B
   FROM   rcte r
   CROSS  JOIN LATERAL (  -- implicit break condition
      SELECT t.B
      FROM   table_abc t
      WHERE  t.A = 'astring'
      AND    t.B > r.B
      ORDER  BY t.B
      LIMIT  1
      ) t
   )
TABLE rcte;

Distinct B with greatest C:

WITH RECURSIVE rcte AS (
   (
   SELECT B, C
   FROM   table_abc
   WHERE  A = 'astring'
   ORDER  BY B DESC, C DESC
   LIMIT  1
   )
UNION ALL
   SELECT t.B, t.C
   FROM   rcte r
   CROSS  JOIN LATERAL (  -- implicit break condition
      SELECT t.B, t.C
      FROM   table_abc t
      WHERE  t.A = 'astring'
      AND    t.B < r.B
      ORDER  BY t.B DESC, t.C DESC  -- keep in sync to match index!
      LIMIT  1
      ) t
   )
TABLE rcte;

fiddle

Now, the index-only scans only fetch needed values. (No sequential scan of the covering index.) Effectively, you get what you tried to force with the UNION query (which should be UNION ALL instead).

Should be very fast.

SELECT DISTINCT is slower than expected on my table in PostgreSQL

There are intricate details to this technique. See:

Optimize GROUP BY query to retrieve latest row per user

About alternating sort (ASC/DESC) order in a multicolumn index:

`DISTINCT ON` for only few rows per group

The queries you tried are the optimum in that case. The considerable overhead for the recursive CTE wouldn't pay.

fiddle

See:

Select first row in each GROUP BY group?

WHERE A=x DISTINCT ON (B), with a composite index on (A, B, C)

I need all distinct Bs.

I also need all biggest Cs for distinct Bs.

Questions

1 Answers1

Emulate index-skip scan for many rows per group

DISTINCT ON for only few rows per group

I need all distinct `B`s.

I also need all biggest `C`s for distinct `B`s.

`DISTINCT ON` for only few rows per group