2

I have huge table with a composite index on (A, B, C).

-- psql (13.16 (Debian 13.16-0+deb11u1), server 14.12)

\d index_a_b_c Index "public.index_a_b_c" Column | Type | Key? | ----------+-----------------------+------+ A | character varying(44) | yes | B | numeric(20,0) | yes | C | numeric(20,0) | yes | btree, for table "public.table_a_b_c"

I need all distinct Bs.

This query runs with Index Only Scan, but, scans over the all A matches. Which is not scale for my case since for some As there as millions of rows. Millions of Index Only Scan row is slow.

EXPLAIN (ANALYZE true) 
SELECT DISTINCT ON ("B") "B"
  FROM "table_a_b_c"
 WHERE "A" = 'astring'

-- Execution time: 0.172993s -- Unique (cost=0.83..105067.18 rows=1123 width=5) (actual time=0.037..19.468 rows=67 loops=1) -- -> Index Only Scan using index_a_b_c on table_a_b_c (cost=0.83..104684.36 rows=153129 width=5) (actual time=0.036..19.209 rows=1702 loops=1) -- Index Cond: (A = 'astring'::text) -- Heap Fetches: 351 -- Planning Time: 0.091 ms -- Execution Time: 19.499 ms

As you see, runs over 1.7k rows and manually filter and returns 67 rows. 20ms getting tens of seconds when 1.7k to millions.

I also need all biggest Cs for distinct Bs.

Same thing as in 1). In theory, Postgres could know possible Bs, and not need to check the whole list matched to A.

EXPLAIN (ANALYZE true)
SELECT DISTINCT ON ("B") *
  FROM "table_a_b_c"
 WHERE "A" = 'astring'
 ORDER BY "B" DESC,
          "C" DESC

-- Execution time: 0.822705s -- Unique (cost=0.83..621264.51 rows=1123 width=247) (actual time=0.957..665.927 rows=67 loops=1) -- -> Index Scan using index_a_b_c on table_a_b_c (cost=0.83..620881.69 rows=153130 width=247) (actual time=0.955..664.408 rows=1702 loops=1) -- Index Cond: (a = 'astring'::text) -- Planning Time: 0.116 ms -- Execution Time: 665.978 ms

But for instance, this is fast:

SELECT * WHERE A="x" AND B=1 ORDER BY C DESC
  UNION
SELECT * WHERE A="x" AND B=2 ORDER BY C DESC
  UNION
....

for all possible Bs. It is like loop with number of B time.

Questions

a) Shouldn't the index on (A, B, C) be a superset of (A, B) in theory? (A, B) will be super fast for distinct.

b) Why is it hard to find distinct Bs for Postgres?

c) How to handle this without new index?

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633

1 Answers1

1

Exhibit A (from your description):

for some As there are millions of rows.

Exhibit B (from the query plan):

rows=153129rows=67

Meaning many rows per distinct value in B.

Emulate index-skip scan for many rows per group

Just distinct B:

WITH RECURSIVE rcte AS (
   (
   SELECT B
   FROM   table_abc
   WHERE  A = 'astring'
   ORDER  BY B
   LIMIT  1
   )

UNION ALL SELECT t.B FROM rcte r CROSS JOIN LATERAL ( -- implicit break condition SELECT t.B FROM table_abc t WHERE t.A = 'astring' AND t.B > r.B ORDER BY t.B LIMIT 1 ) t ) TABLE rcte;

Distinct B with greatest C:

WITH RECURSIVE rcte AS (
   (
   SELECT B, C
   FROM   table_abc
   WHERE  A = 'astring'
   ORDER  BY B DESC, C DESC
   LIMIT  1
   )

UNION ALL SELECT t.B, t.C FROM rcte r CROSS JOIN LATERAL ( -- implicit break condition SELECT t.B, t.C FROM table_abc t WHERE t.A = 'astring' AND t.B < r.B ORDER BY t.B DESC, t.C DESC -- keep in sync to match index! LIMIT 1 ) t ) TABLE rcte;

fiddle

Now, the index-only scans only fetch needed values. (No sequential scan of the covering index.) Effectively, you get what you tried to force with the UNION query (which should be UNION ALL instead).

Should be very fast.

Related:

There are intricate details to this technique. See:

About alternating sort (ASC/DESC) order in a multicolumn index:

DISTINCT ON for only few rows per group

The queries you tried are the optimum in that case. The considerable overhead for the recursive CTE wouldn't pay.

fiddle

See:

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633