Index that is not used, yet influences query

Question

I've got a PostgreSQL 9.3 table with some numbers and some additional data:

CREATE TABLE mytable (
    myid BIGINT,
    somedata BYTEA
)

This table currently has about 10M records and takes 1GB of disk space. myid are not consecutive.

I want to compute how many rows are in every block of 100000 consecutive numbers:

SELECT myid/100000 AS block, count(*) AS total FROM mytable GROUP BY myid/100000;

This returns about 3500 rows.

I noticed that existence of a certain index significantly speeds up this query even though the query plan does not mention it at all. The query plan without the index:

db=> EXPLAIN (ANALYZE TRUE, VERBOSE TRUE) SELECT myid/100000 AS block, count(*) AS total FROM mytable GROUP BY myid/100000;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=1636639.92..1709958.65 rows=496942 width=8) (actual time=6783.763..8888.841 rows=3460 loops=1)
   Output: ((myid / 100000)), count(*)
   ->  Sort  (cost=1636639.92..1659008.91 rows=8947594 width=8) (actual time=6783.752..8005.831 rows=8947557 loops=1)
         Output: ((myid / 100000))
         Sort Key: ((mytable.myid / 100000))
         Sort Method: external merge  Disk: 157440kB
         ->  Seq Scan on public.mytable  (cost=0.00..236506.92 rows=8947594 width=8) (actual time=0.020..1674.838 rows=8947557 loops=1)
               Output: (myid / 100000)
 Total runtime: 8914.780 ms
(9 rows)

The index:

db=> CREATE INDEX myindex ON mytable ((myid/100000));
db=> VACUUM ANALYZE;

The new query plan:

db=> EXPLAIN (ANALYZE TRUE, VERBOSE TRUE) SELECT myid/100000 AS block, count(*) AS total FROM mytable GROUP BY myid/100000;
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=281242.99..281285.97 rows=3439 width=8) (actual time=3190.189..3190.800 rows=3460 loops=1)
   Output: ((myid / 100000)), count(*)
   ->  Seq Scan on public.mytable  (cost=0.00..236505.56 rows=8947485 width=8) (actual time=0.026..1659.571 rows=8947557 loops=1)
         Output: (myid / 100000)
 Total runtime: 3190.975 ms
(5 rows)

So, the query plans and the runtimes differ significantly (almost three times) but neither mention the index. This behavior is perfectly reproducible on my dev machine: I went through several cycles of dropping the index, testing the query several times, recreating the index, again testing the query several times. What's happening here?

score 6 · Answer 1 · answered Jul 29 '14 at 16:31

When you create an expression index, it causes PostgreSQL to gather statistics on the that expression. With those statistics on hand, it now has an accurate estimate for the number of aggregated rows that the query will return, which leads it to make a better plan choice.

Specifically in this case, without those extra statistics it thought the hash table would be too large to fit in work_mem, so it didn't choose that method.

Erwin Brandstetter · Accepted Answer · 2019-05-24T00:11:28.483

VACUUM ANALYZE makes the difference in your example. Plus, as @jjanes supplied, the additional statistics for the functional index. Per documentation:

pg_statistic also stores statistical data about the values of index expressions. These are described as if they were actual data columns; in particular, starelid references the index. No entry is made for an ordinary non-expression index column, however, since it would be redundant with the entry for the underlying table column.

However, creating the index does not by itself cause Postgres to gather statistics. Try:

CREATE INDEX myindex ON mytable ((myid/100000));
SELECT * FROM pg_statistic WHERE starelid = 'myindex'::regclass;

Returns nothing until you run your first ANALYZE (or VACUUM ANALYZE, or the autovacuum daemon kicks in).

ANALYZE mytable;
SELECT * FROM pg_statistic WHERE starelid = 'myindex'::regclass;

Now you'll see added statistics.

Since the whole table has to be read anyway, Postgres is going to use a sequential scan unless it expects the computation of myid/100000 to be expensive enough to switch, which it isn't.

Your only other chance would be an index-only scan if the index is much smaller than the table - and preconditions for an index-only scan are met. Details in the Postgres Wiki and in the manual.

As long as that functional index is not used, the collateral benefit from added statistics is moderate. If the table was read-only the cost would be low - but then again, we'd probably see an index-only scan right away.

Maybe you can also achieve better query plans by setting a higher statistics target for mytable.myid. That would only incur a minor cost. More:

PostgreSQL + query planner + amount of entities

Index that is not used, yet influences query

2 Answers2

Linked