8

I have a table with 165M records like this:

Performance
   id        integer
   installs  integer
   hour      timestamp without time zone

I also have an index on hour:

CREATE INDEX hour_idx
  ON performance
  USING btree
  (hour DESC NULLS LAST);

However, select top 10 records ordered by hour takes 6 minutes!

EXPLAIN ANALYZE  select hour from performance order by hour desc limit 10

Returns

Limit  (cost=7952135.23..7952135.25 rows=10 width=8) (actual time=376313.958..376313.964 rows=10 loops=1)
  ->  Sort  (cost=7952135.23..8368461.00 rows=166530310 width=8) (actual time=376313.957..376313.960 rows=10 loops=1)
        Sort Key: hour
        Sort Method: top-N heapsort  Memory: 25kB
        ->  Seq Scan on performance  (cost=0.00..4353475.10 rows=166530310 width=8) (actual time=0.006..327149.828 rows=192330557 loops=1)
Planning time: 0.070 ms
Execution time: 376330.573 ms

Why is it taking so long? If there is an index on date field desc - shouldn't it be super quick to retrieve data?

Michael Green
  • 25,255
  • 13
  • 54
  • 100
Dejell
  • 185
  • 8

2 Answers2

19

In your sample code above, the index is explicitly created as NULLS LAST and the query is implicitly running NULLS FIRST (which is the default for ORDER BY .. DESC) so PostgreSQL would need to re-sort the data if it used the index. As a result the index would actually make the query many times slower than even the (already-slow) table scan.

rds-9.6.5 root@db1=> create table performance (id integer, installs integer, hour timestamp without time zone);
CREATE TABLE
Time: 28.100 ms

rds-9.6.5 root@db1=> with generator as (select generate_series(1,166530) i)
[more] - > insert into performance (
[more] ( >   select
[more] ( >     i id,
[more] ( >     (random()*1000)::integer installs,
[more] ( >     (now() - make_interval(secs => i))::timestamp installs
[more] ( >   from generator
[more] ( > );
INSERT 0 166530
Time: 244.872 ms

rds-9.6.5 root@db1=> create index hour_idx
[more] - > on performance
[more] - > using btree
[more] - > (hour desc nulls last);
CREATE INDEX
Time: 67.089 ms

rds-9.6.5 root@db1=> vacuum analyze performance;
VACUUM
Time: 43.552 ms

We can add a WHERE clause on the hour column so that using the index becomes a good idea - but notice how we still need to re-sort the data from the index.

rds-9.6.5 root@db1=> explain select hour from performance where hour>now() order by hour desc limit 10;
                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Limit  (cost=4.45..4.46 rows=1 width=8)
   ->  Sort  (cost=4.45..4.46 rows=1 width=8)
         Sort Key: hour DESC
         ->  Index Only Scan using hour_idx on performance  (cost=0.42..4.44 rows=1 width=8)
               Index Cond: (hour > now())
(5 rows)

Time: 0.789 ms

If we add an explicit NULLS LAST to your query then it will use the index as expected.

rds-9.6.5 root@db1=> explain select hour from performance order by hour desc NULLS LAST limit 10;
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Limit  (cost=0.42..0.68 rows=10 width=8)
   ->  Index Only Scan using hour_idx on performance  (cost=0.42..4334.37 rows=166530 width=8)
(2 rows)

Time: 0.526 ms

Alternatively, if we drop the (non-default) NULLS LAST from your index then the query will use it as expected without modification.

rds-9.6.5 root@db1=> drop index hour_idx;
DROP INDEX
Time: 4.124 ms

rds-9.6.5 root@db1=> create index hour_idx
[more] - > on performance
[more] - > using btree
[more] - > (hour desc);
CREATE INDEX
Time: 69.220 ms

rds-9.6.5 root@db1=> explain select hour from performance order by hour desc limit 10;
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Limit  (cost=0.42..0.68 rows=10 width=8)
   ->  Index Only Scan using hour_idx on performance  (cost=0.42..4334.37 rows=166530 width=8)
(2 rows)

Time: 0.725 ms 

Note that you can also drop the DESC from your index; PostgreSQL can scan indexes both forwards and backwards and on single-column indexes it's generally unnecessary to reverse them. You only need to be careful about having the right combination of order and nulls first/last.

rds-9.6.5 root@db1=> drop index hour_idx;
DROP INDEX
Time: 3.837 ms

rds-9.6.5 root@db1=> create index hour_idx
[more] - > on performance
[more] - > using btree
[more] - > (hour);
CREATE INDEX
Time: 94.815 ms

rds-9.6.5 root@db1=> explain select hour from performance order by hour desc limit 10;
                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..0.68 rows=10 width=8)
   ->  Index Only Scan Backward using hour_idx on performance  (cost=0.42..4334.37 rows=166530 width=8)
(2 rows)

Time: 0.740 ms
Joe Obbish
  • 32,976
  • 4
  • 74
  • 153
2

If most of your queries intend to select non-NULL values from hour then you should consider building a partial index on those values, i.e. something like:

CREATE INDEX hour_not_null_idx ON performance (hour)
 WHERE hour IS NOT NULL;

which, as long as you either query for a particular value of hour, as Jeremy demonstrated in his answer, or add hour IS NOT NULL to your WHERE clause, will give you the same results, and possibly save you a little space as well:

# explain select hour from performance where hour > now() order by hour desc limit 10;
 Limit  (cost=0.42..5.30 rows=10 width=8)
   ->  Index Only Scan Backward using hour_not_null_idx on performance  (cost=0.42..8.72 rows=17 width=8)
         Index Cond: (hour > now())

If there are no NULL values in the column, you should declare it NOT NULL (I'm going to assume you know how to do this with ALTER TABLE ;o)), then create the index (without NULLS LAST, since it's no longer important anyway). Then you get the same benefit:

william=# create index hour_idx on performance using btree ( hour );
CREATE INDEX
william=# explain select hour from performance order by hour desc limit 10;
                                           QUERY PLAN                                               
--------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..0.73 rows=10 width=8)
   ->  Index Only Scan Backward using hour_idx on performance  (cost=0.42..5238.37 rows=166530 width=8)
(2 rows)