20

I have a table with lots of inserts, setting one of the fields (uploaded_at) to NULL. Then a periodic task selects all the tuples WHERE uploaded_at IS NULL, processes them and updates, setting uploaded_at to current date.

How should I index the table?

I understand that I should use a partial index like:

CREATE INDEX foo ON table (uploaded_at) WHERE uploaded_at IS NULL

Or smth like that. I'm a bit confused though if it is correct to index on a field that is always NULL. Or if it is correct to use a b-tree index. Hash looks like a better idea, but it is obsolete and is not replicated via streaming hot-standby replication. Any advice would be greatly appreciated.

I've experimented a bit with the following indices:

"foo_part" btree (uploaded_at) WHERE uploaded_at IS NULL
"foo_part_id" btree (id) WHERE uploaded_at IS NULL

and the query planer seems to always choose the foo_part index. explain analyse also yields slightly better result for the foo_part index:

Index Scan using foo_part on t1  (cost=0.28..297.25 rows=4433 width=16) (actual time=0.025..3.649 rows=4351 loops=1)
   Index Cond: (uploaded_at IS NULL)
 Total runtime: 4.060 ms

vs

Bitmap Heap Scan on t1  (cost=79.15..6722.83 rows=4433 width=16) (actual time=1.032..4.717 rows=4351 loops=1)
   Recheck Cond: (uploaded_at IS NULL)
   ->  Bitmap Index Scan on foo_part_id  (cost=0.00..78.04 rows=4433 width=0) (actual time=0.649..0.649 rows=4351 loops=1)
 Total runtime: 5.131 ms
Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633
Kirill Zaitsev
  • 303
  • 1
  • 2
  • 7

1 Answers1

14

In this special case the column actually indexed is irrelevant for the query at hand. You can pick any column. I would pick something else than uploaded_at, which is useless. Some column that may be useful for other queries and is not bigger than 8 bytes, ideally.

CREATE INDEX foo ON table bar (some_col) WHERE uploaded_at IS NULL;

If you have no use case for any other column, it's still best to stick with the useless uploaded_at, so not to introduce additional maintenance cost for the index and restrictions for H.O.T. updates. More:

Or use a constant as index expression if you have no use for any other index column. Like:

CREATE INDEX baz ON table bar ((TRUE)) WHERE uploaded_at IS NULL;

Parentheses required. This also keeps the index at minimum size. But while the index column is never bigger than 8 bytes (which is the case for timestamp) it's still at minimum size anyway.

Update: Storage characteristics change in Postgres 13 with index deduplication (while the rest still applies). See:

Related:

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633