1

I have run the following queries.

CTE

WITH temp AS (SELECT id, timestamp FROM table group by timestamp, id)
SELECT COUNT(*) FROM temp WHERE timestamp = '2023-01-01 08:28:45'

Subquery

SELECT COUNT(*) FROM (SELECT id, timestamp FROM table group by timestamp, id) 
as temp WHERE timestamp = '2023-01-01 08:28:45'

The CTE statement took Total runtime: 638.871 ms
The Subquery statement took Total runtime: 3,795.166 ms

Question

This is the same table, same data, and indexes. What can be the reason for the difference?

Both statement were run on a PostgreSQL 9.6.7 installation.

It seems that the subquery is using External merge while the CTE is not.

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633
gbox
  • 141
  • 1
  • 7

2 Answers2

2

We would need to see the table definitions (particularly the keys and indexes) and query plans to be able to tell you with any confidence, so you should edit those details into your question. Also, the version of postgres you are using may be significant because in recent versions there have been significant changes in how it can optimise CTEs.

My first guess based on currently available information is that in the first case predicate push-down has allowed it to use an index on timestamp before doing anything about the grouping, whereas for the second query it has scanned the whole timestamp index, performed the grouping, and only then applied the filter on timestamp.

If you have no index on timestamp then both will result in a full table scan, but with the CTE example still filtering by timestamp as it goes and having fewer rows to group, but the second sorting the whole lot to do the grouping, perhaps spooling the result to disk if it is large enough, then filtering.

But again: these are guesses, if you provide the table definitions and query plans someone can tell you exactly what has happened and why.

David Spillett
  • 32,593
  • 3
  • 50
  • 92
1

Outdated Postgres version

PostgreSQL 9.6 has reached EOL in 2021 at the point release of 9.6.24 (!) - not 9.6.7.

Upgrade to a current version at the earliest opportunity. If that's not an option, at least upgrade to the latest point release. Quoting our versioning policy:

We always recommend that all users run the latest available minor release for whatever major version is in use.

(Bolded as in the original.)

You are missing out on a host of fixes and improvements. Not least this one in Postgres 12:

  • Allow common table expressions (CTEs) to be inlined into the outer query (Andreas Karlsson, Andrew Gierth, David Fetter, Tom Lane)

    Specifically, CTEs are automatically inlined if they have no side-effects, are not recursive, and are referenced only once in the query. Inlining can be prevented by specifying MATERIALIZED, or forced for multiply-referenced CTEs by specifying NOT MATERIALIZED. Previously, CTEs were never inlined and were always evaluated before the rest of the query.

Meaning, the difference between your two displayed queries goes away almost completely. In older Postgres versions, the result of a CTE is always materialized. This poses as optimization barrier. Hence, a subquery is typically faster, especially when the optimizer can simplify the query - like in your case.

The CTE processes the whole table and materializes aggregated rows - before filtering the one timestamp of concern. A huge waste!

An optimized query only considers the timestamp of concern to begin with, which saves work proportional to the number of distinct timestamps in the table. This is particularly severe in the presence of an applicable index.

(There are queries where said optimization barrier helps. But this is not one of those.)

Related:

Better query

That said, your query can be simplified to:

SELECT count(*)
FROM  (
   SELECT id  -- or even just an empty SELECT list
   FROM   tbl
   WHERE  timestamp = '2023-01-01 08:28:45'
   GROUP  BY id
   ) temp;

No need to add timestamp to GROUP BY or the SELECT list after filtering it to a single value.

And this is simpler and faster, yet - at least in modern Postgres:

SELECT count(DISTINCT id)
FROM   tbl
WHERE  timestamp = '2023-01-01 08:28:45';
Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633