Exists subselect vs inner join?

Question

I'm moving up to the next level of my mystery query. It looks like there's a subselect inside of an exists, but on the same table. I think this could probably be simplified with an INNER JOIN higher up.

Using PostgreSQL 9.4.2.
Table definitions (/d+): https://gist.github.com/neezer/879f5d3649ca1903c6f3
Cardinalities:

billing_pricequote: 1,462,625 rows
billing_pricequotestatus: 3,331,657 rows
billing_lineitem: 43,687,855 rows

Here's the original query, without modifications suggested for the subquery inside EXISTS:

SELECT i.quote_id, i.acct_id AS account_id, SUM(i.delta_amount) AS amt
FROM billing_lineitem i
INNER JOIN billing_pricequote pq ON i.quote_id = pq.id
WHERE pq.date_applied AT TIME ZONE 'PST' BETWEEN '2016-02-02T00:00:00'::timestamp
                                AND '2016-03-03T22:27:41.734102-08:00'::timestamptz
AND EXISTS(
  SELECT s1.quote_id
  FROM billing_pricequotestatus s1
  INNER JOIN (
    SELECT DISTINCT ON (quote_id) quote_id, MAX(created_at) AS max_created_at
    FROM billing_pricequotestatus
    WHERE quote_id=i.quote_id
    GROUP BY quote_id, created_at
    ORDER BY quote_id, created_at DESC
  ) AS s2
  ON s1.quote_id = s2.quote_id
  AND s1.created_at = s2.max_created_at
  WHERE s1.name IN ('adjustment','payment','billable')
)
GROUP BY i.quote_id, i.acct_id
;

The part I noticed looking weird is the SELECT on billing_pricequotestatus and then another subselect on the same table inside the INNER JOIN.

I tried changing this with the modification from my other SO post:

SELECT i.quote_id, i.acct_id AS account_id, SUM(i.delta_amount) AS amt
FROM billing_lineitem i
INNER JOIN billing_pricequote pq ON i.quote_id = pq.id
WHERE pq.date_applied AT TIME ZONE 'PST' BETWEEN '2016-02-02T00:00:00'::timestamp
                                AND '2016-03-03T22:27:41.734102-08:00'::timestamptz
AND EXISTS(
  SELECT quote_id, MAX(created_at) AS max_created_at
  FROM billing_pricequotestatus
  WHERE quote_id=i.quote_id
  AND name IN ('adjustment','payment','billable')
  GROUP BY quote_id
)
GROUP BY i.quote_id, i.acct_id
;

That cut my execution time in half (~40 seconds to ~20 seconds), but yielded slightly different results (the original query returned 28,895 rows, but my new query returns 28,917 rows). It's not immediately clear to me why my modification didn't produce equivalent output (which it needs to).

EXPLAIN ANALYZE for both queries
Explain analysis for original query depesz.com.

Would really appreciate any help/guidance on this!

I tried updating @ypercubeᵀᴹ's answer with a LATERAL JOIN, and the performance seems about the same (each has an equal number of wins by less than a second):

SELECT i.quote_id, i.acct_id AS account_id, SUM(i.delta_amount) AS amt
FROM billing_lineitem i
INNER JOIN billing_pricequote pq ON i.quote_id = pq.id
LEFT JOIN LATERAL
( SELECT name
  FROM billing_pricequotestatus
  WHERE quote_id = i.quote_id
  ORDER BY created_at DESC
  LIMIT 1
) pqs ON true
WHERE pq.date_applied AT TIME ZONE 'PST' BETWEEN '2016-02-02T00:00:00'::timestamp
                                AND '2016-03-03T22:27:41.734102-08:00'::timestamptz
AND pqs.name IN ('adjustment', 'payment', 'billable')
GROUP BY i.quote_id, i.acct_id
;

Explain analysis.

Any other suggestions to get this below ≈10 seconds?

score 7 · Accepted Answer · edited Jun 15 '20 at 09:05

The purpose of your subquery as I understand it:
Select rows where the latest related entry in billing_pricequotestatus has a qualifying name.

Incorrect 2nd query

It's not immediately clear to me why my modification didn't produce equivalent output

The 1st query picks the latest row from billing_pricequotestatus and checks whether the name qualifies (name IN ('adjustment','payment','billable')).

The 2nd query is backwards: it check for any row with qualifying name (not just the last one). Also, it doesn't make sense to compute an aggregate in an EXISTS semi-join. You don't want that. And it's not equivalent.

Consequently, you get more rows from the 2nd query.

Incorrect time range

This predicate is a mess. Inefficient and possibly incorrect - or at least a ticking bomb:

WHERE pq.date_applied AT TIME ZONE 'PST'
       BETWEEN '2016-02-02T00:00:00'::timestamp
           AND '2016-03-03T22:27:41.734102-08:00'::timestamptz

The column date_applied is of type timestamptz. The construct AT TIME ZONE 'PST' converts it to type timestamp and shifts by the time offset hard-coded into the time zone abbreviation 'PST' - which is a bad move to begin with. It makes the expression non-sargable. That's more expensive and, more importantly, rules out the use of any plain index on date_applied.

Even worse, the time zone abbreviation 'PST' is not aware of DST or any historic shifts in time. If your time zone has (or had in the past) daylight saving time, and your set spans different DST periods, your current expression is most probably incorrect:

You would need to use the applicable time zone name instead of the ~~abbreviation~~ to get consistent local time - which is even more expensive.

And there's yet another problem: While the column value is shifted by a hard-coded time offset ('PST'), your upper bound '2016-03-03T22:27:41.734102-08:00'::timestamptz is provided as timestamptz and silently coerced to match the data type timestamp. Since no explicit time offset is provided, the cast defaults to the time zone of the current session. So you can get different results depending on the current time zone setting of your session. I can't think of a use case where this would make sense.

Don't do any of this. Don't translate timestamptz column date_applied to local time at all, don't mix data types like you do and don't mix different ways to cast. Instead, use the column as is and provide timestamptz parameters.

Query

SELECT i.quote_id, i.acct_id AS account_id, sum(i.delta_amount) AS amt
FROM   billing_pricequote pq
JOIN   LATERAL (
   SELECT name
   FROM   billing_pricequotestatus
   WHERE  quote_id = pq.id
   ORDER  BY created_at DESC
   LIMIT  1
   ) pqs ON pqs.name IN ('adjustment', 'payment', 'billable')
JOIN   billing_lineitem i ON i.quote_id = pq.id
WHERE  pq.date_applied BETWEEN (timestamp '2016-02-02T00:00:00' AT TIME ZONE 'PST')  -- !
                           AND timestamptz '2016-03-03T22:27:41.734102-08:00'
GROUP  BY 1,2;

Note the LATERAL join, but not ~~LEFT JOIN~~, make it an INNER JOIN to implement your predicate right away.

Or use the equivalent correlated subquery outlined by @ypercube. Not sure which is faster.

Also note that I base the LATERAL JOIN on billing_pricequote - before joining to the big table billing_lineitem. This way we can eliminate rows early, which should be cheaper.

Index

Currently, you get:

Seq Scan on billing_pricequote pq

Only 70k of your 1,5M rows are selected, which is around 5 %. An index on date_applied might help a bit, but not much. However, this multicolumn index should help substantially if you can get index-only scans out of it:

CREATE INDEX foo ON billing_pricequotestatus (quote_id, created_at DESC, name);

Even more efficient with name_id instead of name as suggested below.

Statistics

Postgres is over-estimating the selectivity of your time range:

(cost=0.00..88,546.50 rows=7,313 width=4) (actual time=2.353..767.408 rows=70,623 loops=1)

It might help to increase the statistics target for just the column date_applied. Details here:

Configuring PostgreSQL for read performance

Table definition

Exemplary for billing_pricequotestatus:

name seems to be one of a couple of possible types. It would help performance to normalize some more and just use a 4-byte integer referencing a lookup-table instead of a varchar(20) repeated over an over in 3.3M rows. Also, reordering columns (if possible) like I demonstrate would help some more:

   Column   |           Type           |              Modifiers
------------+--------------------------+------------------------------------------
 id         | integer                  | not null default nextval('...
 quote_id   | integer                  | not null
 created_at | timestamp with time zone | not null
 updated_at | timestamp with time zone | not null
 name_id    | integer                  | not null REFERENCES name_table(name_id)
 notes      | text                     | not null

See link above about alignment and padding. To measure row size:

Measure the size of a PostgreSQL table row

And "name" is not a good identifier. I would use something descriptive instead.

score 5 · Answer 2 · answered Mar 04 '16 at 20:32

I think that the EXISTS subquery:

AND EXISTS(
  SELECT s1.quote_id
  FROM billing_pricequotestatus s1
    INNER JOIN 
      ( SELECT DISTINCT ON (quote_id) quote_id, MAX(created_at) AS max_created_at
        FROM billing_pricequotestatus
        WHERE quote_id=i.quote_id
        GROUP BY quote_id, created_at
        ORDER BY quote_id, created_at DESC
      ) AS s2
    ON s1.quote_id = s2.quote_id
    AND s1.created_at = s2.max_created_at
  WHERE s1.name IN ('adjustment','payment','billable')
)

can be simplified to a correlated subquery:

AND   ( SELECT name
        FROM billing_pricequotestatus
        WHERE quote_id = i.quote_id
        ORDER BY created_at DESC
        LIMIT 1
      ) IN ('adjustment', 'payment', 'billable')

An index on (quote_id, created_at DESC, name) would help a lot.

If your version of Postgres is 9.3 or later, it can also be written with a LATERAL join, which may have improved efficiency.

The first question you posted at SO did not include the WHERE quote_id = i.quote_id so it was not possible for others to know that the subquery was correlated. The answer you got there is correct for that case.