RANK used in 'where' returns invalid column, but exists in results set

Question

I basically have a table with date, timestamp, DID, coordinates.

I want a query that will return rows with the last coordinate from day X, and the first coord from day X+1, and the coordinates. So it will only return results where there are 2 consecutive dates.

This is what I came up with. Been trying to get this query to work, it's almost perfect, but I just need to add the commented out where conditions and it'll do exactly what I want. But when I uncomment, I get an error "column doesn't exist":

SELECT  a.timestamp_intersecting_date d1,b.timestamp_intersecting_date d2,   
        a."DID", 
        a.timestamp_intersecting_max t1, b.timestamp_intersecting_min t2,
        RANK () OVER ( 
          PARTITION BY a.timestamp_intersecting_date
          ORDER BY a.timestamp_intersecting_max DESC
       ) timestamp_d1_rank ,
        RANK () OVER ( 
          PARTITION BY b.timestamp_intersecting_date
          ORDER BY b.timestamp_intersecting_max ASC
       ) timestamp_d2_rank,
        a.coords_centroid, b.coords_centroid
FROM
    signals a
INNER JOIN signals b ON (a."DID" = b."DID")
WHERE (b.timestamp_intersecting_date = a.timestamp_intersecting_date + INTERVAL '1 DAY')
AND a."DID" = b."DID"
--AND timestamp_d1_rank = 1
--AND timestamp_d2_rank = 1
ORDER BY a."DID", t1 desc, t2 asc

How to solve this?

Vérace · Answer 1 · 2019-09-09T13:41:30.903

You cannot use aliases in WHERE clauses, nor can you use Window functions.

To demonstrate, consider this example (fiddle).

CREATE TABLE payment (amount INTEGER, pay_date DATE);

INSERT INTO payment VALUES (54,  '2019-09-01'), (56,  '2019-09-01'), (154, '2019-09-02'), 
(156, '2019-09-02'), (254, '2019-09-03'), (256, '2019-09-03');

Then run the two following queries:

SELECT 
  *, 
  SUM(amount) OVER (PARTITION BY pay_date) AS the_sums
FROM payment
WHERE the_sums > 200; -- ERROR:  column "the_sums" does not exist

and

SELECT 
  *, 
  SUM(amount) OVER (PARTITION BY pay_date) AS the_sums
FROM payment
WHERE SUM(amount) OVER (PARTITION BY pay_date) > 200;
-- ERROR:  window functions are not allowed in WHERE

You didn't post your error message or PostgreSQL version, but your problem is that timestamp_d1_rank and timestamp_d2_rank are aliases.

You have two options:

First Option: use the aliases/Window functions within a subquery (also in fiddle)

SELECT * FROM
(
SELECT 
  *, 
  SUM(amount) OVER (PARTITION BY pay_date) AS the_sums
FROM payment
) AS tab
WHERE the_sums < 200;

Result:

amount    pay_date  the_sums
    54  2019-09-01       110
    56  2019-09-01       110

Second Option: you can use a CTE (aka the WITH clause as outlined here or as in @ypercubeᵀᴹ 's answer above or also see the fiddle.

For a good explanation of this whole area, see here.

score 4 · Accepted Answer · answered Sep 08 '19 at 11:28

You cannot use window/ranking functions or their aliases in the WHERE clause because they are evaluated in SELECT, after the WHERE has been evaluated.

You can use a subquery (derived table or CTE) to process a second WHERE clause, after the window functions have been evaluated:

SELECT d1, d2,
       "DID",
       t1, t2,
       coords_centroid_a,
       coords_centroid_b
FROM
  (
    SELECT  a.timestamp_intersecting_date d1, b.timestamp_intersecting_date d2,   
            a."DID", 
            a.timestamp_intersecting_max t1, b.timestamp_intersecting_min t2,
            RANK () OVER ( 
              PARTITION BY a.timestamp_intersecting_date
              ORDER BY a.timestamp_intersecting_max DESC
           ) timestamp_d1_rank ,
            RANK () OVER ( 
              PARTITION BY b.timestamp_intersecting_date
              ORDER BY b.timestamp_intersecting_max ASC
           ) timestamp_d2_rank,
            a.coords_centroid AS coords_centroid_a, 
            b.coords_centroid AS coords_centroid_b
    FROM
        signals a
    INNER JOIN signals b ON (a."DID" = b."DID")
    WHERE (b.timestamp_intersecting_date = a.timestamp_intersecting_date + INTERVAL '1 DAY')
    AND a."DID" = b."DID"
  ) AS t
WHERE t.timestamp_d1_rank = 1
  AND t.timestamp_d2_rank = 1
ORDER BY "DID", t1 DESC, t2 ASC ;

score 3 · Answer 3 · answered Sep 08 '19 at 23:47

The confusion with syntax rules concerning input and output column names has been addressed properly by existing answers. This is about a lurking performance problem.

There is an transient CROSS JOIN in your query that scales terribly with growing number of rows per ("DID", date_col). Example: If a "DID" has 100 coordinates per day, the query ends up processing 10.000 (!) combinations with the next day.

Based on this assumed table definition:

CREATE TABLE signals (
   "DID" serial PRIMARY KEY
 , date_col date NOT NULL  -- ?
 , time_col time NOT NULL  -- ?
 , coords_centroid geography  -- ?
);

And assuming there are only few rows per ("DID", date_col), this query should be much faster already:

SELECT a."DID", a.date_col, t1, coords1, t2, coords2
FROM  (  -- first row per ("DID", date_col)
   SELECT DISTINCT ON ("DID", date_col)
          "DID", date_col, time_col AS t2, coords_centroid AS coords2
   FROM   signals
   ORDER  BY "DID", date_col, time_col
   ) b
CROSS JOIN LATERAL (  -- corresponding last row from previous day
   SELECT "DID", date_col, time_col AS t1, coords_centroid AS coords1
   FROM   signals a
   WHERE  a."DID" = b."DID"
   AND    a.date_col = b.date_col - 1
   ORDER  BY time_col DESC
   ) a;

Needs an index on ("DID", date_col, time_col).

More might be done here, depending on your actual setup and actual data distribution. I'll elaborate if you are interested and provide missing information ...

RANK used in 'where' returns invalid column, but exists in results set

3 Answers3