6

I basically have a table with date, timestamp, DID, coordinates.

I want a query that will return rows with the last coordinate from day X, and the first coord from day X+1, and the coordinates. So it will only return results where there are 2 consecutive dates.

This is what I came up with. Been trying to get this query to work, it's almost perfect, but I just need to add the commented out where conditions and it'll do exactly what I want. But when I uncomment, I get an error "column doesn't exist":

SELECT  a.timestamp_intersecting_date d1,b.timestamp_intersecting_date d2,   
        a."DID", 
        a.timestamp_intersecting_max t1, b.timestamp_intersecting_min t2,
        RANK () OVER ( 
          PARTITION BY a.timestamp_intersecting_date
          ORDER BY a.timestamp_intersecting_max DESC
       ) timestamp_d1_rank ,
        RANK () OVER ( 
          PARTITION BY b.timestamp_intersecting_date
          ORDER BY b.timestamp_intersecting_max ASC
       ) timestamp_d2_rank,
        a.coords_centroid, b.coords_centroid
FROM
    signals a
INNER JOIN signals b ON (a."DID" = b."DID")
WHERE (b.timestamp_intersecting_date = a.timestamp_intersecting_date + INTERVAL '1 DAY')
AND a."DID" = b."DID"
--AND timestamp_d1_rank = 1
--AND timestamp_d2_rank = 1
ORDER BY a."DID", t1 desc, t2 asc

How to solve this?

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633
roy naufal
  • 187
  • 1
  • 6

3 Answers3

5

You cannot use aliases in WHERE clauses, nor can you use Window functions.

To demonstrate, consider this example (fiddle).

CREATE TABLE payment (amount INTEGER, pay_date DATE);

INSERT INTO payment VALUES (54,  '2019-09-01'), (56,  '2019-09-01'), (154, '2019-09-02'), 
(156, '2019-09-02'), (254, '2019-09-03'), (256, '2019-09-03');

Then run the two following queries:

SELECT 
  *, 
  SUM(amount) OVER (PARTITION BY pay_date) AS the_sums
FROM payment
WHERE the_sums > 200; -- ERROR:  column "the_sums" does not exist

and

SELECT 
  *, 
  SUM(amount) OVER (PARTITION BY pay_date) AS the_sums
FROM payment
WHERE SUM(amount) OVER (PARTITION BY pay_date) > 200;
-- ERROR:  window functions are not allowed in WHERE

You didn't post your error message or PostgreSQL version, but your problem is that timestamp_d1_rank and timestamp_d2_rank are aliases.

You have two options:

  • First Option: use the aliases/Window functions within a subquery (also in fiddle)


SELECT * FROM
(
SELECT 
  *, 
  SUM(amount) OVER (PARTITION BY pay_date) AS the_sums
FROM payment
) AS tab
WHERE the_sums < 200;

Result:

amount    pay_date  the_sums
    54  2019-09-01       110
    56  2019-09-01       110
  • Second Option: you can use a CTE (aka the WITH clause as outlined here or as in @ypercubeᵀᴹ 's answer above or also see the fiddle.

For a good explanation of this whole area, see here.

Vérace
  • 30,923
  • 9
  • 73
  • 85
4

You cannot use window/ranking functions or their aliases in the WHERE clause because they are evaluated in SELECT, after the WHERE has been evaluated.

You can use a subquery (derived table or CTE) to process a second WHERE clause, after the window functions have been evaluated:

SELECT d1, d2,
       "DID",
       t1, t2,
       coords_centroid_a,
       coords_centroid_b
FROM
  (
    SELECT  a.timestamp_intersecting_date d1, b.timestamp_intersecting_date d2,   
            a."DID", 
            a.timestamp_intersecting_max t1, b.timestamp_intersecting_min t2,
            RANK () OVER ( 
              PARTITION BY a.timestamp_intersecting_date
              ORDER BY a.timestamp_intersecting_max DESC
           ) timestamp_d1_rank ,
            RANK () OVER ( 
              PARTITION BY b.timestamp_intersecting_date
              ORDER BY b.timestamp_intersecting_max ASC
           ) timestamp_d2_rank,
            a.coords_centroid AS coords_centroid_a, 
            b.coords_centroid AS coords_centroid_b
    FROM
        signals a
    INNER JOIN signals b ON (a."DID" = b."DID")
    WHERE (b.timestamp_intersecting_date = a.timestamp_intersecting_date + INTERVAL '1 DAY')
    AND a."DID" = b."DID"
  ) AS t
WHERE t.timestamp_d1_rank = 1
  AND t.timestamp_d2_rank = 1
ORDER BY "DID", t1 DESC, t2 ASC ;
ypercubeᵀᴹ
  • 99,450
  • 13
  • 217
  • 306
3

The confusion with syntax rules concerning input and output column names has been addressed properly by existing answers. This is about a lurking performance problem.

There is an transient CROSS JOIN in your query that scales terribly with growing number of rows per ("DID", date_col). Example: If a "DID" has 100 coordinates per day, the query ends up processing 10.000 (!) combinations with the next day.

Based on this assumed table definition:

CREATE TABLE signals (
   "DID" serial PRIMARY KEY
 , date_col date NOT NULL  -- ?
 , time_col time NOT NULL  -- ?
 , coords_centroid geography  -- ?
);

And assuming there are only few rows per ("DID", date_col), this query should be much faster already:

SELECT a."DID", a.date_col, t1, coords1, t2, coords2
FROM  (  -- first row per ("DID", date_col)
   SELECT DISTINCT ON ("DID", date_col)
          "DID", date_col, time_col AS t2, coords_centroid AS coords2
   FROM   signals
   ORDER  BY "DID", date_col, time_col
   ) b
CROSS JOIN LATERAL (  -- corresponding last row from previous day
   SELECT "DID", date_col, time_col AS t1, coords_centroid AS coords1
   FROM   signals a
   WHERE  a."DID" = b."DID"
   AND    a.date_col = b.date_col - 1
   ORDER  BY time_col DESC
   ) a;

Needs an index on ("DID", date_col, time_col).

More might be done here, depending on your actual setup and actual data distribution. I'll elaborate if you are interested and provide missing information ...

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633