1

I am using PostgreSQL and I am trying to create a view of data:

forecast_datetime, file_date, thing_id, thing_name, facts...

The forecast_datetime column is a datetime for each 30 minute interval. The issue is that the forecast is for 1 month into the future which causes a lot of duplicate data. (so today, there will be a forecast for 2016-12-25 01:30:00 and then tomorrow there will be a forecast for the same interval, over and over again potentially until the period actually happens.).

The view I want to create is to just take a look at the last forecast which was appended to the table. This is indicated by the file_date column (as the name suggests, I receive a daily forecast file which is appended to the database. This column indicates the date from the filename).

I basically want to filter all rows for each thing_id and select the forecast_datetime columns where there file_date was the maximum date for that forecast. So each thing_id should only have one row per unique forecast_datetime interval which is based on the latest file_date.

CREATE OR REPLACE VIEW schema.view AS 
SELECT DISTINCT ON (thing_id, forecast_datetime, file_date)
       forecast_datetime,
       interval,
       time_zone,
       file_date,
       thing_id, 
       thing_name,
       other_cols...
FROM   schema.orig_table
ORDER  BY  file_date DESC 

Would that query do the trick? I can do this in pandas/python easily, but I am newer to SQL and wanted to try a view instead of writing a new physical table.

Here is some sample data after applying the query I wrote above this. Because I included the the file_date in the SELECT DISTINCT query, there is still a repeat row which duplicates each forecast_datetime. This causes issues when I join this data based on the thing_id and forecast_datetime back to the actual results table.

forecast_datetime   file_date   thing_id
2016-12-12 20:30    2016-12-11  99
2016-12-12 20:30    2016-12-10  99
2016-12-12 20:30    2016-12-09  99
2016-12-12 20:30    2016-12-08  99

I would like to only have the row with the max file_date. In the example above, that would be the first row (2016-12-11). The other rows would be dropped from the view.

trench
  • 235
  • 4
  • 8

1 Answers1

0
CREATE OR REPLACE VIEW schema.view AS 
SELECT DISTINCT ON (thing_id, forecast_datetime)
       forecast_datetime,
       interval,
       time_zone,
       file_date,
       thing_id, 
       thing_name,
       other_cols...
FROM   schema.orig_table
ORDER  BY  thing_id, forecast_datetime ASC, file_date DESC 

This appears to have worked perfectly.

trench
  • 235
  • 4
  • 8