I have the following table with values for different stations from 2014-01-01 to 2014-01-04. The data has some date gaps that I want to fill leaving the value as NULL, and assigning the missing date to each station. I'm working with PostgreSQL 10.9
This is my table:
CREATE TABLE stations (station_id text, value integer, date date);
INSERT INTO stations (station_id, value, date) VALUES
('001', 10, '2014-01-01'),
('001', 30, '2014-01-03'),
('002', 40, '2014-01-01'),
('002', 50, '2015-01-02'),
('003', 20, '2014-01-01'),
('003', 10, '2015-01-02'),
('003', 70, '2015-01-04');
I also have a table holding unique stations with identifiers.
And I want something like this:
| station | value | date |
|---------|-------|------------|
| 001 | 10 | 2014-01-01 |
| 001 | NULL | 2014-01-02 |
| 001 | 30 | 2014-01-03 |
| 001 | NULL | 2014-01-04 |
| 002 | 40 | 2014-01-01 |
| 002 | 50 | 2014-01-02 |
| 002 | NULL | 2014-01-03 |
| 002 | NULL | 2014-01-04 |
| 003 | 20 | 2014-01-01 |
| 003 | 10 | 2014-01-02 |
| 003 | NULL | 2014-01-03 |
| 003 | 70 | 2014-01-04 |
Following some DBA Exchange (questions)1, I tried a combination of a LEFT JOIN with a LATERAL JOIN:
WITH complete_dates_station AS (
select station_id,
generate_series(DATE '2014-01-01', DATE '2014-12-31', INTERVAL '1 day')::DATE as dt
FROM stations
GROUP by station_id
), temp_join AS (
SELECT station_id,
dt,
s.value
FROM complete_dates_station
LEFT JOIN LATERAL (
SELECT s.value
FROM stations s
WHERE s.station_id = complete_dates_station.station_id
AND s.date = complete_dates_station.dt
ORDER by s.station_id, date desc
LIMIT 1) as s on TRUE
ORDER BY station_id, dt
) SELECT * from temp_join
This works like a charm, but this join is really slow for my complete table, which has more than 2M rows and the date range goes over 18 years (I stopped after 4 hrs of running). I tried a simpler approach by using a regular LEFT JOIN, but the table outputs the not-joined groups as missings:
WITH complete_dates_station AS (
SELECT station_id,
generate_series(date '2014-01-01', date '2014-12-31', interval '1 day')::date as dt
from stations
GROUP BY station_id)
SELECT s.station_id,
c.dt,
s.value
FROM complete_dates_station c
left outer join stations s
on c.station_id = s.station_id and
c.dt = s.date;
which yields the following:
| station | value | date |
|---------|-------|------------|
| 001 | 10 | 2014-01-01 |
| NULL | NULL | 2014-01-02 |
| 001 | 30 | 2014-01-03 |
| NULL | NULL | 2014-01-04 |
| 002 | 40 | 2014-01-01 |
| 002 | 50 | 2014-01-02 |
| NULL | NULL | 2014-01-03 |
| NULL | NULL | 2014-01-04 |
| 003 | 20 | 2014-01-01 |
| 003 | 10 | 2014-01-02 |
| NULL | NULL | 2014-01-03 |
| 003 | 70 | 2014-01-04 |
There is any way to optimize the first query, or use a simpler approach to fill my station gaps in the second query? I tried already using multicolumn indexes in my source table, but the query is still taking a lot of time.