Problem
When working with SCD type 2 data it's easy to see the state of a table "as of" a given point in time by using your date columns, eg: valid_from and valid_to. For example:
select * from table
where '2023-11-01' between valid_from and valid_to
You can then group by or use window functions to aggregate.
But what if I want to do that repeatedly, for each date in a range (eg: daily)? I do not need to aggregate across these dates, just within each one.
Example
Let's say I have a table that tracks the quantity of people for each reservation_id. Both the quantity of people and reservation_status can change over time, as tracked by the table. Each reservation is attached to an event_id. event_date is included to help limit scope if needed (see assumptions below).
| reservation_id | reservation_status | quantity | event_id | event_date | valid_from | valid_to |
|---|---|---|---|---|---|---|
| 1 | active | 4 | 100 | 2023-05-25 | 2023-01-01 | 2023-01-02 |
| 2 | active | 2 | 200 | 2024-01-07 | 2023-01-01 | 2023-01-03 |
| 3 | active | 7 | 100 | 2023-05-25 | 2023-01-02 | 9999-12-31 |
| 4 | active | 1 | 200 | 2024-01-07 | 2023-01-02 | 9999-12-31 |
| 1 | active | 5 | 100 | 2023-05-25 | 2023-01-03 | 9999-12-31 |
| 5 | active | 8 | 100 | 2023-05-25 | 2023-01-03 | 9999-12-31 |
| 2 | cancelled | 2 | 200 | 2024-01-07 | 2023-01-04 | 9999-12-31 |
| 6 | active | 3 | 100 | 2023-05-25 | 2023-01-06 | 9999-12-31 |
PostgreSQL since BigQuery is more difficult to test – db-fiddle / SQL:
CREATE TABLE Reservations (
"reservation_id" INTEGER,
"reservation_status" VARCHAR(9),
"quantity" INTEGER,
"event_id" INTEGER,
"event_date" DATE,
"valid_from" DATE,
"valid_to" DATE
);
INSERT INTO Reservations
("reservation_id", "reservation_status", "quantity", "event_id", "event_date", "valid_from", "valid_to")
VALUES
('1', 'active', '4', '100', '2023-05-25', '2023-01-01', '2023-01-02'),
('2', 'active', '2', '200', '2024-01-07', '2023-01-01', '2023-01-03'),
('3', 'active', '7', '100', '2023-05-25', '2023-01-02', '9999-12-31'),
('4', 'active', '1', '200', '2024-01-07', '2023-01-02', '9999-12-31'),
('1', 'active', '5', '100', '2023-05-25', '2023-01-03', '9999-12-31'),
('5', 'active', '8', '100', '2023-05-25', '2023-01-03', '9999-12-31'),
('2', 'cancelled', '2', '200', '2024-01-07', '2023-01-04', '9999-12-31'),
('6', 'active', '3', '100', '2023-05-25', '2023-01-06', '9999-12-31');
While this is ultimately for BigQuery, answer will be accepted in any dialect as long as it's somewhat generic.
Assumptions
- "As of" dates can be a list or range based off
valid_frommin/max valid_toof9999-12-31is the most recent data- All reservations for a given event will be between
event_date - INTERVAL '2 years'andevent_date. This doesn't change anything for this example, but maybe is useful for scaling(?)
Desired output
I'd like to know the sum of quantity grouped by event_id and reservation_status as of each interval (day).
| as_of_date | event_id | reservation_status | sum_quantity |
|---|---|---|---|
| 2023-01-01 | 100 | active | 4 |
| 2023-01-01 | 200 | active | 2 |
| 2023-01-02 | 100 | active | 11 |
| 2023-01-02 | 200 | active | 2 |
| 2023-01-03 | 100 | active | 20 |
| 2023-01-03 | 200 | active | 3 |
| 2023-01-04 | 100 | active | 20 |
| 2023-01-04 | 200 | active | 1 |
| 2023-01-04 | 200 | cancelled | 2 |
| 2023-01-06 | 100 | active | 23 |
Rough estimate of row values. Will be different if using complete range of dates.
I essentially want to do the following:
/* Invalid SQL, just for conceptual purposes */
-- Given a list of dates, for each "date":
select
event_id,
reservation_status,
sum(quantity)
from table
where {{date}} between valid_from and valid_to
group by
event_id,
reservation_status
I believe this can be done using procedural language, such as a for loop, but I feel like I am overthinking that, and just having trouble combining simpler concepts.