Aggregate SCD Type 2 data "as of" each day

Question

Problem

When working with SCD type 2 data it's easy to see the state of a table "as of" a given point in time by using your date columns, eg: valid_from and valid_to. For example:

select * from table
where '2023-11-01' between valid_from and valid_to

You can then group by or use window functions to aggregate.

But what if I want to do that repeatedly, for each date in a range (eg: daily)? I do not need to aggregate across these dates, just within each one.

Example

Let's say I have a table that tracks the quantity of people for each reservation_id. Both the quantity of people and reservation_status can change over time, as tracked by the table. Each reservation is attached to an event_id. event_date is included to help limit scope if needed (see assumptions below).

reservation_id	reservation_status	quantity	event_id	event_date	valid_from	valid_to
1	active	4	100	2023-05-25	2023-01-01	2023-01-02
2	active	2	200	2024-01-07	2023-01-01	2023-01-03
3	active	7	100	2023-05-25	2023-01-02	9999-12-31
4	active	1	200	2024-01-07	2023-01-02	9999-12-31
1	active	5	100	2023-05-25	2023-01-03	9999-12-31
5	active	8	100	2023-05-25	2023-01-03	9999-12-31
2	cancelled	2	200	2024-01-07	2023-01-04	9999-12-31
6	active	3	100	2023-05-25	2023-01-06	9999-12-31

PostgreSQL since BigQuery is more difficult to test – db-fiddle / SQL:

CREATE TABLE Reservations (
  "reservation_id" INTEGER,
  "reservation_status" VARCHAR(9),
  "quantity" INTEGER,
  "event_id" INTEGER,
  "event_date" DATE,
  "valid_from" DATE,
  "valid_to" DATE
);
INSERT INTO Reservations
  ("reservation_id", "reservation_status", "quantity", "event_id", "event_date", "valid_from", "valid_to")
VALUES
  ('1', 'active', '4', '100', '2023-05-25', '2023-01-01', '2023-01-02'),
  ('2', 'active', '2', '200', '2024-01-07', '2023-01-01', '2023-01-03'),
  ('3', 'active', '7', '100', '2023-05-25', '2023-01-02', '9999-12-31'),
  ('4', 'active', '1', '200', '2024-01-07', '2023-01-02', '9999-12-31'),
  ('1', 'active', '5', '100', '2023-05-25', '2023-01-03', '9999-12-31'),
  ('5', 'active', '8', '100', '2023-05-25', '2023-01-03', '9999-12-31'),
  ('2', 'cancelled', '2', '200', '2024-01-07', '2023-01-04', '9999-12-31'),
  ('6', 'active', '3', '100', '2023-05-25', '2023-01-06', '9999-12-31');

While this is ultimately for BigQuery, answer will be accepted in any dialect as long as it's somewhat generic.

Assumptions

"As of" dates can be a list or range based off valid_from min/max
valid_to of 9999-12-31 is the most recent data
All reservations for a given event will be between event_date - INTERVAL '2 years' and event_date. This doesn't change anything for this example, but maybe is useful for scaling(?)

Desired output

I'd like to know the sum of quantity grouped by event_id and reservation_status as of each interval (day).

as_of_date	event_id	reservation_status	sum_quantity
2023-01-01	100	active	4
2023-01-01	200	active	2
2023-01-02	100	active	11
2023-01-02	200	active	2
2023-01-03	100	active	20
2023-01-03	200	active	3
2023-01-04	100	active	20
2023-01-04	200	active	1
2023-01-04	200	cancelled	2
2023-01-06	100	active	23

^{Rough estimate of row values. Will be different if using complete range of dates.}

I essentially want to do the following:

/* Invalid SQL, just for conceptual purposes */
-- Given a list of dates, for each "date":
select
  event_id,
  reservation_status,
  sum(quantity)
from table
where {{date}} between valid_from and valid_to
group by
  event_id,
  reservation_status

I believe this can be done using procedural language, such as a for loop, but I feel like I am overthinking that, and just having trouble combining simpler concepts.

camtech · Accepted Answer · 2023-11-29T20:03:38.453

Turns out I was very much overthinking it. Thank you to this Stack Overflow post for reminding me of the basics, particularly that you can just join on and with comparisons, and SergeyA's comment: just make a list of dates and join to it.

Here is one complete solution (db-fiddle):

with
-- Generate range of days based on valid_from
-- Or use a calendar table/date dimension
all_dates as (
  select day::date
  from generate_series(
      (select min(valid_from) from Reservations),
      (select max(valid_from) from Reservations),
      '1 day'
  ) day
),
-- Quantities as of each day
quantity_as_of as (
  select
    day as as_of_date,
    Reservations.event_id,
    Reservations.reservation_status,
    sum(Reservations.quantity) as sum_quantity
  from all_dates as ad
  join Reservations
    on valid_from <= ad.day and ad.day < valid_to
  group by
    ad.day,
    Reservations.event_id,
    Reservations.reservation_status
  order by as_of_date, event_id
)
select * from quantity_as_of

as_of_date	event_id	reservation_status	quantity
2023-01-01	100	active	4
2023-01-01	200	active	2
2023-01-02	100	active	7
2023-01-02	200	active	3
2023-01-03	100	active	20
2023-01-03	200	active	1
2023-01-04	100	active	20
2023-01-04	200	cancelled	2
2023-01-04	200	active	1
2023-01-05	100	active	20
2023-01-05	200	active	1
2023-01-05	200	cancelled	2
2023-01-06	100	active	23
2023-01-06	200	cancelled	2
2023-01-06	200	active	1

Aggregate SCD Type 2 data "as of" each day

Problem

Example

Assumptions

Desired output

1 Answers1