Split data into X days' Cohort (SQL Server)

Question

Hi I want to split data by x days' cohort.

the table

SELECT 
 --earliest_time, 
 sum(case when by_day = 0 then 1 else 0 end) day_0,
    sum(case when by_day >  0 and by_day <= 10 then 1 else 0 end) day_10,
    sum(case when by_day > 10 and by_day <= 20 then 1 else 0 end) day_20,
    sum(case when by_day > 20 and by_day <= 30 then 1 else 0 end) day_30,
    sum(case when by_day > 30 and by_day <= 40 then 1 else 0 end) day_40,
    sum(case when by_day > 40 and by_day <= 50 then 1 else 0 end) day_50,
    sum(case when by_day > 50 and by_day <=60 then 1 else 0 end) day_60,
    sum(case when by_day > 60 then 1 else 0 end) day_60plus
from cohort_days co

This splits the data into cohorts of 10 days. If I want to change it to 20 days' cohort, then I need to edit them 1 by 1. I want to just define x days' cohort once, then the data can generate the cohort groups itself automatically.

I want to know is there easier way to do this than the above code? For example, I just need to define x days' cohort once at the beginning then I can use loops to iterate over it to generate the rest of the cohort groups.

A bit like this:

declare @d int
select @d = 20
select * from cohort_days
SELECT 
 sum(case when by_day = 0 then 1 else 0 end) day_0,
    sum(case when by_day > 0 and by_day <= @d then 1 else 0 end) CONCAT('ay_: ', @d),
    select @2d = @d*2
    sum(case when by_day > @d and by_day <= @2d then 1 else 0 end) CONCAT('ay_: ', @2d)
    select @3d = @d*3
    sum(case when by_day > @2d and by_day <= @3d then 1 else 0 end) CONCAT('ay_: ', @3d)
from cohort_days co

If I want to change cohort days to 50 days, I just need to change select @d = 20 to select @d = 50

score 1 · Answer 1 · answered Dec 20 '18 at 03:35

Your desired outcome is that you can change one value (the cohort width) to have all rows re-allocated to the new groupings.

I'd use a separate table to hold the upper and lower bounds for each cohort. These values can be generated a run-time according to the width parameter. There are two special cases - the day-zero cohort which only ever contains a single value and the last cohort which has an unlimited upper bound. There will have to be a second parameter to say how many "normal" cohorts there must be. To generate this I will use a "numbers" table.

create table #Cohorts
(
    Label       varchar(20) not NULL,
    LowerBound  int         not NULL,
    UpperBound  int         not NULL
);


declare @Width int = 10;
declare @Depth int = 5;

insert #Cohorts
(
    Label,
    LowerBound,
    UpperBound
)
select
    'day_' + CONVERT(varchar(20), ((n + 1) * @Width)),
    1 + (n * @Width),
    ((n + 1) * @Width)
from Numbers
where n <= @Depth

union all

select 'day_0', 0, 0        -- the special case for day_0

union all

select
    'day_' + CONVERT(varchar(20), ((@Depth + 1) * @Width)) + 'plus',  -- a special label for the last cohort
    1 + ((@Depth + 1) * @Width),    -- the upper limit of the dynamic cohorts
    2147483647                      -- the maximum value of an int; pseudo-infinity

For the values given in the question this returns

Label       LowerBound  UpperBound
----------  ----------  ----------
day_0       0           0
day_10      1           10
day_20      11          20
day_30      21          30
day_40      31          40
day_50      41          50
day_60      51          60
day_60plus  61          2147483647

In your original question you have used (open, close] bounds. For simplicity I have used [closed, closed]. It doesn't matter since the bounds are integers and the cohorts are contiguous. It just means your > and <= becomes my >= and <=. With tinkering your convention could be used equally well.

We assign each data row to a cohort by joining to this temporary table. The counting is done using GROUP BY.

select
    c.Label, COUNT(*)
from your_data_table as d
inner join #Cohorts as c
     on c.LowerBound <= d.by_day
     and d.by_day <= c.UpperBound
group by c.Label;

This will give one row per cohort. To get a single row with a column per cohort requires a PIVOT. There is a T-SQL statement to do this. There are other methods, too, of which your sum(case..) is one.

score 0 · Answer 2 · answered Dec 19 '18 at 23:25

Try this.

select 
    cast((floor((by_day / 10)) + 1) as varchar) + ' - ' + cast(ceiling((by_day / 10)) as varchar) as day_bracket,
    count(1) as no_in_bracket
from cohort_days
where by_day > 0
group by ceiling((by_day / 10))

union

select
    '0',
    count(1)
from cohort_days
where by_day = 0
;

Notes:

You need to update your bracket size (ie. '10') in 3 places (all in the first select), not just 1.

The results will list the bracket size in column 1, and a count of how many records fall into that bracket, in column 2. In your question, the results are recorded in separate columns. If you really need to transpose from a vertical layout to a horizontal one, I'd suggest copy the results from the SSMS query results window, into Excel, paste them, then copy them again and transpose using CTRL+E then S then E then Return.

The above does account for the zero case (where you've counted the zero record on its own, and not in a band of 10 days) by using the union. It does not account for your "60+" days case. You could do that, if you needed it, however, by introducing another select as per below. Note - the upper bound (60, in this case) is defined twice - once in the first select, and once in the last - update this value if you want to change your upper bound.

select 
    cast((floor((by_day / 10)) + 1) as varchar) + ' - ' + cast(ceiling((by_day / 10)) as varchar) as day_bracket,
    count(1) as no_in_bracket
from cohort_days
where by_day > 0 and by_day <= 60
group by ceiling((by_day / 10))

union

select
    '0',
    count(1)
from cohort_days
where by_day = 0

union

select
    'higher',
    count(1)
from cohort_days
where by_day > 60
;

Split data into X days' Cohort (SQL Server)

2 Answers2