1

Any ideas on how to code a running ratio of counts (unique and total) for ordered observations (rows) in t-SQL (SQL Server 2017 Community Edition or higher)?

Given a data set, which is necessarily ordered (Ex. a time series indexed/ordered by date):

CREATE TABLE #x (dt DATE, Name VARCHAR(MAX));
INSERT INTO #x VALUES
('2012-01-01', 'a'),('2012-01-02', 'b'),('2012-01-03', 'c'),
('2012-01-04', 'a'),('2012-01-05', 'b'),('2012-01-06', 'd'),
('2012-01-07', 'a'),('2012-01-08', 'b'),('2012-01-09', 'e'),('2012-01-10', 'e');

So, given columns Dt and Names, I need the output Ratio below. For clarity, I also included output columns UCnt that counts unique names to date (i.e. a restricted running count) and TCnt counts all names to date (i.e. a simple running count). The output Ratio field is a just a running proportion of the two.

Dt          Name    UCnt    TCnt    Ratio
2012-01-01  a       1       1       1.000
2012-01-02  b       2       2       1.000
2012-01-03  c       3       3       1.000
2012-01-04  a       3       4       0.750
2012-01-05  b       3       5       0.600
2012-01-06  d       4       6       0.666
2012-01-07  a       4       7       0.571
2012-01-08  b       4       8       0.500
2012-01-09  e       5       9       0.555
2012-01-10  e       5       10      0.500

2 Answers2

4

A combination of OUTER APPLY and ROW_NUMBER seems to do the trick:

WITH CTE AS
(
    SELECT  *,
            Tot_Cnt = ROW_NUMBER() OVER(ORDER BY dt)
    FROM #x
)
SELECT  A.dt,
        A.[Name],
        B.Unq_Cnt,
        A.Tot_Cnt,
        Ratio_of_Cnts = CONVERT(NUMERIC(10,4),B.Unq_Cnt)/A.Tot_Cnt 
FROM CTE A
OUTER APPLY (SELECT Unq_Cnt = COUNT(DISTINCT [Name]) 
             FROM CTE
             WHERE dt <= A.dt) B
;

Here is a demo of this.

ypercubeᵀᴹ
  • 99,450
  • 13
  • 217
  • 306
Lamak
  • 2,576
  • 1
  • 24
  • 30
0

Thanks Lamak. The OUTER APPLY ran for 2+ hours (~1MM rows); I had to kill it. Here is my solution that takes ~30 sec on the same-size table.

WITH y AS (-- count dupped Names and each observation
    SELECT dt, Name,
    UCnt_=COUNT(*) OVER (PARTITION BY Name ORDER BY Dt), 
    TCnt=ROW_NUMBER() OVER (ORDER BY Dt)
    FROM #x
),
z AS (-- count unique Names only
    SELECT *,
    UCnt=SUM(CASE WHEN UCnt_>1 THEN 0 ELSE 1 END) OVER (ORDER BY Dt) 
    FROM y
)
SELECT Dt, Name, UCnt_, UCnt, TCnt, Ratio=UCnt/1.0/TCnt
FROM z
ORDER BY Dt

The output:

Dt          Name    UCnt_   UCnt    TCnt    Ratio
2012-01-01  a       1       1       1       1.000
2012-01-02  b       1       2       2       1.000
2012-01-03  c       1       3       3       1.000
2012-01-04  a       2       3       4       0.750
2012-01-05  b       2       3       5       0.600
2012-01-06  d       1       4       6       0.666
2012-01-07  a       3       4       7       0.571
2012-01-08  b       3       4       8       0.500
2012-01-09  e       1       5       9       0.555
2012-01-10  e       2       5       10      0.500