0

I have two tables. Each holds some attributes for a business entity and the date range for which those attributes were valid. I want to combine these tables into one, matching rows on the common business key and splitting the time ranges.

The real-world example is two source temporal tables feeding a type-2 dimension table in the data warehouse.

The entity can be present in neither, one or both of the source systems at any point in time. Once an entity is recorded in a source system the intervals are well-behaved - no gaps, duplicates or other monkey business. Membership in the sources can end at different dates.

The business rules state we only want to return intervals where the entity is present in both sources simultaneously.

What query will give this result?

This illustrates the situation:

Month          J     F     M     A     M     J     J
Source A:  <--><----------><----------><---->
Source B:            <----><----><----------------><-->

Result: <----><----><----><---->

Sample Data

For simplicity I've used closed date intervals; likely any solution could be extended to half-open intervals with a little typing.

drop table if exists dbo.SourceA;
drop table if exists dbo.SourceB;
go

create table dbo.SourceA ( BusinessKey int, StartDate date, EndDate date, Attribute char(9) );

create table dbo.SourceB ( BusinessKey int, StartDate date, EndDate date, Attribute char(9) ); GO

insert dbo.SourceA(BusinessKey, StartDate, EndDate, Attribute) values (1, '19990101', '19990113', 'black'), (1, '19990114', '19990313', 'red'), (1, '19990314', '19990513', 'blue'), (1, '19990514', '19990613', 'green'), (2, '20110714', '20110913', 'pink'), (2, '20110914', '20111113', 'white'), (2, '20111114', '20111213', 'gray');

insert dbo.SourceB(BusinessKey, StartDate, EndDate, Attribute) values (1, '19990214', '19990313', 'left'), (1, '19990314', '19990413', 'right'), (1, '19990414', '19990713', 'centre'), (1, '19990714', '19990730', 'back'), (2, '20110814', '20110913', 'top'), (2, '20110914', '20111013', 'middle'), (2, '20111014', '20120113', 'bottom');

Desired output

BusinessKey StartDate   EndDate     a_Colour  b_Placement
----------- ----------  ----------  --------- -----------
1           1999-02-14  1999-03-13  red       left     
1           1999-03-14  1999-04-13  blue      right    
1           1999-04-14  1999-05-13  blue      centre   
1           1999-05-14  1999-06-13  green     centre   
2           2011-08-14  2011-09-13  pink      top      
2           2011-09-14  2011-10-13  white     middle   
2           2011-10-14  2011-11-13  white     bottom   
2           2011-11-14  2011-12-13  gray      bottom    
Michael Green
  • 25,255
  • 13
  • 54
  • 100

3 Answers3

1

I may have misunderstood your question, but the results seem to be according to your question:

select a.businesskey
     -- greatest(a.startdate, b.startdate)
     , case when a.startdate > b.startdate 
            then a.startdate 
            else b.startdate 
       end as startdate
     -- least(a.enddate, b.enddate)
     , case when a.enddate < b.enddate 
            then a.enddate 
            else b.enddate 
       end as enddate
     , a.attribute as a_color
     , b.attribute as b_placement
from dbo.SourceA a 
join dbo.SourceB b 
        on a.businesskey = b.businesskey
       and (a.startdate between b.startdate and b.enddate 
          or b.startdate between a.startdate and a.enddate)
order by 1,2

Since intervals need to overlap most of the work can be done with a join with that as the predicate. Then it's just a matter of choosing the intersection of the intervals.

LEAST and GREATEST seem to be missing as functions, so I used a case expression instead.

Fiddle

Lennart - Slava Ukraini
  • 23,842
  • 3
  • 34
  • 72
0

This solution deconstructs the source intervals to just their starting dates. By combining these two list a set of output interval start dates are obtained. From these the corresponding output end dates are calculated by a window function. As the final output interval must end when either of the two input intervals end there is special processing to determine this value.

;with Dates as
(
    select BusinessKey, StartDate
    from dbo.SourceA
union

select BusinessKey, StartDate
from dbo.SourceB

union

select x.BusinessKey, DATEADD(DAY, 1, MIN(x.EndDate))
from
(
    select BusinessKey, EndDate = MAX(EndDate) 
    from dbo.SourceA
    group by BusinessKey

    union all

    select BusinessKey, EndDate = MAX(EndDate) 
    from dbo.SourceB
    group by BusinessKey
) as x
group by x.BusinessKey

), Intervals as ( select dt.BusinessKey, dt.StartDate, EndDate = lead (DATEADD(DAY, -1, dt.StartDate), 1) over (partition by dt.BusinessKey order by dt.StartDate) from Dates as dt ) select i.BusinessKey, i.StartDate, i.EndDate, a_Colour = a.Attribute, b_Placement = b.Attribute from Intervals as i inner join dbo.SourceA as a on i.BusinessKey = a.BusinessKey and i.StartDate between a.StartDate and a.EndDate inner join dbo.SourceB as b on i.BusinessKey = b.BusinessKey and i.StartDate between b.StartDate and b.EndDate where i.EndDate is not NULL order by i.BusinessKey, i.StartDate;

The "Dates" CTE uses UNION rather than UNION ALL to eliminate duplicates. If both sources change on the same date we want only one corresponding output row.

As we want to close output when either source closes the third query in "Dates" adds the earliest end date i.e. the MIN of the MAX of EndDates. As it is an EndDate masquerading as a StartDate it must have another day added to it. It's purpose is to allow the window function to calculate the end of the preceding interval. It will be eliminated in the final predicate.

Using inner joins for the final query eliminates those source intervals for which there is no corresponding value in the other source.

Michael Green
  • 25,255
  • 13
  • 54
  • 100
0

There are a lot of interesting solutions to this problem (stated in different terms) here and its preceding pages. There it is presented as matching supply and demand in an auction. The units supplied/demanded is directly analogous to the days in an interval from this question so the solution translates. I've left it in the terms used in the linked site, though.

Sample data.

DROP TABLE IF EXISTS dbo.Auctions;

CREATE TABLE dbo.Auctions ( ID INT NOT NULL IDENTITY(1, 1) CONSTRAINT pk_Auctions PRIMARY KEY CLUSTERED, Code CHAR(1) NOT NULL CONSTRAINT ck_Auctions_Code CHECK (Code = 'D' OR Code = 'S'), Quantity DECIMAL(19, 6) NOT NULL CONSTRAINT ck_Auctions_Quantity CHECK (Quantity > 0) );

SET NOCOUNT ON;

DELETE FROM dbo.Auctions;

SET IDENTITY_INSERT dbo.Auctions ON;

INSERT INTO dbo.Auctions(ID, Code, Quantity) VALUES (1, 'D', 5.0), (2, 'D', 3.0), (3, 'D', 8.0), (5, 'D', 2.0), (6, 'D', 8.0), (7, 'D', 4.0), (8, 'D', 2.0), (1000, 'S', 8.0), (2000, 'S', 6.0), (3000, 'S', 2.0), (4000, 'S', 2.0), (5000, 'S', 4.0), (6000, 'S', 3.0), (7000, 'S', 2.0);

The solutions expounded reduce the elapsed time for his 400k row sample data from a naive 11 seconds to 0.4s. The fastest is by Paul White (of this parish), shown here.

DROP TABLE IF EXISTS #MyPairings;

CREATE TABLE #MyPairings ( DemandID integer NOT NULL, SupplyID integer NOT NULL, TradeQuantity decimal(19, 6) NOT NULL ); GO

INSERT #MyPairings WITH (TABLOCK) ( DemandID, SupplyID, TradeQuantity ) SELECT Q3.DemandID, Q3.SupplyID, Q3.TradeQuantity FROM ( SELECT Q2.DemandID, Q2.SupplyID, TradeQuantity = -- Interval overlap CASE WHEN Q2.Code = 'S' THEN CASE WHEN Q2.CumDemand >= Q2.IntEnd THEN Q2.IntLength WHEN Q2.CumDemand > Q2.IntStart THEN Q2.CumDemand - Q2.IntStart ELSE 0.0 END WHEN Q2.Code = 'D' THEN CASE WHEN Q2.CumSupply >= Q2.IntEnd THEN Q2.IntLength WHEN Q2.CumSupply > Q2.IntStart THEN Q2.CumSupply - Q2.IntStart ELSE 0.0 END END FROM ( SELECT Q1.Code, Q1.IntStart, Q1.IntEnd, Q1.IntLength, DemandID = MAX(IIF(Q1.Code = 'D', Q1.ID, 0)) OVER ( ORDER BY Q1.IntStart, Q1.ID ROWS UNBOUNDED PRECEDING), SupplyID = MAX(IIF(Q1.Code = 'S', Q1.ID, 0)) OVER ( ORDER BY Q1.IntStart, Q1.ID ROWS UNBOUNDED PRECEDING), CumSupply = SUM(IIF(Q1.Code = 'S', Q1.IntLength, 0)) OVER ( ORDER BY Q1.IntStart, Q1.ID ROWS UNBOUNDED PRECEDING), CumDemand = SUM(IIF(Q1.Code = 'D', Q1.IntLength, 0)) OVER ( ORDER BY Q1.IntStart, Q1.ID ROWS UNBOUNDED PRECEDING) FROM ( -- Demand intervals SELECT A.ID, A.Code, IntStart = SUM(A.Quantity) OVER ( ORDER BY A.ID ROWS UNBOUNDED PRECEDING) - A.Quantity, IntEnd = SUM(A.Quantity) OVER ( ORDER BY A.ID ROWS UNBOUNDED PRECEDING), IntLength = A.Quantity FROM dbo.Auctions AS A WHERE A.Code = 'D'

        UNION ALL 

        -- Supply intervals
        SELECT 
            A.ID, 
            A.Code, 
            IntStart = SUM(A.Quantity) OVER (
                ORDER BY A.ID 
                ROWS UNBOUNDED PRECEDING) - A.Quantity,
            IntEnd = SUM(A.Quantity) OVER (
                ORDER BY A.ID 
                ROWS UNBOUNDED PRECEDING),
            IntLength = A.Quantity
        FROM dbo.Auctions AS A
        WHERE 
            A.Code = 'S'
    ) AS Q1
) AS Q2

) AS Q3 WHERE Q3.TradeQuantity > 0;

Michael Green
  • 25,255
  • 13
  • 54
  • 100