1

(Using PostgreSQL) I'm looking to select the latest set of continuous (by date) rows that have a code 'A'. The latest row is always missing the "To Date" as it is assumed to be current. And there are never any gaps between the "To Date" of a row and the "From Date" of the row following it.

Example Tables

Here I would like to select rows 2 and 3:

Row Number Code From Date To Date
1 B 2021-05-01
2 A 2020-02-01 2021-04-30
3 A 2019-02-01 2020-01-31

Here I would like to select rows 1 and 2:

Row Number Code From Date To Date
1 A 2021-05-01
2 A 2020-02-01 2021-04-30
3 B 2019-02-01 2020-01-31

Here I would like to select rows 1 and 2:

Row Number Code From Date To Date
1 A 2021-09-07
2 A 2021-04-01 2021-09-06
3 B 2021-03-13 2021-03-31
4 A 2021-01-13 2021-03-12
5 A 2021-01-01 2021-01-12

Any assistance is greatly appreciated!

ahbarnum
  • 13
  • 4

2 Answers2

1

This is a classic situation.

References

Query

Like in the referenced questions, I'm going to use the Tabibitosan method to separate our groups, then just select the first one.

WITH start AS (
SELECT rn, 
    code, 
    from_date, 
    to_date, 
    lag(from_date) OVER (ORDER BY rn) AS next_from 
FROM t3 
WHERE code='A'
ORDER BY rn
),
group_identify AS (
SELECT *,
    CASE WHEN rn=1 OR next_from - to_date = 1
    THEN 1
    ELSE 0
    END AS group_cnt
FROM    start
), 
group_enumerate AS (
SELECT *,
    rn-sum(group_cnt) OVER (ORDER BY rn) AS grpn
FROM group_identify
)
SELECT rn, code, from_date, to_date
FROM group_enumerate
WHERE grpn=(select min(grpn) from group_enumerate)
ORDER BY rn

Explanation

  • The first table, start, pulls in the From Date from the previous line using lag(). Change t3 to the name of your table and code to the code of interest here.
  • The next table, group_identify, computes whether the current row is a candidate for a sequence, setting the field group_cnt to 1 if so, and 0 if not.
  • The third table, group_enumerate, assigns a unique group number to each sequence by subtracting the count of groups from the row number rn. This is the Tabibitosan method.
  • Finally, the outer query selects the lowest-numbered group and outputs the requested columns.

Assumptions

  • Rows are already numbered in the table. If this was not true then a window function like row_number() OVER (ORDER BY from_date DESC) AS rn could be used to generate it.
  • Date ranges do not overlap.

Caution

I did not do extensive testing to ensure all of the ordering works properly if the table is not organized as in the examples.

0

To provide an answer to this problem, I did the following (all of the code below is on the fiddle here):

CREATE TABLE test
(
  row_number INTEGER NOT NULL,
  code       TEXT    NOT NULL,
  from_date  DATE    NOT NULL,
  to_date    DATE    NULL
);

Populated with:

INSERT INTO test VALUES
(1, 'A', '2021-09-07', NULL),   
(2, 'A', '2021-04-01', '2021-09-06'),
(3, 'B', '2021-03-13', '2021-03-31'),
(4, 'A', '2021-01-13', '2021-03-12'),
(5, 'A', '2021-01-01', '2021-01-12');

Note the use of out of sequence row_numbers - this can very easily happen in a frequently updated database - the resolution should only depend on those parts of the data that are independent of implementation details - i.e. the from_date in this case.

Step 1:

Mark when a code change takes place - in order of from_date DESC.

SELECT 
  ROW_NUMBER() OVER (ORDER BY from_date DESC) AS rn,
  code, from_date, to_date,

CASE WHEN LAG(code) OVER (ORDER BY from_date DESC) = code THEN 0 ELSE 1 END AS code_change

FROM test;

Result:

rn  code    from_date   to_date     code_change
1   A      2021-09-07   NULL             1
2   A      2021-04-01   2021-09-06       0
3   B      2021-03-13   2021-03-31       1
4   A      2021-01-13   2021-03-12       1
5   A      2021-01-01   2021-01-12       0

Second step:

Perform a cumulative sum over the code_change generated variable

SELECT
  *, SUM(code_change) OVER (ORDER BY from_date DESC)
FROM
(
  SELECT 
    ROW_NUMBER() OVER (ORDER BY from_date DESC) AS rn,
    code, from_date, to_date,
CASE 
  WHEN LAG(code) OVER (ORDER BY from_date DESC) = code THEN 0
  ELSE 1
END AS code_change

FROM test ) AS tab_01 ORDER BY from_date DESC;

Result:

rn  code    from_date   to_date     code_change     sum
 1     A    2021-09-07  NULL            1            1
 2     A    2021-04-01  2021-09-06      0            1
 3     B    2021-03-13  2021-03-31      1            2
 4     A    2021-01-13  2021-03-12      1            3
 5     A    2021-01-01  2021-01-12      0            3

So, we now have a means of grouping the codes by sequence of the same code!

Step 3:

We now obtain the MIN(sum_cc) to obtain the group with code 'A'. This gives us the group with the most recent date(s) - because of the ORDER BY from_date DESC!

SELECT 
  code, MIN(sum_cc) 
FROM
(
SELECT
  *, SUM(code_change) OVER (ORDER BY from_date DESC) AS sum_cc
FROM
(
  SELECT 
    ROW_NUMBER() OVER (ORDER BY from_date DESC) AS rn,
    code, from_date, to_date,
...
... SQL snipped for brevity
...
) AS tab_02
WHERE code = 'A'
GROUP BY code;

Result:

code    min
   A      1

So, we no know that 'A' with grouping = 1 that is the group of interest.

VERY IMPORTANT - this will work even if the row_number varialbe is NOT in sync with the date values - this can easily happen in a multi-user system - see this fiddle here.

Step 4:

We now JOIN the result of step 4 back to the result from step 3 and obtain our final desired result as follows:

SELECT tab_03.*, tab_02.*
FROM
(
  SELECT
    *, SUM(tab_01.code_change) OVER (ORDER BY tab_01.from_date DESC) AS xx 
  FROM
  (
    SELECT 
      ROW_NUMBER() OVER (ORDER BY from_date DESC) AS rn,
      code, from_date, to_date,
  CASE 
    WHEN LAG(code) OVER (ORDER BY from_date DESC) = code THEN 0
    ELSE 1
  END AS code_change

FROM test

) AS tab_01 ) AS tab_03 JOIN ( SELECT code, MIN(sum_cc) AS yy FROM ( SELECT *, SUM(code_change) OVER (ORDER BY from_date DESC) AS sum_cc FROM ( SELECT ROW_NUMBER() OVER (ORDER BY from_date DESC) AS rn, code, from_date, to_date,

    CASE 
      WHEN LAG(code) OVER (ORDER BY from_date DESC) = code THEN 0
      ELSE 1
    END AS code_change

  FROM test
) AS tab_01
ORDER BY from_date DESC

) AS tab_02 WHERE code = 'A' GROUP BY code ) AS tab_02 ON tab_03.xx = tab_02.yy;

Result:

row_number  code    from_date   to_date
         1     A    2021-09-07  NULL    
         2     A    2021-04-01  2021-09-06

which is correct for the data given above. This solution also works in situations where the row_number is not in sync with the dates (see here)!

Vérace
  • 30,923
  • 9
  • 73
  • 85