Sum by Implicit Group

Question

I'm sure I'm missing an obvious solution on this, but I'm trying to summarize values for groups that are defined by an explicit group number and an implicit ordering. I'm sure that doesn't make this much clearer, so let's say I have this example source heap table:

GroupID     Value
----------- -----------
1           5
1           5
1           3
2           4
2           1
1           4
2           3
2           5
2           2
1           1

I would like a query that provides me with the following results:

GroupID     Values
----------- -----------
1           13
2           5
1           4
2           10
1           1

The implied ordering is the challenge that I've just not found a way to get around... yet. Any help would be appreciated.

I was hoping I could create a deterministic row ordering with a query similar to the following:

SELECT    * 
        , ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RowNum  
FROM Table WITH(TABLOCK)
OPTION (MAXDOP 1)

I am hoping this forces an allocation order scan which would then give me a deterministic row order.

Sadly, I'm stuck with the data as is. I've got no other indicators here such as date, etc. that would provide any set order. My hope was the trick outlined above would suffice, but I'm not entirely certain it will.

EDIT: Just to close this up as I'm know there were questions around why I was asking this, I had a series of heap tables, named by month/year, that contained line-item amounts that the business wanted to be summed up by day (which they correlated to the implied groups in my question). Because it doesn't look feasible to do this effectively, we've settled on aggregating at the month (e.g. table) level, so this post helped me justify the alteration to the business requirements. Thanks for everyone's input!

score 7 · Accepted Answer · edited Sep 30 '17 at 05:45

The "implicit" group you mention seems to be based on the row order. Unlike a spreadsheet or text file, a relational table is logically an unordered set of rows regardless of whether it is stored as a heap or has a clustered index. It will not be possible to write a query to provide the desired results unless you have another column to facilitate the grouping.

I was hoping I could create a deterministic row ordering with a query similar to the following:
SELECT      * 
        , ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RowNum  
FROM Table WITH(TABLOCK)
OPTION (MAXDOP 1)

Sorry to be the bearer of bad news, but ordering by a literal will not guarantee deterministic ordering (and even if it did, you'd need an ORDER BY clause there). If it appears to do so, it is only by happenstance. Any incremental column will work, though.

I am hoping this forces an allocation order scan which would then give me a deterministic row order.

An allocation-ordered scan is no more deterministic than any other implementation; you're just (unsafely) relying on a different observed behaviour.

sepupic · Answer 2 · 2017-09-30T06:53:11.090

If you really want to use %%physloc%%, here is a solution:

declare @t table(GroupID int, Value int);
insert into @t values
(1,           5),
(1,           5),
(1,           3),
(2,           4),
(2,           1),
(1,           4),
(2,           3),
(2,           5),
(2,           2),
(1,           1);

create table #t(id int identity, GroupID int, Value int);

insert into #t (GroupID, Value)
select GroupID, Value
from @t
     cross apply (select sys.fn_PhysLocFormatter (%%physloc%%) as loc) a 
     cross apply (select charindex(':', loc) as first)a1
     cross apply (select charindex(':', loc, first + 1) as second)a2
     cross apply (select cast(SUBSTRING(loc, 2, first - 2) as int)as file_,
       cast(SUBSTRING(loc, first + 1, second - 4) as bigint)as page_,
       cast(SUBSTRING(loc, second + 1, len(loc) - second - 1)as int) as slot_)a4
order by a4.file_, a4.page_, a4.slot_;

with cte as
(
select id, GroupID, Value,
       row_number() over(order by id) - row_number() over(order by GroupID, id) as grp 
from #t
)

,cte1 as
(
select GroupID, grp, sum(Value) as Value, min(id) as id 
from cte
group by GroupID, grp
)

select GroupID, Value
from cte1
order by id;

%%physloc%% is a physical-record locator function, you can read about it here: SQL Server 2008: New (undocumented) physical row locator function

UPDATE:

As ypercubeᵀᴹ suggeasted, order by %%physloc%% is incorrect, we need to extract file, page an slot and order by them

and what happens when there are deletes in the table? We can't assume that the inserts after that will get higher physical RIDs as they may fill the gaps

This question should be addressed to OP, not to me. My solution is for static heap that was presented in Original Post.

If the author knows that this heap may change he should copy this heap not into temporary table with identity column but into permanent one.

Sum by Implicit Group

2 Answers2

Linked