SQL Server - Select most recent record from each group when performance is critical

Question

I run a SQL Server 2016 database where I have the following table with 100+ millions rows:

StationId | ParameterId |       DateTime       | Value
    1     |      2      | 2020-02-04 15:00:000 |  5.20
    1     |      2      | 2020-02-04 14:00:000 |  5.20
    1     |      2      | 2020-02-04 13:00:000 |  5.20
    1     |      3      | 2020-02-04 15:00:000 |  2.81
    1     |      3      | 2020-02-04 14:00:000 |  2.81
    1     |      4      | 2020-02-04 15:00:000 |  5.23
    2     |      2      | 2020-02-04 15:00:000 |  3.70
    2     |      4      | 2020-02-04 15:00:000 |  12.20
    3     |      2      | 2020-02-04 15:00:000 |  1.10

This table has a clustered index for StationId, ParameterId and DateTime, in this order, all ascending.

What I need is, for each unique pair StationId - ParameterId, return the most recent value from DateTime column:

StationId | ParameterId |       LastDate       
    1     |      2      | 2020-02-04 15:00:000 
    1     |      3      | 2020-02-04 15:00:000 
    1     |      4      | 2020-02-04 15:00:000 
    2     |      2      | 2020-02-04 15:00:000 
    2     |      4      | 2020-02-04 15:00:000 
    3     |      2      | 2020-02-04 15:00:000

What I'm doing now is the following query, which takes around 90 to 120 seconds to run:

    SELECT StationId, ParameterId, MAX(DateTime) AS LastDate
    FROM  MyTable WITH (NOLOCK)
    GROUP BY StationId, ParameterId

I've also seen many posts suggesting the following, which takes 10+ minutes to run:

    SELECT StationId, ParameterId, DateTime AS LastDate
    FROM
    (
       SELECT StationId, ParameterId, DateTime
       ,ROW_NUMBER() OVER (PARTITION BY StationId,ParameterIdORDER BY DateTime DESC) as row_num
       FROM  MyTable WITH (NOLOCK)
    )
    WHERE row_num = 1

Even in the best case (using GROUP BY clause and MAX aggregate funcition), the execution plan doesn't indicate an Index Seek:

I wonder if there's a better way to perform this query (or to build the index) in order to achieve better execution time.

David Browne - Microsoft · Accepted Answer · 2020-02-05T20:59:19.280

If you have a small-enough number of (StationID, ParameterID) pairs, then try a query like this:

select StationID, ParameterID, m.DateTime LastDate 
from StationParameter sp
cross apply 
  (
     select top 1 DateTime 
     from MyTable 
     where StationID = sp.StationID
      and ParameterID = sp.ParameterID
     order by DateTime desc
  ) m

To enable SQL Server to perform a lookup, seeking the latest DateTime for each (StationID,ParameterID) pair.

With only a Clustered Index on (StationID, ParameterID, DateTime), there's no way for SQL Server to discover the distinct (StationID, ParameterID) pairs without scanning the leaf level of the index, and it can find the largest DateTime while it's scanning.

Also at 100M+ rows, this table might be better as a Clustered Columnstore instead of a BTree Clustered Index.

score 0 · Answer 2 · answered Feb 05 '20 at 19:41

If the performance is really critical and you are about frequently ask your table for the most recent date... Why not create lookup table of station and parameter as key with most recent time stamp. You need to update this table each time you modify the big one, but this way you have your results when you need it I milliseconds.

score 0 · Answer 3 · answered Feb 06 '20 at 03:01

0

With the CTE row_number approach, try creating a nonclustered index on StationId, ParameterId, DateTime desc. I've found having an index with the proper sort order that my partition by order clause uses has improved performance for me.

answered Feb 06 '20 at 03:01

Connor

353
2
5

SQL Server - Select most recent record from each group when performance is critical

3 Answers3