Partitioning an existing table yearly in SQL Server with huge data

Question

I have a logs table for my application which is almost 3TB data. I want to partition this table yearly. Below is the scripts for Table, View, and indexes script:


GO
CREATE TABLE [dbo].[AuditLogs](
    [Id] [bigint] IDENTITY(1,1) NOT NULL,
    [BankId] [nvarchar](2) NULL,
    [Message] [nvarchar](max) NULL,
    [Level] [nvarchar](100) NULL,
    [TimeStamp] [datetimeoffset](7) NOT NULL,
    [FunctionName] [nvarchar](400) NULL,
    [CorrelationId] [uniqueidentifier] NULL,
 CONSTRAINT [PK_AuditLogs] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
/****** Object:  View [dbo].[View_RequestTime]    Script Date: 22/10/2023 2:56:01 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE VIEW [dbo].[View_RequestTime]
AS
select [FunctionName],BankId,[TimeStamp],[ElapsedTime] from(
select
  ROW_NUMBER() over(partition by CorrelationId order by l.[TimeStamp])  AS Row#,
    l.[FunctionName],
    l.BankId,
    l.[TimeStamp],
    l.CorrelationId,
   DATEDIFF(millisecond,l.[TimeStamp],lead(l.[TimeStamp]) over(partition by CorrelationId order by l.[TimeStamp])) ElapsedTime
from AuditLogs l  where l.CorrelationId is not null
)lo  where Row#=1

GO
/****** Object:  View [dbo].[View_1]    Script Date: 22/10/2023 2:56:01 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
/****** Object:  Index [NonClusteredIndex-20220821-124100]    Script Date: 22/10/2023 2:56:01 PM ******/
CREATE NONCLUSTERED INDEX [NonClusteredIndex-20220821-124100] ON [dbo].[AuditLogs]
(
[Level] ASC,
[TimeStamp] ASC,
[FunctionName] ASC,
[CorrelationId] ASC,
[BankId] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
GO
/****** Object:  Index [NonClusteredIndex-20231011-105720]    Script Date: 22/10/2023 2:56:01 PM ******/
CREATE NONCLUSTERED INDEX [NonClusteredIndex-20231011-105720] ON [dbo].[AuditLogs]
(
    [CorrelationId] ASC
)
INCLUDE([BankId],[TimeStamp],[FunctionName]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
GO

My concern is how to create a partition function that works on yearly data partitioning. Column for datetime is [TimeStamp] and I have Id as primary key(clustered index), so how I will create clustered index that works with partition? Any help and suggestion is highly appreciated, I have read this answer before posting this link.

Query Execution Plan: Here is the execution plan link

Below is the query that I ran on the view with very small data compared to actual data. It takes more than 2 minutes to execute.

SELECT [v].[BankId], [v].[FunctionName],
AVG(CAST([v].[ElapsedTime] AS float)) AS [TotalSucceedAvgTime], CAST(COUNT(CASE
    WHEN [v].[ElapsedTime] IS NOT NULL THEN 1
END) AS float) AS [TotalSucceedReq], 
CAST(COUNT(CASE WHEN [v].[ElapsedTime] IS NULL THEN 1
END) AS float) AS [TotalFailedReq], 
(CAST(COUNT(CASE
    WHEN [v].[ElapsedTime] IS NOT NULL THEN 1
END) AS float) / CAST(COUNT() AS float))  100.0E0 AS [SuccessRate], (CAST(COUNT(CASE
    WHEN [v].[ElapsedTime] IS NULL THEN 1
END) AS float) / CAST(COUNT() AS float))  100.0E0 AS [FailedRate]
FROM [View_RequestTime] AS [v]
WHERE (([v].[BankId] = 1)) AND ((CONVERT(date, [v].[TimeStamp]) >= '2022-05-05 00:00:00') AND (CONVERT(date, [v].[TimeStamp]) <= '2024-11-11 00:00:00'))
GROUP BY [v].[FunctionName], [v].[BankId]

Erik Reasonable Rates Darling · Answer 1 · 2024-12-09T21:37:16.750

the problem

The main issues in this plan are:

The index that gets used here is only suitable for seeking to where CorrelationId IS NOT NULL
It doesn't help with the Windowing Function partitioning and ordering requirements
The predicates you're applying to BankId and TimeStamp are not pushed past the Sequence Project operator

index issue

You can resolve the index issue by changing it to this one:

CREATE INDEX 
    [NonClusteredIndex-20231011-105720] 
ON dbo.AuditLogs
(
    CorrelationId,
    TimeStamp,
    BankId
)
INCLUDE
    (FunctionName) 
WITH 
(
    SORT_IN_TEMPDB = ON, 
    DROP_EXISTING = ON, 
    DATA_COMPRESSION = PAGE
);
GO

This will allow you to find not-NULL CorrelationIds, and fully support what the two Windowing Functions asking for

    (
        PARTITION BY
             CorrelationId
        ORDER BY
            l.TimeStamp
    )

That should help with the Sort Spill in the query plan:

query issues

It's possible that you've tried to narrow the searched records down to where lo.Row# = 1, and that it's not logically needed for result correctness. You can keep it in there to narrow down displayed results, but it's causing a late filter operator in the plan:

You go from 26,000,000 rows to 11,500,000 rows after the filters are applied. That's a lot of extra rows and work for all the other operators in the plan.

I'd replace the view with an inline table valued function that looks like this:

CREATE OR ALTER FUNCTION 
    dbo.View_RequestTime
(
    @__searchStats_To_Value_Date_3 datetime2(7),
    @__searchStats_From_Value_Date_2 datetime2(7),
    @__searchStats_BankId_1 nvarchar(4000)
)
RETURNS 
    table
WITH
    SCHEMABINDING
AS
RETURN
SELECT
    lo.FunctionName,
    lo.BankId,
    lo.TimeStamp,
    lo.ElapsedTime
FROM
(
    SELECT
        Row# = 
            ROW_NUMBER() OVER 
            (
                PARTITION BY
                     CorrelationId
                ORDER BY
                    l.TimeStamp
            ),
        l.FunctionName,
        l.BankId,
        l.TimeStamp,
        l.CorrelationId,
        ElapsedTime = 
            DATEDIFF
            (   
                MILLISECOND, 
                l.TimeStamp, 
                LEAD(l.TimeStamp) OVER 
                (
                    PARTITION BY
                         CorrelationId
                     ORDER BY
                         l.TimeStamp
                )
            )
    FROM AuditLogs AS l
    WHERE l.CorrelationId IS NOT NULL
    AND   l.BankId = @__searchStats_BankId_1
    AND   l.TimeStamp >= @__searchStats_From_Value_Date_2
    AND   l.TimeStamp <= @__searchStats_To_Value_Date_3
) AS lo
WHERE lo.Row# = 1;

Applying the predicate before the Windowing Functions will allow you to get down to a smaller search result to apply the Windowing Functions to. Of course, this may not be logically correct for your needs, but I don't know what the expectations are there.

If this arrangement is logically incorrect, move the parameter filtering to after the lo.Row# = 1, but this will likely add the late filter back into the plan and ruin everything.

That would turn your final query into:

SELECT
    v.BankId,
    v.FunctionName,
    TotalSucceedAvgTime = 
        AVG(CAST(v.ElapsedTime AS float)),
    TotalSucceedReq = 
        CAST
        (
            COUNT_BIG
            (
                CASE
                    WHEN v.ElapsedTime IS NOT NULL
                    THEN 1
                END
            ) AS float
        ),
    TotalFailedReq =
        CAST
        (
            COUNT_BIG
            (   
                CASE
                    WHEN v.ElapsedTime IS NULL
                    THEN 1
                END
            ) AS float
        ),
    SuccessRate =
        (
            CAST
            (
                COUNT_BIG
                (   
                    CASE
                        WHEN v.ElapsedTime IS NOT NULL
                        THEN 1
                    END
                ) AS float
            ) / CAST(COUNT_BIG(*) AS float)
        ) * 100.0E0,
    FailedRate =
        (
            CAST
            (
                COUNT_BIG
                (
                    CASE
                        WHEN v.ElapsedTime IS NULL
                        THEN 1
                    END
                ) AS float
            ) / CAST(COUNT_BIG(*) AS float)
        ) * 100.0E0
FROM dbo.View_RequestTime
(
    1, 
    '2022-05-05 00:00:00', 
    '2024-11-11 00:00:00'
)
GROUP BY
    v.FunctionName,
    v.BankId;

other considerations

There are other things worth trying here under the right circumstances.

Higher DOP: If you have more than 4 CPUs, see if OPTION(MAXDOP 8); at the end of your outer query improves things. Right now, you're running at DOP 4, and each thread is dealing with 6-7 million rows. The higher DOP would spread the row load out a bit more, which may improve things.

Batch Mode: J.D. was on the right track with his suggestion to use a nonclustered columnstore index:

Ahem for code formatting.

CREATE COLUMNSTORE INDEX 
    AuditLogs_CorrelationId_TimeStamp_BankId_FunctionName 
ON dbo.AuditLogs 
    (CorrelationId, [TimeStamp], BankId, FunctionName);

But this will only be useful if you're on an Enterprise Edition or equivalent SKU. You see, Microsoft hates you if you don't pay them enough money to be your friend, and they have terribly hobbled Batch Mode execution on lower (priced) SKUs.

Validate parameter data types

Your application has passed in these, which seem a bit suspect:

@__searchStats_To_Value_Date_3 datetime2(7),
@__searchStats_From_Value_Date_2 datetime2(7),
@__searchStats_BankId_1 nvarchar(4000)

Apart from strange behavior when comparing different temporal data types, the 4000 byte string to compare to BankId seems off. Performance problems may not be obvious without a seekable index for these, though.

score 0 · Answer 2 · answered Oct 24 '23 at 03:08

I haven't had a chance to dive in yet, but from a first look, just some quick thoughts:

Your index on ([Level] ASC, [TimeStamp] ASC, [FunctionName] ASC, [CorrelationId] ASC, [BankId] ASC) likely won't be used because Level isn't being used in your query (yet it's the leading column in your index), particularly not in any predicates.

Does the following columnstore index get used and improve the query at all?

CREATE NONCLUSTERED COLUMNSTORE INDEX IX_AuditLogs_CorrelationId_TimeStamp_BankId_FunctionName 
ON dbo.AuditLogs (CorrelationId, [TimeStamp], BankId, FunctionName)

score 0 · Answer 3 · answered Nov 06 '23 at 22:38

I work with similar data. I agree with the others - I'm not sure what your goal is, but just based on your specific ask, what you can do to partition that table is as follows. Given the research you've posted, I'll simply say first you need to have your partition function and partition scheme with FILEGROUPS and FILES ready. Something like these JUST AS EXAMPLES -

USE master;
GO
ALTER DATABASE <yourDB> ADD FILEGROUP [2018];
ALTER DATABASE <yourDB> ADD FILEGROUP [2019];
...
GO
ALTER DATABASE <yourDB>
ADD FILE
    (
        NAME = [2018]
      , FILENAME = 'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\2018.ndf'
      , SIZE = 100GB
      , MAXSIZE = 1TB -- Or unlimited, I don't know how big your years are
      , FILEGROWTH = 50GB
    ) TO FILEGROUP [2018];
GO
ALTER DATABASE <yourDB>
ADD FILE
    (
        NAME = [2019]
      , FILENAME = 'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\2019.ndf'
      , SIZE = 100GB
      , MAXSIZE = 1TB -- Or unlimited, I don't know how big your years are
      , FILEGROWTH = 50GB
    ) TO FILEGROUP [2019];
...
CREATE PARTITION FUNCTION pf_Year (DATETIMEOFFSET)
AS RANGE RIGHT FOR VALUES
(   N'2019-01-01T00:00:00.000'
  , N'2020-01-01T00:00:00.000'
  , N'2021-01-01T00:00:00.000'
  , N'2022-01-01T00:00:00.000'
  , N'2023-01-01T00:00:00.000'
  , N'2024-01-01T00:00:00.000'
);
GO
CREATE PARTITION SCHEME ps_Year
AS PARTITION pf_Year
TO
(
    [2018]
  , [2019]
  , [2020]
  , [2021]
  , [2022]
  , [2023]
  , [2024]
);
GO

Then to build your table, you need the PARTITION KEY to be apart of the PRIMARY KEY.

CREATE TABLE AuditLogs
(
    Id BIGINT IDENTITY(1, 1) NOT NULL
  , BankId NVARCHAR(2) NULL
  , Message NVARCHAR(MAX) NULL
  , Level NVARCHAR(100) NULL
  , TimeStamp DATETIME NOT NULL
  , FunctionName NVARCHAR(400) NULL
  , CorrelationId UNIQUEIDENTIFIER NULL
  , CONSTRAINT PK_AuditLogs
        PRIMARY KEY CLUSTERED (
                                  Id ASC
                                , TimeStamp ASC
                              )
        WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON
            , ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = ON
             ) ON ps_Year(TIMESTAMP)
) ON ps_year(TimeStamp);
GO

Or alter on the existing table

ALTER TABLE dbo.AuditLogs DROP CONSTRAINT PK_AuditLogs;
GO
ALTER TABLE dbo.AuditLogs
ADD CONSTRAINT PK_AuditLogs
    PRIMARY KEY CLUSTERED (
                              Id
                            , TimeStamp
                          )
    WITH (MAXDOP = 16, ONLINE = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = ON) ON ps_Year(timestamp);
GO

Then just do the same things with the indexes to align them to the partition scheme.

CREATE NONCLUSTERED INDEX [NonClusteredIndex-20220821-124100]
ON dbo.AuditLogs (
                     Level ASC
                   , TimeStamp ASC
                   , FunctionName ASC
                   , CorrelationId ASC
                   , BankId ASC
                 )
WITH (MAXDOP = 16, DROP_EXISTING = ON, ONLINE = ON)
ON ps_Year(TimeStamp);
GO
CREATE NONCLUSTERED INDEX [NonClusteredIndex-20231011-105720]
ON dbo.AuditLogs (CorrelationId ASC)
INCLUDE (
            BankId
          , TimeStamp
          , FunctionName
        )
WITH (MAXDOP = 16, DROP_EXISTING = ON, ONLINE = ON)
ON ps_Year(TimeStamp);
GO

I hope that helps.

score 0 · Answer 4 · answered Nov 07 '23 at 18:19

Here are my thoughts about huge log type tables. I'd start with changing clustered to [TimeStamp] since vast majority of queries will use it (give me events happened yesterday or last month or between @dfrom and @dto) You can have [Id] as primary key but if this table is never referenced you can redefine your primary key as [TimeStamp],[Id]

How to actually convert your table? First of all confirm your table is 'insert only' (most of log tables do not allow any updates).

Create new table and populate it slowly from source table. On actual switch day you can add missing records and rename tables.

For older partition you can -

put it in read-only filegroup and skip backup/recovery.
you can apply compression
you can create it as columnstore but to do it you have to use partitioning views instead of actual partitions https://www.brentozar.com/archive/2016/09/partitioned-views-guide/

Partitioning an existing table yearly in SQL Server with huge data

4 Answers4

the problem

index issue

query issues

other considerations