Using sp_getapplock to implement a queue. Is it correct? Is there a better way?

Question

I've been reading a series of posts by Paul White about SQL Server Isolation Levels and came across a phrase:

To emphasise the point, pseudo-constraints written in T-SQL have to perform correctly no matter what concurrent modifications might be occurring. An application developer might protect a sensitive operation like that with a lock statement. The closest thing T-SQL programmers have to that facility for at-risk stored procedure and trigger code is the comparatively rarely-used sp_getapplock system stored procedure. That is not to say it is the only, or even preferred option, just that it exists and can be the right choice in some circumstances.

I am using sp_getapplock and this made me wonder if I'm using it correctly, or there is a better way to get the desired effect.

I have a C++ application that processes so called "Building servers" in a loop 24/7. There is a table with the list of these Building Servers (about 200 rows). New rows can be added at any time, but it doesn't happen often. Rows are never deleted, but they can be marked as inactive. Processing a server may take from few seconds to dozens of minutes, each server is different, some are "small", some are "large". Once a server is processed, application has to wait at least 20 minutes before processing it again (the servers should not be polled too often). Application starts 10 threads that perform the processing in parallel, but I must guarantee that no two threads attempt to process the same server at the same time. Two different servers can and should be processed simultaneously, but each server can be processed not more often than once in 20 minutes.

Here is the definition of a table:

CREATE TABLE [dbo].[PortalBuildingServers](
    [InternalIP] [varchar](64) NOT NULL,
    [LastCheckStarted] [datetime] NOT NULL,
    [LastCheckCompleted] [datetime] NOT NULL,
    [IsActiveAndNotDisabled] [bit] NOT NULL,
    [MaxBSMonitoringEventLogItemID] [bigint] NOT NULL,
CONSTRAINT [PK_PortalBuildingServers] PRIMARY KEY CLUSTERED 
(
    [InternalIP] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

CREATE NONCLUSTERED INDEX [IX_LastCheckCompleted] ON [dbo].[PortalBuildingServers]
(
    [LastCheckCompleted] ASC
)
INCLUDE 
(
    [LastCheckStarted],
    [IsActiveAndNotDisabled],
    [MaxBSMonitoringEventLogItemID]
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

The main loop of a worker thread in an application looks like this:

for(;;)
{
    // Choose building server for checking
    std::vector<SBuildingServer> vecBS = GetNextBSToCheck();
    if (vecBS.size() == 1)
    {
        // do the check and don't go to sleep afterwards
        SBuildingServer & bs = vecBS[0];
        DoCheck(bs);
        SetCheckComplete(bs);
    }
    else
    {
        // Sleep for a while
        ...
    }
}

Two functions here GetNextBSToCheck and SetCheckComplete are calling corresponding stored procedures.

GetNextBSToCheck returns 0 or 1 row with details of the server that should be processed next. It is a server that hasn't been processed for a longest time. If this "oldest" server has been processed less than 20 minutes ago, no rows are returned and thread will wait for a minute.

SetCheckComplete sets the time when processing was completed, thus making it possible to choose this server for processing again after 20 minutes.

Finally, the code of stored procedures:

GetNextToCheck:

CREATE PROCEDURE [dbo].[GetNextToCheck]
AS
BEGIN
    SET NOCOUNT ON;

    BEGIN TRANSACTION;
    BEGIN TRY
        DECLARE @VarInternalIP varchar(64) = NULL;
        DECLARE @VarMaxBSMonitoringEventLogItemID bigint = NULL;

        DECLARE @VarLockResult int;
        EXEC @VarLockResult = sp_getapplock
            @Resource = 'PortalBSChecking_app_lock',
            @LockMode = 'Exclusive',
            @LockOwner = 'Transaction',
            @LockTimeout = 60000,
            @DbPrincipal = 'public';

        IF @VarLockResult >= 0
        BEGIN
            -- Acquired the lock
            -- Find BS that wasn't checked for the longest period
            SELECT TOP 1
                @VarInternalIP = InternalIP
                ,@VarMaxBSMonitoringEventLogItemID = MaxBSMonitoringEventLogItemID
            FROM
                dbo.PortalBuildingServers
            WHERE
                LastCheckStarted <= LastCheckCompleted
                -- this BS is not being checked right now
                AND LastCheckCompleted < DATEADD(minute, -20, GETDATE())
                -- last check was done more than 20 minutes ago
                AND IsActiveAndNotDisabled = 1
            ORDER BY LastCheckCompleted
            ;

            -- Start checking the found BS
            UPDATE dbo.PortalBuildingServers
            SET LastCheckStarted = GETDATE()
            WHERE InternalIP = @VarInternalIP;
            -- There is no need to explicitly verify if we found anything.
            -- If @VarInternalIP is null, no rows will be updated
        END;

        -- Return found BS, 
        -- or no rows if nothing was found, or failed to acquire the lock
        SELECT
            @VarInternalIP AS InternalIP
            ,@VarMaxBSMonitoringEventLogItemID AS MaxBSMonitoringEventLogItemID
        WHERE
            @VarInternalIP IS NOT NULL
            AND @VarMaxBSMonitoringEventLogItemID IS NOT NULL
        ;

        COMMIT TRANSACTION;
    END TRY
    BEGIN CATCH
        ROLLBACK TRANSACTION;
    END CATCH;

END

SetCheckComplete:

CREATE PROCEDURE [dbo].[SetCheckComplete]
    @ParamInternalIP varchar(64)
AS
BEGIN
    SET NOCOUNT ON;

    BEGIN TRANSACTION;
    BEGIN TRY

        DECLARE @VarLockResult int;
        EXEC @VarLockResult = sp_getapplock
            @Resource = 'PortalBSChecking_app_lock',
            @LockMode = 'Exclusive',
            @LockOwner = 'Transaction',
            @LockTimeout = 60000,
            @DbPrincipal = 'public';

        IF @VarLockResult >= 0
        BEGIN
            -- Acquired the lock
            -- Completed checking the given BS
            UPDATE dbo.PortalBuildingServers
            SET LastCheckCompleted = GETDATE()
            WHERE InternalIP = @ParamInternalIP;
        END;

        COMMIT TRANSACTION;
    END TRY
    BEGIN CATCH
        ROLLBACK TRANSACTION;
    END CATCH;

END

As you can see, I use sp_getapplock to guarantee that only one instance of both of these stored procedures are running at any given time. I think I need to use sp_getapplock in both procedures, because the query that chooses the "oldest" server uses the LastCheckCompleted time, which is updated by SetCheckComplete.

I think that this code does guarantee that no two threads attempt to process the same server at the same time, but I'd be grateful if you could point to any issues with this code and the overall approach. So, the first question: Is this approach correct?

Also, I'd like to know if the same effect could be achieved without using sp_getapplock. The second question: Is there a better way?

score 5 · Accepted Answer · edited Apr 13 '17 at 12:42

Is this approach correct?

Yes. It meets all the objectives stated in the question.

A comment in the procedures to explain the strategy and note the related procedure name might be helpful for future maintenance by others.

Is there a better way?

Not to my mind, no.

Taking a single lock is an extremely fast operation, and results in very clear logic. It is not clear to me that taking the lock in the second procedure is redundant, but even if it is, what do you really gain by omitting it? The simplicity and safety of your implementation appeal to me.

The alternatives are much more complex, and may leave you wondering if you've truly covered all cases, or if there might be a change in internal engine details in the future that would break (perhaps subtle and unstated) assumptions.

If you ever need a more traditional queueing implementation, the following reference is very useful:

Using Tables as Queues by Remus Rusanu

score 1 · Answer 2 · edited Apr 13 '17 at 12:42

This scenario seems very similar to the following Question:

Strategies for "checking out" records for processing

In my answer there, I advocated a model similar to what you have here, but with the notion of including sp_applock as a fail-safe only if the initial concept wasn't bullet-proof.

The main difference in the "check out" process was that I combined the SELECT and UPDATE queries using a CTE and the OUTPUT clause. This, with the appropriate query hints of (READPAST, ROWLOCK, UPDLOCK) on the SELECT, allows for updating the field used to determine if a row is eligible for processing while at the same time returning that value so that it can be returned to the calling process. With these two steps combined, it should be ok to do without the app lock. And getting rid of the app lock should, in turn, allow for greater throughput since any individual thread doing the "check out" process in the current model (as posted in the Question) causes the remaining 9 threads to wait, even though they could be grabbing the next one(s) in line at pretty much the same time.

Regarding the following statement towards the end of the Question:

I use sp_getapplock to guarantee that only one instance of both of these stored procedures are running at any given time. I think I need to use sp_getapplock in both procedures, because the query that chooses the "oldest" server uses the LastCheckCompleted time, which is updated by SetCheckComplete.

I would say that whether you keep the current approach or switch to the "combined SELECT + UPDATE via CTE + OUTPUT clause" (tm) approach, the use of sp_getapplock in the SetCheckComplete stored procedure is logically unnecessary. The reason it is not needed is:

only one thread can have a record checked out at any given time
using the app lock in SetCheckComplete implies that the value of LastCheckCompleted can have a determining effect on the GetNextToCheck, yet:
- until a record is checked-in, the LastCheckStarted field will be > the LastCheckCompleted, and this state causes the record to be filtered out of the "GetNext" query due to LastCheckStarted <= LastCheckCompleted condition
- upon being checked in, the LastCheckStarted <= LastCheckCompleted will no longer filter the record out, but the LastCheckCompleted < DATEADD(minute, -20, GETDATE()) condition will filter it out since by definition it was completed mere milliseconds prior to this query running.

So, the SetCheckComplete is really completely independent of the GetNextToCheck process. It is only the GetNextToCheck process that needs any amount of safe-guards added to it.

Removing the app lock from SetCheckComplete should not only be completely safe, but it would also increase throughput as it would be less contention on that arbitrary lock @Resource (again, whether or not you keep the current model or switch to what I suggested).

UPDATE

Question from comment on this answer:

The GetNext query has two conditions in WHERE. Is it possible that server checks one condition first: LastCheckCompleted < DATEADD(minute, -20, GETDATE()) - this condition is true for the item that is currently checked out. Then SetCheckComplete changes the value of LastCheckCompleted. Then server checks the second condition LastCheckStarted <= LastCheckCompleted and it also appears to be true. End result: we checked out a row that we just checked in without waiting for 20 minutes.

My understanding is that within a single object (heap or index) this would not be possible, but between multiple objects this is not impossible. And looking at the schema, you do have an index on LastCheckCompleted. So two things:

I think this scenario is highly unlikely given that it requires LastCheckCompleted to be updated after the first condition has been verified, but I would think that the table (Clustered Index) would get updated first, before the NonClustered Index, but yet for this scenario to come true it would have to get the value of LastCheckCompleted from the NonClustered Index, right?
Either way, an easy fix to disallow this potential (outside of using a higher level transaction isolation level), would be to make a single status field. Complicating matters, at least a little, is that you need an extra piece of information (i.e. eligibility is made up of "not checked out" and "checked-in over 19 minutes ago"). Maybe you can keep the current two time fields as informational, and add a new DATETIME field that is either NULL for "checked-out" or the checked-in time. Then just check for new_field <= 20 minutes ago, and the NULL (i.e. checked-out) rows won't match anyway (unless someone was silly enough to turn ANSI_NULLS OFF ;-).

Using sp_getapplock to implement a queue. Is it correct? Is there a better way?

2 Answers2

Linked

Related