Diagnosing a block in a process that EXECs several layers of stored procedures

Question

Disclaimer: I am a developer. Please be nice to me. I am not the developer that is responsible for what follows. I try to be one of the good ones.

I have inherited a support ticket that involves a specific process causing a block on the client's SQL Server 2008R2 installation. I can trigger the block at any time on the client's server, but cannot reproduce it anywhere else. We even went so far as to create a virtual server with exactly the same hardware stats as the client's server, restored the exact same database to the exact same SQL Server setup, but no dice--can't replicate it. The process itself is ugly: a stored procedure is called, which then executes several other stored procedures, all using named transactions, some nested in cursors. The process follows this pseudocode:

sp_Outermost (named transaction Trans_Outermost)
    sp_Nested1 (Trans_Nested1)
    sp_Nested2 (Trans_Nested2)
    sp_Nested3 (Trans_Nested3)
        sp_Nested3_1 (Trans_Nested3_1)
            sp_Nested3_1_1 (Trans_Nested3_1_1)
                sp_Nested3_1_1_1 (Trans_Nested3_1_1_1)
                    sp_Nested3_1_1_1_1 (Trans_Nested3_1_1_1_1)

Sorry...not sure how else to describe it.

There is no TRY-CATCH logic in any of the stored procedures, though there is some "custom" error handling that involves GOTOs and setting an "error number" (more on that later).

When I look at the process in Activity Monitor, the Task State is RUNNING, Command is SELECT, and the Wait Type is ASYNC_NETWORK_IO.

If I run DBCC OPENTRAN or look at sys.dm_tran_session_transactions and sys.dm_tran_active_transactions, it lists the outermost transaction (Trans_Outermost) as being the open transaction. However, when I run a query against sys.dm_exec_connections and sys.dm_exec_sessions, I am informed that the query being executed is actually sp_Nested3_1_1_1. This is always the case. Further, running a query ganked from this answer, I see that the statement being waited on is always this:

SET @ErrorNum = 85656

This @ErrorNum variable is declared and used in every one of these stored procedures. It strikes me as very, very odd that a simple SET statement could cause so much trouble, but I don't believe in coincidence.

I commented out every usage of @ErrorNum in sp_Nested3_1_1_1 to see if that made a difference and, well, it kind of did. There's a stored proc that writes to an audit log table that is called from every other stored procedure. Now, the error statement comes from that procedure, but still involves @ErrorNum.

SET @ErrorNum = 85026

So, my question is, how can I figure out what the root cause of this blocking is? If a local variable with the same name is declared and used in nested locations on a server with inadequate hardware, could that cause a problem? Where else can I look here?

I just figured out that these procedures use RAISERROR WITH SETERROR with a set of custom message IDs, all more than 85000. Not sure if this matters, but it's where I'm googling now.

I commented out some code in sp_Nested3_1_1_1 and sp_Nested3_1_1_1_1, specifically surrounding this @ErrorNum business, and now it's telling me that what appears to be a perfectly legitimate line of code in sp_Nested3_1_1 is the problem.

SELECT CASE WHEN @Attached_ID = -1 THEN     
    SCOPE_IDENTITY()
ELSE     
    @Attached_ID    
END AS Attached_ID

This seems completely arbitrary to me and makes me wonder if it has to do with the poor hardware, and the fact that they're running two other enterprise DBs on it in addition to ours.

I'm using the following queries as well as Activity Monitor to identify when/where locks occur:

SELECT  t.text,
        QUOTENAME(OBJECT_SCHEMA_NAME(t.objectid, t.dbid)) + '.'
        + QUOTENAME(OBJECT_NAME(t.objectid, t.dbid)) proc_name,
        c.connect_time,
        s.last_request_start_time,
        s.last_request_end_time,
        s.status
FROM    sys.dm_exec_connections c
JOIN    sys.dm_exec_sessions s
        ON c.session_id = s.session_id
CROSS APPLY sys.dm_exec_sql_text(c.most_recent_sql_handle) t
WHERE   c.session_id = 61;--your blocking spid

As well as the first query from this answer.

Cody Konior · Accepted Answer · 2016-06-01T02:57:42.023

Q. So, my question is, how can I figure out what the root cause of this blocking is?

You may find it's one of the old standards:

A concurrency issue with the code.
A maintenance job doing a rebuild.
A user connecting directly to the database for reporting (or Excel!) causing blocking.

But it's going to be difficult if not impossible to trace this down in Activity Monitor alone. If you can withstand making some basic modifications on the client server I think you should focus on capturing what is blocking there.

Put Adam Mechanic's sp_WhoIsActive on the system. If you don't have a database just for holding tools, best put this in master so you can run it anywhere.
It can identify blocking chains and locks and output to a table. But first you need to work out the table definition and create the table (in this case we'll put it in master too but a special tools database would be best practice). Here's a snippet adapted from Kendra Little. DECLARE @destination_table VARCHAR(4000) ; SET @destination_table = 'BLOCKED_PROCESS_REPORT' ; DECLARE @schema VARCHAR(4000) ; EXEC sp_WhoIsActive @get_transaction_info = 1, @get_plans = 1, @find_block_leaders = 1, @RETURN_SCHEMA = 1, @SCHEMA = @schema OUTPUT ; SET @schema = REPLACE(@schema, '<table_name>', @destination_table) ; PRINT @schema EXEC(@schema) ;
Exec sp_configure 'blocked process threshold (s)'. If it's set to 0 then set it to something like 15 (seconds) and run Reconfigure. This is usually safe, but the standard buyer beware warnings apply. Don't set it much lower than this and if you set it to 30s you might miss out on things because 30s is the ADO.NET default timeout.
Create an Agent job which runs sp_WhoIsActive @get_transaction_info = 1, @get_plans = 1, @find_block_leaders = 1, @destination_table = 'BLOCKED_PROCESS_REPORT'
Create an Agent WMI alert on, type WMI event alert, with a Query of SELECT * FROM BLOCKED_PROCESS_REPORT, and a Response that executes the above job.
Test it in two sessions (BEGIN TRAN, INSERT into table, then DELETE from table in the other session, and wait to see that your BLOCKED_PROCESS_REPORT table begins getting populated with data about 30s later).

Now you sit back and wait. Once the problem happens again you'll have a bunch of detailed information in BLOCKED_PROCESS_REPORT about what's blocking what, in what order, and what locks are taken, and can go from there.

Remember to clean up these things once you're done.

score 3 · Answer 2 · edited Mar 02 '21 at 17:56

As described in this article http://www.sqlshack.com/reducing-sql-server-async_network_io-wait-type/, while investigating the excessive ASYNC_NETWORK_IO wait type values, the following should be checked:

Check whether the application is requesting large data sets from a SQL Server instance, and then if it filters those data on the client side. Pay attention to third-party applications like Microsoft Access or ORM software (aka Object relational mapping) for example that may be requesting the large data sets that they are filtering on the client side. Using the read immediately and process afterwards programing method may often save users from excessive ASYNC_NETWORK_IO wait type values

Make sure that appropriate views are created for the client application, as this can ensure that data filtering is done by the SQL Server instance and therefore the significantly lower amount of data will be send to the client application

Make sure that the application is committing the opened transactions and that it committing them in a timely manner

Check if there is the way to reduce the requested dataset in a way to perform data filtering on the SQL Server directly

In case of individual or ad-hock queries, make sure that WHERE clause is added wherever it is possible and that query is properly optimized in a way to restrict the requested data set to only the required data

Check if it possible to use “TOP n” in the query to decrease the row number that will be returned by the query

Scalar-Valued User Defined Functions (UDF) are often the cause of the high ASYNC_NETWORK_IO wait type due to RBAR, so look for any instances of these objects that may be affecting performance

Using a Computed Column Defined with a User Defined Function (UDF) with a large database is another frequent reason for the high ASYNC_NETWORK_IO wait type due to RBAR

In case of SQL Server 2016, it is possible to use natively compiled UDFs that can significantly lower RBAR in most cases and to improve the execution speed up to 100%. This can be particularly useful in situations when refactoring UDF to a Table-Valued Function is not an option

Hope this info is helpful

Diagnosing a block in a process that EXECs several layers of stored procedures

2 Answers2