19

I've been arguing with a DBA and a couple hardware guys about performance issues on our SQL server. Normally everything is fine, however over the past few weeks we have been having huge lag spikes in sql server. Its clear that SQL Server is waiting on disk I/O. But I keep getting told that it is beacuse SQL Server is asking for abnormally high I/O. Which isn't the case. I can see from what is running that there is nothing out of the normal, and all the DBA cares to look at is what is causing the blocking and so on, which is useless. For instance the major thing we see backing up is operation on the ASPState database, which we are using to manage the ASP Session State on the web servers. These operations are normally never seen on Sp_who2 active results because they occur so quickly. The database is in simple recovery mode and logging is miminal. However during these lag spikes we can see alot select and update operations on the database being blocked or waiting. I'm sure what is going on is that someone or some job is running something that is causing heavey disk usage on the raid arrays used for that databases log and data files. The problem is proving it, since no one wants to admit they are doing something that is killing our website.

My question is what performance counters or whatever can I log that will help show that SQL server is waiting on I/O, but not because its asking for more than normaly, instead beacuse the disk is to busy to respond to the requests from sql server as quickly as it normally would?

Edgey
  • 191
  • 1
  • 1
  • 4

3 Answers3

20

Have a look at the following perfmon counters:

SQL Server driving a high number of IO requests would be corroborated with a high number scans, increase in page lookups and page reads and high page IO latch waits. Is worth trying a look at sys.dm_exec_query_stats for entries with high physical reads counts. They could quickly pinpoint the culprit.

In general approaching the problem as a performance troubleshooting problem, following a methodlogy like Waits and Queues is the right approach. You DBA seems to be doing the right thing so you should listen to him.

Remus Rusanu
  • 52,054
  • 4
  • 96
  • 172
13

To start use Glenn Berry's Diagnostic queries and Adam Machanic's SP_Whoisactive to find out whats really happening.

First see which database files have the most IO bottleneck by running this query(Query by Glenn Berry)

SELECT  DB_NAME(fs.database_id) AS [Database Name] ,
        mf.physical_name ,
        io_stall_read_ms ,
        num_of_reads ,
        CAST(io_stall_read_ms / ( 1.0 + num_of_reads ) AS NUMERIC(10, 1)) AS [avg_read_stall_ms] ,
        io_stall_write_ms ,
        num_of_writes ,
        CAST(io_stall_write_ms / ( 1.0 + num_of_writes ) AS NUMERIC(10, 1)) AS [avg_write_stall_ms] ,
        io_stall_read_ms + io_stall_write_ms AS [io_stalls] ,
        num_of_reads + num_of_writes AS [total_io] ,
        CAST(( io_stall_read_ms + io_stall_write_ms ) / ( 1.0 + num_of_reads
                                                          + num_of_writes ) AS NUMERIC(10,
                                                              1)) AS [avg_io_stall_ms]
FROM    sys.dm_io_virtual_file_stats(NULL, NULL) AS fs
        INNER JOIN sys.master_files AS mf WITH ( NOLOCK ) ON fs.database_id = mf.database_id
                                                             AND fs.[file_id] = mf.[file_id]
ORDER BY avg_io_stall_ms DESC
OPTION  ( RECOMPILE );

Then run this query to see the top ten events your server is waiting on(query by Jonathan Kehayias). You will also find similar query from Glenn Berry diagnostic queries.

SELECT TOP 10
        wait_type ,
        max_wait_time_ms wait_time_ms ,
        signal_wait_time_ms ,
        wait_time_ms - signal_wait_time_ms AS resource_wait_time_ms ,
        100.0 * wait_time_ms / SUM(wait_time_ms) OVER ( ) AS percent_total_waits ,
        100.0 * signal_wait_time_ms / SUM(signal_wait_time_ms) OVER ( ) AS percent_total_signal_waits ,
        100.0 * ( wait_time_ms - signal_wait_time_ms )
        / SUM(wait_time_ms) OVER ( ) AS percent_total_resource_waits
FROM    sys.dm_os_wait_stats
WHERE   wait_time_ms > 0 -- remove zero wait_time
        AND wait_type NOT IN -- filter out additional irrelevant waits
( 'SLEEP_TASK', 'BROKER_TASK_STOP', 'BROKER_TO_FLUSH', 'SQLTRACE_BUFFER_FLUSH',
  'CLR_AUTO_EVENT', 'CLR_MANUAL_EVENT', 'LAZYWRITER_SLEEP', 'SLEEP_SYSTEMTASK',
  'SLEEP_BPOOL_FLUSH', 'BROKER_EVENTHANDLER', 'XE_DISPATCHER_WAIT',
  'FT_IFTSHC_MUTEX', 'CHECKPOINT_QUEUE', 'FT_IFTS_SCHEDULER_IDLE_WAIT',
  'BROKER_TRANSMITTER', 'FT_IFTSHC_MUTEX', 'KSOURCE_WAKEUP',
  'LAZYWRITER_SLEEP', 'LOGMGR_QUEUE', 'ONDEMAND_TASK_QUEUE',
  'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BAD_PAGE_PROCESS',
  'DBMIRROR_EVENTS_QUEUE', 'BROKER_RECEIVE_WAITFOR',
  'PREEMPTIVE_OS_GETPROCADDRESS', 'PREEMPTIVE_OS_AUTHENTICATIONOPS', 'WAITFOR',
  'DISPATCHER_QUEUE_SEMAPHORE', 'XE_DISPATCHER_JOIN', 'RESOURCE_QUEUE' )
ORDER BY wait_time_ms DESC

Once you have this information at hand it would be much easier to troubleshoot the problem.

Aaron Bertrand
  • 181,950
  • 28
  • 405
  • 624
DaniSQL
  • 702
  • 7
  • 14
1

"The problem is proving it", rightly said. Take a look at SQL Server: Minimize Disk I/O

It is talking about following DMV

sys.dm_io_virtual_file_stats
sys.dm_io_pending_io_requests

References:

  1. How to examine IO subsystem latencies from within SQL Server
  2. Glenn Berry's SQL Server Performance - sys.dm_io_pending_io_requests
LCJ
  • 900
  • 3
  • 7
  • 30