SQL Server inexplicably stopped working

Question

We had a production emergency yesterday: out of the blue, in the middle of peak usage time, our SQL Server just stopped working. Never mind no new connections; even existing connections stopped functioning. I restarted the service and everything came right immediately, but in the interim we lost about 30 minutes of downtime and got serious egg on our faces.

There are no errors or warnings in the event log. The only clue that something was going wrong is a series of "Information" messages in the event log from the time the outage began, every 5 minutes, saying:

New queries assigned to process on Node 0 have not been picked up by a worker thread in the last 1500 seconds. Blocking or long-running queries can contribute to this condition, and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads, or optimize current running queries. SQL Process Utilization: 0%%. System Idle: 98%%.

So it seems like we're running out of worker threads, of which we have 576. And I have no idea how that can happen. I know that there are no particularly long running queries, because I am constantly running a Trace to pick up anything longer than 2 seconds, and we're getting no more than one or two of those every 5 minutes.

Also, we have not made any significant changes to the system for the last several weeks, and this kind of thing has never happened before. Perhaps our usage has been growing, but if it was a resource overload, I would have expected the CPU to be maxed out, rather than at the beach around 0-2%.

What could have happened? How do we diagnose, and how do we prevent this happening again?

Microsoft SQL Server 2014 (SP2) (KB3171021) - 12.0.5000.0 (X64) 
    Jun 17 2016 19:14:09 
    Copyright (c) Microsoft Corporation
    Standard Edition (64-bit) on Windows NT 6.3 <X64> (Build 9600: ) (Hypervisor)

score 1 · Answer 1 · answered Jan 25 '17 at 10:45

Based on the comment thread, the first step is to update the system. That is, install Windows updates, patch the Sql Server and update all the device drivers and firmwares. On virtual machines (which you seem to have), the VM guest drivers and the host itself should be updated, but that's usually not DBA's domain.

Lack of patching is often a red flag about more widespread infrastructure problems. Please make sure that the system is fine in other ways. That is, you have a working restore plan [1], service accounts, passwords and user rights are documented and so on.

To illustrate the patching effect, SentryOne has a handy list about the number of fixes included in service packs and cumulative updates. Just SP2 fixes some 133 issues, of which "only" 53 are public ones. Reading through all the release notes about if the problem is mentioned is going to take a while.

As a side note: Microsoft's support policy is restrictive. One needs quite strong a business case (read: lots of money is at stake) if unpatched software problems are looked into. The advice usually is akin to "install all the patches and see if problem occurs again".

[1]: "Working" means you have tested recently enough that the backups can be restored and contain the stuff they should. It's all too common that backup process looks successful, but doesn't quite do what was intended.

SQL Server inexplicably stopped working

1 Answers1

Linked