We had a production emergency yesterday: out of the blue, in the middle of peak usage time, our SQL Server just stopped working. Never mind no new connections; even existing connections stopped functioning. I restarted the service and everything came right immediately, but in the interim we lost about 30 minutes of downtime and got serious egg on our faces.
There are no errors or warnings in the event log. The only clue that something was going wrong is a series of "Information" messages in the event log from the time the outage began, every 5 minutes, saying:
New queries assigned to process on Node 0 have not been picked up by a worker thread in the last 1500 seconds. Blocking or long-running queries can contribute to this condition, and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads, or optimize current running queries. SQL Process Utilization: 0%%. System Idle: 98%%.
So it seems like we're running out of worker threads, of which we have 576. And I have no idea how that can happen. I know that there are no particularly long running queries, because I am constantly running a Trace to pick up anything longer than 2 seconds, and we're getting no more than one or two of those every 5 minutes.
Also, we have not made any significant changes to the system for the last several weeks, and this kind of thing has never happened before. Perhaps our usage has been growing, but if it was a resource overload, I would have expected the CPU to be maxed out, rather than at the beach around 0-2%.
What could have happened? How do we diagnose, and how do we prevent this happening again?
Microsoft SQL Server 2014 (SP2) (KB3171021) - 12.0.5000.0 (X64)
Jun 17 2016 19:14:09
Copyright (c) Microsoft Corporation
Standard Edition (64-bit) on Windows NT 6.3 <X64> (Build 9600: ) (Hypervisor)