I've been trying to track down the source of my tempdb log filling up every few weeks where the tempdb log space used will continue to grow until it reaches 99% and stays there for hours while the log file grows. After adding in some additional logging and email alerts, I think I identified the source of the problem, at least for this incident, not sure if the cause is the same in past incidents.
dbcc opentran in the tempdb reported this:
Oldest active transaction:
SPID (server process ID): 47s
UID (user ID) : -1
Name : sort_init
LSN : (41:3017:42)
Start time : Dec 19 2020 10:03:20:410AM
SID : 0x01
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
The spid is 47s which I discovered is a system spid. I ran sp_WhoIsActive with @show_system_spids = 1 which brought back data on the process. After running sp_WhoIsActive over the course of an hour, tempdb_allocations ended up at 4,221,088, tempdb_current was 4,128,656, open_tran_count was 1 (throughout the entire incident) and sql_text was always null. After running out of ideas, I ended up restarting SQL Server. Several times prior to restarting SQL server, I issued a checkpoint in tempdb but that didn't help, likely since the transaction from the system process was still active.
The Start time from the dbcc opentran output is 10:03am and it was 1:00pm when I restarted SQL Server. I ran dbcc opentran a few times over the course of an hour and the same output was returned each time.
In this particular case, the server was rebooted and it restarted at 10:02:54 -- 26 seconds before this long running system process started at 10:03:20. Past incidents don't appear to have happened immediately after a system reboot. I don't have many details on past incidents, but at least the last time this occurred 9 days ago, in the sp_WhoIsActive output I collected (without @show_system_spids set to 1) there were no long running processes captured which likely means it was a similar system process running / filling up the tempdb log. This 99% full tempdb log incident 9 days ago was ongoing for 7 hours.
Nothing stands out in the SQL Server log and Windows logs. I'm running Microsoft SQL Server 2017 (RTM-CU15-GDR) (KB4505225) - 14.0.3192.2 (X64).
Is there a way to determine what a system SPID is doing? Or any tips on how I can better troubleshoot this problem when it happens next time? The only clue from this incident is the Name value of sort_init in the dbcc opentran output which could be helpful but seems generic at the same time.