Database enters in recovery mode every time the Transaction Log is full

Question

I am facing a situation that it is being somewhat hard to address. I need help to understand what is happening.

TL;DR: Every time the Transaction Log gets full in SQL Server it needs to shutdown the database to enter in Recovery Mode and rollback the offending transactions? Is this always done by design or this only happens when something bad happens?

The scenario:

One of our heavy used production databases, which runs several ETL jobs, and long running table batches, entered in Recovery Mode and became inaccessible for some time. This happened three times this week (this server is on for ~2 years, and we didn't notice this issue in the past).

Looking into the errors logs what happened was clear: the Transaction Log was full, the database needed to rollback the transaction, the rollback failed, the database shutdown, and started in recovery mode.

The DBA defends this as normal behavior of SQL Server. That is, according to him, every time the transaction log gets full and a transaction needs to rollback the database will enter in Recovery Mode due to the lack of log space. After the rollback (that can only be done in Recovery Mode according to him), the database will become available again.

I found no reference for this info. So I strongly disagree. I would really appreciate if someone convince me that I am wrong.

My point:

As far of my knowledge, a DBMS is built to manage/run queries. If it lacks space, the query will fail. Simple as it is. And I am not talking about performance of anything else, but availability only.

It makes no sense for me to accept that a DBMS needs by design to shutdown itself to rollback any transaction. In my understanding, it does not matter if I am running tons of queries or if the queries are bad designed. The bad queries should fail and life continues. Doesn't it?

My guess is that something else is making it fail, and I need to track what is happening.

Is my understanding wrong or this is really how SQL Server is designed to work? Supposing I am not wrong, what else can I do to track the source of this issue?

Some additional info

select @@version: Microsoft SQL Server 2012 (SP1) - 11.0.3156.0 (X64) May 4 2015 18:48:09 Copyright (c) Microsoft Corporation Standard Edition (64-bit) on Windows NT 6.2 (Build 9200: )
This database is in Simple recovery model.
There are other databases in the same instance. They don't present the same issue, but they are also not heavy used.
Only the transaction log is full, not the disk. The disk has plenty of space, but the log size for the database is limited.
We monitor this server, CPU load is fine, memory usage is fine, disks use RAID-5 and the controller present no crashes or read failures. There are some peaks in resource usage, but not anything uncommon.
I know the queries may be improved to efficiently use the log. I also know I can increase the transaction log space. But this is really not my point here.
A DBA was hired recently to take care of this database. So several configs were changed recently, for tuning purposes. He made me aware of all changes (like disabling auto-shrink, increase auto-grow size, etc). I found nothing that may harm the database.

Log dump (in order of occurrence, duplicates removed)

[02:58:37am ~ 04:47:42pm, 12 times] Error: 845. Severity: 17. State: 1. Time-out occurred while waiting for buffer latch type 3 for page (1:8728760). database ID 7. FlushCache: cleaned up 10460 bufs with 6709 writes in 77540 ms (avoided 864 new dirty bufs) for db 7:0 average throughput: 1.05 MB/sec. I/O saturation: 107. context switches 391 last target outstanding: 4800. avgWriteLatency 0 FlushCache: cleaned up 95448 bufs with 37560 writes in 85820 ms (avoided 60465 new dirty bufs) for db 7:0 average throughput: 8.69 MB/sec. I/O saturation: 17026. context switches 20713 last target outstanding: 446. avgWriteLatency 3.

[02:58:37am ~ 04:47:42pm, 13 times] A time-out occurred while waiting for buffer latch -- type 3. bp 000000109B9E69C0. page 1:73430228. stat 0x10b. database id: 7. allocation unit Id: 72057594304790528. task 0x00000008BC0850C8 : 1. waittime 300 seconds. flags 0x100000001a. owning task 0x0000000827B38188. Not continuing to wait.

[02:58:37am ~ 04:47:42pm, 12 times] Error: 5901. Severity: 16. State: 1. One or more recovery units belonging to database 'XXXXXXXXXX' failed to generate a checkpoint. This is typically caused by lack of system resources such as disk or memory or in some cases due to database corruption. Examine previous entries in the error log for more detailed information on this failure.

[05:14:29pm ~ 05:14:53pm, 9 times] Error: 9002. Severity: 17. State: 4. The transaction log for database 'XXXXXXXXXX' is full due to 'ACTIVE_TRANSACTION'.

[05:14:53pm, once] Error: 3314. Severity: 21. State: 3. Database XXXXXXXXXX was shutdown due to error 3314 in routine 'XdesRMReadWrite::RollbackToLsn'. Restart for non-snapshot databases will be attempted after all connections to the database are aborted.

[05:14:53pm ~ 05:14:53pm, 16 times] Error: 3314. Severity: 21. State: 3. During undoing of a logged operation in database 'XXXXXXXXXX' an error occurred at log record ID (8064074:20971:110). Typically the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup or repair the database.

[05:14:53pm ~ 05:14:53pm, 9 times] Error: 9001. Severity: 21. State: 5.The log for database 'XXXXXXXXXX' is not available. Check the event log for related error messages. Resolve any errors and restart the database.

[05:14:58, once] Starting up database 'XXXXXXXXXX'.

[05:15:02, once] Recovery of database 'XXXXXXXXXX' (7) is 0% complete (approximately 2931 seconds remain). Phase 1 of 3. This is an informational message only. No user action is required.

...

[05:51:01pm, once] 6 transactions rolled back in database 'XXXXXXXXXX' (7:0). This is an informational message only. No user action is required.

[05:51:01pm, once] Recovery is writing a checkpoint in database 'XXXXXXXXXX' (7). This is an informational message only. No user action is required.

[05:56:47pm, once] Recovery completed for database XXXXXXXXXX (database ID 7) in 2505 second(s) (analysis 1774 ms redo 406623 ms undo 1749182 ms.) This is an informational message only. No user action is required.

I found no other relevant log entries in the Error Log or Event Viewer. The closest error that happens in the Event Viewer is:

[04:56:45pm ~ 05:27:24pm, 13 times] The application-specific permission settings do not grant Local Activation permission for the COM Server application with CLSID {FDC3723D-1588-4BA3-92D4-42C430735D7D} and APPID {83B33982-693D-4824-B42E-7196AE61BB05} to the user MY_DOMAIN\dba.personal.user SID (S-1-5-21-000000000-000000000-0000000000-00000) from address LocalHost (Using LRPC) running in the application container Unavailable SID (Unavailable). This security permission can be modified using the Component Services administrative tool.

This error happened about ~18 minutes before the database start the recovery process, and repeated sometimes during the beginning of the recovery. It is somewhat related with the DBA user, but I really don't know what it is (I had no time to ask for the DBA yet).

score 3 · Accepted Answer · answered Apr 20 '18 at 02:10

First of all few housekeeping rule.

You (or your DBA) should manage transaction log space depending on your recovery model.
Do not let transaction log to get full and impact your database/application availability.

Following two links can help you better manage transaction log file.

SQL Server Transaction Log Architecture and Management Guide
How do you clear the SQL Server transaction log? Answered by Aaron Bertrand which also listed some great resources around transaction log issues.

What you are experiencing is not normal behavior when transaction log file is full and cannot grow further.

When the transaction log becomes full, SQL Server Database Engine issues a 9002 error. The log can fill when the database is online, or in recovery. If the log fills while the database is online, the database remains online but can only be read, not updated. If the log fills during recovery, the Database Engine marks the database as RESOURCE PENDING. In either case, user action is required to make log space available.

The appropriate response to a full transaction log depends partly on what condition or conditions caused the log to fill. To discover what is preventing log truncation in a given case, use the log_reuse_wait and log_reuse_wait_desc columns of the sys.database catalog view.

What you are seeing is failure in transaction rollback. For more details read this post.

INF:What happens to a database when transaction rollback fails?

As per Paul Randal's blog post you hit a bug which was fixed in SQL 2012 SP4.

2012/2014 bug that can cause database or server to go offline

More details on error 3314:

Troubleshooting Error 3313, 3314, 3414, or 3456 (SQL Server)

Reference:

Troubleshoot a Full Transaction Log (SQL Server Error 9002)

score 2 · Answer 2 · answered Apr 20 '18 at 07:44

The first and the foremost thing that your database is already in Simple recovery mode so log will not grow much until it is held up by a single transaction. Once the checkpoint comes the log will be truncated itself. Now that you have limited the log size, SQL Server goes for a toss when the transaction is in the middle and there is no scope of log to grow. Simply kept, the transaction needs the log to grow. You cannot make the transactions complete without allowing the logs to grow when required. This is why the best setting is to keep the AUTOGROWTH enabled.

SQL Server treats it as a crash and then goes for a recovery to rollback the incomplete commands as seen in the error log.

Solution : Enable the AUTOGROWTH. Also if this is an important database then switch to Full recovery Mode and configure log backups.

score 2 · Answer 3 · answered Apr 20 '18 at 19:22

This is my take on your issue :

Microsoft SQL Server 2012 (SP1) - 11.0.3156.0

This is pretty old build and lot of fixes have been put into newer SPs. You should patch your server with SP4 atleast - 11.00.7001.

Error: 845. Severity: 17. State: 1.

This is your disk subsystem causing issues. Check your \MSSQL\LOG\sqldump folder. You will have stack dumps created. You can analyze it as per SQL Server Latch & Debugging latch time out or open a case with Microsoft. Again, check your disk subsystem.

Error: 5901. Severity: 16. State: 1

This was fixed in Cumulative Update 8 for SQL Server 2012 SP2

so takeaways:

Patch with SP4 and see if you encounter issues.
Check with your windows or storage admin for disk contention.
Make sure you have enough RAM and Max memory is set appropriately.
CPU power plan is set to high performance as opposed to balanced.
Enable autogrowth to fixed MB so that the db can grow appropriately.
Run sp_Blitz to identify potential issues.

Database enters in recovery mode every time the Transaction Log is full

3 Answers3