We have been seeing what appears to be reboots on our SQL Managed Instances (Standard, Failover Group). One node (the current primary, USEast) has been apparently rebooting for the last 24 hours, between 30 minutes and 2 hours between cycles. This has been slowing our system down, although the users only occasionally see a connection error (retry logic). But our various business processes are suffering more.
The only place I can see to detect these is in the SQL Server Logs (SSMS \ Management \ SQL Server Logs) and then looking at the various entries there. On one of the restarts, the logs appeared to clear and now I only have the two most recent entries listed.
I don't see anything in sys.dm_operation_status to indicate anything there. The @@VERSION command has changed from early September when I checked last, but currently shows:
Microsoft SQL Azure (RTM) - 12.0.2000.8 Sep 18 2021 19:01:34 Copyright (C) 2019 Microsoft Corporation
The last time this happened to us (to this extent, we are used to one or two reboots periodically) was a few years ago where it was rebooting every 5-10 minutes. In that event, we initiated a failover to our other side and that took several hours before it finally completed. At that time I didn't have the query to see FOG health (hardened LSN, etc.) but I do now and it's showing green across the board.
We are currently planning a failover in a few hours (as soon as the business can tolerate it), and have our 3rd party interface between MSFT and us (yay...) opening an urgent ticket so we can get better insight.
But what I wanted to really ask is if there are more places to look for clues about what's happening? Any insight into what's going on and why it's rebooting so frequently?
UPDATE - We failed over to our secondary and it did so without issue. However, as soon as we failed, the continuous "RECOVERY" messages started appearing. The "RECOVERY" Messages were appearing in the logs on the other side as well prior to failing over. DBCC CHECKALLOC/CHECKCATALOG on all databases showed no issues before failover.
MSFT has a ticket open on this issue and if I hear back, I will update here. Still welcome any insights from this community though. =)
UPDATE - The recovery messages still appear, although we have not seen any sign of the restarts that were hurting us earlier. The recovery messages are in this form:
RECOVERY (873c7092-f355-4b10-806b-7f2ec74855c7, 10): XactRM::PrepareLocalXact, Preparing particpant xact, XdesId = -839198875 3
It cycles through all of our databases
UPDATE (6/8/2022 - Six months later)
I've forgotten to keep this updated, but not much has happened. A few months ago we finally got MSFT support to look at this issue. They requested a slew of BS; but we did it anyways so we could get them to move on. What they wanted was for us to rebuild every index and do a full update of statistics. This took us quite a bit to get through as our maintenance window was small, but we managed it. Of course it had no impact. It's back in the product teams hands now (no additional word on this issue). It still follows us on every failover between regions; whichever is the primary has these flooding it's error log.
Another thing they had us try (which did not resolve the issue) was to move both of our nodes to a higher tier of hardware.
Most recently (last week), the primary node we were on suffered repeated reboots with what looked like stack dumps (confirmed by MSFT), we failed over to the other side and it stopped doing that, although the recovery messages still appear. Possibly related, but I don't think so, most likely for the reboots/stack dumps is hardware issues on whatever physical stack we were on.
There are a lot of trade-offs when you move to the cloud and allow someone else to fully manage your infrastructure... this appears to be one of them. I will try and remember to update when/if MSFT ever gets back to us on the issue, but after 6 months....
Over the summer (our business is tightly tied to the school year), we may move to a single node for cost savings, so it's a possibility that rebuilding the Failover Group will resolve the issue.
UPDATE - 6/14/2022
And now we have the same issue. The current primary side (USC) started doing reboots every 5-20 minutes. We failed over the our other side (USE) and it resolved the issue for now. As before, the logs on the USC side seemed to indicate stack dumps. Poking MSFT again....
UPDATE - 6/14/2022 - A SOLUTIION?
Finally got a premium MSFT engineer. They said that the reboots we were experiencing was related to a bug in the Optimization Engine and that the fix had been released to MI's already (in the last week or so), but only if you were using the 160 Compatibility Mode. We are doing accelerated testing and then are going to push that through (we were at 140 before).
As for the spamming of RECOVERY messages, they think that the messages themselves are benign, but there is a soon to be released fix going in that will stop that sub-system from essentially spamming the log with informational messages. So, maybe we will get this going after all.