0

I'm relatively new to high availability groups in SQL Server, and am hoping for a bit of advice on best practices for configuration. I didn't set up this configuration, I inherited it and am trying to get an understanding as I've been moved towards a more DB Admin-focused role of late. There's a specific issue we encountered recently which I'll try to explain.

We have a SQL Server cluster on 2019 with two nodes and a file share witness. Load balancing enabled, low aggressiveness. I noticed the file share witness is configured to fail over if the restart period fails (set to 10 minutes).

Anyways, the problem. Last night our WSUS server installed updates which took the file share witness offline and triggered a failover to our secondary node. However, since the problem wasn't with any of the database availability groups, they all remained on the primary node, rendering them inaccessible on the cluster.

EDIT: This is an availability group setup alone; the file share witness was on the primary node at the time. The file share witness is owned by the primary node. Logs seem to indicate that the fsw failed a periodic health check when it was rebooting/installing updates, timed out of the automatic restart attempts, was marked as failed, and triggered a failover from primary to secondary. None of the availability groups on the primary node failed over to the secondary.

I assume that either the file share witness shouldn't be triggering a failover, or there's something I should be doing to make sure everything fails over when triggered. I just wanted to reach out to see if anyone had any advice, or anything I should be aware of to better prevent problems like this.

Thanks for your time. If there's anything I can add to clarify things, please let me know.

fuandon
  • 101
  • 3

1 Answers1

2

Last night our WSUS server installed updates which took the file share witness offline and triggered a failover to our secondary node. However, since the problem wasn't with any of the database availability groups, they all remained on the primary node, rendering them inaccessible on the cluster.

I don't understand this, nor did you specify if this was an AG, FCI, or a combo. It's also unclear the topology. Is the FSW on a non-primary node in the same cluster which was rebooted? The FSW is on some other server which became unavailable and the database failed over but didn't? I'm just not following. This would need to be more clear to help in any meaningful way.

I assume that either the file share witness shouldn't be triggering a failover [...]

That depends on the state of the cluster. Generally, in a cluster that has no issues and everything is proper, losing access to the FSW will not cause anything, not even a blip.

Please update the question with more information including topology and sequence of events.


[...] the file share witness was on the primary node at the time [...]

If the FSW is literally on the same nodes as the cluster, that's an incorrect configuration and should be changed.

Logs seem to indicate that the fsw failed a periodic health check when it was rebooting/installing updates [...]

If the FSW is on the primary node, then the primary node also rebooted. This means, according to the only surviving node of the cluster (the secondary replica node) means it believes it's partitioned. When it called for a vote, it only hears back from itself, since the other node which has a vote and the FSW which is hosted on the node that is down won't be able to weigh in.

This means that partition lost quorum and should shut itself down (not physically poweroff, just clustering services) as to not split-brain.

[...] and triggered a failover from primary to secondary.

This wouldn't happen, since the cluster partition did not have quorum.

Everything that happened seems to be correct given the improper (I know you said you inherited this) configuration of the cluster.

Sean Gallardy
  • 38,135
  • 3
  • 49
  • 91