2

We have a Microsoft Failover Cluster with dynamic disks managed by Veritas Storage Foundation. Today the sysadmins added a new disk for SQL Server but the cluster size on the volume was wrong, so I issued a quick format to change it.

The disk volume failed, the SQL Server group failed as well and the cluster became unresponsive. After some minutes I managed to fail over to a passive node.

The SAN admins say it's my fault because I shouldn't have formatted the disk from the Windows format applet, but I should have used Veritas Enterprise Administrator instead.

Can a format operation bring offline a whole cluster group this way?

Relevant error messages:

From the eventlog:

The cluster resource host subsystem (RHS) stopped unexpectedly.
An attempt will be made to restart it. This is usually due to a 
problem in a resource DLL. Please determine which resource DLL is 
causing the issue and report the problem to the resource vendor.

From the cluster.log

ERR   [RCM] rcm::RcmResControl::DoResourceControl: 
ERROR_RESOURCE_CALL_TIMED_OUT(5910)' because of 'Control(STORAGE_GET_DISK_INFO_EX) 
to resource 'NameOfTheDiskGroup' timed out.'

Veritas Documentation:

Excerpt from Symantec's documentation:

Note: Before manually creating the resource, you must format the cluster-shared volume with NTFS using the VEA GUI and mount it on the node where you are trying to create the resource.

Does this mean the disk cannot be formatted from Windows? I don't read it that way.

For the record, I formatted many disks using the Windows applet in the past and nothing bad happened.

2 Answers2

1

Seeing as how it's a shared volume, it appears the clustered nodes were already trying to use it, so using the VEA GUI would be the best way to go. It doesn't mention in their documentation, but they most likely do something different from the Windows GUI (even if it's just a temporary write-lock on the CSV from the machine running VEA, so that it can indeed format the volume, telling the nodes to use a different disk, etc.

Also, I suspect the bigger problem was:

Note: You must ensure that the selected drive letter for the new cluster-shared volume is available and not in use on any of the cluster nodes.

It sounds like your disk was in use when you formatted it. Formatting the disk to NTFS using Windows is likely trivial, but the fact that the disk was in use and you didn't use the VEA GUI which arguably could have prevented some problems is what caused this.

MDMoore313
  • 5,616
0

Yes. If the disk was already configured as a dependency of SQL Server (and to be used, a disk must be a dependency of the SQL Server resource), by the way a WSFC works, you may have caused a 'failure' so to speak causing the disk resource to go offline, and would escalate to bringing the entire Role offline. This may not be it, but that's the cluster perspective. I've never formatted a disk after the fact and seen what it does.

It could also be the fact that Symantec/Veritas is NOT NTFS, so the way you configure it, you screwed things up and the disk resource went offline in formatting. Again, if configured as a resource dependency of SQL Server, that would escalate.