Why does data file growth fail even with plenty of disk space?

Question

We have a SQL Server 2016 (v13 SP3) Enterprise Edition server hosted in a Windows failover cluster / SQL Availability Group with two nodes (primary & secondary). The two nodes are running inside AWS EC2 instances running Windows Server 2012 R2 64-bit (NT 6.3).

Earlier this week, the server started responding with this error:

Could not allocate space for object 'dbo.Batches'.'pk_Batches_BatchID' in database 'XXX' because the 'PRIMARY' filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.

At first this seemed pretty straightforward: we figured we'd been careless and allowed the data and/or log files to get too big. The files were definitely full without any unallocated space left in them. We figured we just needed to grow the Windows (NTFS) drive (backed by AWS EBS underneath the hood). The 'XXX' database has one log file and two data files - the data files are set to unlimited growth (though only by 64MB at a time) and the database only has the default 'PRIMARY' filegroup, no other filegroups involved. The data files are on the 'D:' drive.

But the 'D:' drive has over 400 GB free, so why are the data files not growing?

We spent a lot of time looking at the Windows Clustering and the SQL Availability Groups, as we were also seeing plenty of errors about the AG status going into "Recovering" and the clustering role not applying/synchronizing properly. Some changes allowed the primary node to come back up for a few minutes, but then it would crash again. (Because of this, our ability to inspect the 'XXX' database itself was limited.) We looked to see if EBS was having some sort of issue or outage, but could see no errors.

We realize that the servers are old and out of date. We realize that some would say that using/relying on autogrowth is a bad practice. But this question isn't about best practices - it's how do we get this currently-down production server back on its feet?

score 1 · Accepted Answer · answered Sep 14 '23 at 23:52

The solution was to add another data file!

Because of the rapid rise and fall of the node, we had to turn it back on ("Start Role" in Windows clustering) and then rush to execute this command:

ALTER DATABASE [XXX] ADD FILE ( NAME = N'XXX3', FILENAME = N'D:\DATA\XXX3.mdf' , SIZE = 1110884352KB, FILEGROWTH = 65536KB ) TO FILEGROUP [PRIMARY]

(Doing it through the UI was too slow.)

Once we did this, the issue went away and the server / cluster stayed up!

(We did increase the size of the drive/volume beforehand.)

I cannot adequately explain why this worked. The person who figured this out said Microsoft once told him that this is caused by a bug in NTFS that won't let files grow under certain conditions. I can find nothing about this anywhere, so I dunno if that's true.

There does appear to be a common misconception (that I had as well) here. Apparently, the idea that the files in a filegroup would grow individually is false. In actually, an autogrowth event will extend every file in the filegroup at the same time. That makes sense, but it doesn't seem to explain anything here: it seems like it would just mean that the Database Engine was trying to grab 128 MB instead of 64 MB, which there more than sufficient space for.

Now the two data files are of differing sizes: the first file is 2,161,928MB (~2TB) and second file is 1,084,912MB (~1TB). Information about the how the proportional fill algorithm for growing filegroups works suggests that this could have caused the engine to be asking for 192 MB, or even 256 MB. But again, both are less than 400 GB.

There is a relationship between filegroup growth behavior and trace flags 1117 & 1118. For the record, neither of those trace flag were enabled on this server. (FYI you can check this with DBCC TRACESTATUS.) I've been unable to figure out whether the SQL Server Service was started with the -E (increased extents) option.

One way or another, it would appear that engine was trying to allocate much, much, much more filespace than the autogrowth settings suggested. Only be adding another data file and effectively circumventing autogrowth were we able to extend the database size and get it back on its feet.

I was cursing the 'inaccurate/misleading' error message the whole time, but it said it right there: Create disk space by ...adding additional files to the filegroup We should have done that from the start.

Why does data file growth fail even with plenty of disk space?

1 Answers1