How does SSD meta-data corruption on power-loss happen? And can I minimize it?

Question

Note: This is a follow-up question to Is there a way to protect SSD from corruption due to power loss?. I got good info there but it basically centered in three area, "get a UPS", "get better drives", or how to deal with Postgres reliability.

But what I really want to know is whether there is anything I can do to protect the SSD against meta-data corruption especially in old writes. To recap the problem. It's an ext4 filesystem on Kingston consumer-grade SSDs with write-cache enabled and we're seeing these kinds of problems:

files with the wrong permissions
files that have become directories (for example, toggle.wav is now a directory with files in it)
directories that have become files (not sure of content..)
files with scrambled data

The problem is less with these things happening on data that's being written while the drive goes down, or shortly before. It's a problem but it's expected and I can handle that in other ways.

The bigger surprise and problem is that there is meta-data corruption happening on the disk in areas that were not recently written to (ie, a week or more before).

I'm trying to understand how such a thing can happen at the disk/controller level. What's going on? Does the SSD periodically "rebalance" and move blocks around so even though I'm writing somewhere else? Like this:

And then there is a power loss when D is being rewritten. There may be pieces left on block 1 and some on block 2. But I don't know if it works this way. Or maybe there is something else happening..?

In summary - I'd like to understand how this can happen and if there anything I can do to mitigate the problem at the OS level.

Note: "get better SSDs" or "use a UPS" are not valid answers here - we are trying to move in that direction but I have to live with the reality on the ground and find the best outcome with what we have now. If there is no solution with these disks and without a UPS, then I guess that's the answer.

References:

Is post-sudden-power-loss filesystem corruption on an SSD drive's ext3 partition "expected behavior"? This is similar but it's not clear if he was experiencing the kinds of problems we are.

EDIT: I've also been reading issues with ext4 that might have problems with power-loss. Ours are journaled, but I don't know about anything else.

Prevent data corruption on ext4/Linux drive on power loss

http://www.pointsoftware.ch/en/4-ext4-vs-ext3-filesystem-and-why-delayed-allocation-is-bad/

score 2 · Answer 1 · answered Aug 02 '18 at 08:59

2

Your best bet is to disable write caching on the disk both by telling the disk not to do write caching (look at hdparm and smartctl options and hope the disk honors them) and to make the OS not buffer writes with mount options like sync and dirsync.

answered Aug 02 '18 at 08:59

Baruch Even

1,131

score 2 · Accepted Answer · answered Aug 09 '18 at 15:34

For how metadata corruption can happen after an unexpected power failure, give a look at my other answer here.

Disabling cache can significantly reduce the likehood of in-flight data loss; however, based on your SSDs, data-at-rest remain at risk of being corrupted. Moreover, it commands a massive performance loss (I saw 500+ MB/s SSDs to write at a mere 5 MB/s after disabling the private DRAM cache).

If you can't trust your SSDs, the only "solution" (or, rather, workaround) is to use an end-to-end checksumming filesystem as ZFS or BTRFS and a RAID1/mirror setup: in this manner, any eventual single-device (meta)data corruption can be recovered from the other mirror side by running a check/scrub.

score 1 · Answer 3 · answered Dec 07 '23 at 05:32

The metadata corruption you describe is filesystem metadata, not (internal) SSD metadata. SSDs generally don't understand filesystem metadata and therefore it is quite likely that not only metadata, but also data is corrupted - it's just not as obvious.

There are many mechanisms that can cause that, here are some:

SSDs store a lot of data in RAM, which is (on practically all consumer drives) lost on a power outage. Some drives pretty much randomly drop data, others try to provide a consistent state, losing writes, but rolling back writes in time order. Many drives ignore flush requests as well. Depending on how good the drive firmware is, this might not result in actual corruption, as filesystems can deal with certain patterns of loss of data recently written before a commit has happened. This is likely the kind of corrutpion you saw.
MLC (TLC etc.) flash chips store more than one bit of information in a cell, and these extra bits are usually stored at different times. A power outage can corrupt the contents of a cell, which in turn might corrupt older data written a long time ago that happened to be already stored in that cell. This is data-at-rest corruption and no filesystem can handle this situation. There are some consumer drives that "guarantee" that this kind of corruption does not happen, but they are rare.

How does SSD meta-data corruption on power-loss happen? And can I minimize it?

3 Answers3

Linked