How to find what is wearing out my SSDs

Question

We have 8 Cisco servers with 12 spinning disks for data and 2 SSDs for OS. The 2 SSDs are in linux software raid 1. The SSDs all have their wear indicator in single digits and some of those that have reached a value of 1 have failed. I'm in the process of swapping them all out from spares (a long and tiresome process) but I've noticed the wear indicator is dropping 1 or 2% per week (I didn't take exact measurements). There is a single application running on these servers and the vendor has given me some vague ideas, but I really need to find the directories that it is writing to. That way I can really highlight the problem and push the vendor for a fix. I've searched a bit but haven't been able to find too much. iotop for example show full disk throughput including the 12 spinning disks. OS is Redhat 7.9

In answer to some of the questions:

disks are "480GB 2.5 inch Enterprise Value 6Gb SATA SSD"
product ID is "UCS-SD480GBKS4-EB"
disks were supplied standard with the servers in 2018
The wearing out appears to have accelerated recently (I am now logging the wear so will have a better answer on that in a few days)
I have replaced most disks with identical disks purchased maybe a couple of years later.
iotop is showing a constant 8MB/s write.
the system is running hadoop across 8 servers. The hadoop file system is on spinning disks so shouldn't touch the SSDs
I have reduced the disk IO considerably on suggestion of the vendor although it does still seem high (8MB/s)

score 13 · Answer 1 · answered May 10 '24 at 19:46

It’s hard to be certain without more details on the age of the systems, the exact model and age of the SSDs, and a handful of other factors.

Assuming good quality SSDs, 1-2% on the wear indicator in a week means you’re writing a couple of terabytes (minimum) of data to them in a week. That’s a huge amount of data for an OS volume. Top culprits I would look at are, in order:

Cheap SSDs. Put simply, it sounds a lot like you don’t have particularly good quality SSDs in this system, which would invalidate the assumption that 1-2% usable life expectancy translates to multiple TB of data. I suggest doing some research on the exact model of SSDs you’re using to confirm what their actual rated lifetime write endurance is and that there are no documented firmware issues. Good ones from the past five years or so should be rated for at least 100 times their listed capacity (so at least 100 TB on a 1 TB SSD), but ideally more than that (as a point of comparison, current high-end consumer 1 TB SSDs are typically rated for about 300 TB of writes these days).
Block device caching. If you have bcache, dm-cache, ZFS L2ARC, or some other block device caching setup that is using space on the SSD’s, that’s probably the culprit, try turning it off and see what happens (well, other than probably a nasty hit to performance).
Logging. Most of your logs are probably on your OS volume. If you’ve got verbose logging turned on, and your application is very busy, this could easily run into the the terabyte range in a week. But it could also be something else, like logs from SELinux, or process accounting, or the auditing daemon.
Non-block caching. Essentially, stuff under /var/cache or other locations where caches might be stored (such as ~/.cache in user home directories). This shouldn’t be hitting the required numbers unless it’s a very active terminal server, but it’s worth checking.
Swapping. Probably not a major contributor, because hitting the numbers required would translate to swapping frequently enough to cause other performance issues on the system.

score 11 · Answer 2 · answered May 10 '24 at 09:10

Check swapping - that is a typical indicator. Check whether you run any temp files for whatever software - that may be another one. Both need you to check and given that temp files are software dependent - no real help possible. Build server directories were where i observed that last time - technically a temp structure as every run downloads the repository (ok, updates it), then initializes the source tree and builds - that is a LOT of writes. End user SSD are not made for this. Really depends on the software - no generic answer possible.

Otherwise consider whether using low end SSD is suitable to start with - this sounds like more drop than should be possible

score 5 · Accepted Answer · answered May 10 '24 at 12:11

5

You can use ProcMon for Linux to trace file system calls.

https://github.com/Sysinternals/ProcMon-for-Linux

answered May 10 '24 at 12:11

Greg Askew

39,132

score 2 · Answer 4 · answered May 12 '24 at 21:59

You can approach this problem top-down.

That means first set up a monitoring such as netdata that continuously writes all the relevant IO metrics into a database for all servers.

Using that data you can check for swap activity and what amount of writes volume your SSDs are seeing and how it changes over time.

That way you can cross-check whether the change of the wear-indicator is actually plausible. I mean bugs in SSDs firmware that influence SMART reporting aren't unheard of.

For identifying directories and files that are written to at a high rate you can run filetop from the bcc-tools package, e.g.:

# /usr/share/bcc/tools/filetop
23:56:12 loadavg: 1.32 0.83 0.60 4/1273 563644
TID     COMM             READS  WRITES R_Kb    W_Kb    T FILE
563614  yes              0      36757  0       294056  R foo.bar
[..]

How to find what is wearing out my SSDs

4 Answers4