We have a server with Ubuntu 20.04.6 LTS. It is the secondary storage for our backup, with 12x8TB HDDs in a RAIDZ3, with an XFS filesystem on top.
A couple of days ago, one drive failed. I thought, "OK, no problem, it is a RAIDZ3," but even before I replaced and resilvered the broken drive, I noticed that the filesystem is no longer mounted.
I tried mounting it manually to no avail, running:
sudo mount -t xfs /dev/zd0 /mnt/veeam_repo_prod
Immediately, it returns a kernel error: "XFS (zd0): log recovery write I/O error at daddr 0x1b1b70 len 4096 error -5", followed by "mount: /mnt/veeam_repo_prod: can't read superblock on /dev/zd0."
I can't see any problems in zpool status -v.
pool: zpool01
state: ONLINE
scan: scrub repaired 0B in 2 days 11:10:24 with 0 errors on Wed Feb 28 19:54:19 2024
config:
NAME STATE READ WRITE CKSUM
zpool01 ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
scsi-351402ec000fe5847 ONLINE 0 0 0
scsi-351402ec000fe5848 ONLINE 0 0 0
scsi-351402ec000fe5849 ONLINE 0 0 0
scsi-351402ec000fe584a ONLINE 0 0 0
scsi-351402ec000fe584b ONLINE 0 0 0
errors: No known data errors
Running a scrub returns 0B repaired.
I tried running xfs_repair /dev/zd0, then it says that there are valuable metadata changes in a log. Running xfs_repair -L /dev/zd0 returns again an I/O error: "xfs_repair: libxfs_device_zero write failed: Input/output error".
I am simply out of ideas. The only good thing is that it is only the second copy of the backup, and I could just begin from scratch, but it takes weeks to recopy all the data. Also, if it happened once, it can happen again, and I dont want to be there the day we need the backup and it happened again.