2

We have a server with Ubuntu 20.04.6 LTS. It is the secondary storage for our backup, with 12x8TB HDDs in a RAIDZ3, with an XFS filesystem on top.
A couple of days ago, one drive failed. I thought, "OK, no problem, it is a RAIDZ3," but even before I replaced and resilvered the broken drive, I noticed that the filesystem is no longer mounted.
I tried mounting it manually to no avail, running:
sudo mount -t xfs /dev/zd0 /mnt/veeam_repo_prod
Immediately, it returns a kernel error: "XFS (zd0): log recovery write I/O error at daddr 0x1b1b70 len 4096 error -5", followed by "mount: /mnt/veeam_repo_prod: can't read superblock on /dev/zd0."

I can't see any problems in zpool status -v.

  pool: zpool01
 state: ONLINE
  scan: scrub repaired 0B in 2 days 11:10:24 with 0 errors on Wed Feb 28 19:54:19 2024
config:
    NAME                        STATE   READ WRITE CKSUM
    zpool01                     ONLINE     0     0     0
      raidz3-0                  ONLINE     0     0     0
        sdb                     ONLINE     0     0     0
        sdc                     ONLINE     0     0     0
        sdd                     ONLINE     0     0     0
        sde                     ONLINE     0     0     0
        sdf                     ONLINE     0     0     0
        sdg                     ONLINE     0     0     0
        sdh                     ONLINE     0     0     0
        scsi-351402ec000fe5847  ONLINE     0     0     0
        scsi-351402ec000fe5848  ONLINE     0     0     0
        scsi-351402ec000fe5849  ONLINE     0     0     0
        scsi-351402ec000fe584a  ONLINE     0     0     0
        scsi-351402ec000fe584b  ONLINE     0     0     0

errors: No known data errors

Running a scrub returns 0B repaired.

I tried running xfs_repair /dev/zd0, then it says that there are valuable metadata changes in a log. Running xfs_repair -L /dev/zd0 returns again an I/O error: "xfs_repair: libxfs_device_zero write failed: Input/output error".

I am simply out of ideas. The only good thing is that it is only the second copy of the backup, and I could just begin from scratch, but it takes weeks to recopy all the data. Also, if it happened once, it can happen again, and I dont want to be there the day we need the backup and it happened again.

Sabsoun
  • 41

2 Answers2

2

I found the solution by accident in my Reddit feed today, one day after my question here; someone on Reddit had the same symptoms Reddit Post.

Cause:

The problem seems to be that the storage is full, although I do not know how because it should only be half full, but that's a problem for another day.

Solution:

One way is obviously to add more drives if possible. As this was not possible in my situation, I had to take another approach. Thankfully, the solution was in the Reddit post too. I increased the value in /sys/module/zfs/parameters/spa_slop_shift to 15. This allowed me to then increase the quota on zpool01/veeam by another 1TB (sudo zfs set quota=61T zpool01/veeam). With the newly usable storage, I was able to normally mount my XFS again and was able to delete some files and lower the retention for now.

Sabsoun
  • 41
0

You're running an XFS filesystem on top of a ZFS zvol. Stacking filesystems. It's possible for XFS to break while the underlying ZFS reports fine.

Can you provide specifics of the hardware, controllers and OS and ZFS version details?

Depending upon the nature of your pools failed drive, there may be repair needed on the ZFS side (since it's unaware of the contents of your XFS filesystem).

  • A real zpool scrub is a good start.
  • There may be valuable output in dmesg output. Can you post that?

If all else fails, a professional engagement or using UFS Explorer to recover is possible:

Also see: How to recover XFS file system with "superblock read failed"

ewwhite
  • 201,205