2

Repair with xfs_repair of a backup volume has segfault problems.

A brief introduction to the system: There are 2 RAID6 Arrays, each of them identified as a ~160TB disk by the OS. 2 Arrays have a capacity of about 320TB in total. The two "hard disks" (RAID6 Arrays) make up a volume group, and a logical volume is created on the volume group. This logical volume is used as a backup volume.

An incident, however, made 3 HDDs (in one array) not identified by the RAID Controller, which led to failure of the array and, as a result, critical errors on the filesystem (XFS). To avoid even more trouble, the logical volume was first unmounted, and unidentified physical HDDs were unplugged and plugged in again. The RAID controller then tried to verify and fix the failed array and finished.

Description of problem: first try with xfs_repair didn't work. Mount was also impossible, so not to mention the umount. Then xfs_repair -L was tried. The first run "xfs_repair -L" was killed by oom. After that, normal xfs_repair without -L was possible again. However, xfs_repair then always meets segfault error after repairing a specific inode, reporting (German):

bad CRC for inode 289910359791, will rewrite
Eintrag enthält unerlaubten Wert im Attribut mit Namen SGI_ACL_FILE
oder SGI_ACL_DEFAULT
removing attribute entry 0 for inode 289910359791
Eintrag enthält unerlaubtes Zeichen in der Kurzform des Attributsnamens
removing attribute entry 0 for inode 289910359791

translated with deepl:

bad CRC for inode 289910359791, will rewrite
Entry contains unauthorized value in attribute named SGI_ACL_FILE
or SGI_ACL_DEFAULT
removing attribute entry 0 for inode 289910359791
Entry contains unauthorized character in the short form of the attribute name
removing attribute entry 0 for inode 289910359791

Then the xfs_repair corrupted. dmesg shows following:

xfs_repair[16648]: segfault at 7fe5539aa000 ip 00007febb3f9a021 sp 00007febaa7fb718 error 4 in libc-2.31.so[7febb3e09000+1e8000]

IP and SP and position in libc are always the same.

Hardware and software information:

Intel(R) Xeon(R) Bronze 3204 CPU, 1.90GHz
128GB RAM
OS: openSUSE 15.5, kernel version 5.14.21
xfs_repair (xfsprogs) version 6.8.0

Will a memtest be helpful? Or some newer xfs_repair version should be pulled and built for further repair? Does anyone have any suggestions on how to solve the problem? Sincere thanks in advance.

Georg
  • 21

1 Answers1

1

With possibly corrupt inputs, the bad data is also likely to be the cause. Especially if the crash is reproducible. In this case the bad data could be something to do with the block device or file system.

Check for hardware faults anyway. Use rasdaemon on Linux, or otherwise check ECC errors and similar.

Now to debug software faults, rather than hardware ones. Get a human readable backtrace from this crash, to report to the XFS maintainers:

Enable the debuginfo-pool and debuginfo-update repositories.

Install the application and probably libc debug symbols. xfsprogs-debuginfo glibc-debuginfo in this case.

debuginfo needs to be from the identical build that generated it. zypper trick to ensure these match: 'zypper se --provides 'debuginfo(build-id)=$GNU_BUILD_ID' where $GNU_BUILD_ID is the hex number from eu-readelf -n /usr/sbin/xfs_repair

Reproduce the crash. Review it with coredumpctl info

If necessary run a debugger on it with coredumpctl gdb I don't know a lot about this debugger, just to start bt to get a backtrace.

Likely such a crash is interesting to report to maintainers and eventually upstream. Being an XFS file system, consider xfs_metadump program to extract metadata to provide to support. Read its manual page, as it is supposed to be run on read-only file systems, and metadata in the log may be in plain text. Consider taking a LVM snapshot, as a source for xfs_metadump and other debug and repair tests.

After all that, you still require a way to mount this volume again. Check what backups exist and if restoring them would be acceptable. Or, expect repair to be difficult, given you already ran the simple task of running xfs_repair.

John Mahowald
  • 36,071