How can I tell which ZFS snapshots are affected by an error?

Question

A recent scrub of my ZFS pool uncovered one error:

$ zpool status -v rpool
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 04:48:04 with 1 errors on Thu Oct 24 14:07:21 2024
config:
    NAME                                            STATE     READ WRITE CKSUM
    rpool                                           ONLINE       0     0     0
      raidz1-0                                      ONLINE       0     0     0
        wwn-0x50014ee216641349-part2                ONLINE       0     0     2
        wwn-0x50014ee216620a52-part2                ONLINE       0     0     2
        wwn-0x50014ee2c1015982-part2                ONLINE       0     0     2
      raidz1-1                                      ONLINE       0     0     0
        ata-WDC_WD3000FYYZ-01UL1B1_WD-WCC1F0293326  ONLINE       0     0     0
        ata-WDC_WD30EZRZ-00Z5HB0_WD-WCC4N2ZYKAKR    ONLINE       0     0     0
        ata-WDC_WD30EZRZ-60Z5HB0_WD-WCC4N0SFH6JA    ONLINE       0     0     0
    logs
      ata-CT500MX500SSD1_2344E8835E4F-part1         ONLINE       0     0     0
    cache
      sdi2                                          ONLINE       0     0     0


errors: Permanent errors have been detected in the following files:
    rpool/[DATASET]@autosnap_2024-09-30_23:30:07_weekly:[FILEPATH]

(Note: the actual filesystem and file paths are redacted for privacy, but the snapshot name is real.)

Obviously, the error is present in the "autosnap_2024-09-30_23:30:07_weekly" snapshot. However, since ZFS is copy-on-write and this file presumably has not changed in some time, I expect that this error also exists in the versions of this file present in other snapshots. However, the zpool status command does not give me any indication of this. Is there something I can do to determine which snapshots have the corrupted copy of the file in them and which ones do not?

score 5 · Answer 1 · answered Oct 26 '24 at 07:25

5

Remove the file containing the errors listed in the output.

I don't think there's any other course of action needed.

answered Oct 26 '24 at 07:25

ewwhite

201,205

Ryan C. Thompson · Answer 2 · 2024-10-27T13:19:18.087

One simple way I have found to identify all snapshots in which a specific file is corrupted (i.e., has a checksum error) is to try to run a command like md5sum on the same file in every snapshot, e.g.:

md5sum rpool[DATASET]/.zfs/snapshot/*/[FILEPATH]
# Also check the current copy
md5sum rpool[DATASET]/[FILEPATH]

Whatever command you choose should involve reading every byte of the file, which is why I chose a checksum program. For any snapshot with a corrupted copy, this will give an I/O error of some kind, and for any snapshot with no corruption in the specified file, it will simply print the md5sum as normal. However, this is not a general solution, since the file could also have been renamed or moved to another folder, in which case the above shell wildcard will not identify the file in all snapshots. So a solution that directly checks which snapshots still refer to the corrupted copy of the data would still be preferable.

How can I tell which ZFS snapshots are affected by an error?

2 Answers2