zfs checksum error in raidz1 vdev but not in disk

Question

I am backing up data stored in a zpool consisting of a single raidz vdev with 2 hard disks. During this operation, I got checksum errors, and now the status looks as follows:

  pool: tmp_zpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: none requested
config:

    NAME                  STATE     READ WRITE CKSUM
    tmp_zpool             ONLINE       0     0     2
      raidz1-0            ONLINE       0     0     4
        tmp_cont_0        ONLINE       0     0     0
        tmp_cont_1        ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /some/file

What I find confusing is that the checksum error appears at vdev level, but not at disk level. Perhaps I should note, one of the hard disks is internal and the other is external (this is a temporary situation). Can this be an issue with the hard drive controllers?

Is there anything I could try to do to get back the affected file? Like clearing the error and importing the vdev degrade with only one of the disks? I didn't even try to read the file again to see what happens. (Not sure if it would affect anything.)

Update: I gave up waiting for an explanation of what might go wrong if I clear the errors and retry, so I went ahead and tried that. I first did zpool clear, then zpool status showed no errors. Then, I tried to read the files with errors (2 of them in the end), but the respective blocks were still being reported as bad/unreadable. This time, zpool status no longer showed increasing checksum errors. Next, I tried to offline one of the disks in the raidz1 vdev and repeat the process, but the results did not change. In total, I lost 2 128K blocks out of 1.6T.

Answer Status: Currently, I find there is no comprehensive answer to this question. If somebody wants to write one up or edit an existing one, please address the following:

What could have caused this situation.
What could be done about it.
How it could have been prevented.

For 1, the theories and their problems seem to be:

Choice of raidz1 over raidz2. Problem: one needs a minimum of 4 disks for raidz2. While the need for redundancy is clear, it is not useful to repeatedly suggest that the cure for failing redundancy is more redundancy. It would be much more useful to understand how to best use the redundancy you have.
Choice of raidz1 over mirror. Problem: At first sight, the difference between these seems to be efficiency, not redundancy. This might be wrong, though. Why: zfs saves a checksum with each block on each disk, but neither disk reported individual checksum errors. This seems to suggest that for every bad block, the 2 disks contained different block payloads, each with a matching checksum, and zfs was unable to tell which is correct. This suggests there were 2 different checksum calculations, and that the payload somehow changed between them. This could be explained by RAM corruption, and maybe (need confirmation) with a choice of mirror over raidz1, only one checksum would have been needed.
RAM corruption during writing, not reading. As explained above, this seems plausible. Problem: why was this not be detected as an error at write time? Can it be that zfs doesn't check what it writes? Or rather, that the block payloads written to the different disks are the same?

For 2:

Since the disks have no individual checksum errors, is there some low-level way in zfs to gain access to the 2 different copies of such bad blocks?

For 3:

Is it clear that mirror over raidz1 would have prevented this situation?
I assume a scrub of this zpool have detected the problem. In my case, I was moving some data around, and I destroyed the source data before I actually read this zpool, thinking that I have a 2 disk redundancy. Would the moral here be to scrub a zpool before trusting its contents? Surely scrubbing is useful, but is it necessary? For instance, would a scrub be necessary with mirror instead of raidz1?

score 3 · Answer 1 · answered Aug 07 '15 at 20:51

This is the problem with raidz1 (and also RAID5). If the data on the disk changes but no drive fault occurs to let ZFS or the RAID controller know which drive caused the error, then it can't know which drive is correct. With raidz2 (and higher) or RAID6, you get a quorum of drives that can decide which drive to ignore for reconstruction.

Your only solution here is to overwrite the file, either by restoring a backup copy or writing /dev/null to the file.

score 0 · Answer 2 · edited Sep 02 '16 at 04:22

I'm running into a similar issue. I'm not sure if it's helpful, but I found this relevant post about vdev-level checksum errors from a FreeBSD developer.

https://lists.freebsd.org/pipermail/freebsd-hackers/2014-October/046330.html

The checksum errors will appear on the raidz vdev instead of a leaf if vdev_raidz.c can't determine which leaf vdev was responsible. This could happen if two or more leaf vdevs return bad data for the same block, which would also lead to unrecoverable data errors. I see that you have some unrecoverable data errors, so maybe that's what happened to you.

Subtle design bugs in ZFS can also lead to vdev_raidz.c being unable to determine which child was responsible for a checksum error. However, I've only seen that happen when a raidz vdev has a mirror child. That can only happen if the child is a spare or replacing vdev. Did you activate any spares, or did you manually replace a vdev?

I myself am considering deleting my zpool.cache file and importing my pool to regenerate that zpool.cache file.

zfs checksum error in raidz1 vdev but not in disk

2 Answers2

Linked