mdcheck hangs when attempting to stop check

Question

I have a raid5 array that has a check run on it once a month. It is configured so that the check runs for 6 hours from 01:00 and then stops. The following nights it will resume the check for another 6 hours until it has completed.

The issue I have is that sometimes when mdcheck attempts to stop the check running it hangs. Once this happens you can read from the array, but any attempt to write results in the process hanging.

The array state is as follows:

 md0 : active raid5 sdb1[4] sdc1[2] sdd1[5] sde1[1]
      8790398976 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      [========>............]  check = 44.2% (1296999956/2930132992) finish=216065.8min speed=125K/sec
      bitmap: 0/6 pages [0KB], 262144KB chunk

The check = 44.2% (1296999956/2930132992) never advances or stops.

From looking at the /usr/share/mdadm/mdcheck script it appears that every 2 minutes, until the end time, it reads /sys/block/md0/md/sync_completed and saves the position in a file stored in the /var/lib/mdcheck/ directory. Looking in that directory the file is there and is dated 2 minutes before it was due to stop with the value of 2588437040. The current value of sync_completed is 2593999912 which indicates that everything was still working 2 minutes before it was due to stop.

Running lsof on the mdcheck process reveals the following:

 mdcheck 23887 root    1w   REG               0,21     4096     43388 /sys/devices/virtual/block/md0/md/sync_action

This appears to show that the mdcheck process is hanging when trying to stop the check after 6 hours. I confirmed this by running the following in a terminal: sudo echo idle >/sys/devices/virtual/block/md0/md/sync_action and this also hung.

The only way I have found to stop the check is to attempt a reboot, which also hangs, and then cycle the power.

How do I stop/unhang the mdcheck (and hence the array) without a reboot and how do I find out what the cause of the issue is (and resolve it)?

Additional information:

OS: OpenSUSE Leap 15.2

Kernel: 5.3.18-lp152.57-default

Running the consistency check without interruption succeeds.

Running extended self tests on the disks succeeds.

Replacing all the SATA cables has no effect.

Relevant dmesg entries:

[    5.565328] md/raid:md0: device sdb1 operational as raid disk 3
[    5.565330] md/raid:md0: device sdc1 operational as raid disk 2
[    5.565331] md/raid:md0: device sdd1 operational as raid disk 0
[    5.565332] md/raid:md0: device sde1 operational as raid disk 1
[    5.575520] md/raid:md0: raid level 5 active with 4 out of 4 devices, algorithm 2
[    5.640309] md0: detected capacity change from 0 to 9001368551424
[53004.024693] md: data-check of RAID array md0
[74605.665890] md: md0: data-check interrupted.
[139404.408605] md: data-check of RAID array md0
[146718.260616] md: md0: data-check done.
[1867115.595820] md: data-check of RAID array md0

Output of mdadm --detail /dev/md0:

           Version : 1.2
     Creation Time : Sat Nov  7 09:48:15 2020
        Raid Level : raid5
        Array Size : 8790398976 (8.19 TiB 9.00 TB)
     Used Dev Size : 2930132992 (2.73 TiB 3.00 TB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent
 Intent Bitmap : Internal

   Update Time : Tue Feb  2 06:59:55 2021
         State : active, checking 
Active Devices : 4

Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0
        Layout : left-symmetric
    Chunk Size : 512K


Consistency Policy : bitmap
  Check Status : 44% complete

          Name : neptune:0  (local to host neptune)
          UUID : 5dd490df:79bf70fa:b4b530bc:47b30419
        Events : 28109

Number   Major   Minor   RaidDevice State
   5       8       49        0      active sync   /dev/sdd1
   1       8       65        1      active sync   /dev/sde1
   2       8       33        2      active sync   /dev/sdc1
   4       8       17        3      active sync   /dev/sdb1

Output of mdadm --examine /dev/sdb1 (all disks are essentially the same):

/dev/sdb1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 5dd490df:79bf70fa:b4b530bc:47b30419
           Name : neptune:0  (local to host neptune)
  Creation Time : Sat Nov  7 09:48:15 2020
     Raid Level : raid5
   Raid Devices : 4
Avail Dev Size : 5860266895 sectors (2.73 TiB 3.00 TB)
     Array Size : 8790398976 KiB (8.19 TiB 9.00 TB)
  Used Dev Size : 5860265984 sectors (2.73 TiB 3.00 TB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=911 sectors
          State : clean
    Device UUID : a40bb655:70a88240:06dfad1d:f7fcbdca
Internal Bitmap : 8 sectors from superblock
    Update Time : Tue Feb  2 06:59:55 2021
  Bad Block Log : 512 entries available at offset 16 sectors
       Checksum : 42b3d6 - correct
         Events : 28109
     Layout : left-symmetric
 Chunk Size : 512K


Device Role : Active device 3
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)

Luke Yeager · Answer 1 · 2022-11-08T19:26:19.613

4

It's probably this bug:

https://lore.kernel.org/linux-raid/aa9567fd-38e1-7b9c-b3e1-dc2fdc055da5@molgen.mpg.de/
See also AskUbuntu: https://askubuntu.com/q/1428232/336440

If that is indeed your issue, then you can try this workaround (swap md1 for md0/md2/etc. first):

echo active | sudo tee /sys/block/md1/md/array_state

edited Nov 08 '22 at 19:26

answered Nov 07 '22 at 18:00

Luke Yeager

141

mdcheck hangs when attempting to stop check

1 Answers1