2

This is a Mint 21.1 x64 Linux system, which has over the years had disks added to RAID arrays until we now have one array of 10 3TB and one array of 5 6TB. Four HDs dropped out of the arrays, two from each, apparently as a result of one controller failing. We've replaced controllers, but that has not restored the arrays to function. mdadm --assemble reports unable to start either array, insufficient disks (with two failed in each, I'm not surprised); mdadm --run reports I/O error (syslog seems to suggest this is because it can't start all the drives, but there is no indication that it tried to start the two apparently unhappy ones), but I can still mdadm --examine failed disks and they look absolutely normal. Here's output from a functional drive:

mdadm --examine /dev/sda
/dev/sda:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 829c0c49:033a810b:7f5bb415:913c91ed
           Name : DataBackup:back  (local to host DataBackup)
  Creation Time : Mon Feb 15 13:43:15 2021
     Raid Level : raid5
   Raid Devices : 10

Avail Dev Size : 5860268976 sectors (2.73 TiB 3.00 TB) Array Size : 26371206144 KiB (24.56 TiB 27.00 TB) Used Dev Size : 5860268032 sectors (2.73 TiB 3.00 TB) Data Offset : 264192 sectors Super Offset : 8 sectors Unused Space : before=264112 sectors, after=944 sectors State : clean Device UUID : 6e072616:2f7079b0:b336c1a7:f222c711

Internal Bitmap : 8 sectors from superblock Update Time : Sun Apr 2 04:30:27 2023 Bad Block Log : 512 entries available at offset 24 sectors Checksum : 2faf0b93 - correct Events : 21397

     Layout : left-symmetric
 Chunk Size : 512K

Device Role : Active device 9 Array State : AAAAAA..AA ('A' == active, '.' == missing, 'R' == replacing)

And here's output from a failed drive:

mdadm --examine /dev/sdk
/dev/sdk:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 829c0c49:033a810b:7f5bb415:913c91ed
           Name : DataBackup:back  (local to host DataBackup)
  Creation Time : Mon Feb 15 13:43:15 2021
     Raid Level : raid5
   Raid Devices : 10

Avail Dev Size : 5860268976 sectors (2.73 TiB 3.00 TB) Array Size : 26371206144 KiB (24.56 TiB 27.00 TB) Used Dev Size : 5860268032 sectors (2.73 TiB 3.00 TB) Data Offset : 264192 sectors Super Offset : 8 sectors Unused Space : before=264112 sectors, after=944 sectors State : clean Device UUID : d62b85bc:fb108c56:4710850c:477c0c06

Internal Bitmap : 8 sectors from superblock Update Time : Sun Apr 2 04:27:31 2023 Bad Block Log : 512 entries available at offset 24 sectors Checksum : d53202fe - correct Events : 21392

     Layout : left-symmetric
 Chunk Size : 512K

Device Role : Active device 6 Array State : AAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)

Edit: Here's the --examine report from the second failed drive; as you can see, it failed at the same time the entire array fell off line.

# mdadm --examine /dev/sdl
/dev/sdl:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 829c0c49:033a810b:7f5bb415:913c91ed
           Name : DataBackup:back  (local to host DataBackup)
  Creation Time : Mon Feb 15 13:43:15 2021
     Raid Level : raid5
   Raid Devices : 10

Avail Dev Size : 5860268976 sectors (2.73 TiB 3.00 TB) Array Size : 26371206144 KiB (24.56 TiB 27.00 TB) Used Dev Size : 5860268032 sectors (2.73 TiB 3.00 TB) Data Offset : 264192 sectors Super Offset : 8 sectors Unused Space : before=264112 sectors, after=944 sectors State : clean Device UUID : 35ebf7d9:55148a4a:e190671d:6db1c2cf

Internal Bitmap : 8 sectors from superblock Update Time : Sun Apr 2 04:27:31 2023 Bad Block Log : 512 entries available at offset 24 sectors Checksum : c13b7b79 - correct Events : 21392

     Layout : left-symmetric
 Chunk Size : 512K

Device Role : Active device 7 Array State : AAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)

The second array, 5x6TB, fell off line two minutes later when two disks quit. The two failed disks on this array, and the two on the other array, all connected to a single 4-port SATA controller card which of course has now been replaced.

The main thing I find interesting about this is that the failed drive seems to report itself as alive, but mdadm doesn't agree with it. journalctl doesn't seem to go back as far as 2 April, so I may not be able to find out what happened. Anyone have any ideas about what I can do to bring this beast back online?

tsc_chazz
  • 2,941

1 Answers1

1
  1. Always make an image-level backups of all drives in the array before attempting any potentially destructive mdadm commands. With these backups at hand you can later attempt recovery on a VM outside the box.
  2. Examine Update time field in for failed drives in the output of mdadm --examine /dev/sdX to determine exact sequence of events when drives were falling out of the array. Sometimes the first drive failure comes unnoticed and bringing that old drive online will result in a catastrophic failure while trying to mount a filesystem.
  3. In your case both drives failed at once, so it should be safe to force array online with mdadm --assemble --force /dev/mdX or mdadm --assemble --force --scan. If it were not the case, you should force online only the last drive that fell off the array by specifying array member drives for mdadm --assemble --force /dev/mdX /dev/sda /dev/sdb missing /dev/sdd, note that the order of drives is important.
  4. As you were able to get things going only with explicit device list for assemble I believe your array is currently in a degraded state with that /dev/sdh marked offline. Look into the output of cat /proc/mdstat to determine that, do a backup, troubleshoot your hardware and rebuild your array completely after that.
Peter Zhabin
  • 2,991