23

We have a Dell PowerEdge T410 server running CentOS, with a RAID-5 array containing 5 Seagate Barracuda 3 TB SATA disks. Yesterday the system crashed (I don't know how exactly and I don't have any logs).

Upon booting up into the RAID controller BIOS, I saw that out of the 5 disks, disk 1 was labeled as "missing," and disk 3 was labeled as "degraded." I forced disk 3 back up, and replaced disk 1 with a new hard drive (of the same size). The BIOS detected this and began rebuilding disk 1 - however it got stuck at %1. The spinning progress indicator did not budge all night; totally frozen.

What are my options here? Is there any way to attempt rebuilding, besides using some professional data recovery service? How could two hard drives fail simultaneously like that? Seems overly coincidental. Is it possible that disk 1 failed, and as a result disk 3 "went out of sync?" If so, is there any utility I can use to get it back "in sync?"

peterh
  • 5,017

8 Answers8

39

You have a double disk failure. This means your data is gone, and you will have to restore from a backup. This is why we aren't supposed to use raid 5 on large disks. You want to set up your raid so you always have the ability to withstand two disk failures, especially with large slow disks.

Basil
  • 8,931
37

Your options are:

  1. Restoring from backups.
    • You do have backups, don't you? RAID is not a backup.

  2. Professional data recovery
    • It's possible, though very expensive and not guaranteed, that a professional recovery service will be able to recover your data.

  3. Accepting your data loss and learning from the experience.
    • As noted in the comments, large SATA disks are not recommended for a RAID 5 configuration because of the chance of a double failure during rebuild causing the array to fail.
      • If it must be parity RAID, RAID 6 is better, and next time use a hot spare as well.
      • SAS disks are better for a variety of reasons, including more reliability, resilience, and lower rates of unrecoverable bit errors that can cause UREs (unrecoverable read errors)
    • As noted above, RAID is not a backup. If the data matters, make sure it's backed up, and that your backups are restore-tested.
HopelessN00b
  • 54,273
27

After you accepted a bad answer, I am really sorry for my heretic opinion (which saved such arrays multiple times already).

Your second failed disk has probably a minor problem, maybe a block failure. This is the cause, why the bad sync tool of your bad raid5 firmware crashed on it.

You could easily make a sector-level copy with a lowlevel disk cloning tool (for example, gddrescue is probably very useful), and use this disk as your new disk3. In this case, your array survived with a minor data corruption.

I am sorry, probably it is too late, because the essence of the orthodox answer in this case: "multiple failure in a raid5, here is the apocalypse!"

If you want very good, redundant raid, use software raid in linux. For example, its raid superblock data layout is public and documented... I am really sorry, for my this another heretic opinion.

peterh
  • 5,017
4

Simultaneous failure is possible, even probable, for the reasons others have given. The other possibility is that one of the disks had failed some time earlier, and you weren't actively checking it.

Make sure your monitoring would pick up a RAID volume running in degraded mode promptly. Maybe you didn't get an option but it's never good to have to learn these things from the BIOS.

richardb
  • 1,296
2

To answer "How could two hard drives fail simultaneously like that?" precisely, I'd like to quote from this article:

The crux of the argument is this. As disk drives have become larger and larger (approximately doubling in two years), the URE (unrecoverable read error) has not improved at the same rate. URE measures the frequency of occurrence of an Unrecoverable Read Error and is typically measured in errors per bits read. For example an URE rate of 1E-14 (10 ^ -14) implies that statistically, an unrecoverable read error would occur once in every 1E14 bits read (1E14 bits = 1.25E13 bytes or approximately 12TB).

...

The argument is that as disk capacities grow, and URE rate does not improve at the same rate, the possibility of a RAID5 rebuild failure increases over time. Statistically he shows that in 2009, disk capacities would have grown enough to make it meaningless to use RAID5 for any meaningful array.

So, RAID5 was unsafe in 2009. RAID6 will be soon too. As for RAID1, I started making them out of 3 disks. RAID10 with 4 disks is also precarious.

Halfgaar
  • 8,534
2

Thread is old but if you are reading , understand when a drive fails in a raid array, check the age of the drives. If you have several disks in a raid array and they are over 4-5 years old, the chances are good that another drive will fail. *** MAKE An IMAGE or Backup ** before you proceed. If you think you have a backup, test it to make sure you can read it and restore from it.

Reason being is that you are placing years of normal wear and tear on the remaining drives as they spin full speed for hours and hours. The larger the number of 6 year old drives, the larger chance another drive will fail from the stress. If it's RAID5, and you blow the array, great you have a backup but a 2TB disk will take 8 - 36 hours to restore depending on the type of raid controller and other hardware.

We routinely replace the entire raid hive on production servers if all the drives are old. Why wast time replacing one drive, then wait until the next one fails in a day, week, month or two. As cheep as drives are, its just not worth the down time.

1

Typically when purchasing drives in a lot from a reputable reseller you can request that the drives come from different batches, which is important for reasons stated above. Next, this is precisely why RAID 1+0 exists. If you had used 6 drives in RAID 1+0 you would have had 9TB of data with immediate redundancy where no rebuilding of a volume is necessary.

1

If your controller is recognized by dmraid (for instance here) on linux, you may be able to use ddrescue to recover the failed disk to a new one, and use dmraid to build the array, instead of your hardware controller.