Hadoop disk fail, what do you do?

Question

I would like to know about your strategies on what to do when one of the Hadoop server disk fails.

Let's say, I have multiple (>15) Hadoop servers and 1 namenode, and one from 6 disks on slaves stops working, disks are connected via SAS. I don't care about retrieving data from this disk, but for general strategies for keeping cluster running.

What do you do?

score 3 · Accepted Answer · answered Sep 01 '10 at 00:05

We deployed hadoop. You can specify replication numbers for files. How many times a file gets replicated. Hadoop has a single point of failure on the namenode. If you are worried about disks going out, increase replication to 3 or more.

Then if a disk goes bad, it's very simple. Throw it out and reformat. Hadoop will adjust automatically. In fact as soon as a disk goes out, it will start rebalancing files to maintain the replication numbers.

I am not sure why you have such a large bounty. You said you don't care to retrieve data. Hadoop only has a single point of failure on the name node. All other nodes are expendable.

score 3 · Answer 2 · answered Sep 03 '10 at 07:49

You mentioned this system was inherited (possibly not up to date) and that the load shoots up indicating a possible infinite loop. Does this bug report describe your situation?

https://issues.apache.org/jira/browse/HDFS-466

If so, it's been reported as fixed in the latest HDFS 0.21.0 (just released last week):

http://hadoop.apache.org/hdfs/docs/current/releasenotes.html

Disclaimer: To my disappointment I have yet to have the need to use Hadoop/HDFS :)

Hadoop disk fail, what do you do?

2 Answers2