41

If one happens to have some server-grade hardware at ones disposal, is it ever advisable to run ZFS on top of a hardware-based RAID1 or some such? Should one turn off the hardware-based RAID, and run ZFS on a mirror or a raidz zpool instead?

With the hardware RAID functionality turned off, are hardware-RAID-based SATA2 and SAS controllers more or less likely to hide read and write errors than non-hardware-RAID controllers would?

In terms of non-customisable servers, if one has a situation where a hardware RAID controller is effectively cost-neutral (or even lowers the cost of the pre-built server offering, since its presence improves the likelihood of the hosting company providing complementary IPMI access), should it at all be avoided? But should it be sought after?

cnst
  • 14,646

8 Answers8

26

The idea with ZFS is to let it known as much as possible how the disks are behaving. Then, from worst to better:

  • Hardware raid (ZFS has absolutely no clue about the real hardware),
  • JBOD mode (The issue being more about any potential expander: less bandwidth),
  • HBA mode being the ideal (ZFS knows everything about the disks)

As ZFS is quite paranoid about hardware, the less hiding there is, the more it can cope with any hardware issues. And as pointed out by Sammitch, RAID Controller configurations and ZFS may be very difficult to restore or reconfigure when it fails (i.e. hardware failure).

About the issue of standardized hardware with some hardware-RAID controller in it, just be careful that the hardware controller has a real pass-through or JBOD mode.

Tmanok
  • 207
Ouki
  • 1,457
17

Q. If one happens to have some server-grade hardware at ones disposal, is it ever advisable to run ZFS on top of a hardware-based RAID1 or some such?

A. It is strongly preferable to run ZFS straight to disk, and not make use of any form of RAID in between. Whether or not a system that effectively requires you make use of the RAID card precludes the use of ZFS has more to do with the OTHER benefits of ZFS than it does data resiliency. Flat out, if there's an underlying RAID card responsible for providing a single LUN to ZFS, ZFS is not going to improve data resiliency. If your only reason for going with ZFS in the first place is data resiliency improvement, then you just lost all reason for using it. However, ZFS also provides ARC/L2ARC, compression, snapshots, clones, and various other improvements that you might also want, and in that case, perhaps it is still your filesystem of choice.

Q. Should one turn off the hardware-based RAID, and run ZFS on a mirror or a raidz zpool instead?

A. Yes, if at all possible. Some RAID cards allow pass-through mode. If it has it, this is the preferable thing to do.

Q. With the hardware RAID functionality turned off, are hardware-RAID-based SATA2 and SAS controllers more or less likely to hide read and write errors than non-hardware-RAID controllers would?

A. This is entirely dependent on the RAID card in question. You'll have to pore over the manual or contact the manufacturer/vendor of the RAID card to find out. Some very much do, yes, especially if 'turning off' the RAID functionality doesn't actually completely turn it off.

Q. In terms of non-customisable servers, if one has a situation where a hardware RAID controller is effectively cost-neutral (or even lowers the cost of the pre-built server offering, since its presence improves the likelihood of the hosting company providing complementary IPMI access), should it at all be avoided? But should it be sought after?

A. This is much the same question as your first one. Again - if your only desire to use ZFS is an improvement in data resiliency, and your chosen hardware platform requires a RAID card provide a single LUN to ZFS (or multiple LUN's, but you have ZFS stripe across them), then you're doing nothing to improve data resiliency and thus your choice of ZFS may not be appropriate. If, however, you find any of the other ZFS features useful, it may still be.

I do want to add an additional concern - the above answers rely on the idea that the use of a hardware RAID card underneath ZFS does nothing to harm ZFS beyond removing its ability to improve data resiliency. The truth is that's more of a gray area. There are various tuneables and assumptions within ZFS that don't necessarily operate as well when handed multi-disk LUN's instead of raw disks. Most of this can be negated with proper tuning, but out of the box, you won't be as efficient on ZFS on top of large RAID LUN's as you would have been on top of individual spindles.

Further, there's some evidence to suggest that the very different manner in which ZFS talks to LUN's as opposed to more traditional filesystems often invokes code paths in the RAID controller and workloads that they're not as used to, which can lead to oddities. Most notably, you'll probably be doing yourself a favor by disabling the ZIL functionality entirely on any pool you place on top of a single LUN if you're not also providing a separate log device, though of course I'd highly recommend you DO provide the pool a separate raw log device (that isn't a LUN from the RAID card, if at all possible).

Nex7
  • 2,055
13

I run ZFS on top of HP ProLiant Smart Array RAID configurations fairly often.

Why?

  • Because I like ZFS for data partitions, not boot partitions.
  • Because Linux and ZFS boot probably isn't foolproof enough for me right now.
  • Because HP RAID controllers don't allow RAW device passthrough. Configuring multiple RAID 0 volumes is not the same as RAW disks.
  • Because server backplanes aren't typically flexible enough to dedicate drive bays to a specific controller or split duties between two controllers. These days you see 8 and 16-bay setups most often. Not always enough to segment the way things should be.
  • But I still like the volume management capabilities of ZFS. The zpool allows me to carve things up dynamically and make the most use of the available disk space.
  • Compression, ARC and L2ARC are killer features!
  • A properly-engineered ZFS setup atop hardware RAID still gives good warning and failure alerting, but outperforms the hardware-only solution.

An example:

RAID controller configuration.

[root@Hapco ~]# hpacucli ctrl all show config

Smart Array P410i in Slot 0 (Embedded)    (sn: 50014380233859A0)

   array B (Solid State SATA, Unused Space: 250016  MB)
      logicaldrive 3 (325.0 GB, RAID 1+0, OK)

      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, Solid State SATA, 240.0 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, Solid State SATA, 240.0 GB, OK)
      physicaldrive 2I:1:7 (port 2I:box 1:bay 7, Solid State SATA, 240.0 GB, OK)
      physicaldrive 2I:1:8 (port 2I:box 1:bay 8, Solid State SATA, 240.0 GB, OK)

block device listing

[root@Hapco ~]# fdisk  -l /dev/sdc

Disk /dev/sdc: 349.0 GB, 348967140864 bytes
256 heads, 63 sectors/track, 42260 cylinders
Units = cylinders of 16128 * 512 = 8257536 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1       42261   340788223   ee  GPT

zpool configuration

[root@Hapco ~]# zpool  list
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
vol1   324G  84.8G   239G    26%  1.00x  ONLINE  -

zpool detail

  pool: vol1
 state: ONLINE
  scan: scrub repaired 0 in 0h4m with 0 errors on Sun May 19 08:47:46 2013
config:

        NAME                                      STATE     READ WRITE CKSUM
        vol1                                      ONLINE       0     0     0
          wwn-0x600508b1001cc25fb5d48e3e7c918950  ONLINE       0     0     0

zfs filesystem listing

[root@Hapco ~]# zfs list
NAME             USED  AVAIL  REFER  MOUNTPOINT
vol1            84.8G   234G    30K  /vol1
vol1/pprovol    84.5G   234G  84.5G  -
ewwhite
  • 201,205
6

Typically you should never run ZFS on top of disks configured in a RAID array. Note that ZFS does not have to run in RAID mode. You can just use individual disks. However, virtually 99% of people run ZFS for the RAID portion of it. You could just run your disks in striped mode, but that is a poor use of ZFS. Like other posters have said, ZFS wants to know a lot about the hardware. ZFS should only be connected to a RAID card that can be set to JBOD mode, or preferably connected to an HBA. Jump onto IRC Freenode channel #openindiana ; any of the ZFS experts in the channel will tell you the same thing. Ask your hosting provider to provide JBOD mode if they will not give a HBA.

chris
  • 61
4

Everybody tells that ZFS on top of RAID is a bad idea without even providing a link. But the developers of ZFS - Sun Microsystems even recommend to run ZFS on top of HW RAID as well as on ZFS mirrored pools for Oracle databases.

The main argument against HW RAID is that it can't detect bit rot like ZFS mirror. But that's wrong. There is T10 PI for that. You can use T10 PI capable controllers (that at least all LSI controllers that I used are) Majority of enterprise disks are T10 PI capable. So if it is appropriate for you, you can build T10 PI capable array, create ZFS pool without redundancy on top of it, and just make sure you follow the guidelines regarding to your use case in the article. Though it is written for Solaris, IMHO it is also suitable for the other OS.

The benefits for me is that the replacing disk in HW controller is really easier ( especially in my case, because I don't use whole disk for zpool for performance reasons ) It requires NO intervention at all and can be done by client's staff.

The downside is that you have to make sure that disks you buy are actually formatted to support T10 PI, because some of them though capable of T10 PI but sold formatted as regular disks. You can format them yourself, but it's not very straightforward and potentially dangerous if you interrupt the process.

Alek_A
  • 367
  • 2
  • 9
2

In-short: using RAID below ZFS simply kills the idea of using ZFS. Why? — Because it's designed to work on pure disks, not RAIDs.

poige
  • 9,730
  • 3
  • 28
  • 53
1

For all of you... ZFS over any Raid is a total PAIN and is done only by MAD people!... like using ZFS with non ECC memory.

With samples you will understand better:

  • ZFS over Raid1, one disk have a bit changed when was not powered off... pry all you know, ZFS will see some damage or not depending what disk is readed (Raid controller did not see that bit changed and think both disks are OK)... if the fail is in the VDEV part... the whole ZPOOL looses all its data forever.
  • ZFS over Raid0, one disk have a bit changed when was not powered off... pry all you know, (Raid controller did not see that bit changed and think both disks are OK)... ZFS will see that damage but if the fail is in the VDEV part... the whole ZPOOL looses all its data forever.

Where ZFS is good is in detecting Bits that changed when disk where without power (RAID controllers can not do that), also when something changes without been asked to, etc.

It is the same problem as when a bit in a RAM module spontaneously changes without being asked to... if memory is ECC, memory corrects it self; if not, that data had changed, so that data will be sent to disks modified; pry that change is not on the UDEV part, if the fail is in the VDEV part... the whole ZPOOL looses all its data forever.

That is a weakness on ZFS... VDEVs fails implies all data get lost for ever.

Hardware Raid and Software Raid can not detect spontaneous bit changes, they do not have checksums, worst on Raid1 levels (mirros), they read not all parts and compare them, they supose all parts will allways have the same data, ALLWAYS (i say it loudly) Raid suposes data has not changed by any other thing/way... but disks (as memory) are prone to spontaneous bit changes.

Never ever use a ZFS on a non-ECC RAM and never ever use ZFS on raided disks, let ZFS see all the disks, do not add a layer that can ruin your VDEV and POOL.

How to simulate such fail... power off the PC, took out one disk of that Raid1 and alter only one bit... reconect and see how Raid controller can not know that has changed... ZFS can because all reads are tested against the checksum and if does not match, read form another part... Raid never read again because a fail (except hardware impossible read fails)... if Raid can read it thinks data is OK (but it is not on such cases)... Raid only try to read from another disk if where it reads says "hey, i can not read from there, hardware fail"... ZFS read from another disk if checksum does not match as also as if where it reads says "hey, i can not read from there, hardware fail".

Hope i let it very clear... ZFS over any level of Raid is a toal pain and a total risk to your data! as well as ZFS on non-ECC memories.

But what no one says (except me) is:

  • Do not use disks with internal cache (not only that ones SHDD, also some that has 8Mib to 32MiB cache, etc)... some of them use non-ECC memory for such cache
  • Do not use SATA NCQ (a way to queu writes) because it can ruin ZFS if power loose

So what disks to use?

  • Any disk with internal battery that ensures all the queu will be writted to the disk on power fail cases and uses ECC memory inside it (sorry, there are very little ones with all of that and are expensive).

But, hey, most people do not know all of this and never ever had a problem... i say to them: wow, how lucky you are, buy some lottery tickets, before lucky goes away.

The risks are there... such failures conincidences may occur... so the better answer is:

  • Try not to put any layer between ZFS and where data is really stored (RAM, Raid, NCQ, internal disk cache, etc)... as much as you can afford.

What i personally do?

  • Put some layers more... i use each 2.5" SATA III 7200 rpm disk on a USB 3.1 Gen2 type C enclosure, i connect some enclosures to a USB 3.1 Gen 2 Type A Hub that i connect to the PC; other to another hub that i connect to another root port on the PC, etc.
  • For the system i use internal sata connectors on a ZFS (Raid0 level) because i use an inmutable (Like a LiveCD) Linux system, each boot identical content on internal disks... and i have a Clone image of the system i can restore (less than 1GiB system)... also i use the trick to have the system contained on a file and use RAM mapped drive where i clone it on boot, so after boot all the system runs in RAM... putting such file on a DVD i can also boot the same way, so in case of fail of internal disks, i just boot with the DVD and system is online again... similar trick to SystemRescueCD but a little bit more complex beacuse ISO file can be on the internal ZFS or just be the real DVD and i do not want two different versions.

Hope i could give a little light on ZFS against Raid, it is really a pain when things go wrong!

Claudio
  • 27
0

Lots of single-factor answers here, but another factor is caching. If you can get non-volatile write cache, then you have ‘write-back’ caching - especially important if you are using slower storage devices. This NV Cache might be provided by a hardware RAID controller with battery-backed cache, or it might be provided by a landing zone fast SSD drive (see ZIL/SLOG for ZFS), or maybe your drives are fast enough that you don’t need it. Beware of RAM-only write-back caching - it’s volatile, a server crash will result is losing in-flight writes.

Chalky
  • 151