cheap, commodity hardware based, highly available distributed filesystem

Question

I have two PCs running linux, a 2 TB disk each and one small gigabit switch. To build a highly available system on the cheap, I resorted to this stack:

custom 5.6 kernel with ZFS and DRBD9 on both PCs.
one zvol in a partition of each local disk of each PC - compression enabled, dedup disabled (tried to enable compression, everything hangs badly)
dual primary DRBD9 to mirror between them
OCFS2 on top, to mount the resulting device on both PCs

A third very old machine acts as DRBD arbitrator with no actual disk space participating to DRBD mirror.

A second switch and a second NIC is coming to improve availability.

I would like to understand if there is simpler stack to achieve the same result. I already discarded some options, based on my current knowledge: Lustre (too complex for small environments), BeeGFS (not updated), GlusterFS (does not work with raw devices, only with mounted folders)

EDIT - I've been asked to focus on one question. As the first one has been answered, I kept the second.

score 3 · Accepted Answer · answered Jun 15 '20 at 10:59

You are conflating Cluster Filesytems with Distributed Filesystem.

What you achieved with your ZVOL+DRBD+OCFS2 setup is a "shared-nothing" clustered filesystem, where DRBD emulates a true shared-block SAN and OCFS2 (or GFS2) provides multiple concurrent mounts by multiple head nodes. In this configuration, you can not swap layer 2 and 3 (ie: DRBD+ZVOL+OCFS2) because ZFS is not a cluster filesystem - if mounted on two different hosts, it will very quickly corrupt itself (this is true even for ZVOLs, which are little more than hidden files in the root ZFS dataset).

Lustre, Gluster, Ceph, etc. are distributed filesystem: they use individual filesystem/fiels/databases on each hosts, which are combined at userspace level as a single, multiple-hosts-spanning (ie: distributed) filesystem.

How can you select between the two approaches? It depends on multiple factors:

if cold, async replication is sufficient, you can use zfs send/recv and call it a day
if true realtime replication is required but no hard/immediate HA is needed and manual failover is an option, you can use DRBD in single-primary mode and completely skip the overhead of a cluster filesystem (ie: using a plain XFS rather than OCFS2/GFS2)
if used for big file store (ie: vm images) and with only an handful of hosts, your current approach is probably the best one (at the cost of added complexity and reduced performance). If having many nodes, GlusterFS (with the right options - sharding being the first one) can be a reasonable choice, but be sure to follow the mailing list (it has many gotcha)
if you need a "large NAS" for storing many medium-sized files (1-128M), GlusterFS in replica mode can be the right choice (again, be sure to follow the mailing list)
if having many nodes and large sysadmin resource (read: a dedicated team) you can consider Lustre or Ceph, which are the higher-end options in distributed filesystem.

I strongly advise you to keep things as simple as possible, even in the face or reduced availability (unless you really need it): storage administration is a complex task, which requires a profound understanding of all the moving parts to avoid burning yourself (and eating your data).

NOTE: you can read here for complementary informations

cheap, commodity hardware based, highly available distributed filesystem

1 Answers1