Recommended approach to build a 24-disk pooled SSD hot-set cache: RAID, LVM JBOD, etc?

Question

I'm trying to figure out the lowest hassle way to provision 24x locally attached SSDs as a large logical volume with low-value data. I'm using them as a hot-set cache for data who's master state (about a petabyte) resides in S3, so I care more about performance, complexity of maintenance, and downtime more than lost data. Nothing will linger in the hot data set for more than a couple days, and its all easy to recreate from S3 anyway.

Medium large instance: 32x vCPUs, 120GB RAM, Skylake
24x locally attached SSDs @ 375GB each = 9TB total
Hosted on Google Cloud (GCP)
Debian 10 (Buster)
Access is ~4x heavier on read than write
High number of concurrent users (human and machine) with pretty random access patterns, and very hungry for I/O.
90% of files are larger than 10MB

I'm thinking RAID 5 is out of the question, no chance I'm going to wait for manual rebuilds. I'm inclined toward either RAID 0, RAID 10, or.... maybe this is actually a case for a simple LVM pool with no RAID at all? Do I really lose anything by going that relatively simpler route in this case?

My ideal solution would have each subdir (I have one self contained dataset per subdir) of / completely contained on a single disk (I can fit maybe 10 subdirs on each drive). If a drive failed, I'd have a temporary outage of the subdirs/datasets on that drive, but an easy to reason about set of "these data sets are redownloading and not available". Then I'd just rebuild the missing data sets from S3 on a new drive. I suspect LVM jbods (not sure of exactly the right word for this?) might come closest to replicating this behavior.

score 4 · Answer 1 · answered May 12 '20 at 19:22

You appear to be contradicting your needs - "My ideal solution would have each subdir (I have one self contained dataset per subdir) of / completely contained on a single disk" tells you that you don't want RAID, LVM or any abstraction technology - *surely the solution to this is would be to simply mount each disk individually. The disadvantage here is you are likely to waste disk space and if the data set grows you will need to spend more time juggling it. (I expect you know Unix can mount drives in arbitrary places of a filesystem tree, so with a bit if thought it should be easy enough to make the drives visible as a logical tree structure)

You talk about JBOD or RAID0. If you do decide for a combined disk solution, RAID0 will give you better read performance in most cases, as data is broken up over the disks easily. RAID10 would buy you redundancy you said you don't need. JBOD is only useful to you if you have disks of different sizes, and you would be better off using LVM instead, as it can behave the same way but give you flexibility to move data around.

I can see edge cases where LVM would help over individual disk, but in general, any scenario is likely to add more complexity then it gives useful flexibility here - particularly bearing in mind the initial statement about data sets being bound to disks.

Where you might want to spend some effort is looking at the most appropriate file system and tuning parameters.

Steve Sether · Answer 2 · 2020-05-13T14:30:23.213

I care more about performance, complexity of maintenance, and downtime more than lost data.

Maximizing performance indicates you need to use some form of RAID-0 or RAID10, or LVM. Complexity of maintenance rules out doing something like segmenting the disk by subdirectory (as another mentions volume juggling). Minimizing downtime means you have to have some form of redundancy, since the loss of one drive takes the whole array down, which you'd then have to rebuild. I read that as "downtime". Degraded mode on RAID-5 likely also rules out RAID-5 for performance reasons.

So I'd say your options are RAID10, or RAID1+LVM. LVM offers some increased ability to manage the size of the volume, but a lot of that would disappear if you're going to mirror it with RAID-1 anyway. According to this article https://www.linuxtoday.com/blog/pick-your-pleasure-raid-0-mdadm-striping-or-lvm-striping.html RAID-0 offers better performance than LVM.

score 2 · Answer 3 · edited May 07 '24 at 10:17

A simpler, more hassle-free setup would be to use software RAID [mdadm] + XFS. If, and only if, you do not care about data and availability, you could use a RAID0 array. Otherwise, I strongly suggest using some other RAID layout. I generally suggest using RAID10, but it commands a 50% capacity penalty. For a 24x 375GB RAID you might consider RAID6, or -gasp-, even RAID5.

The above solution comes with many strings attached. Most importantly, presenting a single block device and skipping any logical volume or group-based storage layout means you will be unable to take snapshots. On the other hand, XFS allocator is very adept at balancing disks in a RAID0 setup.

Other possible solutions:

use XFS over classic LVM over RAID0/5/6: a legacy LVM volume has basically no impact on performance, and enables you to both dynamically partition block devices, and take short-lived snapshots (albeit with comparatively significant storage cost).
use XFS over thin LVM over RAID0/5/6: thin LVM enables modern snapshots with lower storage costs, and other benefits. If used with a big enough chunk size, performance is good
consider using ZFS (in its ZoL incarnation), especially if your data is compressible, as it can provide significant space savings, and possibly even performance advantages. Moreover, as you workload seems quite read-intensive, ZFS ARC can be more efficient than a traditional linux page cache.

If your data do not compress effectively, but are amenable to deduplication, consider layering a VDO between the RAID block device and filesystem.

Lastly, consider than any sort of LVM, JBOD or ZFS pooling does not mean losing a disk will only bring the directories offline that are located on such disks. Rather, the entire virtual block device becomes unavailable. To have such isolation, you would need to provide a filesystem for each block device. That would mean you'd have to manage all of their various mount points, and more importantly, your storage would not be pooled (ie: you might run out of space on one disk while others have available capacity).

score 1 · Answer 4 · answered May 13 '20 at 14:47

If you genuinely don't care about the data, only its performance and the speed to rebuild service WHEN it fails rather than to avoid failure then, against all my normal better judgement, R0 will be fine.

It doesn't let you choose what data goes where obviously, but it'll be about as fast as I can think it might be, yes it'll definitely fail but you can just have a script that removes the R0 array, rebuilds it and mounts it, shouldn't take more than a minute or so to do maximum - you could even run it automatically when you lose access to the drive.

One small question - you want a 32 x vCPU VM using Skylake cores, they don't do a single socket this big so your VM will be split across sockets, this might not be as fast as you'd expect, maybe test performance with 32/24/16 cores to see what the impact would be ok, it's worth a quick try at least.

score 0 · Answer 5 · answered May 13 '20 at 16:26

About best performance, complexity of maintenance, you can use the best practices listed here [1] [2] as a quick reference of what to keep in mind when building an application that uses Cloud Storage.

[1] https://cloud.google.com/storage/docs/best-practices

[2] https://cloud.google.com/compute/docs/disks/performance

Recommended approach to build a 24-disk pooled SSD hot-set cache: RAID, LVM JBOD, etc?

5 Answers5