Ceph cluster mixing nvme disks with different sizes

Question

I am deploying a ceph cluster, the cluster will have three controller servers and 27 osd nodes. Each osd node has 3x3.8Tb nvme + 1x1.9Tb nvme disks, for a total of 4 nvme disks per node. The failure domain of the cluster will be by chassis. Every 4 nodes belong to the same chassis, in the end I will have an osd tree with 8 chassis, 5 of then will contain 3 osd nodes and 3 of them will contain 4 nodes.

What is the best way to use the nvme disks? So far I had three ideas from the net and other forums :

Use all disks in the same cluster, that the difference in size is not that important.
Use all disks in the same cluster, and change the weight of the smaller disks from 1.0 to 0.5 since they are half size the big ones.
Separate the disks into two crush rules to avoid filling up the smaller disks in case of failure.

So far, it's the 3rd idea that I prefer, but I don't have that much experience in ceph and I'm wondering if I'm not seeing potential problems ahead.

If you can give me more advice, that will be great.

Regards.

score 1 · Answer 1 · answered Mar 16 '24 at 09:02

In the Current state of the Question, there might be 3 Possible Use-Cases:

1. Big Data Grave and Single OS (" 3+1 The usual")

The Use case of having a 3+1 setup should have the data redundancy while a loosing of the OS can be easily recovered, while the data might have a priority over availability.

1.1 Setup:

3 with 3.8TB drives in combined or splitting. 1 with 1.9TB drives in Solo.

1.2 Co an Pro

If Setup uses Reduandancy for the 3x3.8TB there can be an outage without a data failure

More Capacity for data

An outage will occur if OS Disk (1.9TB) breaks.

Single Harddisk is in a Server either a planned risk and/or accepted

If the three devices are used like Raid 0, you have a lot of capacity but no failover.

2. Splitting into the smallest drive size for Redundancy (" 2 + 2 both with 1.9TB used " )

The use-case might be, to have both redundantly. Typical for Sensitive and important Data and Servers, which should operate while having one disk failed.

2.1 Setup:

2 with 3.8TB and 1 3.8TB with 50% used while using 1.9TB in the same Pool This is limiting the maximum Size to around about 1.9TB for all drives due the smallest drive matters.

2.2 Co an Pro

depending on which has been used, one or two drives may fail and the system can still operates

One drive of each pool can fail, if mirroring was used, and the Data are still Avaible

In Mirroring: 50% Lost of space, while the overall Cap is limited to the smallest drive in a real-world scenario. But Better Avaiblity

In Splitting (like raid 0) No Redundancy. If one disk of the 2 Pool

3.1 Using unequal Drives

The use case would be if you don't care, IMHO. Ceph can handle this, but it will have some impacts.

3.2 Setup

All available disk in one cluster, regardless of its size.

3 with 3.8Tb and one 1.9TB

3.2 Co an Pro

In Ceph erasure coding (EC) setups, disks don't necessarily need to be of the same size, but there are some constraints and considerations:

Fragment Size: The size of the fragments into which data is divided during erasure coding is determined by the configuration. This should be chosen to align with the disk sizes to enable efficient utilization.

Parity Distribution: In an EC configuration with double parity, data is divided into fragments and distributed across various OSDs, as are the parity bits. In such a setup, it's possible for OSDs to have different disk sizes as long as the fragments and parity information can be evenly distributed across the OSDs.

Efficiency: To maximize storage efficiency and minimize the impact of unequal disk sizes, it's often advisable to use disks of similar sizes. This facilitates the calculation of fragmentation and ensures even distribution of data and parity across OSDs.

Overall, it's possible to use disks of different sizes in a Ceph EC configuration, >but disk sizes should be chosen to avoid compromising system efficiency and to >enable even distribution of data.

Overview

RAID	Description	Minimum Number of HDDs	Ceph Configuration	Minimum Number of HDDs
RAID 0	Striping	HDD: 2	Striping Policy	HDD: 2
RAID 1	Mirroring	HDD: 2	Replication Mode	HDD: 2
RAID 5	Block-Level Striping with Distributed Parity	HDD: 3 (minimum)	Erasure Coding (EC)	HDD: 4 (minimum)
RAID 6	Block-Level Striping with Dual Distributed Parity	HDD: 4 (minimum)	Erasure Coding (EC) with Dual Parity	HDD: 6 (minimum)

I have added the Raidlevel to understand Ceph for people who mostly not handle Ceph ;)

And an additional IMHO: Please note that the answer to this Question may, or can only be given based on opinions instead rather than facts. In a typical scenario, it is recommended to use a maximum cap of 3x3.8TB in redundancy while being in the same cluster as the data storage. And a single NVMe disk for the operating system. The reason for this is, that using a smaller size could lead to a lower capacity and a higher possibility of data loss, while using more disks increases reliability. However, without knowing your specific needs, it is difficult to give a specific recommendation. In most cases, for a 3+1 setup, one of the data disks will fail before the single OS disk due to the workload mostly.

and as always, it may not directly answer your question but i would be happy if you found it helpful in anycase and let an upvote on this answer ;-)