What is a viable low-cost DR option for a large cluster?

Question

We have a Cassandra cluster running on GKE with a 32-CPU node pool and SSD disks. The current cluster size is nearly 1 PB, with each node utilizing an average of 5 TB on 10 TB allocated SSD disks. The cluster comprises 200 nodes, each with 10 TB disks, totaling 2 PB in size total allocated.

Given this cluster size, the maintenance costs are substantial. How can we achieve low-cost disaster recovery for such a large cluster?

One option I am considering is creating a new data center in a different region with a replication factor of 1 (RF1). While this is not recommended, it would at least reduce the cluster size by a factor of three.

Any suggestions would be greatly appreciated.

score 0 · Accepted Answer · answered Jul 26 '24 at 09:59

As much as I understand that cost is a major factor, it should not be the primary consideration. The real deciding factor is the business requirement(s).

In over 25 years, in over a dozen companies across industries that include banking & finance, telecommunications, media and advertising, there is only one thing for certain -- enterprises do not have a reliable disaster recovery plan in place. The recent CrowdStrike global system meltdown is a glaring proof of that.

The business requirements determine what DR solution should be in place. The main criteria should include things like time it takes to failover, required throughput/capacity of DR site, reliability of the DR DC and so on.

Deciding to add a DC with a replication factor of 1 because you're worried about cost is misinformed. If the application cannot operate because DR is under-provisioned, you might as well not have a DR solution. Think of the scenario when a node fails with RF=1 -- it's game over.

As a side note, your question is going to get you opinion-based answers so it is likely to be flagged and voted down. For future reference, have a look at questions you should avoid asking. Cheers!

What is a viable low-cost DR option for a large cluster?

1 Answers1