What would be a stable, failproof, scalable galera cluster implementation

Question

Context: We are using a MariaDB Gallera cluster with (only) 2 master nodes for a web application. Last night we had a power failure and now we can't seem to recover the data and found out the database was corrupt on both nodes. Our initial impression on this setup was if one node goes down, the other would quickly act as the primary node.

My questions are,

Is there a way to set a cluster up so there is always a backup node which will automatically be replicated if one of the nodes go down? Specially if there is a power failure.
What would be the correct implementation of the gallery cluster?

score 1 · Answer 1 · answered Jul 14 '15 at 20:06

The answer to the first question, as with most issues in computing is: yes, if you have enough resources and time. If the cluster is in some sort of datacenter environment one would hopefully have some sort of out-of-band management interface like dedicated management NICs and/or a KVM system.

Modern datacenter management solutions like Intel's Datacenter Manager or Raritan Datacenter management systems offer users the ability to setup policies to automatically reboot systems after a power failure, send notifications and potentially even begin spinning up off-site or cloud based fail-over nodes. However there is potentially a great cost and expertise level required to setup and configure all aspects of this sort of safety net, it requires a lot of equipment and thoroughly testing and preparing is difficult without some downtime.

Another common node management tool is Nagios which allows allows for remote power management and control.

In addition to the in-band and out-of-band management options setting up a configuration management server using a CM tool like Salt or Chef would help to ensure that nodes are configured properly and greatly simplifies the task of provisioning new nodes, even in strange or remote environments. The storage and database requirements as well as the network environment will help to determine the appropriate cluster architecture, particularly with regard to storage, power and backups. In some instances it could be valuable to generate kickstart clones or some sort of similar installation aid like AutoYaST on SUSE systems. That would allow you to quickly build clean nodes and import necessary data from snapshots or backups in the event of a hardware failure.

Saving custom system images built with the KIWI Build system that import, mount or copy necessary data is another option. Using KIWI would allow you to create images that can be deployed in a variety of scenarios including as VMs, over PXE, bootable DVD/USB and more. Designing the perfect custom image for your needs using KIWI or some other operating system build tool could be quite beneficial for a variety of reasons.

Being more specific about the second question is difficult without knowing what lengths you would consider acceptable. The setup and resources required for a multi-site high-availability cluster with additional remote backups, automatic failover and recovery are drastically different from those that would be required for a cluster where "high-availability" means if the building the cluster resides in has power and internet it needs to work. Hopefully some of the information in there is useful.

score 1 · Accepted Answer · answered Jul 15 '15 at 13:26

We are using a Galera cluster with 5 nodes that have a load balancer in front of them, that is continously checking all the nodes. Our configuration is that we only have one of the nodes serving a write and read target for the connections from the load balancer and the other nodes are hot standby. But of course Galera also supports multi master read and write, so you can tune that to your liking.

The minimum cluster size needs to be three, since it has to a odd number to avoid a split brain situation when the connection between the nodes goes down for any reason. (You can also use an arbitrator, but the easier setup is just to use at least 3 proper cluster nodes.) We use 5 nodes, to allow for easier upgrades on the cluster and to increase resilience.

Galera also supports a cluster over WAN, but that needs some extra tuning in the server settings to not wreck the server performance. Usually a cluster with 3+ nodes that have redundant network and power should be fine for the applications.

Some thing you didn't say in your question is the type of database engine you are using on your Galera cluster. Seeing that you got corruption, I would think it is probably MyISAM? If that is the case you need to migrate to using InnoDB, since MyISAM is actually not supported by Galera. It also has other some other benefits like more resilient writing that avoids data corruption even in the unlikely case that the cluster should actually break apart and you need to restore the database.

What would be a stable, failproof, scalable galera cluster implementation

2 Answers2