Cluster MariaDB Primary + Two Replicas orchestrated via MaxScale: how to recover from a major disruption?

Question

I am working on a Mariadb cluster primary/replica handled via maxscale, to which I successfully added a third node. When I left things few days ago everything was fine. today I found that all nodes were down, supposedly because of a power outage. This is a lab environment, so nobody really cared to check before rebooting the hardware.

Because it is a lab environment, I could easily restart from the last point, but since I'm here, I'd like to take the chance and learn something.

I managed to reboot all nodes, except one which is not getting back to the cluster.

This are my variables right now:

mariadb221 [(none)]> show global variables like '%gtid%';
Variable_name            Value
gtid_binlog_pos          1-3000-1359255
gtid_binlog_state        1-2000-358368,1-1000-1359225,1-3000-1359255
gtid_cleanup_batch_size  64
gtid_current_pos         1-3000-1359255
gtid_domain_id           1
gtid_ignore_duplicates   OFF
gtid_pos_auto_engines
gtid_slave_pos           1-3000-1359255
gtid_strict_mode         ON
wsrep_gtid_domain_id     0
wsrep_gtid_mode          OFF

mariadb222 [(none)]> show global variables like '%gtid%';

Variable_name            Value
gtid_binlog_pos          1-1000-1359231
gtid_binlog_state        1-2000-358368,1-3000-359229,1-1000-1359231
gtid_cleanup_batch_size  64
gtid_current_pos         110001359254
gtid_domain_id           1
gtid_ignore_duplicates   OFF
gtid_pos_auto_engines
gtid_slave_pos           110001359254
gtid_strict_mode         ON
wsrep_gtid_domain_id     0
wsrep_gtid_mode          OFF

mariadb223 [(none)]> show global variables like '%gtid%';
Variable_name            Value
gtid_binlog_pos          1-3000-1359255
gtid_binlog_state        1-1000-1359230,1-2000-358368,1-3000-1359255
gtid_cleanup_batch_size  64
gtid_current_pos         1-3000-1359255
gtid_domain_id           1
gtid_ignore_duplicates   OFF
gtid_pos_auto_engines                            
gtid_slave_pos           1-3000-1359255
gtid_strict_mode         ON
wsrep_gtid_domain_id     0
wsrep_gtid_mode          OFF

Aside from deleting the bad node and recreating it like it was a new one (backup of the remaining slave + restore of that backup onto it + adding it to the cluster again), is there anything I can do to recover from this position?

Any suggestion of anything else I should check will also be appreciated.

P.S.: the bad node is the second one

Cluster MariaDB Primary + Two Replicas orchestrated via MaxScale: how to recover from a major disruption?

0 Answers0