MySQL Group Replication add node to (long) existing cluster fails

Question

I've got a working MySQL GROUP REPLICATION Cluster consisting of 2 Ubuntu 20.04 servers, running on IP's 192.168.1.2 and 192.168.1.3 (local network connection). This works great, however to get good availability (when a server goes down), I have to add at least one more node, which was prepared but never added at that time. The current cluster is active for multiple months, and now I want to add the 3rd node (192.168.1.1).

I've executed the steps on the new node, but I can't get the node to join the cluster. It shows up in RECOVERING state on the other two nodes during startup of the group replication, but eventually it fails with the following logs:

2021-06-21T14:37:14.833580Z 229 [System] [MY-013587] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' is starting.'
2021-06-21T14:37:14.836221Z 229 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Automatically adding IPv4 localhost address to the allowlist. It is mandatory that it is added.'
2021-06-21T14:37:14.836325Z 229 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Automatically adding IPv6 localhost address to the allowlist. It is mandatory that it is added.'
2021-06-21T14:37:14.839968Z 232 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2021-06-21T14:37:16.627848Z 0 [Warning] [MY-013426] [Repl] Plugin group_replication reported: 'Member version is read compatible with the group.'
2021-06-21T14:37:17.628880Z 0 [ERROR] [MY-013467] [Repl] Plugin group_replication reported: 'No valid or ONLINE members exist to get the missing data from the group. For cloning check if donors of the same version and with clone plugin installed exist. For incremental recovery check if you have donors where the required data was not purged from the binary logs.'
2021-06-21T14:37:17.629002Z 0 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
2021-06-21T14:37:18.629849Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.1.3:3306, 192.168.1.2:3306, 192.168.1.1:3306 on view 16074139281868351:77.'
2021-06-21T14:37:18.630074Z 0 [ERROR] [MY-011486] [Repl] Plugin group_replication reported: 'Message received while the plugin is not ready, message discarded.'
2021-06-21T14:37:22.189313Z 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'

As you can see, it reports that it can't recover, based on the current 2 active nodes:

SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 3006dde2-2290-11eb-b8c5-525400a817fd | 192.168.1.3 |        3306 | ONLINE       | PRIMARY     | 8.0.22         |
| group_replication_applier | 3497e1d0-2290-11eb-b932-52540020541f | 192.168.1.2 |        3306 | ONLINE       | PRIMARY     | 8.0.22         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
2 rows in set (0.00 sec)

On the new node, you can also see the error-state:

SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 8252825b-d292-11eb-acab-525400cc2440 | 192.168.1.1 |        3306 | ERROR        |             | 8.0.25         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
1 row in set (0.00 sec)

What I've tried so far is all kinds of imports from dumps of either one of the active nodes. Resetting master / slave, etc. It all leads to the error-state with the same error message. I expect the binary logs are purged indeed, because of the long uptime for the 2 nodes in the cluster, but this shouldn't be a problem right? What if I want to add another node to the cluster in let's say 2 years from now?

So, my problem is to get the 3rd (new) node active in the cluster on ip 192.168.1.1. The group replication set-up is correct to my opinion:

Node 1 (new node):

# General replication settings
gtid_mode = ON
enforce_gtid_consistency = ON
master_info_repository = TABLE
relay_log_info_repository = TABLE
binlog_checksum = NONE
log_slave_updates = ON
log_bin = binlog
binlog_format = ROW
transaction_write_set_extraction = XXHASH64
loose-group_replication_bootstrap_group = OFF
loose-group_replication_start_on_boot = OFF
loose-group_replication_ssl_mode = REQUIRED
loose-group_replication_recovery_use_ssl = 1
Shared replication group configuration
loose-group_replication_group_name = "bfdb88ba-b21c-4c07-84b5-8b9c1d145345"
loose-group_replication_ip_whitelist = "192.168.1.1,192.168.1.2,192.168.1.3"
loose-group_replication_group_seeds = "192.168.1.1:33061,192.168.1.2:33061,192.168.1.3:33061"
Single or Multi-primary mode? Uncomment these two lines
for multi-primary mode, where any host can accept writes
loose-group_replication_single_primary_mode = OFF
loose-group_replication_enforce_update_everywhere_checks = ON
Host specific replication configuration
server_id=1
report_host="192.168.1.1"
loose-group_replication_local_address="192.168.1.1:33061"

Node 2 (currently active):

# General replication settings
gtid_mode = ON
enforce_gtid_consistency = ON
master_info_repository = TABLE
relay_log_info_repository = TABLE
binlog_checksum = NONE
log_slave_updates = ON
log_bin = binlog
binlog_format = ROW
transaction_write_set_extraction = XXHASH64
loose-group_replication_bootstrap_group = OFF
loose-group_replication_start_on_boot = OFF
loose-group_replication_ssl_mode = REQUIRED
loose-group_replication_recovery_use_ssl = 1
Shared replication group configuration
loose-group_replication_group_name = "bfdb88ba-b21c-4c07-84b5-8b9c1d145345"
loose-group_replication_ip_whitelist = "192.168.1.1,192.168.1.2,192.168.1.3"
loose-group_replication_group_seeds = "192.168.1.1:33061,192.168.1.2:33061,192.168.1.3:33061"
Single or Multi-primary mode? Uncomment these two lines
for multi-primary mode, where any host can accept writes
loose-group_replication_single_primary_mode = OFF
loose-group_replication_enforce_update_everywhere_checks = ON
Host specific replication configuration
server_id=2
report_host="192.168.1.2"
loose-group_replication_local_address="192.168.1.2:33061"

Node 3 (also currently active):

# General replication settings
gtid_mode = ON
enforce_gtid_consistency = ON
master_info_repository = TABLE
relay_log_info_repository = TABLE
binlog_checksum = NONE
log_slave_updates = ON
log_bin = binlog
binlog_format = ROW
transaction_write_set_extraction = XXHASH64
loose-group_replication_bootstrap_group = OFF
loose-group_replication_start_on_boot = OFF
loose-group_replication_ssl_mode = REQUIRED
loose-group_replication_recovery_use_ssl = 1
Shared replication group configuration
loose-group_replication_group_name = "bfdb88ba-b21c-4c07-84b5-8b9c1d145345"
loose-group_replication_ip_whitelist = "192.168.1.1,192.168.1.2,192.168.1.3"
loose-group_replication_group_seeds = "192.168.1.1:33061,192.168.1.2:33061,192.168.1.3:33061"
Single or Multi-primary mode? Uncomment these two lines
for multi-primary mode, where any host can accept writes
loose-group_replication_single_primary_mode = OFF
loose-group_replication_enforce_update_everywhere_checks = ON
Host specific replication configuration
server_id=3
report_host="192.168.1.3"
loose-group_replication_local_address="192.168.1.3:33061"

MySQL Group Replication add node to (long) existing cluster fails

Shared replication group configuration

Single or Multi-primary mode? Uncomment these two lines

for multi-primary mode, where any host can accept writes

Host specific replication configuration

Shared replication group configuration

Single or Multi-primary mode? Uncomment these two lines

for multi-primary mode, where any host can accept writes

Host specific replication configuration

Shared replication group configuration

Single or Multi-primary mode? Uncomment these two lines

for multi-primary mode, where any host can accept writes

Host specific replication configuration

0 Answers0