MySQL/InnoDB cluster unable to reboot

Question

I have a three-node MySQL/InnoDB cluster (version 8.0.25), that has been deployed by Juju/OpenStack a couple of month ago. The cluster has recently crashed and I unfortunately can't manage to bring it back online.

When I run the dba.rebootClusterFromCompleteOutage(); command from the mysqlsh-js console, it hangs for multiple minutes then fails with following error:

Dba.rebootClusterFromCompleteOutage: Timeout waiting for super_read_only to be
unset after call to start Group Replication plugin. (RuntimeError)

Then if run the command again, the error becomes:

Dba.rebootClusterFromCompleteOutage: Dba.rebootClusterFromCompleteOutage: 
This function is not available through a session to a recovering instance (RuntimeError)

The only way to get rid of this error is to reboot the MySQL service (I've waited multiple hours before doing so to see if something was recovering but no luck).

When I restart all three MySQL services, I can see that members are seeing each other for a bit of time:

mysql> select instance_name, mysql_server_uuid, addresses from  mysql_innodb_cluster_metadata.instances;
+-------------------+--------------------------------------+--------------------------------------------------------------------------------------------------------+
| instance_name     | mysql_server_uuid                    | addresses                                                                                              |
+-------------------+--------------------------------------+--------------------------------------------------------------------------------------------------------+
| 172.16.3.133:3306 | 7573a29f-8c1f-11eb-9019-00163e0e5bcb | {"mysqlX": "172.16.3.133:33060", "grLocal": "172.16.3.133:33061", "mysqlClassic": "172.16.3.133:3306"} |
| 172.16.3.94:3306  | 782db9b8-8c1f-11eb-b9b8-00163e84e158 | {"mysqlX": "172.16.3.94:33060", "grLocal": "172.16.3.94:33061", "mysqlClassic": "172.16.3.94:3306"}    |
| 172.16.3.61:3306  | 77ab99e1-8c1f-11eb-8e8a-00163e33e612 | {"mysqlX": "172.16.3.61:33060", "grLocal": "172.16.3.61:33061", "mysqlClassic": "172.16.3.61:3306"}    |
+-------------------+--------------------------------------+--------------------------------------------------------------------------------------------------------+
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST  | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+--------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 7573a29f-8c1f-11eb-9019-00163e0e5bcb | 172.16.3.133 |        3306 | RECOVERING   | SECONDARY   | 8.0.25         |
| group_replication_applier | 77ab99e1-8c1f-11eb-8e8a-00163e33e612 | 172.16.3.61  |        3306 | RECOVERING   | SECONDARY   | 8.0.25         |
| group_replication_applier | 782db9b8-8c1f-11eb-b9b8-00163e84e158 | 172.16.3.94  |        3306 | RECOVERING   | SECONDARY   | 8.0.25         |
+---------------------------+--------------------------------------+--------------+-------------+--------------+-------------+----------------+

but then two of them change status to OFFLINE (172.16.3.133 & 172.16.3.94) and the last one stays in RECOVERY (this node seems to be the master before the crash).

And on the old master (172.16.3.61):

mysql> show global variables like '%gtid_%';
+----------------------------------------------+--------------------------------------------------------------------------------------------------+
| Variable_name                                | Value                                                                                            |
+----------------------------------------------+--------------------------------------------------------------------------------------------------+
| binlog_gtid_simple_recovery                  | ON                                                                                               |
| enforce_gtid_consistency                     | ON                                                                                               |
| group_replication_gtid_assignment_block_size | 1000000                                                                                          |
| gtid_executed                                | 7573a29f-8c1f-11eb-9019-00163e0e5bcb:1-8,
9a11f648-8c1f-11eb-b355-00163e0e5bcb:1-81797986        |
| gtid_executed_compression_period             | 0                                                                                                |
| gtid_mode                                    | ON                                                                                               |
| gtid_owned                                   | 9a11f648-8c1f-11eb-b355-00163e0e5bcb:81797990#15981:81797989#15982:81797988#15980:81797987#15977 |
| gtid_purged                                  | 7573a29f-8c1f-11eb-9019-00163e0e5bcb:1-8,
9a11f648-8c1f-11eb-b355-00163e0e5bcb:1-72588219        |
| session_track_gtids                          | OFF                                                                                              |
+----------------------------------------------+--------------------------------------------------------------------------------------------------+
9 rows in set (0.00 sec)
mysql> select @@hostname, @@read_only, @@super_read_only;
+---------------------+-------------+-------------------+
| @@hostname          | @@read_only | @@super_read_only |
+---------------------+-------------+-------------------+
| juju-8c2989-2-lxd-1 |           1 |                 0 |
+---------------------+-------------+-------------------+
1 row in set (0.00 sec)
mysql> SHOW FULL PROCESSLIST;
+--------+-----------------+-------------------+------+---------+--------+---------------------------------------------------+----------------------------------+
| Id     | User            | Host              | db   | Command | Time   | State                                             | Info                             |
+--------+-----------------+-------------------+------+---------+--------+---------------------------------------------------+----------------------------------+
|      6 | event_scheduler | localhost         | NULL | Daemon  |   9525 | Waiting on empty queue                            | NULL                             |
|  15230 | clusteruser     | 172.16.3.94:37698 | NULL | Sleep   |   7827 | NULL                                              | PLUGIN                           |
|  15960 | system user     |                   | NULL | Connect |   9252 | waiting for handler commit                        | Group replication applier module |
|  15976 | system user     |                   | NULL | Query   |      0 | Waiting for slave workers to process their queues | NULL                             |
|  15977 | system user     |                   | NULL | Query   | 745201 | waiting for handler commit                        | NULL                             |
|  15980 | system user     |                   | NULL | Query   | 745201 | waiting for handler commit                        | NULL                             |
|  15981 | system user     |                   | NULL | Query   | 745201 | waiting for handler commit                        | NULL                             |
|  15982 | system user     |                   | NULL | Query   | 745201 | waiting for handler commit                        | NULL                             |
| 549957 | root            | localhost         | NULL | Query   |      0 | init                                              | SHOW FULL PROCESSLIST            |
+--------+-----------------+-------------------+------+---------+--------+---------------------------------------------------+----------------------------------+
9 rows in set (0.00 sec)

And I can't get cluster information in mysqlsh-js (connected to the old master, 172.16.3.61):

 MySQL  172.16.3.61:33060+ ssl  JS > cluster = dba.getCluster()
WARNING: Cluster error connecting to target: RuntimeError: Unable to find a primary member in the cluster
Dba.getCluster: Unable to find a primary member in the cluster (RuntimeError)

What would be the procedure to bring the cluster back online, or to rebuild the cluster from the old master?

Edit 20/05/2021

I'm now able to have the cluster recognized, and a master node running. I was also able to remove the two old slaves from the cluster config, so here is my current cluster config:

 MySQL  172.16.3.94:33060+ ssl  JS > cluster.status()
{
    "clusterName": "jujuCluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "172.16.3.94:3306", 
        "ssl": "REQUIRED", 
        "status": "OK_NO_TOLERANCE", 
        "statusText": "Cluster is NOT tolerant to any failures.", 
        "topology": {
            "172.16.3.94:3306": {
                "address": "172.16.3.94:3306", 
                "instanceErrors": [
                    "WARNING: Recovery user account not found for server address: 172.16.3.94:3306 with UUID 782db9b8-8c1f-11eb-b9b8-00163e84e158"
                ], 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": null, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.25"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "172.16.3.94:3306"
}

(I'm not sure how to fix the warning though).

But now I try to add again the slaves, and I facing the following error:

cluster.addInstance({user: "clusteruser", host: "172.16.3.61", port: 3306, password: "***password***"}, {recoveryMethod: "clone"});
Clone based recovery selected through the recoveryMethod option
Validating instance configuration at 172.16.3.61:3306...
This instance reports its own address as 172.16.3.61:3306
Instance configuration is suitable.
ERROR: Cannot join instance '172.16.3.61:3306' to the cluster because it has asynchronous (source-replica) replication channel(s) configured. MySQL InnoDB Cluster does not support manually configured channels as they are not managed using the AdminAPI (e.g. when PRIMARY moves to another member) which may cause cause replication to break or even create split-brain scenarios (data loss).

and indeed when I check the slave config, it still thinks it's part of the cluster:

mysql> select * from mysql_innodb_cluster_metadata.instances;
+-------------+--------------------------------------+-------------------+--------------------------------------+-------------------+--------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| instance_id | cluster_id                           | address           | mysql_server_uuid                    | instance_name     | addresses                                                                                              | attributes                                                                                                                                 | description |
+-------------+--------------------------------------+-------------------+--------------------------------------+-------------------+--------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
|           1 | a2ea1850-8c1f-11eb-b355-00163e0e5bcb | 172.16.3.133:3306 | 7573a29f-8c1f-11eb-9019-00163e0e5bcb | 172.16.3.133:3306 | {"mysqlX": "172.16.3.133:33060", "grLocal": "172.16.3.133:33061", "mysqlClassic": "172.16.3.133:3306"} | {"server_id": 1000, "recoveryAccountHost": "%", "recoveryAccountUser": "mysql_innodb_cluster_1000"}                                        | NULL        |
|           2 | a2ea1850-8c1f-11eb-b355-00163e0e5bcb | 172.16.3.94:3306  | 782db9b8-8c1f-11eb-b9b8-00163e84e158 | 172.16.3.94:3306  | {"mysqlX": "172.16.3.94:33060", "grLocal": "172.16.3.94:33061", "mysqlClassic": "172.16.3.94:3306"}    | {"joinTime": "2021-03-23 21:35:54.777", "server_id": 1001, "recoveryAccountHost": "%", "recoveryAccountUser": "mysql_innodb_cluster_1001"} | NULL        |
|           3 | a2ea1850-8c1f-11eb-b355-00163e0e5bcb | 172.16.3.61:3306  | 77ab99e1-8c1f-11eb-8e8a-00163e33e612 | 172.16.3.61:3306  | {"mysqlX": "172.16.3.61:33060", "grLocal": "172.16.3.61:33061", "mysqlClassic": "172.16.3.61:3306"}    | {"joinTime": "2021-03-23 21:36:29.040", "server_id": 1002, "recoveryAccountHost": "%", "recoveryAccountUser": "mysql_innodb_cluster_1002"} | NULL        |
+-------------+--------------------------------------+-------------------+--------------------------------------+-------------------+--------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
3 rows in set (0.00 sec)
mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 77ab99e1-8c1f-11eb-8e8a-00163e33e612 | 172.16.3.61 |        3306 | OFFLINE      |             |                |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
1 row in set (0.00 sec)

MySQL/InnoDB cluster unable to reboot

0 Answers0