AWS RDS MySQL "manual" read replica in "Waiting for dependent transaction to commit" state but lag is zero

Question

I have an SOP I use to upgrade AWS RDS instances running MySQL with no downtime that I have been following since 2021. I'll outline my method below, but it involves manually hooking up a new master RDS instance to my existing instance via mysql.rds_set_external_master before doing a switch.

The method has worked well in the past, but right now I am encountering a weird issue where the Slave_SQL_Running_State on the new instance is Waiting for dependent transaction to commit but there are no pending transations and the Seconds_Behind_Master value is also 0. Replication seems to be fine, but with this error constantly present I am hesitant to continue.

My Goal

We want to

upgrade to the latest MySQL minor version 8.0.36
and change the instance class to db.m6g.2xlarge
without downtime

We are running

MySQL 8.0.28
on a db.m5.2xlarge
with Multi-AZ on
using RDS proxy
and one read replica

The SOP to upgrade RDS MySQL instance with no downtime

The method outlined below is modified from this answer by Dmitriy as I was not comfortable with M-to-M replication.

Prerequisites

Ensure binlog retention hours is set to a long period, 48 or 72 hours
- Check with call mysql.rds_show_configuration;
- Modify with call mysql.rds_set_configuration('binlog retention hours', 48);
Ensure a replication user exists on the instance
Have DNS CNAME entry for RDS instance / RDS proxy which your application uses, rather than connecting directly to the RDS instance / RDS proxy.
- Meaning rather than the application connecting to myDB.proxy-xxxxxxxxxxxx.us-east-1.rds.amazonaws.com, it will connect to myDB.mydomain.com

Steps

In this example, the original master M1 will be myDB
Create a read replica, R1, of the original master, M1, e.g. myDBr1
Wait for the replica lag on R1 to hit zero.
Connect to R1 and stop replication
- CALL mysql.rds_stop_replication;
Obtain current binlog file and position
- show slave status
- Note down values for Master_Log_File and Exec_Master_Log_Pos, e.g.

Master_Log_File: mysql-bin-changelog.123456
Exec_Master_Log_Pos: 1236549

Create snapshot, SNAP1, from R1
Create a new parameter group, myDBpg2 for the new master instance, M2, by copying the existing parameter group. Make no changes to it.
Create a new master instance, M2, by restoring SNAP1, assign it the new parameter group, myDBpg2 while creating the instance
Modify M2, enable backups (and any other config like Performance Insights that you could not set when restoring)
Get the private IP of M1 using telnet or nslookup from an EC2 instance that has access to the database
- telnet myDB.xxxxxxxxxxxx.us-east-1.rds.amazonaws.com 3306
Configure M2 as the slave of M1, using the private IP, master log file, and exec master log pos
- CALL mysql.rds_set_external_master ('10.1.xxx.xxx', 3306, 'repl_user', <password>, 'mysql-bin-changelog.123456', 1236549, 0);
Start replication
- CALL mysql.rds_start_replication;
Wait for replica lag on M2 to hit zero
Can delete R1 and SNAP1
Now can upgrade M2 or perform AWS scheduled maintenance on M2
Again wait for replica lag to hit zero on M2
[optional] Create new read replica R2 of M2 if original M1 had a read replica
- Perform any necessary actions on R2
Connect to new instance and do sanity tests
Create RDS Proxy for the new instance
In Route 53, update the CNAME record(s) TTL to 60 seconds
Wait for old TTL(s) to expire
Update parameter groups of M1 and M2 to
- Both having auto_increment_increment of 2
- One parameter group have auto_increment_offset of 1 and the other 2
Update the CNAME record(s) to point towards new instances
Test and monitor (it will take 60 seconds for the application to start hitting the new databases)
Keep an eye on show full processlist; for old and new db instances. The old instance may have a long running script running.
Stop replication on M2
- CALL mysql.rds_stop_replication;
Update parameter group of M2 to have auto_increment_offset and auto_increment_increment both equal to 1
Reset the replication external master
- CALL mysql.rds_reset_external_master;
Update CNAME record(s) TTL back to their higher value
Update CloudWatch alarm, metrics, dashboards
Delete old instance M1

The Problem

When I get to step 11 and start replication, instead of getting a success message I get Slave has encountered an error. Run SHOW SLAVE STATUS\G; to see the error. Then running SHOW SLAVE STATUS (after replica lag has gone down) shows

               Slave_IO_State: Waiting for master to send event
                Connect_Retry: 60
              Master_Log_File: mysql-bin-changelog.259152
          Read_Master_Log_Pos: 11271584
               Relay_Log_File: relaylog.000700
                Relay_Log_Pos: 20945831
        Relay_Master_Log_File: mysql-bin-changelog.259113
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: innodb_memcache.cache_policies,innodb_memcache.config_options,mysql.plugin,mysql.rds_configuration,mysql.rds_history,mysql.rds_monitor,mysql.rds_replication_status,mysql.rds_sysinfo,mysql.rds_upgrade_prechecks
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 20952250
              Relay_Log_Space: 664160422
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
        Seconds_Behind_Master: 0
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Waiting for dependent transaction to commit
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 
                Auto_Position: 0
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
       Master_public_key_path: 
        Get_master_public_key: 0
            Network_Namespace:

I am confused as the replica lag goes down to zero, despite

The initial error message when starting the replication
Slave_SQL_Running_State having the value Waiting for dependent transaction to commit instead of Slave_SQL_Running_State

I've repeated this process three times now, rebooting M1, with failover, before this latest third attempt. (Though as the DB does failover, I do wonder if the reboot actually did anything or perhaps I should've rebooted twice with failover to get back to the original DB.)

Long running transactions?

I have checked to see if there are any long running transcations on M1 but there are none.

SELECT * FROM information_schema.INNODB_TRX;

and

SELECT p.ID, p.USER, p.HOST, p.DB, p.COMMAND, p.TIME, p.STATE, p.INFO
FROM information_schema.INNODB_TRX t
JOIN information_schema.PROCESSLIST p ON t.trx_mysql_thread_id = p.ID;

Both show nothing of interest, just the latest queries running.

Checking data consistency between M1 and M2

Since replication seems to be working when I eyeball things, to make sure, I thought to use pt-table-checksum to verify the replication. Unfortunately, our binlog_format is set to MIXED rather than STATEMENT, which pt-table-checksum requires.

I have read some potential work arounds

Checking Data Consistency for RDS for MySQL
- This solution recommends traffic be stopped, i.e. downtime, which is what I am trying to avoid in the first place.
An update to the solution above Data Consistency for RDS for MySQL: The 8.0 Version
- This solution gets around the fact that we cannot give SUPER privileges in RDS as with "8.0.1, MySQL introduced something called “Dynamic Privileges” which is a solution to grant more granulated privileges to the users"
- I am a bit worried about having the binlog_format get swapped to STATEMENT

Successful attempts in the past

As I mentioned above, the SOP has been successful in the past. The first time when we upgraded from MySQL 5.6.x to 8.x. Then again for some other migration I don't recall. Both times the message after start replication was a success message.

Alternatives?

Amazon RDS Blue/Green Deployments - would be great, but it does not support RDS Proxy which we use.
Performing the upgrades with no special treatment, Multi-AZ and RDS Proxy should reduce downtime.
- This is true for instance class changes. We could change from our db.m5.2xlarge to db.m6g.2xlarge with minimal downtime. AWS will first change the instance class of the standby instance, failover to the upgraded instance, and then change the instance class of the new standby instance.
- This does not seem to be true for engine upgrades, i.e. MySQL 8.0.28 to MySQL 8.0.36 as per AWS documentation Maintenance for Multi-AZ deployments (emphasis mine):

If you upgrade the database engine for your DB instance in a Multi-AZ deployment, Amazon RDS modifies both primary and secondary DB instances at the same time. In this case, both the primary and secondary DB instances in the Multi-AZ deployment are unavailable during the upgrade. This operation causes downtime until the upgrade is complete. The duration of the downtime varies based on the size of your DB instance.

Update (before even posting this question)

While typing this question out, I ended up doing more googling online to see if I missed something. This delayed the posting of this question by a day or so. The next day I checked show slave status and was surprised to find

Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates

Curious, I then tried to stop and start replication again to see if the initial error message on starting replication would also be fixed. However, when I did this,

I still got the error message Slave has encountered an error. Run SHOW SLAVE STATUS\G; to see the error.
AND doing show slave status caused brought back Slave_SQL_Running_State: Waiting for dependent transaction to commit

I decided to wait another day and today the Slave_SQL_Running_State is back to Slave has read all relay log; waiting for more updates.

Summary of my questions

Why am I seeing the two errors when starting replication at step 11?
- Slave has encountered an error. Run SHOW SLAVE STATUS\G; to see the error.
- Slave_SQL_Running_State: Waiting for dependent transaction to commit
As the replication lag eventually goes to zero, despite Slave_SQL_Running_State: Waiting for dependent transaction to commit and eventually even Slave_SQL_Running_State goes into the normal state of Slave has read all relay log; waiting for more updates, is it safe to assume everything is good and that I can proceed with using the new instance?
Any methods besides the pt-table-checksum script to verify data consistency?
Can I use pt-table-checksum as outlined in the article above which will set the binlog_format to STATEMENT safely?