I have an SOP I use to upgrade AWS RDS instances running MySQL with no downtime that I have been following since 2021. I'll outline my method below, but it involves manually hooking up a new master RDS instance to my existing instance via mysql.rds_set_external_master before doing a switch.
The method has worked well in the past, but right now I am encountering a weird issue where the Slave_SQL_Running_State on the new instance is Waiting for dependent transaction to commit but there are no pending transations and the Seconds_Behind_Master value is also 0. Replication seems to be fine, but with this error constantly present I am hesitant to continue.
My Goal
We want to
- upgrade to the latest MySQL minor version 8.0.36
- and change the instance class to db.m6g.2xlarge
- without downtime
We are running
- MySQL 8.0.28
- on a db.m5.2xlarge
- with Multi-AZ on
- using RDS proxy
- and one read replica
The SOP to upgrade RDS MySQL instance with no downtime
The method outlined below is modified from this answer by Dmitriy as I was not comfortable with M-to-M replication.
Prerequisites
- Ensure
binlog retention hoursis set to a long period, 48 or 72 hours- Check with
call mysql.rds_show_configuration; - Modify with
call mysql.rds_set_configuration('binlog retention hours', 48);
- Check with
- Ensure a replication user exists on the instance
- Have DNS CNAME entry for RDS instance / RDS proxy which your application uses, rather than connecting directly to the RDS instance / RDS proxy.
- Meaning rather than the application connecting to
myDB.proxy-xxxxxxxxxxxx.us-east-1.rds.amazonaws.com, it will connect tomyDB.mydomain.com
- Meaning rather than the application connecting to
Steps
- In this example, the original master
M1will bemyDB - Create a read replica,
R1, of the original master,M1, e.g.myDBr1 - Wait for the replica lag on
R1to hit zero. - Connect to
R1and stop replicationCALL mysql.rds_stop_replication;
- Obtain current binlog file and position
show slave status- Note down values for
Master_Log_FileandExec_Master_Log_Pos, e.g.
Master_Log_File: mysql-bin-changelog.123456
Exec_Master_Log_Pos: 1236549
- Create snapshot,
SNAP1, fromR1 - Create a new parameter group,
myDBpg2for the new master instance, M2, by copying the existing parameter group. Make no changes to it. - Create a new master instance,
M2, by restoringSNAP1, assign it the new parameter group,myDBpg2while creating the instance - Modify
M2, enable backups (and any other config like Performance Insights that you could not set when restoring) - Get the private IP of
M1using telnet or nslookup from an EC2 instance that has access to the databasetelnet myDB.xxxxxxxxxxxx.us-east-1.rds.amazonaws.com 3306
- Configure
M2as the slave ofM1, using the private IP, master log file, and exec master log posCALL mysql.rds_set_external_master ('10.1.xxx.xxx', 3306, 'repl_user', <password>, 'mysql-bin-changelog.123456', 1236549, 0);
- Start replication
CALL mysql.rds_start_replication;
- Wait for replica lag on
M2to hit zero - Can delete
R1andSNAP1 - Now can upgrade
M2or perform AWS scheduled maintenance onM2 - Again wait for replica lag to hit zero on
M2 - [optional] Create new read replica
R2ofM2if originalM1had a read replica- Perform any necessary actions on
R2
- Perform any necessary actions on
- Connect to new instance and do sanity tests
- Create RDS Proxy for the new instance
- In Route 53, update the CNAME record(s) TTL to 60 seconds
- Wait for old TTL(s) to expire
- Update parameter groups of
M1andM2to- Both having
auto_increment_incrementof2 - One parameter group have
auto_increment_offsetof1and the other2
- Both having
- Update the CNAME record(s) to point towards new instances
- Test and monitor (it will take 60 seconds for the application to start hitting the new databases)
- Keep an eye on
show full processlist;for old and new db instances. The old instance may have a long running script running. - Stop replication on
M2CALL mysql.rds_stop_replication;
- Update parameter group of
M2to haveauto_increment_offsetandauto_increment_incrementboth equal to1 - Reset the replication external master
CALL mysql.rds_reset_external_master;
- Update CNAME record(s) TTL back to their higher value
- Update CloudWatch alarm, metrics, dashboards
- Delete old instance
M1
The Problem
When I get to step 11 and start replication, instead of getting a success message I get Slave has encountered an error. Run SHOW SLAVE STATUS\G; to see the error. Then running SHOW SLAVE STATUS (after replica lag has gone down) shows
Slave_IO_State: Waiting for master to send event
Connect_Retry: 60
Master_Log_File: mysql-bin-changelog.259152
Read_Master_Log_Pos: 11271584
Relay_Log_File: relaylog.000700
Relay_Log_Pos: 20945831
Relay_Master_Log_File: mysql-bin-changelog.259113
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table: innodb_memcache.cache_policies,innodb_memcache.config_options,mysql.plugin,mysql.rds_configuration,mysql.rds_history,mysql.rds_monitor,mysql.rds_replication_status,mysql.rds_sysinfo,mysql.rds_upgrade_prechecks
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 20952250
Relay_Log_Space: 664160422
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Seconds_Behind_Master: 0
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Waiting for dependent transaction to commit
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set:
Auto_Position: 0
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
Master_public_key_path:
Get_master_public_key: 0
Network_Namespace:
I am confused as the replica lag goes down to zero, despite
- The initial error message when starting the replication
Slave_SQL_Running_Statehaving the valueWaiting for dependent transaction to commitinstead ofSlave_SQL_Running_State
I've repeated this process three times now, rebooting M1, with failover, before this latest third attempt. (Though as the DB does failover, I do wonder if the reboot actually did anything or perhaps I should've rebooted twice with failover to get back to the original DB.)
Long running transactions?
I have checked to see if there are any long running transcations on M1 but there are none.
SELECT * FROM information_schema.INNODB_TRX;
and
SELECT p.ID, p.USER, p.HOST, p.DB, p.COMMAND, p.TIME, p.STATE, p.INFO
FROM information_schema.INNODB_TRX t
JOIN information_schema.PROCESSLIST p ON t.trx_mysql_thread_id = p.ID;
Both show nothing of interest, just the latest queries running.
Checking data consistency between M1 and M2
Since replication seems to be working when I eyeball things, to make sure, I thought to use pt-table-checksum to verify the replication. Unfortunately, our binlog_format is set to MIXED rather than STATEMENT, which pt-table-checksum requires.
I have read some potential work arounds
- Checking Data Consistency for RDS for MySQL
- This solution recommends traffic be stopped, i.e. downtime, which is what I am trying to avoid in the first place.
- An update to the solution above Data Consistency for RDS for MySQL: The 8.0 Version
- This solution gets around the fact that we cannot give SUPER privileges in RDS as with "8.0.1, MySQL introduced something called “Dynamic Privileges” which is a solution to grant more granulated privileges to the users"
- I am a bit worried about having the binlog_format get swapped to
STATEMENT
Successful attempts in the past
As I mentioned above, the SOP has been successful in the past. The first time when we upgraded from MySQL 5.6.x to 8.x. Then again for some other migration I don't recall. Both times the message after start replication was a success message.
Alternatives?
- Amazon RDS Blue/Green Deployments - would be great, but it does not support RDS Proxy which we use.
- Performing the upgrades with no special treatment, Multi-AZ and RDS Proxy should reduce downtime.
- This is true for instance class changes. We could change from our
db.m5.2xlargetodb.m6g.2xlargewith minimal downtime. AWS will first change the instance class of the standby instance, failover to the upgraded instance, and then change the instance class of the new standby instance. - This does not seem to be true for engine upgrades, i.e. MySQL 8.0.28 to MySQL 8.0.36 as per AWS documentation Maintenance for Multi-AZ deployments (emphasis mine):
- This is true for instance class changes. We could change from our
If you upgrade the database engine for your DB instance in a Multi-AZ deployment, Amazon RDS modifies both primary and secondary DB instances at the same time. In this case, both the primary and secondary DB instances in the Multi-AZ deployment are unavailable during the upgrade. This operation causes downtime until the upgrade is complete. The duration of the downtime varies based on the size of your DB instance.
Update (before even posting this question)
While typing this question out, I ended up doing more googling online to see if I missed something. This delayed the posting of this question by a day or so. The next day I checked show slave status and was surprised to find
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Curious, I then tried to stop and start replication again to see if the initial error message on starting replication would also be fixed. However, when I did this,
- I still got the error message
Slave has encountered an error. Run SHOW SLAVE STATUS\G; to see the error. - AND doing
show slave statuscaused brought backSlave_SQL_Running_State: Waiting for dependent transaction to commit
I decided to wait another day and today the Slave_SQL_Running_State is back to Slave has read all relay log; waiting for more updates.
Summary of my questions
- Why am I seeing the two errors when starting replication at step 11?
Slave has encountered an error. Run SHOW SLAVE STATUS\G; to see the error.Slave_SQL_Running_State: Waiting for dependent transaction to commit
- As the replication lag eventually goes to zero, despite
Slave_SQL_Running_State: Waiting for dependent transaction to commitand eventually evenSlave_SQL_Running_Stategoes into the normal state ofSlave has read all relay log; waiting for more updates, is it safe to assume everything is good and that I can proceed with using the new instance? - Any methods besides the
pt-table-checksumscript to verify data consistency? - Can I use
pt-table-checksumas outlined in the article above which will set thebinlog_formattoSTATEMENTsafely?