One of my slave is no longer replicating. The seconds_behind_master continue to increase, Exec_Master_Log_Pos does not increase, and Relay_Log_Space does increase. Slave_IO_Running and Slave_SQL_Running are yes (unless I stop it, or encounter the 1205).
I've tried the solutions on this thread which sounded similar but haven't had any luck, Slave SQL thread got hanged. I also tried a RESET SLAVE which still produce the same behavior.
Additionally when I run:
stop slave;
on my instance it takes +30 seconds to execute.
Query OK, 0 rows affected (33.97 sec)
show slave status\G
returns:
Slave_IO_State: Waiting for master to send event
Master_Host: 10.0.40.203
Master_User: replicant
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000779
Read_Master_Log_Pos: 881930813
Relay_Log_File: mysqld-relay-bin.000002
Relay_Log_Pos: 283
Relay_Master_Log_File: mysql-bin.000779
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB: test
Replicate_Do_Table: Users,corporations,dates,systemspecs,test_replication,domains,test,ips,deleteddate,percona_checksum,accesslevels,status,collectionsdata,orders,email_to_user,requests,userprops,percona_checksum,useremails,requests_site,sections,ordertosection,UserToGroup,validkeys
Replicate_Ignore_Table:
Replicate_Wild_Do_Table: percona.%
Replicate_Wild_Ignore_Table: test.%
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 771399898
Relay_Log_Space: 110531372
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 4784
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 2222
Master_UUID: example
Master_Info_File: /mnt/mysql/master.info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: updating
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set:
Auto_Position: 0
I have four other slaves of the master that all are functional so I know the master logs aren't corrupt.
If I leave the replication running I end up with a 1205 error:
Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
UPDATE:
Running SHOW PROCESSLIST brought back:
348 | replicant | serverDNS | NULL | Binlog Dump | 1107340 | Master has sent all binlog to slave; waiting for binlog to be updated
After finding this we altered innodb_lock_wait_timeout from 50, its default value, to 14400. This allowed the replication to process again. However it is unclear why the 50 timeout would occur on only one of 5 slaves. All slaves are m5.2xlarge AWS instances so they have the same number of resources.
Additionally, should I stop at 14400 or should I just set this to the max 1073741824?
Update 2:
If I issue a restart for the mysql service replication processes as expected for about a day then the issue reproduces.
Additionally this slave is also a master of another system if that makes a difference. The slave of this master is running fine.
Current relevant (or in my eyes) slave output lines:
Master_Log_File: mysql-bin.000786
Read_Master_Log_Pos: 131895019
Relay_Log_File: mysqld-relay-bin.000025
Relay_Log_Pos: 52668949
Relay_Master_Log_File: mysql-bin.000786
Exec_Master_Log_Pos: 91692081
Relay_Log_Space: 131895472
Seconds_Behind_Master: 12163
The 91692081 is the value it currently is stuck at.
Update 3:
Looking into it further OS file reads, OS file writes, and OS fsyncs are consistently increasing. I also have found a warning being logged:
Warning: difficult to find free blocks in the buffer pool (324 search iterations)! 0 failed attempts to flush a page! Consider increasing the buffer pool size. It is also possible that in your Unix version fsync is very slow, or completely frozen inside the OS kernel. Then upgrading to a newer version of your operating system may help. Look at the number of fsyncs in diagnostic info below.