MariaDB replication lagging up to a dozen times a day for no obvious reasons

Question

We first set up MariaDB replication using less powerful servers but considering our increased database load we had to rent more powerful servers and that's when the troubles began.

Right now the replication between the master and slave occasionally lags between 0 and 1000 seconds (it's slowly rising to 1000, approximately a second for two) for reasons which we've not been able to identify. This happens on average a dozen times a day. Considering the nature of my company business, we cannot afford any lags at all.

Master

The master has a six-core Intel CPU, 128GB of RAM, 1TB SSD in raid 1 configuration and load averages slightly above 2, running MariaDB 10.4.2x on Ubuntu 20.04 LTS.

For the past roughly 18 hours there have been just two replication related errors/warnings in the log:

Oct 21 12:00:35 master.com mysqld[808]: 2022-10-21 12:00:35 5677386 [Warning] Aborted connection 5677386 to db: 'unconnected' user: 'replication' host: '1.2.3.4' (Got an error writing communication packets)
Oct 21 12:03:46 master.com mysqld[808]: 2022-10-21 12:03:46 5780035 [Warning] Aborted connection 5780035 to db: 'unconnected' user: 'replication' host: '1.2.3.4' (Got an error writing communication packets)

Slave

The slave has exactly the same HW configuration and load averages close to 1.

MariaDB [(none)]> show slave status\G 
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: master.com
                   Master_User: replication
                   Master_Port: 3306
                 Connect_Retry: 5
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB: 
           Replicate_Ignore_DB: 
            Replicate_Do_Table: 
        Replicate_Ignore_Table: 
       Replicate_Wild_Do_Table: 
   Replicate_Wild_Ignore_Table: 
                    Last_Errno: 0
                    Last_Error: 
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 133718212
               Relay_Log_Space: 133718872
               Until_Condition: None
                Until_Log_File: 
                 Until_Log_Pos: 0
         Seconds_Behind_Master: 0
                 Last_IO_Errno: 0
                 Last_IO_Error: 
                Last_SQL_Errno: 0
                Last_SQL_Error: 
   Replicate_Ignore_Server_Ids: 
              Master_Server_Id: 105
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 0-105-9982390796
       Replicate_Do_Domain_Ids: 
   Replicate_Ignore_Domain_Ids: 
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
              Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 2226622

We have a 1Gbit connection between servers, it's almost perfect. ping -f reports fewer than 0.002% of lost packets:

root@master # ping -f slave.com
PING slave.com 56(84) bytes of data.
.^C
--- slave.com ping statistics ---
71492 packets transmitted, 71491 received, 0.00139876% packet loss, time 25747ms
rtt min/avg/max/mdev = 0.239/0.339/1.941/0.066 ms, ipg/ewma 0.360/0.335 ms

I've searched the Internet for the issue and haven't found any satisfactory answers in terms of trying to understand the root cause or how to debug the issue.

Right now it's working as intended but earlier when we were experiencing the issue I saw this on master:

$ SELECT * FROM information_schema.processlist WHERE User="replication"\G;
           USER: replication
           HOST: slave.com:47200
             DB: NULL
        COMMAND: Binlog Dump
           TIME: 2041
          STATE: Master has sent all binlog to slave; waiting for binlog to be updated

I'm not sure what to make of it. How come "it's sent all binlog to slave"? Why is the lag increasing at this time? What's the hold out? Why aren't master SQL queries sent to the slave?

As of now, 2022-11-01 UTC 10:48, the lag is slowly increasing:

MariaDB [(none)]> SHOW SLAVE STATUS\G;
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: master.com
                   Master_User: replication
                   Master_Port: 3306
                 Connect_Retry: 5
               Master_Log_File: mariadb-bin.002061
           Read_Master_Log_Pos: 395606907
                Relay_Log_File: mysqld-relay-bin.000253
                 Relay_Log_Pos: 340046809
         Relay_Master_Log_File: mariadb-bin.002061
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB: 
           Replicate_Ignore_DB: 
            Replicate_Do_Table: 
        Replicate_Ignore_Table: 
       Replicate_Wild_Do_Table: 
   Replicate_Wild_Ignore_Table: 
                    Last_Errno: 0
                    Last_Error: 
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 340046508
               Relay_Log_Space: 395607567
               Until_Condition: None
                Until_Log_File: 
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File: 
            Master_SSL_CA_Path: 
               Master_SSL_Cert: 
             Master_SSL_Cipher: 
                Master_SSL_Key: 
         Seconds_Behind_Master: 389
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error: 
                Last_SQL_Errno: 0
                Last_SQL_Error: 
   Replicate_Ignore_Server_Ids: 
              Master_Server_Id: 105
                Master_SSL_Crl: 
            Master_SSL_Crlpath: 
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 0-105-10062038390
       Replicate_Do_Domain_Ids: 
   Replicate_Ignore_Domain_Ids: 
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Closing tables
              Slave_DDL_Groups: 8
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 81798798
1 row in set (0.000 sec)
ERROR: No query specified
MariaDB [(none)]> SHOW FULL PROCESSLIST\G;
*************************** 1. row ***************************
      Id: 1
    User: system user
    Host: 
      db: NULL
 Command: Daemon
    Time: NULL
   State: InnoDB purge coordinator
    Info: NULL
Progress: 0.000
*************************** 2. row ***************************
      Id: 2
    User: system user
    Host: 
      db: NULL
 Command: Daemon
    Time: NULL
   State: InnoDB purge worker
    Info: NULL
Progress: 0.000
*************************** 3. row ***************************
      Id: 3
    User: system user
    Host: 
      db: NULL
 Command: Daemon
    Time: NULL
   State: InnoDB purge worker
    Info: NULL
Progress: 0.000
*************************** 4. row ***************************
      Id: 4
    User: system user
    Host: 
      db: NULL
 Command: Daemon
    Time: NULL
   State: InnoDB purge worker
    Info: NULL
Progress: 0.000
*************************** 5. row ***************************
      Id: 5
    User: system user
    Host: 
      db: NULL
 Command: Daemon
    Time: NULL
   State: InnoDB shutdown handler
    Info: NULL
Progress: 0.000
*************************** 6. row ***************************
      Id: 11
    User: system user
    Host: 
      db: NULL
 Command: Slave_IO
    Time: 945669
   State: Waiting for master to send event
    Info: NULL
Progress: 0.000
*************************** 7. row ***************************
      Id: 12
    User: system user
    Host: 
      db: NULL
 Command: Slave_SQL
    Time: 0
   State: Closing tables
    Info: NULL
Progress: 0.000
*************************** 8. row ***************************
      Id: 50
    User: replication
    Host: XX.XX.XX.XXX:52970
      db: NULL
 Command: Binlog Dump
    Time: 945656
   State: Master has sent all binlog to slave; waiting for binlog to be updated
    Info: NULL
Progress: 0.000
*************************** 9. row ***************************
      Id: 1244933
    User: replication
    Host: XX.XX.XX.XX:54514
      db: NULL
 Command: Binlog Dump
    Time: 83210
   State: Master has sent all binlog to slave; waiting for binlog to be updated
    Info: NULL
Progress: 0.000
*************************** 10. row ***************************
      Id: 1372686
    User: root
    Host: localhost
      db: NULL
 Command: Query
    Time: 0
   State: Init
    Info: SHOW FULL PROCESSLIST
Progress: 0.000
10 rows in set (0.000 sec)
ERROR: No query specified
MariaDB [(none)]>
MySQL on localhost (10.4.26)                                                                        load 1.08 1.31 1.29 1/388 427200 up 10+22:42:50 [10:47:22]
 Queries: 34.0     qps:    0 Slow:     0.0         Se/In/Up/De(%):    00/00/00/00 
 Sorts:      0 qps now:    1 Slow qps: 0.0  Threads:    3 (   1/   1) 00/00/00/00 
 Key Efficiency: 100.0%  Bps in/out:   0.0/  0.2   Now in/out:  21.2/ 3.8k
 Replication IO:Yes SQL:Yes Delay: 464 sec.
   Id      User         Host/IP         DB       Time    Cmd    State Query                                                                               
   --      ----         -------         --       ----    ---    ----- ----------                                                                          
   11 system us                                945768 Slave_  Waiting                                                                                     
   50 replicati          slave2                945756 Binlog Master h                                                                                     

1244933 replicati          slave3                 83310 Binlog Master h

        1 system us                                     0 Daemon InnoDB p

        2 system us                                     0 Daemon InnoDB p

        3 system us                                     0 Daemon InnoDB p

        4 system us                                     0 Daemon InnoDB p

        5 system us                                     0 Daemon InnoDB s

       12 system us                                     0 Slave_  Closing

  1372858      root       localhost                     0  Query     Init show full processlist

Please advise.

score 0 · Accepted Answer · edited Sep 13 '24 at 18:44

Turns out it's a known bug/feature of MariaDB.

https://mariadb.com/resources/blog/goodbye-replication-lag/

To fix it, on all DB instances that serve as slaves the slave_parallel_threads variable must be increased from its default value of 1 to something larger this way:

STOP SLAVE SQL_THREAD;
SET GLOBAL slave_parallel_threads = 4;
START SLAVE SQL_THREAD;
SELECT @@slave_parallel_threads;

slave_parallel_threads=# is a dynamic variable that can be changed without restarting mysqld. All replicas connections must however be stopped before changing the value. You can also set it permanently in the MariaDB configuration file.

score 0 · Answer 2 · answered May 10 '23 at 06:22

0

Put it in the config to make it permanent:

slave_parallel_threads = 4

answered May 10 '23 at 06:22

Nico

1

MariaDB replication lagging up to a dozen times a day for no obvious reasons

Master

Slave

2 Answers2