1

We first set up MariaDB replication using less powerful servers but considering our increased database load we had to rent more powerful servers and that's when the troubles began.

Right now the replication between the master and slave occasionally lags between 0 and 1000 seconds (it's slowly rising to 1000, approximately a second for two) for reasons which we've not been able to identify. This happens on average a dozen times a day. Considering the nature of my company business, we cannot afford any lags at all.

Master

The master has a six-core Intel CPU, 128GB of RAM, 1TB SSD in raid 1 configuration and load averages slightly above 2, running MariaDB 10.4.2x on Ubuntu 20.04 LTS.

For the past roughly 18 hours there have been just two replication related errors/warnings in the log:

Oct 21 12:00:35 master.com mysqld[808]: 2022-10-21 12:00:35 5677386 [Warning] Aborted connection 5677386 to db: 'unconnected' user: 'replication' host: '1.2.3.4' (Got an error writing communication packets)
Oct 21 12:03:46 master.com mysqld[808]: 2022-10-21 12:03:46 5780035 [Warning] Aborted connection 5780035 to db: 'unconnected' user: 'replication' host: '1.2.3.4' (Got an error writing communication packets)

Slave

The slave has exactly the same HW configuration and load averages close to 1.

MariaDB [(none)]> show slave status\G 
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: master.com
                   Master_User: replication
                   Master_Port: 3306
                 Connect_Retry: 5
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB: 
           Replicate_Ignore_DB: 
            Replicate_Do_Table: 
        Replicate_Ignore_Table: 
       Replicate_Wild_Do_Table: 
   Replicate_Wild_Ignore_Table: 
                    Last_Errno: 0
                    Last_Error: 
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 133718212
               Relay_Log_Space: 133718872
               Until_Condition: None
                Until_Log_File: 
                 Until_Log_Pos: 0
         Seconds_Behind_Master: 0
                 Last_IO_Errno: 0
                 Last_IO_Error: 
                Last_SQL_Errno: 0
                Last_SQL_Error: 
   Replicate_Ignore_Server_Ids: 
              Master_Server_Id: 105
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 0-105-9982390796
       Replicate_Do_Domain_Ids: 
   Replicate_Ignore_Domain_Ids: 
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
              Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 2226622

We have a 1Gbit connection between servers, it's almost perfect. ping -f reports fewer than 0.002% of lost packets:

root@master # ping -f slave.com
PING slave.com 56(84) bytes of data.
.^C
--- slave.com ping statistics ---
71492 packets transmitted, 71491 received, 0.00139876% packet loss, time 25747ms
rtt min/avg/max/mdev = 0.239/0.339/1.941/0.066 ms, ipg/ewma 0.360/0.335 ms

I've searched the Internet for the issue and haven't found any satisfactory answers in terms of trying to understand the root cause or how to debug the issue.

Right now it's working as intended but earlier when we were experiencing the issue I saw this on master:

$ SELECT * FROM information_schema.processlist WHERE User="replication"\G;
           USER: replication
           HOST: slave.com:47200
             DB: NULL
        COMMAND: Binlog Dump
           TIME: 2041
          STATE: Master has sent all binlog to slave; waiting for binlog to be updated

I'm not sure what to make of it. How come "it's sent all binlog to slave"? Why is the lag increasing at this time? What's the hold out? Why aren't master SQL queries sent to the slave?

As of now, 2022-11-01 UTC 10:48, the lag is slowly increasing:

MariaDB [(none)]> SHOW SLAVE STATUS\G;
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: master.com
                   Master_User: replication
                   Master_Port: 3306
                 Connect_Retry: 5
               Master_Log_File: mariadb-bin.002061
           Read_Master_Log_Pos: 395606907
                Relay_Log_File: mysqld-relay-bin.000253
                 Relay_Log_Pos: 340046809
         Relay_Master_Log_File: mariadb-bin.002061
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB: 
           Replicate_Ignore_DB: 
            Replicate_Do_Table: 
        Replicate_Ignore_Table: 
       Replicate_Wild_Do_Table: 
   Replicate_Wild_Ignore_Table: 
                    Last_Errno: 0
                    Last_Error: 
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 340046508
               Relay_Log_Space: 395607567
               Until_Condition: None
                Until_Log_File: 
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File: 
            Master_SSL_CA_Path: 
               Master_SSL_Cert: 
             Master_SSL_Cipher: 
                Master_SSL_Key: 
         Seconds_Behind_Master: 389
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error: 
                Last_SQL_Errno: 0
                Last_SQL_Error: 
   Replicate_Ignore_Server_Ids: 
              Master_Server_Id: 105
                Master_SSL_Crl: 
            Master_SSL_Crlpath: 
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 0-105-10062038390
       Replicate_Do_Domain_Ids: 
   Replicate_Ignore_Domain_Ids: 
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Closing tables
              Slave_DDL_Groups: 8
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 81798798
1 row in set (0.000 sec)

ERROR: No query specified

MariaDB [(none)]> SHOW FULL PROCESSLIST\G; *************************** 1. row *************************** Id: 1 User: system user Host: db: NULL Command: Daemon Time: NULL State: InnoDB purge coordinator Info: NULL Progress: 0.000 *************************** 2. row *************************** Id: 2 User: system user Host: db: NULL Command: Daemon Time: NULL State: InnoDB purge worker Info: NULL Progress: 0.000 *************************** 3. row *************************** Id: 3 User: system user Host: db: NULL Command: Daemon Time: NULL State: InnoDB purge worker Info: NULL Progress: 0.000 *************************** 4. row *************************** Id: 4 User: system user Host: db: NULL Command: Daemon Time: NULL State: InnoDB purge worker Info: NULL Progress: 0.000 *************************** 5. row *************************** Id: 5 User: system user Host: db: NULL Command: Daemon Time: NULL State: InnoDB shutdown handler Info: NULL Progress: 0.000 *************************** 6. row *************************** Id: 11 User: system user Host: db: NULL Command: Slave_IO Time: 945669 State: Waiting for master to send event Info: NULL Progress: 0.000 *************************** 7. row *************************** Id: 12 User: system user Host: db: NULL Command: Slave_SQL Time: 0 State: Closing tables Info: NULL Progress: 0.000 *************************** 8. row *************************** Id: 50 User: replication Host: XX.XX.XX.XXX:52970 db: NULL Command: Binlog Dump Time: 945656 State: Master has sent all binlog to slave; waiting for binlog to be updated Info: NULL Progress: 0.000 *************************** 9. row *************************** Id: 1244933 User: replication Host: XX.XX.XX.XX:54514 db: NULL Command: Binlog Dump Time: 83210 State: Master has sent all binlog to slave; waiting for binlog to be updated Info: NULL Progress: 0.000 *************************** 10. row *************************** Id: 1372686 User: root Host: localhost db: NULL Command: Query Time: 0 State: Init Info: SHOW FULL PROCESSLIST Progress: 0.000 10 rows in set (0.000 sec)

ERROR: No query specified

MariaDB [(none)]>

MySQL on localhost (10.4.26) load 1.08 1.31 1.29 1/388 427200 up 10+22:42:50 [10:47:22] Queries: 34.0 qps: 0 Slow: 0.0 Se/In/Up/De(%): 00/00/00/00 Sorts: 0 qps now: 1 Slow qps: 0.0 Threads: 3 ( 1/ 1) 00/00/00/00 Key Efficiency: 100.0% Bps in/out: 0.0/ 0.2 Now in/out: 21.2/ 3.8k Replication IO:Yes SQL:Yes Delay: 464 sec.

   Id      User         Host/IP         DB       Time    Cmd    State Query                                                                               
   --      ----         -------         --       ----    ---    ----- ----------                                                                          
   11 system us                                945768 Slave_  Waiting                                                                                     
   50 replicati          slave2                945756 Binlog Master h                                                                                     

1244933 replicati slave3 83310 Binlog Master h
1 system us 0 Daemon InnoDB p
2 system us 0 Daemon InnoDB p
3 system us 0 Daemon InnoDB p
4 system us 0 Daemon InnoDB p
5 system us 0 Daemon InnoDB s
12 system us 0 Slave_ Closing
1372858 root localhost 0 Query Init show full processlist

Please advise.

2 Answers2

0

Turns out it's a known bug/feature of MariaDB.

https://mariadb.com/resources/blog/goodbye-replication-lag/

To fix it, on all DB instances that serve as slaves the slave_parallel_threads variable must be increased from its default value of 1 to something larger this way:

STOP SLAVE SQL_THREAD;
SET GLOBAL slave_parallel_threads = 4;
START SLAVE SQL_THREAD;
SELECT @@slave_parallel_threads;

slave_parallel_threads=# is a dynamic variable that can be changed without restarting mysqld. All replicas connections must however be stopped before changing the value. You can also set it permanently in the MariaDB configuration file.

mustaccio
  • 28,207
  • 24
  • 60
  • 76
0

Put it in the config to make it permanent:

slave_parallel_threads = 4
Nico
  • 1