MySQL replica lagging but reporting no lag

Question

I have a primary MySQL DB with a secondary replica which, apparently, works perfectly.

The replica is connected through a VPN, which reports no problems.

SHOW SLAVE STATUS reports zero seconds behind master (except on slow query updates and so on, when lag is expected).

After hours, the load on the primary drops, and the replica stops updating. That is, I have a single table that is periodically updated even after hours, with a DATETIME timestamp, so it's very easy to check the MAX() of the recorded datetime; and by doing this, I can see lags that slowly grow, up to one hour. Suspiciously exactly one hour at most, actually.

Experimentally, I have created a dummy table and started writing to it. And, as I suspected, after a few (well, many) updates, abruptly the replica catches up... and immediately starts lagging again.

So, I conclude that the master sends updates only when a certain threshold is met. A size threshold? Say, every megabyte or so? And there must also be some timeout that expires after one hour, when the resync is forced come Hell or high water.

STOP SLAVE followed by START SLAVE also syncs the replication, every time, in a few seconds.

I would very much either lower the data threshold, if it exists, or (waaay better!) decrease the "forced update" timeout threshold to one minute so that the slave keeps in sync.

What I've checked:

innodb_flush_log_at_trx_commit = 0 (shouldn't matter. Tried 1 with no effect)
innodb_flush_log_at_timeout = 1
sync_binlog = 1 (again, should not matter)
binlog format is okay (I have other replicas that work - actually, this one, the only one behind a VPN, is an exact clone - except for the actual MySQL database - of another replica that works on a different database, flawlessly. Both are Ubuntu 20.04-LTS).
The primary reports all OK.

Replica process list is also good:

IDs (as in this other question) are correct.

This... well, this looks as if someone was playing an abort-slave-event-count on me:

When this option is set to some positive integer value other than 0 (the default) it affects replication behavior as follows: After the replication SQL thread has started, value log events are permitted to be executed; after that, the replication SQL thread does not receive any more events, just as if the network connection from the source were cut. The replication SQL thread continues to run, and the output from SHOW SLAVE STATUS displays Yes in both the Slave_IO_Running and the Slave_SQL_Running columns, but no further events are read from the relay log.

Except, of course, that no one could have conceivably done it.

Other checks or clarifications thanks to Rick James's contribution:

Timezone problems: nope, both servers use the same configuration. And in that case I ought to see a constant lag, not a slowly growing one.
Transactions are committed eventually (they may run up to five minutes depending on what they're doing, but never more than that).
Nothing else is running either on the primary or the replica.
Timestamp/Datetime: the table receives a datetime that's externally generated, so the TZ of the machine or MySQL's don't enter into the matter.
I already did the trick of entering data into an expendable table (created ad hoc). I entered row after row, in autocommit mode, and ran a SELECT COUNT on the slave. The count remains frozen (while rows are entered into the master, where the COUNT goes up as expected), then after some time, bang!, the data arrives at the replica and the COUNT goes to the expected value... and stops there for another not-so-little time.

Checks I should have done and didn't, will do tomorrow:

log both the COUNT and the data size estimate, with rows of different size (e.g. slowly incrementing TEXT field), and the time intervals, when the sync happens when the almost only thing running is my fake "insert rows into expendable table" routine. See whether the critical factor is really the data size, or the time, or the number of rows. This would allow to determine whether the critical factor is a timeout, a data threshold, or a row threshold.

An interesting and useful observation by @RickJames:

Transactions are sent to the Replicas immediately after COMMIT.

I thought as much, and now I am sure. So, TRXs are sent to the replica immediately. Well, they aren't. So this is not a misconfiguration (I know I could do delayed replication, but I didn't); so if MySQL wouldn't ordinarily wait for replication, something must be positively borked here, something that is interfering with MySQL.

I'm more and more suspecting the VPN, somehow. Possibly a logical fallacy of post hoc ergo propter hoc, but this is the only VPN connection I have and the only connection that gives me headaches.

Other checks, with this in mind:

Network connectivity. Perfect.
Packet loss rate. Negligible on the copper, zero on the VPN.
Routing. Copacetic. But of course that would have been noticeable, as in "nothing works".
MTU. 1500 at both ends, an intentionally naive netcat block transfer of 200 MB of /dev/zero went from primary to replica, and vice versa, with no issues.

Whatever (else) could I be missing?

score 2 · Accepted Answer · answered Oct 23 '21 at 02:30

SHOW SLAVE STATUS reports zero seconds behind master (except on slow query updates and so on, when lag is expected).

The Seconds_behind_master (aka `Seconds_behind_source on 8.0) only shows the difference between the timestamp of the last binlog entry it has executed on the replica and the timestamp of the last binlog entry it has downloaded. But it doesn't know if more binlogs are waiting to be downloaded on its source.

STOP SLAVE followed by START SLAVE also syncs the replication, every time, in a few seconds.

I think this is the most relevant clue. At least I've seen this before, and I was told why stopping and starting replication fixes it.

If MySQL's replication connection experiences packet loss from an unreliable connection, it uses exponential backoff when retrying transmission. This is done to avoid network congestion from retry packets. But it means that networks that are especially flaky cause long-lived TCP connections to become progressively slower and slower, as more pauses are inserted in between retries.

Stopping and starting replication starts a new TCP connection, which naturally has no pauses. But if it's a persistently flaky network, soon that connection also becomes slower.

Your deduction that something is wrong with your network or your VPN is probably correct.

Transactions are sent to the Replicas immediately after COMMIT.

This is not exactly true. Binlogs are written on the source instance immediately after the COMMIT. There they sit, waiting for a replica to download them. If the replica runs STOP SLAVE (aka STOP REPLICA), or if the replica itself is stopped, or if the network connection is broken for any reason, then no binlogs are fetched.

If the network is very slow, then binlogs continue to be fetched, but at such a slow rate that the replica may become further and further behind, even though it has executed all events in the binlog it has downloaded so far.

I have a single table that is periodically updated even after hours, with a DATETIME timestamp, so it's very easy to check the MAX() of the recorded datetime

This is the principle of the "heartbeat" feature introduced in MySQL 8.0. It sends a timestamp through the replication channel at a regular interval, so you can compare the latest timestamp it has processed to the current system time on the replica, and that gives a more accurate measure of replication lag than Seconds_behind_source.

If you're using a version of MySQL earlier than 8.0, you can use the free tool pt-heartbeat to achieve the same thing.

score 1 · Answer 2 · answered Oct 22 '21 at 22:44

Timezone problems? Use TIMESTAMP instead of DATETIME for monitoring. Also compare SHOW VARIABLES LIKE '%zone%' between the servers.
Transactions are sent to the Replicas immediately after COMMIT. Check that you either run with autocommit=ON or explicitly use COMMIT. (There is no megabyte threshold.)
Backups using mysqldump (or similar tool) can clog things up.
ALTERs can clog the replication stream.
What version of MySQL? Any "clustering" technique being used? Is GTID in use?
innodb_flush_log_at_trx_commit -- "2" is similar to "0", but probably better in all situations. Change that.
Does STOP/START always fix the issue?
Is any big query running at night (ETL / UPDATE / DELETE / etc.)?

MySQL replica lagging but reporting no lag

What I've checked:

Other checks or clarifications thanks to Rick James's contribution:

Checks I should have done and didn't, will do tomorrow:

2 Answers2

Linked