Performance issues of a queue implemented on top of InnoDB

Question

Brief problem statement

First thing first: after posting this question originally and working more with our DBAs, I actually learned that our DB runs in a container, instead of being installed natively. From what I've read before, it's discouraged to do that in prod, since databases do all sorts of microoptimizations with memory and storage and cpu, and the container might obscure some things and so the behavior will not be optimal. So, if after reading this question you think it can be actually caused by the DB running in a container, definitely lmk.

Alright so now to the actual question:

There will be a detailed explanation below, but this initial problem statement is just to give you a feel of what we’re dealing with here.

We have a feature called “Tasks”, which is just a generic task runner. A task is represented by a single row in an InnoDB table (using MariaDB 10.6.14); anyone can create a task, and then one of the workers will pick it up, run, potentially retry a few times if there are errors, and once the task is finished (successfully or not), delete the row from the table.

The issue is that, having plenty of tasks to run (even just 200K of runnable tasks which we need to run asap is enough), the performance alternates between those two modes:

Performance is great (the query to pick the next task takes 20-30ms, and every worker is able to run about 200 tasks per second or more)
Performance is terrible (the query quickly jumps to hundreds of milliseconds, and keeps slowly growing, and we’re running 1-3 tasks per second per worker).

The “great” phase usually lasts for 5-10 mins, followed by a few hours of the “terrible” phase; then it fixes itself and goes back to “great”, etc.

Here’s how it looks on the charts:

A few quick points to highlight:

On this chart, no new tasks are being inserted in the table; we already have a few hundred K of tasks in the table that we need to run ASAP, and as we run them (and DELETE the rows from the table), the performance alternates like that.
It doesn’t seem to have anything to do with the amount of data in the table: as mentioned, even just 200K of rows is enough, which is nothing for the modern hardware (and yeah the hardware specs used here are very good). It’s rather just the velocity of changes that seems to be causing this: those 200K of rows represent the tasks that we need to run ASAP, and we do. When we have a much lower rate of tasks that we delete/update, those performance issues don’t happen.
It doesn’t seem to be caused by a wrong index: if there wasn’t a proper index, the performance would always be bad. But here it switches back and forth on its own, without us adding any new data to the table, so it seems to be some internal mariadb issues.

This was just to give you a feel of what the issue is like. I have more to say on this, but before I do that, I feel I need to share more implementation details, so let’s get to it.

Background info and implementation details

As mentioned above, we have a generic task runner. A task is represented by a single row in an InnoDB table; anyone can create a task, and then one of the workers will pick it up, take care of, and eventually delete the row from the table.

There are two additional features worth highlighting:

A task can be scheduled either to run ASAP or after a specific time in the future; the only time-related guarantee here is that the task will not run before its scheduled time. We say that a task is runnable if we don’t need to wait more and can already run it whenever we can;
Every task has a priority: from 0 (highest) to 4 (lowest); having multiple runnable tasks on different priorities, the higher-priority tasks will always be picked first.

Btw those features are a big part of why this was implemented on top of MariaDB, and not say Kafka or similar. It’s not a textbook use case of Kafka.

The usage pattern can be very bursty: most of the time we might have like 10-20 tasks per second being created and ran ASAP, and it doesn’t cause problems; but then as part of some batch job we might create a few millions of low-priority tasks, and the workers will run them at the rate of 500 tasks per second in total.

This is how the table looks like (there are a bit more fields, but they are not relevant to the problem, so are omitted)

CREATE TABLE `tasks` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
-- Identifies which handler to actually run for this task. Think of it
  -- as the name of a function to call.
  name varchar(127) NOT NULL,
-- Priority of the task; the highest priority is 0, and as the number
  -- increases, priority lowers. Currently, it can be from 0 to 4.
  priority tinyint(3) unsigned NOT NULL DEFAULT 2,
-- Status of the task, one of those:
  -- - 0: PENDING: the task is ready to run asap or after next_attempt time.
  --      It can be the initial status that the task is created with, but the
  --      task could also reenter this status again later if it failed and the
  --      scheduler decided to retry it;
-- - 1: PROCESSING: the task is currently being executed;
-- Note: there are no states for completed or failed tasks, because such
  -- tasks are deleted from this table.
  status tinyint(3) unsigned NOT NULL DEFAULT 0,
-- Specifies the earliest time when the task needs to run next time.
  -- Used for tasks scheduled for the future, as well as for implementing
  -- retries with backoff. If NULL, the task should run ASAP.
  next_attempt timestamp(6) NULL,
PRIMARY KEY (id),
-- See details on this particular index ordering below.
  INDEX tasks_next_attempt_id_idx (priority, next_attempt, id)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

There are a few workers (not too many: something like 3 or 6 etc) polling this table (see the exact query below); when a worker gets no task to run, it backs off and polls it only once per second, but when it successfully gets a task to run, it tries to get the next one right away, so that having many runnable tasks, we run them as fast as we can.

The query to pick the next task does the following, atomically:

Find the id of the next task to run (taking into account the scheduled time and priorities as described above)
Set its status to 1 (which means “processing”)
Set the next_attempt to the current time plus 15 seconds (which is an interval after which the task will rerun if it appears dead; it’s not really relevant to the problem, but for the context: while the task is running, every 10 seconds its next_attempt will be again updated to 15 seconds into the future, so as long as the worker is functioning and the task keeps running, next_attempt will always be in the future)
Return the task id

Here’s the exact query (where 2024-04-11 08:00:00 used as the current time: we actually pass specific timestamps like that instead of using NOW() for better testability, but it’s not related to the issue).

UPDATE tasks
INNER JOIN (
    SELECT id FROM tasks
    WHERE
            priority IN (0, 1, 2, 3, 4)
        AND (next_attempt IS NULL OR next_attempt <= '2024-04-11 08:00:00')
    ORDER BY priority, next_attempt, id
    LIMIT 1
    FOR UPDATE SKIP LOCKED
) tmp ON tasks.id = tmp.id
SET
    tasks.status=1,
    tasks.next_attempt='2024-04-11 08:00:15',
    tasks.id=LAST_INSERT_ID(tasks.id)

As you see, there are 3 parts in this query:

The outer UPDATE
The INNER JOIN
The SELECT subquery.

In case you’re wondering if those subqueries with joins are causing issues, let me also mention upfront that, when we’re in the “terrible” performance phase and I just execute the SELECT query, without any joins or updates, it is also very very slow. So it doesn’t seem like those updates or joins are the culprit; the SELECT itself, albeit pretty simple, is the problem.

Let’s now look at every part in more detail.

The `SELECT` subquery

So here’s the query:

SELECT id FROM tasks
WHERE
        priority IN (0, 1, 2, 3, 4)
    AND (next_attempt IS NULL OR next_attempt <= '2024-04-11 08:00:00')
ORDER BY priority, next_attempt, id
LIMIT 1
FOR UPDATE SKIP LOCKED

This section is super verbose; if you feel that the query is descriptive enough and it makes sense, feel free to skip this whole section.

As briefly mentioned above, when we’re in the “terrible” performance phase, it’s this very part that is slow: I execute it manually, and even on an empty table it might take more than 3 seconds to complete.

It selects a single task of all the runnable tasks. One very important thing is to make sure that mysql doesn't have to check rows one by one, trying to find a row which finally matches the WHERE clause: our tests have confirmed that having even just 15K of non-runnable tasks in front of runnable ones slows down the query from milliseconds to hundreds of milliseconds, so this is crucial. To this end, we do a number of things:

The ORDER BY clause tries to put runnable tasks before non-runnable ones, so that ideally the very first row that mysql checks already matches the WHERE clause;
Obviously, there is an index supporting this ORDER BY;
The WHERE clause corresponds to the ORDER BY in such a way so that when mysql does encounter a non-matching row, it knows that a whole bunch of the upcoming rows will not match either, and quickly skips that whole bunch.

Now let's go over more details, elaborating on this.

Let's imagine for a moment that we don't have task priorities, so all tasks are equal. Then, all we need to do about the ordering is just this:

ORDER BY next_attempt

And that's it. If next_attempt is NULL, it just means "this task should run ASAP", while a non-NULL value means "this task should run after this timestamp", and we set this timestamp for a number of different reasons: either a task was initially scheduled to run at certain time, or the task is being processed and the framework set it to the time when we'll consider a task being stuck so we should pick it again, or the task has failed and we're waiting for the backoff period before running it again: in all those cases we set next_attempt to some timestamp, and so ordering by it would put all runnable tasks before non-runnable ones (to mention explicitly, mariadb puts NULL values before non-NULL ones, which is what we want too).

The threshold between runnable and non-runnable tasks is just the current time, and so the WHERE clause can look like this:

WHERE next_attempt IS NULL OR next_attempt <= NOW()

Remember what we've said about the WHERE clause corresponding to the ORDER BY, so that mysql knows when to cut short. Imagine that we have 10 million rows in this table, and they all have next_attempt in the future. No tasks are runnable here, and when mysql sees the first row which doesn't match the WHERE, knowing that the rows are ordered by next_attempt, it concludes that no other rows can possibly match, and immediately bails out.

So far so good: our ordering always puts runnable tasks first, and mysql has enough info to bail out when we encounter the first non-runnable task.

Now, let's consider what happens when we introduce priorities. A higher-priority runnable task should always run before lower-priority task, so we should update our ordering to this:

ORDER BY priority, next_attempt

The problem here though is that this ordering no more guarantees that all runnable tasks come before non-runnable ones: there could be a high-priority task scheduled for the future, and a lower-priority task that should run asap, and in this case, we obviously need to run the lower-priority task first, but this ordering puts the high-priority non-runnable task first instead.

Bad news is that if we simply modify the ORDER BY clause as shown above, and leave the WHERE clause intact, then mysql is not smart enough to handle it correctly, and having even just 15K of high-priority tasks scheduled for the future, the picking becomes very slow, since mysql checks all of those non-runnable high-priority tasks one by one.

Good news though is that we can help mysql out, by adding one more condition to WHERE: priority IN (0, 1, 2, 3, 4). And yes, for this to work, we need to explicitly enumerate all the possible values for the priority, e.g. a seemingly equivalent (priority >= 0 AND priority <= 4) would not help. We need the exact values (I don't know if mysql devs have a good reason for this limitation). All in all, our WHERE clause becomes this:

WHERE
        priority IN (0, 1, 2, 3, 4)
    AND (next_attempt IS NULL OR next_attempt <= NOW())

With this, mysql is smart enough to figure that when it encounters a row on priority 0 and next_attempt in the future, it needs to skip all other rows with priority 0.

As another point, unrelated to the performance of the pick query, we want to make sure that if we have more tasks than we can run in parallel, and some of them are failing, then the tasks should be being picked in the round-robin fashion: this way, it won't be possible for a bunch of tasks that keep failing forever to saturate the system by not allowing other tasks to run. Luckily, we don't have to do anything special here: it's also solved by ordering by next_attempt (that's because whenever a task fails, its next_attempt is set to a timestamp in the future due to backoff), but I'm mentioning it just in case we have to rewrite the query somehow, because it's an important point to keep in mind.

And a couple final things here:

We also use FOR UPDATE SKIP LOCKED. This is very important for queue-like applications (like this one); it instructs mariadb to just ignore the rows which are already locked by another transaction, and go to the next row, instead of waiting for the existing transaction to finish. Having multiple workers querying the same data to run queued tasks, this is exactly what we want, and it helps with the performance very significantly.
For testability, instead of actually using the NOW(), we pass the current timestamp manually from the app code: our tests use mocked time and pass mocked timestamps as the current time;
Also for testability, we additionally order by id. So the full order is priority, next_attempt, id;

You might have noticed that the status column is not used for querying at all. It is kinda used in some other places but is just a sanity check; so technically we can drop this column. But it’s not related to the performance issues anyway.

The `INNER JOIN`

The subquery discussed above is wrapped in an INNER JOIN (...) tmp. This is in order to bypass a limitation of our MariaDB version (10.6) where subqueries cannot be used in IN clauses. That SELECT creates a temporary table of one entry, which makes MariaDB happy.

The outer `UPDATE`

The outer query updates the status to 1 (which means “processing”), the next_attempt column to the current timeplus the heartbeat expiration interval (15s), and does the id=LAST_INSERT_ID(id). This last thing is a no-op, but it instructs the LAST_INSERT_ID to memorize the value of id so we can get that id back.

Back to the problem

Ok so having that laid out, we have enough info to dive deeper into the problem we’re having: when the tasks are being deleted from the table too fast (e.g. 600 per second in total among all workers), the performance is great for a few minutes, then it abruptly becomes much worse, and then it slowly keeps getting worse for one or more hours; after which it recovers on its own, and so it keeps cycling back and forth this way. To repost the same chart again:

A few more observations:

1. When in the degraded state, the table can be empty while queries can take seconds

It happened already that the workers handled all the tasks, and the queries to get the next task were taking more than 3 seconds on an empty table. Here’s the chart illustrating that:

That jump from 1s to 3s is when there were no more tasks to run. I tried to execute just the SELECT part manually, and it took 3.67 seconds, on an empty table:

mysql> select count(*) from tasks;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.3 sec)
mysql> SELECT id FROM tasks WHERE priority IN (0, 1, 2, 3, 4) AND (next_attempt IS NULL OR next_attempt <= NOW()) ORDER BY priority, next_attempt, id LIMIT 1 FOR UPDATE SKIP LOCKED;
Empty set (3.67 sec)

The EXPLAIN looks fine, exactly how it looks when everything works properly:

mysql> EXPLAIN SELECT id FROM tasks WHERE priority IN (0, 1, 2, 3, 4) AND (next_attempt IS NULL OR next_attempt <= NOW()) ORDER BY priority, next_attempt, id LIMIT 1 FOR UPDATE SKIP LOCKED;                    
+------+-------------+----------+-------+---------------------------------------+---------------------------------------+---------+------+--------+--------------------------+
| id   | select_type | table    | type  | possible_keys                         | key                                   | key_len | ref  | rows   | Extra                    |
+------+-------------+----------+-------+---------------------------------------+---------------------------------------+---------+------+--------+--------------------------+
|    1 | SIMPLE      | tasks    | range | tasks_priority_next_attempt_id_idx    | tasks_priority_next_attempt_id_idx    | 9       | NULL | 748746 | Using where; Using index |
+------+-------------+----------+-------+---------------------------------------+---------------------------------------+---------+------+--------+--------------------------+
1 row in set (0.26 sec)

I also tried to SHOW TABLE STATUS, the important bits from there are:

Data_length:    45 092 864
Index_length:    7 393 280
Data_free:     827 326 464

After a few hours of being in that degraded state, it recovered on its own (without having any data being inserted), and those stats became:

Data_length:        4 096
Index_length:       4 096
Data_free:    879 755 264

I’m not a DBA and not sure what exactly these numbers mean though, but thought it’s worth sharing them here. I’m guessing it tells me that the index is full of garbage (7 MB of index on an empty table can’t have anything but garbage), and it must also be telling me that the index is not getting rebuilt when it should have.

2. If we don’t `DELETE` the tasks, but only `UPDATE` them to make them non-runnable, it doesn’t help with the performance

I was just suspecting that the fragmentation (which is supposedly caused by deleting a lot of data from the table) might have something to do with it, so I just changed the logic to UPDATE the task after completion (set it to lowest the priority 5 which is never runnable) instead of DELETEing the row. It didn’t change anything about the performance; the patterns remained the same.

3. Doing a bunch of `INSERT`s helps

As another observation, when we not only run tasks (and therefore DELETE them from the table), but also INSERT new tasks at about the same rate, then even though mariadb still switches to poor performance periodically, it recovers very quickly. Check it out:

So while we’re inserting tasks, it doesn't go into the degraded state for too long;
Instead, there is a distinctive pattern: seemingly every 4 mins, the performance drops, and then immediately recovers.
When we stopped creating tasks, it worked for a few minutes, and again fell into a degraded state for an extended period of time.

I also confirmed that by just waiting for it to degrade, and then inserting a bunch of tasks, to check if it’d help it to recover. It helped every time. Empirically found that inserting 10K or 20K is not enough, but 30K of tasks is usually enough; the performance recovers right after inserting enough tasks:

I guess it tells me that frequent INSERTs cause MariaDB to do something useful with the index, while frequent DELETEs and UPDATEs unfortunately do not.

4. Rebuilding an index helps

If we just build a new index and drop the old one:

ALTER TABLE tasks ADD INDEX tasks_next_attempt_id_idx2 (`priority`,`next_attempt`,`id`), ALGORITHM=INPLACE, LOCK=NONE;
ALTER TABLE tasks DROP INDEX tasks_next_attempt_id_idx;

It fixes the issue immediately, and doesn’t lock the table while building an index. Actually this is the most viable workaround we’re thinking of, at the moment: if we can’t find a way to make it happen automatically, we can just add some app logic like “if the next-task queries become slower than 100ms and stay this way for 5 seconds, recreate the index manually”; or even the lazy way like “rebuild the index every few minutes”. It sucks, and is generally a weird design to do ALTER TABLE in the app code, but practically it’s still much better than letting it be in the degraded state for hours.

Question

I’d appreciate any thoughts and feedback you have based on the explanation above, but the questions I'm actually asking are:

1. What is happening?

Really curious to learn some MariaDB implementation details which would explain this behavior, since it doesn’t make sense to me and I wasn’t able to find it on my own.

2. How to make it work fast without having to rebuild the index manually?

As mentioned before, doing a bunch of INSERTs, like 30K or more, usually helps it to recover from the degraded state, so it looks like MariaDB does some maintenance to the index behind the scenes, and this maintenance doesn’t happen for DELETEs or UPDATEs. I wonder if there is some knob in MariaDB that we can tune to enable some more aggressive index maintenance, without having to rebuild it manually.

score 0 · Answer 1 · answered Apr 12 '24 at 14:57

First of all, you should not have to perform any DDL (adding or dropping indexes) in the application. What is getting messed up is not primarily the index at the storage level, but the index statistics in RAM for that index. Whenever bulk inserts or deletes happen, they can periodically cause the MySQL/MariaDB Query Optimizer to ignore the index because its index statistics have become stale (i.e. the index statistics do not reflect the key distribution of the current dataset).

There are a couple of suggestions you can try

SUGGESTION #1 : Create a leaner index

There is no need for the id to be in the index. Why ? Because the id is already embedded in each index entry. This is stated in the MySQL Documentation on Secondary Indexes

How Secondary Indexes Relate to the Clustered Index

Indexes other than the clustered index are known as secondary indexes.
In InnoDB, each record in a secondary index contains the primary key
columns for the row, as well as the columns specified for the
secondary index. InnoDB uses this primary key value to search for the
row in the clustered index.

The MariaDB Documentation says this as well in Getting Started with Indexes

Primary Key

A primary key is unique and can never be null. It will always identify
only one record, and each record must be represented. Each table can
only have one primary key.
In InnoDB tables, all indexes contain the primary key as a suffix.
Thus, when using this storage engine, keeping the primary key as small
as possible is particularly important. If a primary key does not exist
and there are no UNIQUE indexes, InnoDB creates a 6-bytes clustered
index which is invisible to the user.

In light of this, please drop that index and add it back without the id

ALTER TABLE tasks ADD INDEX tasks_next_attempt_id_idx 
(`priority`,`next_attempt`), ALGORITHM=INPLACE, LOCK=NONE;
ALTER TABLE tasks DROP INDEX tasks_next_attempt_id_idx2;

You will never have to manually rebuild the index from here on.

SUGGESTION #2 : Refresh the index statistics

Rather then rebuilding the index, you can rebuild the index statistics with ANALYZE TABLE after each bulk insert or bulk delete:

ANALYZE TABLE tasks;

MariaDB Documentation says this about ANALYZE TABLE:

MariaDB uses the stored key distribution to decide the order in which tables should be joined when you perform a join on something other than a constant. In addition, key distributions can be used when deciding which indexes to use for a specific table within a query.

The Query Optimizer will have fresh eyes on the index and can see the key distribution with the most up-to-date contents of the table.

Wilson Hauck · Answer 2 · 2024-04-12T21:31:41.003

To Dmitry Frank,

In your 1. section TASKS table appears to have

Data_length: 45 092 864 Index_length: 7 393 280 Data_free: 827 326 464

and After a few hours of being in that degraded state, has

Data_length: 4 096 Index_length: 4 096 Data_free: 879 755 264

is a serious clue that you need to

OPTIMIZE TABLE tasks;

Personal observation over time is that any table with Data_free size greater than 10% of Data_length will benefit after OPTIMIZE TABLE tbl_name; is completed.

That is assuming you are in fact using innodb_file_per_table=1 and the table is truly stored on the OS in a separate file. We have observed people unaware that their very old tables were still in system tablespace. If you have any doubt, ALTER TABLE tasks ENGINE = INNODB; will definitely get it out of system tablespace and into the innodb_file_per_table world of data management. In your case for this file you have BAGGAGE of 879 Meg of data_free occupying media space and RAM used for no good purpose. This may be contributing to your erratic query completion timings. View my profile, please.

Rick James · Answer 3 · 2024-04-27T05:41:50.287

The SELECT is terribly inefficient:

WHERE priority IN (0, 1, 2, 3, 4)
  AND (next_attempt IS NULL OR next_attempt <= '2024-04-11 08:00:00')

It will take a long time when there are lots of rows to check for what to do next. That is, when a burst comes it becomes backlogged and stays backlogged for a long time. This may explain all the graphs you provided.

Change to

WHERE status = 0
  AND next_attempt <= NOW(6)

Notes:

status is better for indexing since it is an "=" test.
Get rid of the test on priority; just have it in the ORDER BY.
I'm having next_attempt serve three purposes -- run ASAP (NOW(6)), run at a specified time in the future, and rerun at some time in the future.
Changing any indexed column will be somewhat costly since it will necessitate changing both the data's BTree and the index's bTree. For the sake of the UPDATE, let's cut that back from two indexes change to one by making the important index the PRIMARY KEY.
(I defer to others on whether FOR UPDATE SKIP LOCKED is advised or even needed.)

Indexes:

PRIMARY KEY(status, priority, next_attempt, id),
INDEX(id) -- sufficient for AUTO_INCREMENT

The ORDER BY does not need changing:

ORDER BY priority, next_attempt, id

How many workers do you have? If you have more than there are CPU cores or each worker is likely to do a lot of I/O, then consider having fewer workers. In any case, be aware that MariaDB will give each connection equal access. If, for example, you have 16 cores but 100 active workers, the queue will bog down most of the time. Much of the effort will be spent on context switching without getting much down. Latency will suffer even if throughput does not.

All together:

UPDATE tasks
    INNER JOIN (
        SELECT id FROM tasks
        WHERE status = 0
          AND next_attempt < NOW(6)
        ORDER BY priority, next_attempt, id
        LIMIT 1
        FOR UPDATE SKIP LOCKED
    ) tmp ON tasks.id = tmp.id
    SET
        tasks.status = 1,
        tasks.next_attempt = NOW(6) + INTERVAL 15 SECOND
        tasks.id = LAST_INSERT_ID(tasks.id)

(Usually I answer queueing questions with "Don't queue it, just do it." But your design has some extra wrinkles that led me to assuming that queuing is useful.)

score 0 · Answer 4 · answered Apr 27 '24 at 06:06

Analysis of GLOBAL STATUS and VARIABLES:

Observations:

Version: 10.6.15-MariaDB-log
500 GB of RAM
Uptime = 146d 19:33:15
145 QPS

The More Important Issues:

Setting recommendations:

table_open_cache = 5000
max_connections = 400

Do you have lots of apps running in your 500GB? How much data do you have? Set innodb_buffer_pool_size up to about 80% of the available RAM after accounting for other apps but not necessarily more than twice the data+indexes you have.

Have you had any issues with innodb_page_size = 4K? (I have never seen such in production.)

Note that Max_used_connections is only 208 even after 146 days.

I see 11 Deadlocks per hour. Have you looked into them?

Details and other observations:

( Table_open_cache_misses ) = 560058192 / 12684795 = 44 /sec -- May need to increase table_open_cache (now 2048)

( Table_open_cache_misses / (Table_open_cache_hits + Table_open_cache_misses) ) = 560,058,192 / (2366661389 + 560058192) = 19.1% -- Effectiveness of table_open_cache. -- Increase table_open_cache (now 2048) and check table_open_cache_instances (now 8).

( innodb_buffer_pool_size ) = 65,536 / 512000M = 12.8% -- % of RAM used for InnoDB buffer_pool -- Set to about 70% of available RAM. (To low is less efficient; too high risks swapping.)

( innodb_lru_scan_depth ) = 1,500 -- innodb_lru_scan_depth is a very poorly named variable. A better name would be innodb_free_page_target_per_buffer_pool. It is a number of pages InnoDB tries to keep free in each buffer pool instance to speed up read and page creation operations. -- "InnoDB: page_cleaner: 1000ms intended loop took ..." may be fixed by lowering lru_scan_depth

( Innodb_buffer_pool_pages_free * 16384 / innodb_buffer_pool_size ) = 12,918,949 * 16384 / 65536M = 308.0% -- buffer pool free -- buffer_pool_size is bigger than working set; could decrease it

( innodb_io_capacity ) = 20,000 -- When flushing, use this many IOPs. -- Reads could be slugghish or spiky. Use 2000 if using SSD drive.

( innodb_io_capacity_max ) = 30,000 -- When urgently flushing, use this many IOPs. -- Reads could be slugghish or spiky.

( Innodb_buffer_pool_pages_free / Innodb_buffer_pool_pages_total ) = 12,918,949 / 16146432 = 80.0% -- Pct of buffer_pool currently not in use -- innodb_buffer_pool_size (now 68719476736) is bigger than necessary?

( innodb_page_size ) = 4,096 -- (As of 5.6.4) Applies to all tablespaces. Smaller may help in random access with SSDs. -- (Values other than 16KB are rare; care is needed when using such.)

( innodb_change_buffering ) = innodb_change_buffering = none -- Pre-5.6.11 / 5.5.31, there was a bug that made ="changes" a safer option. MariaDB 10.5.15 is moving toward "none" and deprecating in 10.9

( Innodb_buffer_pool_bytes_data / innodb_buffer_pool_size ) = 13,219,770,368 / 65536M = 19.2% -- Percent of buffer pool taken up by data -- A small percent may indicate that the buffer_pool is unnecessarily big.

( Innodb_pages_read/Innodb_data_reads ) = 193/5306 = 3.6% -- Seems like these values should be equal? -- Possibly should increae innodb_buffer_pool_size (now 68719476736) .

( Uptime / 60 * innodb_log_file_size / Innodb_os_log_written ) = 12,684,795 / 60 * 1024M / 243111676928 = 933 -- Minutes between InnoDB log rotations Beginning with 5.6.8, innodb_log_file_size can be changed dynamically; I don't know about MariaDB. Be sure to also change my.cnf -- (The recommendation of 60 minutes between rotations is somewhat arbitrary.) Adjust innodb_log_file_size (now 1073741824). (Cannot change in AWS.)

( Innodb_history_list_length ) = 281,381 -- See innodb_change_buffering (now none)?

( Com_rollback ) = 134629524 / 12684795 = 11 /sec -- ROLLBACKs in InnoDB. -- An excessive frequency of rollbacks may indicate inefficient app logic.

( Handler_rollback ) = 143784087 / 12684795 = 11 /sec -- Why so many rollbacks?

( default_tmp_storage_engine ) = default_tmp_storage_engine =

( Innodb_row_lock_time_max ) = 50,000 -- Max time to lock a row (millisec) -- Possibly conflicting queries; possibly table scans.

( innodb_flush_log_at_trx_commit ) = 1 -- 1 = secure; 2 = faster -- (You decide) Use 1, along with sync_binlog (now 0)=1 for the greatest level of fault tolerance. 0 is best for speed. 2 is a compromise between 0 and 1.

( sync_binlog ) = 0 -- Use 1 for added security, at some cost of I/O =1 may lead to lots of "query end"; =0 may lead to "binlog at impossible position" and lose transactions in a crash, but is faster. 0 is OK for Galera.

( innodb_print_all_deadlocks ) = innodb_print_all_deadlocks = OFF -- Whether to log all Deadlocks. -- If you are plagued with Deadlocks, turn this on. Caution: If you have lots of deadlocks, this may write a lot to disk.

( Innodb_deadlocks ) = 39675 / 12684795 = 11 /HR -- Deadlocks -- SHOW ENGINE INNODB STATUS; to see the latest pair of queries that deadlocked.

( max_connections ) = 10,000 -- Maximum number of connections (threads). Impacts various allocations. -- If max_connections (now 10000) is too high and various memory settings are high, you could run out of RAM.

( local_infile ) = local_infile = ON -- local_infile (now ON) = ON is a potential security issue

( bulk_insert_buffer_size ) = 8 / 512000M = 0.00% -- Buffer for multi-row INSERTs and LOAD DATA -- Too big could threaten RAM size. Too small could hinder such operations.

( Created_tmp_tables ) = 586388419 / 12684795 = 46 /sec -- Frequency of creating "temp" tables as part of complex SELECTs.

( Created_tmp_disk_tables ) = 91427022 / 12684795 = 7.2 /sec -- Frequency of creating disk "temp" tables as part of complex SELECTs -- increase tmp_table_size (now 536870912) and max_heap_table_size (now 536870912). Check the rules for temp tables on when MEMORY is used instead of MyISAM. Perhaps minor schema or query changes can avoid MyISAM. Better indexes and reformulation of queries are more likely to help.

( Created_tmp_disk_tables / Questions ) = 91,427,022 / 1839200076 = 5.0% -- Pct of queries that needed on-disk tmp table. -- Better indexes / No blobs / etc.

( tmp_table_size ) = 512M -- Limit on size of MEMORY temp tables used to support a SELECT -- Decrease tmp_table_size (now 536870912) to avoid running out of RAM. Perhaps no more than 64M.

( Com_rollback / (Com_commit + Com_rollback) ) = 134,629,524 / (63453168 + 134629524) = 68.0% -- Rollback : Commit ratio -- Rollbacks are costly; change app logic

( (Com_insert + Com_update + Com_delete + Com_replace) / Com_commit ) = (67009727 + 85449155 + 23514933 + 0) / 63453168 = 2.77 -- Statements per Commit (assuming all InnoDB) -- Low: Might help to group queries together in transactions; High: long transactions strain various things.

( Com_show_variables ) = 24425463 / 12684795 = 1.9 /sec -- SHOW VARIABLES ... -- Why are you requesting the VARIABLES so often?

( ( Com_stmt_prepare - Com_stmt_close ) / ( Com_stmt_prepare + Com_stmt_close ) ) = ( 171691985 - 2037195 ) / ( 171691985 + 2037195 ) = 97.7% -- Are you closing your prepared statements? -- Add Closes.

( Com_stmt_prepare - Com_stmt_close ) = 171,691,985 - 2037195 = 1.7e+8 -- How many prepared statements have not been closed. -- CLOSE prepared statements

( Com_stmt_close / Com_stmt_prepare ) = 2,037,195 / 171691985 = 1.2% -- Prepared statements should be Closed. -- Check whether all Prepared statements are "Closed".

( Com_admin_commands / Queries ) = 104,795,310 / 2015059742 = 5.2% -- Percent of queries that are "admin" commands. -- What's going on?

( Com__biggest ) = Com__biggest = Com_stmt_execute -- Which of the "Com_" metrics is biggest. -- Normally it is Com_select (now 764859854). If something else, then it may be a sloppy platform, or may be something else.

( expire_logs_days ) = 3 -- How soon to automatically purge binlog (after this many days). Being replaced by binlog_expire_logs_seconds. -- Too large (or zero) = consumes disk space; too small = need to respond quickly to network/machine crash. (Not relevant if log_bin (now ON) = OFF)

( slow_query_log ) = slow_query_log = OFF -- Whether to log slow queries. (5.1.12)

( long_query_time ) = 5 -- Cutoff (Seconds) for defining a "slow" query. -- Suggest 2

( Max_used_connections / max_connections ) = 208 / 10000 = 2.1% -- Peak % of connections -- Since several memory factors can expand based on max_connections (now 10000), it is good not to have that setting too high.

( Aborted_connects / Connections ) = 15,116,292 / 38520385 = 39.2% -- Perhaps a hacker is trying to break in? (Attempts to connect)

( thread_pool_max_threads ) = 65,536 -- One of many settings for MariaDB's thread pooling -- Lower the value.

Abnormally small:

Handler_icp_attempts = 0.55 /sec
Innodb_buffer_pool_reads * innodb_page_size / innodb_buffer_pool_size = 0.00%
Innodb_data_read = 0.062 /sec
Innodb_data_reads = 1.5 /HR
Innodb_dblwr_pages_written/Innodb_pages_written = 11.9%
Innodb_pages_read = 0.055 /HR
Innodb_pages_written/Innodb_data_writes = 8.8%
Memory_used = 0.02%
Memory_used_initial = 33.4MB
innodb_log_write_ahead_size = 4,096
innodb_lru_scan_depth / innodb_io_capacity = 0.075
table_open_cache / max_connections = 0.205

Abnormally large:

Access_denied_errors = 2.8e+7
Acl_table_grants = 132
Aria_transaction_log_syncs = 78,035
Busy_time = 2.22e+7
Com_backup = 0.00085 /HR
Com_begin = 16 /sec
Com_change_master = 0.00057 /HR
Com_create_procedure = 0.00057 /HR
Com_create_sequence = 0.0026 /HR
Com_create_trigger = 0.0043 /HR
Com_create_user = 0.0026 /HR
Com_create_view = 0.011 /HR
Com_drop_procedure = 0.00057 /HR
Com_drop_sequence = 0.0026 /HR
Com_drop_trigger = 0.0043 /HR
Com_drop_user = 0.0011 /HR
Com_drop_view = 0.01 /HR
Com_repair = 0.00028 /HR
Com_reset = 0.00028 /HR
Com_show_binlog_status = 0.5 /sec
Com_show_binlogs = 0.16 /sec
Com_show_create_user = 0.00028 /HR
Com_show_engine_status = 0.16 /sec
Com_show_explain = 0.0045 /HR
Com_show_privileges = 0.00028 /HR
Com_show_profile = 0.0028 /HR
Com_show_profiles = 0.0028 /HR
Com_show_slave_hosts = 0.16 /sec
Com_show_slave_status = 1.4 /sec
Com_show_status = 1.2 /sec
Com_start_slave = 0.00057 /HR
Com_stmt_execute = 71 /sec
Com_stop_slave = 0.00028 /HR
Com_update_multi = 2 /sec
Cpu_time = 1.48e+6
Feature_insert_returning = 0.1 /sec
Feature_json = 24 /sec
Feature_window_functions = 0.0023 /HR
Handler_read_rnd_deleted = 2.7 /sec
Innodb_background_log_sync = 1 /sec
Innodb_buffer_pool_pages_data = 3.23e+6
Innodb_buffer_pool_pages_dirty = 94,240
Innodb_buffer_pool_pages_free = 1.29e+7
Innodb_buffer_pool_pages_total = 1.61e+7
Innodb_buffer_pool_read_requests / (Innodb_buffer_pool_read_requests + Innodb_buffer_pool_reads ) = 100.0%
Innodb_dblwr_pages_written / Innodb_dblwr_writes = 452
Innodb_deadlocks / Com_commit = 0.06%
Innodb_log_writes / Innodb_log_write_requests = 708.5%
Innodb_master_thread_idle_loops = 1 /sec
Innodb_os_log_pending_writes = 1.84e+19
Innodb_system_rows_deleted = 0.073 /sec
Innodb_system_rows_inserted = 0.073 /sec
Innodb_system_rows_read = 922,319
Open_streams = 4
Opened_views = 44 /sec
Performance_schema_cond_instances_lost = 7.5e+6
Performance_schema_file_instances_lost = 144
Prepared_stmt_count = 46
Rows_sent = 1044560631457 /sec
Slave_connections = 262
Slave_received_heartbeats = 5
Threads_cached = 153
extra_max_connections = 24
extra_port = 3,307
gtid_domain_id = 10,100
host_cache_size = 1,103
innodb_stats_transient_sample_pages = 32
innodb_write_io_threads = 32
log_slow_rate_limit = 10
max_heap_table_size = 512MB
min(max_heap_table_size, tmp_table_size) = 512MB
performance_schema_max_statement_classes = 222
slave_parallel_threads = 24
slave_parallel_workers = 24

Abnormal strings:

Slave_heartbeat_period = 30
Slave_received_heartbeats = 5
aria_recover_options = BACKUP,QUICK
character_set_system = utf8mb3
disconnect_on_expired_password = OFF
gtid_slave_pos = 10097-10097-30687836
innodb_fast_shutdown = 1
innodb_monitor_enable = all
innodb_use_native_aio = OFF
log_slow_verbosity = query_plan,explain
old_alter_table = DEFAULT
old_mode = UTF8_IS_UTF8MB3
optimizer_trace = enabled=off
proxy_protocol_networks = 10.84.8.0/21
slave_compressed_protocol = ON
sql_slave_skip_counter = 0
tx_isolation = READ-COMMITTED
userstat = ON

Performance issues of a queue implemented on top of InnoDB

Brief problem statement

Background info and implementation details

-- - 1: PROCESSING: the task is currently being executed;

The SELECT subquery

The INNER JOIN

The outer UPDATE