Optimize MySQL average query for millions of rows

Question

I have 4 tables (I've specified main columns for each):

monitors (id)
monitor_node (monitor_id, node_id, average_response_time)
monitor_checks (id, monitor_id, node_id, response_time)
nodes (id)

Their relations:

1 monitor - N checks

N monitor - N nodes

1 check - 1 monitor & 1 node

So, I'm logging a lot of checks for each monitor. Monitor_checks will have millions of rows (approximately up to 500 millions).

When I insert another bunch of checks (~1k checks), I need to calculate average response time per node (table monitor_node, column average_response_time).

I'm pretty sure I'm doing it the wrong way and I need a faster solution. What I do now: after inserting 1k rows in monitor_checks I calculate average response time for each monitor (grouping by monitor_id). Then, based on this info I make an array and use insert ... on duplicate key update for monitor_node table, in order to do bulk update. Apart from average response time I calculate some other attributes, which goes along with response time in this bulk update.

Making the array with info and executing insert ... on duplicate key update is fast enough.

The slow query is:

select monitor_id, avg(response_time) as avg_response_time 
from `monitor_checks` 
where `node_id` = 2 
group by `monitor_id`

which takes like ~10-20 seconds for ~4m rows I guess.

I also realised that it's not necessary to get average time based on all checks, I can as well use last 50-100 rows. But I couldn't figure out how to do this. I got a fancy solution for grouping checks for each monitor and limit them for some number of rows: How to get latest 2 records of each group but it took way too long as well.

So the question is: how to quickly recalculate average response times in monitor_node when inserting ~1k rows in monitor_checks table?

DB Fiddle: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=bd95afc030361bf1d87f8bc5c3935c2f

Final desired result:

monitor_id  node_id average_response_time
1   1   0.30
1   2   0.25
2   1   0.55 
2   2   0.65

Gerard H. Pille · Accepted Answer · 2020-03-05T15:04:15.707

I would add an array to monitor_node, containing the most recent 100 response times. When adding recent response times, drop those exceeding 100. No need for extra data when a new average has to be calculated.

As an image may say more than a thousand words, I've updated the DB Fiddle. A trigger does the job, but that logic could be incorporated in the job doing the inserts.

create trigger monitor_checks_air
  after insert on monitor_checks for each row
  begin
  update monitor_node mn
    set mn.rec_resp_times =
          json_extract(
             json_array_insert(mn.rec_resp_times,'$[0]',new.response_time),
             '$[0 to 99]'),
        mn.average_response_time = (
            select sum(jt.rt)
              from json_table(mn.rec_resp_times,
                             '$[*]' columns( rt double path '$[0]')) as jt
          )
          / json_length(mn.rec_resp_times) 
    where mn.monitor_id = new.monitor_id
      and mn.node_id    = new.node_id;
  end;

The fastest query is the one you don't execute.

Rick James · Answer 2 · 2020-03-18T23:08:56.657

Build and maintain a summary table. It would probably be something like the following. I am assuming you care about the average response for each hour between each monitor-node pair?

CREATE TABLE response_summary (
    hr DATETIME NOT NULL,  -- a number representing an hourly bucket
    monitor_id ...,
    node_id ...,
    ct INT UNSIGNED NOT NULL,
    sum_resp FLOAT NOT NULL,
    PRIMARY KEY(node_id, monitor_id, hr)
    INDEX(monitor),
    INDEX(hr),
) ENGINE=InnoDB

When you have a batch of 1K readings, there are multiple ways you can pour it into that table. I'll code it assuming you do most of the work in MySQL. (Some of this could be done in the client instead.)

CREATE TEMPORARY TABLE batch (hr, monitor_id, node_id, response);
INSERT INTO batch VALUES (,,,), (,,,), (,,,), ..., (,,,);  -- the 1K readings
INSERT INTO response_summary
        (hr, monitor_id, node_id,
         ct, sum_resp)
    ON DUPLICATE KEY UPDATE
        ct = ct + VALUES(ct)
        sum_resp = sum_resp + VALUES(sum_resp)
    SELECT CONCAT(LEFT(dt, 13), ':00:00') AS hr,
           monitor_id, node_id,
           COUNT(*), SUM(response)
        FROM batch
        GROUP BY node_id, monitor_id, hr;

For more discussion:

Summary tables: http://mysql.rjweb.org/doc.php/summarytables
High-speed ingestion: http://mysql.rjweb.org/doc.php/staging_table

The order of the columns in the suggested PK is optimal for your where node_id = 2 group by monitor_id.

To get the average for a day:

    SUM(sum_resp) / SUM(ct)

This works to find the average over however many samples you have. If you need, for example, the day's average to be the average of 24 averages, then something else should be done.

Optimize MySQL average query for millions of rows

2 Answers2