0

I have an InnoDB MySQL table of this structure:

column type
service varchar(16)
url varchar(255)
datetime datetime

It contains logs for requests made to one of multiple services. It contains ~3 million rows, and gets ~3 million more rows a month (but I will delete old data if need be).

I am trying to generate a report, and get the number of requests made to each service within a date range.

Here is my current query:

  SELECT service, COUNT(*) as lastMonthCount 
    FROM request_logs 
   WHERE datetime > '2021-02-16 10:51:05' 
GROUP BY service

This works, but is painfully slow (~28 seconds).

It outputs this:

service lastMonthCount
API A 3056752
API B 38451

I have indexes on datetime and service. I can see they are of type BTREE.

How can I radically speed up this query, or restructure my table/indexes so I can achieve the same use case another way?

David Spillett
  • 32,593
  • 3
  • 50
  • 92
S..
  • 111
  • 4

3 Answers3

2

For a relatively simple query like that there isn't a lot you can do to optimise it as there is little room for change (there isn't a simple way of asking for that data).

You can probably reduce the amount of pages being touched as it runs significantly by having an index on datetime and service. This way the data it needs to group by will already be available in what it has read to perform the filter on the date. This will increase the amount of data on disk as the index will be larger, and slow down writes a touch for the same reason. You would probably want to replace the existing index on datetime with the new composite index instead of just adding it, for those reasons.

A little more detail on why this will be faster:

With the index just on the datetime, it will need to read the base table data pages to get the service column for each matching row which as well as being extra page reads just because of referencing the other structure it will possibly be many more page reads because those pages contain less rows each due to the larger data per row (they include the URL column and any other properties that you add, or might already have, for the services).

If that still isn't fast enough...

You may need to look at some form of caching for the counts. There are several options here:

  • a materialised view may work if they are intelligent enough in mysql
  • a little denormalising, by including a counts table updated by trigger when the main table is updated (this is essentially a more "manual" version of a materialised view)
  • something in the application layer, if you can live with the result not necessarily being 100% up-to-date every time (the fastest option by far if done right, but obviously with that key disadvantage)
  • if the data is big enough that the query is reading from storage rather than RAM each time and you don't want to use one/more of the three options above: throw hardware at the problem and buy oodles of RAM! (this is usually not a good solution, though sometimes can be)
David Spillett
  • 32,593
  • 3
  • 50
  • 92
1

Unfortunately MySQL is sometimes a little more limited in it's options compared to other modern RDBMS. One common way to solve the problem you're facing in other systems is to use something called a Materialized View. While not officially a feature of MySQL, you can replicate the behavior with a bit of coding as demonstrated in Speeding Up MySQL Using Materialized Views and MATERIALIZED VIEWS WITH MYSQL.

You may also find some useful information from this DBA.StackExchange answer which gives some alternatives such as creating summary tables. Of course that means you'll need to maintain the data in two places, but you can automate this with Triggers.

Finally, as I prompted in the comments, if your table currently has two separate indexes on the datetime field and the service field, only one of those indexes can be used at a time to serve your query. The most optimal index to improve your query would likely involve creating one index on datetime, service so that way after your datetime field filters down the results, the remaining rows it returns already has the service field inclusive in the index, ready to go for the grouping.

Also to answer your question in the comments, VARCHAR(16) isn't a terrible data type to index, it's just 4x as big as an INT, for example. I doubt you'll see game changing performance by changing the data type but you can experiment with switching to an INT and having a reference table with the actual service names stored in it (with a foreign key relationship from your main table). You could also try the ENUM data type but I'm not personally familiar with it and have heard general recommendations against.

J.D.
  • 40,776
  • 12
  • 62
  • 141
1

Building on great suggestions from @J.D. and @David Spillett here's what I did. I managed to improve the query from ~28 seconds to ~2 seconds which was enough for now. Later I will look to either delete data older than I will query for, or run this query on a cron job which stores it into another table to retrieve from in realtime instead.

  • I remove the index on service and the index on datetime, adding a composite index on (datetime, service) instead. Ordering of the columns in that index matters. This took the query from ~28s to ~14s.
  • I replace the service column (varchar(16)) with an unsigned tinyint and mapped it to a string in my app. This was possible for me since the number services was known to me and would change very rarely. This took the query from ~14s to ~2s.
S..
  • 111
  • 4