are my expectations for mysql unrealistic?

Question

I just made my first mysql database and table, but was surprised by slow performance of a simple select statement. My table has 400 million rows, and my select statement returns about 100,000 rows, but it took 14 minutes! Not sure if my setup is wrong, or if my expectations for mysql were too high. What would be an expected time for a well-designed table to return 100,000 rows from a 400 million row table? This is my setup:

CREATE TABLE CALLS (
quote_date DATE,
quote_time TIME,
expiration DATE,
delta decimal(4,3),
mid decimal(8,4)
);
CREATE INDEX idx_quote_date ON CALLS (quote_date);
CREATE INDEX idx_quote_time ON CALLS (quote_time);
CREATE INDEX idx_expiration ON CALLS (expiration);
CREATE INDEX idx_delta on CALLS (delta);
CREATE INDEX covering_index ON CALLS (quote_date, quote_time, expiration, delta);

My table is really just for reading data with select statements, so I do the indexes after loading all the data.

My select is:

select * from CALLS where DELTA BETWEEN 0.4 and 0.6;

J.D. · Accepted Answer · 2024-05-12T13:14:01.707

Q: What does the query plan show?

A: Use EXPLAIN ANALYZE to see.

My guess is it's scanning a large portion of, if not the entire table to locate your data. This is because of the following issues:

Your query is poorly written because it uses SELECT * which is an anti-pattern.
Your indexes aren't properly defined for the query you're testing.
400 million rows is not trivial, neither is scanning for 100,000 of them.

Ways to improve your test case:

Don't use SELECT * and instead list out the columns you want to select, explicitly.
Define a composite index that includes all of those columns, e.g. CREATE INDEX idx_delta_expiration_mid on CALLS (delta, expiration, mid); if your query is SELECT delta, expiration, mid FROM CALLS where DELTA BETWEEN 0.4 and 0.6;, for example. This will make the index applicable for use, ideally with a seek for efficiently locating the data you're interested in, as opposed to scanning the table.

score 1 · Answer 2 · answered May 12 '24 at 19:59

That query will probably use INDEX(delta), which you have. It will be performed as follows (pseudo-code):

1. Reach into the B+Tree of INDEX(delta) to find 0.4 (fast)
2. Scan forward until 0.6.
2a. For each index entry, reach over into the main BTree
       using the PRIMARY KEY to find `*` (as in `SELECT *`)
2b. Send all 5 columns to the user.

The back and forth in "2a" is rather costly, especially if the table is bigger than innodb_buffer_pool_size.

It is naughty not to specify a PRIMARY KEY; one was provided for you. (This detail does not impact performance.)

Your "covering_index" is not covering (for this query) because mid is missing.

The order of columns in a composite index matters. (But not in this one example.)

If you don't need all columns, then having a composite index starting with delta (so it would be used) and containing all the needed columns (covering) would be faster. (As J.D. points out.)

General remarks:

Where is your PRIMARY KEY?

It looks to me as if this data is immutable? By that I mean a given call is a matter of historical record rather than being subject to multiple updates?

From here, we get (caveat - I don't know for sure that this is what we're dealing with):

A call option is a contract between a buyer and a seller to purchase a certain stock at a certain price up until a defined expiration date. The buyer of a call has the right, not the obligation, to exercise the call and purchase the stocks. On the other hand, the seller of the call has the obligation and not the right to deliver the stock if assigned by the buyer.

You have no stock identifier - this puzzles me - surely you should have a field like identifier CHAR(4) - or whatever the length of your system's identifiers is - if it's variable, use VARCHAR(n) - storage = n + 1 byte.

You can store DATE and TIME together as a TIMESTAMP, which can hold values between '1970-01-01 00:00:01' (UTC) and '2038-01-19 03:14:07' (UTC). If you don't require more than 1s precision, this will save space - 4 bytes vs. 6.

Also, re your PK, from here, you don't require a surrogate PRIMARY KEY (normally INTEGER) if your data isn't changing much, which as I surmised above, should be the case for what is, essentially, a logging table.

1st possibility:

So, I would design it like this (all of the code below is available on the fiddle here):

CREATE TABLE calls 
(
  stock_id    CHAR(4)      NOT NULL,
  quote_ts    TIMESTAMP    NOT NULL,
  expiry_date DATE         NOT NULL, 
  delta       DECIMAL(4,3) NOT NULL,
  mid         DECIMAL(8,4) NOT NULL,
PRIMARY KEY (stock_id, quote_ts, expiry_date, delta, mid)
);

You can have an index on delta as follows:

CREATE INDEX delta_ix ON calls (delta);

Notice that if you do this, then you get an INDEX SCAN when running your query (using EXPLAIN ANALYZE):

EXPLAIN
-> Filter: ((calls.delta >= 4.000) and (calls.delta <= 6.000))  (cost=0.55 rows=3) (actual time=0.0802..0.0847 rows=3 loops=1)
    -> Covering index scan on calls using delta_ix  (cost=0.55 rows=3) (actual time=0.0758..0.0794 rows=3 loops=1)

2nd possibility:

Now, if expiration is determined by the number of days from the quote_ts, you could store it as SMALLINT (or even UNSIGNED TINYINT if it's never > 255 days) and use the DATE_ADD() function and have expiry_date as a stored field (VIRTUAL) - no space, calculation is done on the fly. See the fiddle - test with your own system.

CREATE TABLE calls_bis
(
  stock_id    CHAR(4)      NOT NULL,
  quote_ts    TIMESTAMP    NOT NULL,
  expiry_days SMALLINT     NOT NULL,
  delta       DECIMAL(4,3) NOT NULL,
  mid         DECIMAL(8,4) NOT NULL,
expiry_date DATE AS (DATE_ADD(DATE(quote_ts), INTERVAL expiry_days DAY)) VIRTUAL,
PRIMARY KEY (stock_id, quote_ts, expiry_days, delta, mid)
);

Then create the same an index on delta, then rerun the query - INDEX SCAN again. Try a few different queries - check the sizes of the tables (+/- INDEXes.

There's a certain amount of overhead attached to each record - records are stored in pages (again, overhead) as are INDEXes - more overhead. So, to get a handle on your real data usage/record, you'd have to load 1M records and perform the queries at the bottom of the fiddle.

A few points to note:

I'm not sure why you're retrieving 400k records from a table - are you aggregating over something?
I don't know how the delta and mid fields are produced/calculated/derived. If they can be GENERATED, there is potential to reduce the size of your table - and hence the time taken to scan it?
It's best to have NOT NULLs on as many fields as possible - the more information you provide to the optimizer, the better the chances that it will do a good job. Maybe this isn't possible for the mid field which can't be entered until the call is half-way through its term? (don't know much about trading).
Your two indexes (CREATE INDEX covering_index ON CALLS (quote_date, quote_time, expiration, delta); and CREATE INDEX covering_index ON CALLS (quote_date) can be replaced by the PRIMARY KEY - the first field of a PK doesn't need an extra INDEX as it's the leading field anyway.
Instead of the BETWEEN operator, better to use the mathematical ones (<, <=, >=, >) - these are unambiguous - is BETWEEN inclusive and exclusive? Most servers will just transform this operator into these other operators anyway - see here for confusion!
With the index delta_ix, it is used in a query with requests records based on values of delta - see the EXPLAIN ANALYZE in the fiddle.
The order of the fields in the PK should depend on your most frequent query or queries. Test for this.

If you'd like to elaborate on how the fields are produced, then we might have more opportunities to reduce record size and speed up your queries? p.s. welcome to dba.se!

score 1 · Answer 4 · answered May 13 '24 at 08:23

Assuming column "delta" is randomly distributed, your query is selecting 100k random rows out of the 400M row table.

There is an index on delta, but it will still have to read 100k randomly distributed rows from the table.

14min/100k = 8.4 milliseconds per row, which is really close to the random access time of a 7200rpm hard drive... hmm...

Therefore I guess you are running this on a 7200rpm hard drive. It's going to take a while, there's no way around it. The drive head has to move to reach the data.

The only solutions are

Either use a fast NVME SSD with high random IOPS, or put enough RAM in the box to hold the whole table in cache.
Or use a covering index on (delta, the other columns), that will turn the random access into sequential which is much faster. But it will use a ton of space and take a while to build.

Another solution is to use a database specialized in this kind of stuff:

select sum(value) from mqtt_float where value between 1.095 and 1.1;
┌─────────sum(value)─┐
│ 1830854.5649999972 │
└────────────────────┘
1 row in set. Elapsed: 2.669 sec. Processed 2.13 billion rows, 17.04 GB (798.31 million rows/s., 6.39 GB/s.)
Peak memory usage: 2.80 MiB.

There's no index on the "value" column, because it's MQTT data, so the index is on (mqtt_topic, timestamp) which is also the table order and partitioning key. Thus it reads the whole table. Clickhouse compresses this 2.1 billion row table (38 GB) by a factor of about 11, so it uses only 3.4GB which the NVME SSD reads in about 1 second, and the query completes in 2.7s on a cheap desktop PC. This does not include the time to transport the result set to the client, in this case the result set is 1.6M rows so that would be about 50MB, half a second on gigabit ethernet.

are my expectations for mysql unrealistic?

4 Answers4

General remarks:

1st possibility:

A few points to note: