Why does MySQL choose this execution plan?

Question

I have two queries,

select some_other_column 
from `table` 
order by primary_index_column asc 
limit 4000000, 10;

and

select some_other_column 
from `table` 
order by secondary_index_column asc 
limit 4000000, 10;

Both return 10 rows; the first takes 2.74 seconds, and the second takes 7.07 seconds. some_other_column is not a part of any index. primary_index_column is the primary key column; secondary_index_column has a b-tree index and a cardinality of 200 (according to MySQL).

Here are the explain results:

mysql> explain select some_other_column from `table` order by primary_index_column limit 4000000, 10;
+----+-------------+---------+-------+---------------+---------+---------+------+---------+-------+
| id | select_type | table   | type  | possible_keys | key     | key_len | ref  | rows    | Extra |
+----+-------------+---------+-------+---------------+---------+---------+------+---------+-------+
|  1 | SIMPLE      | table   | index | NULL          | PRIMARY | 4       | NULL | 4000010 |       |
+----+-------------+---------+-------+---------------+---------+---------+------+---------+-------+

mysql> explain select some_other_column from `table` order by secondary_index_column limit 4000000, 10;
+----+-------------+---------+------+---------------+------+---------+------+---------+----------------+
| id | select_type | table   | type | possible_keys | key  | key_len | ref  | rows    | Extra          |
+----+-------------+---------+------+---------------+------+---------+------+---------+----------------+
|  1 | SIMPLE      | table   | ALL  | NULL          | NULL | NULL    | NULL | 4642945 | Using filesort |
+----+-------------+---------+------+---------------+------+---------+------+---------+----------------+

Why does MySQL choose that specific execution plan for the second query? I don't understand why it can use the index for the first query but not for the second query.

score 7 · Answer 1 · edited Jun 15 '20 at 09:05

An indexed column in InnoDB always has an additional key to the gen_clust_index (aka Clustered Index) attached to it. This would be traversed by the first query to get to row 4000000 in the order of the index. Since it is the only column being requested, access to the table is unnecessary.

The second query has to collect the non-indexed column from the table along with the indexed column into a temp table. The temp table is then sorted before presenting the non-indexed column as SELECT output.

Notice another contrast

The table count is 4636881
The EXPLAIN plan for the first query traversed 4000010 indexed_column keys. No need to read the last 636871 keys.
The EXPLAIN plan for the second query traversed 4636881 rows ordered by indexed_column. For every row picking up the non-indexed column out of the table, the indexed column (already sorted by the index) is looked up and comes along for the ride. The tmp table gets ordered by the indexed column, and mysqld then dismisses the first 4000000 rows, leaving 10 rows. All that interaction between table and index just for 10 rows is the bottleneck.

COMMON THINGS

In both instances, the query specfies the number of rows to traverse. Since the number of rows in the table is 4636881, we should readily expect a full scan. The contrast becomes apparent when the MySQL Query Optimizer decides where to perform the full scan.

The first query is referencing an indexed column only in the SELECT list and the WHERE clause. The MySQL Query Optimizer chooses to perform a full index scan with no need to contact the table since everything needed is in the index.
The second query is referencing an indexed column in the WHERE clause. However, it must reach out the table to retrieve the corresponding non-indexed column. The MySQL Query Optmizer was tipped off by the query that it must not use the index becasue of the number of rows it was expected to read. As a rule of thumb for any RDBMS, if more than 5% of a table has to be read to fulfill a query, the MySQL Query Optimizer will just throw the index 'under the bus' and do a full table scan.

Doing the math, here is what MySQL Query optimizer computes:

5% of 4636881 is 231844
The second query is commanded to read 4000000 rows, which is way higher than 231844
MySQL Query Optimizer realizes there would be too much interaction between table (because of the non-indexed column) and index (because of the indexed column) to get the needed data. It decides to read the table only (because both the indexed and non-indexed columns are residing together in the table) rather than bounce back and forth between them.

In my honest opinion, with the row count of the table, the present indexes the table has, and the number of rows prescribed by the query, the MySQL Query Optimizer made the correct decision.

RECOMMENDATION

Create this index

ALTER TABLE `table` ADD INDEX mynewndx (indexed_column,some_other_column);

and your second query will never touch the table again going forward. The MySQL Query Optimizer will behave quite differently when it sees this new index.

score 0 · Answer 2 · answered Mar 26 '12 at 15:53

According to the MySQL documentation on optimizing order by queries,

In some cases, MySQL cannot use indexes to resolve the ORDER BY [...] these cases include the following:

[...]

The type of table index used does not store rows in order. For example, this is true for a HASH index in a MEMORY table.

My understanding of InnoDB is that the rows are stored in order according to the primary key. Thus, they're out of order for any secondary indexes.

Why does MySQL choose this execution plan?

2 Answers2

Notice another contrast

COMMON THINGS

RECOMMENDATION

Linked