Why is "select join insert" so much faster than "loop insert"?

Question

MySQL docs state that the time required for inserts has these approximate proportions:

Connecting: (3)

Sending query to server: (2)

Parsing query: (2)

Inserting row: (1 × size of row)

Inserting indexes: (1 × number of indexes)

Closing: (1)

In other words, if we use stored procedures to do inserts into a table without indexes, the deciding factor for the time required would simply be:

Inserting row: (1 × size of row)

So I test with a stored procedure that made 2 million inserts:

create table t(i int)engine innodb;
delimiter $
    create procedure f()begin
        while @a<=2000000 do
            insert t select null;
            set@a=@a+1;
        end while;
    end$
delimiter ;

set@a=1;
begin;call f;commit;drop procedure f;
select count(*)from t;drop table t;

The query call f; takes 29.44 seconds.

Rolando suggests (below) to use a prepared statement, so I updated the stored procedure to use prepare:

create table t(i int)engine innodb;
delimiter $
    create procedure f()begin
        while @a<=2000000 do
            execute e; # # # # # # # # # # # # # # changed insert statement
            set@a=@a+1;
        end while;
    end$
delimiter ;

set@a=1;
prepare e from'insert t select null';  # # # # # # added prepare statement
    begin;call f;commit;drop procedure f;
deallocate prepare e;  # # # # # # # # # # # # # # added optional deallocate prepare statement
select count(*)from t;drop table t;

However, the query call f; now still takes 30.27 seconds.

It seems like there's no way to beat raw "begin + loop insert + commit" (unless we resort to file loading using load data infile).

However, if we insert using select joins, it only takes 6.28 seconds:

create table t(i int)engine innodb;
insert t select null
from(
    select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 10
)`10`
join(
    select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 10
)`100`
join(
    select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 10
)`1000`
join(
    select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 10
)`10k`
join(
    select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 10
)`100k`
join(
    select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 10
)`1m`
join(
    select 1 union select 2
)`2m`;
select count(*)from t;drop table t;

I'm using MySQL (though I suspect this isn't a MySQL-specific issue).

Why is "select join insert" so much faster than "loop insert"?

What's the explanation for this oddity?

score 2 · Answer 1 · edited Jun 15 '20 at 09:05

QUERY #1

Each time you do an INSERT, you are doing this under the hood

SET @sql = 'insert t select null';
PREPARE s FROM @sql;
EXECUTE s;
DEALLOCATE PREPARE s;

Within the stored procedure, you fully parse, compile, execute and deallocate structures for the prepared SQL statement 2 million times.

QUERY #2

Running insert t select null from(, you fully parse, compile, execute and deallocate structures for the prepared SQL statement ONCE !!!

The execution of the Cartesian joins is really a self-contained atomic operation. Though it does perform iterations ( O(n²) running time ) to form the result table, it is not tethered by creating/destroying query parsing structures with each iteration as you are doing in the stored procedure (despite an O(n) running time). The execution plan of the single INSERT may be uglier to behold, but will surely run faster.

The inline queries are most likely memory tables. The resulting join could be a temporary MEMORY table or temporary MyISAM table. Either way, the formation of the result set is a single operation.

CONCLUSION

Inserting 2 million rows on a single INSERT is always faster that than doing an INSERT of a single row 2 million times. In principle, this would be more evident if you connected, inserted and disconnected. Connecting once beats connecting 2 million times (See my post How costly is opening and closing of a DB connection?)

Rick James · Answer 2 · 2015-04-11T17:49:18.373

I have a simpler way to put it.

"Batched INSERTs and LOAD DATA run 10 times as fast as single-row INSERTs."

By "batching", I mean INSERT INTO t (a,b) VALUES (1,2), (2,3), .... The optimal number is between 100 and 1000 rows per INSERT. Beyond that, you get into diminishing returns.

Here are some issues that impact performance, especially for Batched Inserts:

innodb_flush_log_on_trx_commit = 1 is secure, but takes ~10ms (typical spinning media) per INSERT (unless part of a transaction).
When Replicating a large INSERT, you are clogging up replication, so you might want to do only 100 rows at a time.
Huge columns (BLOB/TEXT) can lead to hitting some limit when batching.

Edit As for the title --

The "select join insert" is a single query
The "loop insert" is lots of 1-row inserts

The overhead for talking to the server, parsing the query, etc, is significant. Hence the latter is 5 times as slow. This is similar to the "10" I mentioned above.

Why is "select join insert" so much faster than "loop insert"?

2 Answers2

QUERY #1

QUERY #2

CONCLUSION