2

What kind of overhead should I expect when replacing a single complex query with multiple simple queries?

My goal is to improve the readability and portability of all SQL code so I will be favoring simple constructs and replacing database-specific extensions with ANSI SQL whenever possible.

For example:

  • Imagine the client is invoking Dynamic SQL (as opposed to stored procedures)
  • Scenario 1: Client invokes: INSERT INTO employee SELECT name FROM user
  • Scenario 2: Client invokes:
Statement getNames = connection.createStatement();
try (ResultSet rs = getNames.executeQuery("SELECT name FROM user"))
{
  while (rs.next())
  {
    String name = result.getString(1);
    PreparedStatement prepared = connection.prepareStatement("INSERT INTO employee SET name = ?");
    prepared.setString(1, name);
    prepared.executeUpdate();
  }
}

Scenario 1 is not a complex query, but for the sake of argument let's pretend it is. Scenario 2 achieves the same result using multiple (simpler) queries. What kind of overhead can I expect for Scenario 2 versus Scenario 1? Is it something I should worry about or is it negligible?

UPDATE: https://stackoverflow.com/a/14408631/14731 makes a good point. Decomposing queries by hand hard-codes an execution plan instead of allowing the database's optimizer to choose. That being said, it's still not clear whether the overhead is meaningful.

Gili
  • 1,049
  • 1
  • 16
  • 31

2 Answers2

3

As a general rule of thumb, you should give the database as much information as possible about the task that you're implementing. How does this apply to your scenarios?

Scenario 1 (INSERT .. SELECT)

The database knows that you're about to bulk-move a whole set of data from one table or from one derived table to another. It can optimise the execution given:

  • The cardinality of the bulk operation
  • The various constraints / triggers that will be invoked
  • The amount of storage that needs to be reserved in the new table
  • etc etc.

Scenario 2 (SELECT, and the N x INSERT)

The database has no clue about the various prepared statements that you are going to send after the SELECT. Even if it were smart enough to collect long-term statistics and heuristics about what's going to come after your SELECT, it would be unwise to assume anything about the subsequent load. So, effectively, this scenario is most often much much worse than the other.

There are some remarks to place, though:

  • Even if you have a good application-design reason to partition INSERTs like that, you should then probably batch-send them to the database
  • One good reason to partition INSERTs is the fact that you can have more fine-grained control over transaction lengths. It is generally not good to insert millions of records in a long-running transaction with logging turned on. So, either turn off logging, or commit after N inserts

Conclusion

The above are remarks regarding your two specific scenarios. Real-world scenarios aren't that simple, but you've also stated that in the question. The point I'm trying to make is that in many cases, you should let the database perform bulk data operations, because that's what it's very good at.

Lukas Eder
  • 842
  • 6
  • 19
1

For your specific example scenario 1 keeps all the data on the server. Scenario 2 will require the data to be packaged, sent over the wire to the client, where is must be buffered (and, perhaps, spilled to disk) then un-buffered, re-packaged and sent back to the server, where it will finally be processed. This network time adds up. Do this often enough with large enough rowsets and you will see latency mount. Do it one row at a times (as your example shows) and you'll be ruing your decision for the rest of your natural life. Submitting a batch is usually faster than submitting statements individually.

Scenario 1 is a single statement with an implicit transaction to keep the data consistent. Scenario 2 will require an explicit transaction to achieve the same consistency. Those locks will have to be held for the duration of the round trip to the client, which will cause blocking.

If you split a complex query into simpler statements which are all submitted as a batch there are gains and losses. For example, splitting

select
    <lots of rows>
from <really complex predicate>

into

select
    <some columns>
into #T1
from dbo.Table1 as a
where <whatever>;

select
    <other stuff>
from #T1
inner join dbo.Table2 as b
where <etc.>

Will have the obvious cost of the write to and read from #T1. There may be a benefit, however, if you can index #T1 in a way which helps the second query, perhaps if complex manipulation was performed on Table1's values.

Stored procedures can suffer from parameter sniffing. This can sometimes be rectified by recompiling the procedure at each execution. Recompiles can be expensive. By splitting a complex query into several simpler ones you can put the recompile hint on just the statement(s) that benefits from it.

An optimizer will not search indefinitely for the best execution plan. It will eventually time out and run with the best available at that point in time. The work the optimiser must do is exponentially larger with the complexity of the query. Having simpler queries may allow the optimizer to find the best plan for each individual query, providing a lower-cost execution overall.

Michael Green
  • 25,255
  • 13
  • 54
  • 100