It's a general question rather than a db engine specific one. In popular db engines, we see the query optimization process taking statistics into consideration. I'm trying to understand what types of statistics the optimization function is using and on what basis these statistics are selected and used.
Why is it important to carefully determine and select statistics for the query optimization process?
1 Answers
The type of statistics are cardinalities i.e. a count of the number of rows that have a certain value for a particular attribute.
To take a simple example, say I have a table with one column that is one character long:
create table T(c1 char(1));
This table contains the values A, A, A, B, C, F, K, X, X, Z. The cardinality will then be a count of the occurrences of each value:
A 3; B 1; C 1; F 1; K 1; X 2; Z 1
Now when the optimizer sees the query select * from T where c1 = 'A' it knows to expect three rows returned because the statistics tells it that's how many exist for that value.
In general tables can be large with many, many different key values. Storing complete statistics as I have shown would be impractical. What actually happens is that a statistically believable summary of the actual table is created. (There are many ways to do this; here's one approach; this answer works through the nitty-gritty.) This summary is compact enough to be traversed quickly during optimization to give good-enough results.
The statistics are used by the optimizer to choose between different physical implementations of the query. My single-table query above does not illustrate this well so let's say we have the query
select *
from TableA
inner join TableB
on TableB.CustId = TableA.CustId
where TableA.SomeColumn = 42;
There are many possible implementations. It could apply the WHERE then do the join, or it could do the join then apply the WHERE. The physical join implementation could be a nested loop, merge or hash join. To read the data from disk it could do table scans, index look-ups or some combination. How is the optimizer to choose between these alternatives?
Most modern DBMS will use a cost-based approach based on the Cascades framework. Each alternative is given a cost (in some arbitrary units that only make sense within the optimizer itself) and the lowest cost option is chosen for execution. These costs are calculated based (amongst other things) on the number of rows expected which, of course, come from the statistics.
- 25,255
- 13
- 54
- 100