29

I've quite frequently seen benchmarks where the tester discarded the highest and the lowest time out of N runs.

Discarding the highest time I understand; it's probably high because of some other processes running suddenly demanding more CPU.

But doesn't the lowest time indicate the best possible performance when the benchmark was running full tilt without interruptions?

pepoluan
  • 499
  • 4
  • 10

5 Answers5

31

The lowest timing might indeed represent the "true" timing without outside interference, or it might be a measurement error. E.g. the boosting behaviour of a CPU might speed up the first run in a larger benchmarking suite, or a less congested network might speed up a net-dependent benchmark during certain times of day. Similarly, the highest timing might represent the true worst case, or non-representative interference. E.g. a background service might have started during the benchmark, or a SMR hard drive cache is being flushed during an IO-based benchmark.

Such interference indicates a flawed experimental design that fails to control for these influences, but it's not always possible or economical to design the perfect experiment. So we have to deal with the messy real-world data that we have.

Statistics like the mean (average) of some values is very sensitive to outliers. It is thus common to use a trimmed mean where we remove outliers, in the hopes of getting closer to the "true" mean. Various methods for determining outliers exist, with the simplest approach being to remove the top/bottom p%, for some value p. Another option is to use techniques like bootstrapping that let us estimate how reliable the estimate is: instead of removing top/bottom observations, we remove random observations and repeat the calculations multiple times.

However, it is not generally necessary to calculate the mean run time when doing benchmarking. For comparing the typical behaviour, we can use measures like the median or other quantiles. Especially when measuring latencies, quantiles like the 95%-percentile are often used (meaning: 95% of measurements were this fast or faster).

It is also unnecessary to calculate the mean when trying to determine whether one program is significantly faster than another. Instead of parametric statistical tests that require such parameters to be estimated from the sample, it is possible to use non-parametric tests. E.g. the Mann–Whitney Rank Sum Test only considers the order of values from two samples, not their actual values. While this is more flexible and more rigorous, this does lose some statistical power (you might not be able to detect a significant difference even if it exists).

amon
  • 135,795
28

Outliers indicate unusual situations. Outliers are interesting in science, because they give you something to investigate, but they are useless in benchmarks.

If you have 10000 benchmark runs, and 9999 of them took one hour, but 1 of them took 1 second, then it is useless for your customers to tell them it will take 1 second, if there is a 99.99% chance that it will take one hour.

The problem is unexplained outliers, and it is usually simpler to discard the outliers than to investigate and find an explanation. Especially since oftentimes the explanation will be some freak occurrence that does not apply to real-world usage of the program, and thus makes it unlikely that your users will ever experience that performance.

Jörg W Mittag
  • 104,619
8

I'd add the thought that before doing anything else I would eyeball the data, that is plot the distribution data. I'd do that with most datasets, for that matter. My experience as a retired statto is that missing this step and going for some mechanistic strategy can lead to missing the obvious.

There might be two peaks, with a whole clutch of unusually slow running times. If that is the case, I would then think it important to understand the cause, because the code might end up being used in an environment where that cause is the norm.

1

Another point: benchmarks are generally "averaged" using the geometric mean. The geometric mean intrinsically upweights the lowest value in the list, when compared to the algebraic mean.

On top of the domain-specific reasons for distrusting the lowest value, the use of the geometric mean intrinsically gives it extra weight.

wchlm
  • 149
0

Good benchmark design depends on what one is measuring, and who is doing the measuring. Throwing out low numbers is most likely to remove cases which benefit disproportionately from caching.