Computer architects can’t agree on a way to find the average.
For years, academic practitioners in this field have been arguing about the appropriate way to summarize the average performance of their designs.1 That is: given \(n\) workloads, if system \(A\) outperforms system \(B\) by \(S_1, S_2, \ldots, S_n\) on each, how much faster should you say system \(A\) is, on average? I think this argument is kind of pointless through.
For the most part, people tend to use the arithmetic mean \(\left(\frac{1}{n} \sum_{i=1}^n S_i\right)\) or the geometric mean \(\left(\sqrt[n]{\prod_{i=1}^n S_i}\right)\). Henessey and Patterson’s famous Computer Architecture: A Quantitative Approach advocates for the latter:
Using the geometric mean ensures two important properties:
- The geometric mean of the ratios is the same as the ratio of the geometric means.
- The ratio of the geometric means is equal to the geometric mean of the performance ratios, which implies that the choice of the reference computer is irrelevant.
Therefore the motivations to use the geometric mean are substantial, especially when we use performance ratios to make comparisons.
Other people disagree with H&P’s reasoning, but I think it’s just about as good as it gets.
All Means are Bad
Recently (well, over a year ago now), a paper appeared in IEEE Computer Architecture Letters with the title R.I.P. Geomean Speedup Use Equal-Work (Or Equal-Time) Harmonic Mean Speedup Instead. Its author, Eeckhout, argues that geomean is bad, and people should instead be using what he calls the Equal-Work Harmonic Speedup or the Equal-Time Harmonic Speedup. Eeckhout also presented this work at HPCA 2025 as a part of the Best of Computer Architecture Letters session.
The main thing that Eeckhout seems to dislike about the geometric mean is that it “lacks physical meaning.” He claims that using one of his alternatives is better because they have physical meaning. One of the alternatives that he proposes is the Equal-Time Harmonic Speedup (\(ETS\)), which is just the harmonic mean of the speedups observed on each workload.
\[ETS = \frac{n}{\sum_{i=1}^n \frac{1}{S_i}}\]Why use the Harmonic Mean instead of the Geometric Mean? Well, if every workload takes the same amount of time to run on the baseline system, the ETS is equal to the total speedup observed when running each of those workloads sequentially.2 Eeckhout says that this physical meaning provides us with a compelling reason to use something like this over the geometric mean.
But this physical meaning doesn’t matter! When I report a score for SPEC, I don’t really care about how long it takes to run every single workload in that benchmark in a sequential fashion! It’s not like I expect to run a suduko solver (exchange2
), then immediately compile gcc
, and then perform video compression (x264
). I mean, I might run all of these at some point, but certainly not for the exact same amount of time.3 Although the harmonic mean has a clear physical meaning, it’s not one that really matters for many benchmark suites.
Admittedly, I don’t really care about the geometric mean of these workloads either. I agree with Eeckhout when he says the geomean doesn’t have a clear physical meaning. But it comes down to a choice between an average that doesn’t have a clear physical meaning and one whose physical meaning isn’t relevant in most situations.
So is there actually a good number to report?
Unless you actually know the precise mix of workloads being run in a real system, any number you report is going to fail to accurately predict the effect of your design on that system. Benchmarks like SPEC are useful insofar as they show general performance patterns, but no matter how you cut it, a single number is always going to fail to provide a perfect comparison between machines when using a general-purpose benchmark suite.
If you do know the particular applications that you care about, and you know their relative importance, then by all means, take their weighted average and you’ll be set.
Otherwise, I suggest just using the geomean. It’s easy to compare, and everyone else is familiar with it. Use another mean at your own risk: they’ll all just be wrong in different ways.
Why are people still talking about this?
I really don’t know. Seems like this argument should be over by now.
One of my former mentors once told me that he never looks at an academic paper’s evaluation section. If the idea presented in the rest of the paper sounds reasonable, maybe he’ll try to apply its innovations to the production design. If the idea sounds rediculous, or addresses a problem he’s already solved in another way, then it’s of no use, regardless of much speedup the authors might claim.4
There are other problems that contribute to the industry perspective of academic evaluations. But I share this anecdote just to say: academic computer architects should spend more time coming up with new, inherently interesting ideas, and less time talking about which method of averaging is the least meaningless.
Footnotes
-
This argument goes back at least as far at 1986 with the paper How not to lie with statistics: The correct way to summarize benchmark results. Eeckhout provides a good account of this history in his paper. ↩
-
Interestingly, even though we can assign a physical meaning to ETS, it can still provide non-intuitive results. For example, if machine \(A\) runs workload 1 twice as fast as machine \(B\) (\(S_1=2\)), but workload 2 only half as fast (\(S_2 = 0.5\)), then computing the ETS of \(A\) over \(B\) yields 0.8 (meaning a slowdown overall). But by symmetry, the ETS of \(B\) over \(A\) is also 0.8. How can both machines be “slower” than the other? Because unlike the geometric mean, the reference machine does matter for ETS! We’re assigning different weights to the workloads depending on our starting point! ↩
-
Personally, my machines have probably spent much more time running
x264
than compiling gcc or solving suduko. Thanks YouTube. ↩ -
There are other problems that contribute to this perception of academic evaluations, beyond the relatively unimportant issue of averaging workload results. In particular, academic microarchitectural simulators are often inaccurate, and baseline systems are often poor comparison points. ↩