This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Remove statisticaly unsound benchmarks


On 19/04/13 22:46, OndÅej BÃlka wrote:
> Hi,
> I found that testcase measures time. However its measurement is flawed
> in severeal ways described below.
> 
> Problem number one. Minimum performance does not correspond to average 
> performance.
> You can have implementations A and B where A has minimum 3 and average
> 100 while B has minimum 10 and average 20(even in ideal conditions). 
> Picking variant with smaller minimum is obviously flawed due of this
> bias.
> 
> If we take aside this problem then second problem is lack of robustnes.
> See http://en.wikipedia.org/wiki/Sample_maximum_and_minimum
> In our conditions minimum will pick more random noise than signal.

Do these arguments actually apply to what is being tested here?

I find it difficult (although I do no think entirely impossible) to
generate a case where method A gives a lower minimum but higher mean
than method B.  I suppose if method A had substantially higher variance
in (e.g.) cache misses...  But that does not invalidate the minimum, but
rather say it could be supplemented with a mean (or better, median to
avoid outlier effects).

The properties of the minimum of a sample from a distribution you
mention all assume a continuous variate - i.e. the distribution has
tails.  In the cases here, there is a hard limit for the minimum.  So
the variance of the minimum is not as unstable.  In fact, going to the
wikipedia page you linked:

"However, with clean data or in theoretical settings, they can sometimes
prove very good estimators, particularly for platykurtic distributions,
where for small data sets the mid-range is the most efficient estimator."

Which yes I know is not saying exactly about the minimum, but it still
applies somewhat...

> An mean http://en.wikipedia.org/wiki/Sample_mean_and_sample_covariance
> should be used instead. 
> However you may need to first filter outliers (context switch, minor page faults...)
> 
> Second problem is that measurements are done in tight loop. Even if
> measurements were done correctly repeately calling function with same
> arguments is also problematic.
> 
> It estimates performance when users repeately calls function with same
> arguments and little else.
> 
> In experiments you must replicate real world conditions all much as
> possible. By neglecting factors you introduce more and more errors into
> estimates.
> 
> Here doing randomized tests capture more factors and will almost always
> be closer to real world performance. If you need to choose if to trust
> randomized or idealized benchmark an randomized is better bet. 
> Why should we keep, read this data when we know it is useless?

I do not agree that the generated data is useless.  It shows the speed
for a specific case.  If enough cases are tested, it will cover the
range of calls in realistic usage.  So the key is to pick a good set of
parameters that reflect all paths through the function calls.  This also
requires knowledge of which paths are more important to optimise due to
usage, but you are going to need that to generate "randomised" tests anyway.

> Another argument againist tight loops is following:
> Assume you picked a set (say 5) of benchmarks. When you pick benchmarks
> that cover cases encountered in real world and are as close as posible
> maximizing performance will likely improve performance in real world.
> However if you pick benchmarks that are distant and really do not cover
> entire use case space then maximizing results on those benchmarks will
> result in random change for real world.

As above, picking good parameters to test is the key.

> Given these flaws any results that use them are unscientific and we
> should remove them.

I disagree.  Yes better benchmarks could probably be made, but the
current ones are not wrong in of themselves and are still better than
nothing.  And I have seen substantial real world improvements been made
on the basis of their output so far.

Allan



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]