[Shootout-list] ackermann, values for ''n''

John Skaller skaller@users.sourceforge.net
Wed, 08 Jun 2005 17:30:33 +1000


On Wed, 2005-06-08 at 07:56 +0200, Bengt Kleberg wrote:
> On 2005-06-08 06:54, John Skaller wrote:
> ...deleted
> > The Python code I recently posted avoids
> > this problem in two distinct ways:
> > 
> > (a) it automatically seeks an appropriate time,
> > allowing you to test run the benchmarking in 
> > a smaller amount of time
> 
> would this mean that if a language is slow, then the automation would 
> not use large ''n'' even for a fast langugae? or have i misunderstood?

The process seeks the largest 'n' value for each language
independently. Also the smallest 'n' value. However data
is recorded for ALL values of n up to the per language
maximum, not just for the maximum.

This means you could mechanically calculate the largest
'n' for which all language didn't timeout, or perhaps
more sensible, for which 80% of languages didn't timeout.
Actually, you could just say 30% of languages didn't timeout,
and only include them in the tables/graphs -- excluding
the slowest 70% of languages.

Tests for n between the minimum and maximum *for a particular
translator* are run with equal frequency, tests above and
below these points are only run once (and tests right on
the border will probably be run about half as often
as the frequent ones).

[You could actually RUN the code I published!!]

[failing tests ..]

> another way of solving this problem would be to have a deterministic 
> test process, but only re-run the missing 10%.

yes.

> i am assuming that you mean that ''a test process'' is all the languages 
> for a test, or all the tests for a language, etc. if you only mean a 
> single language and a single test, then my statement is not applicable.

I mean neither. The current code I have runs tests randomly:
by 'test' I mean a single execution of a program with 
a single value of n.

The test, language translator, and value of n are all chosen
randomly.

Therefore by accident on a short run, some cases won't
be run at all and some will be run twice. However the
results accumulate forever so this isn't a problem,
after a number of weeks there will be plenty of data
to get a reliable average for each translator/benchmark/n value
triplet. It will be interesting to also calculate the 
Standard Deviation and other statistics.. the deviation
can even be displayed on a graph with gnuplot using
'error bars' instead of points.

If you want to see only 'recent language implementations'
then you can filter/weight the tests based on test date,
at some risk of losing reliability due to the randomness.

There are statistical measures of significance which will
actually tell you how statistically reliable the results
are -- no opinion is required, the values can be calculated.
It is an open issue whether there are *systematic* errors
in the experimental science though, which is correlated
with the deeper question of 'what are we actually trying
to measure'.

I argue that random testing and measurement of real
clock times on a machine which is randomly loaded
with other processes .. typing emails, compiling programs,
running cron jobs like updatedb .. is actually a realistic
and useful measure, probably more useful than running
the tests in a clean-room (a machine unloaded by any
other work) -- the actual results in both cases
are just as reliable, however the conditions for
doing the clean-room measurements are much harder
to obtain and very expensive and arguably not 
applicable to the real world. However the results
can be obtained faster this way, meaning one can
tolerate much lower numbers of tests and expect
most results to cluster tightly around the average,
wheres the 'dirty' testing procedure will have a much
larger spread of results and has to run much longer
to get meaningful comparisons. However it can
aggregate results sensibly, and eliminates one
source of systematic error (cache preloading,
cache times being counted, but arbitrarily excluding
system overhead). Instead of excluding things like
disk-to-memory load times, the dirty process
averages them away.

-- 
John Skaller, skaller at users.sf.net
PO Box 401 Glebe, NSW 2037, Australia Ph:61-2-96600850 
Download Felix here: http://felix.sf.net