[Shootout-list] Directions of various benchmarks

Bengt Kleberg bengt.kleberg@ericsson.com
Thu, 02 Jun 2005 13:47:09 +0200


On 2005-06-01 21:29, John Skaller wrote:
...deleted
> I have some Python script that does benchmarking now.
> It isn't complete of course. Here is the output:
> 
> 
> Rosella 2005/06/02 04:04 ocamlb takfp 2 0.0063
...deleted
> Rosella 2005/06/02 04:04 gccopt_4_0 ack 11 0.6415
> 
> Some notes now: the first field is the hostname,
> the second the test date, the third the test time.
> The fourth is the translator key, the fifth the test key,
> the sixth the value of n, and the last field is the elapsed time.

this seems to be most of the things to use for metrics. the only missing 
items i can think of are ram usage and loc.


> Here is what the test process does, roughly:
> 
> We start with a a minimum and maximum allowed time per test,
> and a minimum and maximum initial n value (one pair for
> each test).
> 
> The procedure randomly picks a translator, test, and n,
> and measures the time.
> 
> If the time is too low, the minimum n is increased by 1.
> If the time is too high, the maximum time is decreased by 1.

suppose we have the count-words test. on my machine i need about 2500000 
words to get a run time of more then a second. if minimum n is low it 
will take a long time to reach 2500000.

then, if we have a very slow language would not a too high minimum n 
assure that this language would never get a run time that is lower then 
the maximum?

i think i have misunderstood your algorithm.


> The test process runs until a total elapsed time has 
> expired, a test crashes, or you press Ctrl-C.

would any one crash stop the whole test run?


> The results of each test are appended to a single file,
> which accumulates results for ever. If someone else 
> does some tests they can mail you the file as an attachment
> and you can just append it to other data you have.

the pros of a single file is that all the data is in one place. the cons 
might be that this one file can get very big, and for some people not 
all of the contents are of interest.


> The test procedure is such that a modification to
> allow for concurrency may be possible.
> --------------------------------------------------
> 
> The key properties of this procedure are:
> 
> (a) it is automatically adaptive: it downgrades
> results of tests that run too quickly, and it 
> find the maximum n value automatically.

would you be so kind and expand the bit about ''downgrades results of 
tests that run too quickly''. i do not understand.


> (b) it kills tests that exceed a time limit

good.


> (c) it runs for a fixed amount of time and stops

very good.


> (d) it can also be stopped with a Ctrl-C

good.


> (e) It runs the tests randomly to avoid any
> biases such as pre-loaded cache memory
> from a previous test

do be sure to avoid pre-loading, would it not be a good idea to have a 
round-robin algorithm? instead of random, i mean.
note that if somebody (ie, i) wants to run only one test it is difficult 
to avoid pre-loading the test data.
or if somebody wants to run all tests for a single translator, it is 
difficult to avoid pre-loading the translator/executable.

perhaps it is fairer to always have pre-loading?


> (f) the results are infinitely cumulative
> 
> (g) it can merge results from different
> architectures.
> 
> (h) the set of tests, source files,
> and translators can be varied at any time

would that be doine by editing a file, or at start/on the command line?


> (i) the procedure measures *real* time

is using the current time better than system + user time?


> 
> (j) The procedure does not measure memory use

unfortunate


> (k) The procedure does not check the results

check the metrics or the results? (metrics => run time, memory usage, 
loc. result => what the test produces)


> (l) The procedure does not analyse the results

ditto


> (m) The data can be parsed by 'readline' and then 'split'
> on a single space.

good.


> ----------------------------------------------------
> Here is how the measurement is done: grab the current time.
> 
> Launch two processes without waiting: 
> (1) the test, and,
> (2) a 'sleep' command line
> 
> which is set a few seconds above the maximum allowable time.
> 
> Then wait for one of the child processes to return,
> record the termination time.
> 
> Kill the other process and wait for it.

somewhere here you check that the first process to return was the test, 
right?


> --------------------------------------------------
> The script is written in Python, and requires
> a configuration file that looks like this:
> ---------------------------------------------------
> 
> # this config file defines default translators on your platform
> # to be used by the performance test module
> #
> # The file 'config/speed_xlators.py' can be edited,
> # it will not be clobbered once created
> #
> # define the translators
> def mk_gcc_3_4(k,p):
>   return "gcc-3.4 -o speed/exes/%s/%s speed/src/c/%s.c" % (k,p,p)
> 
> def mk_gcc_3_4_opt(k,p):
>   x = "gcc-3.4 -O3 -fomit-frame-pointer "
>   x = x + "-o speed/exes/%s/%s speed/src/c/%s.c" % (k,p,p)
>   return x
> 
> def mk_gcc_4_0(k,p):
>   return "gcc-4.0 -o speed/exes/%s/%s speed/src/c/%s.c" % (k,p,p)
> 
> def mk_gcc_4_0_opt(k,p):
>   x = "gcc-4.0 -O3 -fomit-frame-pointer "
>   x = x + "-o speed/exes/%s/%s speed/src/c/%s.c" % (k,p,p)
>   return x
> 
> def mk_ocamlopt(k,p):
>   return "ocamlopt.opt -o speed/exes/%s/%s speed/src/ocaml/%s.ml" %
> (k,p,p)
> 
> def mk_ocamlb(k,p):
>   return "ocamlc.opt -o speed/exes/%s/%s speed/src/ocaml/%s.ml" %
> (k,p,p)
> 
> def mk_felix(k,p):
>   x = "bin/flx --test --force --static --optimise -c
> -DFLX_PTF_STATIC_POINTER "
>   x = x + "speed/src/felix/%s && " % p
>   x = x + "mv speed/src/felix/%s speed/exes/%s/%s" % (p,k,p)
>   return x
> 
> xlators = [
>   ('felix',mk_felix,'felix'),
>   ('gcc_3_4',mk_gcc_3_4,'c'),
>   ('gccopt_3_4',mk_gcc_3_4_opt,'c'),
>   ('gcc_4_0',mk_gcc_4_0,'c'),
>   ('gccopt_4_0',mk_gcc_4_0_opt,'c'),
>   ('ocamlopt',mk_ocamlopt,'ocaml'),
>   ('ocamlb',mk_ocamlb,'ocaml'),
> ]

would it not be a potential problem if all translators had to be put 
into this single config file? ie, if somebody makes a mistake with one 
translator, would not all other translators be inconvenienced?


> ------------------------------------------------
> The file layout is:
> 
> speed/src/<language>/<tests>
> speed/exes/<xlator>/<executables>

exes? what does that stand for?


> And here is the actual script:

...deleted

nice work. much better than a theoretical model.


bengt