[Shootout-list] Directions of various benchmarks

John Skaller skaller@users.sourceforge.net
Fri, 03 Jun 2005 01:01:56 +1000


On Thu, 2005-06-02 at 13:47 +0200, Bengt Kleberg wrote:
> On 2005-06-01 21:29, John Skaller wrote:
> ...deleted
> > I have some Python script that does benchmarking now.
> > It isn't complete of course. Here is the output:

> this seems to be most of the things to use for metrics. the only missing 
> items i can think of are ram usage and loc.

Loc can still be counted of course, however it is a metric
of the language, not the translator.

There's no memory measurement.

I think memory use is a secondary metric. These days we have
virtual memory etc .. which can run some pretty big programs.
In the hiercarchy of memory (registers, L1, L2 cache, ram,
disk buffers, disk .. Internet .. <g> .. the price for memory
use is mainly paid in terms of performance. 

It also isn't clear what 'memory' actually is. Do you count
address space allocated to the stack, but for which there
is no backing store? 

Recall I measure *real* time. So page swaps, cache spills from
task switching, etc, all cost: the tests could be run on
a loaded machine, for example one could run 20 copies
of the test process at once (needing a mod to make sure
the data is appended to the results file without a problem).

> > If the time is too low, the minimum n is increased by 1.
> > If the time is too high, the maximum time is decreased by 1.
> 
> suppose we have the count-words test. on my machine i need about 2500000 
> words to get a run time of more then a second. if minimum n is low it 
> will take a long time to reach 2500000.

Redesign the test so that, for example,

	words = 5^n

> i think i have misunderstood your algorithm.

Nope, I think you understand it: just view the 
'linearity' of the control variable as a constraint
on test design.

At present the code has a single n range for each test.
This is basically because we need that for a sensible
graph *for a given host*. So really there should
be a range per host. I'm sure there are lots of refinements.

> 
> > The test process runs until a total elapsed time has 
> > expired, a test crashes, or you press Ctrl-C.
> 
> would any one crash stop the whole test run?

At present yes, the programs are not supposed to crash.
However, stack overflow may terminate a program, yet
the program is still correct. The current code can't
handle that but probably should.

> the pros of a single file is that all the data is in one place. the cons 
> might be that this one file can get very big, and for some people not 
> all of the contents are of interest.

That isn't a problem. You can always move the file away,
delete it, or, more generally, extract a subset of the
data you're interested in.

Actually, it is really a database .. you can think of
doing SQL selects to extract the data you want.

A database is too heavy for my micky mouse testing,
however it may make sense for the Shootout.

> > The key properties of this procedure are:
> > 
> > (a) it is automatically adaptive: it downgrades
> > results of tests that run too quickly, and it 
> > find the maximum n value automatically.
> 
> would you be so kind and expand the bit about ''downgrades results of 
> tests that run too quickly''. i do not understand.

Tests that run too fast aren't very accurate. 
The result is still stored, but the minimum n is increased
so that test won't run again. Tests that take too long,
but don't time-out could also have results stored, but
they're downgraded cause they're too slow.

This is basically a feedback system to adjust the test
range dynamically, so most of the tests are done in
a range of n yielding the desired range of elapsed time.


> > (d) it can also be stopped with a Ctrl-C
> 
> good.

That was the hardest part to get right!

Initially Ctrl-C killed the individual test,
but not the driver code, which just ran another
test immediately. I think I should have used
a detached process or something -- the output
from the test prints too, which I don't want
normally (but during debugging I do ..)

> 
> > (e) It runs the tests randomly to avoid any
> > biases such as pre-loaded cache memory
> > from a previous test
> 
> do be sure to avoid pre-loading, would it not be a good idea to have a 
> round-robin algorithm? instead of random, i mean.

Depends what you want, and that is hard to define.
I think of even running a low frequency cron job,
and gathering data for 24 hours or a week .. whilst
using the computer to do other things. Sometimes the
tests will run on an unloaded machine, and sometimes
they'll not get much CPU.

I probably need yet another key in the record layout,
suggesting the load conditions (tests were the only
job running .. the machine was heavily loaded .. etc).
Or perhaps I should just use a different 'host' key.

> note that if somebody (ie, i) wants to run only one test it is difficult 
> to avoid pre-loading the test data.
> or if somebody wants to run all tests for a single translator, it is 
> difficult to avoid pre-loading the translator/executable.

Yes. And it would be interesting to set it up to do that.

> perhaps it is fairer to always have pre-loading?

Perhaps. I don't know. Does it make a difference?
The answer is as above -- it seems worth using 
the tool, perhaps with mods, to actually do some
experiments.
 
> > (h) the set of tests, source files,
> > and translators can be varied at any time
> 
> would that be doine by editing a file, or at start/on the command line?

At present, the set of translators is loaded from a
configuration file, the set of tests is hard coded
into the driver program.

Of course this should be made more flexible.

> 
> > (i) the procedure measures *real* time
> 
> is using the current time better than system + user time?

Better? Perhaps. I think real time is what should be measured
because that's what user care about. 
> 
> > (j) The procedure does not measure memory use
> 
> unfortunate

I don't know how to do it :)

Most of the code is plain python, and most of the
Unix like calls will actually run on Windows.
[Python takes care of it]

However memory use is very system dependent,
and it isn't clear what it means, other than as
a factory influencing performance.

Yes, unfortunate .. but I never look at the Shootout
memory use figures ..

> > (k) The procedure does not check the results
> 
> check the metrics or the results?

It doesn't check the programs give the right answer.

> > (l) The procedure does not analyse the results
> 
> ditto

I mean, it just records the times, it doesn't, for
example, calculate the average time for ackermann 
using gcc.

> somewhere here you check that the first process to return was the test, 
> right?

Yes, the code compares the pids (process IDs)


> > # this config file defines default translators on your platform
> > 
> > xlators = [
> >   ('felix',mk_felix,'felix'),
> >   ('gcc_3_4',mk_gcc_3_4,'c'),
> >   ('gccopt_3_4',mk_gcc_3_4_opt,'c'),
> >   ('gcc_4_0',mk_gcc_4_0,'c'),
> >   ('gccopt_4_0',mk_gcc_4_0_opt,'c'),
> >   ('ocamlopt',mk_ocamlopt,'ocaml'),
> >   ('ocamlb',mk_ocamlb,'ocaml'),
> > ]
> 
> would it not be a potential problem if all translators had to be put 
> into this single config file? 

Yes. It is better than hard coding them in the driver,
but clearly there are more flexible schema.

> ie, if somebody makes a mistake with one 
> translator, would not all other translators be inconvenienced?

Yes. The code should be regarded as a working prototype
and not a finished product.

> nice work. much better than a theoretical model.

Not better, different :)

Americans have the saying, 

'it works in practice'

And the French have a counter-saying

'ah, but does it work in theory?'


-- 
John Skaller, skaller at users.sf.net
PO Box 401 Glebe, NSW 2037, Australia Ph:61-2-96600850 
Download Felix here: http://felix.sf.net