[Shootout-list] measurement vs. ratings

Thu, 30 Sep 2004 13:02:21 -0700

Einar Karttunen wrote:
> Brandon J. Van Every wrote:
> >
> > Startup and shutdown overhead are not well-understood on a per-test
> > basis, at present.  Given a general lack of precision, I don't see
> > reasonable fear in 'new overheads' compared to 'old
> > overheads'.  I do see a reasonable fear of implementation work.
>
> The shootout has many types of tests and having a per test startup
> time would make things quite unfair. Think about a language that
> does all initializations before main and one which does them in
> a lazy manner.

Measuring phenomena does not make the methodology 'unfair'.  It's how
you rate the results of the measurements that makes it fair or unfair.
We haven't agreed on how to do that.  That's what 'LCD or Simple
Benchmark' is about.

Everyone is always going to have a noninvasive START() and STOP()
outside the test.  That's an assumption of mine, I just haven't been
spelling it out.  There's no way to determine startup and shutdown costs
if you don't do this.  Without such bracketing, you'd only be able to
measure test loop bodies.

If everyone accepts an invasive START() and a STOP() timer to more
accurately determine startup/shutdown times, everyone is in the same
boat.  If everyone has START() STOP(), START() STOP() firing repeatedly
while the body of the test is executing, everyone is in the same boat.
The only fairness issue here is making sure the START() STOP() gaps are
sufficiently large so that the timer overhead is small compared to
what's executed in between.  That is why the 'guessing N' problem never
goes away, it just moves from a macro to a micro level.

The reason to move from macro to micro is accuracy.  The more samples
you take, the more accurate.  Provided that the overhead of taking
samples is very small compared to the test work performed.

> Clearly this would penalize the language which is optimized
> so that real code can start executing quickly.

This is a throughput vs. latency biasing issue.  Do you want your
language to run faster once you're done with your initialization costs,
or to have a fast turnaround time for small problems?  It's not an error
to implement lazy evaluation, it's a different strategy.

> And if the benchmark is an utility program then the additional startup
> and shutdown costs *are* interesting and should be a part of the
> benchmark.

I agree that the information should be recorded.  In fact, I believe it
should be recorded more accurately than just guessing that 'Hello World'
somehow represents what acutally occurs in all the tests.  I think the
extra accuracy is worth reimplementing all the tests with invasive
timers.  I'm even prepared to do a good chunk of the work, if I can get
the Shootout working on Windows without pulling out hair.

How to rate the results, that's a different matter.

Cheers,                     www.indiegamedesign.com
Brandon Van Every           Seattle, WA

"The pioneer is the one with the arrows in his back."
                          - anonymous entrepreneur