[Shootout-list] demoting marginal languages

Tue, 28 Sep 2004 02:44:31 -0700

Bengt Kleberg wrote:
>
> however, what is it that stops us from giving a language that
> does not
> have an implementation of a test a very bad score (maximum time
> exceeded) on that test in the basic benchmark. the problem of
> information overload scales up and solves itself. we aren't
> really going
> to end up with issues of ''which language should be in the basic
> benchmark score card''. the best implementations will rise up the
> ladder. :-)

It depends on how big the penalties are.

If the penalties are really really big, such that a language with an
'incompleteness penalty' always loses to a language with complete tests,
then you might as well just call it 0 anyways.  The only difference
would be alphabetical ranking for the losers (in the case of 0) vs. some
fudge meaningless half-baked pseudo-performance ranking (for exceedingly
low but nonzero scores).  I suppose in the latter case, you'd be
measuring the completeness of the testing suite rather than the
performance proper.  It strikes me as a big discontinuity in the
testing.

Why not just keep such languages in a separate 'Provisonal' category,
decline to award scores, and note the number of tests they fail?  A
'Provisional' category might motivate some people to work on moving the
language out of 'Provisional', by completing the tests.  Or maybe call
it 'Purgatory', for kicks.

To try another way, let's say the penalties are more modest.  Honestly,
I don't like the idea of languages being able to place in the rankings
if they don't complete all the tests.  That would send the message that
(say) 1 of these 15 tests doesn't really matter all that much.  I think
all the tests should matter.  Otherwise, I don't think we should have
them at all.

I admit I have a RISC-like design mentality for these sorts of things.
I'd want to see orthogonal coverage, not like 5 different string
handling tests.  Unless such tests are regarded as 1 aggregate unit, not
'separate tests' as far as scoring and weighting are concerned.  I think
orthogonal areas of testing should be given equal weight.

I haven't decided whether this orthogonal design sensibility is
achievable in the real world.

Actually, if the 'String Test' has 5 sub-parts to it, that's tedious to
implement.  To save everyone real world grief, I think 1 decent string
test should be chosen.  Is it really fair to make everyone implement
40..50 snippets of code, just because people kept thinking up new tests
all day long?  This is a serious scaling / planning issue.  We have to
decide what the cutoff point for writing and maintaining code is.

I would also note that Viewperf scores aren't orthogonal or equally
rated.  Each dataset is supposed to represent typical data for a given
application.  'Typical data' is decided by the application vendor, and
it's an arbitrary decision by them.  They might decide that antialiased
lines count 40%, gouraud shaded triangles 30%, and trilinear mipmapped
textures 3%.  It just depends on what they think their customers
actually use, or what they want you to think their customers actually
use.  Anyways it's probably in the best interest to spec it accurately,
as that's what all the 3D HW vendors are going to design towards.

I'm too tired to think of a proper analogy to language benchmarking
right now.  There may not be one.

Cheers,                     www.indiegamedesign.com
Brandon Van Every           Seattle, WA

"The pioneer is the one with the arrows in his back."
                          - anonymous entrepreneur