[Shootout-list] An alternative to line counts
Brian Hurt
bhurt@spnz.org
Sat, 19 Jun 2004 15:16:53 -0500 (CDT)
I'd like to propose an alternative to lines of code as a measure of
program complexity in the shootout. Instead, I think we should pick a
compression utility (gzip and bzip2 being the two obvious ones), and
instead measure the size of the compressed source code. I think this
gives a much more accurate reading on the real complixity of the source
code, and is less influenced by irrelevent factors, including coding
styles and the possibility of cheating.
Before getting into this, I'd like to state for the record that I'm an
unrepentant Ocaml advocate. But that the best thing for Ocaml is to make
the shoot-out as fair and honest as it can be.
The problem with lines of code as a measure of program complexity is that
it's too easy to fake. In most languages, newlines are optional. Which
means no matter how long the program is, I can turn it into a one-line
program by the simple expediant of removing all the newlines. Now,
obviously, this sort of cheat would not be accepted by the maintainer (at
least, I hope it wouldn't be accepted). But now the game becomes one of
how close to the line can I walk? The closer to the line I walk- the more
newlines I remove- the better my language looks.
Note that I don't think such cheating has gone on (at least not
conciously)- but I think it's a definately possibility (for every language
except, perhaps, Python- an argument in favor of python :-).
Language styles also have a dispropotionate effect. Some languages (for
example, Ocaml) encourage a more, um, succinct variable and function
other languages (for example, Java) have a more verbose style. Shorter
variable names allow me to "pack" more code on a line before it starts
looking cluttered, while verbose variable names limits the abilitiy.
I would postulate that the absolute complexity of a peice of code is it's
information content in the Claud Shannon Information Theory sense. This
is a very attractive idea intuitively. The question now becomes how to
measure the information in the code. I think by far the easiest way to
measure the information content is the size of the compressed file.
Now, the size of identifiers, the presence or abscence of newlines, or the
amount of whitespace, doesn't make that big of a difference.
An alternative would be a token count- but this implies being able to
tokenize all the different languages. Not to mention what qualifies as a
token. For example, Python using whitespace for program structure- should
increasing or decreasing the indention of code be considered a token?
Using compression means we don't have to worry about parsing the languages
(making adding a new language easier), nor do we have to debate what is or
isn't a token.
But I wanted to kick off the debate. Thoughts? Comments?
--
"Usenet is like a herd of performing elephants with diarrhea -- massive,
difficult to redirect, awe-inspiring, entertaining, and a source of
mind-boggling amounts of excrement when you least expect it."
- Gene Spafford
Brian