[Shootout-list] An alternative to line counts

Sat, 19 Jun 2004 15:16:53 -0500 (CDT)

I'd like to propose an alternative to lines of code as a measure of 
program complexity in the shootout.  Instead, I think we should pick a 
compression utility (gzip and bzip2 being the two obvious ones), and 
instead measure the size of the compressed source code.  I think this 
gives a much more accurate reading on the real complixity of the source 
code, and is less influenced by irrelevent factors, including coding 
styles and the possibility of cheating.

Before getting into this, I'd like to state for the record that I'm an 
unrepentant Ocaml advocate.  But that the best thing for Ocaml is to make 
the shoot-out as fair and honest as it can be.

The problem with lines of code as a measure of program complexity is that 
it's too easy to fake.  In most languages, newlines are optional.  Which 
means no matter how long the program is, I can turn it into a one-line 
program by the simple expediant of removing all the newlines.  Now, 
obviously, this sort of cheat would not be accepted by the maintainer (at 
least, I hope it wouldn't be accepted).  But now the game becomes one of 
how close to the line can I walk?  The closer to the line I walk- the more 
newlines I remove- the better my language looks.

Note that I don't think such cheating has gone on (at least not 
conciously)- but I think it's a definately possibility (for every language 
except, perhaps, Python- an argument in favor of python :-).

Language styles also have a dispropotionate effect.  Some languages (for 
example, Ocaml) encourage a more, um, succinct variable and function 
other languages (for example, Java) have a more verbose style.  Shorter 
variable names allow me to "pack" more code on a line before it starts 
looking cluttered, while verbose variable names limits the abilitiy.

I would postulate that the absolute complexity of a peice of code is it's 
information content in the Claud Shannon Information Theory sense.  This 
is a very attractive idea intuitively.  The question now becomes how to 
measure the information in the code.  I think by far the easiest way to 
measure the information content is the size of the compressed file.

Now, the size of identifiers, the presence or abscence of newlines, or the 
amount of whitespace, doesn't make that big of a difference.

An alternative would be a token count- but this implies being able to 
tokenize all the different languages.  Not to mention what qualifies as a 
token.  For example, Python using whitespace for program structure- should 
increasing or decreasing the indention of code be considered a token?  
Using compression means we don't have to worry about parsing the languages 
(making adding a new language easier), nor do we have to debate what is or 
isn't a token.

But I wanted to kick off the debate.  Thoughts?  Comments?

-- 
"Usenet is like a herd of performing elephants with diarrhea -- massive,
difficult to redirect, awe-inspiring, entertaining, and a source of
mind-boggling amounts of excrement when you least expect it."
                                - Gene Spafford 
Brian