[Shootout-list] Missing programs and score calculation (Was: Re:TinyCC and Io)

Tue Sep 20 19:37:31 UTC 2005

On Tue, 2005-09-20 at 13:51 -0400, grfgguvf at gmail.com wrote:

> Why is comparing userspace threading with kernel threading pointless?
> Because: the whole point of threading is to harness the power of
> hardware capable of parallel execution (that'd be SMP these days).

That isn't entirely the story. There are many reasons to use

(a) Processes
(b) Threads
(c) Cooperative threading
(d) Asynchronous I/O
(e) Concurrency
(f) Parallelism

.. all of which are quite different.

There is an argument that synchronous coroutines
are a primitive control structure all languages should
provide. The fact that C does not is a strong condemnation
of that language: one can say it is intrinsically flawed
because it lacks one of the four basic control
structures: sequential execution, control transfer,
subroutine calling, and control exchange -- the latter
is missing from C.

Some advanced functional languages provide call/cc,
and/or continuations, and many languages provide
exceptions, which are related to control exchange,
and none of which are possible in C.

However POSIX can certainly do asynchronous control
exchange, because threads provide precisely that,
and with locking, you can synchronise the exchange.

There are also some C and Posix features that are related
such as setjmp/longjmp (ISO C), and some context switching
library functions. Neither of these techniques work properly
with C++, on the other hand C++ provides exceptions.

Only the context switching library calls are actually strong
enough to provide general control exchange -- setjump/longjmp
cannot work in general (the jumpbuf has to be upstack of
the longjmp). However they suffer from address space restrictions.

User space implementations of control exchange provided by
*with compiler support* can circumvent all these problems:
the context switch can be extremely fast, and is vastly superior
to using pthreads when all you want is synchronous control
exchange -- and that is almost all the time. Pre-emptive
multi-tasking and concurrent execution are useful primarily
when your OS lacks high performance asynchronous I/O.

So, whilst there are real uses for fully blown concurrency,
they ability to actually utilise such techniques is strictly
limited to handful of research labs that actually have 
super computing clusters available.

In fact the average user can utilise high level concurrency,
two examples being BitTorrent and Skype  -- but this isn't
quite the same order of concurrency.

The problem for the Shootout, of course, is that to test
anything which measures *real* concurrency requires
appropriate hardware, and in addition suitable measurement
software.

So we're basically left with a situation only synchronous
control inversion can be benchmarked, and languages like
C that fail to provid it should fail dismally, by being
forced to use posix threads as an implementation technique.

In fact, to be fair a range of benchmarks are required,
since setjmp/longjmp can be used in some contexts and not
others -- so there should be a benchmark whose specification
allows such an implementation, and one which does not.

Similarly, exceptions can provide some control inversion,
so again, there should be benchmarks that permit such
a solution, and ones that do not.

Ultimately there should be a benchmark that only languages
like MLton, Felix, Scheme, Lua -- which provide builtin
user space threading, continuations, coroutines,
or other equivalent technique supporting control exchanges,
can possibly pass.

This isn't so easy: for example to kill off the idea
posix stack switching library call is as good as
real microthreading, we need a benchmark which will
break the processors address space using that technique,
but run fine with compiler supported control exchanges --
someone reported being able to run 20K threads with the
context switch function call .. so the test is going
to need to demand 100K threads or more to break it.

Felix can support 1Meg thread easily on my amd64,
and the benchmark I use only takes 2 seconds to run,
so it can be done.

I would in fact love to compare MLton with Felix,
since i expect it is the only system faster than Felix
in this regard, and technical details suggest the
MLton implementation *should* be faster. In fact,
with a very high load, fragmentation should kill
the Felix implementation but not the MLton one..
so there had better be a test that proves that too.

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net