[Shootout-list] Numerical medium-sized benchmarks

Fri, 25 Mar 2005 08:08:57 -0800

> Hmmm, is there a MathWorld explanation for the mathematically

Consider the matrix A=3D
2 0
0 3

It maps vectors (x,y) to (2x,3y). If you map the unit circle through
that matrix A, you will obtain an axis-aligned ellipse. The ellipse
will be wide than it is tall, with ratio 3/2. The width of the ellipse
will be 6 (and the extremal points are (3,0) and (-3,0).) As such,
this matrix tends to "grow" unit things (like the unit circle) by a
factor of 3 in the x direction and 2 in the y direction. The larger of
the two values, 3, is the norm of the matrix A. It measures how
"large" A is.

If you pick a more complicated matrix like B=3D
1 2
3 6

the unit circle gets sent to a skewed ellipse, not axis-aligned.
However, the idea is still the same. You find the point p on the
ellipse furthest away from the origin. The distance from the origin to
p is again the norm of B. In this case, the matrix norm is about
6.2429...

For a 3x3 matrix, you would replace the "unit circle" with the "unit
sphere" and you would get an oblong ellipsoid instead of an ellipse.
For nxn matrices, you use the "unit n-sphere" and obtain an oblong
n-ellipsoid. It turns out that you can do this for matrices of
infinite size as well. The original SIAM problem required computing
the matrix norm of a matrix whose entries are given by a certain
formula. It's hard to manipulate inifinite matrices on the computer,
so I truncated the matrix to 20000x20000. This matrix was too large to
fit in memory. So the goal is to estimate the norm of the matrix
without actually computing all its entries.

There are two theorems that help.

First, if A is a symmetric matrix, its matrix norm is precisely its
largest eigenvalue. If a is a scalar and v is a vector such that Av=3Dav
then we say that a is an eigenvalue and v an eigenvector of v. If A is
not symmetric, then B=3DA*A is symmetric (where A* is the matrix
transpose of A) and the matrix norm of A is precisely the matrix norm
of A.

Second, if B is symmetric and if the largest eigenvalue is a simple
eigenvalue (usually the case) then for any random vector v[0], the
sequence v[k+1]=3DBv[k] will "converge" to the eigenvector of B with the
largest eigenvalue. In fact, v[k] may not converge, but
v[k]/norm(v[k]) will converge, and that will be the eigenvector of B
with the largest eigenvalue. This is called the "power method."

So the algorithm I gave computes the largest eigenvalue of B=3DA*A using
the power method. The matrix A is infinite, but I approximate it by
truncating it.

To compute v[k+1]=3DBv[k] without creating B in memory, you write a
function multiply_B_by_a_vector(u) which computes Bu directly without
creating a huge 20000x20000 array for B. This is the method I used.

> -snip-
> > Please also consider the numerical integration code at
> > http://www.math.mcgill.ca/loisel/numerical-integration.html
>=20
> Where can we see a "math for dummies" explanation of the problem?

This algorithm is slightly more complex as it is about calculus. The
problem is to compute the "area under the curve". For instance, the
area under the curve f(x)=3Dx, with x from 0 to 1 is the area of the
right-angle triangle with base 1 and height 1, so it's 1/2. The area
under the curve f(x)=3Dx^2 is harder to calculate, because we have
replaced the straight hypothenuse of the triangle with a curved line.
Fortunately, we still have a way of computing it, using integration,
and the area under that curve from x=3D0 to x=3D1 is 1/3.

The program on that web page finds the "area under the curve" of
sin(xlogx) from x=3D1 to x=3D10000. I can outline the method used.

If you try to integrate f(x)=3Dx^2 from x=3D0 to x=3D1, it makes sense to
approximate this "curved triangle" with a "straight triangle".
Presuming we didn't know the integral was 1/3, we would approximate it
by taking a single triangle (0,0)-(0,1)-(1,1), of area 1/2. This isn't
a terribly good approximation, but it's a start.

We can instead use many triangles. We divide the x axis into [0,0.5]
and [0.5,1]. At 0.5, f(0.5)=3D0.25. On the interval [0,0.5] we use a
right angle triangle like before, and on the interval [0.5,1] we use a
trapeze. The bases are both 0.5. The triangle has h=3D0.25. The trapeze
has two heights in its formula, with h=3D0.25 and H=3D1. The sum of the
two areas is 0.375, with the true area being 0.333... So we're doing
better.

Instead of using trapezoids, we could use different shapes and obtain
different estimates of the area. I used two different methods (not
trapezoids) and I got two different esimates of the area under the
curve. Then I took the difference between those two estimates, and I
declared that this was an estimate of the error. Then I subdivided the
interval of integration until this error estimate was small enough. As
I subdivide intervals, it may happen that the integration method is
doing better on one part of the interval and no further subdivision is
needed there. The subdivision is therefore adaptive. The algorithm has
a bag of intervals, from this bag it picks the interval that has the
most error. It removes it from the bag and subdivides it into two
pieces, reinserting them into the bag. It does this until the sum of
all the errors in the bag is less than 0.00001, at which point it
outputs the answer.

> We're trying to keep to a smallish number of benchmarks - so for every
> problem we add to the Shootout, we look for a problem to remove. Which
> problems do you think could be replaced by the benchmarks you've
> suggested?
>=20
> Alternatively, what gap in Shootout benchmarks do these new problems
> fill?

Well, before we include it, let me float a few benchmarks, so we don't
end up with too many benchmarks to keep track of. This benchmark is
interesting because it's floating point and does more computing than
memory I/O. I haven't looked to closely at all the other benchmarks,
but one of the floating point benchmarks that looks a little fishy to
me is the statistical moments benchmark. Statistical moments are
useful, but this calculation is probably memory bandwidth limited if
not disk I/O limited for all compiled or even modically efficient
interpreted languages. Which is probably why python is only 10x slower
than C. At some point in time, these benchmarks had plenty of I/O
benchmarks, this one looks to me like another one of them.

A similar benchmark that I like a lot is the n-body problem. The
n-body problem fits on the cache so it measures pure floating point
performance without the real-world problem of having to go back and
forth to memory.

The matrix benchmark I proposed as is written now probably fits in the
cache, but one could tweak n so that it doesn't. At n=3D20000 it won't
fit in cache for all architectures. However, you'll never get python
to run it at n=3D20000.

The numerical integrator has a blend of memory I/O and floating point.
According to some benchmarks, in some languages the problem is the
heap data structure, and in other cases the floating point
calculations are a bottleneck.

Another similar benchmark is the matrix-matrix multiplication.
However, as I understand, you're multiplying integer matrices, which
is almost irrelevant. Also, you run a very high risk of seeing
numerical python and matlab running 2-10x faster than gcc, because
they will be built on top of LAPACK and your gcc code won't. There are
LAPACK benchmarks already available, we don't need more of them -- it
isn't a very interesting test of the performance of a language when
the one line of python code is to call LAPACK. The matrix norm test I
provided would force you to do at least some work in the language.
You'd still probably want to use LAPACK in python because python
performance sucks donkeys, but you wouldn't run faster than gcc (and
you'd probably be 10x slower or worse.)

The latest sample I sent is much more complicated. Since I haven't put
in any vectors yet, it probably runs entirely in the cache. However, I
wouldn't be surprised if it defeated many "unboxing" heuristics when
it'll be implemented in garbage collected languages. This kind of code
has been important to me historically when I was comparing C, Fortran
and C++. I would write some algorithm in a fancy template way in C++,
with the understanding that this really ought to run as fast as C. But
sometimes I would find a performance deficit of 3x. This happened to
me when I did something similar to POOMA http://acts.nersc.gov/pooma/

Cheers,

S=E9bastien Loisel