[Shootout-list] New matrix-norm not matrixy?

Sebastien Loisel Sebastien Loisel <sloisel@gmail.com>
Sun, 27 Mar 2005 12:57:08 -0800


>     I've taken a quick look at the matrix-norm test and it seems like in
> the process of converting from the mathematical description (item #3 at
> http://mathworld.wolfram.com/Hundred-DollarHundred-DigitChallengeProblems.html
> ) to an imperative algorithm, the test has lost some of its matrixyness.
> Note that the original 2-D matrix has been replaced by the "eval_A"

Calculating the matrix norm of the infinite dimensional operator by
instantiating the actual matrix is not practical when you want to get
10 digits of precision, which is how this algorithm came to be. You
can, in fact, calculate the norm of the infinite dimensional matrix
from the norm of smaller restrictions of it by using a sort of Aitken
acceleration, but that is an entirely different algorithm. That is how
our team confirmed that we had ten digits (by using two completely
different methods of calculation.)

As you point out, there is no two-dimensional array in the program.

> function.  Also, the indexing on the the 1-D vectors (u & v, which are
> use to accumulate the intermediate values) is very simple.  It marches
> right on through, from one element to the very next.  We might want to

and everything fits in cache, so even if you did access randomly it
wouldn't change much.

> consider a matrix test which has higher dimensions (2D, 3D, ...) and a

which is a different problem -- you can't use the "power method" to
compute the "operator norm" of a 3rd order tensor. These words aren't
even well defined for third order tensors.

> more complex access pattern.  For instance, maybe having the next element

The first step in optimizing numerical code is to take a code with a
"complex access pattern" and making it less complex, which is the
BLAS/LAPACK success story.

> of matrix depend on its nearest neighbors in both columns and rows (e.g.
> B[i,j] = A[i-1,j] + A[i,j-1] - A[i+1,j+1] ...).  As an example, see the
> fluid dynamics simulation shown on...

One of my next benchmarks will be either a very high dimensional ODE
solver (which will involve at least a dense LU or gaussian
elimination) or a PDE solver (which will probably have a sparse matrix
solver.)

If you write a test that uses A[i-1][j] and A[i][j+1] and so on to
thrash the cache, you'll be testing memory latency. Since this is an
extremely well studied problem, we know what the solutions are. You
have to rewrite the matrix A in block forms (such that A[i][j] is not
a double, but rather a small matrix) so that all operations are now
performed on blocks, and the three blocks you will need at any one
time will fit in the cache of the CPU. A compiler stands absolutely
zero chance of figuring this out since this isn't actually
semantically the same as the straightforward algorithm once you take
into account the details of floating point calculations and since the
layout in memory is completely different.

Furthermore, if you do write such a test-case in C, the guy who writes
the numerical python one will beat you. Does that mean that numerical
python is faster than C? No, numerical python is built on top of
LAPACK, a fortran package. If you also write your C program to use
LAPACK, you'll perform the same as numerical python. But we don't care
about all this -- there are many LAPACK benchmarks and we already know
that LAPACK is fast.

One of the advantages of the proposed matrix norm calculation is that
it is resistant to LAPACK acceleration. So you'll have to get your
hands dirty if you want to get some performance.

I am currently working on several more interesting benchmarks. I've
circulated them on the mailing list
(http://lists.alioth.debian.org/pipermail/shootout-list/2005-March/001272.html,
http://www.math.mcgill.ca/loisel/numerical-integration.html, and
http://lists.alioth.debian.org/pipermail/shootout-list/2005-March/001275.html.)
The third one is very complex and uses several nontrivial algorithms
and looks scary to translate in many languages, however it will still
run entirely in the cache. Once I have written that one in several
languages, I will propose it and will begin work on another benchmark
with actual matrices in it.

Cheers,

Sebastien Loisel