[Shootout-list] Directions of various benchmarks

19 May 2005 00:16:48 +1000

On Wed, 2005-05-18 at 23:24, Jos=C3=A9 Bollo wrote:

> So the RegExp bench has to stay.=20

I would say that we need several graded tests of
string processing capabilities -- one test isn't enough.

The particular tests we have include the phone number finder,
but also the word count tests need to lex text to find and
count words.

> It is good for a language to offer it.
> But only a few subset of regexp is used in the bench.
> I like the use of (?<...), (?!...), (?:...), \2, ...

I do not even know what these are. I have only needed to use
'regexps' half a dozen times in my whole life: excluding
extremely heavy use of lexer generators. Most of the things
you're probably talking about are NOT part of any sane
concept of a regular expression: they're hacks and kludges
added on by people without any proper mathematics and have
prevented the adoption of better solutions.

Generally the correct techniques to handle these things is to
use a lexer to get a token stream, then parse the token stream
and emit the result -- it is fully understood mathematically,
but in most languages lexing and parsing requires laborious
interfacing with tools like lex and yacc.

I would argue they still represent the proper solution if
only they were properly integrated .. and in Felix they are
integrated, although there are still rough edges.

There is, however, plenty of argument for various additional
tests such as high speed searching of HUGE documents for
keywords, as well as complicated transductions of strings
using weird stuff like the above, and probably going all
the way up to building a parser.

Just so you can see, Felix has expression:

regexp letter =3D ['A'-'Z'];
regexp digit =3D ['0'-'9'];
regexp alnum =3D letter | digit;

regmatch s with
| digit+ =3D> "number"
| letter alnum * =3D> "identifier"
endmatch

which is used for matching, and a variant of this which is used to
match the head of a string, and is used for lexing.

This is the end .. no captures, no back references,
no substitution, nothing else. What is here is the
definitive total of regular expressions -- if you=20
go any further you're out of the domain of regular
languages into the domain of context free languages=20
and then to use a general tool you have to have
a GLR or Earley parser engine (Felix has GLR),
an RTN, or some other kind of context free language
parsing technque.

There ARE well understood extensions to regular sets,
for example a finite state machine 'plus one register'
can do a bit more, including count brackets of exactly
one kind in a string to see if they're matching.

But almost all these extensions are little hacks and=20
don't have enough expressive power to bother with,
and the ones that do turn out not to have deterministic
behaviour: captures are an example of this.

I know of only one important exception: there are now
powerful tree automata than can parse tree structures without
the full power of a general context free grammar -- see
the CDuce/Xduce projects of Alain Frisch: these are=20
important because XML fits this model.

The bottom line is, I can always solve a weird regexp
like capture/back reference kind of problem in Felix
with a combination of regmatch/reglex and ordinary
Felix code to do any context checks and captures..
but this definitely isn't "the same way" as a PCRE
solution.=20

So .. please make sure there is a natural language
specification of the problem I can understand -- because
a complex PCRE expression does NOT necessarily
have well defined semantics.

--=20
John Skaller, mailto:skaller@users.sf.net
voice: 061-2-9660-0850,=20
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net