[Po4a-devel]Sgml bug in the tracker

Martin Quinson martin.quinson@loria.fr
Tue, 10 May 2005 20:31:25 +0200


--B4IIlcmfBL/1gGOG
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Sorry for the delay, I was playing with shadow ;)

On Mon, May 09, 2005 at 01:59:09AM +0200, Nicolas Fran=E7ois wrote:
> Hello,
>=20
> There is a bug reported on the Alioth tracker against the Sgml module.
>=20
> I did not notice it before.
> Was there a notification on po4a-devel@lists.alioth?

It should have been. The bug tracker is configured to send everything to
po4a-devel@lists.alioth.debian.org
See
https://alioth.debian.org/tracker/admin/index.php?group_id=3D30267&atid=3D4=
10622&update_type=3D1

> Otherwise, is there a way to get some notifications from the tracker?
>=20
>=20
> Then regarding the bug report:
>  * I've already uploaded a simple fix for a typo reported in the bug
>    report.
>  * the SGML book uses a contrib and epigraph tag. Are those tags
>    standards? Can I add them to the translate category?

I dunno ; please do so. If it helps for this document, it's good. There's
almost no change that it break anything.

>  * for the main part of the bug report, I propose to escape '<', '>' and
>    '&' to {PO4A-lt}, {PO4A-gt} and {PO4A-amp} before feeding nsgmls. And
>    changing them back to the original in the cdata type.

Great, that's what we have to do.

> I also had some other issues with this PHP book:
>  * around line 795, PO4A-beg/end are changed back to there SGML
>    counterparts only if they appear at the beginning of a line.
>    Why only at the beginning?

I can't remember. That's a *long* time that I didn't dig into sgml.pm
anymore. And I keep bad remembering about this. The code is a bit obscure,
and there is a bunch of stuff we should move to TransTractor (file
inclusion) or do another way (I dream of killing nsgml).

>    This cause some PO4A-beg/end to be kept in the output document.

If so, this is a bug ;)

>  * also, the content of the cdata is pushed, but the buffer is not
>    flushed, so it can be pushed too early.
>    In my patch, I appended the content of the cdata to $buffer.
>    Should the content of cdata be verbatim? shouldn't it be translated?

I think it should be verbatim. I'm not sure anymore about translation.

>  * also, I don't really understand what is done with the leading spaces
>    and the added trailing '\n', but this is probably not an issue.

What I absolutely want to avoid here is getting the whole document on only
one line since it kills any dream of addendum. So, I try to get one
structuring tag per line, and to add some spaces around to make this look
better. But this code also can be bugged...
=20
>  * around line 535, & is changed to {PO4A-amp} if it is not the beginning
>    of an entity.
>    This uses:
>      while ($origfile =3D~ /^(.*?)&([^;\s]*);(.*)$s/) {
>        ...
>      }
>    this regex is too permissive. This cause the following line:
>      ]]><![CDATA[&d_op=3Dviewdownload&cid=3D79\">Web Installer...
>    being changed in:
>      ]]><![CDATA[_op=3Dviewdownload=3D79\">Web Installer...
>=20
>    I found the following grammar (for XML):
>      http://www.w3.org/TR/REC-xml/#NT-Name
>    It's probably too complicated (the Letter or Digit rules use a lot of
>    Unicode chars). So I propose to only allow ASCII chars (with a non
>    greedy match):
>      while ($origfile =3D~ /^(.*?)&([A-Za-z_:][-_:.A-Za-z0-9]*?);(.*)$s/)=
 {
>        ...
>      }

Ups. :-/

btw, you can make it greedy, ";" is not accepted so it won't make any
difference, will it?

>  * my last point: can anybody have a look at the sgmldiff between
>    EN-Book.sgml and po4a-normalize.output?
>=20
> I'm highly incompetent regarding SGML and I based my analysis on po4a and
> sgmldiff outputs. So please stop me if any of the above statement is
> wrong.

I'm rather sort on time, but I'll try to do so. The statements look good.

> Attached is the patch I plan to commit this week.

No need to wait that long ;)


Thanks again for your time,
Mt.

--B4IIlcmfBL/1gGOG
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFCgP38IiC/MeFF8zQRApbvAJ9jQnj/CaO97F8wijX+NKoH67++ggCg0Ik+
q4UuPv04851BP6/iqcVSjvo=
=Jxtp
-----END PGP SIGNATURE-----

--B4IIlcmfBL/1gGOG--