[Po4a-devel]Sgml bug in the tracker
Nicolas François
nicolas.francois@centraliens.net
Mon, 9 May 2005 01:59:09 +0200
--8P1HSweYDcXXzwPJ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Hello,
There is a bug reported on the Alioth tracker against the Sgml module.
I did not notice it before.
Was there a notification on po4a-devel@lists.alioth?
Otherwise, is there a way to get some notifications from the tracker?
Then regarding the bug report:
* I've already uploaded a simple fix for a typo reported in the bug
report.
* the SGML book uses a contrib and epigraph tag. Are those tags
standards? Can I add them to the translate category?
* for the main part of the bug report, I propose to escape '<', '>' and
'&' to {PO4A-lt}, {PO4A-gt} and {PO4A-amp} before feeding nsgmls. And
changing them back to the original in the cdata type.
I also had some other issues with this PHP book:
* around line 795, PO4A-beg/end are changed back to there SGML
counterparts only if they appear at the beginning of a line.
Why only at the beginning?
This cause some PO4A-beg/end to be kept in the output document.
* also, the content of the cdata is pushed, but the buffer is not
flushed, so it can be pushed too early.
In my patch, I appended the content of the cdata to $buffer.
Should the content of cdata be verbatim? shouldn't it be translated?
* also, I don't really understand what is done with the leading spaces
and the added trailing '\n', but this is probably not an issue.
* around line 535, & is changed to {PO4A-amp} if it is not the beginning
of an entity.
This uses:
while ($origfile =~ /^(.*?)&([^;\s]*);(.*)$s/) {
...
}
this regex is too permissive. This cause the following line:
]]><![CDATA[&d_op=viewdownload&cid=79\">Web Installer...
being changed in:
]]><![CDATA[_op=viewdownload=79\">Web Installer...
I found the following grammar (for XML):
http://www.w3.org/TR/REC-xml/#NT-Name
It's probably too complicated (the Letter or Digit rules use a lot of
Unicode chars). So I propose to only allow ASCII chars (with a non
greedy match):
while ($origfile =~ /^(.*?)&([A-Za-z_:][-_:.A-Za-z0-9]*?);(.*)$s/) {
...
}
* my last point: can anybody have a look at the sgmldiff between
EN-Book.sgml and po4a-normalize.output?
I'm highly incompetent regarding SGML and I based my analysis on po4a and
sgmldiff outputs. So please stop me if any of the above statement is
wrong.
Attached is the patch I plan to commit this week.
TIA,
--
Nekral
--8P1HSweYDcXXzwPJ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="Sgml.pm.diff"
Index: lib/Locale/Po4a/Sgml.pm
===================================================================
RCS file: /cvsroot/po4a/po4a/lib/Locale/Po4a/Sgml.pm,v
retrieving revision 1.47
diff -u -r1.47 Sgml.pm
--- lib/Locale/Po4a/Sgml.pm 8 May 2005 22:12:47 -0000 1.47
+++ lib/Locale/Po4a/Sgml.pm 8 May 2005 23:55:44 -0000
@@ -449,7 +449,19 @@
# protect the conditional inclusions in the file
$origfile =~ s/<!\[(\s*[^\[]+)\[/{PO4A-beg-$1}/g; # cond. incl. starts
$origfile =~ s/\]\]>/{PO4A-end}/g; # cond. incl. end
-
+
+ my $tmp1 = $origfile;
+ $origfile = "";
+ while ($tmp1 =~ m/^(.*?{PO4A-beg-[^}]*})(.+?)({PO4A-end}.*)$/s) {
+ my ($begin, $tmp) = ($1, $2);
+ $tmp1 = $3;
+ $tmp =~ s/</{PO4A-lt}/gs;
+ $tmp =~ s/>/{PO4A-gt}/gs;
+ $tmp =~ s/&/{PO4A-amp}/gs;
+ $origfile .= $begin.$tmp;
+ }
+ $origfile .= $tmp1;
+
# Deal with the %entities; in the prolog. God damn it, this code is gross!
# Try hard not to change the number of lines to not fuck up the references
my %prologentincl;
@@ -532,7 +544,7 @@
}
# Change the entities including files in the document
- while ($origfile =~ /^(.*?)&([^;\s]*);(.*)$/s) {
+ while ($origfile =~ /^(.*?)&([A-Za-z_:][-_:.A-Za-z0-9]*?);(.*)$/s) {
if (defined $entincl{$2}) {
my ($begin,$key,$end)=($1,$2,$3);
$end =~ s/^\s*\n//s;
@@ -792,20 +804,22 @@
elsif ($event->type eq 'cdata') {
my $cdata = $event->data;
- if ($cdata =~ /^(({PO4A-(beg|end)[^\}]*})|\s)+$/ &&
+ $cdata =~ s/{PO4A-lt}/</g;
+ $cdata =~ s/{PO4A-gt}/>/g;
+ $cdata =~ s/{PO4A-amp}/&/g;
+ if ($cdata =~ /(({PO4A-(beg|end)[^\}]*})|\s)+$/ &&
$cdata =~ /\S/) {
$cdata =~ s/\s*{PO4A-end}/\]\]>\n/g;
$cdata =~ s/\s*{PO4A-beg-([^\}]+)}/<!\[$1\[\n/g;
- $self->pushline($cdata);
} else {
if (!$verb) {
$cdata =~ s/\\t/ /g;
$cdata =~ s/\s+/ /g;
$cdata =~ s/^\s//s if $lastchar eq ' ';
}
- $lastchar = substr($cdata, -1, 1);
- $buffer .= $cdata;
}
+ $lastchar = substr($cdata, -1, 1);
+ $buffer .= $cdata;
} # end of type eq 'cdata'
elsif ($event->type eq 'sdata') {
--8P1HSweYDcXXzwPJ--