[Po4a-devel]Sgml bug in the tracker

Nicolas François nicolas.francois@centraliens.net
Mon, 9 May 2005 01:59:09 +0200


--8P1HSweYDcXXzwPJ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Hello,

There is a bug reported on the Alioth tracker against the Sgml module.

I did not notice it before.
Was there a notification on po4a-devel@lists.alioth?
Otherwise, is there a way to get some notifications from the tracker?


Then regarding the bug report:
 * I've already uploaded a simple fix for a typo reported in the bug
   report.
 * the SGML book uses a contrib and epigraph tag. Are those tags
   standards? Can I add them to the translate category?
 * for the main part of the bug report, I propose to escape '<', '>' and
   '&' to {PO4A-lt}, {PO4A-gt} and {PO4A-amp} before feeding nsgmls. And
   changing them back to the original in the cdata type.

I also had some other issues with this PHP book:
 * around line 795, PO4A-beg/end are changed back to there SGML
   counterparts only if they appear at the beginning of a line.
   Why only at the beginning?
   This cause some PO4A-beg/end to be kept in the output document.
 * also, the content of the cdata is pushed, but the buffer is not
   flushed, so it can be pushed too early.
   In my patch, I appended the content of the cdata to $buffer.
   Should the content of cdata be verbatim? shouldn't it be translated?
 * also, I don't really understand what is done with the leading spaces
   and the added trailing '\n', but this is probably not an issue.

 * around line 535, & is changed to {PO4A-amp} if it is not the beginning
   of an entity.
   This uses:
     while ($origfile =~ /^(.*?)&([^;\s]*);(.*)$s/) {
       ...
     }
   this regex is too permissive. This cause the following line:
     ]]><![CDATA[&d_op=viewdownload&cid=79\">Web Installer...
   being changed in:
     ]]><![CDATA[_op=viewdownload=79\">Web Installer...

   I found the following grammar (for XML):
     http://www.w3.org/TR/REC-xml/#NT-Name
   It's probably too complicated (the Letter or Digit rules use a lot of
   Unicode chars). So I propose to only allow ASCII chars (with a non
   greedy match):
     while ($origfile =~ /^(.*?)&([A-Za-z_:][-_:.A-Za-z0-9]*?);(.*)$s/) {
       ...
     }

 * my last point: can anybody have a look at the sgmldiff between
   EN-Book.sgml and po4a-normalize.output?

I'm highly incompetent regarding SGML and I based my analysis on po4a and
sgmldiff outputs. So please stop me if any of the above statement is
wrong.

Attached is the patch I plan to commit this week.

TIA,
-- 
Nekral

--8P1HSweYDcXXzwPJ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="Sgml.pm.diff"

Index: lib/Locale/Po4a/Sgml.pm
===================================================================
RCS file: /cvsroot/po4a/po4a/lib/Locale/Po4a/Sgml.pm,v
retrieving revision 1.47
diff -u -r1.47 Sgml.pm
--- lib/Locale/Po4a/Sgml.pm	8 May 2005 22:12:47 -0000	1.47
+++ lib/Locale/Po4a/Sgml.pm	8 May 2005 23:55:44 -0000
@@ -449,7 +449,19 @@
     # protect the conditional inclusions in the file
     $origfile =~ s/<!\[(\s*[^\[]+)\[/{PO4A-beg-$1}/g; # cond. incl. starts
     $origfile =~ s/\]\]>/{PO4A-end}/g;                # cond. incl. end
-    
+
+    my $tmp1 = $origfile;
+    $origfile = "";
+    while ($tmp1 =~ m/^(.*?{PO4A-beg-[^}]*})(.+?)({PO4A-end}.*)$/s) {
+        my ($begin, $tmp) = ($1, $2);
+        $tmp1 = $3;
+        $tmp =~ s/</{PO4A-lt}/gs;
+        $tmp =~ s/>/{PO4A-gt}/gs;
+        $tmp =~ s/&/{PO4A-amp}/gs;
+        $origfile .= $begin.$tmp;
+    }
+    $origfile .= $tmp1;
+
     # Deal with the %entities; in the prolog. God damn it, this code is gross!
     # Try hard not to change the number of lines to not fuck up the references
     my %prologentincl;
@@ -532,7 +544,7 @@
     }
 
     #   Change the entities including files in the document
-    while ($origfile =~ /^(.*?)&([^;\s]*);(.*)$/s) {
+    while ($origfile =~ /^(.*?)&([A-Za-z_:][-_:.A-Za-z0-9]*?);(.*)$/s) {
 	if (defined $entincl{$2}) {
 	    my ($begin,$key,$end)=($1,$2,$3);
 	    $end =~ s/^\s*\n//s;
@@ -792,20 +804,22 @@
 	
 	elsif ($event->type eq 'cdata') {
 	    my $cdata = $event->data;
-	    if ($cdata =~ /^(({PO4A-(beg|end)[^\}]*})|\s)+$/ &&
+	    $cdata =~ s/{PO4A-lt}/</g;
+	    $cdata =~ s/{PO4A-gt}/>/g;
+	    $cdata =~ s/{PO4A-amp}/&/g;
+	    if ($cdata =~ /(({PO4A-(beg|end)[^\}]*})|\s)+$/ &&
 		$cdata =~ /\S/) {
 		$cdata =~ s/\s*{PO4A-end}/\]\]>\n/g;
 		$cdata =~ s/\s*{PO4A-beg-([^\}]+)}/<!\[$1\[\n/g;
-		$self->pushline($cdata);
 	    } else {
 		if (!$verb) {
 		    $cdata =~ s/\\t/ /g;
 		    $cdata =~ s/\s+/ /g;
 		    $cdata =~ s/^\s//s if $lastchar eq ' ';
 		}
-		$lastchar = substr($cdata, -1, 1);
-		$buffer .= $cdata;
 	    }
+	    $lastchar = substr($cdata, -1, 1);
+	    $buffer .= $cdata;
 	} # end of type eq 'cdata'
 
 	elsif ($event->type eq 'sdata') {

--8P1HSweYDcXXzwPJ--