[Po4a-devel] CR-LF characters in docbook (and xml in general)

Bob Jolliffe bobjolliffe at gmail.com
Wed Dec 14 22:18:52 UTC 2011


Hello

I have a problem with parsing message strings in docbook para
elements.  Many authors present the text in these elements in multiple
lines eg.

<para>This is a long paragraph which spans
          multiple lines.  I would like it to be presented
          as a po message string to be translated</para>

This is handled perfectly well by po4a-gettextize for most cases.
Unfortunately some authors are using editors which represent the
newline as a \r\n sequence instead of just \n.

The proper behaviour of an xml parser according to the spec
(http://www.w3.org/TR/REC-xml/#sec-line-ends) is to simply swallow
those redundant \r characters.  Unfortunately the po4a tools are
passing them through to the po files which is confusing for the
translator when using tools like pootle.  The translator doesn't know
whether the \r characters are required, whether they are significant
etc.

I am not sure exactly how the XML module is working, but would it be a
complicated fix to request that the module processes the XML by
swallowing the \r characters as per the spec?

It is not critical as it is certainly simple enough to pre-process the
input files with tr or something similar, but it would be nice (and
make the workflow easier) if the xml parser did this by default.

Regards
Bob



More information about the Po4a-devel mailing list