[Po4a-devel]Encoding options
Denis Barbier
barbier@linuxfr.org
Wed, 4 Aug 2004 08:28:22 +0200
On Tue, Aug 03, 2004 at 03:40:42PM -0700, Martin Quinson wrote:
[...]
> I'm ok with being pedentic here, too. This approach would fit me:
> For the master:
> - if no encoding specified, supposed to be UTF8
If you run "xgettext --from-code=UTF-8", no other charset can be used
for PO files, and translators may dislike being forced to use this
charset without any good reason.
I much prefer assuming ASCII by default. (Then UTF-8 if a falback is
needed)
> - if it's not valid UTF8, refuse to process until being given what it is
> For translations:
> - if not specified, suppose it's the same than the one in translated part
> of the po file
There is a problem I did not think about before, few English man pages
contain non-ASCII characters, like euro-test in Debian. PO files have
then to be UTF-8 encoded, and generated man pages will also be UTF-8
encoded which is not the expected result, at least in Debian.
The easy solution is to use escaped sequences (see groff_char(7))
instead of ISO-8859-1 characters, and hope that a similar solution
is always available. Then documentation should clearly state which
encoding can be used for original documents, depending on their format.
> - could be cool if we could check that the encoding is not broken, but I'm
> not sure whether it's even possible.
Double conversion from ISO-8859-1 to UTF-8 is a common error and seems
pretty hard to diagnose.
> - during gettextization, assume it's UTF8 if no encoding is provided, whine
> for a proper setting if it's not the case
> For po files:
> - msgid must be in UTF8. No matter what happen.
> - msgstr have to be in the encoding specified in the po file headers.
No, msgids and msgstrs must share the same encoding, which is why UTF-8
is the only sane encoding if msgids contain non-ASCII characters.
Denis