[Po4a-devel]Encoding options

Denis Barbier barbier@linuxfr.org
Wed, 4 Aug 2004 08:28:22 +0200


On Tue, Aug 03, 2004 at 03:40:42PM -0700, Martin Quinson wrote:
[...]
> I'm ok with being pedentic here, too. This approach would fit me:
> For the master:
>  - if no encoding specified, supposed to be UTF8

If you run "xgettext --from-code=UTF-8", no other charset can be used
for PO files, and translators may dislike being forced to use this
charset without any good reason.
I much prefer assuming ASCII by default.  (Then UTF-8 if a falback is
needed)

>  - if it's not valid UTF8, refuse to process until being given what it is
> For translations:
>  - if not specified, suppose it's the same than the one in translated part
>    of the po file

There is a problem I did not think about before, few English man pages
contain non-ASCII characters, like euro-test in Debian.  PO files have
then to be UTF-8 encoded, and generated man pages will also be UTF-8
encoded which is not the expected result, at least in Debian.
The easy solution is to use escaped sequences (see groff_char(7))
instead of ISO-8859-1 characters, and hope that a similar solution
is always available.  Then documentation should clearly state which
encoding can be used for original documents, depending on their format.

>  - could be cool if we could check that the encoding is not broken, but I'm
>    not sure whether it's even possible.

Double conversion from ISO-8859-1 to UTF-8 is a common error and seems
pretty hard to diagnose.

>  - during gettextization, assume it's UTF8 if no encoding is provided, whine
>    for a proper setting if it's not the case
> For po files:
>  - msgid must be in UTF8. No matter what happen.
>  - msgstr have to be in the encoding specified in the po file headers.

No, msgids and msgstrs must share the same encoding, which is why UTF-8
is the only sane encoding if msgids contain non-ASCII characters.

Denis