[Po4a-devel]Encoding options
Denis Barbier
barbier@linuxfr.org
Wed, 4 Aug 2004 21:44:47 +0200
On Wed, Aug 04, 2004 at 02:46:41PM +0200, Jordi Vilalta wrote:
> Ok, let me summarize what we have said until now (thanks everyone to help
> me understand better the limitations of the po files and the objectives of
> the encodings).
>
>
> Here are the conditions we have to fulfil:
>
> - msgids and msgstrs must share the same encoding
> - msgids should only be ascii or utf-8
> - ascii is preferred over utf-8 by translators
Fully right.
> And here's a proposal of the processes:
> * Handling the master document (in gettextize, translate and update):
>
> - If a charset is specified in the command-line, convert from that to
> utf-8 (and set the po charset to utf-8)
> - Else, if the format module can detect the encoding from the document,
> convert from this to utf-8 (and set the po charset to utf-8)
No, it must be ASCII by default because 'ascii is preferred over utf-8
by translators'.
> - If nothing can determine the file encoding, assume it's in ascii and
> don't convert anything (and set the po charset to something invalid, so
> that the translator can set it)
If master file contains non-ASCII characters, one can check whether it
is UTF-8 encoded. In such a case, lib/Locale/Po4a/Po.pm has to write
"Content-Type: text/plain; charset=UTF-8\n"
instead of
"Content-Type: text/plain; charset=CHARSET\n"
in the POT file. If translated PO files already exist, they have to
be converted to UTF-8 so that they can be merged with the POT file.
If master file is not UTF-8 encoded, po4a-gettextize must abort because
this has to be fixed by maintainers, not translators.
> * Handling the input translated document (in gettextize):
>
> - If the master document's charset is ascii (not specified in the po), we
> should let the translated document remain in the specified charset (in
> the command line or the format module's detected one (if nothing
> detected, stop the process)), and set the po charset to it.
> - If the master document's charset is utf-8, we should convert from the
> specified charset (in the command line or the format module's detected
> one) to utf-8.
Fine by me, but this seems in contradiction with your previous paragraph,
because you said that if no charset is specified, PO file is UTF-8
encoded ;)
In the first case, PO charset can be unspecified until translator fixes
it. In the second case, it is troublesome, msgstrs really have to be
recoded into UTF-8, otherwise the PO file is pretty useless, this
conversion cannot be performed afterwards. Maybe po4a-gettextize should
abort too.
> * Handling the output translated document (in translate):
>
> - Use the charset specified in the command line, or the po file's charset
> if nothing specified.
Ok.
> * Handling the addendum (in translate):
>
> - It should be converted from the specified charset in the command line
> (mandatory) to the output document charset determined in the point
> above.
Ok.
> Did I miss something? Am I wrong in some points?
Sounds good.
> Oh, and one last question for now: should we recode everything or just the
> translated strings (assuming that's the only place where there can be
> encoding issues...)?
The safest solution is to allow only ASCII encoded non-translatable materials,
and see if there are complaints.
Denis