[Po4a-devel]Encoding options

Denis Barbier barbier@linuxfr.org
Wed, 4 Aug 2004 21:44:47 +0200


On Wed, Aug 04, 2004 at 02:46:41PM +0200, Jordi Vilalta wrote:
> Ok, let me summarize what we have said until now (thanks everyone to help 
> me understand better the limitations of the po files and the objectives of 
> the encodings).
> 
> 
> Here are the conditions we have to fulfil:
> 
> - msgids and msgstrs must share the same encoding
> - msgids should only be ascii or utf-8
> - ascii is preferred over utf-8 by translators

Fully right.

> And here's a proposal of the processes:
> * Handling the master document (in gettextize, translate and update):
> 
> - If a charset is specified in the command-line, convert from that to
>   utf-8 (and set the po charset to utf-8)
> - Else, if the format module can detect the encoding from the document,
>   convert from this to utf-8 (and set the po charset to utf-8)

No, it must be ASCII by default because 'ascii is preferred over utf-8
by translators'.

> - If nothing can determine the file encoding, assume it's in ascii and
>   don't convert anything (and set the po charset to something invalid, so
>   that the translator can set it)

If master file contains non-ASCII characters, one can check whether it
is UTF-8 encoded.  In such a case, lib/Locale/Po4a/Po.pm has to write
   "Content-Type: text/plain; charset=UTF-8\n"
instead of
   "Content-Type: text/plain; charset=CHARSET\n"
in the POT file.  If translated PO files already exist, they have to
be converted to UTF-8 so that they can be merged with the POT file.

If master file is not UTF-8 encoded, po4a-gettextize must abort because
this has to be fixed by maintainers, not translators.

> * Handling the input translated document (in gettextize):
> 
> - If the master document's charset is ascii (not specified in the po), we
>   should let the translated document remain in the specified charset (in
>   the command line or the format module's detected one (if nothing
>   detected, stop the process)), and set the po charset to it.
> - If the master document's charset is utf-8, we should convert from the
>   specified charset (in the command line or the format module's detected
>   one) to utf-8.

Fine by me, but this seems in contradiction with your previous paragraph,
because you said that if no charset is specified, PO file is UTF-8
encoded ;)
In the first case, PO charset can be unspecified until translator fixes
it.  In the second case, it is troublesome, msgstrs really have to be
recoded into UTF-8, otherwise the PO file is pretty useless, this
conversion cannot be performed afterwards.  Maybe po4a-gettextize should
abort too.

> * Handling the output translated document (in translate):
> 
> - Use the charset specified in the command line, or the po file's charset
>   if nothing specified.

Ok.

> * Handling the addendum (in translate):
> 
> - It should be converted from the specified charset in the command line
>   (mandatory) to the output document charset determined in the point
>   above.

Ok.

> Did I miss something? Am I wrong in some points?

Sounds good.

> Oh, and one last question for now: should we recode everything or just the 
> translated strings (assuming that's the only place where there can be 
> encoding issues...)?

The safest solution is to allow only ASCII encoded non-translatable materials,
and see if there are complaints.

Denis