[Po4a-devel]Encoding options
Jordi Vilalta
jvprat@wanadoo.es
Wed, 4 Aug 2004 22:45:36 +0200 (CEST)
On Wed, 4 Aug 2004, Denis Barbier wrote:
> On Wed, Aug 04, 2004 at 02:46:41PM +0200, Jordi Vilalta wrote:
>> Ok, let me summarize what we have said until now (thanks everyone to help
>> me understand better the limitations of the po files and the objectives of
>> the encodings).
>>
>>
>> Here are the conditions we have to fulfil:
>>
>> - msgids and msgstrs must share the same encoding
>> - msgids should only be ascii or utf-8
>> - ascii is preferred over utf-8 by translators
>
> Fully right.
>
>> And here's a proposal of the processes:
>> * Handling the master document (in gettextize, translate and update):
>>
>> - If a charset is specified in the command-line, convert from that to
>> utf-8 (and set the po charset to utf-8)
>> - Else, if the format module can detect the encoding from the document,
>> convert from this to utf-8 (and set the po charset to utf-8)
>
> No, it must be ASCII by default because 'ascii is preferred over utf-8
> by translators'.
>
Well, this "detect" means that the the document specifies the charset
inside himself (like the xml headers: <?xml encoding='iso-8859-1'?>),
the format module checks it, and then this should be converted to utf-8.
>> - If nothing can determine the file encoding, assume it's in ascii and
>> don't convert anything (and set the po charset to something invalid, so
>> that the translator can set it)
>
> If master file contains non-ASCII characters, one can check whether it
> is UTF-8 encoded. In such a case, lib/Locale/Po4a/Po.pm has to write
> "Content-Type: text/plain; charset=UTF-8\n"
> instead of
> "Content-Type: text/plain; charset=CHARSET\n"
> in the POT file. If translated PO files already exist, they have to
> be converted to UTF-8 so that they can be merged with the POT file.
Do you mean that an update on the master document can cause the change
from ascii to utf-8 and we should convert the po files to utf-8 when
updating?
>
> If master file is not UTF-8 encoded, po4a-gettextize must abort because
> this has to be fixed by maintainers, not translators.
>
>> * Handling the input translated document (in gettextize):
>>
>> - If the master document's charset is ascii (not specified in the po), we
>> should let the translated document remain in the specified charset (in
>> the command line or the format module's detected one (if nothing
>> detected, stop the process)), and set the po charset to it.
>> - If the master document's charset is utf-8, we should convert from the
>> specified charset (in the command line or the format module's detected
>> one) to utf-8.
>
> Fine by me, but this seems in contradiction with your previous paragraph,
> because you said that if no charset is specified, PO file is UTF-8
> encoded ;)
> In the first case, PO charset can be unspecified until translator fixes
> it. In the second case, it is troublesome, msgstrs really have to be
> recoded into UTF-8, otherwise the PO file is pretty useless, this
> conversion cannot be performed afterwards. Maybe po4a-gettextize should
> abort too.
>
Yes, it's what I meant. When the master (msgids) is utf-8, we should
convert the translated strings to utf-8 also (before mixing the 2 po)
>> * Handling the output translated document (in translate):
>>
>> - Use the charset specified in the command line, or the po file's charset
>> if nothing specified.
>
> Ok.
>
>> * Handling the addendum (in translate):
>>
>> - It should be converted from the specified charset in the command line
>> (mandatory) to the output document charset determined in the point
>> above.
>
> Ok.
>
>> Did I miss something? Am I wrong in some points?
>
> Sounds good.
>
>> Oh, and one last question for now: should we recode everything or just the
>> translated strings (assuming that's the only place where there can be
>> encoding issues...)?
>
> The safest solution is to allow only ASCII encoded non-translatable materials,
> and see if there are complaints.
I also vote for this.
Regards,
Jordi Vilalta