[Po4a-devel]Encoding options

Wed, 4 Aug 2004 22:45:36 +0200 (CEST)

On Wed, 4 Aug 2004, Denis Barbier wrote:
> On Wed, Aug 04, 2004 at 02:46:41PM +0200, Jordi Vilalta wrote:
>> Ok, let me summarize what we have said until now (thanks everyone to help
>> me understand better the limitations of the po files and the objectives of
>> the encodings).
>>
>>
>> Here are the conditions we have to fulfil:
>>
>> - msgids and msgstrs must share the same encoding
>> - msgids should only be ascii or utf-8
>> - ascii is preferred over utf-8 by translators
>
> Fully right.
>
>> And here's a proposal of the processes:
>> * Handling the master document (in gettextize, translate and update):
>>
>> - If a charset is specified in the command-line, convert from that to
>>   utf-8 (and set the po charset to utf-8)
>> - Else, if the format module can detect the encoding from the document,
>>   convert from this to utf-8 (and set the po charset to utf-8)
>
> No, it must be ASCII by default because 'ascii is preferred over utf-8
> by translators'.
>

Well, this "detect" means that the the document specifies the charset 
inside himself (like the xml headers: <?xml encoding='iso-8859-1'?>), 
the format module checks it, and then this should be converted to utf-8.

>> - If nothing can determine the file encoding, assume it's in ascii and
>>   don't convert anything (and set the po charset to something invalid, so
>>   that the translator can set it)
>
> If master file contains non-ASCII characters, one can check whether it
> is UTF-8 encoded.  In such a case, lib/Locale/Po4a/Po.pm has to write
>   "Content-Type: text/plain; charset=UTF-8\n"
> instead of
>   "Content-Type: text/plain; charset=CHARSET\n"
> in the POT file.  If translated PO files already exist, they have to
> be converted to UTF-8 so that they can be merged with the POT file.

Do you mean that an update on the master document can cause the change 
from ascii to utf-8 and we should convert the po files to utf-8 when 
updating?

>
> If master file is not UTF-8 encoded, po4a-gettextize must abort because
> this has to be fixed by maintainers, not translators.
>
>> * Handling the input translated document (in gettextize):
>>
>> - If the master document's charset is ascii (not specified in the po), we
>>   should let the translated document remain in the specified charset (in
>>   the command line or the format module's detected one (if nothing
>>   detected, stop the process)), and set the po charset to it.
>> - If the master document's charset is utf-8, we should convert from the
>>   specified charset (in the command line or the format module's detected
>>   one) to utf-8.
>
> Fine by me, but this seems in contradiction with your previous paragraph,
> because you said that if no charset is specified, PO file is UTF-8
> encoded ;)
> In the first case, PO charset can be unspecified until translator fixes
> it.  In the second case, it is troublesome, msgstrs really have to be
> recoded into UTF-8, otherwise the PO file is pretty useless, this
> conversion cannot be performed afterwards.  Maybe po4a-gettextize should
> abort too.
>

Yes, it's what I meant. When the master (msgids) is utf-8, we should 
convert the translated strings to utf-8 also (before mixing the 2 po)

>> * Handling the output translated document (in translate):
>>
>> - Use the charset specified in the command line, or the po file's charset
>>   if nothing specified.
>
> Ok.
>
>> * Handling the addendum (in translate):
>>
>> - It should be converted from the specified charset in the command line
>>   (mandatory) to the output document charset determined in the point
>>   above.
>
> Ok.
>
>> Did I miss something? Am I wrong in some points?
>
> Sounds good.
>
>> Oh, and one last question for now: should we recode everything or just the
>> translated strings (assuming that's the only place where there can be
>> encoding issues...)?
>
> The safest solution is to allow only ASCII encoded non-translatable materials,
> and see if there are complaints.

I also vote for this.

Regards,

Jordi Vilalta