[Debburn-devel] Character sets in UDF and other filesystems
Peter Samuelson
peter at p12n.org
Sat Jan 27 04:38:29 CET 2007
[Florent Rougon]
> provided that everyone is satisfied by UTF-8, which I believe is not
> the case, unfortunately (seems Unicode isn't very popular in
> East-Asia).
As I understand it, the problems with Unicode in east Asia is the same
as the problems with Unicode elsewhere: legacy apps and legacy data.
The challenges of supporting CJK also make it non-trivial for a legacy
app (or, indeed, a new app) to be converted to use Unicode.
The biggest problems I've heard of with regard to Unicode are not in
east Asia but in the Indian subcontinent. I understand that until
fairly recently, the Tamil script was not fully supported by Unicode;
the same is probably true for other Indian language scripts, of which
there are well over a dozen in wide use. Even when all the glyphs are
present, rendering engines must be quite complex for some Indic
scripts, because (a) many letter combinations form ligatures, 2 or 3 or
even more characters combining into one glyph (the Devanagari script
has _hundreds_ of ligatures); and (b) printed order is not the same as
logical/spoken order - the languages are written left-to-right, but
certain vowels, when they follow a consonant, appear to the _left_ of
the consonant.
Character conversion between encodings is also a bit complex in some of
these languages, for reason (b) above. Different encodings do not
agree on whether to represent text in spoken/logical order or in
written order. (Unicode chooses spoken/logical order.)
> > Win32 solves the problem by having syscalls that use UTF-16LE
> > unconditionally.
> Hmmm... so, CJK people probably have problems there?
Win32 actually includes two sets of system calls - one for Unicode, one
for your non-Unicode character set (known as a "code page"), from the
MS-DOS days. It may be that some apps still use code-page syscalls
instead of Unicode syscalls.
Anyway. You seem to understand all the issues. One last thing I
should mention: 'man genisoimage' and look for the CHARACTER SETS
section. That describes character set handling in a bit more detail,
especially regarding Rock Ridge. I should note that the manpage is
slightly out of date - it mentions the default of "iso8859-1" but fails
to mention that LC_CTYPE is used instead, if present.
Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.alioth.debian.org/pipermail/debburn-devel/attachments/20070126/3ba5b14a/attachment.pgp
More information about the Debburn-devel
mailing list