[Debburn-devel] Character sets in UDF and other filesystems

Peter Samuelson peter at p12n.org
Sat Jan 27 04:38:29 CET 2007


[Florent Rougon]
> provided that everyone is satisfied by UTF-8, which I believe is not
> the case, unfortunately (seems Unicode isn't very popular in
> East-Asia).

As I understand it, the problems with Unicode in east Asia is the same
as the problems with Unicode elsewhere: legacy apps and legacy data.
The challenges of supporting CJK also make it non-trivial for a legacy
app (or, indeed, a new app) to be converted to use Unicode.

The biggest problems I've heard of with regard to Unicode are not in
east Asia but in the Indian subcontinent.  I understand that until
fairly recently, the Tamil script was not fully supported by Unicode;
the same is probably true for other Indian language scripts, of which
there are well over a dozen in wide use.  Even when all the glyphs are
present, rendering engines must be quite complex for some Indic
scripts, because (a) many letter combinations form ligatures, 2 or 3 or
even more characters combining into one glyph (the Devanagari script
has _hundreds_ of ligatures); and (b) printed order is not the same as
logical/spoken order - the languages are written left-to-right, but
certain vowels, when they follow a consonant, appear to the _left_ of
the consonant.

Character conversion between encodings is also a bit complex in some of
these languages, for reason (b) above.  Different encodings do not
agree on whether to represent text in spoken/logical order or in
written order.  (Unicode chooses spoken/logical order.)

> > Win32 solves the problem by having syscalls that use UTF-16LE
> > unconditionally.

> Hmmm... so, CJK people probably have problems there?

Win32 actually includes two sets of system calls - one for Unicode, one
for your non-Unicode character set (known as a "code page"), from the
MS-DOS days.  It may be that some apps still use code-page syscalls
instead of Unicode syscalls.


Anyway.  You seem to understand all the issues.  One last thing I
should mention: 'man genisoimage' and look for the CHARACTER SETS
section.  That describes character set handling in a bit more detail,
especially regarding Rock Ridge.  I should note that the manpage is
slightly out of date - it mentions the default of "iso8859-1" but fails
to mention that LC_CTYPE is used instead, if present.

Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.alioth.debian.org/pipermail/debburn-devel/attachments/20070126/3ba5b14a/attachment.pgp


More information about the Debburn-devel mailing list