[Debburn-devel] Character sets in UDF and other filesystems
Florent Rougon
f.rougon at free.fr
Fri Jan 26 23:54:48 CET 2007
Hi,
Peter Samuelson <peter at p12n.org> wrote:
> Yes, it is a limitation. I think this is one of the reasons Linux
> distributions have been trying to push users towards UTF-8 in the past
> few years. For that matter, some apps (Gtk+) seem to believe all
> filenames are UTF-8 regardless of their LC_CTYPE - I suppose that's
> another solution.
Yes---provided that everyone is satisfied by UTF-8, which I believe is
not the case, unfortunately (seems Unicode isn't very popular in
East-Asia).
> Win32 solves the problem by having syscalls that use UTF-16LE
> unconditionally.
Hmmm... so, CJK people probably have problems there?
> The Linux kernel developers would never go for that. The kernel
> doesn't and shouldn't know anything about parsing LANG and LC_*
> variables; that's purely a userspace concern.
I can understand that.
> What would be possible instead, noting that every system call is
> actually a thin wrapper function in libc6, would be for libc6 itself to
> do these translations inside the file access syscalls. However, that
> too is a hard sell, for several reasons:
[...]
> - What should readdir() do if the LC_CTYPE charset cannot represent a
> particular filename? If it returns anything, the app will expect to
> be able to stat() whatever filename it returns.
That is indeed a real problem. :-|
> - In the general case, Unix filesystems do not have locale information
> in them. The only exceptions supported by Linux are vfat, ntfs,
> iso9660 (with Joliet), udf, and jfs. (Note all of these except udf
> came from Windows or OS/2.) Most Unix filesystems abide by the
> philosophy that a filename is just a string of bytes that can
> include any byte except "/" or NUL. Thus a filename's character set
> is simply whatever the app that created the file used.
I don't like that, but as you just pointed out, carrying
(filenames + charset/encoding metadata) from one environment to another
(with different charset/encoding) can be problematic in case some chars
aren't representable in both charsets. Tough problem.
> Right - but maybe better than nothing. In the common case, I suspect
> all users on a given system _are_ using the same charset. And
> particularly with removable media, it's often mounted by a non-root
> user, and only that user is really interested in it.
Agreed.
> The portable filename character set is, I think, a subset of ASCII that
> excludes not only "/" and NUL but several other bytes that can be
> problematic on some OSes, like ":". But when Rock Ridge (and POSIX)
> talk about a "character set", they really mean a "set of bytes"; there
> is no specific implied mapping between bytes and characters.
Okay.
> Right, "genisoimage -r -J" is fully supported and I usually use it.
> Windows OSes use only the Joliet information; most Unix OSes use only
> Rock Ridge. Linux can use either one, but if both are present, it uses
> only Rock Ridge. (And ignores the iocharset= option in that case,
> since the on-disc character set is not known.)
Okay, makes sense.
My main conclusions for this thread are:
- creating UDF images with genisoimage apparently embeds proper
character set and encoding information (and actually uses UTF-8)
without having to do anything special (it's enough that LC_CTYPE
correctly describes how the files are encoded in the source
filesystem);
- the iocharset option when mounting filesystems describes the charset
the kernel will convert *to*, not from, when performing system calls
such as readdir(); therefore, it should match LC_CTYPE if one wants
to see the file names correctly when using ls(1), for instance.
I suppose that, for every filesystem that supports this 'iocharset'
option (whose new name is 'nls' for ntfs, according to mount(8)),
the "from" charset is well-defined (either fixed to something such
as UTF-8, or embedded as metadata).
- this way of configuring the output charset/encoding is not very
satisfactory in a multi-user environment, since it is a per-mount
setting, not per-user. Unfortunately, the obvious alternatives for
having a multi-user setup working where users can specify different
charsets/encodings through their LC_CTYPE environment variables
raise difficult problems, due to backward-compatibility and to the
fact that perfect conversion between charsets is not always possible
(since not all characters can be represented in all charsets).
Thanks for your clear explanations.
Regards,
--
Florent
More information about the Debburn-devel
mailing list