[Debburn-devel] Character sets in UDF and other filesystems

Fri Jan 26 23:54:48 CET 2007

Hi,

Peter Samuelson <peter at p12n.org> wrote:

> Yes, it is a limitation.  I think this is one of the reasons Linux
> distributions have been trying to push users towards UTF-8 in the past
> few years.  For that matter, some apps (Gtk+) seem to believe all
> filenames are UTF-8 regardless of their LC_CTYPE - I suppose that's
> another solution.

Yes---provided that everyone is satisfied by UTF-8, which I believe is
not the case, unfortunately (seems Unicode isn't very popular in
East-Asia).

> Win32 solves the problem by having syscalls that use UTF-16LE
> unconditionally.

Hmmm... so, CJK people probably have problems there?

> The Linux kernel developers would never go for that.  The kernel
> doesn't and shouldn't know anything about parsing LANG and LC_*
> variables; that's purely a userspace concern.

I can understand that.

> What would be possible instead, noting that every system call is
> actually a thin wrapper function in libc6, would be for libc6 itself to
> do these translations inside the file access syscalls.  However, that
> too is a hard sell, for several reasons:

[...]

>  - What should readdir() do if the LC_CTYPE charset cannot represent a
>    particular filename?  If it returns anything, the app will expect to
>    be able to stat() whatever filename it returns.

That is indeed a real problem. :-|

>  - In the general case, Unix filesystems do not have locale information
>    in them.  The only exceptions supported by Linux are vfat, ntfs,
>    iso9660 (with Joliet), udf, and jfs.  (Note all of these except udf
>    came from Windows or OS/2.)  Most Unix filesystems abide by the
>    philosophy that a filename is just a string of bytes that can
>    include any byte except "/" or NUL.  Thus a filename's character set
>    is simply whatever the app that created the file used.

I don't like that, but as you just pointed out, carrying
(filenames + charset/encoding metadata) from one environment to another
(with different charset/encoding) can be problematic in case some chars
aren't representable in both charsets. Tough problem.

> Right - but maybe better than nothing.  In the common case, I suspect
> all users on a given system _are_ using the same charset.  And
> particularly with removable media, it's often mounted by a non-root
> user, and only that user is really interested in it.

Agreed.

> The portable filename character set is, I think, a subset of ASCII that
> excludes not only "/" and NUL but several other bytes that can be
> problematic on some OSes, like ":".  But when Rock Ridge (and POSIX)
> talk about a "character set", they really mean a "set of bytes"; there
> is no specific implied mapping between bytes and characters.

Okay.

> Right, "genisoimage -r -J" is fully supported and I usually use it.
> Windows OSes use only the Joliet information; most Unix OSes use only
> Rock Ridge.  Linux can use either one, but if both are present, it uses
> only Rock Ridge.  (And ignores the iocharset= option in that case,
> since the on-disc character set is not known.)

Okay, makes sense.

My main conclusions for this thread are:
  - creating UDF images with genisoimage apparently embeds proper
    character set and encoding information (and actually uses UTF-8)
    without having to do anything special (it's enough that LC_CTYPE
    correctly describes how the files are encoded in the source
    filesystem);

  - the iocharset option when mounting filesystems describes the charset
    the kernel will convert *to*, not from, when performing system calls
    such as readdir(); therefore, it should match LC_CTYPE if one wants
    to see the file names correctly when using ls(1), for instance.

    I suppose that, for every filesystem that supports this 'iocharset'
    option (whose new name is 'nls' for ntfs, according to mount(8)),
    the "from" charset is well-defined (either fixed to something such
    as UTF-8, or embedded as metadata).

  - this way of configuring the output charset/encoding is not very
    satisfactory in a multi-user environment, since it is a per-mount
    setting, not per-user. Unfortunately, the obvious alternatives for
    having a multi-user setup working where users can specify different
    charsets/encodings through their LC_CTYPE environment variables
    raise difficult problems, due to backward-compatibility and to the
    fact that perfect conversion between charsets is not always possible
    (since not all characters can be represented in all charsets).

Thanks for your clear explanations.

Regards,

-- 
Florent