[Debburn-devel] Character sets in UDF

Florent Rougon f.rougon at free.fr
Wed Jan 24 15:47:20 CET 2007


Hi,

Many thanks for your helpful answer.

Peter Samuelson <peter at p12n.org> wrote:

> The kernel knows how to interpret filenames on the UDF filesystem; what
> it doesn't know is how you would like them presented to your process
> via syscalls like readdir() and open().  Your terminal settings and
> LC_CTYPE are per-process, not system-wide.  That is why the iocharset
> mount option exists: so you can tell the kernel what character set you
> are using.  Not so you can tell it what character set the DVD is using:
> it already knows that.

I have the impression that this design is broken, considering Unix is
multi-user. If the iocharset option indicates the charset the kernel
will use in readdir(), open() and friends, then this cannot work if
several users on the system use different charsets in LC_CTYPE.

What I would expect:

  root mounts /dev/foobar somewhere accessible to the users

  User 1: has an LC_CTYPE specifying ISO 8859-1

    readdir() and friends should return strings in ISO 8859-1, no?

  User 2: has an LC_CTYPE specifying UTF-8

    readdir() and friends should return strings in UTF-8, no?

For this to work, the kernel would have to read the LC_CTYPE value for
the calling process when doing a readdir() or an open(). But maybe the
kernel devs don't want to do that (a syscall whose behavior depends on
an environment variable), or simply don't want to mess with locales, I
don't know.

> Now this is interesting - that should not have worked.  The actual
> parameter you want is "iocharset=iso8859-15".
>
> The reason it worked is that you made a typo: you said "utf-8" instead
> of "utf8".  So it failed to load it, and instead loaded the default NLS
> map, which is a kernel config option (CONFIG_NLS_DEFAULT) and in your
> case is probably set to either "iso8859-1" or "iso8859-15".

Exactly. Well spotted! The dmesg output confirms that "utf-8" wasn't
recognized, and my kernel was compiled with
CONFIG_NLS_DEFAULT="iso8859-15", as you guessed.

I then tried mounting with iocharset=iso8859-15, and it does work.

> If you think 'mount' should automatically parse LC_CTYPE and pass the
> appropriate iocharset= parameter to the kernel, you should take that up
> with the util-linux people.

I don't think it's the right thing to do, again because a mount can very
well be done by root for *several* users who use different charsets...

For the multi-user scenario to work, the charset and encoding to use for
filenames should not be determined at mount time, but whenever a process
accesses the filesystem.

> Same issue with ISO-9660, in fact it's even worse with Rock Ridge: UDF
> and Joliet have a well-defined character set (and the iocharset=
> parameter), but I _believe_ Rock Ridge does not - it just stores
> filenames with no reference to character set.

Well, for Rock Ridge, it is defined on page 6 of the 1.12 draft, which
can be downloaded at:

  ftp://ftp.ymi.com/pub/rockridge/rrip112.ps

... but I don't have the answer, as "it depends", and they refer to "the
portable filename character set as defined in POSIX:2.2.2.60" (grmpf).

Anyway, I'm pretty sure it's possible with genisoimage to combine Joliet
and RR ; the former for filenames, the latter for Unix permissions,
symlinks, etc. (but I don't know what happens in this case if both
extensions specify different names for the same file...).

Regards,

-- 
Florent



More information about the Debburn-devel mailing list