[Debburn-devel] Character sets in UDF

Florent Rougon f.rougon at free.fr
Tue Jan 23 21:43:31 CET 2007


Hi Peter,

Thanks for your answer.

Peter Samuelson <peter at p12n.org> wrote:

> There _is_ metadata for each filename's encoding, a byte that is either
> 0x08 or 0x16.  The spec is not explicit about what the bytes mean, but
> from context - particularly the mention of the Unicode byte-order mark
> 0xfeff / 0xfffe - I gather that 8 means ASCII and 16 means UCS-2BE.  I
> just checked the Linux kernel source, and it agrees with me, for both
> reading and writing.

Looking at the spec, I still find it very unclear (presumably because
I'm missing some background; e.g., what is a d-character? It doesn't
seem to be defined in the spec). Why they had to invent an "OSTA CS0
character set", an "OSTA Compressed Unicode format" etc. while we have
Unicode and its standardized encoding schemes is a mystery to me.

> Note also that for _writing_ filenames, having a target encoding
> (UCS-2BE) is not enough, you also need to specify a normalization form.
> The UDF errata document DCN-2157 recommends Normalization Form C (NFC).
> Almost everyone on Linux uses NFC already, so that should not be a
> problem.

'kay...

> It's possible that some OSes now use UTF-16BE instead of UCS-2BE, but
> that only matters if your filenames include Unicode characters with
> codepoints higher than 0xffff, which is rare.  The Linux UDF driver
> (the only driver whose source code I have handy) uses UCS-2BE.

Okay. To me, this doesn't matter as I only need ISO-8859-15.

*But*... if the charset and encoding for filenames are well specified in
the UDF specs, as you seem to think, then I should be able to mount a
UDF filesystem without specifying the charset. Which is not the case:

# growisofs -Z /dev/dvdrecorder -udf test-dvd

[...]

I: -input-charset not specified, using iso-8859-15 (detected in locale settings)

  -> this is correct

[...]

 99.09% done, estimate finish Tue Jan 23 19:01:05 2007
Total translation table size: 0
Total rockridge attributes bytes: 0
Total directory bytes: 0
Path table size(bytes): 10
Max brk space used 21000
499537 extents written (975 MB)
builtin_dd: 499552*2KB out @ average 3.9x1352KBps
/dev/dvdrecorder: flushing cache
/dev/dvdrecorder: stopping de-icing
/dev/dvdrecorder: writing lead-out
:-[ CLOSE SESSION (but try to continue) failed with SK=2h/ASC=04h/ACQ=07h]: Resource temporarily unavailable
:-[ CLOSE SESSION (but try to continue) failed with SK=2h/ASC=04h/ACQ=07h]: Resource temporarily unavailable
:-[ CLOSE SESSION (but try to continue) failed with SK=2h/ASC=04h/ACQ=07h]: Resource temporarily unavailable
:-[ CLOSE SESSION (but try to continue) failed with SK=2h/ASC=04h/ACQ=07h]: Resource temporarily unavailable
:-[ CLOSE SESSION (but try to continue) failed with SK=2h/ASC=04h/ACQ=07h]: Resource temporarily unavailable

[...]

:-[ CLOSE SESSION (but try to continue) failed with SK=2h/ASC=04h/ACQ=07h]: Resource temporarily unavailable

-> If you can comment on these messages (whether they are harmless, or
   how to get rid of them), please do so!

# mount /dev/dvd
# ls -l /media/dvd0
total 998238
-r--r--r--  1 4294967295 4294967295 1022195172 2007-01-22 22:03 Test avec un nom un peu long comportant des mots accentués, voilà.foobar

  -> This is UTF-8 interpreted as ISO-8859-15 (my LC_CTYPE)
     (checked with iconv)

# mount -o iocharset=utf-8 /dev/dvd
# ls -l /media/dvd0
total 998238
-r--r--r--  1 4294967295 4294967295 1022195172 2007-01-22 22:03 Test avec un nom un peu long comportant des mots accentués, voilà.foobar

  -> This is correct.

I'm disappointed. If the charset and encoding for filenames were
correctly specified in the UDF filesystem, I shouldn't need to pass any
iocharset option to 'mount'.

I tested the same DVD+RW in Windows 2000. The file name was displayed
correctly, with the spaces and accented characters.

Since there is the 2 GB problem mentioned by Eduard, I'll try ISO-9660
with Joliet and Rock Ridge extensions and see if it behaves better wrt
charsets in file names.

But I find it very sad that we are still unable in 2007 to manage this
charset thing correctly with UDF, which I imagined to be a modern
filesystem (unless I'm doing something wrong). :-(

Regards,

-- 
Florent



More information about the Debburn-devel mailing list