[Debburn-devel] Character sets in UDF
Peter Samuelson
peter at p12n.org
Sun Jan 21 23:01:04 CET 2007
[Florent Rougon]
> I'm not sure what you mean with "If you know the encoding". If you're
> talking about my files: yes, sure, I know the encoding. But if you're
> talking about a random UDF filesystem burnt to DVD, I don't know if
> it's possible, and that's a big part of my question.
There _is_ metadata for each filename's encoding, a byte that is either
0x08 or 0x16. The spec is not explicit about what the bytes mean, but
from context - particularly the mention of the Unicode byte-order mark
0xfeff / 0xfffe - I gather that 8 means ASCII and 16 means UCS-2BE. I
just checked the Linux kernel source, and it agrees with me, for both
reading and writing.
Note also that for _writing_ filenames, having a target encoding
(UCS-2BE) is not enough, you also need to specify a normalization form.
The UDF errata document DCN-2157 recommends Normalization Form C (NFC).
Almost everyone on Linux uses NFC already, so that should not be a
problem.
It's possible that some OSes now use UTF-16BE instead of UCS-2BE, but
that only matters if your filenames include Unicode characters with
codepoints higher than 0xffff, which is rare. The Linux UDF driver
(the only driver whose source code I have handy) uses UCS-2BE.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.alioth.debian.org/pipermail/debburn-devel/attachments/20070121/a1f632fe/attachment.pgp
More information about the Debburn-devel
mailing list