[Debburn-devel] What can I assume about libc?
Peter Samuelson
peter at p12n.org
Thu Oct 5 01:18:21 UTC 2006
[Lorenz Minder]
> But doing such a thing as mapping a character to two wchar_ts would
> subvert the very purpose of wchar_t, namely to map each "basic
> element" to a _single_ wchar_t.
You'd think so, wouldn't you?
> >as though it were simple UCS-2.
>
> Ok, I don't know what UCS-2 is. A quick google search tells me that
> "UCS-2 is a fixed-length (16 bits) subset of UTF-16, able to represent
> the basic multilingual plane only."
That's correct. UCS-2 can only represent the BMP (U+0000 - U+FFFF).
UTF-16 is identical to UCS-2 for the BMP, but also specifies a 4-byte
way to represent the rest of Unicode, U+10000 - U+10FFFF. Note also
that each half of a 4-byte UTF-16 character is illegal in UCS-2, so
anything that sufficiently validates its input won't be confused even
if it assumes UCS-2.
I think the story is this: NT4 uses UCS-2 (specifically little-endian).
Windows 2000 adds support for UTF-16 (UTF-16LE). wchar_t is a relic
from NT3/NT4 which can't be changed now without breaking the Win32 ABI.
> The list above lacks u32_to_lc_ctype, though, which is also needed.
> I gather that would be equally trivial to do?
If we need LC_CTYPE, we can use wcstombs() and mbstowcs(), which are
C99 functions that use wchar_t. However, I don't know whether anything
can be assumed about the structure of wchar_t, such that it can be used
non-opaquely by functions outside libc itself. It'd be nice if we
could assume wchar_t is really native-endian UCS-2 or UCS-4 (depending
on the size of the type), but I hesitate to do that. Also, I don't
know the availability of these functions on semi-modern platforms.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.alioth.debian.org/pipermail/debburn-devel/attachments/20061004/148b1b86/attachment.pgp
More information about the Debburn-devel
mailing list