[Debburn-devel] What can I assume about libc?

Albert Cahalan acahalan at gmail.com
Thu Oct 5 07:29:45 UTC 2006


On 10/4/06, Peter Samuelson <peter at p12n.org> wrote:
> [Lorenz Minder]
> > I had a look.  On Win32, wchar_t is actually 16 bit, so there's
> > little risk of getting 4-byte chars.
>
> Since Windows uses UTF-16, there is certainly a theoretical possibility
> of having to handle 4-byte characters.  I have to admit I don't know
> how Windows deals with this given a 2-byte wchar_t, but I assume it
> just splits such a character into two wchar_ts, as though it were
> simple UCS-2.

Java, Windows (including Joliet), and MacOS all come from a time
when it was claimed that 16 bits would be enough. These legacy
systems get hacked up with UTF-16 like this:

Characters from 0 to 0xffff get a 2-byte encoding.
Characters from 0x10000 to 0x10ffff get a 4-byte encoding.
No other characters can be represented.

A wchar_t is 2 bytes. Some characters require 2 wchar_t.
This does icky things with text processing functions.
In general, functions claim to work on characters but
actually work on wchar_t units.

Still though... it's not hard to deal with UTF-16. If you see
two values in the UTF-16 surrogate range, you merge them
into a single 32-bit character.



More information about the Debburn-devel mailing list