[Debburn-devel] What can I assume about libc?
Albert Cahalan
acahalan at gmail.com
Thu Oct 5 07:29:45 UTC 2006
On 10/4/06, Peter Samuelson <peter at p12n.org> wrote:
> [Lorenz Minder]
> > I had a look. On Win32, wchar_t is actually 16 bit, so there's
> > little risk of getting 4-byte chars.
>
> Since Windows uses UTF-16, there is certainly a theoretical possibility
> of having to handle 4-byte characters. I have to admit I don't know
> how Windows deals with this given a 2-byte wchar_t, but I assume it
> just splits such a character into two wchar_ts, as though it were
> simple UCS-2.
Java, Windows (including Joliet), and MacOS all come from a time
when it was claimed that 16 bits would be enough. These legacy
systems get hacked up with UTF-16 like this:
Characters from 0 to 0xffff get a 2-byte encoding.
Characters from 0x10000 to 0x10ffff get a 4-byte encoding.
No other characters can be represented.
A wchar_t is 2 bytes. Some characters require 2 wchar_t.
This does icky things with text processing functions.
In general, functions claim to work on characters but
actually work on wchar_t units.
Still though... it's not hard to deal with UTF-16. If you see
two values in the UTF-16 surrogate range, you merge them
into a single 32-bit character.
More information about the Debburn-devel
mailing list