[Debburn-devel] What can I assume about libc?
Lorenz Minder
lminder at gmx.net
Wed Oct 4 23:26:22 UTC 2006
Hi,
Peter Samuelson wrote:
> [Lorenz Minder]
> > I had a look. On Win32, wchar_t is actually 16 bit, so there's
> > little risk of getting 4-byte chars.
>
> Since Windows uses UTF-16, there is certainly a theoretical possibility
> of having to handle 4-byte characters. I have to admit I don't know
> how Windows deals with this given a 2-byte wchar_t, but I assume it
> just splits such a character into two wchar_ts
But doing such a thing as mapping a character to two wchar_ts would
subvert the very purpose of wchar_t, namely to map each "basic element"
to a _single_ wchar_t.
That, at least, is what I thought wchar_t was here for, and how I read
the docs.
>as though it were
> simple UCS-2.
Ok, I don't know what UCS-2 is. A quick google search tells me that
"UCS-2 is a fixed-length (16 bits) subset of UTF-16, able to represent
the basic multilingual plane only."
Is that wrong?
It's still possible that Windows just can't handle the glyphs of the
Phaistos-disk, for example. And that any character in Windows can
therefore be represented with 16 bits.
> For the purpose of UTF-8 conversion, treating two halves of a 32-bit
> character as two separate characters will spectacularly do the wrong
> thing.
Absolutely. _If_ it behaves that way.
> > >Custom functions may be best:
> > >
> > > utf8_to_u32
> > > u32_to_utf8
> > > u32_to_utf16
> > > utf16_to_u32
> >
> > Or we can just use libiconv instead for this purpose, which
> > apparently also exists for Windows.
>
> Well, those 4 functions are utterly trivial to write,
I'm very happy to learn this. If that is the case, we can do that.
The list above lacks u32_to_lc_ctype, though, which is also needed. I
gather that would be equally trivial to do?
>much easier than
> dealing with iconv (and its platform availability) - _if_ we don't have
> to fully validate our input. Completely validating a stream of Unicode
> (be it UTF-8 or UTF-16) is a whole other story. I'm not certain
> whether even iconv bothers to do _that_.
If that just means that garbage input results in garbage output, then I
guess this would be fine. Or is there more to it?
--Lorenz
More information about the Debburn-devel
mailing list