[Debburn-devel] What can I assume about libc?

Lorenz Minder lminder at gmx.net
Wed Oct 4 23:26:22 UTC 2006


Hi,

Peter Samuelson wrote:
> [Lorenz Minder]
> > I had a look.  On Win32, wchar_t is actually 16 bit, so there's
> > little risk of getting 4-byte chars.
> 
> Since Windows uses UTF-16, there is certainly a theoretical possibility
> of having to handle 4-byte characters.  I have to admit I don't know
> how Windows deals with this given a 2-byte wchar_t, but I assume it
> just splits such a character into two wchar_ts

But doing such a thing as mapping a character to two wchar_ts would
subvert the very purpose of wchar_t, namely to map each "basic element"
to a _single_ wchar_t.

That, at least, is what I thought wchar_t was here for, and how I read
the docs.

>as though it were
> simple UCS-2.

Ok, I don't know what UCS-2 is.  A quick google search tells me that
"UCS-2 is a fixed-length (16 bits) subset of UTF-16, able to represent
the basic multilingual plane only."

Is that wrong?

It's still possible that Windows just can't handle the glyphs of the
Phaistos-disk, for example.  And that any character in Windows can
therefore be represented with 16 bits.

> For the purpose of UTF-8 conversion, treating two halves of a 32-bit
> character as two separate characters will spectacularly do the wrong
> thing.

Absolutely. _If_ it behaves that way.

> > >Custom functions may be best:
> > > 
> > > utf8_to_u32
> > > u32_to_utf8
> > > u32_to_utf16
> > > utf16_to_u32
> > 
> > Or we can just use libiconv instead for this purpose, which
> > apparently also exists for Windows.
> 
> Well, those 4 functions are utterly trivial to write,

I'm very happy to learn this.  If that is the case, we can do that.

The list above lacks u32_to_lc_ctype, though, which is also needed.  I
gather that would be equally trivial to do?

>much easier than
> dealing with iconv (and its platform availability) - _if_ we don't have
> to fully validate our input. Completely validating a stream of Unicode
> (be it UTF-8 or UTF-16) is a whole other story.  I'm not certain
> whether even iconv bothers to do _that_.

If that just means that garbage input results in garbage output, then I
guess this would be fine.  Or is there more to it?

--Lorenz



More information about the Debburn-devel mailing list