[Debburn-devel] Sample mini-iconv (Was: What can I assume about libc?

Fri Oct 6 04:13:50 UTC 2006

On 10/5/06, Peter Samuelson <peter at p12n.org> wrote:
> [Albert Cahalan]

> > I just tested round-trip behavior. It appears to be unreliable.
>
> We can both take some blame - I had two bugs, your tests had a few
> bugs.

My bugs were mainly because I didn't realize you were using
the Microsoft-style definition of a character. You count a
4-byte UTF-16 character as two characters, not one.

Counting the whole thing useful for many types of text
manipulation, but less useful for dealing with buffer sizes.
Counting the individual chunks might be a bit more
awkward than counting bytes, which serves the same use.
(because the sizeof operator gives you bytes)

> > I didn't test this. Probably a utf16 one is useful too.  The utf8 one
> > is trivial (call strlen), but perhaps good for documentation reasons.
>
> Well, all three are trivial, a matter of scanning for zeroes of the
> appropriate data type.

Oops. Sorry, I was rusing out the door and didn't think.

I was really (badly!) thinking about functions to count the
number of complete characters, meaning a pair of UTF-16
surrogates get counted as 1 and so on. Perhaps this isn't
actually useful though. Fields on the disk are limited by
bytes, and there isn't much involving lining things up nicely
for display.

I just tested Solaris. The wchar_t is 32-bit, so not UTF-16.
I can't test MacOS X right now. If MacOS X also uses a
32-bit wchar_t now, that leaves only little-endian stuff
(Joliet and UNICODE Windows) with UTF-16. Perhaps
the code dealing with UTF-16 should do the byte swapping.