[Debburn-devel] Sample mini-iconv (Was: What can I assume about libc?

Lorenz Minder lminder at gmx.net
Fri Oct 6 20:24:51 UTC 2006


Hi,

Peter Samuelson wrote:
> [Lorenz Minder]
> >   #if defined(__STDC_ISO_10646__)
> >           return iswprint((wchar_t)c);
> >   #else
> 
> Ooooh.  I didn't know about __STDC_ISO_10646__.  On the other hand, I
> imagine that relying on __STDC_ISO_10646__ is approximately equivalent
> to relying on the existence of iswprint(), portability-wise, so we
> don't really gain anything.

Unfortunately, __STDC_ISO_10646__ seems to be much more restricted than
iswprintf(), because the BSD's and OS X don't define it (even though
their wchar_t is currently UCS-4 encoded). There's a lot of discussion
on the net about why this is the case, but that's largely off topic for
this thread.

Regarding iswprint(), AFAICS, the problem is not it's portability, e.g.
you find it also in AIX and HP UX, and even Windows. But the real
problem is, it can't be used unless you manage to actually fill the
wchar_t with the value you need.

I currently can't think of any reliable and portable way to do this. If
you were to use mbstowcs() to scan a UTF-8 stream, you'd have to make
sure that a UTF-8-aware locale is being used, which is a big problem.
And on Windows, that can't even be done, because it does not support
UTF-8 for setlocale().

> >           if(c < 256) {
> >                   return isprint((int)c);
> 
> To the best of my knowledge, wchar_t between 0 and 255 are not
> guaranteed to be equivalent to unsigned char.

Well, yes the bound is wrong. I don't know why I put 256, it's actually
128, I probably misread your code when I wrote that.

The point is, the variable c in this snippet is not a wchar_t, but a
ucs4_t; and for that we know that the first 128 characters are identical
to ASCII, on all machines which work with an ASCII charset. And on the
other machines, the whole uniconv.c stuff breaks anyway. I think that's
not much of a restriction, is it?

The most important point about an isprint()-type function is that it
filters control-characters, which the above does, but incompletely;
there's another range of control characters from 0x7F-0x9F, I'll just
add a check for these as well.

> The correct answer for all of this is probably libicu.  But that would
> be another external dependency, and currently a somewhat volatile one,
> so I'm not excited about that either.

Agreed. If we can get away with your conversion functions plus some of
the poor-mans ctype-like functions like my ucs4_isprint(), then I think
this would be much better. (We could still add optional support for such
a library, if it proved worthwhile.)

Best,
--Lorenz



More information about the Debburn-devel mailing list