This is the mail archive of the cygwin mailing list for the Cygwin project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Bug in libiconv?

On Jan 29 08:10, Eric Blake wrote:
> On 01/29/2011 05:30 AM, Corinna Vinschen wrote:
> >> But when characters outside the basic plane, such as
> >> U+12345 (CUNEIFORM SIGN URU TIMES KI), are encoded by 2 consecutive wchar_t
> >> values, values of type wchar_t don't correspond to ISO/IEC 10646 characters.
> >> (Or maybe I'm underestimating what "coded representations" means...?)
> > 
> > I don't read that from your above quote.  The core is that the *type*
> > wchar_t is a *coded* *representation* of the characters defined in
> > 10646.  At no point it says that a single wchar_t value must represent a
> > single character from 10646.  So I take it that UTF-16 is a valid, coded
> > representation of the characters from 10646.
> POSIX is clear that wchar_t must be wide enough so that 1 wchar_t is one
> character.  Which limits a 2-byte wchar_t to just the Unicode basic
> plane.  There's nothing cygwin can do about this other than break LOTS
> of ABI to support a 4-byte wchar_t to supply all of Unicode.
> "All wide-character codes in a given process consist of an equal number
> of bits. This is in contrast to characters, which can consist of a
> variable number of bytes. The byte or byte sequence that represents a
> character can also be represented as a wide-character code.
> Wide-character codes thus provide a uniform size for manipulating text
> data."
> So, using UTF-16 surrogate encodings for characters outside the basic
> plane violates POSIX, but it's the best we can do for those characters.

Right, and we discussed this already on this list.  Or the developer
list, I don't remember.  Maybe we should have stick to the base plane
and only use UCS-2 to be more POSIX compatible.  I have to admit that
I was more interested to get all (or as much as possible) of Unicode
working than to follow POSIX to the last word in this regard.  And I
was interested to make sure that east asian users would get all of the
characters used and there *are* the CJK idograpsh in the 0x2xxxx plane.

However, the POSIX definition doesn't contradict what I said about the
definition of __STDC_ISO_10646__ as far as I'm concerned.

> Someday when gcc has better support for C+1x 16- and 32-bit characters
> (regardless of the sizing of wchar_t), then we can add all the new
> 32-bit character APIs that use Unicode unimpeded, without breaking
> existing ones that use wchar_t.

Yeah, that's what I'm waiting for as well.  But for the time being,
I'm confident that we have the best compromise possible at the time.


Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

Problem reports:
Unsubscribe info:

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]