This is the mail archive of the cygwin mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Bug in libiconv?

From: Corinna Vinschen <corinna-cygwin at cygwin dot com>
To: cygwin at cygwin dot com
Date: Sat, 29 Jan 2011 17:01:57 +0100
Subject: Re: Bug in libiconv?
References: <201101282312.50298.bruno@clisp.org> <20110129123014.GA8671@calimero.vinschen.de> <4D442DDA.4050807@redhat.com>
Reply-to: cygwin at cygwin dot com

On Jan 29 08:10, Eric Blake wrote:
> On 01/29/2011 05:30 AM, Corinna Vinschen wrote:
> >> But when characters outside the basic plane, such as
> >> U+12345 (CUNEIFORM SIGN URU TIMES KI), are encoded by 2 consecutive wchar_t
> >> values, values of type wchar_t don't correspond to ISO/IEC 10646 characters.
> >> (Or maybe I'm underestimating what "coded representations" means...?)
> > 
> > I don't read that from your above quote.  The core is that the *type*
> > wchar_t is a *coded* *representation* of the characters defined in
> > 10646.  At no point it says that a single wchar_t value must represent a
> > single character from 10646.  So I take it that UTF-16 is a valid, coded
> > representation of the characters from 10646.
> 
> POSIX is clear that wchar_t must be wide enough so that 1 wchar_t is one
> character.  Which limits a 2-byte wchar_t to just the Unicode basic
> plane.  There's nothing cygwin can do about this other than break LOTS
> of ABI to support a 4-byte wchar_t to supply all of Unicode.
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html#tag_06_03
> 
> "All wide-character codes in a given process consist of an equal number
> of bits. This is in contrast to characters, which can consist of a
> variable number of bytes. The byte or byte sequence that represents a
> character can also be represented as a wide-character code.
> Wide-character codes thus provide a uniform size for manipulating text
> data."
> 
> So, using UTF-16 surrogate encodings for characters outside the basic
> plane violates POSIX, but it's the best we can do for those characters.

Right, and we discussed this already on this list.  Or the developer
list, I don't remember.  Maybe we should have stick to the base plane
and only use UCS-2 to be more POSIX compatible.  I have to admit that
I was more interested to get all (or as much as possible) of Unicode
working than to follow POSIX to the last word in this regard.  And I
was interested to make sure that east asian users would get all of the
characters used and there *are* the CJK idograpsh in the 0x2xxxx plane.

However, the POSIX definition doesn't contradict what I said about the
definition of __STDC_ISO_10646__ as far as I'm concerned.

> Someday when gcc has better support for C+1x 16- and 32-bit characters
> (regardless of the sizing of wchar_t), then we can add all the new
> 32-bit character APIs that use Unicode unimpeded, without breaking
> existing ones that use wchar_t.

Yeah, that's what I'm waiting for as well.  But for the time being,
I'm confident that we have the best compromise possible at the time.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

Follow-Ups:
- Re: Bug in libiconv?
  - From: Eric Blake

References:
- Re: Bug in libiconv?
  - From: Bruno Haible
- Re: Bug in libiconv?
  - From: Corinna Vinschen
- Re: Bug in libiconv?
  - From: Eric Blake

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]