This is the mail archive of the cygwin-developers mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: charset changes

From: Corinna Vinschen <corinna-cygwin at cygwin dot com>
To: cygwin-developers at cygwin dot com
Date: Sat, 27 Mar 2010 14:33:19 +0100
Subject: Re: charset changes
References: <416096c61001230305x20619d39x55e3a46b428ba@mail.gmail.com> <4B6C474B.7090600@towo.net> <20100205215047.GX28659@calimero.vinschen.de> <416096c61002051514w5bb56b0bj5baeb0c65c7aece@mail.gmail.com> <4B6CB86E.5050904@towo.net> <416096c61002052220rafdb361kec907336ca5b3889@mail.gmail.com> <20100206104024.GY28659@calimero.vinschen.de> <416096c61002060556v2cb254bajb331cd1ebeaa4961@mail.gmail.com> <416096c61003262347s1f4f2bc4m44411de2edcbeef6@mail.gmail.com>
Reply-to: cygwin-developers at cygwin dot com

On Mar 27 06:47, Andy Koppe wrote:
> I think the conclusion from all this is that approach 2 is the least
> broken way to handle GB18030: when encountering a 4-byte sequence that
> maps to a non-BMP char (and hence a UTF-16 surrogate pair), write the
> high surrogate and report that one byte less than actually seen has
> been consumed. On the next mbtowc call, ignore the input, write the
> low surrogate, and report that 1 byte has been consumed.
> 
> As mentioned, this breaks the mbtowc spec when bytes are fed in
> one-by-one, because in that case zero needs to be returned  after the
> high surrogate, yet zero is meant to signal string end. An application
> that's aware of that can work around it by checking whether the wide
> character that's written actually is null, but in others it may cause
> truncated strings. Fortunately, the mbstowcs implementation isn't
> affected by this, because that always passes as many bytes as possible
> to mbtowc, i.e. the incorrect zero return can't occur there.
> 
> The MultiByteToWideChar() function doesn't have a way to tell
> incomplete from invalid sequences, which is needed to decide whether
> to return -2 or -1 from mbtowc. "Interestingly", if you give it only
> two bytes of a 4-byte GB18030 sequence, e.g. \x95 \x33, it interprets
> that as a one-byte invalid sequence followed by the digit '3'.

Huh?  How did you test that?  AFAIK MultiByteToWideChar, it doesn't
tell you how many and which bytes it treated as valid substring.

> Therefore I think the best thing to do is to manually parse GB18030
> sequences, which is fairly straightforward, and only hand complete
> sequences over to MultiByteToWideChar for translation to UTF-16. Shall
> I have a go at that?

I would really be glad.  You'd just create two functions __gb18030_mbtowc
and __gb18030_wctomb in strfuncs.cc, and I could easily add it to newlib's
setlocale_r.  Oh, and then there's check_codepage in nlsfuncs.cc which
needs to test if codepage 54936 is installed.

However, here's a problem.  Adding these functions is non-trivial code
and requires a copyright assignment... sigh.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

References:
- Re: charset changes
  - From: Andy Koppe

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]