This is the mail archive of the
cygwin-developers
mailing list for the Cygwin project.
Re: charset changes
- From: Corinna Vinschen <corinna-cygwin at cygwin dot com>
- To: cygwin-developers at cygwin dot com
- Date: Sat, 27 Mar 2010 14:33:19 +0100
- Subject: Re: charset changes
- References: <416096c61001230305x20619d39x55e3a46b428ba@mail.gmail.com> <4B6C474B.7090600@towo.net> <20100205215047.GX28659@calimero.vinschen.de> <416096c61002051514w5bb56b0bj5baeb0c65c7aece@mail.gmail.com> <4B6CB86E.5050904@towo.net> <416096c61002052220rafdb361kec907336ca5b3889@mail.gmail.com> <20100206104024.GY28659@calimero.vinschen.de> <416096c61002060556v2cb254bajb331cd1ebeaa4961@mail.gmail.com> <416096c61003262347s1f4f2bc4m44411de2edcbeef6@mail.gmail.com>
- Reply-to: cygwin-developers at cygwin dot com
On Mar 27 06:47, Andy Koppe wrote:
> I think the conclusion from all this is that approach 2 is the least
> broken way to handle GB18030: when encountering a 4-byte sequence that
> maps to a non-BMP char (and hence a UTF-16 surrogate pair), write the
> high surrogate and report that one byte less than actually seen has
> been consumed. On the next mbtowc call, ignore the input, write the
> low surrogate, and report that 1 byte has been consumed.
>
> As mentioned, this breaks the mbtowc spec when bytes are fed in
> one-by-one, because in that case zero needs to be returned after the
> high surrogate, yet zero is meant to signal string end. An application
> that's aware of that can work around it by checking whether the wide
> character that's written actually is null, but in others it may cause
> truncated strings. Fortunately, the mbstowcs implementation isn't
> affected by this, because that always passes as many bytes as possible
> to mbtowc, i.e. the incorrect zero return can't occur there.
>
> The MultiByteToWideChar() function doesn't have a way to tell
> incomplete from invalid sequences, which is needed to decide whether
> to return -2 or -1 from mbtowc. "Interestingly", if you give it only
> two bytes of a 4-byte GB18030 sequence, e.g. \x95 \x33, it interprets
> that as a one-byte invalid sequence followed by the digit '3'.
Huh? How did you test that? AFAIK MultiByteToWideChar, it doesn't
tell you how many and which bytes it treated as valid substring.
> Therefore I think the best thing to do is to manually parse GB18030
> sequences, which is fairly straightforward, and only hand complete
> sequences over to MultiByteToWideChar for translation to UTF-16. Shall
> I have a go at that?
I would really be glad. You'd just create two functions __gb18030_mbtowc
and __gb18030_wctomb in strfuncs.cc, and I could easily add it to newlib's
setlocale_r. Oh, and then there's check_codepage in nlsfuncs.cc which
needs to test if codepage 54936 is installed.
However, here's a problem. Adding these functions is non-trivial code
and requires a copyright assignment... sigh.
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat