charset changes

Corinna Vinschen corinna-cygwin@cygwin.com
Sat Mar 27 13:33:00 GMT 2010


On Mar 27 06:47, Andy Koppe wrote:
> I think the conclusion from all this is that approach 2 is the least
> broken way to handle GB18030: when encountering a 4-byte sequence that
> maps to a non-BMP char (and hence a UTF-16 surrogate pair), write the
> high surrogate and report that one byte less than actually seen has
> been consumed. On the next mbtowc call, ignore the input, write the
> low surrogate, and report that 1 byte has been consumed.
> 
> As mentioned, this breaks the mbtowc spec when bytes are fed in
> one-by-one, because in that case zero needs to be returned  after the
> high surrogate, yet zero is meant to signal string end. An application
> that's aware of that can work around it by checking whether the wide
> character that's written actually is null, but in others it may cause
> truncated strings. Fortunately, the mbstowcs implementation isn't
> affected by this, because that always passes as many bytes as possible
> to mbtowc, i.e. the incorrect zero return can't occur there.
> 
> The MultiByteToWideChar() function doesn't have a way to tell
> incomplete from invalid sequences, which is needed to decide whether
> to return -2 or -1 from mbtowc. "Interestingly", if you give it only
> two bytes of a 4-byte GB18030 sequence, e.g. \x95 \x33, it interprets
> that as a one-byte invalid sequence followed by the digit '3'.

Huh?  How did you test that?  AFAIK MultiByteToWideChar, it doesn't
tell you how many and which bytes it treated as valid substring.

> Therefore I think the best thing to do is to manually parse GB18030
> sequences, which is fairly straightforward, and only hand complete
> sequences over to MultiByteToWideChar for translation to UTF-16. Shall
> I have a go at that?

I would really be glad.  You'd just create two functions __gb18030_mbtowc
and __gb18030_wctomb in strfuncs.cc, and I could easily add it to newlib's
setlocale_r.  Oh, and then there's check_codepage in nlsfuncs.cc which
needs to test if codepage 54936 is installed.

However, here's a problem.  Adding these functions is non-trivial code
and requires a copyright assignment... sigh.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat



More information about the Cygwin-developers mailing list