GB18030 (was: Re: charset changes)

Andy Koppe
Sat Mar 27 19:15:00 GMT 2010

On 27 March 2010 18:01, Corinna Vinschen <> wrote:
>> On Vista and 7, if you pass those two bytes to MultiByteToWideChar,
>> you get back the codepage's UnicodeDefaultChar followed by the digit
>> '3'. XP did something else, but I can't remember exactly what.
> Heh, ok.  It never occured to me to test the content of the target
> buffer if MultiByteToWideChar failed anyway.

It only fails if the MB_ERR_INVALID_CHARS flag is set.

>> How about implementing __gb18030_mbtowc/wctomb in newlib, which would
>> handle all the mbstate stuff, with the actual encoding and decoding
>> factored out into functions like this:
>> size_t __gb18030_encode(char *dst, const wchar_t *src, size_t
>> src_len): Pass in one codepoint, consisting of one or two wchars
>> (always one in case of a 32-bit wchar_t). Return the length of the
>> resulting multibyte sequence.
>> size_t __gb18030_decode(wchar_t *dst, const char *src, size_t
>> src_len): Pass in a valid multibyte sequence. Return the number of
>> wchars needed to represent it.
>> On Cygwin, these would be straightforward wrappers around
>> WideCharToMultibyte and MultibyteToWideChar with codepage 54936,
>> implemented in winsup. For other newlib targets, we could take a
>> similar approach as with doublebyte charsets, where multibyte
>> sequences are mapped to a non-Unicode wchar_t representation by simply
>> packing the bytes into the wchar_t.
> Yet another function call for every single character:

That call would only be necessary for non-ASCII characters, and I
don't think it would be terribly significant compared to the magic
that WideCharToMultibyte and MultibyteToWideChar need to do.


ps: Btw, speaking of performance issues, the 8-bit charsets are rather
inefficient because for every single non-ASCII character they parse
the charset name to obtain a charset table index. Storing that index
alongside the name might make quite a big difference.

More information about the Cygwin-developers mailing list