charset changes

Andy Koppe andy.koppe@gmail.com
Fri Feb 5 23:14:00 GMT 2010


>> - .gb18030 (in zh_CN.gb18030)
>
> it's supported by Windows XP and later.  Maybe we
> should add it after 1.7.2?

I doubt whether it's possible to support it correctly. GB18030 is a
monster of an encoding, with 1-byte, 2-byte and 4-byte character codes
and a huge lookup table to map between GB18030 codepoints and Unicode.

The first issue is that MultiByteToWideChar doesn't distinguish
between incomplete and invalid sequences, which is needed to correctly
implement mbrtowc. Say we've seen three bytes, and MultiByteToWideChar
returns zero when looking at those. That could mean that we're still
missing the fourth byte, in which case we should return -2, or it
could mean the sequence is already invalid, in which case we should
return -1.

Hence I suspect Cygwin would need to do its own parsing of GB18030 and
only hand complete sequences over to MultiByteToWideChar to map them
to Unicode.

But having done that, there's another problem: four-byte GB18030
sequences may map both to BMP and non-BMP Unicode codepoints. With
Cygwin's wchar being 16-bit, this means that two wchars may have to be
returned for one GB18030 sequence. Yet mbrtowc can only return one
wchar, and unlike with UTF-8, there's no way to tell before the last
byte whether two wchars are needed. I don't see a way to address that
without bending the mbrtowc spec.

Andy



More information about the Cygwin-developers mailing list