This is the mail archive of the
mailing list for the Cygwin project.
Re: charset changes
On Feb 6 11:40, Corinna Vinschen wrote:
> On Feb 6 06:20, Andy Koppe wrote:
> > Other systems usually have a 32-bit wchar, though. I can see three
> > ways to tackle the issue, but none of them entirely satisfactory. When
> > encountering a 4-byte sequence in __gb18300_mbtowc that maps to a
> > non-BMP char (and hence a UTF-16 surrogate pair):
> > 1. Just report an invalid sequence. BMP-only support would probably
> > still cover most practical needs.
> > 2. Write the high surrogate and report that one byte less than
> > actually seen has been consumed. On the next mbtowc call, ignore the
> > input, write the low surrogate, and report that 1 byte has been
> > consumed. Unfortunately this scheme falls down if the user feeds in
> > the bytes one-by-one, as Corinna previously found when handling UTF-8
> > like this.
> > 3. Write the high surrogate and report the actual number of bytes
> > consumed. On the next call, write the low surrogate, and return 0 to
> > indicate that no bytes have been consumed. Trouble is, a return value
> > of 0 from mbrtowc is supposed to indicate that a null character has
> > been found. While uses within Cygwin could be changed to recognise
> > string end by instead looking at the character actually written, this
> > would lead to truncated strings in applications.
> Can't we just carry over the surrogate pair handling from __utf8_mbtowc()
> in newlib/libc/stdlib/mbtowc_r.c? What's the stumbling block exactly?
> Do you have an example?
I just read the GB18030 entry in the german wikipedia again and, boy,
I dislike that codeset immediately every time. 2-byte sequences have
a trailing byte in the range 0x40-0xfe, 3-byte sequences don't exist,
4-byte sequences have a second and forth byte in the range 0x30-0x39.
Why, oh why, do codeset implementors have to overload the ASCII range
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com