This is the mail archive of the
mailing list for the Cygwin project.
Re: charset changes
On Feb 6 06:20, Andy Koppe wrote:
> Other systems usually have a 32-bit wchar, though. I can see three
> ways to tackle the issue, but none of them entirely satisfactory. When
> encountering a 4-byte sequence in __gb18300_mbtowc that maps to a
> non-BMP char (and hence a UTF-16 surrogate pair):
> 1. Just report an invalid sequence. BMP-only support would probably
> still cover most practical needs.
> 2. Write the high surrogate and report that one byte less than
> actually seen has been consumed. On the next mbtowc call, ignore the
> input, write the low surrogate, and report that 1 byte has been
> consumed. Unfortunately this scheme falls down if the user feeds in
> the bytes one-by-one, as Corinna previously found when handling UTF-8
> like this.
> 3. Write the high surrogate and report the actual number of bytes
> consumed. On the next call, write the low surrogate, and return 0 to
> indicate that no bytes have been consumed. Trouble is, a return value
> of 0 from mbrtowc is supposed to indicate that a null character has
> been found. While uses within Cygwin could be changed to recognise
> string end by instead looking at the character actually written, this
> would lead to truncated strings in applications.
Can't we just carry over the surrogate pair handling from __utf8_mbtowc()
in newlib/libc/stdlib/mbtowc_r.c? What's the stumbling block exactly?
Do you have an example?
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com