Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)

Corinna Vinschen corinna-cygwin@cygwin.com
Sun Sep 27 11:00:00 GMT 2009

On Sep 27 11:22, Andy Koppe wrote:
> 2009/9/27 Corinna Vinschen:
> > It never occured to me that wcrtomb could return 0 and the calling
> > functions like wcsnrtombs would simply proceed.  I'll have a look
> > to change __utf8_wctomb accordingly.
> Two further thoughts on allowing lone surrogates:
> - __mb_cur_max for UTF-8 would need to go up to 6 to allow for a lone
> high surrogate followed by a three-byte char.

In newlib (and thus Cygwin) __mb_cur_max is already 6 for UTF-8.

> - Due to the DCxx scheme, the three-byte UTF-8 encoding of DCxx would
> roundtrip to a single-byte xx. Changing the code to something else
> than DCxx wouldn't help.

I don't understand this one.  That's not what I observe after I have
changed the __utf8_wctomb and __utf8_mbtowc functions accordingly.
A single byte 0x80 gets encoded to U+DC80.  The round trip results
in \xed\xb2\x80.


Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

More information about the Cygwin-developers mailing list