Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)

Corinna Vinschen corinna-cygwin@cygwin.com
Sun Sep 27 09:13:00 GMT 2009

On Sep 27 07:32, Andy Koppe wrote:
> > The __utf8_wctomb function could just create the corresponding
> > UCS-2 values if no first half has been encountered before.  The
> > __utf8_mbtowc function could simply allow these UCS-2 values again.
> >
> > That works (I just tested it) and is a small change, but is it really
> > desirable to allow UCS-2 values in UTF-8 strings?
> [...]
> The pragmatic approach is tempting though, and we do have reasonable
> grounds for it given the 16-bit wchar_t. But I think it would need to
> work for both low and high surrogates.
> Regarding the latter, __utf8_wctomb() currently writes the first byte
> of a four-byte sequence when it sees a high surrogate, which of course
> it can't take back if the following codepoint isn't a low surrogate.
> This is a problem even if lone high surrogates aren't going to be
> supported, because that byte on its own is invalid UTF-8.
> Reading the POSIX spec, however, wctomb() is allowed to write nothing,
> return zero, and leave the entire high surrogate to be dealt with on
> the next call. It just says "wctomb() shall [...] return the number of
> bytes that constitute the character corresponding to the value of
> wchar", and unlike with mbtowc(), a return value of zero is not
> defined to have special meaning.
> There's also room to deal with a lone high surrogate at string end:
> "If wchar is 0, a null byte shall be stored, preceded by any shift
> sequence needed to restore the initial shift state, and wctomb() shall
> be left in the initial shift state."

It never occured to me that wcrtomb could return 0 and the calling
functions like wcsnrtombs would simply proceed.  I'll have a look
to change __utf8_wctomb accordingly.


Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

More information about the Cygwin-developers mailing list