Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)

Corinna Vinschen corinna-cygwin@cygwin.com
Mon Sep 28 12:05:00 GMT 2009

On Sep 28 12:48, Andy Koppe wrote:
> 2009/9/28 Corinna Vinschen:
> > Thanks for the patch, but that won't work.  The problem is that ptr can
> > validly be a NULL pointer if sys_cp_mbstowcs is called only to check
> > for the length of the result.  With the above, you'll get crashes.
> D'oh.
> > In a case like this, you have to check the input string, along these
> > lines:
> >
> >  if (((bytes = f_mbtowc () < 0)
> >      || (bytes == 3 && pmbs[0] == 0xef && (pmbs[1] & 0xf4) == 0x80))
> >    [...]
> Makes sense.
> Oh, and I thought of one more thing that won't roundtrip correctly
> from Unix to Windows and back: a high surrogate directly followed by a
> low surrogate, because they'll combine into a non-BMP codepoint
> represented by a 4-byte sequence. That's near-impossible to happen by
> chance though.

There is no chance to do that right.  But I'm willing to stick to
this trade-off since, as you wrote, it's near-impossible that somebody
created that filename by chance.

> I'll give the DLL with your patches a spin tonight.

Hang on a few minutes.  I'm just applying the Cygwin patches.  I found a
few minor glitches in my original patch.  Of course, you still have to
apply the newlib stuff which hasn't been approved yet.


Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

More information about the Cygwin-developers mailing list