Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)

Corinna Vinschen corinna-cygwin@cygwin.com
Mon Sep 28 12:05:00 GMT 2009


On Sep 28 12:48, Andy Koppe wrote:
> 2009/9/28 Corinna Vinschen:
> > Thanks for the patch, but that won't work.  The problem is that ptr can
> > validly be a NULL pointer if sys_cp_mbstowcs is called only to check
> > for the length of the result.  With the above, you'll get crashes.
> 
> D'oh.
> 
> > In a case like this, you have to check the input string, along these
> > lines:
> >
> >  if (((bytes = f_mbtowc () < 0)
> >      || (bytes == 3 && pmbs[0] == 0xef && (pmbs[1] & 0xf4) == 0x80))
> >    [...]
> 
> Makes sense.
> 
> Oh, and I thought of one more thing that won't roundtrip correctly
> from Unix to Windows and back: a high surrogate directly followed by a
> low surrogate, because they'll combine into a non-BMP codepoint
> represented by a 4-byte sequence. That's near-impossible to happen by
> chance though.

There is no chance to do that right.  But I'm willing to stick to
this trade-off since, as you wrote, it's near-impossible that somebody
created that filename by chance.

> I'll give the DLL with your patches a spin tonight.

Hang on a few minutes.  I'm just applying the Cygwin patches.  I found a
few minor glitches in my original patch.  Of course, you still have to
apply the newlib stuff which hasn't been approved yet.


Thanks,
Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat



More information about the Cygwin-developers mailing list