Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)

Andy Koppe andy.koppe@gmail.com
Mon Sep 28 05:51:00 GMT 2009


2009/9/27 Corinna Vinschen:
>> > What about this:  The private use area U+f0xx is already used for ASCII
>> > chars invalid in Windows filenames.  The same range can be used for
>> > invalid chars > 0x80.  This could happen unconditionally.
>>
>> That's a great idea, allowing both lone surrogate support and Unix
>> filename transparency.
>>
>> [time passes]
>>
>> Nope, can't think of anything wrong with it. :)
>
> Did we get it?  Did we actually get it?

Not quite. :(

If the Unix filename contains the UTF-8 representation of U+F0xx, that
will now roundtrip to just the xx byte. U+F000 is particularly
problematic, as that roundtrips to a null byte.

Solution: if f_mbtowc comes back with a U+F0xx, scratch that, and
instead turn each of the original bytes into a U+F0xx, i.e.:

\xEF\x80\x80 -> U+F0EF U+F080 U+F080

One for later?


> I have a local implementation. for the entire thing,
>
> - Ctrl-X instead of Ctrl-N
> - invalid \xXX bytes -> U+ffXX
> - Allow CESU-8 sequences for lone surrogate halves
> - Change documentation accordingly.

Wow, that was quick!


> If you want to play with it, the entire patch is here (missing a ChangeLog
> for now):
>
>   http://cygwin.de/hopefully-last-big-cygwin-locale-patch.diff

Compile problem:

cc1plus: warnings being treated as errors
../../.././winsup/cygwin/syscalls.cc: In function ‘char*
setlocale(int, const char*)’:
../../.././winsup/cygwin/syscalls.cc:4186: error: ‘w_cwd’ may be used
uninitialized in this function
../../.././winsup/cygwin/syscalls.cc:4186: error: ‘w_path’ may be used
uninitialized in this function

Looks like a false alarm though, and a pair of "=0"s made it compile.

Andy



More information about the Cygwin-developers mailing list