This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)


2009/9/28 Corinna Vinschen:
>> >> Oh, and I thought of one more thing that won't roundtrip correctly
>> >> from Unix to Windows and back: a high surrogate directly followed by a
>> >> low surrogate, because they'll combine into a non-BMP codepoint
>> >> represented by a 4-byte sequence. That's near-impossible to happen by
>> >> chance though.
>> >
>> > There is no chance to do that right. ÂBut I'm willing to stick to
>> > this trade-off since, as you wrote, it's near-impossible that somebody
>> > created that filename by chance.
>>
>> Hmm. But what if Java or Oracle or some other CESU-8 degenerate did
>> that on purpose?
>>
>> Just in case you're not yet completely sick of this, here's how I
>> think it could be done:
>
> Nooooo! ÂI *am* completely sick of this. ÂI'm willing to let this slip
> until the first complaint about this very issue comes along.

Sorry.

And I was wrong as well: outlawing lone surrogates in
__utf8_mbtowc/wctomb is not necessary to deal with this. It could be
done in the same way as I'd suggested for F0xx codepoints, i.e., treat
them as illegals in sys_cp_mbstowcs only:

\xED\xB0\80 -> U+F0ED U+F0B0 U+F080

^X encoding would still be needed for Windows-side lone surrogates,
but due to the above, __utf8_wctomb could be used to encode them.

But yeah, let's leave that for now.

Andy


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]