Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)

Corinna Vinschen corinna-cygwin@cygwin.com
Sun Sep 27 20:30:00 GMT 2009


On Sep 27 18:01, Andy Koppe wrote:
> 2009/9/27 Corinna Vinschen:
> > I'm getting headaches.
> 
> Same here. Someone ought to be shot for UTF-16.

I agree.  Especially those who decided to store filenames as UTF-16.

> > What about this:  The private use area U+f0xx is already used for ASCII
> > chars invalid in Windows filenames.  The same range can be used for
> > invalid chars > 0x80.  This could happen unconditionally.
> 
> That's a great idea, allowing both lone surrogate support and Unix
> filename transparency.
> 
> [time passes]
> 
> Nope, can't think of anything wrong with it. :)

Did we get it?  Did we actually get it?  I can't believe it.

I have a local implementation. for the entire thing,

- Ctrl-X instead of Ctrl-N
- invalid \xXX bytes -> U+ffXX
- Allow CESU-8 sequences for lone surrogate halves
- Change documentation accordingly.

except for the interface to a potential setcons tool, which is low
priority for now.

I'll apply the Cygwin parts and send the newlib patch upstream tomorrow.

If you want to play with it, the entire patch is here (missing a ChangeLog
for now):

   http://cygwin.de/hopefully-last-big-cygwin-locale-patch.diff

Testing highly appreciated.  Patches, too, especially against the
documentation.  I don't think I explained that adequately.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat



More information about the Cygwin-developers mailing list