Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)

Corinna Vinschen corinna-cygwin@cygwin.com
Sun Sep 27 12:04:00 GMT 2009

On Sep 27 12:14, Andy Koppe wrote:
> 2009/9/27 Corinna Vinschen:
> > I don't understand this one.  That's not what I observe after I have
> > changed the __utf8_wctomb and __utf8_mbtowc functions accordingly.
> > A single byte 0x80 gets encoded to U+DC80.  The round trip results
> > in \xed\xb2\x80.
> Ah, I'd assumed that U+DCxx in filenames would continue to map to xx
> (and vice versa). Either way, this would mean that filenames aren't
> transparent: the name can change between open() and readdir().
> ... pondering ...
> Therefore I think that lone surrogates shouldn't be allowed after all,
> because Unix filename transparency is more important than being able
> to access Windows filenames with invalid UTF-16 (which can't have been
> created within Cygwin).

After being through this and looking into what happens, I disagree.

First of all, as you noted, Windows allows lone surrogate halves in
filenames anyway.  To disallow lone surrogate means to disallow
accessing some Windows filenames.  There's a single borderline instance
already, which is, the use of the private use area U+F0xx to allow ASCII
chars otherwise disallowed on Windows filesystems.

Second, the lone U+DCxx should always result in a valid UTF-8 sequence
for the sake of applications calling mbstowcs on these strings.

Third, even when using the \016\377\x replacement, the filename gets
changed between open and readdir.  However, in both cases the
application can still access the file using the original byte sequence
because it's still translated into the identical UTF-16 value.

Last but not least, you cannot have both, graceful handling of invalid
sequences *and* a bijective relation between UTF-16 and multibyrte
strings.  There's always a tradeoff.  Either you disallow some multibyte
filenames, or you have to live with the fact that two different
multibyte sequences translate to the same UTF-16 filename or vice versa.
The latter is IMHO the lesser problem.  The probability that an
application tries to use two different files named foo-\x80 and
foo-\xed\xb2\x80 is almost nil.  Or, FWIW, foo-\xf0\x93\x90\x8d and


Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

More information about the Cygwin-developers mailing list