Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)

Andy Koppe andy.koppe@gmail.com
Sun Sep 27 13:06:00 GMT 2009

2009/9/27 Corinna Vinschen:
> Last but not least, you cannot have both, graceful handling of invalid
> sequences *and* a bijective relation between UTF-16 and multibyrte
> strings.  There's always a tradeoff.

Correct. However, you can have correct roundtripping from any Unix
filename to a Windows filename and back to the same Unix filename
(well, with UTF-8 and singlebyte charsets anyway. The DBCSs seem
inherently dodgy anyway).

And I contend that that's more important than supporting invalid
UTF-16 in Windows filenames not created by Cygwin.

> Second, the lone U+DCxx should always result in a valid UTF-8 sequence
> for the sake of applications calling mbstowcs on these strings.

Applications cannot make that assumption anyway, unless POSIX (or some
such authority) makes valid UTF-8 mandatory for filenames.

> Either you disallow some multibyte
> filenames, or you have to live with the fact that two different
> multibyte sequences translate to the same UTF-16 filename or vice versa.
> The latter is IMHO the lesser problem.  The probability that an
> application tries to use two different files named foo-\x80 and
> foo-\xed\xb2\x80 is almost nil.

Accepted, but I don't think that's the main issue here.

Here's an example: say you've got your locale set to UTF-8, and you
unpack a tarball created on a ISO-8859-1 system that contains a file
called "Ä". This turns into U+DCC4 on disk. So far so good.

Now you run 'convmv -f ISO-8859-1 -t UTF-8' on it to correct the
filename, but instead of a single ISO-8859-1 byte representing "Ä",
convmv will see the three bytes of the low surrogate, and hence the
filename will end up with three UTF-8 characters instead of one.

Also, there'll probably be testsuites that trip over this, e.g. Lapo
Lucchini's 'monotone' tests that triggered this whole discussion
(seemingly an eternity ago ...).


