Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
Sun Sep 27 13:06:00 GMT 2009
2009/9/27 Corinna Vinschen:
> Last but not least, you cannot have both, graceful handling of invalid
> sequences *and* a bijective relation between UTF-16 and multibyrte
> strings. There's always a tradeoff.
Correct. However, you can have correct roundtripping from any Unix
filename to a Windows filename and back to the same Unix filename
(well, with UTF-8 and singlebyte charsets anyway. The DBCSs seem
inherently dodgy anyway).
And I contend that that's more important than supporting invalid
UTF-16 in Windows filenames not created by Cygwin.
> Second, the lone U+DCxx should always result in a valid UTF-8 sequence
> for the sake of applications calling mbstowcs on these strings.
Applications cannot make that assumption anyway, unless POSIX (or some
such authority) makes valid UTF-8 mandatory for filenames.
> Either you disallow some multibyte
> filenames, or you have to live with the fact that two different
> multibyte sequences translate to the same UTF-16 filename or vice versa.
> The latter is IMHO the lesser problem. The probability that an
> application tries to use two different files named foo-\x80 and
> foo-\xed\xb2\x80 is almost nil.
Accepted, but I don't think that's the main issue here.
Here's an example: say you've got your locale set to UTF-8, and you
unpack a tarball created on a ISO-8859-1 system that contains a file
called "Ä". This turns into U+DCC4 on disk. So far so good.
Now you run 'convmv -f ISO-8859-1 -t UTF-8' on it to correct the
filename, but instead of a single ISO-8859-1 byte representing "Ä",
convmv will see the three bytes of the low surrogate, and hence the
filename will end up with three UTF-8 characters instead of one.
Also, there'll probably be testsuites that trip over this, e.g. Lapo
Lucchini's 'monotone' tests that triggered this whole discussion
(seemingly an eternity ago ...).
More information about the Cygwin-developers