This is the mail archive of the
cygwin-developers
mailing list for the Cygwin project.
Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
2009/9/27 Corinna Vinschen:
> Last but not least, you cannot have both, graceful handling of invalid
> sequences *and* a bijective relation between UTF-16 and multibyrte
> strings. ÂThere's always a tradeoff.
Correct. However, you can have correct roundtripping from any Unix
filename to a Windows filename and back to the same Unix filename
(well, with UTF-8 and singlebyte charsets anyway. The DBCSs seem
inherently dodgy anyway).
And I contend that that's more important than supporting invalid
UTF-16 in Windows filenames not created by Cygwin.
> Second, the lone U+DCxx should always result in a valid UTF-8 sequence
> for the sake of applications calling mbstowcs on these strings.
Applications cannot make that assumption anyway, unless POSIX (or some
such authority) makes valid UTF-8 mandatory for filenames.
>ÂEither you disallow some multibyte
> filenames, or you have to live with the fact that two different
> multibyte sequences translate to the same UTF-16 filename or vice versa.
> The latter is IMHO the lesser problem. ÂThe probability that an
> application tries to use two different files named foo-\x80 and
> foo-\xed\xb2\x80 is almost nil.
Accepted, but I don't think that's the main issue here.
Here's an example: say you've got your locale set to UTF-8, and you
unpack a tarball created on a ISO-8859-1 system that contains a file
called "Ã". This turns into U+DCC4 on disk. So far so good.
Now you run 'convmv -f ISO-8859-1 -t UTF-8' on it to correct the
filename, but instead of a single ISO-8859-1 byte representing "Ã",
convmv will see the three bytes of the low surrogate, and hence the
filename will end up with three UTF-8 characters instead of one.
Also, there'll probably be testsuites that trip over this, e.g. Lapo
Lucchini's 'monotone' tests that triggered this whole discussion
(seemingly an eternity ago ...).
Andy