This is the mail archive of the
cygwin-developers
mailing list for the Cygwin project.
Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
On Sep 28 07:23, Andy Koppe wrote:
> 2009/9/28 Andy Koppe:
> > If the Unix filename contains the UTF-8 representation of U+F0xx, that
> > will now roundtrip to just the xx byte. U+F000 is particularly
> > problematic, as that roundtrips to a null byte.
> >
> > Solution: if f_mbtowc comes back with a U+F0xx, scratch that, and
> > instead turn each of the original bytes into a U+F0xx, i.e.:
> >
> > \xEF\x80\x80 -> U+F0EF U+F080 U+F080
> >
> > One for later?
>
> Actually, I think there's a very simple way to implement this: just
> treat a U+F0xx result the same as an encoding error. For example:
>
> --- strfuncs.cc.bak 2009-09-28 06:05:53.866000000 +0100
> +++ strfuncs.cc 2009-09-28 07:08:36.909000000 +0100
> @@ -602,9 +602,10 @@ sys_cp_mbstowcs (mbtowc_p f_mbtowc, cons
> *ptr = 0x18;
> }
> }
> - else if ((bytes = f_mbtowc (_REENT, ptr, (const char *) pmbs, nms,
> - charset, &ps)) < 0
> - && *pmbs >= 0x80)
> + else if (((bytes = f_mbtowc (_REENT, ptr, (const char *) pmbs, nms,
> + charset, &ps)) < 0
> + && *pmbs >= 0x80)
> + || (*ptr & 0xff00) == 0xf000)
> {
> /* The technique is based on a discussion here:
> http://www.mail-archive.com/linux-utf8@nl.linux.org/msg00080.html
> @@ -615,7 +616,7 @@ sys_cp_mbstowcs (mbtowc_p f_mbtowc, cons
> to store them in a symmetric way. */
> bytes = 1;
> if (dst)
> - *ptr = L'\xf080' | *pmbs;
> + *ptr = L'\xf000' | *pmbs;
> memset (&ps, 0, sizeof ps);
> }
Thanks for the patch, but that won't work. The problem is that ptr can
validly be a NULL pointer if sys_cp_mbstowcs is called only to check
for the length of the result. With the above, you'll get crashes.
In a case like this, you have to check the input string, along these
lines:
if (((bytes = f_mbtowc () < 0)
|| (bytes == 3 && pmbs[0] == 0xef && (pmbs[1] & 0xf4) == 0x80))
[...]
> Btw, is the '*pmbs >= 0x80' check necessary there? ASCII bytes should
> pass unharmed through all encodings (well, at the start of a mbchar
> anyway), and if they didn't, we'd probably still want to encode them
> as U+F0xx.
You're right. That's a check we can safely omit.
Thanks,
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat