This is the mail archive of the cygwin-developers mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)

From: Corinna Vinschen <corinna-cygwin at cygwin dot com>
To: cygwin-developers at cygwin dot com
Date: Mon, 28 Sep 2009 11:07:04 +0200
Subject: Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
References: <416096c60909270322x32d94673h47ff7c28231cb09e@mail.gmail.com> <20090927110025.GC30851@calimero.vinschen.de> <416096c60909270414y52d93f6fncfe72852bb3331fe@mail.gmail.com> <20090927120414.GD30851@calimero.vinschen.de> <416096c60909270606h4cc2dd4ctbc1da5c5b1a310bb@mail.gmail.com> <20090927161455.GG30851@calimero.vinschen.de> <416096c60909271001v355a84d4kef2fd6d796a1fec1@mail.gmail.com> <20090927203031.GI30851@calimero.vinschen.de> <416096c60909272250m511ec51dt7413863f80428537@mail.gmail.com> <416096c60909272323k7dfa0dc9u58b498af07bb252b@mail.gmail.com>
Reply-to: cygwin-developers at cygwin dot com

On Sep 28 07:23, Andy Koppe wrote:
> 2009/9/28 Andy Koppe:
> > If the Unix filename contains the UTF-8 representation of U+F0xx, that
> > will now roundtrip to just the xx byte. U+F000 is particularly
> > problematic, as that roundtrips to a null byte.
> >
> > Solution: if f_mbtowc comes back with a U+F0xx, scratch that, and
> > instead turn each of the original bytes into a U+F0xx, i.e.:
> >
> > \xEF\x80\x80 -> U+F0EF U+F080 U+F080
> >
> > One for later?
> 
> Actually, I think there's a very simple way to implement this: just
> treat a U+F0xx result the same as an encoding error. For example:
> 
> --- strfuncs.cc.bak     2009-09-28 06:05:53.866000000 +0100
> +++ strfuncs.cc 2009-09-28 07:08:36.909000000 +0100
> @@ -602,9 +602,10 @@ sys_cp_mbstowcs (mbtowc_p f_mbtowc, cons
>                 *ptr = 0x18;
>             }
>         }
> -      else if ((bytes = f_mbtowc (_REENT, ptr, (const char *) pmbs, nms,
> -                                 charset, &ps)) < 0
> -              && *pmbs >= 0x80)
> +      else if (((bytes = f_mbtowc (_REENT, ptr, (const char *) pmbs, nms,
> +                                 charset, &ps)) < 0
> +               && *pmbs >= 0x80)
> +              || (*ptr & 0xff00) == 0xf000)
>         {
>           /* The technique is based on a discussion here:
>              http://www.mail-archive.com/linux-utf8@nl.linux.org/msg00080.html
> @@ -615,7 +616,7 @@ sys_cp_mbstowcs (mbtowc_p f_mbtowc, cons
>              to store them in a symmetric way. */
>           bytes = 1;
>           if (dst)
> -           *ptr = L'\xf080' | *pmbs;
> +           *ptr = L'\xf000' | *pmbs;
>           memset (&ps, 0, sizeof ps);
>         }

Thanks for the patch, but that won't work.  The problem is that ptr can
validly be a NULL pointer if sys_cp_mbstowcs is called only to check
for the length of the result.  With the above, you'll get crashes.

In a case like this, you have to check the input string, along these
lines:

  if (((bytes = f_mbtowc () < 0)
      || (bytes == 3 && pmbs[0] == 0xef && (pmbs[1] & 0xf4) == 0x80))
    [...]

> Btw, is the '*pmbs >= 0x80' check necessary there? ASCII bytes should
> pass unharmed through all encodings (well, at the start of a mbchar
> anyway), and if they didn't, we'd probably still want to encode them
> as U+F0xx.

You're right.  That's a check we can safely omit.


Thanks,
Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

Follow-Ups:
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Andy Koppe

References:
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Andy Koppe
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Corinna Vinschen
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Andy Koppe
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Corinna Vinschen
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Andy Koppe
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Corinna Vinschen
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Andy Koppe
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Corinna Vinschen
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Andy Koppe
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Andy Koppe

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]