This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Fix UTF-16 surrogate handling in __utf8_mbtowc

From: Corinna Vinschen <vinschen at redhat dot com>
To: newlib at sourceware dot org
Date: Wed, 29 Jul 2009 10:35:12 +0200
Subject: Re: [PATCH] Fix UTF-16 surrogate handling in __utf8_mbtowc
References: <20090728165730.GV18621@calimero.vinschen.de> <4A6F49DF.4050907@redhat.com> <20090728195852.GZ18621@calimero.vinschen.de> <4A6F6E6D.4050003@redhat.com>
Reply-to: newlib at sourceware dot org

On Jul 28 17:32, Jeff Johnston wrote:
> Corinna Vinschen wrote:
>> The question is, shouldn't the code be changed to disallow values beyond
>> 0x10ffff on all systems, rather than just checking it in the UTF-16
>> case?
>>
>>   
> If the code allows those invalid sequences to generate and doesn't catch  
> them at an earlier stage,
> then it should be fixed, so go ahead, assuming you have tested the patch.

Yes, I tested it.  The highest valid UTF-8 sequence is \xf4\x8f\xbf\xbf
which represents U+10ffff.  So the changed tests only allow leading
bytes <= \xf4 and no sequence >= \xf4\x90.  I also added a clarifying
comment for the test for invalid UTF-8 3-byte sequences representing
UTF-16 surrogate values.  These are valid in CESU-8, but not in UTF-8.

Patch applied.

Thanks,
Corinna

-- 
Corinna Vinschen
Cygwin Project Co-Leader
Red Hat

References:
- [PATCH] Fix UTF-16 surrogate handling in __utf8_mbtowc
  - From: Corinna Vinschen
- Re: [PATCH] Fix UTF-16 surrogate handling in __utf8_mbtowc
  - From: Jeff Johnston
- Re: [PATCH] Fix UTF-16 surrogate handling in __utf8_mbtowc
  - From: Corinna Vinschen
- Re: [PATCH] Fix UTF-16 surrogate handling in __utf8_mbtowc
  - From: Jeff Johnston

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]