This is the mail archive of the
newlib@sourceware.org
mailing list for the newlib project.
Re: [PATCH] Fix UTF-16 surrogate handling in __utf8_mbtowc
On Jul 28 17:32, Jeff Johnston wrote:
> Corinna Vinschen wrote:
>> The question is, shouldn't the code be changed to disallow values beyond
>> 0x10ffff on all systems, rather than just checking it in the UTF-16
>> case?
>>
>>
> If the code allows those invalid sequences to generate and doesn't catch
> them at an earlier stage,
> then it should be fixed, so go ahead, assuming you have tested the patch.
Yes, I tested it. The highest valid UTF-8 sequence is \xf4\x8f\xbf\xbf
which represents U+10ffff. So the changed tests only allow leading
bytes <= \xf4 and no sequence >= \xf4\x90. I also added a clarifying
comment for the test for invalid UTF-8 3-byte sequences representing
UTF-16 surrogate values. These are valid in CESU-8, but not in UTF-8.
Patch applied.
Thanks,
Corinna
--
Corinna Vinschen
Cygwin Project Co-Leader
Red Hat