This is the mail archive of the
newlib@sourceware.org
mailing list for the newlib project.
Re: [PATCH] Fix UTF-16 surrogate handling in __utf8_mbtowc
- From: Andy Koppe <andy dot koppe at gmail dot com>
- To: newlib at sourceware dot org
- Date: Tue, 28 Jul 2009 19:34:52 +0100
- Subject: Re: [PATCH] Fix UTF-16 surrogate handling in __utf8_mbtowc
- References: <20090728165730.GV18621@calimero.vinschen.de>
2009/7/28 Corinna Vinschen:
> here's a fix for the UTF-16 surrogate pair handling in __utf8_mbtowc,
> as mentioned in http://sourceware.org/ml/newlib/2009/msg00778.html.
> The original code only worked in the context of application calls to
> mbs[nr]towcs. ÂThe new code below should also work in most cases where
> the application calls mbrtowc by itself.
Thank you very much for implementing that so quickly.
> The downside of this implementation is that an application could be
> happy with the result after only having read the first three bytes
> of the four byte sequence from the input string and just stop. ÂThis
> results in an incomplete surrogate pair. ÂHowever, as far as I can see
> it's rather unlikely, and it's still better that not handling Unicode
> values outside the base plane at all.
I think that's perfectly correct behaviour. There's nothing more that
can be done given the constraint of a 16-bit wchar_t type. That just
can't be hidden here, so applications have to be adapted where
necessary.
> + *pwc = 0xdc00 | ((tmp - 0x10000) & 0x3ff);
Nitpicking: The '- 0x10000' isn't necessary here; '(tmp & 0x3ff)' should do.
Andy