This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
Hi,Please go ahead.
here's a fix for the UTF-16 surrogate pair handling in __utf8_mbtowc, as mentioned in http://sourceware.org/ml/newlib/2009/msg00778.html. The original code only worked in the context of application calls to mbs[nr]towcs. The new code below should also work in most cases where the application calls mbrtowc by itself.
What it does now is this. After three bytes of the 4-byte input sequence have been read, it checks if we're on a sizeof(wchar_t) == 2 system. If so, it extracts the first surrogate pair from these three bytes, since three bytes of UTF-8 input are sufficient to generate the first surrogate half. If the input is valid (I added a comment to explain what an invalid sequence is in this context), the return code is set to the actual number of input bytes, rather than the fixed value of 2, which is one reason this only worked when called from mbs[nr]towcs. To mark this as the first surrogate half, the state->__count field is set to 4.
This conditional expression isn't true if state->__count is 4, rather the code goes along as on sizeof(wchar_t) == 4 systems. Just at the end, if state->__count is 4, *pwc is computed as the second surrogate half, while on sizeof(wchar_t) == 4 systems the entire Unicode value is written to *pwc.
The downside of this implementation is that an application could be happy with the result after only having read the first three bytes of the four byte sequence from the input string and just stop. This results in an incomplete surrogate pair. However, as far as I can see it's rather unlikely, and it's still better that not handling Unicode values outside the base plane at all.
Tested on Cygwin.
Ok to apply?
Thanks, Corinna
* libc/stdlib/mbtowc_r.c (__utf8_mbtowc): Rework UTF-16 surrogate pair handling to be more bullet-proof even with incomplete UTF-8 sequences.
Index: libc/stdlib/mbtowc_r.c
===================================================================
RCS file: /cvs/src/src/newlib/libc/stdlib/mbtowc_r.c,v
retrieving revision 1.14
diff -u -p -r1.14 mbtowc_r.c
--- libc/stdlib/mbtowc_r.c 28 Jul 2009 16:49:19 -0000 1.14
+++ libc/stdlib/mbtowc_r.c 28 Jul 2009 16:52:48 -0000
@@ -205,18 +205,6 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch
if (n == 0)
return -2;
- if (state->__count == 4)
- {
- /* Create the second half of the surrogate pair. For a description
- see the comment below. */
- wint_t tmp = (wchar_t)((state->__value.__wchb[0] & 0x07) << 18)
- | (wchar_t)((state->__value.__wchb[1] & 0x3f) << 12)
- | (wchar_t)((state->__value.__wchb[2] & 0x3f) << 6)
- | (wchar_t)(state->__value.__wchb[3] & 0x3f);
- state->__count = 0;
- *pwc = 0xdc00 | ((tmp - 0x10000) & 0x3ff);
- return 2;
- }
if (state->__count == 0)
ch = t[i++];
else
@@ -353,6 +341,36 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch
state->__count = 3;
else if (n < (size_t)-1)
++n;
+ if (state->__count == 3 && sizeof(wchar_t) == 2)
+ {
+ /* On systems which have wchar_t being UTF-16 values, the value
+ doesn't fit into a single wchar_t in this case. So what we
+ do here is to store the state with a special value of __count
+ and return the first half of a surrogate pair. The first
+ three bytes of a UTF-8 sequence are enough to generate the
+ first half of a UTF-16 surrogate pair. As return value we
+ choose to return the number of bytes actually read up to
+ here.
+ The second half of the surrogate pair is returned in case we
+ recognize the special __count value of four, and the next
+ byte is actually a valid value. See below*/
+ tmp = (wint_t)((state->__value.__wchb[0] & 0x07) << 18)
+ | (wint_t)((state->__value.__wchb[1] & 0x3f) << 12)
+ | (wint_t)((state->__value.__wchb[2] & 0x3f) << 6);
+ tmp = (tmp - 0x10000) >> 10;
+ /* Check if the sequence can fit into a surrogate pair at all.
+ If tmp is > 0x3ff at this point, the full Unicode value will
+ be > 0x10ffff. This is an invalid Unicode value and outside
+ of the defintion of UTF-16 surrogate pairs. */
+ if (tmp > 0x3ff)
+ {
+ r->_errno = EILSEQ;
+ return -1;
+ }
+ state->__count = 4;
+ *pwc = 0xd800 | tmp;
+ return i;
+ }
if (n < 4)
return -2;
ch = t[i++];
@@ -365,21 +383,14 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch
| (wint_t)((state->__value.__wchb[1] & 0x3f) << 12)
| (wint_t)((state->__value.__wchb[2] & 0x3f) << 6)
| (wint_t)(ch & 0x3f);
- if (tmp > 0xffff && sizeof(wchar_t) == 2)
+ if (state->__count == 4 && sizeof(wchar_t) == 2)
{
- /* On systems which have wchar_t being UTF-16 values, the value
- doesn't fit into a single wchar_t in this case. So what we
- do here is to store the state with a special value of __count
- and return the first half of a surrogate pair. As return
- value we choose to return the half of the actual UTF-8 char.
- The second half is returned in case we recognize the special
- __count value above. */
- state->__value.__wchb[3] = ch;
- state->__count = 4;
- *pwc = 0xd800 | (((tmp - 0x10000) >> 10) & 0x3ff);
- return 2;
+ /* Create the second half of the surrogate pair for systems with
+ wchar_t == UTF-16 . */
+ *pwc = 0xdc00 | ((tmp - 0x10000) & 0x3ff);
}
- *pwc = tmp;
+ else
+ *pwc = tmp;
state->__count = 0;
return i;
}
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |