This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Fix UTF-16 surrogate handling in __utf8_mbtowc

From: Jeff Johnston <jjohnstn at redhat dot com>
To: newlib at sourceware dot org
Date: Tue, 28 Jul 2009 14:56:31 -0400
Subject: Re: [PATCH] Fix UTF-16 surrogate handling in __utf8_mbtowc
References: <20090728165730.GV18621@calimero.vinschen.de>

Corinna Vinschen wrote:

Hi,


here's a fix for the UTF-16 surrogate pair handling in __utf8_mbtowc,
as mentioned in http://sourceware.org/ml/newlib/2009/msg00778.html.
The original code only worked in the context of application calls to
mbs[nr]towcs.  The new code below should also work in most cases where
the application calls mbrtowc by itself.

What it does now is this.  After three bytes of the 4-byte input
sequence have been read, it checks if we're on a sizeof(wchar_t) == 2
system.  If so, it extracts the first surrogate pair from these three
bytes, since three bytes of UTF-8 input are sufficient to generate the
first surrogate half.  If the input is valid (I added a comment to
explain what an invalid sequence is in this context), the return code is
set to the actual number of input bytes, rather than the fixed value of
2, which is one reason this only worked when called from mbs[nr]towcs.
To mark this as the first surrogate half, the state->__count field is
set to 4.

This conditional expression isn't true if state->__count is 4, rather
the code goes along as on sizeof(wchar_t) == 4 systems.  Just at the
end, if state->__count is 4, *pwc is computed as the second surrogate
half, while on sizeof(wchar_t) == 4 systems the entire Unicode value is
written to *pwc.

The downside of this implementation is that an application could be
happy with the result after only having read the first three bytes
of the four byte sequence from the input string and just stop.  This
results in an incomplete surrogate pair.  However, as far as I can see
it's rather unlikely, and it's still better that not handling Unicode
values outside the base plane at all.

Tested on Cygwin.

Ok to apply?

Please go ahead.

-- Jeff J.

Thanks,
Corinna
	* libc/stdlib/mbtowc_r.c (__utf8_mbtowc): Rework UTF-16 surrogate
	pair handling to be more bullet-proof even with incomplete UTF-8
	sequences.
Index: libc/stdlib/mbtowc_r.c =================================================================== RCS file: /cvs/src/src/newlib/libc/stdlib/mbtowc_r.c,v retrieving revision 1.14 diff -u -p -r1.14 mbtowc_r.c --- libc/stdlib/mbtowc_r.c 28 Jul 2009 16:49:19 -0000 1.14 +++ libc/stdlib/mbtowc_r.c 28 Jul 2009 16:52:48 -0000 @@ -205,18 +205,6 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch if (n == 0) return -2; - if (state->__count == 4) - { - /* Create the second half of the surrogate pair. For a description - see the comment below. */ - wint_t tmp = (wchar_t)((state->__value.__wchb[0] & 0x07) << 18) - | (wchar_t)((state->__value.__wchb[1] & 0x3f) << 12) - | (wchar_t)((state->__value.__wchb[2] & 0x3f) << 6) - | (wchar_t)(state->__value.__wchb[3] & 0x3f); - state->__count = 0; - *pwc = 0xdc00 | ((tmp - 0x10000) & 0x3ff); - return 2; - } if (state->__count == 0) ch = t[i++]; else @@ -353,6 +341,36 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch state->__count = 3; else if (n < (size_t)-1) ++n; + if (state->__count == 3 && sizeof(wchar_t) == 2) + { + /* On systems which have wchar_t being UTF-16 values, the value + doesn't fit into a single wchar_t in this case. So what we + do here is to store the state with a special value of __count + and return the first half of a surrogate pair. The first + three bytes of a UTF-8 sequence are enough to generate the + first half of a UTF-16 surrogate pair. As return value we + choose to return the number of bytes actually read up to + here. + The second half of the surrogate pair is returned in case we + recognize the special __count value of four, and the next + byte is actually a valid value. See below*/ + tmp = (wint_t)((state->__value.__wchb[0] & 0x07) << 18) + | (wint_t)((state->__value.__wchb[1] & 0x3f) << 12) + | (wint_t)((state->__value.__wchb[2] & 0x3f) << 6); + tmp = (tmp - 0x10000) >> 10; + /* Check if the sequence can fit into a surrogate pair at all. + If tmp is > 0x3ff at this point, the full Unicode value will + be > 0x10ffff. This is an invalid Unicode value and outside + of the defintion of UTF-16 surrogate pairs. */ + if (tmp > 0x3ff) + { + r->_errno = EILSEQ; + return -1; + } + state->__count = 4; + *pwc = 0xd800 | tmp; + return i; + } if (n < 4) return -2; ch = t[i++]; @@ -365,21 +383,14 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch | (wint_t)((state->__value.__wchb[1] & 0x3f) << 12) | (wint_t)((state->__value.__wchb[2] & 0x3f) << 6) | (wint_t)(ch & 0x3f); - if (tmp > 0xffff && sizeof(wchar_t) == 2) + if (state->__count == 4 && sizeof(wchar_t) == 2) { - /* On systems which have wchar_t being UTF-16 values, the value - doesn't fit into a single wchar_t in this case. So what we - do here is to store the state with a special value of __count - and return the first half of a surrogate pair. As return - value we choose to return the half of the actual UTF-8 char. - The second half is returned in case we recognize the special - __count value above. */ - state->__value.__wchb[3] = ch; - state->__count = 4; - *pwc = 0xd800 | (((tmp - 0x10000) >> 10) & 0x3ff); - return 2; + /* Create the second half of the surrogate pair for systems with + wchar_t == UTF-16 . */ + *pwc = 0xdc00 | ((tmp - 0x10000) & 0x3ff); } - *pwc = tmp; + else + *pwc = tmp; state->__count = 0; return i; }

Follow-Ups:
- Re: [PATCH] Fix UTF-16 surrogate handling in __utf8_mbtowc
  - From: Corinna Vinschen

References:
- [PATCH] Fix UTF-16 surrogate handling in __utf8_mbtowc
  - From: Corinna Vinschen

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]