bug in mbrtowc?

Andy Koppe andy.koppe@gmail.com
Tue Jul 28 09:47:00 GMT 2009


2009/7/28 Pedro Izecksohn:
>> #include <stdio.h>
>> #include <locale.h>
>> #include <stdlib.h>
>> #include <wchar.h>
>>
>> int main(void) {
>>  wchar_t wc;
>>  size_t ret;
>>  mbstate_t s = { 0 };
>>  puts(setlocale(LC_CTYPE, "en_GB.UTF-8"));
>>  printf("%i\n", mbrtowc(&wc, "\xe2", 1, 0));
>>  printf("%i\n", mbrtowc(&wc, "\x94", 1, 0));
>>  printf("%i\n", mbrtowc(&wc, "\x84", 1, 0));
>>  printf("%x\n", wc);
>>  return 0;
>> }
>>
>> The sequence E2 94 84 should translate to U+2514. Instead, the second
>> and third calls to mbrtowc report encoding errors. It does work
>> correctly if the three bytes are passed to mbrtowc() in one go:
>  From the "Linux Programmer’s Manual" (release 3.15 of the Linux man-pages):
> "If the n bytes starting at s do not contain a complete multibyte
> character,  mbrtowc()  returns  (size_t) -2."

Correct. And the first call to mbrtowc() does just that. The problem
is that the second call returns -1, which signals an encoding error,
even though E2 94 is a valid yet incomplete sequence, i.e. it should
also return -2 and remember what it's seen so far in its internal
state. The third call should return 1 and write 0x2504 to wc.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple



More information about the Cygwin mailing list