This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug libc/1383] New: iconv doesn't detect invalid UTF-8 input as early as possible
- From: "egmont at uhulinux dot hu" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sources dot redhat dot com
- Date: 27 Sep 2005 18:49:32 -0000
- Subject: [Bug libc/1383] New: iconv doesn't detect invalid UTF-8 input as early as possible
- Reply-to: sourceware-bugzilla at sourceware dot org
I attach a small program that takes a string of N bytes (N is the first
command line argument) where the first byte is 0xE1 and the other ones are
ascii letters, and tries to convert it from UTF-8 to the charset given in the
second argument. Obviusly it'll fail to do that since the string is not valid
UTF-8, the question is what errno will be set to.
If I try to convert it to UTF-8, errno is always set appropriately, as it is
described in libc's info page:
$ for i in `seq 1 4`; do ./iconvbug $i UTF-8; done
retval: -1 errno: 22 Invalid argument
retval: -1 errno: 84 Invalid or incomplete multibyte or wide character
retval: -1 errno: 84 Invalid or incomplete multibyte or wide character
retval: -1 errno: 84 Invalid or incomplete multibyte or wide character
This output is perfect. (If I change the first byte to be an UTF-8 continuation
byte, then I always get errno 84, which is okay, too.)
However, if the destination character set is UCS-4, the output is different:
for i in `seq 1 4`; do ./iconvbug $i UCS-4; done
retval: -1 errno: 22 Invalid argument
retval: -1 errno: 22 Invalid argument
retval: -1 errno: 22 Invalid argument
retval: -1 errno: 84 Invalid or incomplete multibyte or wide character
(The same output is printed if the first byte of the string is a continuation
character, so it doesn't detect that case correctly either.)
The expected output would be the same as in the first test case when the
output charset was UTF-8.
So when a >=128 byte, followed by 1 or 2 pieces of <128 bytes, are given to
iconv to convert to UCS-4, it does not detect that the input is already
illegal, it gives a return value that denotes to the application that the
input may still be valid, additional bytes are needed to decide it.
(By the way it's quite strange to me that parsing the input depends on what the
output is, I see no reason for it in the current situation since the two output
charsets are fully compatible.)
This might cause buggy behavior of applications that handle real-time arriving
data and displays them, such as terminal emulators, character-oriented talk
clients etc. Here for me it's important that if the sending party sends a
latin-1 accented letter followed by a non-accented one and stops typing here,
while my software expects utf-8, I could see a replacement character plus the
non-accented letter appear. Currently if my application performs a conversion
to UCS-4 using iconv, it doesn't yet display these received bytes, although it
could.
Currently I'm trying to fix a closely related bug in gnome-terminal:
http://bugzilla.gnome.org/show_bug.cgi?id=317236
I'm not sure, but it seems to me that half of that bug is caused by this
glibc bug and there'll be another half that is a gnome-terminal (vte) bug.
But I cannot proceed until glibc is fixed, except if I drop the conversion
code that uses iconv and write an own UTF-8 -> UCS-4 implementation.
--
Summary: iconv doesn't detect invalid UTF-8 input as early as
possible
Product: glibc
Version: 2.3.5
Status: NEW
Severity: normal
Priority: P2
Component: libc
AssignedTo: gotom at debian dot or dot jp
ReportedBy: egmont at uhulinux dot hu
CC: glibc-bugs at sources dot redhat dot com
http://sourceware.org/bugzilla/show_bug.cgi?id=1383
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.