This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug libc/1383] New: iconv doesn't detect invalid UTF-8 input as early as possible


I attach a small program that takes a string of N bytes (N is the first
command line argument) where the first byte is 0xE1 and the other ones are
ascii letters, and tries to convert it from UTF-8 to the charset given in the
second argument. Obviusly it'll fail to do that since the string is not valid
UTF-8, the question is what errno will be set to.

If I try to convert it to UTF-8, errno is always set appropriately, as it is
described in libc's info page:

$ for i in `seq 1 4`; do ./iconvbug $i UTF-8; done
retval: -1  errno: 22 Invalid argument
retval: -1  errno: 84 Invalid or incomplete multibyte or wide character
retval: -1  errno: 84 Invalid or incomplete multibyte or wide character
retval: -1  errno: 84 Invalid or incomplete multibyte or wide character

This output is perfect. (If I change the first byte to be an UTF-8 continuation
byte, then I always get errno 84, which is okay, too.)

However, if the destination character set is UCS-4, the output is different:

for i in `seq 1 4`; do ./iconvbug $i UCS-4; done
retval: -1  errno: 22 Invalid argument
retval: -1  errno: 22 Invalid argument
retval: -1  errno: 22 Invalid argument
retval: -1  errno: 84 Invalid or incomplete multibyte or wide character

(The same output is printed if the first byte of the string is a continuation
character, so it doesn't detect that case correctly either.)

The expected output would be the same as in the first test case when the
output charset was UTF-8.

So when a >=128 byte, followed by 1 or 2 pieces of <128 bytes, are given to
iconv to convert to UCS-4, it does not detect that the input is already
illegal, it gives a return value that denotes to the application that the
input may still be valid, additional bytes are needed to decide it.

(By the way it's quite strange to me that parsing the input depends on what the
output is, I see no reason for it in the current situation since the two output
charsets are fully compatible.)

This might cause buggy behavior of applications that handle real-time arriving
data and displays them, such as terminal emulators, character-oriented talk
clients etc. Here for me it's important that if the sending party sends a
latin-1 accented letter followed by a non-accented one and stops typing here,
while my software expects utf-8, I could see a replacement character plus the
non-accented letter appear. Currently if my application performs a conversion
to UCS-4 using iconv, it doesn't yet display these received bytes, although it
could.

Currently I'm trying to fix a closely related bug in gnome-terminal:
http://bugzilla.gnome.org/show_bug.cgi?id=317236
I'm not sure, but it seems to me that half of that bug is caused by this
glibc bug and there'll be another half that is a gnome-terminal (vte) bug.
But I cannot proceed until glibc is fixed, except if I drop the conversion
code that uses iconv and write an own UTF-8 -> UCS-4 implementation.

-- 
           Summary: iconv doesn't detect invalid UTF-8 input as early as
                    possible
           Product: glibc
           Version: 2.3.5
            Status: NEW
          Severity: normal
          Priority: P2
         Component: libc
        AssignedTo: gotom at debian dot or dot jp
        ReportedBy: egmont at uhulinux dot hu
                CC: glibc-bugs at sources dot redhat dot com


http://sourceware.org/bugzilla/show_bug.cgi?id=1383

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]