This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

possible bug in iconv/glibc?


glibc developers,

I'm reposting a message I already sent to libc-help and got no replies
to.  Sorry if you've gotten two of these messages.  The only update
from before is that CP932 works slightly better than SJIS, although it
is by no means a working solution.  Since Firefox can render the
original sjis/cp932 input perfectly I'm assuming that some extensions
to sjis are simply missing from the charset supplied to iconv.  I'd
appreciate any suggestions.

Original post is here:
http://www.sourceware.org/ml/libc-help/2009-07/msg00016.html

Can someone provide me with a correct "from" charset that I can use to
convert this input (see attachment from original post)?

Thanks,

Mike

----------------------

I am working on a library whose intended purpose is to convert a
dictionary from a proprietary format into an open one.  The end result
should be that a dictionary from www.babylon.com is usable in e.g.
StarDict.  I'm using an English -> Japanese dictionary as a test case
at the moment and I've run into a rather odd problem.  Certain
multibyte characters in my input (which is ShiftJIS-encoded) cause
iconv to return EILSEQ, but these characters appear to be valid
characters in the ShiftJIS encoding.  Out of over 155,000 entries in
my input only 6 exhibit this behavior and they all display correctly
when viewed in Firefox.  I've attached a text file with the cases that
fail; if you open it in Firefox it shows the original ShiftJIS input
as well as the place where conversion failed.

To quote from the libiconv website here:
  http://www.gnu.org/software/libiconv/
..."To solve this mess, the Unicode encoding has been created. It is a
super-encoding of all others and is therefore the default encoding for
new text formats like XML."

My conversion descriptors all have destination charset "utf8", so the
iconv_open call looks like this:
  iconv_open(cd, "utf8", "sjis");
which, when combined with the statement from your website, makes me
think that any character from ShiftJIS would be encodable in UTF8 but
not the other way around.  I found a post to a mailing list from 1999
which is interesting:

http://mail.nl.linux.org/linux-utf8/1999-11/msg00201.html

I'm using Gentoo Linux with sys-libs/glibc-2.9_p20081201-r2 installed.
 Can anyone shed light on why the conversion to utf8 is failing?  My
gut feeling is that the input data is possibly nonstandard, or maybe
it is some subtle variant of ShiftJIS.  If this is the case, is it
possible to patch glibc to support the conversion?

Thanks,
Mike


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]