This is the mail archive of the
libc-alpha@sources.redhat.com
mailing list for the glibc project.
Re: New GB18030 gconv module contributed by ThizLinux Laboratory
- From: Anthony Fok <anthony at thizlinux dot com>
- To: Ulrich Drepper <drepper at redhat dot com>
- Cc: libc-alpha at sources dot redhat dot com, Kevin Lau <kevin at thizlinux dot com>, Fai <fai at thizlinux dot com>, Sunny Gu <sunnygu at thizgroup dot com>, James Su <suzhe at gnuchina dot org>, yshao at redhat dot com
- Date: Thu, 17 Jan 2002 18:02:03 +0800
- Subject: Re: New GB18030 gconv module contributed by ThizLinux Laboratory
- References: <20020116074546.GA17279@sunrise> <m36661h44z.fsf@myware.mynet>
On Wed, Jan 16, 2002 at 10:40:12PM -0800, Ulrich Drepper wrote:
> Anthony Fok <anthony@thizlinux.com> writes:
> > While testing GB18030 support with some sample GB18030 text files
> > supplied by the Chinese IT Standardization Technical Committee, we at
> > ThizLinux Laboratory discovered that the GB18030 gconv module was
> > unable to handle certain ranges, most notably in the User-Defined Area,
> > but it seems /usr/bin/iconv discovered problems in other ranges too.
>
> This is quite a conincident. I've been working with Yu Shao on
> exactly this for some time now and he sent me the final versions on
> Monday. I checked them in today.
Wow, quite a coincident indeed. :-) As you know, we decided to write our
own qgb18030codec.cpp for Qt independent of glibc code because KDE needed
it, which initially took 2 weeks, and there was one silly typo that kept it
from working until I finally got it right on November 26. :-) I have been
thinking of porting it to glibc some day, but since it was not urgent then,
and I thought glibc's GB18030 was complete, I was not in a hurry, until our
colleague told us AbiWord fails the User1.txt test. :-)
> Whether your code is better or not I cannot say since I haven't looked
> at it. We've spent quite some time on the code which is now in the
> CVS archive (in the trunk, not the 2.2.5 branch) so that I know that
> code is fine and apparently passes the official tests. Investigating
> your code would just mean doing it all over again for no gain. This
> does not mean I don't appreciate the effort, it's just that the two
> efforts overlapped and I didn't know about your work.
Hehe, I guess everyone is working towards the same goal of GB18030 support.
:-)
> You might want to test the code which is in the CVS archive and if
> there are any problems we'll fix them.
Not sure if it is a problem with /usr/bin/iconv or GB18030.so:
when I tried your module, both old and new, on the Chinese sample
documents:
$ iconv -f gb18030 -t ucs2 four.txt
iconv: illegal input sequence at position 32
$ iconv -f gb18030 -t ucs2 wei.txt
iconv: illegal input sequence at position 0
$ iconv -f gb18030 -t ucs2 zang.txt
iconv: illegal input sequence at position 0
$ iconv -f gb18030 -t ucs2 wei.txt
iconv: illegal input sequence at position 0
$ iconv -f gb18030 -t ucs2 yi.txt
iconv: illegal input sequence at position 0
If the first line is trimmed, the illegal sequence appears at 27420 for
four.txt, etc. It appears to me that your tables only cover the bare
minimum required by the Chinese Standards Committee, but this is not
quite right. GB18030 is supposed to be like UTF-8: it is an encoding
that covers the entire repertoire of ISO-10646-1 while remaining
compatible with GB2312 and GBK. It should be able convert to and from
all Unicode codepoints, i.e. U+0000..U+D7FF, U+E000..U+FFFF,
U+10000..U+10FFFF.
Ah, I see why now. There are lots of 0x0000 and "\x00\x00" in your code
which you put in to keep the filesize of the resulting GB18030.so to a
minimum. If you intend to use AbiWord as the editor for Chinese testing
agency to test GB18030 support, the problem of this approach will be
apparent. When the current AbiWord sees a character it cannot find the
UCS2 from the system iconv, it simply throws it away without even leaving an
empty space as a spaceholder. Thus, the on-screen display and printout
would be different from what appears on Windows 2000/XP+GB18030 which
the agency use as a guideline. Granted, this is partially AbiWord's fault,
but nevertheless, I think glibc should be able to do better
than just the minimum compliance.
Therefore, I sincerely ask you to give our gb18030.c a second look.
Features of this file include:
* Automatically generated (with a Perl script) from Markus Scherer's
excellent gb-18030-2000.xml, which already lists many contiguous areas
that can be algorithmatically calculated and makes the resulting
codec module very compact. (Markus' GB18030 table work on IBM ICU
site is _the_ authorative source of the GB18030 standard.)
* Uses an intermediate index table (generated from Markus' XML file)
which could tell which UCS2 ranges correspond to 2-byte or 4-byte
GB18030 code as well as table_offset or gb 4-byte linear offset.
* Similar to your approach, 2-byte and 4-byte mapping data (GB to UCS2)
are placed into the same table.
* The UDA1, UDA2 and UDA3 areas (U+E000..U+E765) are calculated to
save more space.
* The resulting GB18030.so is only 130 KB on i386 even though
it supports the full U+0000..U+10FFFF (minus U+D800..U+DFFF surrogate
area) (vs. 210 KB (in CVS) and 180 KB (in glibc-2.2.4))
* Correct round-trip conversion (to-and-fro ucs2 or ucs4) using
/usr/bin/iconv on all Chinese sample text files as well as
on a full U+0000..U+D7FF, U+E000..U+10FFFF test file.
Also manually checked (with hexdump) to ensure correctness.
* On the U+0000..U+D7FF, U+E000..U+10FFFF UCS4 test file (4448256 bytes)
on an Athlon K7 800MHz machine:
$ time iconv -f ucs4 -t gb18030 all-ucs4.txt > /dev/null
real 0m0.258s
user 0m0.220s
sys 0m0.030s
With the output of the command saved to all-gb18030.txt (4399992 bytes)
$ time iconv -f gb18030 -t ucs4 all-gb18030.txt > /dev/null
real 0m0.237s
user 0m0.220s
sys 0m0.010s
* Apparently passed official test.
So, please at least give it a try when you have some free time. :-)
Best regards,
Anthony
P.S. By the way, if you have not ported GB18030 to Qt yet, please feel free
to grab ours at http://people.debian.org/~foka/gb18030/ as KDE and Qt
applications need it to properly support GB18030, and I guess you probably
don't want any GPL/LGPL glibc gb18030.c code anywhere near Qt. ;-)
Our qgb18030codec.c and related patches are a combined effort of James and
me, is under LGPL/GPL/QPL and permitted to be included in the commercial
Qt.) Anyhow, our code was submitted to Qt but not yet in Qt 2.3.2 because
* Qt-2.3.2 was in final testing, and it was a bit too late to add new
codes.
* Lars Knoll would prefer the new tables merged with the existing
GBK and Big5 ones to save space.
* It needs to be ported to Qt-3.0.x soon. We'll see who has time to
get that done first. :-)
--
Anthony Fok Tung-Ling
ThizLinux Laboratory <anthony@thizlinux.com> http://www.thizlinux.com/
Debian Chinese Project <foka@debian.org> http://www.debian.org/intl/zh/
Come visit Our Lady of Victory Camp! http://www.olvc.ab.ca/