This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: New GB18030 gconv module for glibc (from ThizLinux Laboratory)


Hello Yu Shao,

On Thu, Jan 17, 2002 at 10:35:12PM +1000, Yu Shao wrote:
> >Not sure if it is a problem with /usr/bin/iconv or GB18030.so:
> >when I tried your module, both old and new, on the Chinese sample
> >documents:
> >
> >	$ iconv -f gb18030 -t ucs2 four.txt
> >	iconv: illegal input sequence at position 32
> >	$ iconv -f gb18030 -t ucs2 wei.txt
> >	iconv: illegal input sequence at position 0
> >	$ iconv -f gb18030 -t ucs2 zang.txt
> >	iconv: illegal input sequence at position 0
> >	$ iconv -f gb18030 -t ucs2 wei.txt
> >	iconv: illegal input sequence at position 0
> >	$ iconv -f gb18030 -t ucs2 yi.txt
> >	iconv: illegal input sequence at position 0
> >
> >If the first line is trimmed, the illegal sequence appears at 27420 for
> >four.txt, etc.  It appears to me that your tables only cover the bare
> >minimum required by the Chinese Standards Committee, but this is not
> >quite right.  GB18030 is supposed to be like UTF-8: it is an encoding
> >that covers the entire repertoire of ISO-10646-1 while remaining
> >compatible with GB2312 and GBK.  It should be able convert to and from
> >all Unicode codepoints, i.e. U+0000..U+D7FF, U+E000..U+FFFF,
> >U+10000..U+10FFFF.
> >
> The character in the postion of 32 of four.txt is 0x8139EE38 whose
> unicode is 0x33FF, if you can have a look of unicode table, 0x33FF is a
> undefined invalide value. Acutally the same things with those other four
> test files. Converting gb code like 0x8139ee38 to a  non-exist unicode
> really means nothing.

I beg to differ.  An "undefined" value is not "invalid".  An undefined
value today does not mean it won't be defined in the future.  By your
reasoning, converting a file containing U+33FF from UCS2 to UTF-8
should also fail, and yet it doesn't, and for very good reason.
Besides, if U+33FF were invalid, why on earth will there be a
GB+8139EE38 that maps to U+33FF?  You mean the Chinese GB18030 standard
commitee intentionally put an invalid code in their _standard_ test
files?

The whole point behind GB 18030 is to provide an encoding that covers
the _entire_ ISO-10646-1 / Unicode character repertoires to provide a
migration path towards the ultimate adoption of GB 13000.1 = ISO 10646-1
/ Unicode.  According to GB 18030 expert Markus Scherer of IBM:

     GB 18030 specifies a mapping table that covers all Unicode code
     points.  It is functionally similar to a UTF (Unicode
     Transformation Format) while maintaining compatibility of
     GB-encoded text with GBK and GB 2312-1980.

	- http://www-106.ibm.com/developerworks/unicode/library/u-china.html

With respect to Unicode<->GB18030 conversion, I don't think it is
gb18030.c job to determine which Unicode codepoint is defined or not.

> The new GB18030-2000 standard only uses unicode till 0xFFFF, do you have
> the latest standard book? And I think doing all gb18030 stuff based on
> the new standard is better.

No, I do not have the actual book at hand.  When you mentioned the "new
standard", do you mean the one released on December 2000?  Or
is an even newer version published?

Note that since the revised GB18030-2000 was published in Dec 2000,
there has been a new Unicode 3.1 standard published in March 2001,
(and then Unicode 3.1.1 in August).  Code range U+10000..U+10FFFF
is added in Unicode 3.1, and GB18030 is updated accordingly.

The U+10000..U+10FFFF was added to glibc-2.2.4 gb18030.c by Bruno Haible.
Markus Scherer's gb-18030-2000.xml also contains mappings to that
range.  See Bruno's message from May 30, 2001, here:

     http://sources.redhat.com/ml/libc-alpha/2001-05/msg00332.html

     This patch updates the GB18030 charmap and converter for Unicode 3.1.
     According to published infos from Markus Scherer and dmeyer@adobe.com,
     U+10000..U+10FFFF is linearly encoded as 0x90308130...0xE3329A35.

> >Ah, I see why now.  There are lots of 0x0000 and "\x00\x00" in your code
> >which you put in to keep the filesize of the resulting GB18030.so to a
> >minimum.  If you intend to use AbiWord as the editor for Chinese testing
> >agency to test GB18030 support, the problem of this approach will be
> >apparent.  When the current AbiWord sees a character it cannot find the
> >UCS2 from the system iconv, it simply throws it away without even 
> leaving an
> >empty space as a spaceholder.  Thus, the on-screen display and printout
> >would be different from what appears on Windows 2000/XP+GB18030 which
> >the agency use as a guideline.  Granted, this is partially AbiWord's 
> fault,
> >but nevertheless, I think glibc should be able to do better
> >than just the minimum compliance.

> For your Abiword problem, I would suggest you update the test files,
> like deleting all the invalid values or simply make Abiword more robust.

Remember that these sample GB 18030 text files are provided by the
_official_ Chinese GB18030 standard testing agency.
Are you implying that the Chinese standards committee intentionally put
invalid GB characters in the files they actually use for standard
compliance testing?

I agree that AbiWord should be able to put in a place holder, but it is
also true that the GB18030.so needs to be able to handle these codepoints.
The government's test GB18030 files show that both gb18030.c and
AbiWord needs to be fixed.


Best regards,

Anthony

-- 
Anthony Fok Tung-Ling
ThizLinux Laboratory   <anthony@thizlinux.com> http://www.thizlinux.com/
Debian Chinese Project <foka@debian.org>       http://www.debian.org/intl/zh/
Come visit Our Lady of Victory Camp!           http://www.olvc.ab.ca/


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]