This is the mail archive of the
libc-alpha@sources.redhat.com
mailing list for the glibc project.
Re: iswxxxxx/towxxxer and Unicode
- To: Bruno Haible <haible at ilog dot fr>
- Subject: Re: iswxxxxx/towxxxer and Unicode
- From: Ulrich Drepper <drepper at redhat dot com>
- Date: 18 Jul 2000 01:37:04 -0700
- Cc: libc-alpha at sourceware dot cygnus dot com
- References: <14707.14010.371576.370585@honolulu.ilog.fr>
- Reply-To: drepper at cygnus dot com (Ulrich Drepper)
Bruno Haible <haible@ilog.fr> writes:
> For fixing bug report libc/1251, I've prepared a patch which changes
> the behaviour of the iswalpha etc. and towlower etc. functions. I
> created an FDCC-set called "unicode" (automatically generated from
> UnicodeData.txt) containing only an LC_CTYPE and LC_IDENTIFICATION
> category.
Why? The Unicode data is not the last word of wisdom. They have
standardized practice in the participating companies and not always
what common sense suggests.
> 1) iswcntrl(0x0000) now returns 1. Why did you change iswcntrl(0x0000)
U0000 is nothing. Some testsuites dictate the current behavior. If
somebody collects a list of how other systems behave we can change it
back.
> 2) iswspace(0x00A0) now returns 0.
OK.
> 3) The compiled LC_CTYPE locale is now 1.2 MB large; before it was
> around 130 KB. With more than 60 supported locales, the
> /usr/lib/locales/ directory will grow to 70 MB. (And I don't have
> added the wcwidth information yet!)
There are several things involved:
- the character names (in collate) must be cleaned up. If a name is
in the Uxxxx form there is no need to store the name in the file. The
value can be determined at runtime.
- there is still the problem to be resolved whether the wide character
data in LC_CTYPE and LC_COLLATE should contains information about all
wide characters or whether the information for all of them should be
included. Currently the later is done and this unnecessarily blows up
the data for all charsets != UTF-8
- for UTF-8 the tables should be almost densly packed. I.e., no size
improvements are possible unless you compress the table data as well.
E.g., by collapsing ranges of characters with the same properties.
> Do you want me to work on this?
If you can find the time very soon, sure. But keep the old
implementation around (at least with #ifdefs for now). It might be
best to decide about the method to use at localedef time.
--
---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------