This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: iswxxxxx/towxxxer and Unicode

To: Bruno Haible <haible at ilog dot fr>
Subject: Re: iswxxxxx/towxxxer and Unicode
From: Ulrich Drepper <drepper at redhat dot com>
Date: 18 Jul 2000 01:37:04 -0700
Cc: libc-alpha at sourceware dot cygnus dot com
References: <14707.14010.371576.370585@honolulu.ilog.fr>
Reply-To: drepper at cygnus dot com (Ulrich Drepper)

Bruno Haible <haible@ilog.fr> writes:

> For fixing bug report libc/1251, I've prepared a patch which changes
> the behaviour of the iswalpha etc. and towlower etc. functions. I
> created an FDCC-set called "unicode" (automatically generated from
> UnicodeData.txt) containing only an LC_CTYPE and LC_IDENTIFICATION
> category.

Why?  The Unicode data is not the last word of wisdom.  They have
standardized practice in the participating companies and not always
what common sense suggests.

> 1) iswcntrl(0x0000) now returns 1. Why did you change iswcntrl(0x0000)

U0000 is nothing.  Some testsuites dictate the current behavior.  If
somebody collects a list of how other systems behave we can change it
back.

> 2) iswspace(0x00A0) now returns 0.

OK.

> 3) The compiled LC_CTYPE locale is now 1.2 MB large; before it was
> around 130 KB. With more than 60 supported locales, the
> /usr/lib/locales/ directory will grow to 70 MB. (And I don't have
> added the wcwidth information yet!)

There are several things involved:

- the character names (in collate) must be cleaned up.  If a name is
  in the Uxxxx form there is no need to store the name in the file.  The
  value can be determined at runtime.

- there is still the problem to be resolved whether the wide character
  data in LC_CTYPE and LC_COLLATE should contains information about all
  wide characters or whether the information for all of them should be
  included.  Currently the later is done and this unnecessarily blows up
  the data for all charsets != UTF-8

- for UTF-8 the tables should be almost densly packed.  I.e., no size
  improvements are possible unless you compress the table data as well.
  E.g., by collapsing ranges of characters with the same properties.

> Do you want me to work on this?

If you can find the time very soon, sure.  But keep the old
implementation around (at least with #ifdefs for now).  It might be
best to decide about the method to use at localedef time.

-- 
---------------.                          ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Red Hat          `--' drepper at redhat.com   `------------------------

Follow-Ups:
- Re: iswxxxxx/towxxxer and Unicode
  - From: Bruno Haible
- Re: iswxxxxx/towxxxer and Unicode
  - From: Bruno Haible

References:
- iswxxxxx/towxxxer and Unicode
  - From: Bruno Haible

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]