This is the mail archive of the
libc-locales@sourceware.org
mailing list for the GNU libc locales project.
Character classifications and language-dependence
- From: ludovic dot courtes at laas dot fr (Ludovic Courtès)
- To: libc-locales at sources dot redhat dot com
- Date: Thu, 14 Sep 2006 18:36:31 +0200
- Subject: Character classifications and language-dependence
- Organization: LAAS-CNRS
Hi,
Currently, many locale definition files that come with glibc (actually
mostly those of western languages) include the "i18n" FDCC-set under
their `LC_CTYPE' category.
However, the "i18n" FDCC-set contains a very broad character
classification: it considers at least all Latin, Greek and Cyrillic
letters as part of the `alpha' character class (as seen in Section 4.3.2
of ISO 14652 [0] and glibc's version). Thus, all the languages whose
locale includes "i18n" end up having a lot of letters in their `alpha'
character class, more than actually exist in the language.
For instance, while `ê' (`e' circumflex) is a letter in French, it is
not a letter in Castellano; likewise, `ñ' is a letter in Castellano, but
not in French. But since glibc's locale definitions for `fr_FR' and
`es_ES' both include "i18n", `isalpha(3)' returns true for both locales.
Section 4 of ISO 14652 reads:
This Technical Report also defines an FDCC-set named "i18n" with
values for some of the above categories in order to simplify FDCC-set
descriptions for a number of cultures. The contents of "i18n"
categories should not necessarily be considered as the most commonly
accepted values, while in many cases it could be the recommended
values.
Thus, my understanding is that glibc's heavy use of "i18n" for character
classifications is acceptable, though not representative of "the most
commonly accepted values". Therefore, one could for instance refine the
`fr_FR' character classification so that only French letters (e.g., not
`ñ') are found under its `alpha' class.
Is this correct? If so, are there plans to actually refine (some of)
these character classifications?
Thanks,
Ludovic.
[0] http://www.open-std.org/jtc1/sc22/wg20/docs/projects#14652