This is the mail archive of the
mailing list for the Cygwin project.
Re: LC_COLLATE vs. egrep -- bug or (non-)feature?
- From: Eric Blake <eblake at redhat dot com>
- To: cygwin at cygwin dot com
- Date: Tue, 11 Oct 2011 13:50:55 -0600
- Subject: Re: LC_COLLATE vs. egrep -- bug or (non-)feature?
- References: <firstname.lastname@example.org>
On 10/11/2011 01:20 PM, Henry S. Thompson wrote:
Is this a feature, or a bug associated with the current ongoing
discussion about locales:
(mis)-feature, and not necessarily a cygwin bug. Historically, POSIX
1992 _required_ that regular expression ranges expand out to all
characters in Collation Element Order, between the two end points. The
intent there was to allow accented characters common in some languages
to automatically be picked up, so that [a-z] would also pick up accented
vowels. But it backfired with several unintended consequences: 1) in
locales that collate case-insensitively, you are collating via y
aAbBcC... or AaBbCc..., so that [a-b] now means [aAb] or [aBb], which
adds unwanted capital letters into your range. And although you can
write a locale definition where collation element order is sane (all
lowercase, followed by all uppercase, followed by collation rules that
merge the two sets), it is not as easy to do (the naive locale
definition writes the collation rules first, intermixing upper and lower
case). 2) even if you write the locale definition in a sane collation
element order, do you put the accents first or last? That is, [a-e] is
liable to pick up all accented a's but no accented e's, even though
[a-z] picks up all accented lower case vowels.
POSIX 2001 and 2008 "fixed" things by saying that the use of range
expressions in regular expressions is undefined in all but the C locale,
but the cat is already out of the bag, and you are stuck with existing
behavior. glibc refuses to change their regex library, preferring to
stick to POSIX 1992 behavior, and claiming that the "bug" instead lies
with any locale definition that still uses naive ordering. Cygwin could
behave differently than glibc here and still comply with POSIX, but then
we'd get bug reports for "why does cygwin not emulate Linux".
Meanwhile, several GNU apps are sick of bug reports about the
unintuitive nature of ranges, and are introducing what is called native
ordering, where range expressions _always_ mean the C locale expansion,
even when not in the C locale; but given glibc behavior, this means
adding code on top of glibc, for all programs that understand regex
(awk, bash, sed, grep, m4, etc.). So don't expect that to save you any
time soon; likewise, that only helps you on GNU systems (Solaris will
still continue to suffer from the confusion).
So, your only safe way to work around it is to request LC_COLLATE=C up
> LC_ALL= egrep '^[a-b]l[dl]e.n$' /usr/share/dict/words
LC_COLLATE= egrep '^[a-b]l[dl]e.n$' /usr/share/dict/words
If it's a feature, how do I set LC_COLLATE w/o changing the other
aspects of my locale?
and don't set LC_ALL.
Eric Blake email@example.com +1-801-349-2682
Libvirt virtualization library http://libvirt.org
Problem reports: http://cygwin.com/problems.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple