Bug in collation functions?

Corinna Vinschen corinna-cygwin@cygwin.com
Thu Oct 29 16:14:00 GMT 2015


On Oct 29 08:59, Ken Brown wrote:
> On 10/29/2015 4:30 AM, Corinna Vinschen wrote:
> >On Oct 29 08:50, Corinna Vinschen wrote:
> >>On Oct 28 21:58, Eric Blake wrote:
> >>>On 10/28/2015 04:14 PM, Ken Brown wrote:
> >>>>It's my understanding that collation is supposed to take whitespace and
> >>>>punctuation into account in the POSIX locale but not in other locales.
> >>>
> >>>Not quite right. It is up to the locale definition whether whitespace
> >>>affects collation.  But you are correct that in the POSIX locale,
> >>>whitespace must not be ignored in collation.
> >>>
> >>>>This doesn't seem to be the case on Cygwin.  Here's a test case using
> >>>>wcscoll, but the same problem occurs with strcoll.
> >>>
> >>>That's because the locale definitions are different in cygwin than they
> >>>are in glibc.  But it is not a bug in Cygwin; POSIX allows for different
> >>>systems to have different locale definitions while still using the same
> >>>locale name like en_US.UTF-8.
> >>
> >>Btw, strcoll and wcscoll in Cygwin are implemented using the Windows
> >>function CompareStringW with the LCID set to the locale matching the
> >>POSIX locale setting.  I'm rather glad I didn't have to implement this
> >>by myself... :}
> >
> >OTOH, CompareString has a couple of flags to control its behaviour, see
> >https://msdn.microsoft.com/en-us/library/windows/desktop/dd317761%28v=vs.85%29.aspx
> >
> >Right now Cygwin calls CompareStringW with dwCmpFlags set to 0, but there
> >are flags like NORM_IGNORENONSPACE, NORM_IGNORESYMBOLS.  I'm open to a
> >discussion how to change the settings to more closely resemble the rules
> >on Linux.
> >
> >E.g.  wcscoll simply calls wcscmp rather than CompareStringW for the
> >C/POSIX locale anyway.  So, would it makes sense to set the flags to
> >NORM_IGNORESYMBOLS in other locales?
> 
> I think so.  That's what the native Windows build of emacs does in this
> situation.

Is that all it's doing?  I'm asking because using NORM_IGNORESYMBOLS
does not exaclty resemble the behaviour on Linux on my W10 box:

    "11" > "1.1" in POSIX locale
!!! "11" > "1.1" in en_US.UTF-8 locale
    "11" > "1 2" in POSIX locale
    "11" < "1 2" in en_US.UTF-8 locale


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://cygwin.com/pipermail/cygwin/attachments/20151029/b000181a/attachment.sig>


More information about the Cygwin mailing list