This is the mail archive of the cygwin-developers mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: "C" character set (again)

From: Andy Koppe <andy dot koppe at gmail dot com>
To: cygwin-developers at cygwin dot com
Date: Tue, 29 Dec 2009 14:11:18 +0000
Subject: Re: "C" character set (again)
References: <416096c60912282254r7230cbaeiad6b3432f7c15257@mail.gmail.com> <4B3A0379.1090508@byu.net>

2009/12/29 Eric Blake:
>> Following the "printf treats differently a string constant and a
>> character array" issue at
>> http://cygwin.com/ml/cygwin/2009-12/msg01009.html, I'm wondering again
>> whether the "C" locale shouldn't go back to using ASCII rather than
>> UTF-8, to avoid surprises like that and also to fit with many people's
>> expectation that "C" means ASCII. I think that would save us a bunch
>> of trouble and pointless legal/religious discussions about the C
>> locale.
>
> Bytes with the 8th bit set are not portable in the C locale, regardless of
> whether that locale uses ASCII or UTF-8 encoding. ÂYes, we will have to
> field complaints from users with non-portable programs. But I don't think
> we have to change back to ASCII - we are doing those users a service by
> making them fix their portability bugs.

Trouble is, Cygwin currently is the only significant platform where
plain "C" implies UTF-8, as far as I know anyway. While I agree that
POSIX does allow it, this does make it more of a Cygwin problem than a
portability problem from the user's perspective, and they are
certainly not going to thank us for that in any case.

Following the introduction of the "C.UTF-8" default locale, we do no
longer need "C" to imply UTF-8, hence we're causing ourselves
unnecessary pain by sticking with that. There've been several user
questions on this already, and also problems with autoconf and gcc
test cases that assumed that C means ASCII as well as complaints on
legal/philosophical grounds from Thomas Dickey and others. And if the
Debian thread discussing the introduction of C.UTF-8 is anything to go
by, there's going to be a lot more of the latter. (See
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=522776)

I'm running Cygwin with the patch posted above right now, and things
are working fine. Everything that cares about charsets uses UTF-8 as
before, as do the filesystem, the console, and also the conversion of
the initial environment. The difference is that worries about 8-bit
cleanness in programs that don't call setlocale or that explicitly set
the C locale go away.

Again, I agree that POSIX doesn't require it, but since Cygwin aims
for GNU/Linux compatibility in addition to POSIX I think this is a
change worth making.


> On the other hand, I wonder if it may be possible to special case the
> C.UTF-8 locale to treat invalid byte sequences as pseudo-characters, such
> that we can achieve 8-bit transparency in character contexts such as
> printf rather than failing with EILSEQ. ÂBut such special-casing should be
> reserved for C.UTF-8; locales like en_US.UTF-8 should still fail with
> EILSEQ on invalid sequences.

That seems hacky and inconsistent.

Andy

Follow-Ups:
- Re: "C" character set (again)
  - From: Dave Korn

References:
- "C" character set (again)
  - From: Andy Koppe
- Re: "C" character set (again)
  - From: Eric Blake

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]