"C" UTF-8 trouble

Corinna Vinschen corinna-cygwin@cygwin.com
Wed Oct 7 09:03:00 GMT 2009

On Oct  7 07:07, Andy Koppe wrote:
> 2009/10/7 Eric Blake:
> > For the problematic apps, are they checking just the environment
> > variables, or are they using setlocale(,NULL) and/or setlocale(,"") to
> > determine the current/default settings?
> Looking into this question, I found that for vim there's actually a
> completely different culprit: nl_langinfo(CODESET) returns "US-ASCII"
> for the C locale. (It also returns incorrect values for other
> charset-less locales.)
> Hence I replaced the code in nl_langinfo's CODESET case with just 'ret
> = __locale_charset()', and vim's fine!

Urgh.  So we have to change nl_langinfo in newlib as well.  Do we have
to return "US-ASCII" if charset is "ASCII", or is it sufficient to
return __locale_charset() as you did, thus returning "ASCII" for "ASCII"?

And what about stuff like "eucJP" vs. "EUCJP"?  The charset in newlib
is always uppercase right now.

> Unfortunately that's not the case for emacs.

<insert obligatory editor dispute here>

> > Anyone using _just_ the
> > environment variables is doomed to failure.  POSIX states:
> >
> > "If the LANG environment variable is not set or is set to the empty
> > string, the implementation-defined default locale shall be used."
> >
> > My preference would be that if the environment variables were not set when
> > cygwin1.dll started, then setlocale(,NULL) returns "C.UTF-8" rather than
> > "C".
> The way I understand it, setlocale(,NULL) only queries the current
> setting and has to return "C" (or "POSIX") in the initial state.
> But you're right regarding setlocale(,""); that could indeed return
> something else if none of the environment variables is set. From
> http://www.opengroup.org/onlinepubs/7990989775/xbd/locale.html:
> "All implementations define a locale as the default locale, to be
> invoked when no environment variables are set, or set to the empty
> string. This default locale can be the POSIX locale or any other,
> implementation-dependent locale."
> I think this a good idea, so I replaced "C" with "C.UTF-8" at the end
> of __get_locale_env. Yet emacs still doesn't behave, and digging into
> its code I found that it does indeed read the env variables directly.
> :(
>       ;; Use the first of these three environment variables
>       ;; that has a nonempty value.
>       (let ((vars '("LC_ALL" "LC_CTYPE" "LANG")))
> 	(while (and vars
> 		    (= 0 (length locale))) ; nil or empty string
> 	  (setq locale (getenv (pop vars) frame)))))

I, too, think this is a good idea.  __get_locale_env() should be changed
to return "C.UTF-8".

As for Emacs, I'm wondering if it shouldn't be changed to set its locale
according to setlocale(LC_CTYPE,NULL) instead, given what POSIX says.

It would be nice to check /etc/defaults/locale in __get_locale_env() as
well, but I'm a bit reluctant to do that.  It means, every invocation of
a Cygwin process has to open that file if the environment isn't set.
Talking about performance...

Alternatively, the first invocation of Cygwin in a process tree could
try to read this file only.

For a start, here's a first untested cut at newlib's locale.c, which
allows us to add any desired mechanism to switch the default locale.
The comment is already jumping ahead ab bit...:

Index: libc/locale/locale.c
RCS file: /cvs/src/src/newlib/libc/locale/locale.c,v
retrieving revision 1.28
diff -u -p -r1.28 locale.c
--- libc/locale/locale.c	29 Sep 2009 19:12:28 -0000	1.28
+++ libc/locale/locale.c	7 Oct 2009 08:57:12 -0000
@@ -205,6 +205,21 @@ static char *categories[_LC_LAST] = {
+ * Default locale per POSIX.
+ */
+#ifdef __CYGWIN__
+ * This variable can be changed by any outside mechanism.  This allows,
+ * for instance, to load the default locale from a file.  On Cygwin,
+ * we're using /etc/defaults/locale for that.
+ */
+char __default_locale[ENCODING_LEN + 1] = DEFAULT_LOCALE;
  * Current locales for each category
 static char current_categories[_LC_LAST][ENCODING_LEN + 1] = {
@@ -733,7 +748,7 @@ __get_locale_env(struct _reent *p, int c
   /* 4. if none is set, fall to "C" */
   if (env == NULL || !*env)
-    env = "C";
+    env = __default_locale;
   return env;

If you agree to this, I'll propose it on the newlib list.


Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

More information about the Cygwin-developers mailing list