UTF-8 as default charset?

Corinna Vinschen corinna-cygwin@cygwin.com
Wed Sep 23 19:23:00 GMT 2009

On Sep 23 19:57, Andy Koppe wrote:
> Perhaps this one's better discussed on -developers as well.
> 2009/9/23 Corinna Vinschen:
> > However, if we default to UTF-8 for a subset of languages anyway, it
> > gets even more interesting to ask, why not for all languages?
> Hmm, why indeed? As far as I know, there's no technical reason not to
> do it. POSIX only requires 7-bit ASCII anyway, and the DC?? scheme
> ensures that filenames are fully 8-bit clean even with invalid UTF-8.
> So for a completely new system, there'd be no question whatsoever:
> UTF-8 is the right way to go.

Uh, that's good to know.  Since you started this discussion I'm
struggling with the two choices.  Doesn't matter that I advocate the
UTF-8 solution.  I'm still unsure.

> > I'm really wondering if we shouldn't simply default to UTF-8 as charset
> > throughout, in the application, the console, and for the filename
> > conversion.
> That would certainly make plenty of sense.
> I assume that default would apply both to "C" and the likes of "en" and "en_US"?

Erm... no.  At least I didn't intend that.  It would be default for the 
"C" locale only.  Using "en" or "en_US" would switch to the actual
default ANSI codepage.  I had the vague idea that this would be the
right thing for users which just don't grok the codepage numbers.

Just given what I wrote about chcp and using the OutputCP setting it
would not be the default for the console output anymore.  That would be
the default OEM codepage (437 on US systems, for instance), or what
the user sets with chcp or the not yet existing setfont.


Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

More information about the Cygwin-developers mailing list