Console codepage setting via chcp?

Andy Koppe andy.koppe@gmail.com
Fri Sep 25 05:40:00 GMT 2009


2009/9/24 Corinna Vinschen:
> Wouldn't this consequentially mean we should stick to UTF-8 for
> filenames entirely in the long run?

Doing that would remove the ability for users to select a different
charset and have Windows filenames showing up correctly in Cygwin and
vice versa. Therefore I think the best way to get to UTF-8 throughout
is to persuade users to set their environment accordingly. Whereby
making UTF-8 the default in the C locale would do most of the
persuading. ;)


> Because, *if* tar uses setlocale(),
> the files would have potentially surprisingly ugly names when unpacked
> on a Unix machine.

tar calling setlocale wouldn't matter, because Cygwin would translate
the filenames according to the environment setting anyway. If you mean
that tar might be doing its own multibyte conversion on filenames:
that would be wrong, and I'd be very surprised if it did that.

Apart from that, whether the filenames show up correctly on the Unix
machine depends on the same charset being used on that machine. But
that's the same when moving stuff between any Unix machines, because
the TAR format doesn't specify a charset.

Btw, Cygwin would have a nice advantage here: if you knew that a
tarball was created on a system with a different charset, you could
still get the correct filenames by invoking it with the appropriate
setting, e.g.:

LC_CTYPE=C.KOI8-R tar xf bla.tgz

On Linux, this would require 'convmv' after untarring. (That approach
would still work on Cygwin too).


> Note that this affects all strings used in Cygwin internally, not only
> filenames.  User and group names, environment strings, ...

I hadn't thought of that, but yep, it's the logical conclusion. I
think the charset specified via setlocale(LC_CTYPE,...) should only
affect what's specified in POSIX, i.e. the ctype stuff itself,
multibyte conversion functions, wchar I/O, and anything I'm
forgetting.


> If an application switches to another locale, all the names internally
> stored are not switched as well.  So they are potentially wrong after
> a setlocale.

The important thing is that file names, user names, and env variables
are represented by the same byte sequences throughout the life of a
program. Determining their translation at program startup ensures
that.

Now, if an application calls setlocale with a charset other than
what's set in the environment and it interprets the byte sequences
according to that charset, then that's its own responsibility. This
would go awry on Linux too, e.g. if a system is set up with
LANG=en.ISO-8859-1, apps shouldn't try to display filenames as UTF-8.


>> > - If you want to switch the console to another charset you can't do that
>> >  on the fly in Cygwin.
>>
>> You can't in xterm or rxvt either, at least not without the likes of luit.
>
> My xterms have a UTF-8 entry in the Ctrl-<right mouse key> menu which
> can be switched on and off...  Unless xterm has been already started in
> UTF-8 mode, in which case the entry is disabled.

Ah, I hadn't thought of that one, because it doesn't let you choose
just any charset.

The switch represents the UTF-8 mode that can be enabled and disabled
with the control sequences "\e%G" and "\e%@". (Btw, this could be
added to the list of possible Cygwin console enhancements).


>> You can in mintty, and also in gnome-terminal and KDE Konsole, but to
>> be honest it's a rather questionable feature, because applications
>> don't get to know about such an on-the-fly character set change, hence
>> things won't work correctly.
>
> Yeah, but it's the users choice.  I could want a ISO-8859-1 terminal,
> regardless what the application prints.

Fair enough, but I don't think it's an important use case.

Andy



More information about the Cygwin-developers mailing list