Console codepage setting via chcp?

Andy Koppe andy.koppe@gmail.com
Sat Sep 26 16:43:00 GMT 2009


2009/9/26 Corinna Vinschen:
>> >> - System objects will always be translated using UTF-8. This includes
>> >> file names, user names, and initial environment variables (and
>> >> probably more I'm not aware of).
>> >[...]
>> The downside, of course, is that non-ASCII filenames created in a
>> non-UTF8 locale won't show up correctly in Windows, and vice versa.
>> But that's the same on Linux if the global setting is UTF-8 while the
>> terminal is set to something else. And the stock answer to any
>> complaints will be: Use UTF-8!
>>
>> In any case, the DCxx scheme will ensure that things work correctly
>> within any particular locale.
>>
>> And I guess the ^N scheme can go (or be disabled)?
>
> Probably not.  I spent some more time thinking about the various
> scenarios (partly instead of sleeping) and it occured to me that using
> UTF-8 exclusively is a nice dream.

So at least you enjoyed the few hours of sleep you did get then. ;)


> Still, what about your tar example given in
> http://cygwin.com/ml/cygwin-developers/2009-09/msg00043.html?

I suspect the interop between non-UTF8 Cygwin and native Windows is
more likely to draw complaints. In particular, rxvt users would be out
of luck in that respect, since UTF8 isn't going to be an option there.


> If we stick to UTF-8 exclusively we *have* to create the convmv-like
> tool which allows to convert "broken" filenames to be converted from the
> \016\377\x notation to the UTF-8 \c2\x or \c3\x notation, otherwise.

What's the \016\377\x notation? \016 is ^N, but the \377 isn't UTF-8,
so is that an additional scheme?

The way I understand it though, if filenames were always treated as
UTF8 by the system calls, then ^N would never be needed, because
invalid UTF8 is encoded as U+DCxx when converting to UTF16, while
UTF16-to-UTF8 is always valid (unless Windows filenames contain
invalid UTF16 in the first place ...).

Therefore, I think the standard 'convmv' should be able to do the job.
I've had a quick look at it: it's a perl script, and seems to be
fairly straightforward to use, for example:

./convmv -f ISO-8859-1 -t UTF-8 bäh
Starting a dry run without changes...
mv "./bäh"      "./bäh"

'LC_CTYPE=ISO-8859-1 tar ..." would still be nicer though.


> What's the right thing to do?  I'm still unsure. With your proposal,
> it's at least the user choose and if some interoperability issue occurs
> and the user complains, we can point to the FAQ: "Use UTF-8, dumbass!"

Yep.


> - System objects will always be *initially* translated using UTF-8. This
>  includes file names, user names, and initial environment variables.
> - By setting the locale environ variables you can switch the charset
>  used to translate filenames on a per-process base.
>  This would be only a stop-gap measure, to allow to re-use old archives
>  or scripts.  Those should be converted to UTF-8 ASAP.  Expect complaints.
> - The "C" locale's charset will be UTF-8.
> - There'll be language-neutral "C.<charset>" locales.
> - The user's ANSI codepage will remain the default charset for
> "language_TERRITORY" locales.
> - The console charset will be set according to LC_ALL/LC_CTYPE/LANG
>  at the time the application starts.
> - setlocale() will (probably) have no effects beyond what's expected in Linux.
>
> Please vote.

I vote for the proposal here, with added fence-sitting in the form of
a CYGWIN option called 'filename_charset' (or some such) taking
precedence over LC_ALL/LC_CTYPE/LANG.

With that, setting 'CYGWIN=fncset:UTF-8' would yield
http://cygwin.com/ml/cygwin-developers/2009-09/msg00050.html.

Andy



More information about the Cygwin-developers mailing list