This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Encoding of German 'umlauts' - please explain


Ronald Fischer wrote:
> Maybe someone could enlighten me about the following:
> ...
> That means, the German letter ü has encoding 0xFC. If I do the same on CMD shell
> (the 'od' used here comes from the Gnu Utilities for Windows), I see:
> ...
> That is, ü is encoded as 0x81. Why is this different?

> I am aware that, for historic reason, different encodings exist (the old
> DOS encoding, Windows ANSI encoding etc.).
So you answered your question yourself :)
> I wouldn't have expected those
> differences, however, when comparing bash.exe vs. cmd.exe.

The encoding is applied by the terminal, not the application. For bash, 
the letter ü is only a sequence of one or two bytes, while the terminal 
decides which bytes your keyboard sends to the application when you enter 
ü, and what to display when your program outputs those bytes (i.e., 
traditionally, while in the age of locales things may sometimes get more 
complicated :( ).

Having said this, I also need to adjust the following response:

Matthias Andree wrote:
> Because the code pages differ. 0xFC is ISO-8859-1 ("Latin 1") or -15 ("Latin 9")
> or CP1252/Windows-1252 (Latin 1 Extended; the latter allocates 0x80...0x9f
> differently than ISO-8859-1) and CMD uses CP437 or CP850.

This is not really correct; like bash, CMD does not use a codepage itself.
If you start CMD from Windows, it will implicitly be embedded in a Windows 
console which uses CP437 (American), CP850 (Western European) or some other 
default of your system configuration.

However, you could also run CMD from a cygwin bash. In this case, maximising 
the confusion, there are two different situations:
* Run mintty, start CMD from bash there: CMD will see the same codepage as 
  bash since it is the one configured for mintty. So echo ü would produce 
  0xFC even in CMD (assuming mintty runs one of the codepages which map 
  ü to 0xFC).
* Run cygwin console, observe this: Since the cygwin console is a hybrid as 
  the encoding is emulated by the cygwin dll within a Windows console, unlike 
  all other terminals, the effective "codepage" varies with the application:
  A cygwin application will use the encoding configured for the cygwin session, 
  while any non-cygwin application will use the native Windows console codepage.
  So you may echo ü from bash, then start CMD from there, echo ü again, and will 
  get different codes for the same key!

Kind regards,
Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]