This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: cygwin + GetConsoleOutputCP


On 20 March 2011 19:13, Charles Wilson wrote:
> Question about porting the upstream "dos2unix" utilities. ÂThese
> implementations provide capabilities to convert text files from a
> certain limited set of INPUT encodings (most are DOS codepages):
>
> =====================================================
> CONVERSION MODES
> Â Â Â Conversion modes ascii, 7bit, and iso are
> Â Â Â similar to those of dos2unix/unix2dos under
> Â Â Â SunOS/Solaris.
>
> Â Â Â ascii
> Â Â Â Â Â In mode "ascii" only line breaks are
> Â Â Â Â Â converted. This is the default conversion
> Â Â Â Â Â mode.
>
> Â Â Â Â Â Although the name of this mode is ASCII,
> Â Â Â Â Â which is a 7 bit standard, the actual mode
> Â Â Â Â Â is 8 bit. Use always this mode when
> Â Â Â Â Â converting Unicode UTF-8 files.
>
> Â Â Â 7bit
> Â Â Â Â Â In this mode all 8 bit non-ASCII characters
> Â Â Â Â Â (with values from 128 to 255) are converted
> Â Â Â Â Â to a 7 bit space.
>
> Â Â Â iso Characters are converted between a DOS
> Â Â Â Â Â character set (code page) and ISO character
> Â Â Â Â Â set ISO-8859-1 (Latin-1) on Unix. DOS
> Â Â Â Â Â characters without ISO-8859-1 equivalent,
> Â Â Â Â Â for which conversion is not possible, are
> Â Â Â Â Â converted to a dot. The same counts for
> Â Â Â Â Â ISO-8859-1 characters without DOS
> Â Â Â Â Â counterpart.
>
> Â Â Â Â Â When only option "-iso" is used dos2unix
> Â Â Â Â Â will try to determine the active code page.
> Â Â Â Â Â When this is not possible dos2unix will use
> Â Â Â Â Â default code page CP437, which is mainly
> Â Â Â Â Â used in the USA. ÂTo force a specific code
> Â Â Â Â Â page use options "-437" (US), "-850"
> Â Â Â Â Â (Western European), "-860" (Portuguese),
> Â Â Â Â Â "-863" (French Canadian), or "-865"
> Â Â Â Â Â (Nordic). ÂWindows code page CP1252
> Â Â Â Â Â (Western European) is also supported with
> Â Â Â Â Â option "-1252". For other code pages use
> Â Â Â Â Â dos2unix in combination with iconv(1).
> Â Â Â Â Â Iconv can convert between a long list of
> Â Â Â Â Â character encodings.
> =====================================================
>
> So basically if you specify -iso (or --conv iso) without any of the
> "input encoding specification" options like -437 etc, then dos2unix will
> autodetect attempt to detect the *console* encoding. ÂIf it succeeds,
> then it will "convert" character codes from that encoding to their
> equivalent in ISO-8859-1 ("Latin 1") [unconvertible codes are replaced
> with an ascii dot]
>
> Note that this autodetect, if it works, assumes that the console's CP is
> the input file's CP. ÂFair enough -- and it's an overridable default
> anyway. ÂHowever, I wonder if, in cygwin-1.7, we actually can/should use
> the "console codepage" in ANY way. ÂHere's the code:
>
> querycp.c:
> #elif defined (WIN32) || defined(__CYGWIN__)
>
> /* Erwin Waterlander */
>
> #include <windows.h>
> unsigned short query_con_codepage(void) {
> Â return((unsigned short)GetConsoleOutputCP());
> }
> #else
>
> Or if instead, on cygwin, we should use some other mechanism (locale
> settings?) to determine the correct default "input" codepage.

I think defaulting to the console codepage makes sense for the DOS
side of the conversion. Having said that, Windows files that aren't
"Unicode", i.e. UTF-16, are usually encoded in the so-called ANSI
codepage, e.g. CP1252, so it would make more sense to default to that.

However, the real problem with this feature is that the Unix side of
the conversion is fixed to ISO-8859-1, which makes it near-useless
when Cygwin defaults to UTF-8. And it's no use for non-Western
European languages in any case.

A worthwhile conversion feature would use
MultiByteToWideChar()/WideCharToMultiByte() defaulting to the system's
ANSI codepage on the DOS side, and mbstowcs()/wcstombs() defaulting to
the charset specified by the LC_CTYPE locale category on the Unix
side.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]