Trouble with output character sets from Win32 applications running under mksh

Wed Aug 5 02:10:45 GMT 2020

Am 04.08.2020 um 23:19 schrieb Michael Shay via Cygwin:
> Michael
The contents of your mail responses is not recognizable due to utterly 
broken formatting.
[This is not top-posting as there's nothing to respond to]
>
>
>
> From:   "Brian Inglis" <Brian.Inglis@SystematicSw.ab.ca>
> To:     cygwin@cygwin.com
> Date:   08/04/2020 08:32 AM
> Subject:        Re: Trouble with output character sets from Win32
> applications running under mksh
> Sent by:        "Cygwin" <cygwin-bounces@cygwin.com>
>
>
>
> On 2020-08-03 16:05, Michael Shay via Cygwin wrote:
>> On 2020-08-03 11:42, Andrey Repin wrote:
>>>>> Doesn't help. I tried 65001 (UTF-8):
>>>> Because you're confusing things.
>>>> chcp has nothing to do with LANG or LC_*.
>>>> Et vice versa.
>>>>
>>> chcp sets console code page for native console applications.
>>>> Only for those supporting it. Many do not.
>>>> LANG sets output parameters for Cygwin applications (and other
> programs
>>>> that look for it, but these are few).
>>> You cut the significant statement at the top of the OP:
>>>>> I'm having a problem with Cygwin 3.1.4, changing the character set on
>>>>> the fly. It seems to work with Cygwin applications, but not with
> Win32
>>>>> applications.
>>> He has problems with invalid characters only running win32 console
>>> applications: I changed the subject to hopefully better reflect the
> issue.
>>> I am unsure where Cygwin 3.1.4 comes into Win32 applications - you have
> to
>>> use the Windows codepage conversion routines.
>>>
>>> You can only change input character sets on the fly; output character
> sets
>>> will depend on mintty support of xterm-compatible character set support
>>> and switching escape sequences; if you set up UTF16LE console output,
>>> Windows and mintty should handle it.
>>>
>>> Perhaps a better description of your environment, build tools, what you
>>> are trying to do, what you expect as output, and what you are getting
> as
>>> output, could help us better understand and help with the issue you
> see.
>
>> The script I sent changes the locale information i.e. LANG and LC_ALL
> are
>> set to en_US.CP1252. i.e.
>>
>> export LANG="en_US.CP1252"
>> export LC_ALL=en_US.CP1252
> FYI the normal sequence and order to check is LANG, LC_CTYPE, LC_ALL,
> where the
> last var set wins, or the reverse where the first var set wins; the
> default
> locale may be POSIX C.ASCII or the effective Windows locale, depending on
> your
> startup.
>>> Thanks, that's good to know.
>> Then, it runs a simple Win32 program that takes a single input argument,
> ZÇ,
>> the second character being C-cedilla, an 8-bit character, hex value
> 0xc7.
>> The Win32 program transcodes the input Unicode argument using the Cygwin
>> character set to determine the codepage, 1252.
> Do you mean using the environment variables to determine the codepage?
>>> Yes. Our code does try to fetch the character set information from the
>>> environment.
>
> FYI the default character set if none is specified is the Unix equivalent
> of the
> default Windows "ANSI"/OEM code page, in English or many European locales
> that
> will be ISO-8859-1.
>
> You may have to use cygpath -C OEM chars... or cygpath -C ANSI chars... to
> convert a string to the required character set for console or GUI
> programs.
>>> Our production code uses the console to display error information in
> the
>>> appropriate character set, but our command-line utilities expect to be
>>> able to take input strings encoded in the character set in use, which
>>> may be an 8-bit SBCS like ISO-8849-1, Windows 1252, or a MBCS, like
> UTF-8
>>> or e.g. Windows 932. Using 'cygpath' isn't an option.
> Please specify what you mean by "Unicode" in each context; that term means
> a
> standard for representing scripts in many writing systems with a large
> character
> glyph repertoire and a number of encodings, representations, and handling
> rules:
> in each use case, do you mean a char/wchar representation, and/or an
> encoding
> UTF16LE or UTF-8?
> Similarly when MS uses "ANSI" they may mean an SBCS OEM code page.
>
>>> Unicode == UTF-16 in all cases. This is the wide-character set used by
> Microsoft
>>> as far as I can tell in the wide-char version of their Win32 API
> functions e.g.
>>> CreateProcessW() vs. CreateProcessA().
> To check what is available and what is in effect in Cygwin, try e.g.:
>
> $ for o in system user no-unicode input format; do echo `locale --$o` $o;
> done
> en_US system
> en_GB user
> en_CA no-unicode
> en_CA input
> en_CA format
> $ locale
>
> on both Cygwin versions.
>
>>> 1.7.28 output
>>> $for o in system user no-unicode input format; do echo `locale --$o` $o;
> done
>>> en_US system
>>> en_US user
>>> en_US no-unicode
>>> locale: unknown option -- input
>>> Try `locale --help' for more information.
>>> input
>>> en_US format
>>> 3.1.4 output
>>> $for o in system user no-unicode input format; do echo `locale --$o` $o;
> done
>>> en_US system
>>> en_US user
>>> en_US no-unicode
>>> en_US input
>>> en_US format
> FYI see:
>
>                   https://cygwin.com/cygwin-ug-net/setup-locale.html
>
>> It then prints the transcoded characters to stdout, and the result
> should be
>> ZÇ, identical to the input argument.
>> This works fine using Cygwin 1.7.28.
> Which Windows version are you running Cygwin 1.7.28 on?
> Please show output from cmd /c ver.
>>> $cmd /c ver
>>> Microsoft Windows [Version 10.0.18363.959]
> That Cygwin version 1.7.28 is from 2014-02 and has been unsupported for
> years.
> That version may not have completely supported international character
> sets and
> may just assume that everything is in ISO-8859-1/Latin-1, which is similar
> to
> CP1252, so that may work, or your system default OEM codepage e.g. 437 or
> 850,
> and pass it along.
>>> Our code supports dozens of character sets, for international sales,
> and that
>>> includes many SBCS, and MBCS, as well as UTF-8. I can use any of the
> codepages
>>> supported by Windows and Cygwin and 1.7.28 handles them just fine.
>> Cygwin 3.1.4 is launching the Win32 application, and is responsible for
>> transcoding the arguments passed to it by mksh, in this case CP1252
>> characters ZÇ, into Unicode.
> Do you mean you believe Cygwin should recode argument strings, and what do
> you
> mean by Unicode in this context?
>>> When I launch a Win32 application that is using a character set other
> than 7-bit ASCI
>>> in a Cygwin shell, the shell passes the command and arguments in the
> input character set.
>>> So, for example, using CP 1252 as the character set, and passing 8-bit
> single-byte characters
>>> like e.g. ZÇ, the shell doesn't change the characters, it passes them
> through to Cygwin
>>> to launch the process. In my test, using gdb ($gdb --version GNU gdb
> (GDB) (Cygwin 8.2.1-1) 8.2.1)
>>> i.e. "gdb ksh.exe", then "(gdb) start -c 'cygtest.exe ZÇ', I can step
> into spawnve() in spawn.cc.
>>> At this point, examining the input arguments confirms that the input
> argument 'ZÇ' is still
>>> in the correct encoding i.e. 0x5a 0xc7. The real work of launching the
> process is done in
>>> child_info_spawn::worker(). Eventually, the code invokes
> CreateProcessW(). The executable
>>> path is already in UTF-16 format, so the only transcoding left to be
> done is the
>>> argument string. This is done in linebuf::wcs() function (winf.h) This
> small method
>>> invokes sys_mbstowcs(), in strfuncs.cc. So yes, I do believe Cygwin
> should transcode
>>> the argument strings from whatever their current character set is to
> UTF-16. This is
>>> what the ancient 1.7.28 did.
>> That means Cygwin has to use the mb-to-uc function for transcoding
> codepage
>> 1252 to Unicode.
> I am unsure if Cygwin does any recoding internally except for input typed
> on the
> terminal console interface.
> CP1252 is an SBCS not an MBCS so MB functions are not required.
> What do you expect when you use Unicode here?
>>> If Cygwin no longer does this internal transcoding, that's a
> significant change
>>> from previous versions. I only know 1.7.28 did the transcoding
> correctly, and it's
>>> certainly possible that at some point between that version and 3.1.4,
> the behavior
>>> changed. Yes, CP1252 is a SBCS, but it supports 8-bit characters,
> unlike 7-bit ASCII
>>> so requires a different mapping from UTF-16. Using either CP 1252 or
> 7-bit ASCII
>>> though would require a different transcoding routine than the UTF-8 ->
> UTF-16 that
>>> gets used.
>> It does not. It uses the UTF-8 to Unicode function (I've seen this using
>> gdb). That function flags the Ç as an invalid UTF-8 sequence, not
>> surprisingly since it's not a UTF-8 character.
> What Windows, Cygwin, gdb versions are you seeing this on and what is the
> name
> of the function you are seeing?
>>> Windows - Microsoft Windows [Version 10.0.18363.959]
>>> Cygwin - CYGWIN_NT-10.0 engr-cygwin-10vm 3.1.4(0.340/5/3) 2020-02-19
> 08:49 x86_64 Cygwin
>>> gdb - GNU gdb (GDB) (Cygwin 8.2.1-1) 8.2.1)
>>> As described above, spawnve() calls child_info_spawn::worker() to do
> the real work of
>>> launching a process, a Win32 or a Cygwin process. The conversion of the
> process arguments
>>> into UTF-16 is done through linebuf::wcs(), into sys_mbstowcs(). In the
> latter function
>>> the only work done is to check if the pointer to the MBCS to WCS is '
> __ascii_mbtowc' and
>>> if so, to instead set it to '__utf8_mbtowc'. It then invokes
> sys_cp_mbstowcs() to do the
>>> work.
>>> However, the problem if there is one, must be occurring very early on.
> dll_crt0_1()
>>> which according to the comments "Take over from libc's crt0.o and start
> the application."
>>> fetches the locale from the environment:
>>>   /* Set internal locale to the environment settings. */
>>>       initial_setlocale ();
>>> I suspect that it's here where either there's a problem, or Cygwin
> behavior has changed from
>>> 1.7.28. I haven't tried to use gdb to step into that initialization
> code.
>
>> No matter what character set I use in 'export LANG...' and 'export
>> LC_ALL...', Cygwin 3.1.4 always uses the uft8-to-wc transcoding function
> in
>> sys
> ... what should be there and what is the name of the function used?
>
>> 1.7.28 Uses the correct function.
> What is the name of that function?
>>> The function is sys_cp_mbstowcs(), which is invoked by sys_mbstowcs()
> as it is in 3.1.4.
>>> But the older version doesn't get the pointer to the mb-to-wc
> transcoding function passed
>>> it, it fetches the pointer and the character set from cygheap->locale
> and passes those
>>> to sys_cp_mbstowcs().
>> I'm not using mintty, I'm using mksh, a requirement since our software
> uses
>> lots of shell scripts, and for legacy support, that means using a Korn
> shell.
>
> So that means that the mksh is running on the Windows console, and you are
> not
> running mintty.
>>> Correct.
>> I could understand it if 1.7.28 didn't do the proper transcoding, but it
>> does.
> You may just be seeing Cygwin 1.7.28 passing the character codes along
> verbatim.
>>> I don't think so. child_info_spawn::worker() has to translate the
> CP1252 characters
>>> into UTF-16. And it does, as I've seen using Windbg on the Windows side
> of this.
>
>> I used:
>>
>>          gdb mksh
>>
>> to load mksh into the debugger, then started it with
>>
>>          start -c 'cygtest.exe ZÇ'
> Windows, Cygwin, and gdb versions?
>>> Windows - Microsoft Windows [Version 10.0.18363.959]
>>> Cygwin - CYGWIN_NT-10.0 engr-cygwin-10vm 3.1.4(0.340/5/3) 2020-02-19
> 08:49 x86_64 Cygwin
>>> gdb - GNU gdb (GDB) (Cygwin 8.2.1-1) 8.2.1)
>> That allowed me to step into child_info_spawn::worker() and stop at the
>> call to CreateProcess(), where the command line (cygtest.exe) and
> argument
>> (ZÇ) are translated into Unicode.
> In this case you mean into a UTF16LE string?
>>> Yes.
>> This is the code to which I'm referring, in strfuncs.cc, which is
> supposed
>> to translate the command line and arguments from CP 1252 into Unicode.
>>
>>    size_t __reg3
>>    sys_mbstowcs (wchar_t * dst, size_t dlen, const char *src, size_t nms)
>>    {
>>      mbtowc_p f_mbtowc = __MBTOWC;
>>      if (f_mbtowc == __ascii_mbtowc)
>>        {
>>          f_mbtowc = __utf8_mbtowc;       <<<< THE CODE CHANGES THE
>> '__ascii_mbtowc' TO '__utf8_mbtowc' EVERY TIME, REGARDLESS OF THE
>> CODEPAGE.
>>        }
>>      return sys_cp_mbstowcs (f_mbtowc, dst, dlen, src, nms);
>>    }
>>
>> So 'f_mbtowc' is set to _ascii_mbtowc, the default.You said:
> UTF-8 contains ASCII as the first 128 code points, so that is valid,
> unless the
> "ASCII" used isn't really, and has character codes > 127!
>>> CP1252 supports 8-bit single-byte characters such as C-cedilla. The
> UTF-8
>>> representation is a 3-byte sequence that is not correct if the
> character
>>> set in use is CP1252.
>> You can only change input character sets on the fly;
>>
>> The input character set to Cygwin should have been changed to CP 1252,
> as
>> it was in 1.7.28. At least, that's what I would expect to happen. If it
>> does not, or if miintty is required, then that's a regression from
> 1.7.28.
>
> As Cygwin packages are rolling releases, old releases are unsupported, and
> you
> must upgrade to the latest release, reproduce the problem with a simple
> test
> case, and other examples if you wish, and post that with a copy of the
> output from:
>
>                   $ cygcheck -hrsv > cygcheck.out
>
> as a plain text attachment to your post.
>>> I understand. We do not ship a stock Cygwin installation. I happen to
> have an
>>> unmodified 3.1.4 on a development machine and was able to reproduce the
> problem
>>> with it. But we cannot take frequent Cygwin updates, as it takes far
> too long
>>> to find and fix problems between Cygwin and our code. The version has
> to be
>>> stable for months before we can use it.
>>> Thanks for the helpful suggestions and information. I'll send updates,
> in case
>>> anyone else sees a similar problem.
>>> Michael Shay