Trouble with character sets

Michael Shay MShay@ABINITIO.COM
Mon Aug 3 15:36:14 GMT 2020


I'm having a problem with Cygwin 3.1.4, changing the character set on the 
fly. It seems to work with Cygwin applications, but not with Win32 
applications.

I have a Korn shell script:
#!/bin/ksh

OLD_LANG="$LANG"
OLD_LC_ALL="$LC_ALL"

echo "locale on entry"
locale
echo ""

export LANG="en_US.CP1252"
export LC_ALL=en_US.CP1252

echo "locale changed to"
locale
echo ""

# Default is to run the Win32 program. Input any argument other than 
'WIN32'
# to run '/bin/echo'.

case $# in
   0 )  echo "Running WIN32 pgm"
        ksh -c 'cygtest.exe ZÇ'
        ;;
   1 )  echo "Running Cygwin 'echo'"
        ksh -c '/bin/echo ZÇ'
        ;;
   2 )  echo "Running WIN32 pgm"
        ksh -c 'cygtest.exe ZÇ'
        echo ""
        echo "Running Cygwin 'echo'"
        ksh -c '/bin/echo ZÇ'
        ;;
   * ) ;;
esac

LC_ALL="$OLD_LC_ALL"
LANG="$OLD_LANG"

and a Win32 application (attached file cygtest.cpp)

I used gdb to see what was happening in child_info_spawn::worker(), when a 
Win32 program is started using:

          rc = CreateProcessW (runpath,   /* image name w/ full path */
                   cmd.wcs (wcmd),  /* what was passed to exec */
                   sa,    /* process security attrs */
                   sa,    /* thread security attrs */
                   TRUE,    /* inherit handles */
                   c_flags,
                   envblock,  /* environment */
                   NULL,
                   &si,
                   &pi);
Specifically, 'cmd.wcs(wcmd)' invokes:

  wchar_t *wcs (wchar_t *wbuf, size_t n)
  {
    if (n == 1)
      wbuf[0] = L'\0';
    else
        sys_mbstowcs (wbuf, n, buf);
    return wbuf;
  }

and sys_mbstowcs():

size_t __reg3
sys_mbstowcs (wchar_t * dst, size_t dlen, const char *src, size_t nms)
{
  mbtowc_p f_mbtowc = __MBTOWC;
  if (f_mbtowc == __ascii_mbtowc)
    {
      f_mbtowc = __utf8_mbtowc;                                 <<<<< this 
is ALWAYS done, no matter what charset is in use.
    }
  return sys_cp_mbstowcs (f_mbtowc, dst, dlen, src, nms);
}

Since the CP1252 is an 8-bit single-byte character set with characters >= 
0x80, the '0xc7' character is always translated as '0xc7 0xf0', with the 
'0xf0' byte indicating an invalid character in the string.

This doesn't seem to happen when e.g. '/bin/echo' is run, although I 
haven't stepped into the code to see what's happening.

I do not think this is a Cygwin bug, but since the User's Guide says the 
locale and charset can be changed on the fly, I don't know what's going 
awry.

Any suggestions? If you need more information, I'm happy to provide it.

Mike Shay

Here's the source for the Win32 program. I built it with Visual Studio 
2015, to get something running quickly.



  
NOTICE  from Ab Initio: This email (including any attachments) may contain information that is subject to confidentiality obligations or is legally privileged, and sender does not waive confidentiality or privilege. If received in error, please notify the sender, delete this email, and make no further use, disclosure, or distribution.  
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cygtest.cpp
Type: application/octet-stream
Size: 4428 bytes
Desc: not available
URL: <https://cygwin.com/pipermail/cygwin/attachments/20200803/b38fc8ec/attachment.obj>


More information about the Cygwin mailing list