This is the mail archive of the cygwin-developers mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: GB18030 (was: Re: charset changes)

From: Corinna Vinschen <corinna-cygwin at cygwin dot com>
To: cygwin-developers at cygwin dot com
Date: Sat, 27 Mar 2010 21:45:54 +0100
Subject: Re: GB18030 (was: Re: charset changes)
References: <416096c61003271002t330ee9ecned88f73ef3b4face@mail.gmail.com> <20100327180157.GC18364@calimero.vinschen.de> <416096c61003271215j1c5d131j48cc64face950cf0@mail.gmail.com>
Reply-to: cygwin-developers at cygwin dot com

On Mar 27 19:15, Andy Koppe wrote:
> On 27 March 2010 18:01, Corinna Vinschen <corinna-cygwin@cygwin.com> wrote:
> >> On Vista and 7, if you pass those two bytes to MultiByteToWideChar,
> >> you get back the codepage's UnicodeDefaultChar followed by the digit
> >> '3'. XP did something else, but I can't remember exactly what.
> >
> > Heh, ok. ?It never occured to me to test the content of the target
> > buffer if MultiByteToWideChar failed anyway.
> 
> It only fails if the MB_ERR_INVALID_CHARS flag is set.

That's what Cygwin is doing.  I don't see how any other setting would
make sense.

> >> On Cygwin, these would be straightforward wrappers around
> >> WideCharToMultibyte and MultibyteToWideChar with codepage 54936,
> >> implemented in winsup. For other newlib targets, we could take a
> >> similar approach as with doublebyte charsets, where multibyte
> >> sequences are mapped to a non-Unicode wchar_t representation by simply
> >> packing the bytes into the wchar_t.
> >
> > Yet another function call for every single character:
> > http://sourceware.org/ml/newlib/2009/msg01033.html
> 
> That call would only be necessary for non-ASCII characters, and I
> don't think it would be terribly significant compared to the magic
> that WideCharToMultibyte and MultibyteToWideChar need to do.

Looks like there's still no chance to persuade you to sign the copyright
assignment form.  Pity.

> ps: Btw, speaking of performance issues, the 8-bit charsets are rather
> inefficient because for every single non-ASCII character they parse
> the charset name to obtain a charset table index. Storing that index
> alongside the name might make quite a big difference.

That's right.  The problem is that it's necessary to be able to call the
function the same way as any other __FOO_wctomb or __FOO_mbtowc
function.  Right now all these functions get the charset name as
parameter.  This is necessary because the functions could be called with
a charset name which is different from the globally stored charset.
For instance, if Cygwin is using another charset for the console window
than the application is requesting in setlocale.

Anyway, feel free to send a patch to change the charset name parameter
to an array index parameter.

Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

Follow-Ups:
- Re: GB18030 (was: Re: charset changes)
  - From: Andy Koppe

References:
- GB18030 (was: Re: charset changes)
  - From: Andy Koppe
- Re: GB18030 (was: Re: charset changes)
  - From: Corinna Vinschen
- Re: GB18030 (was: Re: charset changes)
  - From: Andy Koppe

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]