This is the mail archive of the cygwin-developers mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Console codepage setting via chcp?

From: Corinna Vinschen <corinna-cygwin at cygwin dot com>
To: cygwin-developers at cygwin dot com
Date: Sat, 26 Sep 2009 22:16:48 +0200
Subject: Re: Console codepage setting via chcp?
References: <416096c60909242240o7a62f677yde4638c8224abf40@mail.gmail.com> <20090925093910.GM30851@calimero.vinschen.de> <416096c60909250413o743e2f74g6ef66bdff8006ad@mail.gmail.com> <20090925141042.GN30851@calimero.vinschen.de> <416096c60909250941n135d0bd4gfd7f4d5d90ac308@mail.gmail.com> <20090925172744.GQ30851@calimero.vinschen.de> <416096c60909251142w56af59c4n1896cade0e21600c@mail.gmail.com> <20090926122018.GT30851@calimero.vinschen.de> <416096c60909260943i33fc18cdp6a2a8da066b06132@mail.gmail.com> <20090926193429.GV30851@calimero.vinschen.de>
Reply-to: cygwin-developers at cygwin dot com

On Sep 26 21:34, Corinna Vinschen wrote:
> Do you propose to change __utf8_mbtowc/__utf8_wctomb to allow UCS-2
> encoding as well?
> 
> This is no problem for __utf8_mbtowc, but in __utf8_wctomb it's not
> possible to convert surrogate pairs to correct UTF-8 *and* lone
> surrogate first halfs to UCS-2, at least not with a lot of additional
> effort.  The reason is that the first byte returned when the first half
> is read is > 0xf0.  When the function is called for the second half and
> it turns out there is no second half, then the already returned 0xf0
> byte is suddenly wrong.  And the wctomb functions have no read-ahead
> functionality.
> 
> For that reason, I invented the aforementioned \016\377\x sequence
> to represent lone surrogate second halves.
> 
> The only other alternative would be to revert all the surrogate pair
> handling changes and to allow only UCS-2 again, thus giving up to
> support Unicode values >= U+10000.

No, there's a third alternative, of course.

The __utf8_wctomb function could just create the corresponding
UCS-2 values if no first half has been encountered before.  The
__utf8_mbtowc function could simply allow these UCS-2 values again.

That works (I just tested it) and is a small change, but is it really
desirable to allow UCS-2 values in UTF-8 strings?


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

References:
- Re: Console codepage setting via chcp?
  - From: Andy Koppe
- Re: Console codepage setting via chcp?
  - From: Corinna Vinschen
- Re: Console codepage setting via chcp?
  - From: Andy Koppe
- Re: Console codepage setting via chcp?
  - From: Corinna Vinschen
- Re: Console codepage setting via chcp?
  - From: Andy Koppe
- Re: Console codepage setting via chcp?
  - From: Corinna Vinschen
- Re: Console codepage setting via chcp?
  - From: Andy Koppe
- Re: Console codepage setting via chcp?
  - From: Corinna Vinschen
- Re: Console codepage setting via chcp?
  - From: Andy Koppe
- Re: Console codepage setting via chcp?
  - From: Corinna Vinschen

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]