This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Console codepage setting via chcp?


On Sep 26 21:34, Corinna Vinschen wrote:
> Do you propose to change __utf8_mbtowc/__utf8_wctomb to allow UCS-2
> encoding as well?
> 
> This is no problem for __utf8_mbtowc, but in __utf8_wctomb it's not
> possible to convert surrogate pairs to correct UTF-8 *and* lone
> surrogate first halfs to UCS-2, at least not with a lot of additional
> effort.  The reason is that the first byte returned when the first half
> is read is > 0xf0.  When the function is called for the second half and
> it turns out there is no second half, then the already returned 0xf0
> byte is suddenly wrong.  And the wctomb functions have no read-ahead
> functionality.
> 
> For that reason, I invented the aforementioned \016\377\x sequence
> to represent lone surrogate second halves.
> 
> The only other alternative would be to revert all the surrogate pair
> handling changes and to allow only UCS-2 again, thus giving up to
> support Unicode values >= U+10000.

No, there's a third alternative, of course.

The __utf8_wctomb function could just create the corresponding
UCS-2 values if no first half has been encountered before.  The
__utf8_mbtowc function could simply allow these UCS-2 values again.

That works (I just tested it) and is a small change, but is it really
desirable to allow UCS-2 values in UTF-8 strings?


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]