This is the mail archive of the cygwin-developers mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: charset changes

From: Corinna Vinschen <corinna-cygwin at cygwin dot com>
To: cygwin-developers at cygwin dot com
Date: Sat, 6 Feb 2010 11:40:24 +0100
Subject: Re: charset changes
References: <416096c61001230305x20619d39x55e3a46b428ba@mail.gmail.com> <4B6C474B.7090600@towo.net> <20100205215047.GX28659@calimero.vinschen.de> <416096c61002051514w5bb56b0bj5baeb0c65c7aece@mail.gmail.com> <4B6CB86E.5050904@towo.net> <416096c61002052220rafdb361kec907336ca5b3889@mail.gmail.com>
Reply-to: cygwin-developers at cygwin dot com

On Feb  6 06:20, Andy Koppe wrote:
> Other systems usually have a 32-bit wchar, though. I can see three
> ways to tackle the issue, but none of them entirely satisfactory. When
> encountering a 4-byte sequence in __gb18300_mbtowc that maps to a
> non-BMP char (and hence a UTF-16 surrogate pair):
> 1. Just report an invalid sequence. BMP-only support would probably
> still cover most practical needs.
> 2. Write the high surrogate and report that one byte less than
> actually seen has been consumed. On the next mbtowc call, ignore the
> input, write the low surrogate, and report that 1 byte has been
> consumed. Unfortunately this scheme falls down if the user feeds in
> the bytes one-by-one, as Corinna previously found when handling UTF-8
> like this.
> 3. Write the high surrogate and report the actual number of bytes
> consumed. On the next call, write the low surrogate, and return 0 to
> indicate that no bytes have been consumed. Trouble is, a return value
> of 0 from mbrtowc is supposed to indicate that a null character has
> been found. While uses within Cygwin could be changed to recognise
> string end by instead looking at the character actually written, this
> would lead to truncated strings in applications.

Can't we just carry over the surrogate pair handling from __utf8_mbtowc()
in newlib/libc/stdlib/mbtowc_r.c?  What's the stumbling block exactly?
Do you have an example?


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

Follow-Ups:
- Re: charset changes
  - From: Corinna Vinschen
- Re: charset changes
  - From: Andy Koppe

References:
- Re: charset changes
  - From: Thomas Wolff
- Re: charset changes
  - From: Corinna Vinschen
- Re: charset changes
  - From: Andy Koppe
- Re: charset changes
  - From: Thomas Wolff
- Re: charset changes
  - From: Andy Koppe

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]