This is the mail archive of the cygwin-developers mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: charset changes

From: Corinna Vinschen <corinna-cygwin at cygwin dot com>
To: cygwin-developers at cygwin dot com
Date: Sat, 6 Feb 2010 14:55:53 +0100
Subject: Re: charset changes
References: <416096c61001230305x20619d39x55e3a46b428ba@mail.gmail.com> <4B6C474B.7090600@towo.net> <20100205215047.GX28659@calimero.vinschen.de> <416096c61002051514w5bb56b0bj5baeb0c65c7aece@mail.gmail.com> <4B6CB86E.5050904@towo.net> <416096c61002052220rafdb361kec907336ca5b3889@mail.gmail.com> <20100206104024.GY28659@calimero.vinschen.de>
Reply-to: cygwin-developers at cygwin dot com

On Feb  6 11:40, Corinna Vinschen wrote:
> On Feb  6 06:20, Andy Koppe wrote:
> > Other systems usually have a 32-bit wchar, though. I can see three
> > ways to tackle the issue, but none of them entirely satisfactory. When
> > encountering a 4-byte sequence in __gb18300_mbtowc that maps to a
> > non-BMP char (and hence a UTF-16 surrogate pair):
> > 1. Just report an invalid sequence. BMP-only support would probably
> > still cover most practical needs.
> > 2. Write the high surrogate and report that one byte less than
> > actually seen has been consumed. On the next mbtowc call, ignore the
> > input, write the low surrogate, and report that 1 byte has been
> > consumed. Unfortunately this scheme falls down if the user feeds in
> > the bytes one-by-one, as Corinna previously found when handling UTF-8
> > like this.
> > 3. Write the high surrogate and report the actual number of bytes
> > consumed. On the next call, write the low surrogate, and return 0 to
> > indicate that no bytes have been consumed. Trouble is, a return value
> > of 0 from mbrtowc is supposed to indicate that a null character has
> > been found. While uses within Cygwin could be changed to recognise
> > string end by instead looking at the character actually written, this
> > would lead to truncated strings in applications.
> 
> Can't we just carry over the surrogate pair handling from __utf8_mbtowc()
> in newlib/libc/stdlib/mbtowc_r.c?  What's the stumbling block exactly?
> Do you have an example?

I just read the GB18030 entry in the german wikipedia again and, boy,
I dislike that codeset immediately every time.  2-byte sequences have
a trailing byte in the range 0x40-0xfe, 3-byte sequences don't exist,
4-byte sequences have a second and forth byte in the range 0x30-0x39.
Why, oh why, do codeset implementors have to overload the ASCII range
without need.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

Follow-Ups:
- Re: charset changes
  - From: Thomas Wolff

References:
- Re: charset changes
  - From: Thomas Wolff
- Re: charset changes
  - From: Corinna Vinschen
- Re: charset changes
  - From: Andy Koppe
- Re: charset changes
  - From: Thomas Wolff
- Re: charset changes
  - From: Andy Koppe
- Re: charset changes
  - From: Corinna Vinschen

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]