This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] CJK ambiguous width for non-Unicode charsets

From: Andy Koppe <andy dot koppe at gmail dot com>
To: newlib at sourceware dot org
Date: Wed, 17 Nov 2010 21:34:28 +0000
Subject: Re: [PATCH] CJK ambiguous width for non-Unicode charsets
References: <AANLkTik5ugUtdbrk351sA2aXaAk4gv+e66ydrjaRAVPG@mail.gmail.com> <20101116175820.GF32170@calimero.vinschen.de>

On 16 November 2010 17:58, Corinna Vinschen wrote:
> On Nov Â9 22:06, Andy Koppe wrote:
>> The attached small patch affects character widths as reported by
>> wcwidth(). It addresses an obscure issue.
>>
>> The CJK ambiguous width category contains characters that are one
>> character cell wide in some contexts and two cells in others. That
>> category doesn't actually contain CJK characters as such, but things
>> like the Greek and Cyrillic alphabets, accented Latin characters, and
>> also line drawing characters. These are usually one cell wide, but in
>> CJK legacy encodings such as SJIS or GBK, they were encoded as two
>> bytes, and the usual practice was to have the display width correspond
>> to the number of bytes. Accordingly, CJK terminal fonts usually have
>> double-width glyphs for the affected characters. See also
>> http://unicode.org/reports/tr11/#Ambiguous.
>>
>> Newlib currently decides which width to use based on the selected
>> LC_CTYPE locale, i.e. it will use double width for "zh", "jp", and
>> "ko" locales, and single width for everything else, independent of the
>> selected character set. The attached patch changes this so that single
>> width will always be used for single-byte encodings such as the
>> ISO-8859 ones, and that double width will always be used for the CJK
>> legacy encodings. For UTF-8, the decision will still be made based on
>> the locale. The @cjknarrow modifier can still be used to force single
>> width, independent of locale and encoding.
>>
>> The point of this is to fit in with the historical use of those legacy
>> encodings, since the ambiguity only arose once the different charsets
>> were combined into Unicode. I doubt anyone is using nonsensical
>> locale/encoding combinations such as de_DE.GBK or ja_JP.ISO-8859-1, so
>> this is primarily about the likes of C.GBK and C.SJIS. Those are
>> currently ambiguous-narrow, but vim for example treats them as
>> ambiguous-wide, which makes for "interesting" effects when editing
>> files containing affected characters. The patch here fixes that.
>>
>> Tested in Cygwin. I assume this will need to wait for Corinna's return.
>>
>> Â Â Â * libc/locale/locale.c: Fix ambigous width to one for singlebyte
>> Â Â Â charsets and two for non-Unicode multibyte charsets.
>
> This appears to make a lot of sense. ÂWould you mind to enhance your
> patch slightly to fix also the description in the locale.c
> documentation? ÂThere's a related paragraph starting with "This
> implementation also supports a single modifier, <<"cjknarrow">>..."

Sorry, I hadn't seen that. Amended patch attached.

	* libc/locale/locale.c (loadlocale): Fix width of CJK ambigous
	characters to 1 for singlebyte charsets and 2 for non-Unicode
	multibyte charsets. Change documentation accordingly.

Andy

Attachment: ambiwidth2.patch
Description: Binary data

Follow-Ups:
- Re: [PATCH] CJK ambiguous width for non-Unicode charsets
  - From: Corinna Vinschen

References:
- [PATCH] CJK ambiguous width for non-Unicode charsets
  - From: Andy Koppe
- Re: [PATCH] CJK ambiguous width for non-Unicode charsets
  - From: Corinna Vinschen

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]