This is the mail archive of the
cygwin-developers
mailing list for the Cygwin project.
Re: charset changes
- From: Thomas Wolff <towo at towo dot net>
- To: cygwin-developers at cygwin dot com
- Date: Fri, 05 Feb 2010 17:28:59 +0100
- Subject: Re: charset changes
- References: <416096c61001230305x20619d39x55e3a46b428ba@mail.gmail.com>
On 23.01.2010 12:05, Andy Koppe wrote:
I'm in awe at Corinna's latest locale changes. Getting closer and
closer to the real thing.
Me too.
A couple of points:
...
And here is my couple of points, after some checking:
I found the following inconsistencies, and since the agreed strategy
seems to be to prefer Linux compatibility over Windows mapping,
I think especially the first group of a few incompatible mappings should
be fixed before the 1.7.2 release.
------------------------------------------------------------------------
These locales have inconsistent encodings:
Locale Linux Cygwin
et_EE ISO-8859-1 ISO-8859-15
ja_JP.sjis SHIFT_JIS CP932
ka_GE GEORGIAN-PS UTF-8
kk_KZ PT154 ISO-8859-5
sr_CS ISO-8859-5 UTF-8
uz_UZ ISO-8859-1 UTF-8
zh_CN GB2312 GBK
zh_HK BIG5-HKSCS BIG5
zh_SG GB2312 GBK
Notes:
- SHIFT_JIS -> CP932 has been discussed extensively and I think it's OK
- GB2312 -> GBK is basically a superset, should be OK too
- zh_HK is the dedicated Hongkong locale, so should use the Hongkong
extension
- With respect to other differences above, linux has these two
distinguished locales:
et_EE.iso885915 ISO-8859-15
uz_UZ@cyrillic UTF-8
- getlocale -a lists the following twice, without indicating a difference:
sr_SP
sr_BA
az_AZ
se_FI
uz_UZ (see above)
------------------------------------------------------------------------
Also, some generic encoding suffixes are not handled:
- .iso885915 and .iso8859-15 (cygwin only recognizes .iso-8859-15 and
its capital)
- .koi8r (cygwin only recognizes .koi8-r and .KOI8-R)
- .koi8u (cygwin only recognizes .koi8-u and .KOI8-U)
- .tcvn (in vi_VN.tcvn)
- .gb18030 (in zh_CN.gb18030)
- .eucjp (in ja_JP.eucjp)
- .euctw (in zh_TW.euctw)
(Maybe the latter lack Windows support or depend on Windows
configuration...)
- .koi8t
- .armscii8
- .big5hkscs
- .gb2312
- .georgianps
- .pt154
- .ujis (-> EUC-JP)
------------------------------------------------------------------------
These locales are not known or handled on cygwin at all:
aa_DJ ISO-8859-1
aa_ER UTF-8
aa_ET UTF-8
am_ET UTF-8
an_ES ISO-8859-15
ar_IN UTF-8
ar_SD ISO-8859-6
ast_ES ISO-8859-15
ber_DZ UTF-8
ber_MA UTF-8
bn_BD UTF-8
bo_CN UTF-8
bo_IN UTF-8
br_FR ISO-8859-1
byn_ER UTF-8
ca_AD ISO-8859-15
ca_FR ISO-8859-15
ca_IT ISO-8859-15
crh_UA UTF-8
csb_PL UTF-8
de_BE ISO-8859-1
dz_BT UTF-8
el_CY ISO-8859-7
en_AG UTF-8
en_BE ISO-8859-1
en_BW ISO-8859-1
en_DK ISO-8859-1
en_HK ISO-8859-1
en_IN UTF-8
en_NG UTF-8
en_SG ISO-8859-1
es_US ISO-8859-1
fur_IT UTF-8
fy_DE UTF-8
ga_IE ISO-8859-1
gd_GB ISO-8859-15
gez_ER UTF-8
gez_ET UTF-8
gv_GB ISO-8859-1
ha_NG UTF-8
hne_IN UTF-8
hsb_DE ISO-8859-2
ht_HT UTF-8
ig_NG UTF-8
ik_CA UTF-8
iu_CA UTF-8
iw_IL ISO-8859-8
kl_GL ISO-8859-1
km_KH UTF-8
ks_IN UTF-8
ku_TR ISO-8859-9
kw_GB ISO-8859-1
lg_UG ISO-8859-10
li_BE UTF-8
li_NL UTF-8
lo_LA UTF-8
mai_IN UTF-8
mg_MG ISO-8859-15
nds_DE UTF-8
nds_NL UTF-8
ne_NP UTF-8
nl_AW UTF-8
no_NO ISO-8859-1
nr_ZA UTF-8
nso_ZA UTF-8
oc_FR ISO-8859-1
om_ET UTF-8
om_KE ISO-8859-1
or_IN UTF-8
pap_AN UTF-8
pa_PK UTF-8
ru_UA KOI8-U
rw_RW UTF-8
sc_IT UTF-8
sd_IN UTF-8
shs_CA UTF-8
sh_YU ISO-8859-2
sid_ET UTF-8
si_LK UTF-8
so_DJ ISO-8859-1
so_ET UTF-8
so_KE ISO-8859-1
so_SO ISO-8859-1
ss_ZA UTF-8
st_ZA ISO-8859-1
tg_TJ KOI8-T
ti_ER UTF-8
ti_ET UTF-8
tig_ER UTF-8
tk_TM UTF-8
tl_PH ISO-8859-1
tr_CY ISO-8859-9
ts_ZA UTF-8
ug_CN UTF-8
ve_ZA UTF-8
wa_BE ISO-8859-1
wo_SN UTF-8
yi_US CP1255
yo_NG UTF-8
------------------------------------------------------------------------
And finally, some systems (e.g. Fedora) maintain a number of full-word
locales (locale aliases?) that are not known on cygwin either (maybe not
harmful):
(Note: non-ASCII letters in some of the locale names on those systems
are in 8-bit, Latin-1)
bokmal ISO-8859-1
bokmÃl ISO-8859-1
catalan ISO-8859-1
croatian ISO-8859-2
czech ISO-8859-2
danish ISO-8859-1
dansk ISO-8859-1
deutsch ISO-8859-1
dutch ISO-8859-1
eesti ISO-8859-1
estonian ISO-8859-1
finnish ISO-8859-1
franÃais ISO-8859-1
french ISO-8859-1
galego ISO-8859-1
galician ISO-8859-1
german ISO-8859-1
greek ISO-8859-7
hebrew ISO-8859-8
hrvatski ISO-8859-2
hungarian ISO-8859-2
icelandic ISO-8859-1
italian ISO-8859-1
japanese EUC-JP
korean EUC-KR
lithuanian ISO-8859-13
norwegian ISO-8859-1
nynorsk ISO-8859-1
polish ISO-8859-2
portuguese ISO-8859-1
romanian ISO-8859-2
russian ISO-8859-5
slovak ISO-8859-2
slovene ISO-8859-2
slovenian ISO-8859-2
spanish ISO-8859-1
swedish ISO-8859-1
thai TIS-620
turkish ISO-8859-9
------------------------------------------------------------------------
Thomas