This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Please support CP932. (I have problem using subversion with SJIS)


On Jan 23 14:49, Nayuta Taga wrote:
> Hi all,
> 
> Please support CP932.  Because CP932 is not equal to SJIS, I have
> problem using subversion when LANG=ja_JP.SJIS .  With the attached
> patch and LANG=ja_JP.CP932, I can use subversion as expected.
> 
> The problem is as follows:
> 
> I have the following line in my ~/.subversion/config:
> 	global-ignores = *~
> When LANG=ja_JP.UTF-8, subversion ignores a file 'foo~'.
> But when LANG=ja_JP.SJIS, it doesn't.
> 
> I looked into subverson, then I found a workaround.
> I added *[U+203E] to the line:
> 	global-ignores = *~ *[U+203E]
> ([U+203E] is one character) and saved it in UTF-8.  This works fine.
> 
> In short, '~' (U+007E TILDE) turns into U+203E (OVERLINE) when
> LANG=ja_JP.SJIS.
> 
> Then I looked into cygwin and subversion again.
> (1) cygwin1.dll converts L"foo~" (UCS-2) to "foo~" (CP932).
> (2) Because subversion's internally uses UTF-8,
>     "foo~" (CP932) should be converted to "foo~" (UTF-8).
> (3) It uses iconv to convert from *SJIS* to UTF-8,
>     because nl_langinfo(CODESET) returns "SJIS" when LANG=ja_JP.SJIS.
> (4) The final string is "foo\xe2\x80\xbe".
>     (e2 80 be is UTF-8 representation of U+203E)
> 
> With my patch I can use LANG=ja_JP.CP932, nl_langinfo(CODESET) returns
> "CP932".  So the final string is "foo~".
> 
> supplement:
> 
> $ echo -n foo~ | iconv -f CP932 -t UTF-8 | od -t x1 -t a
> 0000000    66  6f  6f  7e
>            f   o   o   ~
> 0000004
> $ echo -n foo~ | iconv -f SJIS -t UTF-8 | od -t x1 -t a
> 0000000    66  6f  6f  e2  80  be
>            f   o   o   ?  80   ?
> 0000006

I don't understand the problem.

SJIS is the charset name for the Windows codepage 932.  The multibyte to
widechar conversion (and vice versa) for SJIS even uses the Windows
conversion functions under the hood.  And the character 0x7e in SJIS is
identical to the Unicode character U+00fe.

So, why does iconv turn U+007e into U+203E?

This sounds like a bug in iconv, not in Cygwin.  Your patch just adds an
additional charset name CP932 for the exact same charset SJIS.  What
this does is just cancel the recognition of the charset in iconv.  That
sounds like a hack, rather than a solution.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]