This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Japanese/Chinese language question


On Jan 21 10:04, Mark J. Reed wrote:
> On Thu, Jan 21, 2010 at 8:40 AM, Corinna Vinschen  wrote:
> > would somebody with Japanese and/or Chinese language background be so
> > When comparing strings linguistically (strcoll/wcscoll),
> > - are Hiragana and Katakana forms of the same character to be
> > ?treated as equal or as different?
> 
> (Nit: they are not "the same character" in either the technical or
> traditional sense of "character"; they're the same syllable, but
> represented by different characters.)
> 
> From the Unicode point of view, they are distinct; there is no defined
> equivalence, either canonical or compatibility, between corresponding
> Katakana and Hiragana syllables.  The collation algorithm (which does
> take linguistic context into account) doesn't seem to say anything
> about such comparisons, though it's possible I missed something.
> 
>  But as a precedent which might be helpful, I note that with
> linguistic sensitivity active, Oracle 10g does compare Hiragana and
> Katakana forms of the same syllable as equal.
> 
> > - are half-width and full-width forms of the same CJK character
> > ?treated as equal or as different?
> 
> According to the Unicode normalization algorithm, half -width and
> full-width forms normalize to the same character, so they should be
> treated as equivalent.  From the point of view of Unicode, there is no
> semantic difference, and the width property is informative, not
> normative. It's primarily encoded in Unicode to preserve round-trip
> compatibility with other standards, though it's also helpful for hints
> to rendering algorithms.

Thanks for the info.  However...


  linux$ cat jp.c
  #include <stdio.h>
  #include <locale.h>
  #include <wchar.h>

  int
  main (int argc, char **argv)
  {
    setlocale (LC_ALL, "ja_JP.UTF-8");
    /* U+3042 = Hiragana letter A
       U+30a2 = Katakana letter A
       U+ff71 = Halfwidth Katakana letter A */
    printf ("%d\n", wcscoll (L"\x3042", L"\x30a2"));
    printf ("%d\n", wcscoll (L"\xff71", L"\x30a2"));
    return 0;
  }
  linux$ gcc jp.c -o jp
  linux$ ./jp
  -83
  -340

I expected that at least one of the comparisons returns 0.
Am I doing something wrong?


Corinna


-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]