This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Japanese and Unicode


From: Per Bothner <bothner@cygnus.com>
Subject: Re: Japanese and Unicode 
Date: Tue, 21 Oct 1997 12:16:39 -0700

> > However, (currently) I'm not sure that Unicode is the solution for
> > multilingual text handling.
> 
> It is not.  That is a much more complex problem.  However, using
> Unicode does (I think) make that problem a bit easier than
> alternatives (such as Mule).

Unicode was never meant to provide anything you need for multilangual
text handling.  Instead you need to tag the text somehow to provide
the kind of information which cannot reasonably in Unicode (or
character sets in general).  Take SGML (w/ DocBook) as an example.
Here you can write

	just tell him, <QUOTE><FOREIGNPHRASE LANG="fa_af">Besyar nawaqt 
	nawaqt nist</FOREIGNPHRASE>.</QUOTE>

(This is cut&paste'd from the DocBook manual).  The tags provide the
missing information and Unicode was designed with this in mind right
from the beginning.  And now it becomes clear that a rendering machine
can get enough information to choose the correct glyph for a Unicode
code point.

> I think the need for characters beyond 16 bits will be so rare
> that applications that need then can use the Unicode "surrogate
> mechanism".

I consider this as an error which is based on the very same problem
which lead to ASCII.  You must not regard the Unicode 2.0 standard as
the end of the line.  There are a lot of languages which are not yet
representable.  Among them are several Asian languages which might
again occupy a lot of room (I don't know this for sure).

For ASCII people in the US decided what is necessary with the well
know problems.  Now the First World people decided that it's enough to
use all of Unicode (which includes their languages and a few others
they are interested in).  Countries without much influence will stay
again before the door.

> The problem with that is that a single character is encoded using
> two 16-bit Unicode characters.  I think that is acceptable -
> high-quality text processing has to deal with the fact that a single
> logical (or display) character may be composed out of multiple
> Unicode characters anyway (because of accents, ligatures, combining
> characters, etc).

Well, applications certainly could handle this.  But you must see that
if surrogates are possible in the "wide strings" these are not anymore
wide strings and all the string handling functions must be changed to
handle surrogates.  This certainly is slower than handling UCS4 right
from the beginning and never get into this trouble.

(Plus handling of 16bit values is on many platforms slower than
reading 32bit values.)

> Emacs buffers should use UTF-8 (variable-width), at least a long as
> Emacs keeps the existing buffer implementation.

This is fine.  Guile should certainly provide functionality to
multibyte strings.  There is also the possibility to use Reuter's
compression method but I don't know enough about this currently.

-- Uli
---------------.      drepper at gnu.org  ,-.   Rubensstrasse 5
Ulrich Drepper  \    ,-------------------'   \  76149 Karlsruhe/Germany
Cygnus Solutions `--' drepper at cygnus.com   `------------------------