This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: mbstrings



> From:    Jim Blandy <jimb@red-bean.com>
> Date:    Tue, 14 Oct 1997 16:37:18 -0400
> MULE, if I recall correctly,
> allows you to switch between several 16-bit encodings.

Mule uses variable length encoding internally, and is able to convert
them into various encodings when doing i/o.   Most character sets are
distinguished by special byte called 'leading character'.

Here's an excerpt of mule info (original is japanese)

Type 1-1: ASCII character set
   Character codes between 0x00 to 0x7f.  Saved literally.

Type 1-2: 1 byte character sets other than ASCII
   Saved with leading character 'LC1'.  (i.e. it takes 2 bytes per character)

Type 1-3: Private one byte character set
   Saved with two leading character 'LCPRV1' 'LC12'.

Type 2-3: Two byte character set
   Saved with leading character 'LC2'.  (i.e. it takes 3 bytes per character)

Type 2-4: Private two byte character set
   Saved with two leading character 'LCPRV2' 'LC22'.

Type 3-4: Three byte character set
   Saved with leading character 'LC3'.  (i.e. it takes 4 bytes per character)

Type N: Arbitrary length character set (composite character set)
   Starts from 'LCCMP', and each byte is saved with leading character
   'LCNn'

I've heard the drawback of Unicode is it's not organized well for
converting to/from existing character set.   (Unicode depends on
character shape, but existing Chinese character encodings and
Japanese character encodings are completely different even they
share a lot of same shape characters, that means you need big
lookup table for conversion.)
I'm not an expert in this field, though.  Please correct if it's wrong.

> I need to consult one more guru, but I'm leaning towards using 16-bit
> characters everywhere internally, and providing convenient conversion
> functions.

Most applications just use one encoding in their own, so 16-bit
representation is enough.  But if you want to deal with multiple
languages like Mule, and want it to be fixed length characters,
maybe you need either 32-bit representation or using Unicode and
take big conversion tables.

--
Shiro KAWAI
  Square USA Inc.   Honolulu Studio, R&D division
#"The most important things are the hardest things to say" --- Stephen King