This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: mbstrings


>> Instead I think Guile should just define what it wants to use
>> internally (some encoding of Unicode, IMHO), and provide convenient
>> ways to convert to/from other encodings.

Mikael> Yes, but I think we should choose a byte size version.  There
Mikael> is some "pallette switching" byte version of Unicode, isn't
Mikael> there?  Of course, this has the disadvantage of complicating
Mikael> the handling of string length.


I'm not sure what you mean by "pallette switching".

I think the most common encoding of Unicode is called UTF-8.  It has
these properties:

* Variable length encoding
* ASCII is represented in 1 byte, other characters in more bytes
* You can tell the length of a character (in bytes) from its first
  byte
* ASCII characters are guaranteed not to appear in any non-ASCII
  character encoding

Is this the one you meant?


The other really common encoding uses wide characters.  (At least,
this is my understanding.)


I tend to think that UTF-8 would be a good choice.  However I'm not
fully aware of all the issues surrounding the choice of encoding.
Still, isn't that what Java uses?


I wouldn't worry too much about string-length.  Maybe it would be
possible for strings to keep track of their length in characters
instead of in bytes (or both -- I don't know enough about the
internals to say which makes more sense).  Or maybe string-length
isn't run so frequently that the length must be cached.


>> Nonstandard encodings shouldn't be "strings".
Mikael> Or simply a string that the programmer knows contains a
Mikael> nonstandard encoding.

The problem is that if string? is true for something, then you can
reasonably expect to do string-like operations on it.  But depending
on the encoding, this might lose.  Eg what does string-append mean
when the arguments are in different encodings?  Or what does
string-length mean when Guile doesn't know the encoding (and thus
can't compute the length)?

Making nonstandard encodings be something other than strings
emphasizes their difference.


Another way to go would be to support as many encodings as possible,
have the encoding be a property of the string, and allow transparent
switching between encodings when it makes sense (eg string-append).  I
don't like this as much, though.

Mikael> It would be really great if you could remove mbstrings.[ch].
Mikael> But of course Jim has to give his view on this first.

If he gives the ok, I'll remove it.

Tom
-- 
tromey@cygnus.com                 Member, League for Programming Freedom