This is the mail archive of the kawa@sourceware.org mailing list for the Kawa project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
character, strings, and unicode

From: Per Bothner <per at bothner dot com>
To: Kawa mailing list <kawa at sourceware dot org>
Date: Fri, 13 Jun 2014 10:44:47 -0700
Subject: character, strings, and unicode
Authentication-results: sourceware.org; auth=none
I am thinking it might be a good idea to prioritize "full"
R7RS support.  I.e. so we can claim that Kawa is an R7RS
implementation with some limitations (most obviously full
continuations will still be missing).  The plan would be to
announce this as "Kawa 2.0".  I think would be a good excuse
for some publicity etc.

One of the issues we need to tackle is that Kawa doesn't handle
full Unicode properly.  First some terminology:

A 'code point' is a character encoded as an integer in the
range 0 to #xFFFFF.
A 'code unit' is an implementation unit.  A code point is represented
by 1 or more code units.  In the Java world, a code unit is a 16-bit char.
The Basic Multilingual Plane (BMP) contains those characters whose code
point value is up to #xFFFF - i.e. the ones that fit in a single 16-bit char.
A 'non-BMP character' is other character.  In the Java world (and also
in the Microsoft and JavaScript worlds) these are represented using two
'surrogate characters', each of which is 16 bits.  This encoding is called UTF-16.

The problem: the various R7RS string and character functions are
defined in terms of 20-bit code points, not 16-bit code units.
This makes sense, since the former are the units of meaning (to
the extent that a character has meaning), which the latter is
an implementation detail.  For example (string-ref str i) is supposed
to return the i'th code point in str.  However, the implementation
returns the i'th code unit, because that is easy and efficient.

It is easy to fix the implementation of string-ref, but it is hard to
do so efficiently, because UTF-16 is a variable-length encoding.
The obvious implementation is O(N), where N is the index.  If we
want to maintain compatibility with Java (e.g. use java.lang.String
and CharSequence) we may not be able to do better.

So this is the plan:

* Kawa currently uses gnu.text.Char as the representation of
a character.  This is the binding of the 'character' type.  I am
working on changing the 'character' type to a new primitive type
represented using 'int'.  To the JVM and to Java a 'character'
is just an int.  However, when a character is converted to an
object (i.e. "boxed") then a gnu.text.Char is created, rather than
a java.lang.Integer (or java.lang.Character).

* Note this means we're adding a new primitive type, with its
own conversion rules.  The Kawa compiler has to represent this in
the class file.  For example a function that returns a character
is compiled to a method that return an int, but with a new Annotation
to specify that the result is actually a character.  When Kawa later
imports the compiled module, and comes upon a call to this function,
it uses the Annotation so it knows the result is actually a character.

Adding this support is obviously non-trivial.  However, it enables
better type support in various ways: Kawa can distinguish between
a native Java class and a Kawa type that happens to be implemented
with that class - but until now Kawa hasn't been able to encode this
distinction in compiled code.

Adding support for non-builtin primitive allows other posibilities.
For example we can have 'unsigned-int'.  The values in a bytevector
could be all 'unsigned-byte'.

* In addition to the 'character' type becoming a primitive type,
we also add 'character-or-eof'.  This is a union of characters
with the eof value.  The latter is represented is -1.  The return
type of read-char becomes character-or-eof, represent as an int.

* To catch our breath:  This change makes characters more efficient,
since we can often avoid object allocation.  The hope is this may
partly make up for less efficient strings.

* The general Kawa string type is and will continue to be the
java.lang.CharSequence interface.  Kawa-specific string implementations
will also implement gnu.list.CharSeq, which extends CharSequence,
and also extends Sequence<gnu.text.Char>.

* The CharSeq interface gets a new method:
  int characterAt(int index);
The default implementation is to just linearly scan using charAt,
but custom implementations may be able to do better.  For example
if the implementing class maintains a flag saying the string only has
BMP characters, then we can just use a single charAt.

* string-ref takes a java.lang.CharSequence and a int, and returns
a character (represented as an int).  In the general case, string-ref
has to do a linear scan of the CharSequence, using charAt calls.
If the string is a CharSeq, it can use the characterAt method.

* string-set! is trickier, because it can change the size of the
underlying string when measured using String.length().  I.e. replacing
a BMP character with a non-BMP-character or vice versa changes
the size.  Luckily, the FString implementation already supports
basic additions and resizing, so this is conceptually simple.

It is possible for FString to maintain a lookup table to allow
fast characterAt lookup.  For example we could maintain an array
to map code-point-index to code-unit-offset.   (To save space,
we'd only add an entry say for every 16 code points.  We also
don't create the array if there are only BMP characters.)

* If the FString class supports re-sizing, we might as well add
some Scheme methods for this.  For example this seems very useful:
    (string-append! str char)
More general StringBuffer-style methods may also be worth considering.

* For fast O(1) string reading we add some new routines, based on:
https://code.google.com/p/chibi-scheme/source/browse/doc/chibi.scrbl
(Search for 'subsection{Unicode}'.)

A 'cursor' is just an int, but it represents a code unit offset
rather than a code point index.

(string-cursor-start str)
   returns a start cursor for the string
   ;; 0 in the Kawa implementation
(string-cursor-end str)
   returns a cursor one past the last valid cursor
   ;; str,length() in the Kawa implementation
(string-cursor-ref str cursor)}
   get the char at the given cursor
   ;; str.charAt(cursor)
   ;; if the char is a surrogate, also get following
   ;; char, and combine them.
(string-cursor-next str cursor)
   increment to the next cursor
   ;; (+ cursor 1) or (+ cursor 2) if pointing at surrogate
(string-cursor-prev str cursor)
   decrement to the previous cursor
   ;; (- cursor 1) or (- cursor 2)
(substring-cursor str cs1 [cs2])
   take a substring from the given cursors
   ;; str.subSequence(cs1, cs2)

We might want to a add a few methods for string updating
(for example string-cursor-set!) and incermenting:
(string-cursor-increment str cursor count)

* The compiler could potentially optimize loops that
iterate over a string.  string-ref can be inlined to
a combination of string-cursor-increment and string-cursor-ref.
Then string-cursor-increment could be replaced by
string-cursor-next using a form of 'strength reduction'.
(I'm not planning on doing this optimization for Kawa 2.0.)

* For string-processing without the extra scan overhead,
note you can also use string ports (as returned by
open-input-string) and string-map/string-for-each (coming soon).
--
	--Per Bothner
per@bothner.com   http://per.bothner.com/
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]