This is the mail archive of the guile@sourceware.cygnus.com mailing list for the Guile project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Re: Multibyte encoding, scm_mb_supported_charset_p

To: forcer <forcer@mindless.com>
Subject: Re: Multibyte encoding, scm_mb_supported_charset_p
From: Jim Blandy <jimb@red-bean.com>
Date: 13 Sep 1999 12:45:00 -0500
Cc: guile@sourceware.cygnus.com
References: <199909131255.OAA01480@forcix.roof.lan>

Oh dear --- somebody's actually been watching.  I guess now's a good
time to announce it.

The guys in Japan who did Emacs's multilingual supported invited me to
come out for a month and work on adding multilingual support to Guile.
So I did, and had a great time.

Here's the background:

The multilingual model I want Guile to use is very simple: you just
have a lot more different characters than you used to.  Strings can
suddenly contain Chinese characters, Greek characters, etc.  Similarly
for symbol names.  For Scheme code, nothing else should change.

How you actually exchange these characters with the outside world is
more complicated.  For any I/O, you need to know the appropriate
external encoding to use (ISO 8859 Latin-1; Shift-JIS; whatever) so
they can read and write these characters.  I've got a design in mind
(undocumented and untried) that separates who you're talking to (a
file; a socket) from the encoding you use.  There'll be variables you
can set to choose a default encoding to use in your files and
filenames; their default values will be chosen based on the system's
locale, language, etc.

This sounds like a mess, but all we're really doing is trying to
acknowledge the complexity and chaos that's already out there.  In
practice, for users in the Western world, things are going to pretty
much default to Latin-1 and you won't notice any change in behavior.
Guile just won't get certain things wrong any more, and some old
problems will actually have solutions now.

Okay, so strings will now be able to hold multilingual text.  How are
we going to implement this?

Guile will use a single representation for text throughout.  If you
are writing C code that deals with Guile strings, you only have to
make it work once.  You have one set of properties to learn.

Guile will use a variable-width encoding for characters.  See ``Why
Guile Does Not Use a Fixed-Width Encoding''.  If that doesn't persuade
you, note that Tcl (for sure) and Perl (I think) are doing the same.
I recognize that this is controversial, but this decision stands.

For now, Guile will use Emacs 20.4's encoding for multilingual text,
called Emacs-Mule.  One of Guile's primary purposes is to support GNU
Emacs.  Choosing a different representation from Emacs would make this
integration very clumsy, if not impossible.

However, the rest of the world is going with ISO 10646 (a superset of
Unicode) encoded with UTF-8.  In particular, GTk is going this
direction; working with GTk is very important.  Maybe GNOME is, too;
GNOME is very important.  In recent discussion, the Emacs folks
(including Stallman) have decided to switch to ISO 10646/UTF-8 too
(with some simple extensions).  So Guile will switch to using Unicode
and UTF-8 internally, when Emacs does.

It turns out that, although Emacs-Mule and UTF-8 are radically
different character sets, they have a lot of very helpful properties
in common that make them pretty easy to deal with.  Most code in Guile
will not actually need to change.

To make the Emacs-Mule -> UTF-8 transition easier to deal with, we are
going to isolate all knowledge of the actual character set in a few
modules dedicated to that purpose.  The rest of Guile can remain
ignorant of the details.  When we switch, we can change those modules,
and if we've done our job well, everything else will just work.

The document below, which I hope to merge into the Guile manual when
it's done, describes the interface between these multilingual support
modules and the rest of Guile.  Libguile users will use the same
interface as Guile itself, so we can be sure it's got everything we
need.

This interface will change in small ways as I actually start using it
to converting the rest of Guile, and discover ways it could be made
more accurate or easier to use.  I may add small helpful
functions.  But the essentials won't change.

The exception is the conversion interface, which I suspect is too
general.

This is only the C->C interface.  This doesn't say anything about the
Scheme-level interface, which is still unformed.  That will come in a
different chapter.

This is documented in the guile-doc module; see
http://www.gnu.org/software/guile/anon-cvs.html for instructions on
checking it out.  It's in guile-doc/ref/mbapi.texi.  The docs for the
Scheme-level interface will appear there, too.

I'm doing this work on a CVS branch called jimb_mb_branch_1; you'll
need to do a "cvs checkout -r jimb_mb_branch_1 guile-core" to see it.
I'll be re-synching this branch with the trunk from time to time, and
starting new branches; each will be named jimb_mb_branch_N.

The most interesting parts are the intro, ``Why Guile Does Not Use a
Fixed-Width Encoding'' and ``Promised Properties of the Guile
Multibyte Encoding''.



Working With Multibyte Strings in C
***********************************

   Guile allows strings to contain characters drawn from a wide variety
of languages, including many Asian, Eastern European, and Middle Eastern
languages, in a uniform and unrestricted way.  The string representation
normally used in C code -- an array of ASCII characters -- is not
sufficient for Guile strings, since they may contain characters not
present in ASCII.

   Instead, Guile uses a very large character set, and encodes each
character as a sequence of one or more bytes.  We call this
variable-width encoding a "multibyte" encoding.  Guile uses this single
encoding internally for all strings, symbol names, error messages,
etc., and performs appropriate conversions upon input and output.

   The use of this variable-width encoding is almost invisible to Scheme
code.  Strings are still indexed by character number, not by byte
offset; `string-length' still returns the length of a string in
characters, not in bytes.  `string-ref' and `string-set!' are no longer
guaranteed to be constant-time operations, but Guile uses various
strategies to reduce the impact of this change.

   However, the encoding is visible via Guile's C interface, which gives
the user direct access to a string's bytes.  This chapter explains how
to work with Guile multibyte text in C code.  Since variable-width
encodings are clumsier to work with than simple fixed-width encodings,
Guile provides a set of standard macros and functions for manipulating
multibyte text to make the job easier.  Furthermore, Guile makes some
promises about the encoding which you can use in writing your own text
processing code.

   While we discuss guaranteed properties of Guile's encoding, and
provide functions to operate on its character set, we do not actually
specify either the character set or encoding here.  This is because we
expect both of them to change in the future: currently, Guile uses the
same encoding as GNU Emacs 20.4, but we hope to change Guile (and GNU
Emacs as well) to use Unicode and UTF-8, with some extensions.  This
will make it more comfortable to use Guile with other systems which use
UTF-8, like the GTk user interface toolkit.

Multibyte String Terminology
============================

   In the descriptions which follow, we make the following definitions:
"byte"
     A "byte" is a number between 0 and 255.  It has no inherent textual
     interpretation.  So 65 is a byte, not a character.

"character"
     A "character" is a unit of text.  It has no inherent numeric value.
     `A' and `.' are characters, not bytes.  (This is different from
     the C language's definition of "character"; in this chapter, we
     will always use a phrase like "the C language's `char' type" when
     that's what we mean.)

"character set"
     A "character set" is an invertible mapping between numbers and a
     given set of characters.  ASCII is a character set assigning
     characters to the numbers 0 through 127.  It maps `A' onto the
     number 65, and `.' onto 46.

     Note that a character set maps characters onto numbers, *not
     necessarily* onto bytes.  For example, the Unicode character set
     maps the Greek lower-case `alpha' character onto the number 945,
     which is not a byte.

     (This is what Internet standards would call a "coding character
     set".)

"encoding"
     An encoding maps numbers onto sequences of bytes.  For example, the
     UTF-8 encoding, defined in the Unicode Standard, would map the
     number 945 onto the sequence of bytes `206 177'.  When using the
     ASCII character set, every number assigned also happens to be a
     byte, so there is an obvious trivial encoding for ASCII in bytes.

     (This is what Internet standards would call a "character encoding
     scheme".)

   Thus, to turn a character into a sequence of bytes, you need a
character set to assign a number to that character, and then an
encoding to turn that number into a sequence of bytes.

   Likewise, to interpret a sequence of bytes as a sequence of
characters, you use an encoding to extract a sequence of numbers from
the bytes, and then a character set to turn the numbers into characters.

   Errors can occur while carrying out either of these processes.  For
example, under a particular encoding, a given string of bytes might not
correspond to any number.  For example, the byte sequence `128 128' is
not a valid encoding of any number under UTF-8.

   Having carefully defined our terminology, we will now abuse it.

   We will sometimes use the word "character" to refer to the number
assigned to a character by a character set, in contexts where it's
obvious we mean a number.

   Sometimes there is a close association between a particular encoding
and a particular character set.  Thus, we may sometimes refer to the
character set and encoding together as an "encoding".

Promised Properties of the Guile Multibyte Encoding
===================================================

   Internally, Guile uses a single encoding for all text.  It is
correct to write code which assumes that a string or symbol name uses
this encoding; code which makes this assumption will be portable to all
future versions of Guile, as far as we know.

   Guile's encoding has the following properties, which should make it
easier to write code which operates on it.

   Every ASCII character is encoded as a single byte from 0 to 127, in
the obvious way.  This means that a standard C string containing only
ASCII characters is a valid Guile string (except for the terminator;
Guile strings store the length explicitly, so they can contain null
characters).

   The encodings of non-ASCII characters use only bytes between 128 and
255.  That is, when we turn a non-ASCII character into a series of
bytes, none of those bytes can ever be mistaken for the encoding of an
ASCII character.  This means that you can search a Guile string for an
ASCII character using the standard `memchr' library function.  By
extension, you can search for an ASCII substring in a Guile string
using a traditional substring search algorithm -- you needn't add
special checks to verify encoding boundaries, etc.

   No character encoding is a subsequence of any other character
encoding.  (This is just a stronger version of the previous promise.)
This means that you can search for occurrences of one Guile string
within another Guile string just as if they were raw byte strings.  You
can use the stock `memmem' function (provided on GNU systems, at least)
for such searches.  If you don't need the ability to represent null
characters in your text, you can still use null-termination for
strings, and use the traditional string-handling functions like
`strlen', `strstr', and `strcat'.

   You can always determine the full length of a character's encoding
from its first byte.  Guile provides the macro `scm_mb_len' which
computes the encoding's length from its first byte.  Given the first
rule, you can see that `scm_mb_len (B)', for any `0 <= B <= 127',
returns 1.

   Given an arbitrary byte position in a Guile string, you can always
find the beginning and end of the character containing that byte without
scanning too far in either direction.  This means that, if you are sure
a byte sequence is a valid encoding of a character sequence, you can
find character boundaries without keeping track of the beginning and
ending of the overall string.  This promise relies on the fact that, in
addition to storing the string's length explicitly, Guile always either
terminates the string's storage with a zero byte, or shares it with
another string which is terminated this way.

Functions for Operating on Multibyte Text
=========================================

   Guile provides a variety of functions, variables, and types for
working with multibyte text.

Basic Multibyte Character Processing
------------------------------------

   Here are the essential types and functions for working with Guile
text.  Guile uses the C type `unsigned char *' to refer to text encoded
with Guile's encoding.

   Note that any operation marked here as a "Libguile Macro" might
evaluate its argument multiple times.

 - Libguile Type: scm_char_t
     This is a signed integral type large enough to hold any character
     in Guile's character set.  All character numbers are positive.

 - Libguile Macro: scm_char_t scm_mb_get (const unsigned char *P)
     Return the character whose encoding starts at P, or -1 if P does
     not point to a valid character encoding.

 - Libguile Macro: int scm_mb_put (scm_char_t C, unsigned char *P)
     Place the encoded form of the Guile character C at P, and return
     its length in bytes.  If C is not the number of a Guile character,
     return 0.

 - Libguile Constant: int scm_mb_max_len
     The maximum length of any character's encoding, in bytes.  You may
     assume this is relatively small -- less than a dozen or so.

 - Libguile Macro: int scm_mb_len (unsigned char B)
     If B is the first byte of a character's encoding, return the full
     length of the character's encoding, in bytes.

 - Libguile Macro: int scm_mb_len_char (scm_char_t C)
     Return the length of the encoding of the character C, in bytes.

 - Libguile Function: scm_char_t scm_mb_get_func (const unsigned char
          *P)
 - Libguile Function: int scm_mb_put_func (scm_char_t C, unsigned char
          *P)
 - Libguile Function: int scm_mb_len_func (unsigned char B)
 - Libguile Function: int scm_mb_len_char_func (scm_char_t C)
     These are functions identical to the corresponding macros.  You
     can use them in situations where the overhead of a function call
     is acceptable, and the cleaner semantics of function application
     are desireable.

Finding Character Encoding Boundaries
-------------------------------------

   These are functions for finding the boundaries between characters in
multibyte text.

   Note that any operation marked here as a "Libguile Macro" might
evaluate its argument multiple times, unless the definition promises
otherwise.

 - Libguile Macro: int scm_mb_boundary_p (const unsigned char *P)
     Return non-zero iff P points to the start of a character in
     multibyte text.

     This macro will evaluate its argument only once.

 - Libguile Function: const unsigned char * scm_mb_floor (const
          unsigned char *P)
     "Round" P to the previous character boundary.  That is, if P
     points to the middle of the encoding of a Guile character, return
     a pointer to the first byte of the encoding.  If P points to the
     start of the encoding of a Guile character, return P unchanged.

 - libguile Function: const unsigned char * scm_mb_ceiling (const
          unsigned char *P)
     "Round" P to the next character boundary.  That is, if P points to
     the middle of the encoding of a Guile character, return a pointer
     to the first byte of the encoding of the next character.  If P
     points to the start of the encoding of a Guile character, return P
     unchanged.

   Note that it is usually not friendly for functions to silently
correct byte offsets that point into the middle of a character's
encoding.  Such offsets almost always indicate a programming error, and
they should be reported as early as possible.  So, when you write code
which operates on multibyte text, you should not use functions like
these to "clean up" byte offsets which the originator believes to be
correct; instead, your code should signal a `text:not-char-boundary'
error as soon as it detects an invalid offset.  *Note Multibyte Text
Processing Errors::.

Multibyte String Functions
--------------------------

   These functions allow you to operate on multibyte strings: sequences
of character encodings.

 - Libguile Function: int scm_mb_count (const unsigned char *P, int LEN)
     Return the number of Guile characters encoded by the LEN bytes at
     P.

     If the sequence contains any invalid character encodings, or ends
     with an incomplete character encoding, signal a `text:bad-encoding'
     error.

 - Libguile Macro: scm_char_t scm_mb_walk (unsigned char **PP)
     Return the character whose encoding starts at `*PP', and advance
     `*PP' to the start of the next character.  Return -1 if `*PP' does
     not point to a valid character encoding.

 - Libguile Function: const unsigned char * scm_mb_prev (const unsigned
          char *P)
     If P points to the middle of the encoding of a Guile character,
     return a pointer to the first byte of the encoding.  If P points
     to the start of the encoding of a Guile character, return the
     start of the previous character's encoding.

     This is like `scm_mb_floor', but the returned pointer will always
     be before P.  If you use this function to drive an iteration, it
     guarantees backward progress.

 - Libguile Function: const unsigned char * scm_mb_next (const unsigned
          char *P)
     If P points to the encoding of a Guile character, return a pointer
     to the first byte of the encoding of the next character.

     This is like `scm_mb_ceiling', but the returned pointer will always
     be after P.  If you use this function to drive an iteration, it
     guarantees forward progress.

 - Libguile Function: const unsigned char * scm_mb_index (const
          unsigned char *P, int LEN, int I)
     Assuming that the LEN bytes starting at P are a concatenation of
     valid character encodings, return a pointer to the start of the
     I'th character encoding in the sequence.

     This function scans the sequence from the beginning to find the
     I'th character, and will generally require time proportional to
     the distance from P to the returned address.

     If the sequence contains any invalid character encodings, or ends
     with an incomplete character encoding, signal a `text:bad-encoding'
     error.

   It is common to process the characters in a string from left to
right.  However, if you fetch each character using `scm_mb_index', each
call will scan the text from the beginning, so your loop will require
time proportional to at least the square of the length of the text.  To
avoid this poor performance, you can use an `scm_mb_cache' structure
and the `scm_mb_index_cached' macro.

 - Libguile Type: struct scm_mb_cache
     This structure holds information that allows a string scanning
     operation to use the results from a previous scan of the string.
     It has the following members:
    `character'
          An index, in characters, into the string.

    `byte'
          The index, in bytes, of the start of that character.

     In other words, `byte' is the byte offset of the `character''th
     character of the string.  Note that if `byte' and `character' are
     equal, then all characters before that point must have encodings
     exactly one byte long, and the string can be indexed normally.

     All elements of a `struct scm_mb_cache' structure should be
     initialized to zero before its first use, and whenever the
     string's text changes.

 - Libguile Macro: const unsigned char *scm_mb_index_cached (const
          unsigned char *P, int LEN, int I, struct scm_mb_cache *CACHE)
 - Libguile Function: const unsigned char *scm_mb_index_cached_func
          (const unsigned char *P, int LEN, int I, struct scm_mb_cache
          *CACHE)
     This macro and this function are identical to `scm_mb_index',
     except that they may consult and update *CACHE in order to avoid
     scanning the string from the beginning.  `scm_mb_index_cached' is a
     macro, so it may have less overhead than
     `scm_mb_index_cached_func', but it may evaluate its arguments more
     than once.

     Using `scm_mb_index_cached' or `scm_mb_index_cached_func', you can
     scan a string from left to right, or from right to left, in time
     proportional to the length of the string.  As long as each
     character fetched is less than some constant distance before or
     after the previous character fetched with CACHE, each access will
     require constant time.

   Guile also provides functions to convert between an encoded sequence
of characters, and an array of `scm_char_t' objects.

 - Libguile Function: scm_char_t *scm_mb_multibyte_to_fixed (const
          unsigned char *P, int LEN, int *RESULT_LEN)
     Convert the variable-width text in the LEN bytes at P to an array
     of `scm_char_t' values.  Return a pointer to the array, and set
     `*RESULT_LEN' to the number of elements it contains.  The returned
     array is allocated with `malloc', and it is the caller's
     responsibility to free it.

     If the text is not a sequence of valid character encodings, this
     function will signal a `text:bad-encoding' error.

 - Libguile Function: unsigned char *scm_mb_fixed_to_multibyte (const
          scm_char_t *FIXED, int LEN, int *RESULT_LEN)
     Convert the array of `scm_char_t' values to a sequence of
     variable-width character encodings.  Return a pointer to the array
     of bytes, and set `*RESULT_LEN' to its length, in bytes.

     The returned byte sequence is terminated with a zero byte, which
     is not counted in the length returned in `*RESULT_LEN'.

     The returned byte sequence is allocated with `malloc'; it is the
     caller's responsibility to free it.

     If the text is not a sequence of valid character encodings, this
     function will signal a `text:bad-encoding' error.

Exchanging Guile Text With the Outside World in C
-------------------------------------------------

   Guile provides functions for converting between Guile's internal text
representation and encodings popular in the outside world.  These
functions are closely modeled after the `iconv' functions available on
some systems.

   To convert text between two encodings, you should first call
`scm_mb_iconv_open' to indicate the source and destination encodings;
this function returns a context object which records the conversion to
perform.

   Then, you should call `scm_mb_iconv' to actually convert the text.
This function expects input and output buffers, and a pointer to the
context you got from SCM_MB_ICONV_OPEN.  You don't need to pass all
your input to `scm_mb_iconv' at once; you can invoke it on successive
blocks of input (as you read it from a file, say), and it will convert
as much as it can each time, indicating when you should grow your
output buffer.

   An encoding may be "stateless", or "stateful".  In most encodings, a
contiguous group of bytes from the sequence completely specifies a
particular character; these are stateless encodings.  However, some
encodings require you to look back an unbounded number of bytes in the
stream to assign a meaning to a particular byte sequence; such
encodings are stateful.

   For example, in the `ISO-2022-JP' encoding for Japanese text, the
byte sequence `27 36 66' indicates that subsequent bytes should be
taken in pairs and interpreted as characters from the JIS-0208 character
set.  An arbitrary number of byte pairs may follow this sequence.  The
byte sequence `27 40 66' indicates that subsequent bytes should be
interpreted as ASCII.  In this encoding, you cannot tell whether a
given byte is an ASCII character without looking back an arbitrary
distance for the most recent escape sequence, so it is a stateful
encoding.

   In Guile, if a conversion involves a stateful encoding, the context
object carries any necessary state.  Thus, you can have many independent
conversions to or from stateful encodings taking place simultaneously,
as long as each data stream uses its own context object for the
conversion.

 - Libguile Type: struct scm_mb_iconv
     This is the type for context objects, which represent the
     encodings and current state of an ongoing text conversion.  A
     `struct scm_mb_iconv' records the source and destination
     encodings, and keeps track of any information needed to handle
     stateful encodings.

 - Libguile Function: struct scm_mb_iconv * scm_mb_iconv_open (const
          char *TOCODE, const char *FROMCODE)
     Return a pointer to a new `struct scm_mb_iconv' context object,
     ready to convert from the encoding named FROMCODE to the encoding
     named TOCODE.  For stateful encodings, the context object is in
     some appropriate initial state, ready for use with the
     `scm_mb_iconv' function.

     If either TOCODE or FROMCODE is not the name of a known encoding,
     this function will signal the `text:unknown-conversion' error,
     described below.

     Guile supports at least these encodings:
    `US-ASCII'
          US-ASCII, in the standard one-character-per-byte encoding.

    `ISO-8859-1'
          The usual character set for Western European languages, in
          its usual one-character-per-byte encoding.

    `Guile'
          Guile's current internal multibyte encoding.  The actual
          encoding this name refers to will change from one version of
          Guile to the next.  You should use this when converting data
          between external sources and the encoding used by Guile
          objects.

          You should *not* use this as the encoding for data presented
          to the outside world, for two reasons.  1) Its meaning will
          change over time, so data written using the `guile' encoding
          with one version of Guile might not be readable with the
          `guile' encoding in another version of Guile.  2) It
          currently corresponds to `Emacs-Mule', which invented for
          Emacs's internal use, and was never intended to serve as an
          exchange medium.

    `Emacs-Mule'
          This is the variable-length encoding for multi-lingual text
          by GNU Emacs, at least through version 20.4.  You probably
          should not use this encoding, as it is designed only for
          Emacs's internal use.  However, we provide it here because
          it's trivial to support, and some people probably do have
          `emacs-mule'-format files lying around.

     (At the moment, this list doesn't include any character sets
     suitable for external use that can actually handle multilingual
     data; this is unfortunate, as it encourages users to write data in
     Emacs-Mule format, which nobody but Emacs and Guile understands.
     We hope to add support for Unicode in UTF-8 soon, which should
     solve this problem.)

     Case is not significant in encoding names.

     You can define your own conversions; see *Note Implementing Your
     Own Text Conversions::.

 - Libguile Function: size_t scm_mb_iconv (struct scm_mb_iconv
          *CONTEXT, const char **INBUF, size_t *INBYTESLEFT, char
          **OUTBUF, size_t *OUTBYTESLEFT)
     Convert a sequence of characters from one encoding to another.  The
     argument CONTEXT specifies the encodings to use for the input and
     output, and carries state for stateful encodings; use
     `scm_mb_iconv_open' to create a CONTEXT object for a particular
     conversion.

     Upon entry to the function, `*INBUF' should point to the input
     buffer, and `*INBYTESLEFT' should hold the number of input bytes
     present in the buffer; `*OUTBUF' should point to the output
     buffer, and `*OUTBYTESLEFT' should hold the number of bytes
     available to hold the conversion results in that buffer.

     Upon exit from the function, `*INBUF' points to the first
     unconsumed byte of input, and `*INBYTESLEFT' holds the number of
     unconsumed input bytes; `*OUTBUF' points to the byte after the
     last output byte, and `*OUTBYTELEFT' holds the number of bytes
     left unused in the output buffer.

     For stateful encodings, CONTEXT carries encoding state from one
     call to `scm_mb_iconv' to the next.  Thus, successive calls to
     SCM_MB_ICONV which use the same context object can convert a
     stream of data one chunk at a time.

     If either INBUF or `*INBUF' is zero, then `scm_mb_iconv' will
     reset CONTEXT to its initial state for both the input and output
     encodings.  If neither OUTBUF nor `*OUTBUF' are zero, then
     `scm_mb_iconv' will store a byte sequence to put the output string
     in its initial state.  If the output buffer is not large enough to
     hold this byte sequence, SCM_MB_ICONV will return
     `scm_mb_iconv_more_room'.  In this case, the shift states of
     CONTEXT's input and output encodings are unchanged.

     The `scm_mb_iconv' function always consumes only complete
     characters or shift sequences from the input buffer, and the output
     buffer always contains a sequence of complete characters or escape
     sequences.

     If the input sequence contains characters which are not
     expressible in the output encoding, `scm_mb_iconv' converts it in
     an implementation-defined way.  It may simply delete the character.

     Some encodings use byte sequences which do not correspond to any
     textual character.  For example, the escape sequence of a stateful
     encoding has no textual meaning.  When converting from such an
     encoding, a call to `scm_mb_iconv' might consume input but produce
     no output, since the input sequence might contain only escape
     sequences.

     Normally, `scm_mb_iconv' returns the number of input characters it
     could not convert perfectly to the ouput encoding.  However, it may
     return one of the `scm_mb_iconv_' codes described below, to
     indicate an error.  All of these codes are negative values.

     If the input sequence contains an invalid character encoding,
     conversion stops before the invalid input character, and
     `scm_mb_iconv' returns the constant value
     `scm_mb_iconv_bad_encoding'.

     If the input sequence ends with an incomplete character encoding,
     `scm_mb_iconv' will leave it in the input buffer, unconsumed, and
     return the constant value `scm_mb_iconv_incomplete_encoding'.  This
     is not necessarily an error, if you expect to call `scm_mb_iconv'
     again with more data which might contain the rest of the encoding
     fragment.

     If the output buffer does not contain enough room to hold the
     converted form of the complete input text, `scm_mb_iconv' converts
     as much as it can, changes the input and output pointers to
     reflect the amount of text successfully converted, and then returns
     `scm_mb_iconv_more_room'.

   Here are the status codes that might be returned by `scm_mb_iconv'.
They are all negative integers.
`scm_mb_iconv_more_room'
     The conversion needs more room in the output buffer.  Some
     characters may have been consumed from the input buffer, and some
     characters may have been placed in the available space in the
     output buffer.

`scm_mb_iconv_bad_encoding'
     `scm_mb_iconv' encountered an invalid character encoding in the
     input buffer.  Conversion stopped before the invalid character, so
     there may be some characters consumed from the input buffer, and
     some converted text in the output buffer.

`scm_mb_iconv_incomplete_encoding'
     The input buffer ends with an incomplete character encoding.  The
     incomplete encoding is left in the input buffer, unconsumed.  This
     is not necessarily an error, if you expect to call `scm_mb_iconv'
     again with more data which might contain the rest of the incomplete
     encoding.

   Finally, Guile provides a function for destroying conversion
contexts.

 - Libguile Function: void scm_mb_iconv_close (struct scm_mb_iconv
          *CONTEXT)
     Deallocate the conversion context object CONTEXT, and all other
     resources allocated by the call to `scm_mb_iconv_open' which
     returned CONTEXT.

Implementing Your Own Text Conversions
--------------------------------------

   This section describes the interface for adding your own encoding
conversions for use with `scm_mb_iconv'.  The interface here is
borrowed from the GNOME Project's `libunicode' library.

   Guile's `scm_mb_iconv' function works by converting the input text
to a stream of `scm_char_t' characters, and then converting those
characters to the desired output encoding.  This makes it easy for
Guile to choose the appropriate conversion back ends for an arbitrary
pair of input and output encodings, but it also means that the accuracy
and quality of the conversions depends on the fidelity of Guile's
internal character set to the source and destination encodings.  Since
`scm_mb_iconv' will be used almost exclusively for converting to and
from Guile's internal character set, this shouldn't be a problem.

   To add support for a particular encoding to Guile, you must provide
one function (called the "read" function) which converts from your
encoding to an array of `scm_char_t''s, and another function (called
the "write" function) to convert from an array of `scm_char_t''s back
into your encoding.  To convert from some encoding A to some other
encoding B, Guile pairs up A's read function with B's write function.
Each call to `scm_mb_iconv' passes text in encoding A through the read
function, to produce an array of `scm_char_t''s, and then passes that
array to the write function, to produce text in encoding B.

   For stateful encodings, a read or write function can hang its own
data structures off the conversion object, and provide its own
functions to allocate and destroy them; this allows read and write
functions to maintain whatever state they like.

   The Guile conversion back end represents each available encoding
with a `struct scm_mb_encoding' object.

 - Libguile Type: struct scm_mb_encoding
     This data structure describes an encoding.  It has the following
     members:

    `char **names'
          An array of strings, giving the various names for this
          encoding.  The array should be terminated by a zero pointer.
          Case is not significant in encoding names.

          The `scm_mb_iconv_open' function searches the list of
          registered encodings for an encoding whose `names' array
          matches its TOCODE or FROMCODE argument.

    `int (*init) (void **COOKIE)'
          An initialization function for the encoding's private data.
          `scm_mb_iconv_open' will call this function, passing it the
          address of the cookie for this encoding in this context.  (We
          explain cookies below.)  There is no way for the `init'
          function to tell whether the encoding will be used for
          reading or writing.

          Note that `init' receives a *pointer* to the cookie, not the
          cookie itself.  Because the type of COOKIE is `void **', the
          C compiler will not check it as carefully as it would other
          types.

          The `init' member may be zero, indicating that no
          initialization is necessary for this encoding.

    `int (*destroy) (void **COOKIE)'
          A deallocation function for the encoding's private data.
          `scm_mb_iconv_close' calls this function, passing it the
          address of the cookie for this encoding in this context.  The
          `destroy' function should free any data the `init' function
          allocated.

          Note that `destroy' receives a *pointer* to the cookie, not
          the cookie itself.  Because the type of COOKIE is `void **',
          the C compiler will not check it as carefully as it would
          other types.

          The `destroy' member may be zero, indicating that this
          encoding doesn't need to perform any special action to
          destroy its local data.

    `int (*reset) (void *COOKIE, char **OUTBUF, size_t *OUTBYTESLEFT)'
          Put the encoding into its initial shift state.  Guile calls
          this function whether the encoding is being used for input or
          output, so this should take appropriate steps for both
          directions.  If OUTBUF and OUTBYTESLEFT are valid, the reset
          function should emit an escape sequence to reset the output
          stream to its initial state; OUTBUF and OUTBYTESLEFT should
          be handled just as for `scm_mb_iconv'.

          This function can return an `scm_mb_iconv_' error code (*note
          Exchanging Guile Text With the Outside World in C::.).  If it
          returns `scm_mb_iconv_more_room', then the output buffer's
          shift state must be left unchanged.

          Note that `reset' receives the cookie's value itself, not a
          pointer to the cookie, as the `init' and `destroy' functions
          do.

          The `reset' member may be zero, indicating that this encoding
          doesn't use a shift state.

    `enum scm_mb_read_result (*read) (void *COOKIE, const char **INBUF,  size_t *INBYTESLEFT, scm_char_t **OUTBUF, size_t *OUTCHARSLEFT)'
          Read some bytes and convert into an array of Guile
          characters.  This is the encoding's read function.

          On entry, there are *INBYTESLEFT bytes of text at *INBUF to
          be converted, and *OUTCHARSLEFT characters available at
          *OUTBUF to hold the results.

          On exit, *INBYTESLEFT and *INBUF indicate the input bytes
          still not consumed.  *OUTCHARSLEFT and *OUTBUF indicate the
          output buffer space still not filled.  (By exclusion, these
          indicate which input bytes were consumed, and which output
          characters were produced.)

          Return one of the `enum scm_mb_read_result' values, described
          below.

          Note that `read' receives the cookie's value itself, not a
          pointer to the cookie, as the `init' and `destroy' functions
          do.

    `enum scm_mb_write_result (*write) (void *COOKIE, scm_char_t **INBUF, size_t *INCHARSLEFT, **OUTBUF, size_t *OUTBYTESLEFT)'
          Convert an array of Guile characters to output bytes.  This is
          the encoding's write function.

          On entry, there are *INCHARSLEFT Guile characters available at
          *INBUF, and *OUTBYTESLEFT bytes available to store output at
          *OUTBUF.

          On exit, *INCHARSLEFT and *INBUF indicate the number of Guile
          characters left unconverted (because there was insufficient
          room in the output buffer to hold their converted forms), and
          *OUTBYTESLEFT and *OUTBUF indicate the unused portion of the
          output buffer.

          Return one of the `scm_mb_write_result' values, described
          below.

          Note that `write' receives the cookie's value itself, not a
          pointer to the cookie, as the `init' and `destroy' functions
          do.

    `struct scm_mb_encoding *next'
          This is used by Guile to maintain a linked list of encodings.
          It is filled in when you call `scm_mb_register_encoding' to
          add your encoding to the list.


   Here is the enumerated type for the values an encoding's read
function can return:

 - Libguile Type: enum scm_mb_read_result
     This type represents the result of a call to an encoding's read
     function.  It has the following values:

    `scm_mb_read_ok'
          The read function consumed at least one byte of input.

    `scm_mb_read_incomplete'
          The data present in the input buffer does not contain a
          complete character encoding.  No input was consumed, and no
          characters were produced as output.  This is not necessarily
          an error status, if there is more data to pass through.

    `scm_mb_read_error'
          The input contains an invalid character encoding.


   Here is the enumerated type for the values an encoding's write
function can return:

 - Libguile Type: enum scm_mb_write_result
     This type represents the result of a call to an encoding's write
     function.  It has the following values:

    `scm_mb_write_ok'
          The write function was able to convert all the characters in
          INBUF successfully.

    `scm_mb_write_more_room'
          The write function filled the output buffer, but there are
          still characters in INBUF left unconsumed; INBUF and
          INCHARSLEFT indicate the unconsumed portion of the input
          buffer.


   Conversions to or from stateful encodings need to keep track of each
encoding's current state.  Each conversion context contains two `void
*' variables called "cookies", one for the input encoding, and one for
the output encoding.  These cookies are passed to the encodings'
functions, for them to use however they please.  A stateful encoding
can use its cookie to hold a pointer to some object which maintains the
context's current shift state.  Stateless encodings will probably not
use their cookies.

   The cookies' lifetime is the same as that of the context object.
When the user calls `scm_mb_iconv_close' to destroy a context object,
`scm_mb_iconv_close' calls the input and output encodings' `destroy'
functions, passing them their respective cookies, so each encoding can
free any data it allocated for that context.

   Note that, if a read or write function returns a successful result
code like `scm_mb_read_ok' or `scm_mb_write_ok', then the remaining
input, together with the output, must together represent the complete
input text; the encoding may not store any text temporarily in its
cookie.  This is because, if `scm_mb_iconv' returns a successful result
to the user, it is correct for the user to assume that all the consumed
input has been converted and placed in the output buffer.  There is no
"flush" operation to push any final results out of the encodings'
buffers.

   Here is the function you call to register a new encoding with the
conversion system:

 - Libguile Function: void scm_mb_register_encoding (struct
          scm_mb_encoding *ENCODING)
     Add the encoding described by `*ENCODING' to the set understood by
     `scm_mb_iconv_open'.  Once you have registered your encoding, you
     can use it by calling `scm_mb_iconv_open' with one of the names in
     `ENCODING->names'.

Multibyte Text Processing Errors
================================

   This section describes error conditions which code can signal to
indicate problems encountered while processing multibyte text.  In each
case, the arguments MESSAGE and ARGS are an error format string and
arguments to be substituted into the string, as accepted by the
`display-error' function.

 - Condition: text:not-char-boundary FUNC MESSAGE ARGS OBJECT OFFSET
     By calling FUNC, the program attempted to access a character at
     byte offset OFFSET in the Guile object OBJECT, but OFFSET is not
     the start of a character's encoding in OBJECT.

     Typically, OBJECT is a string or symbol.  If the function
     signalling the error cannot find the Guile object that contains
     the text it is inspecting, it should use `#f' for OBJECT.

 - Condition: text:bad-encoding FUNC MESSAGE ARGS OBJECT
     By calling FUNC, the program attempted to interpret the text in
     OBJECT, but OBJECT contains a byte sequence which is not a valid
     encoding for any character.

 - Condition: text:not-guile-char FUNC MESSAGE ARGS NUMBER
     By calling FUNC, the program attempted to treat NUMBER as the
     number of a character in the Guile character set, but NUMBER does
     not correspond to any character in the Guile character set.

 - Condition: text:unknown-conversion FUNC MESSAGE ARGS FROM TO
     By calling FUNC, the program attempted to convert from an encoding
     named FROM to an encoding named TO, but Guile does not support
     such a conversion.

 - Libguile Variable: SCM scm_text_not_char_boundary
 - Libguile Variable: SCM scm_text_bad_encoding
 - Libguile Variable: SCM scm_text_not_guile_char
     These variables hold the scheme symbol objects whose names are the
     condition symbols above.  You can use these when signalling these
     errors, instead of looking them up yourself.

Why Guile Does Not Use a Fixed-Width Encoding
=============================================

   Multibyte encodings are clumsier to work with than encodings which
use a fixed number of bytes for every character.  For example, using a
fixed-width encoding, we can extract the Ith character of a string in
constant time, and we can always substitute the Ith character of a
string with any other character without reallocating or copying the
string.

   However, there are no fixed-width encodings which include the
characters we wish to include, and also fit in a reasonable amount of
space.  Despite the Unicode standard's claims to the contrary, Unicode
is not really a fixed-width encoding.  Unicode uses surrogate pairs to
represent characters outside the 16-bit range; a surrogate pair must be
treated as a single character, but occupies two 16-bit spaces.  As of
this writing, there are already plans to assign characters to the
surrogate character codes.  Three-byte encodings are impractical on most
modern machines, because values will not usually be aligned for
efficient access.  Four-byte encodings are too wasteful for a majority
of Guile's users, who only need ASCII and a few accented characters.

   Another alternative would be to have several different fixed-width
string representations, each with a different element size.  For each
string, Guile would use the smallest element size capable of
accomodating the string's text.  This would allow users of English and
the Western European languages to use the traditional memory-efficient
encodings.  However, if Guile has N string representations, then users
must write N versions of any code which manipulates text directly --
one for each element size.  And if a user wants to operate on two
strings simultaneously, and wants to avoid testing the string sizes
within the loop, she must make N*N copies of the loop.  Most users will
simply not bother.  Instead, they will write code which supports only
one string size, leaving us back where we started.  By using a single
internal representation, Guile makes it easier for users to write
multilingual code.

   Finally, Guile's multibyte encoding is not so bad.  Unlike a two- or
four-byte encoding, it is efficient in space for American and European
users.  Furthermore, the properties described above mean that many
functions can be coded just as they would for a single-byte encoding;
see *Note Promised Properties of the Guile Multibyte Encoding::.
Follow-Ups:
- Re: Multibyte encoding, scm_mb_supported_charset_p
  - From: Per Bothner
References:
- Multibyte encoding, scm_mb_supported_charset_p
  - From: forcer
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]