This is the mail archive of the guile@sourceware.cygnus.com mailing list for the Guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: binary-io, opposable-thumb, pack/unpack (was Re: binary-io (was Re: rfc 2045 base64 encoding/decoding module))


> In any case, I think we are stuck with the standard Scheme ports being
> character sequences, not byte sequences.

I'm sure a lot of people use them as byte sequences, however I agree
that having an explicitly visible encoding method attached to the port
is a good idea.

> We need to allow people to use the existing
> functions for binary I/O.  I think the right model is that in that
> case the character sequence owuld be using the "trivial" encoding
> (integer->char and char->integer).  If the internal character set is
> Unicode or a superset thereof, then that is equivalent to using the
> ISO-Latin-1 encoding (plus disabling CR/LF processing).

Then people are going to want to layer the codecs so you can have
a gzipped file in UTF-8 that reads 16 bit characters from the port.

One thing that made me finally decide that C++ had major design flaws
was the way the << operators knew how to format floating points in a
bunch of useful ways but if you wanted to write your own ``stream''
style IO that also worked with << operators (e.g. you want to send
to a function rather than sending to a file) then you couldn't just
patch a new back-end onto the existing formatting functions you have to
either rewrite the formatting, or send everything to temporary strings
so you can copy it out again or unravel the low level library
implementation code and patch the back-end into the undocumented
internal hooks.

The upshot is a language with heaps of powerful library functions that
you can only use in exactly the way the authors thought of -- i.e.
no flexibility at all and minimal reuse of code. Give me sprintf any
day, at least I know where I stand and I don't have to wade through OO
tripe that gives me no detectable benefit.

Anyhow, I get the feeling that scheme is trying to go the same way
with everything being a ``first-class'' object. What is so first class
about telling a user that this is black magic, no you can't have any
control over how this is done, yes you must use it exactly like so?

So here we are wanting more functionality in the ports... they are
already responsible for read-ahead buffering and write buffering, plus
formatting of objects and now encoding and decoding of character
streams too. All of this is inevitably going to be patched into special
case C code burried into the guile core making the core bigger,
making the whole lot slower and more complicated to use. Then there
will be all this powerful code structured in a completely unmodular and
inflexible fashion providing the user with a monolith and having the
hide to tell them it is a ``first class'' solution!

> In any case, if you have a mix between binary and character data, or
> between different character encodings, you have to be careful about
> properly synchronizing the character stream with the underlying byte
> stream.  You have to do this whether you have a single combined
> port object, or you use character ports that forwards requests to
> a byte port.

The problem with this switcheroo type reading is that ports have buffering,
the buffering is invisible and its exact behaviour is undefined except
that you can read items in sequence. Thus, the only safe way is to go
for the lowest common denominator (i.e. a byte stream) and do all conversions
explicitly ... what you really want to be able to do is build your own
decoder to attach to the port and have access to all the other decoder
components so you can paste bits together in layers. For this to work
all the buffering sections have to cooperate.

When I do binary IO from guile, I just forget about the ports because
it is all too hard and too slow... I extract the unix-level file descriptor
and have an explicit buffer object in guile. Yes, it means I have to
juggle both the port and the buffer but it also means I know exactly
where I am at, I can scan ahead in the buffer, do my parsing, fart
around and not get tangled up with syncronisation. Many functions can
work directly on the buffer, then fill it up a bit more from the port
when they need more bytes. They can return #f and leave the buffer
untouched if they can't scan what is available to them.

> > Your system could make read and read-line simpler or more efficient, I
> > think, by allowing them to scan the buffer without needing to decode
> > the bytes.
> 
> But you can't do that!  The scanning is defined in terms of character
> delimiters, not bytes.  For certain encoding, and certain sets of
> delimiters characters, you can make some optimizations, but those are
> special-case hacks.  The system could do that *behind the scenes*,
> butthe pulished semantics need to handle the general case in a
> clean and consistent manner.

I would argue that direct scanning of the buffer should always be in bytes
because it is one thing that is rock solid and well understood. There
is nothing preventing some sort of technique for indirect scanning where
one layer acts as the eyes and ears for another layer. This allows the
construction of encoder/decoder components in a methodical way.

> Guile is meant to be a scripting language, not a systems-programming
> language.  It should make it easy to write correct and general code,
> and harder to write possibly-faster but inccorect code, not the
> other way around.

Guile is supposed to be everything to all people, that is its biggest
failing and what is holding it back at the moment. One quite sensible
option is that the guile core simply doesn't handle binary IO at all
and there is a .so library of handy binary readers that do all the specific
jobs that people commonly use. That keeps 90% of users happy and the
rest can take the trouble to learn C and write their own shared objects.

> Passing a string containg characters to system routines that expect
> bytes is not something you can expect to work.  It may work for strings
> that use appropriate stateless multi-byte encodings, such as UTF8,
> which I believe has been proposed for Guile.

UTF-8 has high coolness factor from a programmer's point of view.

	- Tel

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]