This is the mail archive of the guile@sourceware.cygnus.com mailing list for the Guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: binary-io, opposable-thumb, pack/unpack (was Re: binary-io (was Re: rfc 2045 base64 encoding/decoding module))


Gary Houston <ghouston@arglist.com> writes:

> Do you think combined input/output ports are more trouble than they're
> worth?

I don't understand the question:  Do you mean ports that are combined
input/output or combined byte/character?  Assuming you mean a port
that is at the same time a byte-sequence and a character-sequence,
I think that encourages slopping programming and typing discipline.

In any case, I think we are stuck with the standard Scheme ports being
character sequences, not byte sequences.  We also have the fact
low-level files and network protocals need to work with bytes.  So it
seems inescapable that a file port is something that works on a
underlying byte-sequence, and decodes that to an appropriate
character-sequence.  We need to allow people to use the existing
functions for binary I/O.  I think the right model is that in that
case the character sequence owuld be using the "trivial" encoding
(integer->char and char->integer).  If the internal character set is
Unicode or a superset thereof, then that is equivalent to using the
ISO-Latin-1 encoding (plus disabling CR/LF processing).

The real argument I think is whether programmers should be allowed
to change the encoding function of a port *on the fly*.  My point
is you don't really need it, and it has some troubling semantics.
However, I won't necessarily say it's the wrong thing to do.  It
might be the simplest extension to Scheme.  You do have to be very
careful about how you define this extension:  What happens in the
various cases, shift states, etc.

> It seems a bit restrictive to allow only meaningful and reliable
> formats.  Examples would be things like reading a binary database
> record with string fields or decoding network protocols (I'm not sure
> which ones off hand.

You're confusing files and ports.  A file contains bytes.  Reading
a file is essentially parsing it, which requires knowing the grammar
and encoding of the file.  Reading a file that contains a mix of
binary data and strings requires being able to delimit the strings.
This has to be well-defined in the file format.  One clean solution
when reading a string in a binary file is to read the bytes until the
termination condition (either a count or a delimiter) and then
converting the bytes read to a string, using the appropriate conversion
function.  You only use a byte-input-port, and not a char-input-port.

> Doesn't HTTP start with an ASCII header and
> switch to a character set specified in the header?)

The clean way to do that is open a byte-input-port, and read enough
to determine the encoding.  At that point, you create a char-input-port
that indirects to the byte-input-port to read the read of the response
or file.  You can either rewind the byte-input-port, or (better)
have the char-input-port start reading bytes at a well-defined point
in the byte stream.

In any case, if you have a mix between binary and character data, or
between different character encodings, you have to be careful about
properly synchronizing the character stream with the underlying byte
stream.  You have to do this whether you have a single combined
port object, or you use character ports that forwards requests to
a byte port.

> Your system could make read and read-line simpler or more efficient, I
> think, by allowing them to scan the buffer without needing to decode
> the bytes.

But you can't do that!  The scanning is defined in terms of character
delimiters, not bytes.  For certain encoding, and certain sets of
delimiters characters, you can make some optimizations, but those are
special-case hacks.  The system could do that *behind the scenes*,
butthe pulished semantics need to handle the general case in a
clean and consistent manner.

> Maybe not in general, it would be up to the user not to mess it up.
> Banning it completely seems like overkill.

Guile is meant to be a scripting language, not a systems-programming
language.  It should make it easy to write correct and general code,
and harder to write possibly-faster but inccorect code, not the
other way around.

> I was thinking of where strings are passed to various system call and
> gh_ interfaces, so reading a string (of arbitrary bytes) with read-line
> and writing it to the interface would end up modifying the bytes.

Passing a string containg characters to system routines that expect
bytes is not something you can expect to work.  It may work for strings
that use appropriate stateless multi-byte encodings, such as UTF8,
which I believe has been proposed for Guile.
-- 
	--Per Bothner
per@bothner.com   http://www.bothner.com/~per/

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]