pserializer - portable serializer implementation

                                                 Shiro Kawai, shiro@acm.org
                 $Id: pserializer.txt,v 1.3 2000/01/26 11:13:39 shiro Exp $


Contents:

   OVERVIEW
   EXTERNAL INTERFACE
   SERIALIZED FORMAT
   EXTENDING THE SERIALIZER
   PORTING


Overview
========

This is a simple implementation to serialize standard Scheme
objects (boolean, pair, symbol, string, number, character, string
and vector).   Serialization is a function to convert a Scheme
structure to a certain bytestream which is independent from the
running process, and when it is read back ("deserialized"), it
recovers a Scheme structure topologically equal to the original one.
Serialized form is very useful to store a Scheme structure in a
file (persistence), or to send it over the network.

This implementation is intended to be portable and extensible
for various Scheme implementations.

To use this module, you need to define a few hash-table funcitons
and error function.  The example to use SLIB and STk is shown in the
source code pserializer.scm.   See "Porting" section below.

Although this implementation deals with only the standard Scheme
objects, you can extend the serializer routine to accept other
types of objects specific to your Scheme implementation (e.g. records
or classes).   See "Extending the serializer" section below.

External Interface
==================

  MAKE-OUTPUT-SERIALIZER port &optional extension      [function]

    Create a serializer from given output port PORT and returns
    it.  Optional argument EXTENSION is used to extend the
    serializer to recognize implementation dependent objects.

  WRITE-TO-OUTPUT-SERIALIZER object serializer         [function]

    Write an OBJECT to the output serializer SERIALIZER.

  CALL-WITH-OUTPUT-SERIALIZER port proc &optional extension   [function]

    PROC must be a procedure which takes one argument, serializer.
    This procedure creates a serializer from PORT and
    EXTENSION argument, and passes it to PROC.

  MAKE-INPUT-SERIALIZER port &optional extension       [function]

    Create a deserializer from given input port PORT and returns
    it.  Optional argument EXTENSION is used to extend the
    serializer to recognize implementation dependent objects.

  READ-FROM-INPUT-SERIALIZER serializer                [function]

    Read one object from the input serializer SERIALIZER.
    It returns an eof object when it reaches the end of the input
    stream.

  CALL-WITH-INPUT-SERIALIZER port proc &optional extension [function]

    PROC must be a procedure which takes one argument, serializer.
    This procedure creates an input serializer from PORT and
    EXTENSION argument, and passes it to PROC.

  REGISTER-OBJECT-TO-INPUT-SERIALIZER object serializer [function]
                                      &optional key

    Add OBJECT to the reference lookup table of SERIALIZER.
    This procedure is needed to extend input serializer.
    See "Extending the serializer" section below.

  SERIALIZER->PORT serializer                           [function]

    Return a port associated to the serializer.


Serialized format
=================

  Numbers, booleans, characters and the emptylist are written out the
  same way as the Scheme external presentation.

  Other types are preceded by a tag presenting its type, then written
  out in type dependent way.   Those objects are also assigned a reference
  number in the order of appearance in the serializer, and if the same
  object (eq?-sense) appears more than once, the second and latter
  appearances are presented by REFERENCE, which is a reference tag and
  the number the object is assigned.

  Following tags are currently used:

   y  : symbol, followed by its name
   p  : pair, followed by its car and cdr.
   v  : vector, followed by its length, then its elements
   s  : string, followed by the string itself.
   r  : reference, followed by the reference number.

  Here's a couple of examples.

    Form: (1 2 "3" #(a b c) a)
    Serialized:
         p 1
         p 2
         p s "3"
         p v 3
         y a
         y b
         y c
         p r 6
         ()

    Form: #0=(a b c . #0#)    ;; circular list
    Serialized:
         p y a
         p y b
         p y c
         r 0

  Design note 1: To serialize a variable length data structure, you need
  a mechanism to specify the end of the structure.  There're two ways
  to do it; to put a special terminator after the contents of the
  structure, or to put a size of the structure before the contents.
  Scheme external presentation uses the former method to mark the end
  of lists and vectors.

  Pserializer uses the latter method, except a list which is written
  as a sequence of pairs.  For it is simpler to handle references
  (consider a recursive vector which has itself in one of its element).

  Design note 2: There's no "magic number" or "header" in the serialized
  form to indicates, for example, the version of the format.  If the
  application is planned to use multiple versions of the serialized format
  incompatible to each other, it's up to the application implementator
  to insert such information to a serialized output.
  

Extending the serializer
========================

  You can extend the serializer to deal with implementation
  dependent objects, by passing an extention specification to
  the optional parameter EXTENTION to MAKE-{INPUT|OUTPUT}-SERIALIZER.

  Extension specification is a list of reader/writer specification.
  A reader/writer specification is a list of four elements;
  a tag symbol, a test procedure, a writer procedure and a reader
  procedure.

  You can choose an arbitrary symbol as the tag except the ones
  already used for Scheme primitives, shown in the previous section.

  When a serializer enconters an object of unknown type, it applies
  the test procedure on the object in the order it appears in the
  extension specification until it returns true.  Then the serializer
  writes an associated tag, and call a write procedure with the
  serializer and the object to be written.   The reference is handled
  in the serializer so the write procedure need not care about it.
  If no test succeeds, the serializer reports an error.

  When a deserializer encounters an unknown tag, it looks for the matching
  tag in the extension specification.  If found, the associated
  reader procedure is called with the input serializer.  The reader
  procedure is then responsible to read the information, to reconstruct
  the object, to register it and to return it.

  Registering the object is done by calling
  REGISTER-OBJECT-TO-INPUT-SERIALIZER.   It assigns reference number
  to the object so that later the input serializer can refer to it.
  Registering should be done _before_ any new Scheme object is read
  from the input serializer, to keep the reference counter in sync,
  and to allow the circular structure to be serialized.

  For example, a vector reader first reads the length of the vector,
  constructs a vector with undefined contents, registers it to
  the serializer, then proceeds to read its elements and fills the
  vector contents.  You can't read the elements first then construct
  the vector, since the reference counter will be wrong if you do so,
  and also you can't deal with recursive vector.
  Because of this, you cannot deserialize an object which requires
  its element ready at the construction time.

  An example of extension is found in pserializer-stk.scm in the
  distribution.


Porting
=======

  Following implementation-dependent procedures must be defined in
  the source.  The example implementation for SLIB and STk is provided
  in the original source.

  PSERIALIZER:MAKE-HASH-TABLE                          [function]

     Returns a hash table

  PSERIALIZER:HASH-TABLE-GET hashtable key             [function]

     Returns a value associated to the KEY.  KEY can be any
     Scheme object, and comparison must be done by eq?.
     If no entry for KEY is defined, it must return #f.

  PSERIALIZER:HASH-TABLE-PUT! hashtable key value      [function]

     Add VALUE associated to KEY to the hashtable.

  PSERIALIZER:ERROR format &rest args                  [function]

     Report error.  FORMAT and ARGS is the same as format
     procedure which can be found in CommonLisp and other Lisp
     dialects.   pserializer uses only ~a and ~s formatter.