For Development HEAD DRAFTSearch (procedure/syntax/module):

9.6 gauche.collection - Collection framework

Module: gauche.collection

This module provides a set of generic functions (GFs) that iterate over various collections. The Scheme standard has some iterative primitives such as map and for-each, and scheme.list (see scheme.list - R7RS lists adds a rich set of such functions, but they work only on lists.

Using the method dispatch of the object system, this module efficiently extends those functions for other collection classes such as vectors and hash tables. It also provides a simple way for user-defined class to adapt those operations. So far, the following operations are defined.

Mapping

fold, fold2, fold3, map, map-to, map-accum, for-each

Selection and searching

find, find-min, find-max, find-min&max, filter, filter-to, remove, remove-to, partition, partition-to group-collection

Conversion

coerce-to

Miscellaneous

size-of, lazy-size-of

Fundamental iterator creator

call-with-iterator, call-with-builder, with-iterator, with-builder, call-with-iterators.

Those operations work on collections and its subclass, sequences. A collection is a certain form of a set of objects that you can traverse all the object in it in a certain way. A sequence is a collection that all its elements are ordered, so that you can retrieve its element by index.

The following Gauche built-in objects are treated as collections and/or sequences.

<list>

A sequence.

<vector>

A sequence.

<string>

A sequence (of characters)

<hash-table>

A collection. Each element is a pair of a key and a value.

<s8vector>, <u8vector>, … <f64vector>

A sequence (methods defined in gauche.uvector module, see Uniform vectors).

See gauche.sequence - Sequence framework, for it adds more sequence specific methods.

The methods that needs to return a set of objects, i.e. map, filter, remove and partition. returns a list (or lists). The corresponding “-to” variant (map-to, filter-to, remove-to and partition-to. takes a collection class argument and returns the collection of the class.


9.6.1 Mapping over collection

These generic functions extends the standard mapping procedures. See also Mapping over sequences, if you care the index as well as elements.

Generic function: fold proc knil coll coll2 …

{gauche.collection} This is a natural extension of fold (see Other list procedures).

For each element Ei in the collection coll, proc is called as (proc Ei Ri-1), where Ri-1 is the result of (i-1)-th invocation of proc for i > 0, and R0 is knil. Returns the last invocation of proc.

(fold + 0 '#(1 2 3 4)) ⇒ 10
(fold cons '() "abc")  ⇒ (#\c #\b #\a)

If the coll is a sequence, it is guaranteed that the elements are traversed in order. Otherwise, the order of iteration is undefined.

Note: We don’t provide fold-right on collections, since the order of elements doesn’t matter, so only fold is sufficient for meaningful traversal. However, sequences do have fold-right; see Mapping over sequences.

You can fold more than one collection, although it doesn’t make much sense unless all of the collections are sequences. Suppose E(k, i) for i-th element of k-th collection. proc is called as

(proc E(0,i) E(1,i) ... E(K-1,i) Ri-1)

Different types of collections can be mixed together.

(fold acons '() "abc" '#(1 2 3))
  ⇒ ((#\c 3) (#\b 2) (#\a 1))

;; calculates dot product of two vectors
(fold (lambda (a b r) (+ (* a b) r)) 0
      '#(3 5 7) '#(2 4 6))
  ⇒ 68

When more than one collection is given, fold terminates as soon as at least one of the collections exhausted.

Generic function: fold2 proc knil1 knil2 coll coll2 …
Generic function: fold3 proc knil1 knil2 knil3 coll coll2 …

{gauche.collection} Like fold, but they can carry two and three state values instead of one, respectively. The state values are initialized by knilN. The procedure proc is called with each element of collN, and the state values. It must return two (fold2) or three (fold3) values, which will be used as the state values of next iteration. The values returned in the last iteration will be the return values of fold2 and fold3.

(fold2 (lambda (elt a b) (values (min elt a) (max elt b)))
       256 0 '#u8(33 12 142 1 74 98 12 5 99))
 ⇒ 1 and 142  ;; find minimum and maximum values

See also map-accum below.

Generic function: map proc coll coll2 …

{gauche.collection} This extends the built-in map (see Walking over lists). Apply proc for each element in the collection coll, and returns a list of the results.

If the coll is a sequence, it is guaranteed that the elements are traversed in order. Otherwise, the order of iteration is undefined.

If more than one collection is passed, proc is called with elements for each collection. In such case, map terminates as soon as at least one of the collection is exhausted. Note that passing more than one collection doesn’t make much sense unless all the collections are sequences.

(map (lambda (x) (* x 2)) '#(1 2 3))
  ⇒ #(2 4 6)

(map char-upcase "abc")
  ⇒ (#\A #\B #\C)

(map + '#(1 2 3) '#(4 5 6))
  ⇒ (5 7 9)

map always returns a list. If you want to get the result in a different type of collection, use map-to described below. If you wonder why (map char-upcase "abc") doesn’t return "ABC", read the discussion in the bottom of this subsection.

Generic function: map-to class proc coll coll2 …

{gauche.collection} This works the same as map, except the result is returned in a collection of class class. Class must be a collection class and have a builder interface (see Fundamental iterator creators).

(map-to <vector> + '#(1 2 3) '#(4 5 6))
  ⇒ #(5 7 9)

(map-to <string> char-upcase "def")
  ⇒ "DEF"

(map-to <vector> char=? "bed" "pet")
  ⇒ #(#f #t #f)
Generic function: map-accum proc seed coll1 coll2 …

{gauche.collection} Collects results of proc over collections, while passing a state value. proc is called like this:

(proc elt1 elt2 ... seed)

Where elt1 elt2 … are the elements of coll1 coll2 …. It must return two values; the first value is collected into a list (like map), while the second value is passed as seed to the next call of proc.

When one of the collections is exhausted, map-accum returns two values, the list of the first return values from proc, and the second return value of the last call of proc.

If the given collections are sequences, it is guaranteed that proc is applied in order of the sequence.

This is similar to Haskell’s mapAccumL, but note that the order of proc’s argument and return values are reversed.

Generic function: for-each proc coll coll2 …

{gauche.collection} Extension of built-in for-each (see Walking over lists). Applies proc for each elements in the collection(s). The result of proc is discarded. The return value of for-each is undefined.

If the coll is a sequence, it is guaranteed that the elements are traversed in order. Otherwise, the order of iteration is undefined.

If more than one collection is passed, proc is called with elements for each collection. In such case, for-each terminates as soon as one of the collection is exhausted. Note that passing more than one collection doesn’t make much sense unless all the collections are sequences.

Generic Function: fold$ proc
Generic Function: fold$ proc knil
Generic Function: map$ proc
Generic Function: for-each$ proc

{gauche.collection} Partial-application version of fold, map and for-each.

Discussion: It is debatable what type of collection map should return when it operates on the collections other than lists. It may seem more “natural” if (map * '#(1 2) '#(3 4)) returns a vector, and (map char-upcase "abc") returns a string.

Although such interface seems work for simple cases, it’ll become problematic for more general cases. What type of collection should be returned if a string and a vector are passed? Furthermore, some collection may only have iterator interface but no builder interface, so that the result can’t be coerced to the argument type (suppose you’re mapping over database records, for example). And Scheme programmers are used to think map returns a list, and the result of map are applied to the procedures that takes list everywhere.

So I decided to add another method, map-to, to specify the return type explicitly The idea of passing the return type is taken from CommonLisp’s map function, but taking a class metaobject, map-to is much flexible to extend using method dispatch. This protocol (“-to” variant takes a class metaobject for the result collection) is used throughout the collection framework.


9.6.2 Selection and searching in collection

Generic function: find pred coll

{gauche.collection} Applies pred for each element of a collection coll until pred returns a true value. Returns the element on which pred returned a true value, or #f if no element satisfies pred.

If coll is a sequence, it is guaranteed that pred is applied in order. Otherwise the order of application is undefined.

(find char-upper-case? "abcDe") ⇒ #\D
(find even? '#(1 3 4 6)) ⇒ 4
(find even? '(1 3 5 7))  ⇒ #f
Generic function: find-min coll :key key compare default
Generic function: find-max coll :key key compare default

{gauche.collection} Returns a minimum or maximum element in the collection coll.

A one-argument procedure key, whose default is identity, is applied for each element to obtain a comparison value. Then a comparison value is compared by a two-argument procedure compare, whose default is <. If the collection has zero or one element, the compare procedure is never called.

When the collection is empty, a value given to default is returned, whose default is #f.

(find-min '((a . 3) (b . 9) (c . -1) (d . 7)) :key cdr) ⇒ (c . -1)
Generic function: find-min&max coll :key key compare default default-min default-max

{gauche.collection} Does find-min and find-max simultaneously, and returns two values, the minimum element and the maximum element. The keyword arguments key, compare, and default are the same as find-min and find-max. Alternatively you can give default values for minimum and maximum separately, by default-min and default-max.

Generic function: filter pred coll

{gauche.collection} Returns a list of elements of collection coll that satisfies the predicate pred. If the collection is a sequence, the order is preserved in the result.

(filter char-upper-case? "Hello, World")
  ⇒ (#\H #\W)
(filter even? '#(1 2 3 4)) ⇒ (2 4)
Generic function: filter-to class pred coll

{gauche.collection} Same as filter, but the result is returned as a collection of class class.

(filter-to <vector> even? '#(1 2 3 4)) ⇒ #(2 4)
(filter-to <string> char-upper-case? "Hello, World")
  ⇒ "HW"
Generic function: remove pred coll

{gauche.collection} Returns a list of elements of collection coll that does not satisfy the predicate pred. If the collection is a sequence, the order is preserved in the result.

(remove char-upper-case? "Hello, World")
  ⇒ (#\e #\l #\l #\o #\, #\space #\o #\r #\l #\d)
(remove even? '#(1 2 3 4)) ⇒ (1 3)
Generic function: remove-to class pred coll

{gauche.collection} Same as remove, but the result is returned as a collection of class class.

(remove-to <vector> even? '#(1 2 3 4)) ⇒ #(1 3)
(remove-to <string> char-upper-case? "Hello, World")
  ⇒ "ello, orld"
Generic function: partition pred coll

{gauche.collection} Does filter and remove the same time. Returns two lists, the first consists of elements of the collection coll that satisfies the predicate pred, and the second consists of elements that doesn’t.

(partition char-upper-case? "PuPu")
  ⇒ (#\P #\P) and (#\u #\u)
(partition even? '#(1 2 3 4))
  ⇒ (2 4) and (1 3)
Generic function: partition-to class pred coll

{gauche.collection} Same as partition, except the results are returned in the collections of class class.

(partition-to <string> char-upper-case? "PuPu")
  ⇒ "PP" and "uu"
(partition-to <vector> even? '#(1 2 3 4))
  ⇒ #(2 4) and #(1 3)
Generic function: group-collection coll :key key test

{gauche.collection} Generalized partition. Groups elements in coll into those who has the same key value, and returns the groups as of lists. Key values are calculated by applying the procedure key to each element of coll. The default value of key is identity. For each element of coll, key is applied exactly once. The equal-ness of keys are compared by test procedure, whose default is eqv?.

If coll is a sequence, then the order of elements in each group of the result is the same order in coll.

(group-collection '(1 2 3 2 3 1 2 1 2 3 2 3))
  ⇒ ((1 1 1) (2 2 2 2 2) (3 3 3 3))

(group-collection '(1 2 3 2 3 1 2 1 2 3 2 3) :key odd?)
  ⇒ ((1 3 3 1 1 3 3) (2 2 2 2 2))

(group-collection '(("a" 2) ("b" 5) ("c" 1) ("b" 3) ("a" 6))
  :key car :test string=?)
  ⇒ ((("a" 2) ("a" 6)) (("b" 5) ("b" 3)) (("c" 1)))

See also group-sequence in gauche.sequence (see Other operations over sequences), which only groups adjacent elements.

Generic function: group-collection->alist coll :key key value test

{gauche.collection} Similar to group-collection, but returns the result as an assoc list keyed by extracted keys of comparison. It’s eaiser to see the example:

(group-collection->alist '((3 "a") (2 "b") (1 "c") (3 "d") (1 "e"))
                         :key car :value cadr)
  ⇒ ((3 "a" "d") (2 "b") (1 "c" "e"))

;; Compare with this
(group-collection '((3 "a") (2 "b") (1 "c") (3 "d") (1 "e"))
                  :key car)
  ⇒ (((3 "a") (3 "d")) ((2 "b")) ((1 "c") (1 "e")))

First, elements in the collection is grouped with the common keys. Then, a pair (cons key (map value-proc elements)) is computed, where value-proc is the value extraction procedure passed by the value keyword argument. List of those pairs is returned.

If omitted, key and value arguments are assumed to be identity, and test argument (used to compare keys) is assumed to be eqv?.

If the collection is ordered, the order of the alist is the same as the order the first element with each key appears in the collection, and within each alist entry, the values are the same order of appearance in the collection.


9.6.3 Miscellaneous operations on collection

Generic function: size-of coll

{gauche.collection} Returns the number of elements in the collection. Default method iterates over the collection to calculate the size, which is not very efficient and may diverge if the collection is infinite. Some collection classes overload the method for faster calculation.

Generic function: lazy-size-of coll

{gauche.collection} Returns either the size of the collection, or a promise to calculate it. The intent of this method is to avoid size calculation if it is expensive. In some cases, the caller wants to have size just for optimization, and it is not desirable to spend time to calculate the size. Such caller uses this method and just discards the information if it is a promise.

Generic function: coerce-to class coll

{gauche.collection} Convert a collection coll to another collection which is an instance of class. If coll is a sequence and class is a sequence class, the order is preserved.

(coerce-to <vector> '(1 2 3 4))
  ⇒ #(1 2 3 4)

(coerce-to <string> '#(#\a #\b #\c))
  ⇒ "abc"

9.6.4 Fundamental iterator creators

These are fundamental methods on which all the rest of iterative method are built. The method interface is not intended to be called from general code, but suitable for building other iterator construct. The reason why I chose this interface as fundamental methods are explained at the bottom of this subsection.

Generic function: call-with-iterator collection proc :key start

{gauche.collection} A fundamental iterator creator. This creates two procedures from collection, both take no argument, and then call proc with those two procedures. The first procedure is terminate predicate, which returns #t if the iteration is exhausted, or #f if there are still elements to be visited. The second procedure is an incrementer, which returns one element from the collection and sets the internal pointer to the next element. The behavior is undefined if you call the incrementer after the terminate predicate returns #t.

If the collection is actually a sequence, the incrementer is guaranteed to return elements in order, from 0-th element to the last element. If a keyword argument start is given, however, the iteration begins from start-th element and ends at the last element. If the collection is not a sequence, the iteration order is arbitrary, and start argument has no effect.

An implementation of call-with-iterator method may limit the extent of the iterator inside the dynamic scope of the method. For example, it allocates some resource (e.g. connect to a database) before calling proc, and deallocates it (e.g. disconnect from a database) after proc returns.

This method returns the value(s) proc returns.

(call-with-iterator '(1 2 3 4 5)
  (lambda (end? next)
    (do ((odd-nums 0))
        ((end?) odd-nums)
      (when (odd? (next)) (inc! odd-nums)))))
 ⇒ 3

See also with-iterator macro below, for it is easier to use.

Macro: with-iterator (collection end? next args …) body …

{gauche.collection} A convenience macro to call call-with-iterator.

(with-iterator (coll end? next args ...) body ...)
 ≡
(call-with-iterator coll
  (lambda (end? next) body ...)
   args ...)
Function: call-with-iterators collections proc

{gauche.collection} A helper function to write n-ary iterator method. This function applies call-with-iterator for each collections, and makes two lists, the first consists of terminate predicates and the second of incrementers. Then proc is called with those two lists. Returns whatever proc returns.

Generic function: call-with-builder collection-class proc :key size

{gauche.collection} A fundamental builder creator. Builder is a way to construct a collection incrementally. Not all collection classes provide this method.

Collection-class is a class of the collection to be built. This method creates two procedures, adder and getter, then calls proc with those procedures. Adder procedure takes one argument and adds it to the collection being built. Getter takes no argument and returns a built collection object. The effect is undefined if adder is called after getter is called.

A keyword argument size may be specified if the size of the result collection is known. Certain collections may be built much more efficiently if the size is known; other collections may just ignore it. The behavior is undefined if more than size elements are added, or the collection is retrieved before size elements are accumulated.

If the collection class is actually a sequence class, adder is guaranteed to add elements in order. Otherwise, the order of elements are insignificant.

Some collection class may take more keyword arguments to initialize the collection.

This method returns the value(s) proc returned.

(call-with-builder <list>
  (lambda (add! get)
    (add! 'a) (add! 'b) (add! 'c) (get)))
 ⇒ (a b c)

(call-with-builder <vector>
  (lambda (add! get)
    (add! 'a) (add! 'b) (add! 'c) (get)))
 ⇒ #(a b c)

See also with-builder macro below, for it is much easier to use.

Macro: with-builder (collection add! get args …) body …

{gauche.collection} A convenience macro to call call-with-builder.

(with-builder (coll add! get args ...) body ...)
 ≡
(call-with-builder coll
  (lambda (add! get) body ...)
  args ...)

Discussion: Other iterator methods are built on top of call-with-iterator and call-with-builder. By implementing those methods, you can easily adapt your own collection class to all of those iterative operations. Optionally you can overload some of higher-level methods for efficiency.

It is debatable that which set of operations should be primitives. I chose call-with-iterator style for efficiency of the applications I see most. The following is a discussion of other possible primitive iterators.

fold

It is possible to make fold a primitive method, and build other iterator method on top of it. Collection-specific iterating states can be kept in the stack of fold, thus it runs efficiently. The method to optimize a procedure that uses fold as a basic iterator construct. However, it is rather cumbersome to derive generator-style interface from it. It is also tricky to iterate irregularly over more than one collections.

CPS

Passes iteratee the continuation procedure that continues the iteration. The iteratee just returns when it want to terminate the iteration. It has resource management problem described in Oleg Kiselyov’s article (http://okmij.org/ftp/Scheme/enumerators-callcc.html).

Iterator object

Like C++ iterator or Common Lisp generator. Easy to write loop. The problem is that every call of checking termination or getting next element must be dispatched.

Series

Common Lisp’s series can be very efficient if the compiler can statically analyze the usage of series. Unfortunately it is not the case in Gauche. Even if it could, the extension mechanism doesn’t blend well with Gauche’s object system.

Macros

Iterator can be implemented as macros, and that will be very efficient; e.g. Scheme48’s iterator macro. It uses macros to extend, however, and that doesn’t blend well with Gauche’s object system.

The current implementation is close to the iterator object approach, but using closures instead of iterator objects so that avoiding dispatching in the inner loop. Also it allows the iterator implementor to take care of the resource problem.


9.6.5 Implementing collections

The minimum requirements of the collection class implementation is as follow:

This makes iterator methods such as map, for-each, find and filter to work.

Optionally, you can specialize other generic functions to optimize performance. Particularly, we strongly recommend to specialize size-of, since its default method iterates entire collection to find the size, which is O(N). Usually a collection has bette way to find out its size.

In order to make the constructive methods (e.g. map-to to create your collection), you have to implement call-with-builder method as well. Note that call-with-builder method must work a sort of class method, dispatched by class, rather than normal method dispatched by instance. In Gauche, you can implement it by using a metaclass. Then the minimal code will look like this:

(define-class <your-collection-meta> (<class>) ())

(define-class <your-collection> (<collection>)
 (...) ;; slots
 :metaclass <your-collection-meta>)

(define-method call-with-iterator
    ((coll <your-collection>) proc . options)
  ...
  )

(define-method call-with-builder
     ((coll <your-collection-meta>) proc . options)
  ...
  )


For Development HEAD DRAFTSearch (procedure/syntax/module):
DRAFT