For Development HEAD DRAFTSearch (procedure/syntax/module):

12.60 sxml.sxpath - SXML query language

Module: sxml.sxpath

SXPath is a query language for SXML, an instance of XML Information set (Infoset) in the form of s-expressions.

It is originally written by Oleg Kiselyov, and improved by Dmitry Lizorkin and Kirill Lisovsky. This module also incorporates various procedures written for SXPath by Dmitry Lizorkin and Kirill Lisovsky.

Current version is based on sxpathlib.scm,v 3.915, sxpath.scm,v 1.1, and sxpath-ext.scm,v 1.911.

This manual is mostly derived from the comments in the original source files.

The module consists of three layers.

  1. Basic converters and applicators, which provides the means to access and translate SXML tree.
  2. High-level query language compiler, which takes abbreviated SXPath and returns a Scheme function that selects a nodeset that satisfies the specified path from the given nodeset.
  3. Extension libraries, which implements SXML counterparts to W3C XPath Core Functions Library.

12.60.1 SXPath basic converters and applicators

A converter is a function

  type Converter = Node|Nodeset -> Nodeset

A converter can also play a role of a predicate: in that case, if a converter, applied to a node or a nodeset, yields a non-empty nodeset, the converter-predicate is deemed satisfied. Throughout this file a nil nodeset is equivalent to #f in denoting a failure.

Function: nodeset? x

{sxml.sxpath} Returns #t if given object is a nodeset.

Function: as-nodeset x

{sxml.sxpath} If x is a nodeset - returns it as is, otherwise wrap it in a list.

Function: sxml:element? obj

{sxml.sxpath} Predicate which returns #t if obj is SXML element, otherwise returns #f.

Function: ntype-names?? crit

{sxml.sxpath} The function ntype-names?? takes a list of acceptable node names as a criterion and returns a function, which, when applied to a node, will return #t if the node name is present in criterion list and #f otherwise.

 ntype-names?? :: ListOfNames -> Node -> Boolean
Function: ntype?? crit

{sxml.sxpath} The function ntype?? takes a type criterion and returns a function, which, when applied to a node, will tell if the node satisfies the test.

  ntype?? :: Crit -> Node -> Boolean

The criterion crit is one of the following symbols:

id

tests if the Node has the right name (id)

@

tests if the Node is an attributes-list.

*

tests if the Node is an Element.

*text*

tests if the Node is a text node.

*data*

tests if the Node is a data node (text, number, boolean, etc., but not pair).

*PI*

tests if the Node is a PI node.

*COMMENT*

tests if the Node is a COMMENT node.

*ENTITY*

tests if the Node is a ENTITY node.

*any*

#t for any type of Node.

Function: ntype-namespace-id?? ns-id

{sxml.sxpath} This function takes a namespace-id, and returns a predicate Node -> Boolean, which is #t for nodes with this very namespace-id. ns-id is a string. (ntype-namespace-id?? #f) will be #t for nodes with non-qualified names.

Function: sxml:invert pred

{sxml.sxpath} This function takes a predicate and returns it inverted . That is if the given predicate yields #f or ’() the inverted one yields the given node (#t) and vice versa.

Function: node-eq? other
Function: node-equal? other

{sxml.sxpath} Curried equivalence converter-predicates, i.e.

  ((node-eq? a) b)    ≡ (eq? a b)
  ((node-equal? a) b) ≡ (equal? a b)
Function: node-pos n

{sxml.sxpath}

 node-pos:: N -> Nodeset -> Nodeset, or
 node-pos:: N -> Converter

Select the N’th element of a Nodeset and return as a singular Nodeset; Return an empty nodeset if the Nth element does not exist. ((node-pos 1) Nodeset) selects the node at the head of the Nodeset, if exists; ((node-pos 2) Nodeset) selects the Node after that, if exists. N can also be a negative number: in that case the node is picked from the tail of the list. ((node-pos -1) Nodeset) selects the last node of a non-empty nodeset; ((node-pos -2) Nodeset) selects the last but one node, if exists.

Function: sxml:filter pred?

{sxml.sxpath}

 filter:: Converter -> Converter

A filter applicator, which introduces a filtering context. The argument converter is considered a predicate, with either #f or nil result meaning failure.

Function: take-until pred?

{sxml.sxpath}

 take-until:: Converter -> Converter, or
 take-until:: Pred -> Node|Nodeset -> Nodeset

Given a converter-predicate and a nodeset, apply the predicate to each element of the nodeset, until the predicate yields anything but #f or nil. Return the elements of the input nodeset that have been processed till that moment (that is, which fail the predicate). take-until is a variation of the filter above: take-until passes elements of an ordered input set till (but not including) the first element that satisfies the predicate. The nodeset returned by ((take-until (not pred)) nset) is a subset – to be more precise, a prefix – of the nodeset returned by ((filter pred) nset).

Function: take-after pred?

{sxml.sxpath}

take-after:: Converter -> Converter, or
take-after:: Pred -> Node|Nodeset -> Nodeset

Given a converter-predicate and a nodeset, apply the predicate to each element of the nodeset, until the predicate yields anything but #f or nil. Return the elements of the input nodeset that have not been processed: that is, return the elements of the input nodeset that follow the first element that satisfied the predicate. take-after along with take-until partition an input nodeset into three parts: the first element that satisfies a predicate, all preceding elements and all following elements.

Function: map-union proc lst

{sxml.sxpath} Apply proc to each element of lst and return the list of results. If proc returns a nodeset, splice it into the result.

From another point of view, map-union is a function Converter->Converter, which places an argument-converter in a joining context.

Function: node-reverse node-or-nodeset

{sxml.sxpath}

node-reverse :: Converter, or
node-reverse:: Node|Nodeset -> Nodeset

Reverses the order of nodes in the nodeset. This basic converter is needed to implement a reverse document order (see the XPath Recommendation).

Function: node-trace title

{sxml.sxpath}

 node-trace:: String -> Converter

(node-trace title) is an identity converter. In addition it prints out a node or nodeset it is applied to, prefixed with the ’title’. This converter is very useful for debugging.

What follow are Converter combinators, higher-order functions that transmogrify a converter or glue a sequence of converters into a single, non-trivial converter. The goal is to arrive at converters that correspond to XPath location paths.

From a different point of view, a combinator is a fixed, named pattern of applying converters. Given below is a complete set of such patterns that together implement XPath location path specification. As it turns out, all these combinators can be built from a small number of basic blocks: regular functional composition, map-union and filter applicators, and the nodeset union.

Function: select-kids test-pred?

{sxml.sxpath}

select-kids:: Pred -> Node -> Nodeset

Given a Node, return an (ordered) subset its children that satisfy the Pred (a converter, actually).

select-kids:: Pred -> Nodeset -> Nodeset

The same as above, but select among children of all the nodes in the Nodeset.

Function: node-self pred

{sxml.sxpath}

 node-self:: Pred -> Node -> Nodeset, or
 node-self:: Converter -> Converter

Similar to select-kids but apply to the Node itself rather than to its children. The resulting Nodeset will contain either one component, or will be empty (if the Node failed the Pred).

Function: node-join . selectors

{sxml.sxpath}

 node-join:: [LocPath] -> Node|Nodeset -> Nodeset, or
 node-join:: [Converter] -> Converter

join the sequence of location steps or paths as described in the title comments above.

Function: node-reduce . converters

{sxml.sxpath}

 node-reduce:: [LocPath] -> Node|Nodeset -> Nodeset, or
 node-reduce:: [Converter] -> Converter

A regular functional composition of converters. From a different point of view, ((apply node-reduce converters) nodeset) is equivalent to (foldl apply nodeset converters) i.e., folding, or reducing, a list of converters with the nodeset as a seed.

Function: node-or . converters

{sxml.sxpath}

 node-or:: [Converter] -> Converter

This combinator applies all converters to a given node and produces the union of their results. This combinator corresponds to a union, ’|’ operation for XPath location paths.

Function: node-closure test-pred?

{sxml.sxpath}

 node-closure:: Converter -> Converter

Select all descendants of a node that satisfy a converter-predicate. This combinator is similar to select-kids but applies to grand... children as well. This combinator implements the "descendant::" XPath axis. Conceptually, this combinator can be expressed as

 (define (node-closure f)
      (node-or
        (select-kids f)
         (node-reduce (select-kids (ntype?? '*)) (node-closure f))))

This definition, as written, looks somewhat like a fixpoint, and it will run forever. It is obvious however that sooner or later (select-kids (ntype?? '*)) will return an empty nodeset. At this point further iterations will no longer affect the result and can be stopped.


12.60.2 SXPath query language

Function: sxpath abbrpath . ns-binding

{sxml.sxpath} Evaluates an abbreviated SXPath. Returns a procedure that when applied on a node or nodeset will return a nodeset matching the given path.

 sxpath:: AbbrPath -> Converter, or
 sxpath:: AbbrPath -> Node|Nodeset -> Nodeset

AbbrPath is a list or a string. If it is a list, it is translated to the full SXPath according to the following rewriting rules. More informal explanation follows shortly. If it is a string, it is an XPath query.

Note that these are abstract rules to show how it works, and not the running code examples. The nonterminals sxpath1 and sxpathr don’t exist as APIs. The term txpath is an internal function that interprets XPath query given as a string.

 (sxpath '()) -> (node-join)
 (sxpath '(path-component ...)) ->
                (node-join (sxpath1 path-component) (sxpath '(...)))
 (sxpath1 '//) -> (node-or
                     (node-self (ntype?? '*any*))
                     (node-closure (ntype?? '*any*)))
 (sxpath1 '(equal? x)) -> (select-kids (node-equal? x))
 (sxpath1 '(eq? x))    -> (select-kids (node-eq? x))
 (sxpath1 '(or@ ...))  -> (select-kids (ntype-names??
                                          (cdr '(or@ ...))))
 (sxpath1 '(not@ ...)) -> (select-kids (sxml:invert
                                         (ntype-names??
                                          (cdr '(not@ ...)))))
 (sxpath1 '(ns-id:* x)) -> (select-kids
                                      (ntype-namespace-id?? x))
 (sxpath1 ?symbol)     -> (select-kids (ntype?? ?symbol))
 (sxpath1 ?string)     -> (txpath ?string)
 (sxpath1 procedure)   -> procedure
 (sxpath1 '(?symbol ...)) -> (sxpath1 '((?symbol) ...))
 (sxpath1 '(path reducer ...)) ->
                (node-reduce (sxpath path) (sxpathr reducer) ...)
 (sxpathr number)      -> (node-pos number)
 (sxpathr path-filter) -> (filter (sxpath path-filter))

SXPath in its simplest form is a list of path components. The result procedure will follow the same path and return the matching node list. For example (one two three) will find element one then two inside it and three inside element two. The equivalent XPath would be one/two/three.

There are a few special path components (see ntype?? for the complete list):

*

matches an element node.

//

matches any one or many consecutive path components.

*text*

matches a text node (text() in XPath).

*data*

matches any data node (e.g. text, number, boolean, etc., but not pair).

@

selects the attribute list node.

A path component could be a list in one of these forms:

(equal? x)

matches if the node under examination matches x using node-equal?

(eq? x)

matches if the node under examination matches x using node-eq?

(or@ ...)

matches if the element name is one of the specified symbols.

(not@ ...)

matches if the element name is not one of the specified symbols.

(ns-id:* x)

matches the node if it’s with namespace x

(<path> n)

matches the n-th node matching same path component. n starts from 1. Negative numbers start from the end of the node list backward. This is path[n] syntax in XPath.

(<path> (<predicate>...))

matches a path component path and (sxpath (<predicate>...)) on those nodes are not empty. This is path[predicate...] syntax in XPath.

If the path component is a string, it is interpreted as an XPath query string.

If the path component is a procedure, the procedure takes three arguments: the nodeset being examined, the root node and the variable bindings.

The root node is usually the entire sxml being applied. However if you apply the result sxpath procedure with two arguments, root-node will be the second argument.

When applied with three arguments, the variable bindings are the third one. This lets you pass arguments to the procedure.

;; select all <book> elements whose style attribute value is equal to
;; the <bookstore> element's specialty attribute value.
(sxpath "//book[/bookstore/@specialty=@style]")
;; a similar query but this time make sure specialty of _all_
;; bookstores is matched
(let ([match-specialty
       (lambda (node root var-binding)
         (let ([style (car ((sxpath '(@ style *text*)) node))]
               [all-specialty ((sxpath '(bookstore @ specialty *text*)) node)])
           (fold (lambda (specialty last-result)
                   (and last-result (string=? style specialty)))
                        #t
                        all-specialty)))])
  (sxpath `(// (book (,match-specialty)))))

;; select all <bookstore> elements that are inside top-level <book>
;; element
(sxpath '(book bookstore))
;; select all <bookstore> elements from anywhere
(sxpath '(// bookstore))
;; select attribute "name" in the top-level <book> element
(sxpath '(book @ name))
;; select all <bookstore> and <bookshop> elements that are inside
;; top-level <book> element
(sxpath '(book (or@ bookstore bookshop)))
;; select all elements except <movie> that are inside top-level <book>
;; element
(sxpath '(book (not@ movie @)))
;; select the attribute "name" of the second <bookstore> element
(sxpath '(book (bookstore 2) @ name))
;; select the attribute "name" of all <bookstore> elements that has
;; attribute "recommended"
(sxpath '(book (bookstore (@ recommended)) @ name))
;; select the attribute "name" of all <bookstore> elements whose
;; "rating" attribute is 3
(sxpath '(book (bookstore (@ rating (eq? 3))) @ name))
;; select the attribute "rating" whose value is greater than 3 from
;; all <bookstore> elements
(let ([greater (lambda (nodeset root-node var-binding)
                 (filter (lambda (node)
                           (> (string->number (sxml:string-value node))
                              3))
                         nodeset))])
  (sxpath `(book bookstore @ rating ,greater)))

Some wrapper functions around sxpath:

Function: if-sxpath path

{sxml.sxpath} sxpath always returns a list, which is #t in Scheme. if-sxpath returns #f instead of empty list.

Function: if-car-sxpath path

{sxml.sxpath} Returns first node found, if any. Otherwise returns #f.

Function: car-sxpath path

{sxml.sxpath} Returns first node found, if any. Otherwise returns empty list.

Function: sxml:id-alist node . lpaths

{sxml.sxpath} Built an index as a list of (ID_value . element) pairs for given node. lpaths are location paths for attributes of type ID.


12.60.3 SXPath extension

SXML counterparts to W3C XPath Core Functions Library.

Function: sxml:string object

{sxml.sxpath} The counterpart to XPath string function (section 4.2 XPath Rec.) Converts a given object to a string. NOTE:

  1. When converting a nodeset - a document order is not preserved
  2. number->string function returns the result in a form which is slightly different from XPath Rec. specification
Function: sxml:boolean object

{sxml.sxpath} The counterpart to XPath boolean function (section 4.3 XPath Rec.) Converts its argument to a boolean.

Function: sxml:number obj

{sxml.sxpath} The counterpart to XPath number function (section 4.4 XPath Rec.) Converts its argument to a number NOTE:

  1. The argument is not optional (yet?).
  2. string->number conversion is not IEEE 754 round-to-nearest.
  3. NaN is represented as 0.
Function: sxml:string-value node

{sxml.sxpath} Returns a string value for a given node in accordance to XPath Rec. 5.1 - 5.7

Function: sxml:node? node

{sxml.sxpath} According to XPath specification 2.3, this test is true for any XPath node. For SXML auxiliary lists and lists of attributes has to be excluded.

Function: sxml:attr-list obj

{sxml.sxpath} Returns the list of attributes for a given SXML node. Empty list is returned if the given node is not an element, or if it has no list of attributes

Function: sxml:id id-index

{sxml.sxpath} Select SXML element by its unique IDs. (XPath Rec. 4.1) Returns a converter that takes object, which is a nodeset or a datatype which can be converted to a string by means of a ’string’ function.

id-index is ( (id-value . element) (id-value . element) ... ).

This index is used for selection of an element by its unique ID.

Comparators for XPath objects:

Function: sxml:equality-cmp bool-op number-op string-op

{sxml.sxpath} A helper for XPath equality operations: = , != bool-op, number-op and ’string-op are comparison operations for a pair of booleans, numbers and strings respectively.

Function: sxml:equal? a b
Function: sxml:not-equal? a b

{sxml.sxpath} Counterparts of XPath equality operations: = , !=, using default equality tests.

Function: sxml:relational-cmp op

{sxml.sxpath} Creates a relational operation ( < , > , <= , >= ) for two XPath objects. op is comparison procedure: < , > , <= or >=.

XPath axises. An order in resulting nodeset is preserved.

Function: sxml:attribute test-pred?

{sxml.sxpath} Attribute axis.

Function: sxml:child test-pred?

{sxml.sxpath} Child axis. This function is similar to ’select-kids’, but it returns an empty child-list for PI, Comment and Entity nodes.

Function: sxml:parent test-pred?

{sxml.sxpath} Parent axis.

Given a predicate, it returns a function RootNode -> Converter which yields a node -> parent converter then applied to a rootnode.

Thus, such a converter may be constructed using ((sxml:parent test-pred) rootnode) and returns a parent of a node it is applied to. If applied to a nodeset, it returns the list of parents of nodes in the nodeset. The rootnode does not have to be the root node of the whole SXML tree – it may be a root node of a branch of interest. The parent:: axis can be used with any SXML node.

Function: sxml:ancestor test-pred?

{sxml.sxpath} Ancestor axis

Function: sxml:ancestor-or-self test-pred?

{sxml.sxpath} Ancestor-or-self axis

Function: sxml:descendant test-pred?

{sxml.sxpath} Descendant axis

Function: sxml:descendant-or-self test-pred?

{sxml.sxpath} Descendant-or-self axis

Function: sxml:following test-pred?

{sxml.sxpath} Following axis

Function: sxml:following-sibling test-pred?

{sxml.sxpath} Following-sibling axis

Function: sxml:namespace test-pred?

{sxml.sxpath} Namespace axis

Function: sxml:preceding test-pred?

{sxml.sxpath} Preceding axis

Function: sxml:preceding-sibling test-pred?

{sxml.sxpath} Preceding-sibling axis

Popular shortcuts:

Function: sxml:child-nodes nodeset

{sxml.sxpath}

((sxml:child sxml:node?) nodeset)
Function: sxml:child-elements nodeset

{sxml.sxpath}

((select-kids sxml:element?) nodeset)


For Development HEAD DRAFTSearch (procedure/syntax/module):
DRAFT