For Development HEAD DRAFTSearch (procedure/syntax/module):

12.40 parser.peg - PEG parser combinators

Module: parser.peg

This module implements a parser combinator library to build parsers based on Parsing Expression Grammar, or PEG.

PEG is a formal grammar to define a language, like regular expressions or context-free grammars. An interesting characteristic of PEG is that it can be directly mapped to a recursive decent parser, which is exactly what this library does—each production rule is a Scheme expression that takes parsers and returns a combined parser. One advantage of this approach is that you can freely mix ordinary Scheme code within the parser, that is, there’s no special “parser description language” distinct from the base Scheme language, nor you need to run separate tools like parser generators to obtain a runnable parser code.

Although PEG can directly parse the character string, the parser combinators are not tied to it. In fact, most of the combinators work transparently for any sequence of tokens, where the exact meanings of tokens depend on the application; you can have separate lexer that generates token sequence that PEG parser can parse, for example.

This library is specifically written to get a good performance on Gauche. The parser created by parser.peg is no slower than the parser written manually from scratch. However, you have to watch out some traps; see Performance, for the details.


12.40.1 Walkthrough

In this section we cover the basic concepts and tools of parser.peg. The code of examples is in examples/pegintro.scm if you have the source tree of Gauche.

In parser.peg, a parser is merely a Scheme procedure that takes a list of tokens as an argument and returns a result (well, in fact, it returns three values, but we’ll go into the details later.)

Typically you don’t need to write parser procedures directly. Instead, you can use procedures that generates parsers. A parser can be as simple as the following, which accepts a character #\a.

($. #\a)   ; ⇒ a parser

Here, accept means the parser checks if the head of input has a character #\a, and if it is, it succeeds, and if not, it fails.

A parser can be invoked by a parser driver. For example, you can use peg-parse-string to invoke the above parser on a string:

gosh> (peg-parse-string ($. #\a) "abc")
#\a

The parsing succeeds, and returns the matched value—#\a in this case. If the parser can’t accept the input, the driver throws an error <parse-error>.

gosh> (peg-parse-string ($char #\a) "xyz")
*** PARSE-ERROR: expecting #\a at 0, but got #\x

A parser can also be constructed by combining simpler parsers, using parser combinators. For example, $seq takes zero or more parsers and apply them sequentially, returning the last result.

gosh> (peg-parse-string ($seq ($. #\a) ($. #\b) ($. #\c)) "abc")
#\c

The combinator $many takes a parser and returns a new parser that accepts zero or more occurrence of the string the original parser accepts.

gosh> (peg-parse-string ($many ($. #\a)) "aaaaabc")
(#\a #\a #\a #\a #\a)
gosh> (peg-parse-string ($many ($. #\a)) "xxxxxyz")
()

A parser is just an ordinary Scheme procedure, so it can be bound to a variable, then can be used to construct more complex parsers.

(define digits    ($many1 ($. #[\d])))
(define ws        ($many_ ($. #[\s])))
(define separator ($seq ws ($. #\,) ws))

I leave explanation of $many1 and $many_ for the later section, but you may be able to guess what those parsers do; digits accepts a sequence of one or more digits, and ws accepts sequence of zero or more whitespaces. The separator parser accepts a comma, optionally surrounded by whitespaces.

The digits parser returns a list of accepted characters:

gosh> (peg-parse-string digits "12345")
(#\1 #\2 #\3 #\4 #\5)

Can we create a parser that returns an integer as a parsed result? Yes, we can use the $let macro.

(define integer
  ($let ([ds digits])
    ($return (x->integer (list->string ds)))))

The $let works somewhat like and-let*; it takes a form of ($let ([var parser] …) expr …), applying the parsers in order, binding the result of each parser to var. If any of the parser fails, the entire parser created by $let macro fails. When all the parser succeeds, each result is bound to var and expr … are evaluated. The last expr must yield a parser.

The $return procedure creates a parser that doesn’t consume input, always succeeds and returns the given value. The name is taken from Haskell’s monads. Note that it is just an ordinary procedure and not like a control-transfer syntax as traditional language’s return. You may think it just as type conversion procedure from a Scheme object to a parser.

gosh> (peg-parse-string integer "12345")
12345

Now you can combine those parsers to build more complex one, such as a comma-separated list of integers:

(define integers1 ($seq integer
                        ($many ($seq separator integer))))

gosh> (peg-parse-string integers1 "123, 456, 789")
(456 789)

Oops, where’s 123? Well, remember that $seq discards the results but the last one. We can use $let again to keep all the results.

(define integers2 ($let ([n  integer]
                         [ns ($many ($seq separator integer))])
                     ($return (cons n ns))))

gosh> (peg-parse-string integers2 "123, 456, 789")
(123 456 789)

(Unlike let where the order of its init expressions are not defined, $let guarantees the parsers are applied sequentially. The reason it is not called $let* is the scope; we also have $let*, which we’ll explain shortly.)

Another way to gather the results of parsers is a combinator $lift. It is used as ($lift proc parser …), where proc is an ordinary procedure which receives the result of parser … as arguments. The return value of proc becomes the result of the entire parser. Unlike $let, proc doesn’t need to return a parser.

(define integers3 ($lift cons
                         integer
                         ($many ($seq separator integer))))

The parsers so far doesn’t handle the case when the list contains no integers. Using $or combinator, which represents a choice, we can modify it to handle zero-element case.

(define integers4 ($or ($let ([n  integer]
                              [ns ($many ($seq separator integer))])
                         ($return (cons n ns)))
                       ($return '())))

By the way, “list of stuff separated by something” is a very common pattern, so we can extract the pattern to name it:

(define (sep-by stuff separator)
  ($or ($let ([n  stuff]
              [ns ($many ($seq separator stuff))])
         ($return (cons n ns)))
       ($return '())))

Then the list of integers can be written this simple:

(define integers5 (sep-by integer separator))

In fact, parser.peg provides $sep-by to do the above, but we’ve just shown the definition to demonstrate the power of the combinatorial approach; you can use ordinary procedural abstraction to factor out common patterns.

There’s one catch in the $or form. It tries the next alternative only when the parser fails without consuming the input. Once the input is consumed, $or commits to that choice. For example, the following fails even if the input seems to match the second alternative:

(define paren  ($. #\())
(define thesis ($. #\)))

(peg-parse-string ($or ($seq paren ($."ab") thesis)
                       ($seq paren ($."cd") thesis))
                  "(cd)")
 ⇒ *** PARSE-ERROR: expecting ab at 1, but got #\c

It’s because when $or tries the first branch, it reads the initial open paren from the input, so $or commits to the first branch. When the branch fails, $or doesn’t bother to try the second branch. (In other words, $or does not backtrack.)

You may factor out the common prefix:

($seq paren
      ($or ($."ab") ($."cd"))
      thesis)

But it may complicates the syntax, and it is not always trivial to factor out like above. The better way is to use the $try combinator: ($try p) runs a parser p, and if it fails, $try rolls back the input as if it didn’t consume input at all. Using with $or, you can do arbitrary lookahead and backtrack.

($or ($try ($seq paren ($."ab") thesis))
     ($seq paren ($."cd") thesis))

Now, let’s get back to the integer list example and make it more interesting. Suppose the list of integers are surrounded by parentheses, brackets or curly-braces. Opening one and closing one must correspond.

(define begin-list
  ($seq0 ($. #[\(\[\{]) ws))

(define (end-list opener)
  ($seq ws (case opener
            [(#\() ($. #\))]
            [(#\[) ($. #\])]
            [(#\{) ($. #\})])))

(define int-list
  ($let* ([opener begin-list]                ;*1
          [ints ($sep-by integer separator)] ;*2
          [ (end-list opener) ])             ;*3
    ($return ints)))

The opening bracket is parsed by the parser begin-list. The $seq0 combinator is similar to seq, but returns the result of the first parser instead of the last one (it’s begin0 to begin).

The closing bracket must match the opening one, so end-list is a procedure that takes the opening bracket and returns a suitable parser to accept the corresponding closing bracket.

The int-list first parses the opening bracket by begin-list and bind the result to opener (*1). Then goes to parse comma-separated integers (*2) and bind the result list to ints. Finally, it parses the matching closing bracket (*3). Note that the it omits the variable, since we don’t need the result of closing bracket parser.

Since we need to use the value of opener in the following bindings, we use $let* here, instead of $let. The difference is the scope of the parser expressions in the binding. Note that, however, $let is a lot more easier to optimize, so you want to use $let whenever possible.

Let’s see it works.

gosh> (peg-parse-string int-list "[123, 456, 789]")
(123 456 789)
gosh> (peg-parse-string int-list "{123, 456, 789}")
(123 456 789)
gosh> (peg-parse-string int-list "(123, 456, 789}")
*** PARSE-ERROR: expecting #\) at 14, but got #\}

The last example shows it rejects unmatched brackets.

What if we want a nested list? In BNF, we could write something like this:

list : begin-list (elem (separator elem)* )? end-list
elem : integer | list

The straight translation would be he following.

;; First try
(define nested-list
  ($let* ([opener begin-list]
          [ints ($sep-by elem separator)]
          [ (end-list opener) ])
    ($return ints)))
(define elem  ($or integer nested-list))

Let’s load it... Oops.

*** ERROR: unbound variable: elem
Stack Trace:
_______________________________________
  0  elem
        [unknown location]
  1  (eval expr env)
        at "../lib/gauche/interactive.scm":267

We need the parser elem to construct nested-list, but we need the parser nested-list to construct elem. In lazy languages like Haskell this doesn’t matter, but we Schemers are eager!

The solution is to delay the parser construction until it is actually used. The $lazy form does the job:

(define nested-list
  ($lazy
    ($let* ([opener begin-list]
            [ints ($sep-by elem separator)]
            [ (end-list opener) ])
      ($return ints))))
(define elem  ($or integer nested-list))

gosh> (peg-parse-string nested-list "(123, [456, {}, 789], 987)")
(123 (456 () 789) 987)

Ok, we’re almost done. Our code can parse nested list of integers, checking bracket matches. But if you give an erroneous input, the message is cryptic and not helpful:

gosh> (peg-parse-string nested-list "(123, [456, {}, 789), 987)")
*** PARSE-ERROR: expecting one of (#[0-9] #\]) at 19
Stack Trace:
_______________________________________
  0  (eval expr env)
        at "../lib/gauche/interactive.scm":267

We could check the reason of failure in the parser, and call $fail with more reasonable error message. Let’s replace end-list above with end-list2 below:

(define (end-list2 opener)
  (define expected
    (assv-ref '((#\( . #\)) (#\[ . #\]) (#\{ . #\})) opener))
  ($seq ws
        ($let ([closer ($. #[\)\]\}])])
          (if (eqv? closer expected)
            ($return closer)
            ($fail (format "Mismatched closing bracket. '~c' expected, \
                            but got '~c'"
                           expected closer))))))

You also have to change nested-list and elem to use end-list2:

(define nested-list2
  ($lazy
    ($let* ([opener begin-list]
            [ints ($sep-by elem2 separator)]
            [ (end-list2 opener) ])
      ($return ints))))

(define elem2  ($or integer nested-list2))

(You could redefine end-list, but if you do so, don’t forget to re-evaluate the definitions of nested-list and elem. It’s because the combinators are caculated taking the value of other combinators when it’s defined, unlike typical procedural approach where you redefine one procedure and other procedures will refer to the updated version after that.)

And now you see this error:

gosh> (peg-parse-string nested-list2 "(123, [456, {}, 789), 987)")
*** PARSE-ERROR: Mismatched closing bracket. ']' expected, but got ')' at 20
Stack Trace:
_______________________________________
  0  (eval expr env)
        at "../lib/gauche/interactive.scm":267

For further examples, you can take a look at some libraries in the Gauche source tree that use parser.peg:


12.40.2 Parser drivers

Function: peg-run-parser parser list

{parser.peg} Apply parser on the input list. If parser accepts it, returns two values–the result of parser, and the rest of the input. If parser fails, raise <parse-error>.

The result of parse is a value yielded by parser passed to rope-finalize, so any unfinalized rope is finalized before being returned. See Ropes, for the details.

Typically, you don’t want to have entire input as a list beforehand, so you pass a lazy sequence as list (see Lazy sequences). If the input is a string or a port, convenience procedures are defined.

Input doesn’t need to be a character list. You may, for example, have a separate lexer that generates a (lazy) list of tokens, and let parser parse them.

Function: peg-parse-string parser string :optional cont
Function: peg-parse-port parser iport :optional cont

{parser.peg} Convenience wrappers of peg-run-parser that takes input from string or iport.

Without cont argument or it is #f, it returns the parsed result and discards the rest of the input. (Hint: To make sure there’s no garbage following, use $eos parser.)

If you want to keep parsing after parser accepts its input, pass cont a procedure; it must take two arguments, the result of parser and a lazy sequence of the rest of the input. What cont returns will be the result of these procedures.

It is an error to pass other values to cont.

Function: peg-parser->generator parser list

{parser.peg} This is useful when you need to apply parser repeatedly over the input list. Returns a generator that generates the parsed result one match at a time.

The same input can be accepted by by (peg-run-parser ($many parser) list), but this one won’t return until all input is consumed. On the other hand,

Condition type: <parse-error>

{parser.peg} An condition type raised by the parser driver when the given parser failed ultimately. Inherits <error>. (Note that each parser won’t throw this; one parser’s failure doesn’t necessarily mean the entire parser fails. It’s the driver that recognizes the ultimate failure and raise this condition.)

The following slots are available:

Instance Variable of <parse-error>: message

This slot is inherited from <error>. Contains string error message.

Instance Variable of <parse-error>: position

The position the failure occured.

The exact meaning of this value depends on how you call the parser driver.

At minimum, when you call it on a bare list or a string/port (using peg-parse-string/peg-parse-port), the parser driver counts the number of elements took from the input and put it here, for it is the only available information. Note that the number may not be what you want, e.g. if you start parsing in the middle of the stream.

If you’re parsing from a sequence with positions, this slot contains an instance of <sequence-position>, which may have line & column numbers or a source file name. See Lazy sequence with positions, for the details.

The user code should expect both cases.

Instance Variable of <parse-error>: type

The type of failure. It’s either one of the follownig symbols:

fail-expect

The parser expects one of the objects in objects slot, but got a different one (stored in token).

fail-unexpect

The parser isn’t expecting any of the objects in objects slot, but got the one in token.

fail-compound

The parser failed multiple options. The objects slot contains a list of of (type . msg) where type is one of the <parse-error> types, and msg is the message associated with it.

fail-message

Other miscellaneous failure. The objects slot contains a mmessage.

fail-error

Non-recoverable failure. Once this failure is generated, parser conbinators don’t backtrack and let the entire parsing fail. This type of failure can be generated by $rause parser constructor and $cut combinator. The objects slot contains a cons whose car is a symbol (error tag) and whose cdr is a list of (type . msg). The cdr part is the same as fail-compound.

Instance Variable of <parse-error>: objects

Value of this slot depends on the value of type slot.

Instance Variable of <parse-error>: rest

The remaining input token stream, beginning from the point when the failure occur.

Instance Variable of <parse-error>: token

The token at the head of input when failure occur. If input already reached at the end, this slot is set to #<eof>.

This is the same as the car of rest slot when there’s more input.


12.40.3 What is a PEG parser, really?

A PEG parser is a Scheme procedure that takes a list of items as an input, and returns three values:

Typically you don’t need to write a parser as a procedure; instead, you can use one of the parser builders and combinators described in the following sections.

We do provide a few utilities to write a parser from scratch, in case you need to do so.

Function: return-result value rest

{parser.peg} Call this at the tail position of the parser when it succeeds, with value for the semantic value and rest for the rest of input. This is the same as (values #f value rest), but clearer to show the intention.

Function: return-failure/expect objs rest
Function: return-failure/unexpect objs rest
Function: return-failure/message msg rest
Function: return-failure/compound fails rest

{parser.peg} Call one of these at the tail position of the parser when it fails. The first argument will be in objects slot of <parse-error>. The second argument should be a list of input, with the first element being a token that can’t be accepted.

  • If you’re expecting some types of objects (objs) but got anything else, call return-failure/expect. It generates fail-expect type failure.
  • If you’re not expecting some types of objects (objs) but got one, call return-failure/unexpect. It generates fail-unexpect type failure
  • If you’re failing because of all of multiple choices fail, you can gather those failures into an assoc list ((fail-type . objs) …), and call return-failure/compound. It generates fail-compound type failure.
  • Finally, your failure can’t be categorized to one of the above, you can call return-failure/message, with the message describing the reason of the failure.
Function: return-failure type objs rest

{parser.peg} This should only be used to pass down the failure form other parser. See the example in parse-success? entry below.

Function: parse-success? r

{parser.peg} Check the first return value of a parser to see if it is a success. A typical usage is to check another parser’s result and take actions accordingly:

(receive (r v s) (parser input)
  (if (parse-success? r)
     (... do things after PARSER succeeds ...
       (return-result ...))
     (return-failure r v s)))

This is simply checking if r is #f, but using this procedure indicates your intention clearly.


12.40.4 Primitive parser builders

Procedures and macros that create parsers.

Function: $return val

{parser.peg} Returns a parser that always succeeds, without consuming input, and yields val as the result of parser.

Frequently used in $let and $let*’s body, but can be used anywhere where a parser is expected.

Function: $fail msg-string

{parser.peg} Returns a parser that always fails, without consuming input, and uses msg-string as the failure message.

Frequently used to produce user-friendly error messages. See also $expect below.

Function: $raise msg-string

{parser.peg} Returns a parser that raises a non-recoverable failure. with msg-string as the failure message.

The difference from $fail is that, if $or sees a failure created by $fail, it may try the remaining branches, while if it sees a failure created by $raise, no more branches are tried.

This can be used for better error reporting. If you detect the case that can’t be a valid input in deep in the parse tree, a normal failure would try other alternatives exhaustively and generate an error message itemizing all the failed possibilities. It is often difficult to see the real cause from such a message. With $raise, you can let the parser give up immediately.

See also $cut combinator below, to convert a normal failure to non-recoverable failure.

Function: $satisfy pred expect :optional result

{parser.peg} This corresponds to “semantic predicate” in PEG; a parser that can apply an arbitrary predicate on input.

Returns a parser that works as follows:

  • If the head of input stream satisfies pred, call (result head (pred head)) and yield its return value as the result of successful parsing. If result is omitted, it yields the head of input.
  • Otherwise, the parsing fails with expect as the expected input.

If you just need a lookahead parser, you can use $assert.

Function: $. obj

{parser.peg} Creates a parser that matches a Scheme object obj, which may be a character, a string, a char-set, or a symbol. If obj is a char-set, the parser matches any character in the set. Specifying a symbol can be used when you’re parsing a token stream instead of a character stream.

On success, the value of the parser is the matched input.

The resulting parser is atomic, that is, it doesn’t consume input when it fails.

Function: $char c
Function: $char-ci c

{parser.peg} Returns a parser that accepts a single character, c. $char-ci ignores case. On success, the parser yields the input character. The resulting parser is atomic, that is, it doesn’t consume input when it fails.

When c is a character, ($char c is the same as ($. c); you can use $char to emphasize you’re matching a character.

Function: $string str
Function: $string-ci str

{parser.peg} Returns a parser that accepts an input that matches a string str. $string-ci ignores case. On success, the parser yields the matched string.

The parsing of string is atomic: When the parser fails, it doesn’t consume the input. That is, ($string "ab") is not the same as ($let ([a ($char #\a)] [b ($char #\b)]) ($return (string a b))).

Function: $one-of cset
Function: $one-of obj-list

{parser.peg} The first form returns a parser that accepts any character in the character set cset. In the second form, obj-list must be a list of either a character, a string, a character set or a symbol, and each one is matched with the same way as $..

On success, the parser yields the accepted object. On failure, the parser doesn’t consume input.

Function: $none-of cset

{parser.peg} Returns a parser that accepts any character not in the character set cset. On success, the parser yields the accepted character.

Function: $any

{parser.peg} Returns a parser that matches any one item, and yields the matched input item on success. It fails only when the input already reached at the end.

Function: $eos

{parser.peg} Stands for “end of stream”. Returns a parser that matches the end of input. The value of parser is #<eof> in that case. It never consumes input.

Macro: $match1 pattern result
Macro: $match1 pattern
Macro: $match1 pattern (=> fail) result
Macro: $match1* pattern result
Macro: $match1* pattern
Macro: $match1* pattern (=> fail) result

{parser.peg} The pattern matcher macro match-let1 lifted to the parser. See util.match - Pattern matching, for the details of supported pattern.

The macro $match1 returns a parser that takes one item from the input stream, and see if it matches pattern. If it matches, evaluate result within an environment where pattern variables are bound to matched content, and the parser yields the value of result. If the input doesn’t match pattern, or the input is empty, the parser fails without consuming input.

In the third form => must be a literal identifier and fail must be an identifier. The identifier fail is bound to a procedure that takes one string argument in result. You can call fail at the tail position of result to make the match fail, with the passed argument as the message. If fail is called, no input will be consumed.

(NB: The match macro in util.match has a similar feature, but it binds fail to a continuation that abandons the current match clause and go to try the next pattern. So you can invoke fail anywhere in the dynamic extent of the clause. In $match1, fail is simply a procedure, so you have to call it at the tail position to make it work.)

The macro $match1* is similar to $match1, except the entire input is matched pattern. It is useful to look into several items in input, instead of just one. Note that if you give a pattern that consumes arbitrary length of input (e.g. ($match1* (a ...)), it will consume entire input.

These macros especially come handy when you have a token stream generated by a separate lexer—each token can have some structure (instead of just a character) and you can take advantage of match.


12.40.5 Ropes

Often you want to construct a string out of the results of other parsers. It can be costly to construct strings eagerly, for a string may be just an intermediate one to be a part of a larger string. We provide a lightweight lazily string construction mechanism, called ropes.

A rope represents text with a tree of characters, strings, and ropes. It allows O(1) concatenation. A rope becomes a string when finalized. The parser drivers such as peg-run-parser automatically finalizes ropes in the parser result.

Function: $->rope parser …

{parser.peg} The parsers must yield either a character, a string, a rope, or #f or (). This procedure returns a parser that matches parser …, then gather the result into a rope. #f and () in the results are ignored.

Function: $->string parser …

{parser.peg} This is a common idiom of ($lift rope->string ($->rope parser …)).

Function: $->symbol parser …

{parser.peg} Like $->string, but yields a symbol rather than a string.

Function: rope->string rope

{parser.peg} Converts a rope to a string.

Function: rope-finalize obj

{parser.peg} Converts any ropes in in obj into strings.


12.40.6 Choice, backtrack and assertion combinators

Function: $or p1 p2 …
Function: $or p1 p2 … :else plast

{parser.peg} Returns a choice parser. Tries the given parser in order on input. If one succeeds, immediately yields its result. If one fails, and does not consume input, then tries the next one. If one fails with consuming input, immediately fails.

If p1 p2 … don’t share the same prefix to match, you can let it fail as soon as one parser fails with consuming input. If more than one parsers do match the same prefix, you want to wrap them with $try except the last one.

If all of the parsers p1 p2 … fail without consuming input, $or returns a compound failure of all the failures. You may wish to produce better error message than that. Putting $fail parser at the last doesn’t cut it, for $fail doesn’t consume input so all the previous failures would compound. In such cases, you can use the second form—if the argument before the last parser is a keyword :else, then $or discards the previous failures.

(peg-parse-string ($or ($. "ab")
                       ($. "cd")
                       :else ($fail "we want 'ab' or 'cd'"))
                  "ef")
 ⇒ PARSE-ERROR: 'ab' or 'cd' required at 0
Function: $try p

{parser.peg} Returns a parser that accepts the same input the parser p accepts, but when p fails the returned parser doesn’t consume input. Used with $or, you can explicitly implement a backtrack behavior.

Function: $optional p :optional fallback

{parser.peg} Returns a parser that tries p on the input. If it succeeds, yielding its result. If it fails, it still succeeds, yielding fallback as the result.

This is atomic; if p fails, it doesn’t consume input.

Function: $assert p

{parser.peg} Returns a parser that accepts the same input as p and returns its result on success, but never consumes the input. It can be used as a lookahead assertion.

Function: $not p

{parser.peg} Returns a parser that succeeds when p fails, and that fails when p succeeds. When p succeeds, it yields an “unexpected” error. It never consumes input in either way. It can be used as a negative lookahead assertion.

Function: $expect p msg-string

{parser.peg} Returns a parser that calls a parser p, and if it succeeds yields its result. If p fails, fails with an error message that says expecting msg-string. Useful to produce user-friendly error messages.

(define digits4
  ($expect ($->string ($repeat ($. #[\d]) 4))
           "4 consecutive digits"))

(peg-parse-string digits4 "123a")
  ⇒ error with
  *** PARSE-ERROR: expecting 4 consecutive digits at 3, but got #\a
Function: $cut p

{parser.peg} If p fails, make the failure non-recoverable. It prevents the upstream $or and $try from backtracking and trying other choices, and makes the entire parsing fail immediately.

See also $raise above.


12.40.7 Sequencing combinators

Function: $seq p1 p2 …
Function: $seq0 p1 p2 …

{parser.peg} Returns a parser that atches p1, p2, … sequentially. When all the parser succeeds, $seq returns the last result, while $seq0 returns the first result. Fails immediately when one of the parsers fails.

Function: $between p1 p2 p3

{parser.peg} Returns a parser that matches p1, p2 and p3 sequentially, and returns the result of p2.

Function: $list p …
Function: $list* p …

{parser.peg} Returns a parser that matches p …, and returns the list of the results. $list* uses the last parser’s result as the last cdr.

They are the same as ($lift list p …) and ($lift list* p …), but we encounter this pattern frequent enough to have these.

Function: $bind p f

{parser.peg} The basic block of parser combinators; p argument is a parser, and f is a procedure that takes a Scheme value and returns a parser.

Returns a parser that first applies p on the input, and if it succeeds, calls f with the result of p, and applies the returned parser on the subsequent input.

This combinator, along with $return and $fail, composes a MonadFail interface as in Haskell. Theoretically any combinators can be built on top of these three. In practice, however, it is not always easy to build things directly on top of $bind, and more high-level forms such as $let, $let* and $lift are frequently used.

Macro: $let (binding …) body …
Macro: $let* (binding …) body …

{parser.peg} Monadic binding form. Each binding can be one of the following forms:

(var parser)

Run the parser, and if it succeeds, bind its result to a variable var. If it fails, the entire $let or $let* immdiately fails.

(parser)

The variable is omitted. The parser is run, and if it succeeds, its result is discarded and the next binding or body is evaluated. If it fails, the entire $let or $let* immdiately fails.

parser

Same as above. This form can only be used if parser is just a vairable reference.

Once all the parsers in binding … succeeds, body … are evaluated in the environment where var in bindings are bound to the parser results. The last expression of body must return a parser.

Unlike let, the parsers in binding … are always applied to the input sequentially. The difference of $let and $let* is the scope. With $let*, the variables bound in earlier binding can be used to construct the parser later.

This means $let can evaluate all the parsers beforehand, while $let* may need to construct parsers at the time of processing input. Creating a parser involves closure allocations, so you want to use $let whenever possible.

Note: $let* is similar to Haskell’s do construct. We chose the name $let and $let*, for it is easier to see it’s a binding form, and also Scheme already uses do for loop construct.

Function: $lift f p …
Function: $lift* f p …

{parser.peg} Lifts a procedure f onto the parsers’ world.

In a pseudo type declaration, lift’s type can be understood as follows:

lift :: (a b ... -> z) (Parser a) (Parser b) ... -> (Parser z)

That is, lift creates a parser such that it first applies parsers on the input, and if all of them succeeds, it calls f with the parsers’ results as arguments, and the return value of f becomes the whole parser’s result.

In other words, the following equivalence holds:

($lift f p0 p1 ...)
 ≡ ($let ([r0 p0] [r1 p1] ...) ($return (f r0 r1 ...)))

It is sometimes simpler to use $lift instead of $let. For example, the following code creates a parser that matches input with p0 p1 … sequentially, then yields the list of the parser results:

($lift list p0 p1 ...)

Note that after all the parsers succeed, the whole parser is destined to succeed—the procedure f can’t make the parser fail. If you need to fail after all the parsers succeeds, use $let or $let.

Macro: $binding parser-bind-form … [(=> fail)] expr
Macro: $lbinding parser-bind-form … [(=> fail)] expr

{parser.peg} Each parser-bind-form may be a parser-yielding expression, except that you can insert a form ($: var parser-expression) anywherer in it, where parser-expression is an expression that yields a parser. The $: form is equivalent to just the parser-expression, except that its semantic value is bound to a variable var.

Each parser created by parser-bind-form is applied to the input in sequence. One of parser-bind-form fails, the entire parser immediately fails. If all of parser-bind-form succeeds, expr is evaluated in the environment where all the vars are bound to the corresponding parser expression.

Since $binding walks entire parser-bind-form to look for $: forms, you can’t have nested $binding form inside parser-bind-form.

If the parser expression associated with var fails, or never executed, the var is bound to #<undef>. If the parser expression succeeds multiple times, var holds the last value. Also, var can appear more than one places; it holds the last bound value.

The value of expr form becomes the semantic value of the entire parser.

$lbinding is a shorthand of ($lazy ($binding ...)).

The optional (=> fail) form before expr is similar to the one with $match. If given, fail, which must be an identifier, is bound to a procedure that returns failure. You should call it in the tail position of expr to indicate failure. It can be

  • (fail message) : Returns fail-message type failure, with a string message as the message.
  • (fail tag message) : Returns fail-error type (non-recoverable) failure, where tag must be a symobl error (in future, differtent tags will be supported).
Function: $fold-parsers proc seed ps

{parser.peg} Ps is a list of parsers. Apply those parsers sequentially on the input, passing around the seed value. That is, if we let v0, v1vn be the result of each parsers in ps, it returns (proc vn (... (proc v2 (proc v1 seed))...)).

If any of the parser in ps fails, $fold-parsers fails at that point.

Conceptually, it can be written as follows:

(define ($fold-parsers proc seed ps)
  (if (null? ps)
    ($return seed)
    ($let ([v (car ps)])
      ($fold-parsers proc (proc v seed) (cdr ps)))))

But we use more efficient implementation.

Function: $fold-parsers-right proc seed ps

{parser.peg} Similar to $fold-parsers, but the folding by proc right to left. That is, if we let v0, v1vn be the result of each parsers in ps, it returns (proc v1 (proc v2 (... (proc vn seed)...))).

If any of the parser in ps fails, $fold-parsers-right fails at that point.


12.40.8 Repetition combinators

Function: $many p :optional min max
Function: $many_ p :optional min max

{parser.peg} Without optional arguments, returns a parser that accepts zero or more repetition of p. On success, $many yields a list of mached results, while $many_ doesn’t keep the results (and faster).

Optinoal min and max must be nonnegative integers and limit the number of occurrences of p. The numbers are inclusive. For example, ($many ($. #\a) 3) accepts three or more #\a’s, and ($many ($. #\a) 2 4) accepts aa, aaa and aaaa.

Note that $many may fail if the input partially matches p.

(peg-parse-string ($many ($seq ($. #\a) ($. #\b))) "ababcd")
  ⇒ (#\b #\b)

(peg-parse-string ($many ($seq ($. #\a) ($. #\b))) "ababac")
  ⇒ *** PARSE-ERROR: expecting #\b at 5, but got #\c

If you want to stop $many at the first two occurrences in the latter case, use $try:

(peg-parse-string ($many ($try ($seq ($. #\a) ($. #\b)))) "ababac")
  ⇒ (#\b #\b)
Function: $many1 p :optional max
Function: $many1_ p :optional max

{parser.peg} Returns a parser that accepts one or more occurences of p. On success, $many1 yields a list of results of p, while $many_ discards the results of p and faster. If max is given, it specifies the maximum number of matches.

Same as ($many p 1 max) and ($many_ p 1 max). Provided as a common pattern.

Function: $repeat p n
Function: $repeat_ p n

{parser.peg} Returns a parser that accepts exaclty n occurences of p. On success, $repeat yields a list of results of p, while $repeat_ discards the results of p and faster.

Same as ($many p n n) and ($many_ p n n). Provided as a common pattern.

Function: $many-till p pe :optional min max
Function: $many-till_ p pe :optional min max

{parser.peg} Returns a parser that accepts repetition of p, until it sees input that accepts pe. On success, $many-till yields a list of results of p, while $many-till_ discards the results of p and faster.

(define comment ($seq ($.";") ($many-till ($any) ($."\n"))))
Function: $sep-by p psep :optional min max
Function: $end-by p psep :optional min max
Function: $sep-end-by p psep :optional min max

{parser.peg} These combinators match repetition of p separated by psep, such as comma-separated values. Returns the list of results of p. Optional min and max are integers that limits the number of repetitions.

These three differ only on treatment of the last separator; $sep-by accepts strictly infix syntax, that is, the input must not end with the separator; while $end-by accepts strictly suffix syntax, that is, the input must end with the separator; $sep-end-by makes the last separator optional.

Function: $chain-left p op
Function: $chain-right p op

{parser.peg} Returns a parser that parsers left-assosiative and right-associative operators, respectively.

The term of expression is parsed by a parser p, and the operator is parsed by op.


12.40.9 Miscellaneous combinators

Macro: $parameterize ((param expr) …) parser …

{parser.peg} Returns a parser that runs parser …, while altering the parameter values of param … with the reuslt of expr …, like parameterize. The parser … are run as if in $seq, so only the value of them is returned on success.

You can’t use ordinary parameterize, since such parameterization takes effect on parser construction time, and not when the parser parsing the input.

Function: $debug name p

{parser.peg} Parses the same input as p, but reports when it is parsing the input, and the result, just like debug-print.

You can’t use debug-print directly, for it will take effect on the parser construction time, not when the input is parsed.

Macro: $lazy p

{parser.peg} Returns a parser that works the same as p, but delays evaluation of p until needed. It is useful when you define mutually recursive parsers.


12.40.10 Performance



For Development HEAD DRAFTSearch (procedure/syntax/module):
DRAFT