parser.peg
- PEG parser combinators ¶This module implements a parser combinator library to build parsers based on Parsing Expression Grammar, or PEG.
PEG is a formal grammar to define a language, like regular expressions or context-free grammars. An interesting characteristic of PEG is that it can be directly mapped to a recursive decent parser, which is exactly what this library does—each production rule is a Scheme expression that takes parsers and returns a combined parser. One advantage of this approach is that you can freely mix ordinary Scheme code within the parser, that is, there’s no special “parser description language” distinct from the base Scheme language, nor you need to run separate tools like parser generators to obtain a runnable parser code.
Although PEG can directly parse the character string, the parser combinators are not tied to it. In fact, most of the combinators work transparently for any sequence of tokens, where the exact meanings of tokens depend on the application; you can have separate lexer that generates token sequence that PEG parser can parse, for example.
This library is specifically written to get a good performance
on Gauche. The parser created by parser.peg
is
no slower than the parser written manually from scratch.
However, you have to watch out some traps; see
Performance, for the details.
In this section we cover the basic concepts and tools of parser.peg
.
The code of examples is in examples/pegintro.scm if you have the
source tree of Gauche.
In parser.peg
, a parser is merely a Scheme procedure
that takes a list of tokens as an argument and returns
a result (well, in fact, it returns three values, but we’ll go
into the details later.)
Typically you don’t need to write parser procedures directly.
Instead, you can use procedures that
generates parsers. A parser can be as simple as the
following, which accepts a character #\a
.
($. #\a) ; ⇒ a parser
Here, accept means the parser checks if the head of input has
a character #\a
, and if it is, it succeeds, and if not, it fails.
A parser can be invoked by a parser driver. For example,
you can use peg-parse-string
to invoke the above parser
on a string:
gosh> (peg-parse-string ($. #\a) "abc") #\a
The parsing succeeds, and returns the matched value—#\a
in this case.
If the parser can’t accept the input, the driver throws an error
<parse-error>
.
gosh> (peg-parse-string ($char #\a) "xyz") *** PARSE-ERROR: expecting #\a at 0, but got #\x
A parser can also be constructed by combining simpler parsers,
using parser combinators. For example, $seq
takes
zero or more parsers and apply them sequentially, returning
the last result.
gosh> (peg-parse-string ($seq ($. #\a) ($. #\b) ($. #\c)) "abc") #\c
The combinator $many
takes a parser and returns a new parser
that accepts zero or more occurrence of the string the original parser
accepts.
gosh> (peg-parse-string ($many ($. #\a)) "aaaaabc") (#\a #\a #\a #\a #\a) gosh> (peg-parse-string ($many ($. #\a)) "xxxxxyz") ()
A parser is just an ordinary Scheme procedure, so it can be bound to a variable, then can be used to construct more complex parsers.
(define digits ($many1 ($. #[\d]))) (define ws ($many_ ($. #[\s]))) (define separator ($seq ws ($. #\,) ws))
I leave explanation of $many1
and
$many_
for the later section,
but you may be able to guess what those parsers do;
digits
accepts a sequence of one or more digits, and
ws
accepts sequence of zero or more whitespaces. The separator
parser accepts a comma, optionally surrounded by whitespaces.
The digits
parser returns a list of accepted
characters:
gosh> (peg-parse-string digits "12345") (#\1 #\2 #\3 #\4 #\5)
Can we create a parser that returns an integer as
a parsed result? Yes, we can use the $let
macro.
(define integer ($let ([ds digits]) ($return (x->integer (list->string ds)))))
The $let
works somewhat like and-let*
; it takes
a form of ($let ([var parser] …) expr …)
,
applying the parsers in order, binding the result of each
parser to var. If any of the parser fails, the entire
parser created by $let
macro fails. When all the parser succeeds,
each result is bound to var and expr … are
evaluated. The last expr must yield a parser.
The $return
procedure creates a parser
that doesn’t consume input, always succeeds
and returns the given value. The name is taken from
Haskell’s monads. Note that it is just an ordinary procedure
and not like a control-transfer syntax as traditional language’s
return
. You may think it just as type conversion
procedure from a Scheme object to a parser.
gosh> (peg-parse-string integer "12345") 12345
Now you can combine those parsers to build more complex one, such as a comma-separated list of integers:
(define integers1 ($seq integer ($many ($seq separator integer)))) gosh> (peg-parse-string integers1 "123, 456, 789") (456 789)
Oops, where’s 123? Well, remember that
$seq
discards the results but the last one.
We can use $let
again to keep all the results.
(define integers2 ($let ([n integer] [ns ($many ($seq separator integer))]) ($return (cons n ns)))) gosh> (peg-parse-string integers2 "123, 456, 789") (123 456 789)
(Unlike let
where the order of its init expressions are
not defined, $let
guarantees the parsers are applied sequentially.
The reason it is not called $let*
is the scope;
we also have $let*
, which we’ll explain shortly.)
Another way to gather the results of parsers is a combinator $lift
.
It is used as ($lift proc parser …)
, where proc
is an ordinary procedure which receives the result of parser …
as arguments. The return value of proc becomes the result of
the entire parser. Unlike $let
, proc doesn’t need to
return a parser.
(define integers3 ($lift cons integer ($many ($seq separator integer))))
The parsers so far doesn’t handle the case when the list contains no
integers. Using $or
combinator, which represents a choice,
we can modify it to handle zero-element case.
(define integers4 ($or ($let ([n integer] [ns ($many ($seq separator integer))]) ($return (cons n ns))) ($return '())))
By the way, “list of stuff separated by something” is a very common pattern, so we can extract the pattern to name it:
(define (sep-by stuff separator) ($or ($let ([n stuff] [ns ($many ($seq separator stuff))]) ($return (cons n ns))) ($return '())))
Then the list of integers can be written this simple:
(define integers5 (sep-by integer separator))
In fact, parser.peg
provides $sep-by
to do the above,
but we’ve just shown the definition to demonstrate
the power of the combinatorial approach;
you can use ordinary procedural abstraction to factor out common patterns.
There’s one catch in the $or
form.
It tries the next alternative only when the
parser fails without consuming the input. Once the input is
consumed, $or
commits to that choice. For example,
the following fails even if the input seems to match the
second alternative:
(define paren ($. #\()) (define thesis ($. #\))) (peg-parse-string ($or ($seq paren ($."ab") thesis) ($seq paren ($."cd") thesis)) "(cd)") ⇒ *** PARSE-ERROR: expecting ab at 1, but got #\c
It’s because when $or
tries the first branch, it reads the initial
open paren from the input, so $or
commits to the first branch.
When the branch fails, $or
doesn’t bother to try the second branch.
(In other words, $or
does not backtrack.)
You may factor out the common prefix:
($seq paren ($or ($."ab") ($."cd")) thesis)
But it may complicates the syntax, and it is not always trivial to
factor out like above. The better way is to use the $try
combinator: ($try p)
runs a parser p, and if
it fails, $try
rolls back the input as if it didn’t
consume input at all. Using with $or
, you can do arbitrary
lookahead and backtrack.
($or ($try ($seq paren ($."ab") thesis)) ($seq paren ($."cd") thesis))
Now, let’s get back to the integer list example and make it more interesting. Suppose the list of integers are surrounded by parentheses, brackets or curly-braces. Opening one and closing one must correspond.
(define begin-list ($seq0 ($. #[\(\[\{]) ws)) (define (end-list opener) ($seq ws (case opener [(#\() ($. #\))] [(#\[) ($. #\])] [(#\{) ($. #\})]))) (define int-list ($let* ([opener begin-list] ;*1 [ints ($sep-by integer separator)] ;*2 [ (end-list opener) ]) ;*3 ($return ints)))
The opening bracket is parsed by the parser begin-list
.
The $seq0
combinator is similar to seq
, but returns
the result of the first parser instead of the last one
(it’s begin0
to begin
).
The closing bracket must match the opening one, so end-list
is a procedure that takes the opening bracket and returns a suitable
parser to accept the corresponding closing bracket.
The int-list
first parses the opening bracket by begin-list
and bind the result to opener (*1).
Then goes to parse comma-separated integers (*2) and bind the
result list to ints.
Finally, it parses the matching closing bracket (*3). Note that
the it omits the variable, since we don’t need the result of
closing bracket parser.
Since we need to use the value of opener in the following bindings,
we use $let*
here, instead of $let
. The difference is
the scope of the parser expressions in the binding. Note that, however,
$let
is a lot more easier to optimize, so you want to use
$let
whenever possible.
Let’s see it works.
gosh> (peg-parse-string int-list "[123, 456, 789]") (123 456 789) gosh> (peg-parse-string int-list "{123, 456, 789}") (123 456 789) gosh> (peg-parse-string int-list "(123, 456, 789}") *** PARSE-ERROR: expecting #\) at 14, but got #\}
The last example shows it rejects unmatched brackets.
What if we want a nested list? In BNF, we could write something like this:
list : begin-list (elem (separator elem)* )? end-list elem : integer | list
The straight translation would be he following.
;; First try (define nested-list ($let* ([opener begin-list] [ints ($sep-by elem separator)] [ (end-list opener) ]) ($return ints))) (define elem ($or integer nested-list))
Let’s load it... Oops.
*** ERROR: unbound variable: elem Stack Trace: _______________________________________ 0 elem [unknown location] 1 (eval expr env) at "../lib/gauche/interactive.scm":267
We need the parser elem
to construct nested-list
, but
we need the parser nested-list
to construct elem
.
In lazy languages like Haskell this doesn’t matter, but we Schemers
are eager!
The solution is to delay the parser construction until it is
actually used. The $lazy
form does the job:
(define nested-list ($lazy ($let* ([opener begin-list] [ints ($sep-by elem separator)] [ (end-list opener) ]) ($return ints)))) (define elem ($or integer nested-list)) gosh> (peg-parse-string nested-list "(123, [456, {}, 789], 987)") (123 (456 () 789) 987)
Ok, we’re almost done. Our code can parse nested list of integers, checking bracket matches. But if you give an erroneous input, the message is cryptic and not helpful:
gosh> (peg-parse-string nested-list "(123, [456, {}, 789), 987)") *** PARSE-ERROR: expecting one of (#[0-9] #\]) at 19 Stack Trace: _______________________________________ 0 (eval expr env) at "../lib/gauche/interactive.scm":267
We could check the reason of failure in the parser, and call
$fail
with more reasonable error message. Let’s replace
end-list
above with end-list2
below:
(define (end-list2 opener) (define expected (assv-ref '((#\( . #\)) (#\[ . #\]) (#\{ . #\})) opener)) ($seq ws ($let ([closer ($. #[\)\]\}])]) (if (eqv? closer expected) ($return closer) ($fail (format "Mismatched closing bracket. '~c' expected, \ but got '~c'" expected closer))))))
You also have to change nested-list
and elem
to
use end-list2
:
(define nested-list2 ($lazy ($let* ([opener begin-list] [ints ($sep-by elem2 separator)] [ (end-list2 opener) ]) ($return ints)))) (define elem2 ($or integer nested-list2))
(You could redefine end-list
, but if you do so, don’t forget to
re-evaluate the definitions of nested-list
and elem
.
It’s because the combinators
are caculated taking the value of other combinators when it’s defined,
unlike typical procedural approach where you redefine one procedure and
other procedures will refer to the updated version after that.)
And now you see this error:
gosh> (peg-parse-string nested-list2 "(123, [456, {}, 789), 987)") *** PARSE-ERROR: Mismatched closing bracket. ']' expected, but got ')' at 20 Stack Trace: _______________________________________ 0 (eval expr env) at "../lib/gauche/interactive.scm":267
For further examples, you can take a look at some libraries in the
Gauche source tree that use parser.peg
:
{parser.peg
}
Apply parser on the input list. If parser accepts it,
returns two values–the result of parser, and the rest of
the input. If parser fails, raise <parse-error>
.
The result of parse is a value yielded by parser
passed to rope-finalize
, so any unfinalized rope is
finalized before being returned. See Ropes, for the details.
Typically, you don’t want to have entire input as a list beforehand, so you pass a lazy sequence as list (see Lazy sequences). If the input is a string or a port, convenience procedures are defined.
Input doesn’t need to be a character list. You may, for example, have a separate lexer that generates a (lazy) list of tokens, and let parser parse them.
{parser.peg
}
Convenience wrappers of peg-run-parser
that takes input from
string or iport.
Without cont argument or it is #f
,
it returns the parsed result and discards
the rest of the input. (Hint: To make sure there’s no garbage
following, use $eos
parser.)
If you want to keep parsing after parser accepts its input, pass cont a procedure; it must take two arguments, the result of parser and a lazy sequence of the rest of the input. What cont returns will be the result of these procedures.
It is an error to pass other values to cont.
{parser.peg
}
This is useful when you need to apply parser repeatedly over
the input list. Returns a generator that generates
the parsed result one match at a time.
The same input can be accepted by
by (peg-run-parser ($many parser) list)
, but this one
won’t return until all input is consumed. On the other hand,
{parser.peg
}
An condition type raised by the parser driver when the given parser
failed ultimately. Inherits <error>
.
(Note that each parser won’t throw this; one
parser’s failure doesn’t necessarily mean the entire parser fails.
It’s the driver that recognizes the ultimate failure and raise this
condition.)
The following slots are available:
<parse-error>
: message ¶This slot is inherited from <error>
. Contains string error message.
<parse-error>
: position ¶The position the failure occured.
The exact meaning of this value depends on how you call the parser driver.
At minimum, when you call it on a bare list or a string/port (using
peg-parse-string
/peg-parse-port
), the parser driver
counts the number of elements took from the input and put it here,
for it is the only available information. Note that the number
may not be what you want, e.g. if you start parsing in the middle of
the stream.
If you’re parsing from a sequence with positions,
this slot contains an instance of <sequence-position>
, which
may have line & column numbers or a source file name.
See Lazy sequence with positions, for the details.
The user code should expect both cases.
<parse-error>
: type ¶The type of failure. It’s either one of the follownig symbols:
fail-expect
The parser expects one of the objects in objects slot, but got a different one (stored in token).
fail-unexpect
The parser isn’t expecting any of the objects in objects slot, but got the one in token.
fail-compound
The parser failed multiple options. The objects slot contains
a list of of (type . msg)
where type is one of the
<parse-error>
types, and
msg is the message associated with it.
fail-message
Other miscellaneous failure. The objects slot contains a mmessage.
fail-error
Non-recoverable failure. Once this failure is generated, parser
conbinators don’t backtrack and let the entire parsing fail.
This type of failure can be generated by $rause
parser constructor
and $cut
combinator. The objects slot contains a cons
whose car is a symbol (error tag) and whose cdr is
a list of (type . msg)
. The cdr part is the same as
fail-compound
.
<parse-error>
: objects ¶Value of this slot depends on the value of type slot.
<parse-error>
: rest ¶The remaining input token stream, beginning from the point when the failure occur.
<parse-error>
: token ¶The token at the head of input when failure occur. If input already
reached at the end, this slot is set to #<eof>
.
This is the same as the car of rest
slot when there’s more input.
A PEG parser is a Scheme procedure that takes a list of items as an input, and returns three values:
#f
. If the parser fails to accept the input,
this value is either one of the symbols
fail-expect
, fail-unexpect
, fail-compound
and fail-message
. It corresponds to the type
slot
of <parse-error>
, and determines the meaning of the second
return value.
objects
slot of <parse-error>
; see the documentation
of <parse-error>
for the details (see Parser drivers).
Typically you don’t need to write a parser as a procedure; instead, you can use one of the parser builders and combinators described in the following sections.
We do provide a few utilities to write a parser from scratch, in case you need to do so.
{parser.peg
}
Call this at the tail position of the parser when it succeeds, with
value for the semantic value and rest for the rest of
input. This is the same as (values #f value rest)
, but
clearer to show the intention.
{parser.peg
}
Call one of these at the tail position of the parser when it fails.
The first argument will be in objects
slot of <parse-error>
.
The second argument should be a list of input, with the first element
being a token that can’t be accepted.
return-failure/expect
. It generates fail-expect
type failure.
return-failure/unexpect
. It generates fail-unexpect
type failure
((fail-type . objs) …)
, and
call return-failure/compound
. It generates fail-compound
type failure.
return-failure/message
, with the message
describing the reason of the failure.
{parser.peg
}
This should only be used to pass down the failure form other parser.
See the example in parse-success?
entry below.
{parser.peg
}
Check the first return value of a parser to see if it is a success.
A typical usage is to check another parser’s result and take
actions accordingly:
(receive (r v s) (parser input) (if (parse-success? r) (... do things after PARSER succeeds ... (return-result ...)) (return-failure r v s)))
This is simply checking if r is #f
, but using this
procedure indicates your intention clearly.
Procedures and macros that create parsers.
{parser.peg
}
Returns a parser that always succeeds, without consuming
input, and yields val as the result of parser.
Frequently used in $let
and $let*
’s body, but can be
used anywhere where a parser is expected.
{parser.peg
}
Returns a parser that always fails, without consuming input,
and uses msg-string as the failure message.
Frequently used to produce user-friendly error messages.
See also $expect
below.
{parser.peg
}
Returns a parser that raises a non-recoverable failure.
with msg-string as the failure message.
The difference from $fail
is that, if $or
sees
a failure created by $fail
, it may try the remaining
branches, while if it sees a failure created by $raise
,
no more branches are tried.
This can be used for better error reporting.
If you detect the case that can’t be a valid input in deep in the
parse tree, a normal failure would try other alternatives exhaustively
and generate an error message itemizing all the failed possibilities.
It is often difficult to see the real cause from such a message.
With $raise
, you can let the parser give up immediately.
See also $cut
combinator below, to convert a normal failure to
non-recoverable failure.
{parser.peg
}
This corresponds to “semantic predicate” in PEG; a parser
that can apply an arbitrary predicate on input.
Returns a parser that works as follows:
(result head (pred head))
and
yield its return value as the result of successful parsing.
If result is omitted, it yields the head of input.
If you just need a lookahead parser, you can use $assert
.
{parser.peg
}
Creates a parser that matches a Scheme object obj,
which may be a character, a string, a char-set, or a symbol.
If obj is a char-set, the parser matches any character in the set.
Specifying a symbol can be used when you’re parsing a token stream
instead of a character stream.
On success, the value of the parser is the matched input.
The resulting parser is atomic, that is, it doesn’t consume input when it fails.
{parser.peg
}
Returns a parser that accepts a single character, c.
$char-ci
ignores case.
On success, the parser yields the input character.
The resulting parser is atomic, that is, it doesn’t consume
input when it fails.
When c is a character, ($char c
is the same as ($. c)
;
you can use $char
to emphasize you’re matching a character.
{parser.peg
}
Returns a parser that accepts an input that matches
a string str. $string-ci
ignores case.
On success, the parser yields the matched string.
The parsing of string is atomic: When the parser fails,
it doesn’t consume the input. That is,
($string "ab")
is not the same as
($let ([a ($char #\a)] [b ($char #\b)]) ($return (string a b)))
.
{parser.peg
}
The first form returns a parser that accepts any character in
the character set cset.
In the second form, obj-list must be a list of either
a character, a string, a character set or a symbol, and each one
is matched with the same way as $.
.
On success, the parser yields the accepted object. On failure, the parser doesn’t consume input.
{parser.peg
}
Returns a parser that accepts any character not in the character set cset.
On success, the parser yields the accepted character.
{parser.peg
}
Returns a parser that matches any one item, and yields the matched
input item on success.
It fails only when the input already reached at the end.
{parser.peg
}
Stands for “end of stream”. Returns a parser that matches
the end of input. The value of parser is #<eof>
in that case.
It never consumes input.
=>
fail) result ¶=>
fail) result ¶{parser.peg
}
The pattern matcher macro match-let1
lifted to the parser.
See util.match
- Pattern matching, for the details of supported pattern.
The macro $match1
returns a parser that takes one item
from the input stream, and see if it matches pattern.
If it matches, evaluate result within an environment
where pattern variables are bound to matched content, and the
parser yields the value of result. If the input doesn’t
match pattern, or the input is empty, the parser fails without
consuming input.
In the third form =>
must be a literal identifier and fail
must be an identifier. The identifier fail is bound to a procedure
that takes one string argument in result. You can call fail
at the tail position of result to make the match fail, with
the passed argument as the message. If fail is called,
no input will be consumed.
(NB: The match
macro in util.match
has a similar feature,
but it binds fail to a continuation that abandons the current match
clause and go to try the next pattern. So you can invoke fail
anywhere in the dynamic extent of the clause.
In $match1
, fail is simply a procedure, so
you have to call it at the tail position to make it work.)
The macro $match1*
is similar to $match1
, except
the entire input is matched pattern. It is useful to
look into several items in input, instead of just one.
Note that if you give a pattern that consumes arbitrary length
of input (e.g. ($match1* (a ...))
, it will consume entire
input.
These macros especially come handy when you have a token stream
generated by a separate lexer—each token can have some structure
(instead of just a character) and you can take advantage of match
.
Often you want to construct a string out of the results of other parsers. It can be costly to construct strings eagerly, for a string may be just an intermediate one to be a part of a larger string. We provide a lightweight lazily string construction mechanism, called ropes.
A rope represents text with a tree of characters, strings, and ropes.
It allows
O(1) concatenation. A rope becomes a string when finalized.
The parser drivers such as peg-run-parser
automatically finalizes
ropes in the parser result.
{parser.peg
}
The parsers must yield either a character, a string, a rope,
or #f
or ()
. This procedure
returns a parser that matches parser …, then gather the
result into a rope. #f
and ()
in the results are ignored.
{parser.peg
}
This is a common idiom of ($lift rope->string ($->rope parser …))
.
{parser.peg
}
Like $->string
, but yields a symbol rather than a string.
{parser.peg
}
Converts a rope to a string.
{parser.peg
}
Converts any ropes in in obj into strings.
{parser.peg
}
Returns a choice parser. Tries the given parser in order on input.
If one succeeds, immediately yields its result.
If one fails, and does not consume input, then tries the next one.
If one fails with consuming input, immediately fails.
If p1 p2 … don’t share the same prefix
to match, you can let it fail as soon as one parser fails with consuming
input. If more than one parsers do match the same prefix,
you want to wrap them with $try
except the last one.
If all of the parsers p1 p2 … fail without consuming input,
$or
returns a compound failure of all the failures. You may
wish to produce better error message than that. Putting $fail
parser at the last doesn’t cut it, for $fail
doesn’t consume
input so all the previous failures would compound. In such cases,
you can use the second form—if the argument before the last parser
is a keyword :else
, then $or
discards the previous failures.
(peg-parse-string ($or ($. "ab") ($. "cd") :else ($fail "we want 'ab' or 'cd'")) "ef") ⇒ PARSE-ERROR: 'ab' or 'cd' required at 0
{parser.peg
}
Returns a parser that accepts the same input
the parser p accepts, but when p fails
the returned parser doesn’t consume input. Used with $or
,
you can explicitly implement a backtrack behavior.
{parser.peg
}
Returns a parser that tries p on the input.
If it succeeds, yielding its result. If it fails,
it still succeeds, yielding fallback as the result.
This is atomic; if p fails, it doesn’t consume input.
{parser.peg
}
Returns a parser that accepts the same input as p
and returns its result on success, but never consumes the input.
It can be used as a lookahead assertion.
{parser.peg
}
Returns a parser that succeeds when p fails, and that fails
when p succeeds. When p succeeds, it yields an
“unexpected” error. It never consumes input in either way.
It can be used as a negative lookahead assertion.
{parser.peg
}
Returns a parser that calls a parser p, and if it succeeds
yields its result. If p fails, fails with an error message
that says expecting msg-string.
Useful to produce user-friendly error messages.
(define digits4
($expect ($->string ($repeat ($. #[\d]) 4))
"4 consecutive digits"))
(peg-parse-string digits4 "123a")
⇒ error with
*** PARSE-ERROR: expecting 4 consecutive digits at 3, but got #\a
{parser.peg
}
If p fails, make the failure non-recoverable. It prevents the
upstream $or
and $try
from backtracking and trying other choices,
and makes the entire parsing fail immediately.
See also $raise
above.
{parser.peg
}
Returns a parser that atches
p1, p2, … sequentially. When all the parser
succeeds, $seq
returns
the last result, while $seq0 returns the first result.
Fails immediately when one of the parsers fails.
{parser.peg
}
Returns a parser that matches
p1, p2 and p3 sequentially, and returns
the result of p2.
{parser.peg
}
Returns a parser that matches p …,
and returns the list of the results.
$list*
uses the last parser’s result as the last cdr.
They are the same as ($lift list p …)
and
($lift list* p …)
, but we encounter this pattern
frequent enough to have these.
{parser.peg
}
The basic block of parser combinators;
p argument is a parser, and f is a procedure
that takes a Scheme value and returns a parser.
Returns a parser that first applies p on the input, and if it succeeds, calls f with the result of p, and applies the returned parser on the subsequent input.
This combinator, along with $return
and $fail
,
composes a MonadFail interface as in Haskell. Theoretically any
combinators can be built on top of these three. In practice,
however, it is not always easy to build things directly on top
of $bind
, and more high-level forms such as $let
,
$let*
and $lift
are frequently used.
{parser.peg
}
Monadic binding form. Each binding can be one of the
following forms:
(var parser)
Run the parser, and if it succeeds, bind its result to a variable
var. If it fails, the entire $let
or $let*
immdiately
fails.
(parser)
The variable is omitted. The parser is run, and if it succeeds,
its result is discarded and the next binding or body is evaluated.
If it fails, the entire $let
or $let*
immdiately fails.
parser
Same as above. This form can only be used if parser is just a vairable reference.
Once all the parsers in binding … succeeds, body … are evaluated in the environment where var in bindings are bound to the parser results. The last expression of body must return a parser.
Unlike let
, the parsers in binding … are always applied
to the input sequentially. The difference of $let
and
$let*
is the scope. With $let*
, the variables bound
in earlier binding can be used to construct the parser later.
This means $let
can evaluate all the parsers beforehand, while
$let*
may need to construct parsers at the time of processing
input. Creating a parser involves closure allocations, so you want
to use $let
whenever possible.
Note: $let*
is similar to Haskell’s do
construct. We chose
the name $let
and $let*
, for it is easier to see it’s
a binding form, and also Scheme already uses do
for
loop construct.
{parser.peg
}
Lifts a procedure f onto the parsers’ world.
In a pseudo type declaration, lift
’s type can be understood
as follows:
lift :: (a b ... -> z) (Parser a) (Parser b) ... -> (Parser z)
That is, lift
creates a parser such that
it first applies parsers on the input, and if all of them succeeds,
it calls f with the parsers’ results as arguments,
and the return value of f becomes the whole parser’s result.
In other words, the following equivalence holds:
($lift f p0 p1 ...) ≡ ($let ([r0 p0] [r1 p1] ...) ($return (f r0 r1 ...)))
It is sometimes simpler to use $lift
instead of $let
.
For example, the following code creates a parser that matches
input with p0 p1 … sequentially, then yields
the list of the parser results:
($lift list p0 p1 ...)
Note that after all the parsers succeed, the whole parser is
destined to succeed—the procedure f can’t make the parser
fail. If you need to fail after all the parsers succeeds, use $let
or $let
.
{parser.peg
}
Each parser-bind-form may be a parser-yielding expression, except
that you can insert a form ($: var parser-expression)
anywherer in it,
where parser-expression is an expression that yields a parser.
The $:
form is equivalent to just the parser-expression, except
that its semantic value is bound to a variable var.
Each parser created by parser-bind-form is applied to the input in sequence. One of parser-bind-form fails, the entire parser immediately fails. If all of parser-bind-form succeeds, expr is evaluated in the environment where all the vars are bound to the corresponding parser expression.
Since $binding
walks entire parser-bind-form to look for
$:
forms, you can’t have nested $binding
form inside
parser-bind-form.
If the parser expression associated with var fails, or never
executed, the var is bound to #<undef>
.
If the parser expression succeeds multiple times, var holds
the last value. Also, var can appear more than one places;
it holds the last bound value.
The value of expr form becomes the semantic value of the entire parser.
$lbinding
is a shorthand of ($lazy ($binding ...))
.
The optional (=> fail)
form before expr is similar to the
one with $match
. If given, fail, which must be an identifier,
is bound to a procedure that returns failure. You should call it in the
tail position of expr to indicate failure. It can be
(fail message)
: Returns fail-message
type
failure, with a string message as the message.
(fail tag message)
: Returns fail-error
type
(non-recoverable) failure, where tag must be a symobl error
(in future, differtent tags will be supported).
{parser.peg
}
Ps is a list of parsers. Apply those parsers sequentially
on the input, passing around the seed value. That is, if we let
v0
, v1 … vn be the result of each parsers in ps,
it returns (proc vn (... (proc v2 (proc v1 seed))...))
.
If any of the parser in ps fails, $fold-parsers
fails at
that point.
Conceptually, it can be written as follows:
(define ($fold-parsers proc seed ps) (if (null? ps) ($return seed) ($let ([v (car ps)]) ($fold-parsers proc (proc v seed) (cdr ps)))))
But we use more efficient implementation.
{parser.peg
}
Similar to $fold-parsers
, but the folding by proc
right to left. That is, if we let
v0
, v1 … vn be the result of each parsers in ps,
it returns (proc v1 (proc v2 (... (proc vn seed)...)))
.
If any of the parser in ps fails, $fold-parsers-right
fails at
that point.
{parser.peg
}
Without optional arguments, returns
a parser that accepts zero or more repetition of p.
On success, $many
yields a list of mached results,
while $many_ doesn’t keep the results (and faster).
Optinoal min and max must be nonnegative integers
and limit the number of occurrences of p. The numbers are inclusive.
For example,
($many ($. #\a) 3)
accepts three or more #\a
’s,
and ($many ($. #\a) 2 4)
accepts aa
, aaa
and
aaaa
.
Note that $many
may fail if the input partially matches
p
.
(peg-parse-string ($many ($seq ($. #\a) ($. #\b))) "ababcd") ⇒ (#\b #\b) (peg-parse-string ($many ($seq ($. #\a) ($. #\b))) "ababac") ⇒ *** PARSE-ERROR: expecting #\b at 5, but got #\c
If you want to stop $many
at the first two occurrences
in the latter case, use $try
:
(peg-parse-string ($many ($try ($seq ($. #\a) ($. #\b)))) "ababac") ⇒ (#\b #\b)
{parser.peg
}
Returns a parser that accepts one or more occurences of p.
On success, $many1
yields a list of results of p,
while $many_ discards the results of p and faster.
If max is given, it specifies the maximum number of matches.
Same as ($many p 1 max)
and ($many_ p 1 max)
.
Provided as a common pattern.
{parser.peg
}
Returns a parser that accepts exaclty n occurences of p.
On success, $repeat
yields a list of results of p,
while $repeat_
discards the results of p and faster.
Same as ($many p n n)
and ($many_ p n n)
.
Provided as a common pattern.
{parser.peg
}
Returns a parser that accepts repetition of p, until it
sees input that accepts pe.
On success, $many-till
yields a list of results of p,
while $many-till_ discards the results of p and faster.
(define comment ($seq ($.";") ($many-till ($any) ($."\n"))))
{parser.peg
}
These combinators match repetition of p separated by psep,
such as comma-separated values.
Returns the list of results of p.
Optional min and max are integers that limits the
number of repetitions.
These three differ only on treatment of the last separator; $sep-by
accepts strictly infix syntax, that is, the input must not end
with the separator; while $end-by
accepts strictly suffix syntax,
that is, the input must end with the separator; $sep-end-by
makes the last separator optional.
{parser.peg
}
Returns a parser that parsers left-assosiative and right-associative operators,
respectively.
The term of expression is parsed by a parser p, and the operator is parsed by op.
{parser.peg
}
Returns a parser that runs parser …, while altering
the parameter values of param … with the reuslt of
expr …, like parameterize
.
The parser … are run as if in $seq
, so only the value
of them is returned on success.
You can’t use ordinary parameterize
, since such parameterization
takes effect on parser construction time, and not when the parser
parsing the input.
{parser.peg
}
Parses the same input as p, but reports when it is parsing
the input, and the result, just like debug-print
.
You can’t use debug-print
directly, for it will take effect
on the parser construction time, not when the input is parsed.
{parser.peg
}
Returns a parser that works the same as p, but delays evaluation
of p until needed. It is useful when you define mutually
recursive parsers.