PEGパーザコンビネータ (Gauche ユーザリファレンス)

12.40 `parser.peg` - PEGパーザコンビネータ

Module: parser.peg ¶

このモジュールは、Parsing Expression Grammar (PEG)に基づいたパーザを組み立てるコンビネータライブラリを実装しています。

PEGは、正規表現や文脈自由文法と同様の形式言語です。 PEGが興味深いのは、文法記述がそのまま再帰下降パーザに移し替えられるところで、このライブラリはまさにそれをやっています。各生成規則は、パーザを受け取って結合されたパーザを返すScheme式です。このアプローチの利点は、Schemeコードと文法記述を混ぜて書けることです。特別な「パーザー記述言語」を別に学ぶ必要も、パーザジェネレータのような別ツールを使ってソースコードを生成する必要もありません。

PEGは文字列を直接パーズすることもできますが、パーザコンビネータは文字列のパーズ限定ではありません。実際、多くのコンビネータは「トークン」の列 (具体的なトークンの意味はアプリケーションが決められます) に対して透過的に動作します。例えば字句解析器を別に作ってトークンの列を生成し、それをPEGでパーズする、ということもできます。

このライブラリは、Gaucheで性能が出るように書かれています。 parser.pegで作られたパーザは、手書きのパーザと遜色ない性能を持ちます。但し、いくつか気をつけるべき点はあります。それについては Performanceを参照してください。

12.40.1 PEGひとめぐり

本節では、parser.pegの基本的なコンセプトと道具を説明します。例に使ったコードは、Gaucheのソースツリーのexamples/pegintro.scmにあります。

parser.pegでは、パーザとは単なるSchemeの手続きです。それはトークンのリストを引数に取り、結果を返します (実際は3つの値を返します。それについては後で詳しく述べます)。

パーザ手続きを直接書く必要は滅多にありません。代わりに、パーザを生成する手続きがたくさん用意されているのでそれを使います。例えば、文字#\aを受理するパーザはこう書けます。

($. #\a)   ; ⇒ a parser

ここで受理するとは、パーザが入力の先頭をチェックして、それが文字#\aならば成功し、そうでなければ失敗するという意味です。

パーザはパーザドライバによって呼び出せます。例えば peg-parse-stringを使って、上のパーザで文字列をパーズしてみましょう。

gosh> (peg-parse-string ($. #\a) "abc")
#\a

パーズは成功し、マッチした値 (#\a) が返りました。パーザが期待する入力でなかった場合には、ドライバは<parse-error>を投げます。

gosh> (peg-parse-string ($char #\a) "xyz")
*** PARSE-ERROR: expecting #\a at 0, but got #\x

単純なパーザをパーザコンビネータで組み合わせて、より複雑なパーザを作ることができます。例えば、$seqはいくつかのパーザを受け取り、「それらを順に適用して最後の結果を返すパーザ」を作って返します。

gosh> (peg-parse-string ($seq ($. #\a) ($. #\b) ($. #\c)) "abc")
#\c

コンビネータ$manyはパーザを取ると、それのゼロ回以上の繰り返しを受理するパーザを作って返します。

gosh> (peg-parse-string ($many ($. #\a)) "aaaaabc")
(#\a #\a #\a #\a #\a)
gosh> (peg-parse-string ($many ($. #\a)) "xxxxxyz")
()

パーザは通常のSchemeの手続きなので、変数に束縛し、それをより複雑なパーザを作るのに使えます。

(define digits    ($many1 ($. #[\d])))
(define ws        ($many_ ($. #[\s])))
(define separator ($seq ws ($. #\,) ws))

$many1と$many_の説明は後回しにしますが、何となくこれらのパーザの動作は推測できるのではないでしょうか。 digitsは1個以上の数字の並びを、 wsは0個以上の空白文字の並びを受理するパーザです。 separatorパーザは0個以上の空白で囲まれたコンマを受容します。

digitsは受理した文字のリストを返します。

gosh> (peg-parse-string digits "12345")
(#\1 #\2 #\3 #\4 #\5)

パーズの結果として整数値を返すようにできるでしょうか。 $letマクロを使えばできます。

(define integer
  ($let ([ds digits])
    ($return (x->integer (list->string ds)))))

$letはand-let*のように動作します。 ($let ([var parser] …) expr)という形式を取り、 parserを順に入力に適用してゆきます。 parserのどれかひとつでも失敗したら全体が失敗します。全てのparserが成功したら、それぞれの結果がvarに束縛された環境下で expr …が評価されます。最後のexprはパーザを返さねばなりません。

$return手続きは、入力を消費せず、常に成功し、引数をパーズ結果とするパーザを作ります。この名前はHaskellのモナドからきました。他の多くの言語にあるreturnのように制御の流れを変えるのではなく、単なる手続きであることに注意してください。任意のSchemeオブジェクトをパーザに変える変換手続きと考えても良いでしょう。

gosh> (peg-parse-string integer "12345")
12345

これらのパーザを組み合わせて、さらに複雑なパーザを作ることができます。例えばコンマで区切られた整数のリストです：

(define integers1 ($seq integer
                        ($many ($seq separator integer))))

gosh> (peg-parse-string integers1 "123, 456, 789")
(456 789)

あれ、最初の123がどっかいっちゃいました。そうそう、 $seqは最後のパーザの値以外は捨てるのでした。 $letを使って結果を集めることができます。

(define integers2 ($let ([n  integer]
                         [ns ($many ($seq separator integer))])
                     ($return (cons n ns))))

gosh> (peg-parse-string integers2 "123, 456, 789")
(123 456 789)

(letではinit式の評価順序は決まっていませんが、 $letはパーザを書かれた順に適用することを保証しています。これを$let*と呼んでいないのはスコープのためです。すぐ後で$let*も紹介します。)

パーザの結果を集めるもうひとつの方法は$liftコンビネータです。 ($lift proc parser …) のように使います。ここで procは普通の手続きで、parser …の結果を引数として受け取り、 procの返した結果が全体のパーザの結果になります。 $letと違って、procはパーザを返す必要はありません。

(define integers3 ($lift cons
                         integer
                         ($many ($seq separator integer))))

これまでのパーザは入力に整数が全く含まれていない場合を考慮していませんでした。選択を表す$orコンビネータを使って、整数が無いケースをサポートできます。

(define integers4 ($or ($let ([n  integer]
                              [ns ($many ($seq separator integer))])
                         ($return (cons n ns)))
                       ($return '())))

ところで、「何かで区切られた何かのリスト」というのは頻繁に出てくるので、次のようにパターンを取り出しておくことができます：

(define (sep-by stuff separator)
  ($or ($let ([n  stuff]
              [ns ($many ($seq separator stuff))])
         ($return (cons n ns)))
       ($return '())))

すると整数のリストのパーザはこんなに簡単になります:

(define integers5 (sep-by integer separator))

実はparser.pegには上と同じことができる$sep-byというのが既に用意されているんですが、コンビネータを使う方法の強みを示すためにあえて実装してみました。共通のパターンに通常の手続きを使った抽象化を使えるのが利点です。

$orについてはひとつ注意があります。ある選択肢が失敗した時、 それが入力を消費していない場合に限り、次の選択肢が試されるということです。入力が少しでも消費されたら、$orはその選択肢にコミットし、その先で失敗しても他の選択肢を試しません。下の例では、 2番目の$seqが入力にマッチするにもかかわらず、パーズは失敗します。

(define paren  ($. #\())
(define thesis ($. #\)))

(peg-parse-string ($or ($seq paren ($."ab") thesis)
                       ($seq paren ($."cd") thesis))
                  "(cd)")
 ⇒ *** PARSE-ERROR: expecting ab at 1, but got #\c

$orの最初の選択肢で、先頭の開き括弧はマッチに成功して入力が消費されます。その後でマッチが失敗するわけですが、既に入力が消費されているので、 $orは他の選択肢を考慮することなく全体を失敗させます。 (言い換えれば、$orはバックトラックを行いません。)

一つの手は、共通部分を括り出すことです。

($seq paren
      ($or ($."ab") ($."cd"))
      thesis)

ただ、これは文法を複雑化させることがありますし、いつでも簡単にできるとは限りません。よりうまい方法は$tryコンビネータを使うことです。 ($try p)はパーザpを走らせ、それが失敗したら入力を元の時点まで巻き戻します。$orと一緒につかえば、いくらでも先読みとバックトラックができます。

($or ($try ($seq paren ($."ab") thesis))
     ($seq paren ($."cd") thesis))

さて整数のリストの例をもうちょっとおもしろくしてみましょう。リストは小括弧、中括弧、大括弧のどれかで囲まれていて、開き括弧と閉じ括弧は対応していなければならない、とします。

(define begin-list
  ($seq0 ($. #[\(\[\{]) ws))

(define (end-list opener)
  ($seq ws (case opener
            [(#\() ($. #\))]
            [(#\[) ($. #\])]
            [(#\{) ($. #\})])))

(define int-list
  ($let* ([opener begin-list]                ;*1
          [ints ($sep-by integer separator)] ;*2
          [ (end-list opener) ])             ;*3
    ($return ints)))

開き括弧はbegin-listパーザが受理します。$seq0は $seqとほぼ同じですが、最初のパーザの結果を返します。 (beginに対するbegin0と同じです。)

閉じ括弧は開き括弧と対応しないとならないので、end-listは開き括弧を取って適切な閉じ括弧のパーザを返す手続きになっています。

int-listはまず開き括弧をbegin-listでパーズし、その結果を openerに束縛します(*1)。次にコンマ区切りの整数をパーズし、結果のリストを intsに束縛します(*2)。最後に開き括弧に対応する閉じ括弧をパーズします(*3)。最後のパーザは値を使わないので、変数を省略しています。

openerの値を続く束縛フォームの中で使う必要があるので、ここでは$letではなく$let*を使いました。この差はletとlet*の差と同じで、スコープの違いです。ただ、$letの方が処理系にはうんと最適化しやすいので、可能な限り$letを使う方が良いでしょう。

走らせてみます。

gosh> (peg-parse-string int-list "[123, 456, 789]")
(123 456 789)
gosh> (peg-parse-string int-list "{123, 456, 789}")
(123 456 789)
gosh> (peg-parse-string int-list "(123, 456, 789}")
*** PARSE-ERROR: expecting #\) at 14, but got #\}

最後の例は括弧の対応が取れていないと失敗する例です。

ネストしたリストをパーズするにはどうすればいいでしょう? BNFならこう書けるでしょう。

list : begin-list (elem (separator elem)* )? end-list
elem : integer | list

これをそのまま移し替えると次のようになります。

;; First try
(define nested-list
  ($let* ([opener begin-list]
          [ints ($sep-by elem separator)]
          [ (end-list opener) ])
    ($return ints)))
(define elem  ($or integer nested-list))

ロードしてみましょう。おっと

*** ERROR: unbound variable: elem
Stack Trace:
_______________________________________
  0  elem
        [unknown location]
  1  (eval expr env)
        at "../lib/gauche/interactive.scm":267

nested-listの値を計算するのにelemの値を使いますが、 elemの値を計算するのにnested-listの値が必要です。 Haskellのような怠惰な言語ならこれでも構わないんですが、 Schemerは熱心なんです!

解決策は、パーザの計算を必要になるまで遅らせることです。 $lazyフォームが必要な遅延を行います。

(define nested-list
  ($lazy
    ($let* ([opener begin-list]
            [ints ($sep-by elem separator)]
            [ (end-list opener) ])
      ($return ints))))
(define elem  ($or integer nested-list))

gosh> (peg-parse-string nested-list "(123, [456, {}, 789], 987)")
(123 (456 () 789) 987)

さあ、ほとんど完成です。括弧の対応をチェックしながら、ネストした整数のリストをパーズできるようになりました。ただ、間違いのある入力を与えた時に出されるエラーメッセージがわかりにくいですね。

gosh> (peg-parse-string nested-list "(123, [456, {}, 789), 987)")
*** PARSE-ERROR: expecting one of (#[0-9] #\]) at 19
Stack Trace:
_______________________________________
  0  (eval expr env)
        at "../lib/gauche/interactive.scm":267

パーザの中で失敗の理由を調べて、意味のあるエラーメッセージで$failを呼び出すことができます。上のend-listを下のend-list2に置き換えてみて下さい。

(define (end-list2 opener)
  (define expected
    (assv-ref '((#\( . #\)) (#\[ . #\]) (#\{ . #\})) opener))
  ($seq ws
        ($let ([closer ($. #[\)\]\}])])
          (if (eqv? closer expected)
            ($return closer)
            ($fail (format "Mismatched closing bracket. '~c' expected, \
                            but got '~c'"
                           expected closer))))))

nested-listとelemについてもend-list2を使うように変更します:

(define nested-list2
  ($lazy
    ($let* ([opener begin-list]
            [ints ($sep-by elem2 separator)]
            [ (end-list2 opener) ])
      ($return ints))))

(define elem2  ($or integer nested-list2))

(end-listを再定義しても良いのですが、その場合は nested-listとelemの定義を再評価することを忘れないでください。コンビネータは、定義時に他のコンビネータの値を使って計算されます。通常の手続き定義のアプローチでは、一つの手続きの定義を差し替えたら他の手続きは新しい定義を使うようになりますが、それと異なるので要注意です。)

これで、エラーメッセージはこんなふうになります。

gosh> (peg-parse-string nested-list2 "(123, [456, {}, 789), 987)")
*** PARSE-ERROR: Mismatched closing bracket. ']' expected, but got ')' at 20
Stack Trace:
_______________________________________
  0  (eval expr env)
        at "../lib/gauche/interactive.scm":267

更なる例は、Gaucheソースツリーの中でparser.pegを使っているライブラリに見ることができます:

lib/rfc/json.scm
lib/text/edn.scm
lib/www/css.scm

12.40.2 パーザドライバ

Function: peg-run-parser parser list ¶

{parser.peg} パーザparserを入力listに適用します。入力が受理されたら、 parserの結果と、残りの入力の二つの値を返します。 parserが失敗したら、<parse-error>が投げられます。

parseの結果はrope-finalizeを通されるので、結果中のファイナライズされていないロープは全てファイナライズされて返されます。ロープについてはRopesを参照してください。

多くの場合、入力をあらかじめ全部リストで持っておきたくはないと思います。その場合、遅延シーケンスをlistに渡せば良いでしょう (遅延シーケンス)。なお、入力が文字列やポートの場合はユーティリティ手続きが定義されています。

入力が文字のリストである必要はありません。例えば、別の字句解析器を使ってトークンの(遅延)リストにしてから、 parserでそれをパーズすることもできます。

Function: peg-parse-string parser string :optional cont ¶

Function: peg-parse-port parser iport :optional cont ¶

{parser.peg} peg-run-parserで文字列stringや入力ポートiportからの入力をパーズするユーティリティ手続きです。

cont引数が省略されるか#fであれば、この手続きはパーザが受理したところ以降の入力は捨てて、パーザの結果だけ返します。 (ヒント: 余分なゴミが後に続いてないことを保証するには$eosパーザが使えます)。

parserが入力を受理した後、さらにパーズを続けたければ、contに手続きを渡します。その手続きはparserの結果と、残りの入力の遅延シーケンスの二つの引数を取ります。contが返すものが全体の返り値となります。

contにそれ以外の値を渡すのはエラーです。

Function: peg-parser->generator parser list ¶

{parser.peg} パーザparserを入力listに繰り返し適用したい場合に便利です。マッチした入力をひとつづつ返すジェネレータを作って返します。

(peg-run-parser ($many parser) list)としても同じ入力が受容されますが、こうした場合は全ての入力が消費されるまで戻ってきません。

Condition type: <parse-error> ¶

{parser.peg} パーザが最終的に失敗した場合にパーザドライバが投げるコンディション型です。 <error>を継承します。 (各パーザがこのコンディションを投げることはありません。一つのパーザが失敗したとしても、他の選択肢が成功するかもしれないからです。最終的な失敗を認識してこのコンディションを投げるのはドライバの仕事です。)

以下のスロットを持ちます。

Instance Variable of <parse-error>: message ¶: <error>から継承されたスロットで、エラーメッセージを保持しています。

Instance Variable of <parse-error>: position ¶

失敗が起きた入力の場所です。

この値の正確な意味は、パーザドライバがどう呼ばれたかに依存します。

パーザドライバに普通のリストや、peg-parse-string/peg-parse-portを使って文字列/ポートを渡した場合、パーザドライバは入力の先頭から失敗箇所までの入力要素の数を数えてそれをこの値とします。それしか情報が無いからです。入力の途中からパーズを始めた場合などでは、この数値はあなたが必要としているものとは異なるかもしれません。

位置情報つきシーケンスを入力として渡していれば、このスロットは<sequence-position> のインスタンスを保持し、そこから行番号、カラム番号、ソースファイル名などを取り出すことができます。詳しくはSee 位置情報つき遅延シーケンスを参照してください。

ユーザコードはどちらの場合にも対応できるようにしておかねばなりません。

Instance Variable of <parse-error>: type ¶

失敗のタイプ。次のいずれかのシンボルです。

fail-expect: パーザはobjectsスロットに格納されたオブジェクトのどれかを期待していたが、入力はそうでないオブジェクトであった (入力オブジェクトはtokenスロットに格納される)。
fail-unexpect: パーザはobjectスロットに格納されたオブジェクトのいずれも期待していなかったが、入力はそのいずれかであった (入力オブジェクトはtokenスロットに格納される)。
fail-compound: パーザは複数の選択肢を試しいずれも失敗した。objectsスロットは (type . msg)のリストを保持している。ここでtypeは一つの選択肢の失敗の<parse-error>タイプ、msgはその選択肢の失敗メッセージ。
fail-message: その他の失敗。messageスロットにメッセージがある。
fail-error: 回復不能な失敗。この失敗が生成されると、パーザコンビネータは巻き戻して別の選択肢を試すことをせず、全体のパージングを失敗させます。このタイプの失敗は$raiseパーザコンストラクタや$cutパーザコンビネータで生成されます。objectスロットはコンスセルで、carにはシンボル (エラータグ) が、 cdrには(type . msg)のリストが入っています。cdr部分は fail-compoundのobjectスロットと同じです。

Instance Variable of <parse-error>: objects ¶: このスロットの値の意味はtypeスロットの値に依存します。

Instance Variable of <parse-error>: rest ¶: 失敗が起きた箇所から先の入力トークンストリーム。

Instance Variable of <parse-error>: token ¶

失敗が起きた時の入力トークン。もし入力が終端に達していた場合は#<eof>。

入力の残りがある場合、これはrestスロットのcarと同じです。

12.40.3 PEGパーザの正体

PEGパーザは、入力にリストを取り、以下の3つの値を返す手続きです。

失敗のタイプ。パーザが入力を受理することに成功した場合、この値は#fです。失敗した場合は、この値は次のシンボルのうちのいずれかです: fail-expect、fail-unexpect、fail-compound、 fail-message。これは<parse-error>オブジェクトの typeスロットの値に対応します。二番目の戻り値の意味はこの値に依存します。
値。パーザが入力の受理に成功した場合は、これはパーザの「結果」、しばしばセマンティックバリューと呼ばれるものです。パーザが失敗した場合は、この値は失敗の原因のヒントとなるものです。 <parse-error>オブジェクトのobjectsスロットの値に対応します。詳しくは<parse-error>のドキュメントを見てください (パーザドライバ参照)。
入力の残り。

通常、パーザを手続きとして書くことはあまりありません。以降の節で説明するパーザビルダやコンビネータを使ってパーザを作ります。

しかし、パーザ手続きを一から書く必要が出てきた時のために、いくつかのユーティリティ手続きが用意してあります。

Function: return-result value rest ¶: {parser.peg} パーザが入力の受理に成功した時は、この手続きを末尾位置で呼んでください。 valueはパーザが返すセマンティックバリュー、 restは入力の残りです。これは(values #f value rest)と同じですが、意図をより明確に示すことができます。

Function: return-failure/expect objs rest ¶

Function: return-failure/unexpect objs rest ¶

Function: return-failure/message msg rest ¶

Function: return-failure/compound fails rest ¶

{parser.peg} パーザが入力を受理できなかった場合は、この手続きのいずれかを末尾位置で呼んでください。最初の引数は<parse-error>のobjectsスロットの値となります。 2番目の引数は、受理できなかったトークンを先頭とする入力リストです。

限られた種類のオブジェクト (objs) を期待していて、それ以外の入力に出会った場合は、return-failure/expectを呼んでください。これは fail-expectタイプの失敗となります。
限られた種類のオブジェクト (objs) 以外のものを期待していたのに、それに出会ってしまった場合は、return-failure/unexpectを呼んでください。これはfail-unexpectタイプの失敗となります。
複数の選択肢があってそのいずれにも失敗した場合は、全ての失敗を((fail-type . objs) …)の形の連想リストにして、それをfails引数としてreturn-failure/compoundを呼んでください。これはfail-compoundタイプの失敗となります。
失敗が上記いずれにもあてはまらない場合は、失敗の理由を文字列のメッセージにして return-failure/messageを呼んでください。

Function: return-failure type objs rest ¶: {parser.peg} これは他のパーザからの失敗をそのまま失敗として返す場合にのみ使ってください。下のparse-success?の説明を参照のこと。

Function: parse-success? r ¶

{parser.peg} パーザの最初の戻り値が成功であるかどうかを調べます。典型的な使い方は、他のパーザの結果によって分岐する場合です:

(receive (r v s) (parser input)
  (if (parse-success? r)
     (... do things after PARSER succeeds ...
       (return-result ...))
     (return-failure r v s)))

この手続きは単にrが#fかどうか見ているだけですが、これを使うことにより意図を明確に示すことができます。

12.40.4 基本的なパーザビルダー

パーザを作り出す手続きとマクロです。

Function: $return val ¶

{parser.peg} 入力を消費せず、常に成功してvalを結果として返すパーザを作って返します。

$letや$let*の本体で使われることが多いですが、パーザを期待するところではどこでも使えます。

Function: $fail msg-string ¶

{parser.peg} 入力を消費せず、常に失敗して、msg-stringを失敗メッセージとするパーザを作って返します。

ユーザフレンドリなエラーメッセージを出すためによく使われます。

Function: $raise msg-string ¶

{parser.peg} 回復不可能な失敗を生成するパーザを作って返します。 msg-stringは失敗のメッセージに使われます。

$failとの違いは、$orが$failで生成された失敗を見た場合は他の選択肢を試すことがあるのに対し、$raiseで生成された失敗を見たら他の選択肢はもう試されないことです。

これはより良いエラー報告をするのに役立ちます。パーズツリーの深いところで、有効になり得ない入力を検出したとします。通常の失敗とした場合、パーザはツリーの上の方で他の選択肢も全部試し、失敗の理由をすべて列挙したエラーメッセージを生成するかもしれません。しかし多くの場合、そのメッセージから本当の失敗の原因を知るのは難しいです。 $raiseを使えば、パーザに即座に諦めさせることができます。

通常の失敗を回復不可能な失敗に変換する、下の$cutコンビネータも参照してください。

Function: $satisfy pred expect :optional result ¶

{parser.peg} これはPEGでいう「先読み述語」、つまり入力トークンに任意の述語predを適用するパーザを作って返します。

返されるパーザは次のように動作します。

入力の先頭headが述語predを満たした場合、resultを (result head (pred head)) のように呼び出し、その戻り値をパーザ成功の結果とします。 resultが省略された場合は、入力の先頭が結果になります。
そうでなければ、expectを期待していた結果としてパーズは失敗します。

必要なのが先読みするパーザであれば、$assertも使えます。

Function: $. obj ¶

{parser.peg} Creates a parser that matches a Scheme object obj, which may be a character, a string, a char-set, or a symbol. If obj is a char-set, the parser matches any character in the set.

The resulting parser is atomic, that is, it doesn’t consume input when it fails.

Function: $char c ¶
Function: $char-ci c ¶: {parser.peg} Returns a parser that accepts a single character, c. $char-ci ignores case. On success, the parser yields the input character. The resulting parser is atomic, that is, it doesn’t consume input when it fails.

Function: $string str ¶

Function: $string-ci str ¶

{parser.peg} Returns a parser that accepts an input that matches a string str. $string-ci ignores case. On success, the parser yields the matched string.

The parsing of string is atomic: When the parser fails, it doesn’t consume the input. That is, ($string "ab") is not the same as ($let ([a ($char #\a)] [b ($char #\b)]) ($return (string a b))).

Function: $one-of cset ¶

Function: $one-of obj-list ¶

{parser.peg} The first form returns a parser that accepts any character in the character set cset. In the second form, obj-list must be a list of either a character, a string, a character set or a symbol, and each one is matched with the same way as $..

On success, the parser yields the accepted object.

Function: $none-of cset ¶: {parser.peg} Returns a parser that accepts any character not in the character set cset. On success, the parser yields the accepted character.

Function: $any ¶: {parser.peg} Returns a parser that matches any one item, and yields the matched input item on success. It fails only when the input already reached at the end.

Function: $eos ¶: {parser.peg} Stands for “end of stream”. Returns a parser that matches the end of input. It never consumes input.

Macro: $match1 pattern result ¶

Macro: $match1 pattern ¶

Macro: $match1 pattern (=> fail) result ¶

Macro: $match1* pattern result ¶

Macro: $match1* pattern ¶

Macro: $match1* pattern (=> fail) result ¶

{parser.peg} The pattern matcher macro match-let1 lifted to the parser. See util.match - パターンマッチング, for the details of supported pattern.

The macro $match1 returns a parser that takes one item from the input stream, and see if it matches pattern. If it matches, evaluate result within an environment where pattern variables are bound to matched content, and the parser yields the value of result. If the input doesn’t match pattern, or the input is empty, the parser fails without consuming input.

In the third form => must be a literal identifier and fail must be an identifier. The identifier fail is bound to a procedure that takes one string argument in result. You can call fail at the tail position of result to make the match fail, with the passed argument as the message. If fail is called, no input will be consumed.

(NB: The match macro in util.match has a similar feature, but it binds fail to a continuation that abandons the current match clause and go to try the next pattern. In $match1, fail is simply a procedure, so you have to call it at the tail position to make it work.)

The macro $match1* is similar to $match1, except the entire input is matched pattern. It is useful to look into several items in input, instead of just one. Note that if you give a pattern that consumes arbitrary length of input (e.g. ($match1* (a ...)), it will consume entire input.

These macros especially come handy when you have a token stream generated by a separate lexer—each token can have some structure (instead of just a character) and you can take advantage of match.

12.40.5 Ropes

Often you want to construct a string out of the results of other parsers. It can be costly to construct strings eagerly, for a string may be just an intermediate one to be a part of a larger string. We provide a lightweight lazily string construction mechanism, called ropes.

A rope is either a character, a string or a pair of ropes. It allows O(1) concatenation. A rope becomes a string when finalized. The parser drivers such as peg-run-parser automatically finalizes ropes in the parser result.

Function: $->rope parser … ¶: {parser.peg} The parsers must yield either a character, a string, a rope, or #f or (). This procedure returns a parser that matches parser …, then gather the result into a rope. #f and () in the results are ignored.

Function: $->string parser … ¶: {parser.peg} This is a common idiom of ($lift rope->string ($->rope parser …)).

Function: $->symbol parser … ¶: {parser.peg} Like $->string, but yields a symbol rather than a string.

Function: rope->string rope ¶: {parser.peg} Converts a rope to a string.

Function: rope-finalize obj ¶: {parser.peg} Converts any ropes in in obj into strings.

12.40.6 Choice, backtrack and assertion combinators

Function: $or p1 p2 … ¶

Function: $or p1 p2 … :else plast ¶

{parser.peg} Returns a choice parser. Tries the given parser in order on input. If one succeeds, immediately yields its result. If one fails, and does not consume input, then tries the next one. If one fails with consuming input, immediately fails.

If p1 p2 … don’t share the same prefix to match, you can let it fail as soon as one parser fails with consuming input. If more than one parsers do match the same prefix, you want to wrap them with $try except the last one.

If all of the parsers p1 p2 … fail without consuming input, $or returns a compound failure of all the failures. You may wish to produce better error message than that. Putting $fail parser at the last doesn’t cut it, for $fail doesn’t consume input so all the previous failures would be compound. In such cases, you can use the second form—if the argument before the last parser is a keyword :else, then $or discards the previous failures.

(peg-parse-string ($or ($. "ab")
                       ($. "cd")
                       :else ($fail "we want 'ab' or 'cd'"))
                  "ef")
 ⇒ PARSE-ERROR: 'ab' or 'cd' required at 0

Function: $try p ¶: {parser.peg} Returns a parser that accepts the same input the parser p accepts, but when p fails the returned parser doesn’t consume input. Used with $or, you can explicitly implement a backtrack behavior.

Function: $optional p :optional fallback ¶

{parser.peg} Returns a parser that tries p on the input. If it succeeds, yielding its result. If it fails, it still succeeds, yielding fallback as the result.

This is atomic; if p fails, it doesn’t consume input.

Function: $assert p ¶: {parser.peg} Returns a parser that accepts the same input as p and returns its result on success, but never consumes the input. It can be used as a lookahead assertion.

Function: $not p ¶: {parser.peg} Returns a parser that succeeds when p fails, and that fails when p succeeds. When p succeeds, it yields an “unexpected” error. It never consumes input in either way. It can be used as a negative lookahead assertion.

Function: $expect p msg-string ¶: {parser.peg} Returns a parser that calls a parser p, and if it succeeds yields its result. If p fails, fails with an error message that says expecting msg-string. Useful to produce user-friendly error messages.

Function: $cut p ¶

{parser.peg} If p fails, make the failure non-recoverable. It prevents the upstream $or and $try from backtracking and trying other choices, and makes the entire parsing fail immediately.

12.40.7 Sequencing combinators

Function: $seq p1 p2 … ¶
Function: $seq0 p1 p2 … ¶: {parser.peg} Returns a parser that atches p1, p2, … sequentially. When all the parser succeeds, $seq returns the last result, while $seq0 returns the first result. Fails immediately when one of the parsers fails.

Function: $between p1 p2 p3 ¶: {parser.peg} Returns a parser that matches p1, p2 and p3 sequentially, and returns the result of p2.

Function: $list p … ¶

Function: $list* p … ¶

{parser.peg} Returns a parser that matches p …, and returns the list of the results. $list* uses the last parser’s result as the last cdr.

They are the same as ($lift list p …) and ($lift list* p …), but we encounter this pattern frequent enough to have these.

Function: $bind p f ¶

{parser.peg} The basic block of parser combinators; p argument is a parser, and f is a procedure that takes a Scheme value and returns a parser.

Returns a parser that first applies p on the input, and if it succeeds, calls f with the result of p, and applies the returned parser on the subsequent input.

This combinator, along with $return and $fail, composes a MonadFail interface as in Haskell. Theoretically any combinators can be built on top of these three. In practice, however, it is not always easy to build things directly on top of $bind, and more high-level forms such as $let, $let* and $lift are frequently used.

Macro: $let (binding …) body … ¶

Macro: $let* (binding …) body … ¶

{parser.peg} Monadic binding form. Each binding can be one of the following forms:

(var parser): Run the parser, and if it succeeds, bind its result to a variable var. If it fails, the entire $let or $let* immdiately fails.
(parser): The variable is omitted. The parser is run, and if it succeeds, its result is discarded and the next binding or body is evaluated. If it fails, the entire $let or $let* immdiately fails.
parser: Same as above. This form can only be used if parser is just a vairable reference.

Once all the parsers in binding … succeeds, body … are evaluated in the environment where var in bindings are bound to the parser results. The last expression of body must return a parser.

Unlike let, the parsers in binding … are always applied to the input sequentially. The difference of $let and $let* is the scope. With $let*, the variables bound in earlier binding can be used to construct the parser later.

This means $let can evaluate all the parsers beforehand, while $let* may need to construct parsers at the time of processing input. Creating a parser involves closure allocations, so you want to use $let whenever possible.

Note: $let* is similar to Haskell’s do construct. We chose the name $let and $let*, for it is easier to see it’s a binding form, and also Scheme already uses do for loop construct.

Function: $lift f p … ¶

Function: $lift* f p … ¶

{parser.peg} Lifts a procedure f onto the parsers’ world.

In a pseudo type declaration, lift’s type can be understood as follows:

lift :: (a b ... -> z) (Parser a) (Parser b) ... -> (Parser z)

That is, lift creates a parser such that it first applies parsers on the input, and if all of them succeeds, it calls f with the parsers’ results as arguments, and the return value of f becomes the whole parser’s result.

In other words, the following equivalence holds:

($lift f p0 p1 …)
 ≡ ($let ([r0 p0] [r1 p1] …) ($return (f r0 r1 …)))

It is sometimes simpler to use $lift instead of $let. For example, the following code creates a parser that matches input with p0 p1 … sequentially, then yields the list of the parser results:

($lift list p0 p1 …)

Note that after all the parsers succeed, the whole parser is destined to succeed—the procedure f can’t make the parser fail. If you need to fail after all the parsers succeeds, use $let or $let.

Macro: $binding parser-bind-form … [(=> fail)] expr ¶

Macro: $lbinding parser-bind-form … [(=> fail)] expr ¶

{parser.peg} Each parser-bind-form may be a parser-yielding expression, except that you can insert a form ($: var parser-expression) anywherer in it, where parser-expression is an expression that yields a parser. The $: form is equivalent to just the parser-expression, except that its semantic value is bound to a variable var.

Each parser created by parser-bind-form is applied to the input in sequence. One of parser-bind-form fails, the entire parser immediately fails. If all of parser-bind-form succeeds, expr is evaluated in the environment where all the vars are bound to the corresponding parser expression.

Since $binding walks entire parser-bind-form to look for $: forms, you can’t have nested $binding form inside parser-bind-form.

If the parser expression associated with var fails, or never executed, the var is bound to #<undef>. If the parser expression succeeds multiple times, var holds the last value. Also, var can appear more than one places; it holds the last bound value.

The value of expr form becomes the semantic value of the entire parser.

$lbinding is a shorthand of ($lazy ($binding ...)).

The optional (=> fail) form before expr is similar to the one with $match. If given, fail, which must be an identifier, is bound to a procedure that returns failure. You should call it in the tail position of expr to indicate failure. It can be

(fail message) : Returns fail-message type failure, with a string message as the message.
(fail tag message) : Returns fail-error type (non-recoverable) failure, where tag must be a symobl error (in future, differtent tags will be supported).

Function: $fold-parsers proc seed ps ¶

{parser.peg} Ps is a list of parsers. Apply those parsers sequentially on the input, passing around the seed value. That is, if we let v0, v1 … vn be the result of each parsers in ps, it returns (proc vn (... (proc v2 (proc v1 seed))...)).

If any of the parser in ps fails, $fold-parsers fails at that point.

Conceptually, it can be written as follows:

(define ($fold-parsers proc seed ps)
  (if (null? ps)
    ($return seed)
    ($let ([v (car ps)])
      ($fold-parsers proc (proc v seed) (cdr ps)))))

But we use more efficient implementation.

Function: $fold-parsers-right proc seed ps ¶

{parser.peg} Similar to $fold-parsers, but the folding by proc right to left. That is, if we let v0, v1 … vn be the result of each parsers in ps, it returns (proc v1 (proc v2 (... (proc vn seed)...))).

If any of the parser in ps fails, $fold-parsers-right fails at that point.

12.40.8 Repetition combinators

Function: $many p :optional min max ¶

Function: $many_ p :optional min max ¶

{parser.peg} Without optional arguments, returns a parser that accepts zero or more repetition of p. On success, $many yields a list of mached results, while $many_ doesn’t keep the results (and faster).

Optinoal min and max must be nonnegative integers and limit the number of occurrences of p. The numbers are inclusive. For example, ($many ($. #\a) 3) accepts three or more #\a’s, and ($many ($. #\a) 2 4) accepts aa, aaa and aaaa.

Note that $many may fail if the input partially matches p.

(peg-parse-string ($many ($seq ($. #\a) ($. #\b))) "ababcd")
  ⇒ (#\b #\b)

(peg-parse-string ($many ($seq ($. #\a) ($. #\b))) "ababac")
  ⇒ *** PARSE-ERROR: expecting #\b at 5, but got #\c

If you want to stop $many at the first two occurrences in the latter case, use $try:

(peg-parse-string ($many ($try ($seq ($. #\a) ($. #\b)))) "ababac")
  ⇒ (#\b #\b)

Function: $many1 p :optional max ¶

Function: $many1_ p :optional max ¶

{parser.peg} Returns a parser that accepts one or more occurences of p. On success, $many1 yields a list of results of p, while $many_ discards the results of p and faster. If max is given, it specifies the maximum number of matches.

Same as ($many p 1 max) and ($many_ p 1 max). Provided as a common pattern.

Function: $repeat p n ¶

Function: $repeat_ p n ¶

{parser.peg} Returns a parser that accepts exaclty n occurences of p. On success, $repeat yields a list of results of p, while $repeat_ discards the results of p and faster.

Same as ($many p n n) and ($many_ p n n). Provided as a common pattern.

Function: $many-till p pe :optional min max ¶

Function: $many-till_ p pe :optional min max ¶

{parser.peg} Returns a parser that accepts repetition of p, until it sees input that accepts pe. On success, $many-till yields a list of results of p, while $many-till_ discards the results of p and faster.

(define comment ($seq ($.";") ($many-till ($any) ($."\n"))))

Function: $sep-by p psep :optional min max ¶

Function: $end-by p psep :optional min max ¶

Function: $sep-end-by p psep :optional min max ¶

{parser.peg} These combinators match repetition of p separated by psep, such as comma-separated values. Returns the list of results of p. Optional min and max are integers that limits the number of repetitions.

These three differ only on treatment of the last separator; $sep-by accepts strictly infix syntax, that is, the input must not end with the separator; while $end-by accepts strictly suffix syntax, that is, the input must end with the separator; $sep-end-by makes the last separator optional.

Function: $chain-left p op ¶

Function: $chain-right p op ¶

{parser.peg} Returns a parser that parsers left-assosiative and right-associative operators, respectively.

The term of expression is parsed by a parser p, and the operator is parsed by op.

12.40.9 Miscellaneous combinators

Macro: $parameterize ((param expr) …) parser … ¶

{parser.peg} Returns a parser that runs parser …, while altering the parameter values of param … with the reuslt of expr …, like parameterize. The parser … are run as if in $seq, so only the value of them is returned on success.

You can’t use ordinary parameterize, since such parameterization takes effect on parser construction time, and not when the parser parsing the input.

Function: $debug name p ¶

{parser.peg} Parses the same input as p, but reports when it is parsing the input, and the result, just like debug-print.

You can’t use debug-print directly, for it will take effect on the parser construction time, not when the input is parsed.

Macro: $lazy p ¶: {parser.peg} Returns a parser that works the same as p, but delays evaluation of p until needed. It is useful when you define mutually recursive parsers.

• PEGひとめぐり:
• PEGパーザドライバ:
• PEGパーザの正体:
• 基本的なパーザビルダー:
• PEG ropes:
• PEG choice:
• PEG sequencing combinators:
• PEG repetition combinators:
• PEG miscellaneous combinators:
• PEG performance tips:

12.40 parser.peg - PEGパーザコンビネータ