Gauche:Regexpの中間表現

Gauche では正規表現を一旦 S 式による中間表現に変換してから正規表現コンパイラに渡す。こうすることにより、正規表現の最適化部分をリスト処理で書いたり、SRE のような別の構文をサポートすることが容易になっている。

中間表現の構文は以下の通り（GHG:regexp.c）。

<ast> : (<element> ...)

<element> : <clause>   ; special clause
       | <item>        ; matches <item>

<item> : <char>       ; matches char
       | <char-set>   ; matches char set
       | (comp . <char-set>) ; matches complement of char set
       | any          ; matches any char
       | bol | eol    ; beginning/end of line assertion
       | wb | nwb     ; word-boundary/negative word boundary assertion

<clause> : (seq . <ast>)      ; sequence
         | (seq-uncase . <ast>) ; sequence (case insensitive match)
         | (seq-case . <ast>) ; sequence (case sensitive match)
         | (alt . <ast>)      ; alternative
         | (rep . <ast>)      ; 0 or more repetition of <ast> (greedy)
         | (rep-min . <ast>)  ; 0 or more repetition of <ast> (lazy)
         | (rep-bound <n> . <ast>) ; repetition up to <n> (greedy)
         | (rep-bound-min <n> . <ast>) ; repetition up to <n> (lazy)
         | (rep-while . <ast>) ; like rep, but no backtrack
         | (<integer> . <ast>) ; capturing group 
         | (assert . <ast>)   ; positive look-ahead assertion
         | (nassert . <ast>)  ; negative look-ahead assertion

手続き regexp-parse で文字列をパースすると中間表現が得られる。

gosh> (regexp-parse "abc*(def)+[xyz]*")
(0 #\a #\b (rep #\c) (seq #0=(1 #\d #\e #\f) (rep #0#)) (rep #[x-z]))

この中間表現は regexp-optimize で最適化し、 regexp-compile により regexp オブジェクトに変換することができる。

gosh> (regexp-optimize '(0 #\a #\b (rep #\c) (seq #0=(1 #\d #\e #\f) (rep #0#)) (rep #[x-z])))
(0 #\a #\b (rep-while #\c) #0=(1 #\d #\e #\f) (rep #0#) (rep-while #[x-z]))
gosh> (regexp-compile '(0 #\a #\b (rep-while #\c) #0=(1 #\d #\e #\f) (rep #0#) (rep-while #[x-z])))
#<regexp 0x98a9b90>

Gauche:Regexpの中間表現

関連