For Gauche 0.9.5

Next: , Previous: , Up: Core library   [Contents][Index]

6.13 Regular expressions

Gauche has a built-in regular expression engine which is mostly upper-compatible of POSIX extended regular expression, plus some extensions from Perl 5 regexp.

A special syntax is provided for literal regular expressions. Also regular expressions are applicable, that is, it works like procedures that match the given string to itself. Combining with these two features enables writing some string matching idioms compact.

(find #/pattern/ list-of-strings)
  ⇒ match object or #f

Next: , Previous: , Up: Regular expressions   [Contents][Index]

6.13.1 Regular expression syntax

Reader Syntax: #/regexp-spec/
Reader Syntax: #/regexp-spec/i

Denotes literal regular expression object. When read, it becomes an instance of <regexp>.

If a letter ’i’ is given at the end, the created regexp becomes case-folding regexp, i.e. it matches in the case-insensitive way.

The advantage of using this syntax over string->regexp is that the regexp is compiled only once. You can use literal regexp inside loop without worrying about regexp compilation overhead. If you want to construct regexp on-the-fly, however, use string->regexp.

Gauche’s built-in regexp syntax follows POSIX extended regular expression, with a bit of extensions taken from Perl.

Note that the syntax described here is just a surface syntax. Gauche’s regexp compiler works on the abstract syntax tree, and alternative syntax such as SRE will be supported in the future versions.


Matches zero or more repetition of re.


Matches one or more repetition of re.


Matches zero or one occurrence of re.


Bounded repetition. re{n} matches exactly n occurrences of re. re{n,m} matches at least n and at most m occurrences of re, where n <= m. In the latter form, either n or m can be omitted; omitted n is assumed as 0, and omitted m is assumed infinity.


Same as the above repetition construct, but these syntaxes use "non-greedy" or "lazy" match strategy. That is, they try to match the minimum number of occurrences of re first, then retry longer ones only if it fails. In the last form either n or m can be omitted. Compare the following examples:

(rxmatch-substring (#/<.*>/ "<tag1><tag2><tag3>") 0)
  ⇒ "<tag1><tag2><tag3>"

(rxmatch-substring (#/<.*?>/ "<tag1><tag2><tag3>") 0)
  ⇒ "<tag1>"

Clustering with capturing. The regular expression enclosed by parenthesis works as a single re. Besides, the string that matches re … is saved as a submatch.


Clustering without capturing. re works as a single re, but the matched string isn’t saved.


Named capture and clustering. Like (re…), but adds the name name to the matched substring. You can refer to the matched substring by both index number and the name.

When the same name appears more than once in a regular expression, it is undefined which matched substring is returned as the submatch of the named capture.


Lexical case sensitivity control. (?i:re…) makes re… matches case-insensitively, while (?-i:re…) makes re… matches case-sensitively.

Perl’s regexp allows several more flags to appear between ’?’ and ’:’. Gauche only supports above two, for now.


Alternation. Matches either one of patterns, where each pattern is re ….


Backreference. n is an integer. Matches the substring captured by the n-th capturing group. (counting from 1). When capturing groups are nested, groups are counted by their beginnings. If the n-th capturing group is in a repetition and has matched more than once, the last matched substring is used.


Named backreference. Matches the substring captured by the capturing group with the name name. If the named capturing group is in a repetition and has matched more than once, the last matched substring is used. If there are more than one capturing group with name, matching will succeed if the input matches either one of the substrings captured by those groups.


Matches any character (including newline).


Matches any of the character set specified by char-set-spec. See Character set, for the details of char-set-spec.

\s, \d, \w

Matches a whitespace character (#[[:space:]]), a digit character (#[[:digit:]]), or a word-constituent character (#[[:alpha:][:digit:]_]), respectively.

Can be used both inside and outside of character set.

\S, \D, \W

Matches the complement character set of \s, \d and \w, respectively.

^, $

Beginning and end of string assertion, when appears at the beginning or end of the pattern.

These characters loses special meanings and matches the characters themselves if they appear in the position other than the beginning of the pattern (for ^) or the end (for $). For the sake of recognizing those characters, lookahead/lookbehind assertions ((?=...), (?!...), (?<=...), (?<!...)) and atomic clustering ((?>...)) are treated as if they are a whole pattern. That is, ^ at the beginning of those groupings are beginning-of-string assertion no matter where these group appear in the containing regexp. So as $ at the end of these groupings.

\b, \B

Word boundary and non word boundary assertion, respectively. That is, \b matches an empty string between word-constituent character and non-word-constituent character, and \B matches an empty string elsewhere.


These are the same as ;, ", and #, respectively, and can be used to avoid confusing Emacs or other syntax-aware editors that are not familiar with Gauche’s extension.


Positive/negative lookahead assertion. Match succeeds if pattern matches (or does not match) the input string from the current position, but this doesn’t move the current position itself, so that the following regular expression is applied again from the current position.

For example, the following expression matches strings that might be a phone number, except the numbers in Japan (i.e. ones that begin with "81").


Positive/negative lookbehind assertion. If the input string immediately before the current input position matches pattern, this pattern succeeds or fails, respectively. Like lookahead assertion, the input position isn’t changed.

Internally, this match is tried by reversing pattern and applies it to the backward of input character sequence. So you can write any regexp in pattern, but if the submatches depend on the matching order, you may get different submatches from when you match pattern from left to right.


Atomic clustering. Once pattern matches, the match is fixed; even if the following pattern fails, the engine won’t backtrack to try the alternative match in pattern.


They are the same as (?>re*), (?>re+), (?>re?), respectively.

(?test-pattern then-pattern)
(?test-pattern then-pattern|else-pattern)

Conditional matching. If test-pattern counts true, then-pattern is tried; otherwise else-pattern is tried when provided.

test-pattern can be either one of the following:


Backreference. If integer-th capturing group has a match, this test counts true.


Positive/negative lookahead assertion. It tries pattern from the current input position without consuming input, and if the match succeeds or fails, respectively, this test counts true.


Positive/negative lookbehind assertion. It tries pattern backward from the left size of the current input position, and if the match succeeds or fails, respectively, this test counts true.

Next: , Previous: , Up: Regular expressions   [Contents][Index]

6.13.2 Using regular expressions

Regexp object and rxmatch object

Builtin Class: <regexp>

Regular expression object. You can construct a regexp object from a string by string->regexp at run time. Gauche also has a special syntax to denote regexp literals, which construct regexp object at loading time.

Gauche’s regexp engine is fully aware of multibyte characters.

Builtin Class: <regmatch>

Regexp match object. A regexp matcher rxmatch returns this object if match. This object contains all the information about the match, including submatches.

The advantage of using match object, rather than substrings or list of indices, is efficiency. The regmatch object keeps internal state of match, and computes indices and/or substrings only when requested. This is particularly effective for mutibyte strings, for index access is slow on them.

Function: string->regexp string :key case-fold

Takes string as a regexp specification, and constructs an instance of <regexp> object.

If a true value is given to the keyword argument case-fold, the created regexp object becomes case-folding regexp. (See the above explanation about case-folding regexp).

Function: regexp? obj

Returns true iff obj is a regexp object.

Function: regexp->string regexp

Returns a source string describing the regexp regexp. The returned string is immutable.

Function: regexp-num-groups regexp
Function: regexp-named-groups regexp

Queries the number of capturing groups, and an alist of named capturing groups, in the given regexp, respectively.

The number of capturing groups corresponds to the number of matches returned by rxmatch-num-matches. Note that the entire regexp forms a group, so the number is always positive.

The alist returned from regexp-named-groups has the group name (symbol) in car, and its subgroup number in cdr. Note that the order of groups in the alist isn’t fixed.

(regexp-num-groups #/abc(?<foo>def)(ghi(?<bar>jkl)(mno))/)
  ⇒ 5
(regexp-named-groups #/abc(?<foo>def)(ghi(?<bar>jkl)(mno))/)
  ⇒ ((bar . 3) (foo . 1))

Trying a match

Function: rxmatch regexp string

Regexp is a regular expression object. A string string is matched by regexp. If it matches, the function returns a <regmatch> object. Otherwise it returns #f.

This is called match, regexp-search or string-match in some other Scheme implementations.

To apply the match repeatedly on the input string, or to match from the input stream (such as the data from the port), you may want to check grxmatch in gauche.generator (see Generator operations).

Generic application: regexp string

A regular expression object can be applied directly to the string. This works the same as (rxmatch regexp string), but allows shorter notation. See Applicable objects, for generic mechanism used to implement this.

Accessing the match result

Function: rxmatch-start match :optional (i 0)
Function: rxmatch-end match :optional (i 0)
Function: rxmatch-substring match :optional (i 0)

Match is a match object returned by rxmatch. If i equals to zero, the functions return start, end or the substring of entire match, respectively. With positive integer I, it returns those of I-th submatches. It is an error to pass other values to I.

It is allowed to pass #f to match for convenience. The functions return #f in such case.

These functions correspond to scsh’s match:start, match:end and match:substring.

Function: rxmatch-after match :optional (i 0)
Function: rxmatch-before match :optional (i 0)

Returns substring of the input string after or before match. If optional argument is given, the i-th submatch is used (0-th submatch is the entire match).

(define match (rxmatch #/(\d+)\.(\d+)/ "pi=3.14..."))

(rxmatch-after match) ⇒ "..."
(rxmatch-after match 1) ⇒ ".14..."

(rxmatch-before match) ⇒ "pi="
(rxmatch-before match 2) ⇒ "pi=3."
Function: rxmatch-substrings match :optional start end
Function: rxmatch-positions match :optional start end

Retrieves multiple submatches (again, 0-th match is the entire match), in substrings and in a cons of start and end position, respectively.

(rxmatch-substrings (#/(\d+):(\d+):(\d+)/ "12:34:56"))
  ⇒ ("12:34:56" "12" "34" "56")

(rxmatch-positions (#/(\d+):(\d+):(\d+)/ "12:34:56"))
  ⇒ ((0 . 8) (0 . 2) (3 . 5) (6 . 8))

For the convenience, you can pass #f to match; those procedures returns () in that case.

The optional start and end arguments specify the range of submatch index. If omitted, start defaults to 0 and end defaults to (rxmatch-num-matches match). For example, if you don’t need the whole match, you can give 1 to start as follows:

(rxmatch-substrings (#/(\d+):(\d+):(\d+)/ "12:34:56") 1)
  ⇒ ("12" "34" "56")
Function: rxmatch->string regexp string :optional selector …

A convenience procedure to match a string to the given regexp, then returns the matched substring, or #f if it doesn’t match.

If no selector is given, it is the same as this:

(rxmatch-substring (rxmatch regexp string))

If an integer is given as a selector, it returns the substring of the numbered submatch.

If a symbol after or before is given, it returns the substring after or before the match. You can give these symbols and an integer to extract a substring before or after the numbered submatch.

gosh> (rxmatch->string #/\d+/ "foo314bar")
gosh> (rxmatch->string #/(\w+)@([\w.]+)/ "" 2)
gosh> (rxmatch->string #/(\w+)@([\w.]+)/ "" 'before 2)
Generic application: regmatch :optional index
Generic application: regmatch 'before :optional index
Generic application: regmatch 'after :optional index

A regmatch object can be applied directly to the integer index, or a symbol before or after. They works the same as (rxmatch-substring regmatch index), (rxmatch-before regmatch), and (rxmatch-after regmatch), respectively. This allows shorter notation. See Applicable objects, for generic mechanism used to implement this.

(define match (#/(\d+)\.(\d+)/ "pi=3.14..."))

  (match)           ⇒ "3.14"
  (match 1)         ⇒ "3"
  (match 2)         ⇒ "14"

  (match 'after)    ⇒ "..."
  (match 'after 1)  ⇒ ".14..."

  (match 'before)   ⇒ "pi="
  (match 'before 2) ⇒ "pi=3."

(define match (#/(?<integer>\d+)\.(?<fraction>\d+)/ "pi=3.14..."))

  (match 1)         ⇒ "3"
  (match 2)         ⇒ "14"

  (match 'integer)  ⇒ "3"
  (match 'fraction) ⇒ "14"

  (match 'after 'integer)   ⇒ ".14..."
  (match 'before 'fraction) ⇒ "pi=3."
Function: rxmatch-num-matches match
Function: rxmatch-named-groups match

Returns the number of matches, and an alist of named groups and whose indices, in match. This corresponds regexp-num-groups and regexp-named-groups on a regular expression that has been used to generate match. These procedures are useful to inspect match object without having the original regexp object.

The number of matches includes the "whole match", so it is always a positive integer for a <regmatch> object. The number also includes the submatches that don’t have value (see the examples below). The result of rxmatch-named-matches also includes all the named groups in the original regexp, not only the matched ones.

For the convenience, rxmatch-num-matches returns 0 and rxmatch-named-groups returns () if match is #f.

(rxmatch-num-matches (rxmatch #/abc/ "abc")) ⇒ 1
(rxmatch-num-matches (rxmatch #/(a(.))|(b(.))/ "ba")) ⇒ 5
(rxmatch-num-matches #f) ⇒ 0

 (rxmatch #/(?<h>\d\d):(?<m>\d\d)(:(?<s>\d\d))?/ "12:34"))
 ⇒ ((s . 4) (m . 2) (h . 1))

Convenience utilities

Function: regexp-replace regexp string substitution
Function: regexp-replace-all regexp string substitution

Replaces the part of string that matched to regexp for substitution. regexp-replace just replaces the first match of regexp, while regexp-replace-all repeats the replacing throughout entire string.

substitution may be a string or a procedure. If it is a string, it can contain references to the submatches by digits preceded by a backslash (e.g. \2) or the named submatch reference (e.g. \k<name>. \0 refers to the entire match. Note that you need two backslashes to include backslash character in the literal string; if you want to include a backslash character itself in the substitution, you need four backslashes.

(regexp-replace #/def|DEF/ "abcdefghi" "...")
  ⇒ "abc...ghi"
(regexp-replace #/def|DEF/ "abcdefghi" "|\\0|")
  ⇒ "abc|def|ghi"
(regexp-replace #/def|DEF/ "abcdefghi" "|\\\\0|")
  ⇒ "abc|\\0|ghi"
(regexp-replace #/c(.*)g/ "abcdefghi" "|\\1|")
  ⇒ "ab|def|hi"
(regexp-replace #/c(?<match>.*)g/ "abcdefghi" "|\\k<match>|")
  ⇒ "ab|def|hi"

If substitution is a procedure, for every match in string it is called with one argument, regexp-match object. The returned value from the procedure is inserted to the output string using display.

(regexp-replace #/c(.*)g/ "abcdefghi"
                (lambda (m)
                    (string->list (rxmatch-substring m 1))))))
 ⇒ "abfedhi"

Note: regexp-replace-all applies itself recursively to the remaining of the string after match. So the beginning of string assertion in regexp doesn’t only mean the beginning of input string.

Note: If you want to operate on multiple matches in the string instead of replacing it, you can use lrxmatch in gauche.lazy module or grxmatch in gauche.generator module. Both can match a regexp repeatedly and lazily to the given string, and lrxmatch returns a lazy sequence of regmatches, while grxmatch returns a generator that yields regmatches.

(map rxmatch-substring (lrxmatch #/\w+/ "a quick brown fox!?"))
 ⇒ ("a" "quick" "brown" "fox")
Function: regexp-replace* string rx1 sub1 rx2 sub2 …
Function: regexp-replace-all* string rx1 sub1 rx2 sub2 …

First applies regexp-replace or regexp-replace-all to string with a regular expression rx1 substituting for sub1, then applies the function on the result string with a regular expression rx2 substituting for sub2, and so on. These functions are handy when you want to apply multiple substitutions sequentially on a string.

Function: regexp-quote string

Returns a string with the characters that are special to regexp escaped.

(regexp-quote "[2002/10/12] touched foo.h and *.c")
 ⇒ "\\[2002/10/12\\] touched foo\\.h and \\*\\.c"

In the following macros, match-expr is an expression which produces a match object or #f. Typically it is a call of rxmatch, but it can be any expression.

Macro: rxmatch-let match-expr (var …) form …

Evaluates match-expr, and if matched, binds var … to the matched strings, then evaluates forms. The first var receives the entire match, and subsequent variables receive submatches. If the number of submatches are smaller than the number of variables to receive them, the rest of variables will get #f.

It is possible to put #f in variable position, which says you don’t care that match.

(rxmatch-let (rxmatch #/(\d+):(\d+):(\d+)/
                      "Jan  1 23:59:58, 2001")
   (time hh mm ss)
  (list time hh mm ss))
 ⇒ ("23:59:58" "23" "59" "58")

(rxmatch-let (rxmatch #/(\d+):(\d+):(\d+)/
                      "Jan  1 23:59:58, 2001")
   (#f hh mm)
  (list hh mm))
 ⇒ ("23" "59")

This macro corresponds to scsh’s let-match.

Macro: rxmatch-if match-expr (var …) then-form else-form

Evaluates match-expr, and if matched, binds var … to the matched strings and evaluate then-form. Otherwise evaluates else-form. The rule of binding vars is the same as rxmatch-let.

(rxmatch-if (rxmatch #/(\d+:\d+)/ "Jan 1 11:22:33")
  (format #f "time is ~a" time)
  "unknown time")
 ⇒ "time is 11:22"

(rxmatch-if (rxmatch #/(\d+:\d+)/ "Jan 1 11-22-33")
  (format #f "time is ~a" time)
  "unknown time")
 ⇒ "unknown time"

This macro corresponds to scsh’s if-match.

Macro: rxmatch-cond clause …

Evaluate condition in clauses one by one. If a condition of a clause satisfies, rest portion of the clause is evaluated and becomes the result of rxmatch-cond. Clause may be one of the following pattern.

(match-expr (var …) form …)

Evaluate match-expr, which may return a regexp match object or #f. If it returns a match object, the matches are bound to vars, like rxmatch-let, and forms are evaluated.

(test expr form …)

Evaluates expr. If it yields true, evaluates forms.

(test expr => proc)

Evaluates expr and if it is true, calls proc with the result of expr as the only argument.

(else form …)

If this clause exists, it must be the last clause. If other clauses fail, forms are evaluated.

If no else clause exists, and all the other clause fail, an undefined value is returned.

;; parses several possible date format
(define (parse-date str)
    ((rxmatch #/^(\d\d?)\/(\d\d?)\/(\d\d\d\d)$/ str)
        (#f mm dd yyyy)
      (map string->number (list yyyy mm dd)))
    ((rxmatch #/^(\d\d\d\d)\/(\d\d?)\/(\d\d?)$/ str)
        (#f yyyy mm dd)
      (map string->number (list yyyy mm dd)))
    ((rxmatch #/^\d+\/\d+\/\d+$/ str)
     (errorf "ambiguous: ~s" str))
    (else (errorf "bogus: ~s" str))))

(parse-date "2001/2/3") ⇒ (2001 2 3)
(parse-date "12/25/1999") ⇒ (1999 12 25)

This macro corresponds to scsh’s match-cond.

Macro: rxmatch-case string-expr clause …

String-expr is evaluated, and clauses are interpreted one by one. A clause may be one of the following pattern.

(re (var …) form …)

Re must be a literal regexp object (see Regular expressions). If the result of string-expr matches re, the match result is bound to vars and forms are evaluated, and rxmatch-case returns the result of the last form.

If re doesn’t match the result of string-expr, string-expr yields non-string value, the interpretation proceeds to the next clause.

(test proc form …)

A procedure proc is applied on the result of string-expr. If it yields true value, forms are evaluated, and rxmatch-case returns the result of the last form.

If proc yields #f, the interpretation proceeds to the next clause.

(test proc => proc2)

A procedure proc is applied on the result of string-expr. If it yields true value, proc2 is applied on the result, and its result is returned as the result of rxmatch-case.

If proc yields #f, the interpretation proceeds to the next clause.

(else form …)

This form must appear at the end of clauses, if any. If other clauses fail, forms are evaluated, and the result of the last form becomes the result of rxmatch-case.

(else => proc)

This form must appear at the end of clauses, if any. If other clauses fail, proc is evaluated, which should yield a procedure taking one argument. The value of string-expr is passed to proc, and its return values become the return values of rxmatch-case. rx

If no else clause exists, and all other clause fail, an undefined value is returned.

The parse-date example above becomes simpler if you use rxmatch-case

(define (parse-date2 str)
  (rxmatch-case str
    (test (lambda (s) (not (string? s))) #f)
    (#/^(\d\d?)\/(\d\d?)\/(\d\d\d\d)$/ (#f mm dd yyyy)
     (map string->number (list yyyy mm dd)))
    (#/^(\d\d\d\d)\/(\d\d?)\/(\d\d?)$/ (#f yyyy mm dd)
     (map string->number (list yyyy mm dd)))
    (#/^\d+\/\d+\/\d+$/                (#f)
     (errorf "ambiguous: ~s" str))
    (else (errorf "bogus: ~s" str))))

Previous: , Up: Regular expressions   [Contents][Index]

6.13.3 Inspecting and assembling regular expressions

When Gauche reads a string representation of regexp, first it parses the string and construct an abstract syntax tree (AST), performs some optimizations on it, then compiles it into an instruction sequence to be executed by the regexp engine.

The following procedures expose this process to user programs. It may be easier for programs to manipulate an AST than a string representation.

Function: regexp-parse string :key case-fold

Parses a string specification of regexp in string and returns its AST, represented in S-expression. See below for the spec of AST.

When a true value is given to the keyword argument case-fold, returned AST will match case-insensitively. (Case insensitive regexp is handled in parser level, not by the engine).

Function: regexp-optimize ast

Performs some rudimental optimization on the regexp AST, returning regexp AST.

Currently it only optimizes some trivial cases. The plan is to make it cleverer in future.

Function: regexp-compile ast

Takes a regexp AST and returns a regexp object. Currently the outermost form of ast must be the zero-th capturing group. (That is, ast should have the form (0 #f x …).) The outer grouping is always added by regexp-parse to capture the entire regexp.

Note: The function does some basic check to see the given AST is valid, but it may not reject invalid ASTs. In such case, the returned regexp object doesn’t work properly. It is caller’s responsibility to provide a properly constructed AST. (Even if it rejects an AST, error messages are often incomprehensible. So, don’t use this procedure as a AST validness checker.)

Function: regexp-ast regexp

Returns AST used for the regexp object regexp.

Function: regexp-unparse ast :key (on-error :error)

From the regexp’s ast, reconstruct the string representation of the regexp. The keyword argument on-error can be a keyword :error (default) or #f. If it’s the former, an error is signaled when ast isn’t valid regexp AST. If it’s the latter, regexp-unparse just returns #f.

This is the structure of AST. Note that this is originally developed only for internal use, and not very convenient to manipulate from the code (e.g. if you insert or delete a subtree, you have to renumber capturing groups to make them consistent.) There’s a plan to provide a better representation, such as SRE, and a tool to convert it to this AST back and forth. Contributions are welcome.

<ast> : <clause>   ; special clause
      | <item>     ; matches <item>

<item> : <char>       ; matches char
       | <char-set>   ; matches char set
       | (comp . <char-set>) ; matches complement of char set
       | any          ; matches any char
       | bol | eol    ; beginning/end of line assertion
       | wb | nwb     ; word-boundary/negative word boundary assertion

<clause> : (seq <ast> ...)       ; sequence
       | (seq-uncase <ast> ...)  ; sequence (case insensitive match)
       | (seq-case <ast> ...)    ; sequence (case sensitive match)
       | (alt <ast> ...)         ; alternative
       | (rep <m> <n> <ast> ...) ; repetition at least <m> up to <n> (greedy)
                               ; <n> may be `#f'
       | (rep-min <m> <n> <ast> ...)
                               ; repetition at least <m> up to <n> (lazy)
                               ; <n> may be `#f'
       | (rep-while <m> <n> <ast> ...)
                               ; like rep, but no backtrack
       | (<integer> <symbol> <ast> ...)
                               ; capturing group.  <symbol> may be #f.
       | (cpat <condition> <ast> <ast>)
                               ; conditional expression
       | (backref . <integer>) ; backreference
       | (once <ast> ...)      ; standalone pattern.  no backtrack
       | (assert . <asst>)     ; positive lookahead assertion
       | (nassert . <asst>)    ; negative lookahead assertion

<condition> : <integer>     ; (?(1)yes|no) style conditional expression
       | (assert . <asst>)  ; (?(?=condition)...) or (?(?<=condition)...)
       | (nassert . <asst>) ; (?(?!condition)...) or (?(?<!condition)...)

<asst> : <ast> ...
       | ((lookbehind <ast> ...))

Previous: , Up: Regular expressions   [Contents][Index]