For Development HEAD DRAFTSearch (procedure/syntax/module):

6.10 Character Sets

Builtin Class: <char-set>

Character set class. Character set object represents a set of characters. Gauche provides built-in support of character set creation and a predicate that tests whether a character is in the set or not.

The class implements the collection protocol (see gauche.collection - Collection framework), so that the standard collection methods provided in the gauche.collection module can be used.

An instance of <char-set> is applicable to a character, and works as a membership predicate; see char-set-contains? below.

Further operations, such as set algebra, is defined in SRFI-14 module (see scheme.charset - R7RS character sets).


6.10.1 Character set literals

Reader Syntax: #[char-set-spec]

You can write a literal character set in this syntax. char-set-spec is a sequence of characters to be included in the set. You can include the following special sequences:

x-y

Characters between x and y, inclusive. x must be smaller than y in the Unicode codepoints.

^

If char-set-spec begins with caret, the actual character set is a complement of what the rest of char-set-spec indicates.

\xN;

A character whose Unicode codepoint is a hexadecimal number N.

\uXXXX
\UXXXXXXXX

This is a legacy Gauche syntax, for a unicode character whose Unicode codepoint is represented by 4-digit and 8-digit hexadecimal numbers, respectively.

\s

Whitespace characters (space, newline, tab, form feed, vertical tab, carriage return). Members of char-set:ascii-whitespace.

\S

Complement of whitespace characters.

\d

Decimal digit characters. Members of char-set:ascii-digits.

\D

Complement of decimal digit characters.

\w

Word constituent characters (#[A-Za-z0-9_]). Members of char-set:ascii-word.

\W

Complement of word constituent characters.

\\

A backslash character.

\-

A minus character.

\^

A caret character.

[:alnum:] …

Character set a la POSIX. See the table below for the complete list of recognized character set names. The set name must be in all lower cases. This notation only includes characters in ASCII range.

[:^alnum:] …

Complement set of [:alnum:] etc.

[:ALNUM:] …

Gauche’s extension of character set a la POSIX; the name must be all in upper cases, and includes full Unicode range. See the table below for the recognized names.

[:^ALNUM:] …

Complement set of [:ALNUM:] etc.

Here’s the list of POSIX-style character class names:

:alpha:ASCII alphabets. char-set:ascii-letter, #[A-Za-z]
:alnum:ASCII alphabets and digits. char-set:ascii-letter+digits, #[0-9A-Za-z].
:blank:ASCII blanks. char-set:ascii-blank, tab and space.
:cntrl:ASCII control characters. char-set:ascii-control, U+0000 to U+001f and U+007f.
:digit:ASCII digits. char-set:ascii-digit, #[0-9].
:graph:ASCII graphic characters. char-set:ascii-graphic.
:lower:ASCII lower-case alphabets. char-set:ascii-lower-case, #[a-z].
:print:ASCII printing characters. char-set:ascii-printing.
:punct:ASCII punctuation characters. char-set:ascii-punctuation.
:space:ASCII whitespaces. char-set:ascii-whitespace.
:upper:ASCII upper-case characters. char-set:ascii-upper-case, #[A-Z].
:word:ASCII word characters (not POSIX). char-set:ascii-word, #[0-9A-Za-z_].
:xdigit:Hexadecimal digits. char-set:hex-digit, #[0-9a-fA-F].
:ascii:ASCII characters (not POSIX). char-set:ascii.
:ALPHA:Unicode letters (category L*). char-set:letter.
:ALNUM:Unicode letters and digits. char-set:letter+digits.
:BLANK:Unicode blanks (tab and category Zs). char-set:blank.
:CNTRL:Unicode control characters (category Cc). char-set:iso-control.
:DIGIT:Unicode digits (category Nd). char-set:digit.
:GRAPH:Unicode graphic characters (letter, digits, punctuation, symbol, and category Nl and No). char-set:graphic.
:LOWER:Unicode lower-case letters (category Ll). char-set:lower-case, #[a-z].
:PRINT:Unicode printing characters (graphic and whitespace). char-set:printing.
:PUNCT:Unicode punctuation characters (category P*). char-set:punctuation.
:SPACE:Unicode whitespaces (tab, LF, vertical tab, FF, CR, and category Z*). char-set:whitespace.
:TITLE:Unicode titlecase letters (category Lt). char-set:title-case.
:UPPER:Unicode upper-case letters (category Lu). char-set:upper-case, #[A-Z].
:WORD:Unicode word characters. char-set:word.
:XDIGIT:Hexadecimal digits (same as :xdigit:).

Here are some examples:

#[aeiou]       ; a character set consists of vowels
#[a-zA-Z]      ; alphabet
#[[:alpha:]]   ; alphabet (using POSIX notation)
#[\\\-]        ; backslash and minus
#[]            ; empty charset
#[\x0d;\x0a;\x3000;] ; carriage return, newline, and ideographic space

Literal character sets are immutable, as other literal data. An error is signalled when you attempt to modify an immutable character set.

Note for the compatibility: We used to recognize a syntax \xNN (two-digit hexadecimal number, without semicolon terminator) as a character; for example, #[\x0d\x0a] as a return and a newline. We still support it when we don’t see the terminating semicolon, for the compatibility. There are ambiguous cases: #[\x0a;] means only a newline in the current syntax, but a newline and a semicolon in legacy syntax.

Setting the reader mode to legacy restores the old behavior. Setting the reader mode to warn-legacy makes it work like the default behavior, but prints warning when it finds legacy syntax. See Reader lexical mode, for the details.

To write code that can work both in new and old syntax, use \u escape.


6.10.2 Predefined character sets

We provide a bunch of predefined character sets, including the ones defined in R7RS charset library (see scheme.charset - R7RS character sets). Those character sets are immutable.

Variable: char-set:letter

[R7RS charset] Letters (Unicode general category Lu, Ll, Lt, Lm and Lo).

Variable: char-set:lower-case
Variable: char-set:upper-case
Variable: char-set:title-case

[R7RS charset] Lower case, upper case and title case letters (Unicode general category Ll, Lu and Lt, respectively).

Variable: char-set:digit

[R7RS charset] Digit characters (Unicode general category Nd). Note that this contains many more characters than ASCII 0 to 9. If you need #[0-9], use char-set:ascii-digit.

Variable: char-set:hex-digit

[R7RS charset] Digit characters used for hexadecimal, i.e. #[0-9A-Fa-f]. This does not contain other Unicode digit characters, for it isn’t practical to mix non-ascii digit characters with hexadecimal notation.

Variable: char-set:letter+digit

[R7RS charset] Union of char-set:letter and char-set:digit.

Variable: char-set:graphic

[R7RS charset] Characters that has some glyph. Union of letters, numbers, punctuations and symbols.

Variable: char-set:printing

[R7RS charset] Union of char-set:graphic and char-set:whitespace.

Variable: char-set:whitespace
Variable: char-set:blank

[R7RS charset] Whitespace and blank characters; char-set:whitespace includes #\tab, #\newline, #\u000B (vertical tab), #\page, #\return, and all characters in general category Zs, Zl, Zp, while char-set:blank includes #\tab and all characters in general category Zs. Note that char-set:whitespace is the same set of characters that Scheme reader treats as whitespace characters.

Variable: char-set:iso-control

[R7RS charset] Control characters (Unicode general category Cc).

Variable: char-set:punctuation

[R7RS charset] Punctuation characters (Unicode general category Pc, Pd, Ps, Pe, Pi, Pf and Po).

Variable: char-set:symbol

[R7RS charset] Symbol characters (Unicode general category Sm, Sc, Sk and So).

Variable: char-set:ascii

[R7RS charset] Contains all ASCII characters (U+0000 to U+007f).

Variable: char-set:empty

[R7RS charset] An empty character set.

Variable: char-set:full

[R7RS charset] A character set that includes all characters.

Variable: char-set:word

A word constituent characters. In the current version, it is equivalent to char-set:ascii-word (#[0-9A-Za-z_]) but in future versions we may extend this to other Unicode characters. If you intend to mean ASCII-only words, use char-set:ascii-word.

Variable: char-set:ascii-letter
Variable: char-set:ascii-lower-case
Variable: char-set:ascii-upper-case
Variable: char-set:ascii-digit
Variable: char-set:ascii-letter+digit
Variable: char-set:ascii-graphic
Variable: char-set:ascii-printing
Variable: char-set:ascii-whitespace
Variable: char-set:ascii-blank
Variable: char-set:ascii-control
Variable: char-set:ascii-punctuation
Variable: char-set:ascii-symbol
Variable: char-set:ascii-word

These are intersection of char-set:ascii and the corresponding char set without ascii-. (char-set:ascii-control corresponds to char-set:iso-control).

The \d, \s and \w notation in the char-set literal and regexp literal corresponds to char-set:ascii-digit, char-set:ascii-whitespace, and char-set:ascii-word, respectively (not the Unicode set).

The POSIX character class notation, such as [:alpha:] in char-set literal and regexp literal, refers to these ASCII-only charsets.

Note: We don’t have char-set:ascii-title-case and char-set:ascii-hex-digit. There’s no titlecase letter in ASCII range. And char-set:hex-digit is limited to ASCII by definition.

Variable: char-set:Lu
Variable: char-set:Ll
Variable: char-set:Lt
Variable: char-set:Lm
Variable: char-set:Lo
Variable: char-set:Mn
Variable: char-set:Mc
Variable: char-set:Me
Variable: char-set:Nd
Variable: char-set:Nl
Variable: char-set:No
Variable: char-set:Pc
Variable: char-set:Pd
Variable: char-set:Ps
Variable: char-set:Pe
Variable: char-set:Pi
Variable: char-set:Pf
Variable: char-set:Po
Variable: char-set:Sm
Variable: char-set:Sc
Variable: char-set:Sk
Variable: char-set:So
Variable: char-set:Zs
Variable: char-set:Zl
Variable: char-set:Zp
Variable: char-set:Cc
Variable: char-set:Cf
Variable: char-set:Cs
Variable: char-set:Co
Variable: char-set:Cn

Each character set contains the corresponding Unicode characters with the given general category; e.g. char-set:Lu contains all characters of the general category Lu.

Variable: char-set:L
Variable: char-set:LC
Variable: char-set:M
Variable: char-set:N
Variable: char-set:P
Variable: char-set:S
Variable: char-set:Z
Variable: char-set:C

Each character set contains the Unicode characters with the general category starting with the letter; e.g. char-set:L is union of char-set:Lu, char-set:Ll, char-set:Lt, char-set:Lm and char-set:Lo.

char-set:LC is for cased-letters, the union of char-set:Lt, char-set:Ll, char-set:Lu.


6.10.3 Character set operations

See also scheme.charset - R7RS character sets for the comprehensive character set operations.

Function: char-set? obj

[R7RS charset] Returns true if and only if obj is a character set object.

Function: char-set-immutable? char-set

Returns #t if char-set is an immutable char-set, #f if it’s a mutable char-set.

Function: char-set-contains? char-set char

[R7RS charset] Returns true if and only if a character set object char-set contains a character char.

(char-set-contains? #[a-z] #\y) ⇒ #t
(char-set-contains? #[a-z] #\3) ⇒ #f

(char-set-contains? #[^ABC] #\A) ⇒ #f
(char-set-contains? #[^ABC] #\D) ⇒ #t

Generic application: char-set char

A char-set object can be applied to a character, and it works just like (char-set-contains? char-set char).

(#[a-z] #\a) ⇒ #t
(#[a-z] #\A) ⇒ #f

(use gauche.collection)
(filter #[a-z] "CharSet") ⇒ (#\h #\a #\r #\e #\t)
Function: char-set char …

[R7RS charset] Creates a character set that contains char ….

(char-set #\a #\b #\c)   ⇒ #[a-c]
Function: char-set-size char-set

[R7RS charset] Returns a number of characters in the given charset.

gosh> (char-set-size #[])
0
gosh> (char-set-size #[[:alnum:]])
62
Function: char-set-copy char-set

[R7RS charset] Copies a character set char-set.

Function: char-set-complement char-set
Function: char-set-complement! char-set

[R7RS charset] Returns a complement set of char-set. The former always returns a new set, while the latter may reuse the given charset.



For Development HEAD DRAFTSearch (procedure/syntax/module):
DRAFT