Character set class. Character set object represents a set of characters. Gauche provides built-in support of character set creation and a predicate that tests whether a character is in the set or not.
The class implements the collection protocol (see Collection framework),
so that the standard collection methods provided in the
gauche.collection module can be used.
An instance of
<char-set> is applicable to a character,
and works as a membership predicate; see
Further operations, such as set algebra, is defined in SRFI-14 module (see Character-set library).
You can write a literal character set in this syntax. char-set-spec is a sequence of characters to be included in the set. You can include the following special sequences:
Characters between x and y, inclusive. x must be smaller than y in the internal encoding.
If char-set-spec begins with caret, the actual character set is a complement of what the rest of char-set-spec indicates.
A character whose Unicode codepoint is a hexadecimal number N.
This is a legacy Gauche syntax, for a unicode character whose Unicode codepoint is represented by 4-digit and 8-digit hexadecimal numbers, respectively.
Complement of whitespace characters.
Decimal digit characters.
Complement of decimal digit characters.
Word constituent characters. Currently, it is alphanumeric characters and underscore.
Complement of word constituent characters.
A backslash character.
A minus character.
A caret character.
Character set a la POSIX. The following character set name is
#[aeiou] ; a character set consists of vowels #[a-zA-Z] ; alphabet #[[:alpha:]] ; alphabet (using POSIX notation) #[\\\-] ; backslash and minus # ; empty charset #[\x0d;\x0a;\x3000;] ; carriage return, newline, and ideographic space
Literal character sets are immutable, as other literal data. An error is signalled when you attempt to modify an immutable character set.
Note for the compatibility:
We used to recognize a syntax
\xNN (two-digit hexadecimal number,
without semicolon terminator) as a character; for example,
#[\x0d\x0a] as a return and a newline. We still support it
when we don’t see the terminating semicolon, for the compatibility.
There are ambiguous cases:
#[\x0a;] means only a newline
in the current syntax, but a newline and a semicolon in legacy syntax.
Setting the reader mode to
legacy restores the old behavior.
Setting the reader mode to
warn-legacy makes it work like the default
behavior, but prints warning when it finds legacy syntax.
See Reader lexical mode, for the details.
To write code that can work both in new and old syntax, use
[SRFI-14] Returns true if and only if obj is a character set object.
#t if char-set is an immutable char-set,
#f if it’s a mutable char-set.
[SRFI-14] Returns true if and only if a character set object char-set contains a character char.
(char-set-contains? #[a-z] #\y) ⇒ #t (char-set-contains? #[a-z] #\3) ⇒ #f (char-set-contains? #[^ABC] #\A) ⇒ #f (char-set-contains? #[^ABC] #\D) ⇒ #t
A char-set object can be applied to a character, and it
works just like
(char-set-contains? char-set char).
(#[a-z] #\a) ⇒ #t (#[a-z] #\A) ⇒ #f (use gauche.collection) (filter #[a-z] "CharSet") ⇒ (#\h #\a #\r #\e #\t)
[SRFI-14] Creates a character set that contains char ….
(char-set #\a #\b #\c) ⇒ #[a-c]
[SRFI-14] Returns a number of characters in the given charset.
gosh> (char-set-size #) 0 gosh> (char-set-size #[[:alnum:]]) 62
[SRFI-14] Copies a character set char-set.
[SRFI-14] Returns a complement set of char-set. The former always returns a new set, while the latter may reuse the given charset.