Character set class. Character set object represents a set of characters. Gauche provides built-in support of character set creation and a predicate that tests whether a character is in the set or not.
The class implements the collection protocol (see gauche.collection
- Collection framework),
so that the standard collection methods provided in the
gauche.collection
module can be used.
An instance of <char-set>
is applicable to a character,
and works as a membership predicate; see char-set-contains?
below.
Further operations, such as set algebra, is defined in SRFI-14
module (see scheme.charset
- R7RS character sets).
• Character set literals: | ||
• Predefined character sets: | ||
• Character set operations: |
#[char-set-spec]
¶You can write a literal character set in this syntax. char-set-spec is a sequence of characters to be included in the set. You can include the following special sequences:
x-y
Characters between x and y, inclusive. x must be smaller than y in the Unicode codepoints.
^
If char-set-spec begins with caret, the actual character set is a complement of what the rest of char-set-spec indicates.
\xN;
A character whose Unicode codepoint is a hexadecimal number N.
\uXXXX
\UXXXXXXXX
This is a legacy Gauche syntax, for a unicode character whose Unicode codepoint is represented by 4-digit and 8-digit hexadecimal numbers, respectively.
\s
Whitespace characters
(space, newline, tab, form feed, vertical tab, carriage return).
Members of char-set:ascii-whitespace
.
\S
Complement of whitespace characters.
\d
Decimal digit characters. Members of
char-set:ascii-digits
.
\D
Complement of decimal digit characters.
\w
Word constituent characters (#[A-Za-z0-9_]
).
Members of char-set:ascii-word
.
\W
Complement of word constituent characters.
\\
A backslash character.
\-
A minus character.
\^
A caret character.
[:alnum:] …
Character set a la POSIX. See the table below for the complete list of recognized character set names. The set name must be in all lower cases. This notation only includes characters in ASCII range.
[:^alnum:] …
Complement set of [:alnum:]
etc.
[:ALNUM:] …
Gauche’s extension of character set a la POSIX; the name must be all in upper cases, and includes full Unicode range. See the table below for the recognized names.
[:^ALNUM:] …
Complement set of [:ALNUM:]
etc.
Here’s the list of POSIX-style character class names:
:alpha: | ASCII alphabets. char-set:ascii-letter , #[A-Za-z] |
:alnum: | ASCII alphabets and digits. char-set:ascii-letter+digits , #[0-9A-Za-z] . |
:blank: | ASCII blanks. char-set:ascii-blank , tab and space. |
:cntrl: | ASCII control characters. char-set:ascii-control , U+0000 to U+001f and U+007f. |
:digit: | ASCII digits. char-set:ascii-digit , #[0-9] . |
:graph: | ASCII graphic characters. char-set:ascii-graphic . |
:lower: | ASCII lower-case alphabets. char-set:ascii-lower-case , #[a-z] . |
:print: | ASCII printing characters. char-set:ascii-printing . |
:punct: | ASCII punctuation characters. char-set:ascii-punctuation . |
:space: | ASCII whitespaces. char-set:ascii-whitespace . |
:upper: | ASCII upper-case characters. char-set:ascii-upper-case , #[A-Z] . |
:word: | ASCII word characters (not POSIX). char-set:ascii-word , #[0-9A-Za-z_] . |
:xdigit: | Hexadecimal digits. char-set:hex-digit , #[0-9a-fA-F] . |
:ascii: | ASCII characters (not POSIX). char-set:ascii . |
:ALPHA: | Unicode letters (category L* ). char-set:letter . |
:ALNUM: | Unicode letters and digits. char-set:letter+digits . |
:BLANK: | Unicode blanks (tab and category Zs ). char-set:blank . |
:CNTRL: | Unicode control characters (category Cc ). char-set:iso-control . |
:DIGIT: | Unicode digits (category Nd ). char-set:digit . |
:GRAPH: | Unicode graphic characters (letter, digits, punctuation, symbol, and category Nl and No ). char-set:graphic . |
:LOWER: | Unicode lower-case letters (category Ll ). char-set:lower-case , #[a-z] . |
:PRINT: | Unicode printing characters (graphic and whitespace). char-set:printing . |
:PUNCT: | Unicode punctuation characters (category P* ). char-set:punctuation . |
:SPACE: | Unicode whitespaces (tab, LF, vertical tab, FF, CR, and category Z* ). char-set:whitespace . |
:TITLE: | Unicode titlecase letters (category Lt ). char-set:title-case . |
:UPPER: | Unicode upper-case letters (category Lu ). char-set:upper-case , #[A-Z] . |
:WORD: | Unicode word characters. char-set:word . |
:XDIGIT: | Hexadecimal digits (same as :xdigit: ). |
Here are some examples:
#[aeiou] ; a character set consists of vowels #[a-zA-Z] ; alphabet #[[:alpha:]] ; alphabet (using POSIX notation) #[\\\-] ; backslash and minus #[] ; empty charset #[\x0d;\x0a;\x3000;] ; carriage return, newline, and ideographic space
Literal character sets are immutable, as other literal data. An error is signalled when you attempt to modify an immutable character set.
Note for the compatibility:
We used to recognize a syntax \xNN
(two-digit hexadecimal number,
without semicolon terminator) as a character; for example,
#[\x0d\x0a]
as a return and a newline. We still support it
when we don’t see the terminating semicolon, for the compatibility.
There are ambiguous cases: #[\x0a;]
means only a newline
in the current syntax, but a newline and a semicolon in legacy syntax.
Setting the reader mode to legacy
restores the old behavior.
Setting the reader mode to warn-legacy
makes it work like the default
behavior, but prints warning when it finds legacy syntax.
See Reader lexical mode, for the details.
To write code that can work both in new and old syntax, use \u
escape.
We provide a bunch of predefined character sets, including the ones
defined in R7RS charset library (see scheme.charset
- R7RS character sets).
Those character sets are immutable.
[R7RS charset]
Letters (Unicode general category Lu
, Ll
, Lt
,
Lm
and Lo
).
[R7RS charset]
Lower case, upper case and title case letters
(Unicode general category Ll
, Lu
and Lt
, respectively).
[R7RS charset]
Digit characters (Unicode general category Nd
). Note that this
contains many more characters than ASCII 0
to 9
. If you
need #[0-9]
, use char-set:ascii-digit
.
[R7RS charset]
Digit characters used for hexadecimal, i.e. #[0-9A-Fa-f]
. This does
not contain other Unicode digit characters, for it isn’t practical
to mix non-ascii digit characters with hexadecimal notation.
[R7RS charset]
Union of char-set:letter
and char-set:digit
.
[R7RS charset] Characters that has some glyph. Union of letters, numbers, punctuations and symbols.
[R7RS charset]
Union of char-set:graphic
and char-set:whitespace
.
[R7RS charset]
Whitespace and blank characters; char-set:whitespace
includes #\tab
, #\newline
, #\u000B
(vertical tab),
#\page
, #\return
, and all characters in general category
Zs
, Zl
, Zp
, while char-set:blank
includes #\tab
and all characters in general category Zs
.
Note that char-set:whitespace
is the same set of characters
that Scheme reader treats as whitespace characters.
[R7RS charset]
Control characters (Unicode general category Cc
).
[R7RS charset]
Punctuation characters (Unicode general category
Pc
, Pd
, Ps
, Pe
, Pi
, Pf
and Po
).
[R7RS charset]
Symbol characters (Unicode general category Sm
, Sc
,
Sk
and So
).
[R7RS charset] Contains all ASCII characters (U+0000 to U+007f).
[R7RS charset] An empty character set.
[R7RS charset] A character set that includes all characters.
A word constituent characters. In the current version,
it is equivalent to char-set:ascii-word
(#[0-9A-Za-z_]
)
but in future versions we may extend this to other Unicode characters.
If you intend to mean ASCII-only words, use char-set:ascii-word
.
These are intersection of char-set:ascii
and
the corresponding char set without ascii-
.
(char-set:ascii-control
corresponds to char-set:iso-control
).
The \d
, \s
and \w
notation in the char-set literal
and regexp literal corresponds to char-set:ascii-digit
,
char-set:ascii-whitespace
, and char-set:ascii-word
,
respectively (not the Unicode set).
The POSIX character class notation, such as [:alpha:]
in
char-set literal and regexp literal, refers to these ASCII-only
charsets.
Note: We don’t have char-set:ascii-title-case
and
char-set:ascii-hex-digit
. There’s no titlecase letter
in ASCII range. And char-set:hex-digit
is limited to ASCII
by definition.
Each character set contains the corresponding Unicode characters with
the given general category; e.g. char-set:Lu
contains all characters
of the general category Lu
.
Each character set contains the Unicode characters with the general
category starting with the letter; e.g. char-set:L
is union
of char-set:Lu
, char-set:Ll
, char-set:Lt
,
char-set:Lm
and char-set:Lo
.
char-set:LC
is for cased-letters, the union of
char-set:Lt
, char-set:Ll
, char-set:Lu
.
See also scheme.charset
- R7RS character sets for the comprehensive character set
operations.
[R7RS charset] Returns true if and only if obj is a character set object.
Returns #t
if char-set is an immutable char-set,
#f
if it’s a mutable char-set.
[R7RS charset] Returns true if and only if a character set object char-set contains a character char.
(char-set-contains? #[a-z] #\y) ⇒ #t (char-set-contains? #[a-z] #\3) ⇒ #f (char-set-contains? #[^ABC] #\A) ⇒ #f (char-set-contains? #[^ABC] #\D) ⇒ #t
A char-set object can be applied to a character, and it
works just like (char-set-contains? char-set char)
.
(#[a-z] #\a) ⇒ #t (#[a-z] #\A) ⇒ #f (use gauche.collection) (filter #[a-z] "CharSet") ⇒ (#\h #\a #\r #\e #\t)
[R7RS charset] Creates a character set that contains char ….
(char-set #\a #\b #\c) ⇒ #[a-c]
[R7RS charset] Returns a number of characters in the given charset.
gosh> (char-set-size #[]) 0 gosh> (char-set-size #[[:alnum:]]) 62
[R7RS charset] Copies a character set char-set.