[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.11 Character Set

Builtin Class: <char-set>

Character set class. Character set object represents a set of characters. Gauche provides built-in support of character set creation and a predicate that tests whether a character is in the set or not.

The class implements the collection protocol (See section gauche.collection - Collection framework), so that the standard collection methods provided in the gauche.collection module can be used.

An instance of <char-set> is applicable to a character, and works as a membership predicate; see char-set-contains? below.

Further operations, such as set algebra, is defined in SRFI-14 module (See section srfi-14 - Character-set library).

Reader Syntax: #[char-set-spec]

You can write a literal character set in this syntax. char-set-spec is a sequence of characters to be included in the set. You can include the following special sequences:

x-y

Characters between x and y, inclusive. x must be smaller than y in the internal encoding.

^

If char-set-spec begins with caret, the actual character set is a complement of what the rest of char-set-spec indicates.

\xN;

A character whose Unicode codepoint is a hexadecimal number N.

\uXXXX
\UXXXXXXXX

This is a legacy Gauche syntax, for a unicode character whose Unicode codepoint is represented by 4-digit and 8-digit hexadecimal numbers, respectively.

\s

Whitespace characters.

\S

Complement of whitespace characters.

\d

Decimal digit characters.

\D

Complement of decimal digit characters.

\w

Word constituent characters. Currently, it is alphanumeric characters and underscore.

\W

Complement of word constituent characters.

\\

A backslash character.

\-

A minus character.

\^

A caret character.

[:alnum:] …

Character set a la POSIX. The following character set name is recognized: alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper and xdigit.

 
#[aeiou]       ; a character set consists of vowels
#[a-zA-Z]      ; alphabet
#[[:alpha:]]   ; alphabet (using POSIX notation)
#[\\\-]        ; backslash and minus
#[]            ; empty charset
#[\x0d;\x0a;\x3000;] ; carriage return, newline, and ideographic space

Note for the compatibility: We used to recognize a syntax \xNN (two-digit hexadecimal number, without semicolon terminator) as a character; for example, #[\x0d\x0a] as a return and a newline. We still support it when we don’t see the terminating semicolon, for the compatibility. There are ambiguous cases: #[\x0a;] means only a newline in the current syntax, but a newline and a semicolon in legacy syntax.

Setting the reader mode to legacy restores the old behavior. Setting the reader mode to warn-legacy makes it work like the default behavior, but prints warning when it finds legacy syntax. See section Reader lexical mode, for the details.

To write code that can work both in new and old syntax, use \u escape.

Function: char-set? obj

[SRFI-14] Returns true if and only if obj is a character set object.

Function: char-set-contains? char-set char

[SRFI-14] Returns true if and only if a character set object char-set contains a character char.

 
(char-set-contains? #[a-z] #\y) ⇒ #t
(char-set-contains? #[a-z] #\3) ⇒ #f

(char-set-contains? #[^ABC] #\A) ⇒ #f
(char-set-contains? #[^ABC] #\D) ⇒ #t

Generic application: char-set char

A char-set object can be applied to a character, and it works just like (char-set-contains? char-set char).

 
(#[a-z] #\a) ⇒ #t
(#[a-z] #\A) ⇒ #f

(use gauche.collection)
(filter #[a-z] "CharSet") ⇒ (#\h #\a #\r #\e #\t)
Function: char-set char …

[SRFI-14] Creates a character set that contains char ….

 
(char-set #\a #\b #\c)   ⇒ #[a-c]
Function: char-set-size char-set

[SRFI-14] Returns a number of characters in the given charset.

 
gosh> (char-set-size #[])
0
gosh> (char-set-size #[[:alnum:]])
62
Function: char-set-copy char-set

[SRFI-14] Copies a character set char-set.

Function: char-set-complement char-set
Function: char-set-complement! char-set

[SRFI-14] Returns a complement set of char-set. The former always returns a new set, while the latter may reuse the given charset.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]

This document was generated on July 19, 2014 using texi2html 1.82.