[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.10 Characters

Builtin Class: <char>
Reader Syntax: #\charname

[R7RS] Denotes a literal character.

When the reader reads #\, it fetches a subsequent character. If it is one of ()[]{}" \|;#, this is a character literal of itself. Otherwise, the reader reads subsequent characters until it sees a non word-constituent character. If only one character is read, it is the character. Otherwise, the reader matches the read characters with predefined character names. If it doesn’t match any, an error is signaled.

The following character names are recognized. These character names are case insensitive.

space

Whitespace (ASCII #x20)

newline, nl, lf

Newline (ASCII #x0a)

return, cr

Carriage return (ASCII #x0d)

tab, ht

Horizontal tab (ASCII #x09)

page

Form feed (ASCII #x0c)

escape, esc

Escape (ASCII #x1b)

delete, del

Delete (ASCII #x7f)

null

NUL character (ASCII #x00)

xN

A character whose Unicode codepoint is the integer N, when N is a hexadecimal integer. This is R7RS lexical syntax. (See the compatibility note below).

uN

A character whose Unicode codepoint is the integer N, where N is 4-digit or 8-digit hexadecimal number.

This is legacy Gauche lexical syntax. Use \xN syntax for the new code. (See the compatibility note below).

 
#\newline ⇒ #\newline ; newline character
#\x0a     ⇒ #\newline ; ditto
#\x41     ⇒ #\A       ; ASCII letter 'A'
#\x3042   ⇒ ; Hiragana letter A
#\x2a6b2  ⇒ ; JISX0213 Kanji 2-94-86

Compatibility note: Before 0.9.4, \xNN syntax uses Gauche’s internal character encoding as opposed to Unicode codepoint. Both are the same if Gauche is compiled with internal encoding utf-8 or none (if it’s none, only characters up to U+00ff is supported and in this range the characters are the same as Unicode characters.) If Gauche is compiled with encoding euc-jp or sjis, the meaning of \xNN beyond ASCII range differs from 0.9.3.3 or before.

If you set the reader mode to legacy (See section Reader lexical mode), #\xNN is read as before, keeping the compatibility (but it isn’t compatible to R7RS). Alternatively, you can use #\uNNNN, or a character itself, to make the code work in both new and old versions of Gauche.

Function: char? obj

[R7RS] Returns #t if obj is a character, #f otherwise.

Function: char=? char1 char2 char3 …
Function: char<? char1 char2 char3 …
Function: char<=? char1 char2 char3 …
Function: char>? char1 char2 char3 …
Function: char>=? char1 char2 char3 …

[R7RS] Compares characters. Character comparison is done in internal character encoding.

Function: char-ci=? char1 char2 char3 …
Function: char-ci<? char1 char2 char3 …
Function: char-ci<=? char1 char2 char3 …
Function: char-ci>? char1 char2 char3 …
Function: char-ci>=? char1 char2 char3 …

[R7RS] Compares characters in case-insensitive way. The comparison is done in the internal character code of the foldcase of the each character; see char-foldcase below.

In R7RS, these procedures are in the (scheme char) library.

Function: char-alphabetic? char
Function: char-numeric? char
Function: char-whitespace? char
Function: char-upper-case? char
Function: char-lower-case? char

[R7RS] Returns true if a character char is an alphabetic character (Unicode character category Lu, Ll, Lt, Lm, Lo, Nl), a numeric character (Unicode character category Nd), a whitespace character, (Unicode character category Zs, Zp, Zl), an upper case character (Unicode character category Lu), or a lower case character (Unicode character category Ll), respectively.

In R7RS, these procedures are in the (scheme char) library.

Function: char-general-category char

[R6RS] Returns one of the following symbols, representing the Unicode general category of char.

CcOther, Control
CfOther, Format
CnOther, Not Assigned
CoOther, Private Use
CsOther, Surrogate
LlLetter, Lowercase
LmLetter, Modifier
LoLetter, Other
LtLetter, Titlecase
LuLetter, Uppercase
McMark, Spacing Combining
MeMark, Enclosing
MnMark, Nonspacing
NdNumber, Decimal Digit
NlNumber, Letter
NoNumber, Other
PcPunctuation, Connector
PdPunctuation, Dash
PePunctuation, Close
PfPunctuation, Final quote
PiPunctuation, Initial quote
PoPunctuation, Other
PsPunctuation, Open
ScSymbol, Currency
SkSymbol, Modifier
SmSymbol, Math
SoSymbol, Other
ZlSeparator, Line
ZpSeparator, Paragraph
ZsSeparator, Space

If Gauche is compiled with euc-jp or shift_jis encoding, there are characters that don’t have corresponding Unicode codepoint (each of them are represented by one unicode character plus one unicode modifier character). A provisional category is assigned to those charcters. If future versions of Unicode incorporates these characters, the category may be reassigned.

SJISEUCCatUnicode
82F5A4F7LoU+304B U+309A (Semi-voiced Hiragana KA)
82F6A4F8LoU+304D U+309A (Semi-voiced Hiragana KI)
82F7A4F9LoU+304F U+309A (Semi-voicdd Hiragana KU)
82F8A4FALoU+3051 U+309A (Semi-voiced Hiragana KE)
82F9A4FBLoU+3053 U+309A (Semi-voiced Hiragana KO)
8397A5F7LoU+30AB U+309A (Semi-voiced Katakana KA)
8398A5F8LoU+30AD U+309A (Semi-voiced Katakana KI)
8399A5F9LoU+30AF U+309A (Semi-voiced Katakana KU)
839AA5FALoU+30B1 U+309A (Semi-voiced Katakana KE)
839BA5FBLoU+30B3 U+309A (Semi-voiced Katakana KO)
839CA5FCLoU+30BB U+309A (Semi-voiced Katakana SE)
839DA5FDLoU+30C4 U+309A (Semi-voiced Katakana TSU)
839EA5FELoU+30C8 U+309A (Semi-voiced Katakana TO)
83F6A6F8LoU+31F7 U+309A (Semi-voiced small Katakana FU)
8663ABC4LlU+00E6 U+0300 (Accented latin small ae)
8667ABC8LlU+0254 U+0300 (Accented latin small open o)
8668ABC9LlU+0254 U+0301 (Accented latin small open o)
8669ABCALlU+028C U+0300 (Accented latin small turned v)
866AABCBLlU+028C U+0301 (Accented latin small turned v)
866BABCCLlU+0259 U+0300 (Accented latin small schwa)
866CABCDLlU+0259 U+0301 (Accented latin small schwa)
866DABCELlU+025A U+0300 (Accented latin small schwa w/hook)
866EABCFLlU+025A U+0301 (Accented latin small schwa w/hook)
8685ABE5SkU+02E9 U+02E5
8686ABE6SkU+02E5 U+02E9
Function: char->integer char
Function: integer->char n

[R7RS] char->integer returns an exact integer that represents internal encoding of the character char. integer->char returns a character whose internal encoding is an exact integer n. The following expression is always true for valid character char:

 
(eq? char (integer->char (char->integer char)))

Note: R7RS defines these procedures to deal with Unicode codepoints. Gauche complies it when compiled with utf-8 or none internal encoding (for the latter, only characters up to U+00ff are supported). If Gauche is compiled with euc-jp or sjis internal encoding, you need to use char->ucs/ucs->char below to convert between Unicode codepoints and characters.

The result is undefined if you pass n to integer->char that doesn’t have a corresponding character.

Function: char->ucs char
Function: ucs->char n

Converts a character char to integer UCS codepoint, and integer UCS codepoint n to a character, respectively.

If Gauche is compiled with UTF-8 encoding, these procedures are the same as char->integer and integer->char.

When Gauche’s internal encoding differs from UTF-8, these procedures implicitly loads gauche.charconv module to convert internal character code to UCS or vice versa (See section gauche.charconv - Character Code Conversion). If char doesn’t have corresponding UCS codepoint, char->ucs returns #f. If UCS codepoint n can’t be represented in the internal character encoding, ucs->char returns #f, unless the conversion routine provides a substitution character.

Function: char-upcase char
Function: char-downcase char
Function: char-titlecase char
Function: char-foldcase char

[R6RS][R7RS] Returns the upper case, lower case, title case and folded case of char, respectively.

The mapping is done according to Unicode-defined character-by-character case mapping whenever possible. If the native encoding doesn’t support the mapped character defined in Unicode, the operation becomes no-op. If the native encoding is ’none’, we treat the characters as if they are Latin-1 (ISO-8859-1) characters. So, upcasing Latin-1 character small y with diaresis (U+00ff) maps to capital y with diaresis (U+039c) if the internal encoding is utf-8, but it is no-op if the internal encoding is none.

R7RS doesn’t have char-titlecase; other three procedures are defined in the (scheme char) library. R6RS defines all of them.

The character-by-character case mapping doesn’t consider a character that may map to more than one characters; a notable example is eszett (latin small letter sharp S, U+00df), which is is mapped to two capital S’s in string context, but char-upcase #\ß returns #\ß. To get a full mapping, use string-upcase etc. in text.unicode module (See section Full string case conversion).

Function: digit->integer char :optional (radix 10) (extended-range? #f)

If given character char is a valid digit character in radix radix number, the corresponding integer is returned. Otherwise #f is returned.

 
(digit->integer #\4) ⇒ 4
(digit->integer #\e 16) ⇒ 14
(digit->integer #\9 8) ⇒ #f

If the optional extended-range? argument is true, this procedure recognizes not only ASCII digits, but also all characters with Nd general category—such as FULLWIDTH DIGIT ZERO to NINE (U+ff10 - U+ff19).

R7RS has digit-value, which is equivalent to (digit->integer char 10 #t).

Note: CommonLisp has a similar function in rather confusing name, digit-char-p.

Function: integer->digit integer :optional (radix 10) (basechar1 #\0) (basechar2 #\a)

Reverse operation of digit->integer. Returns a character that represents the number integer in the radix radix system. If integer is out of the valid range, #f is returned.

 
(integer->digit 13 16) ⇒ #\d
(integer->digit 10) ⇒ #f

The optional basechar1 argument specifies the character that stands for zero; by default, it’s #\0. You can give alternative character, for example, U+0660 (ARABIC-INDIC DIGIT ZERO) to convert an integer to a arabic-indic digit character.

Another optional basechar2 argument is used for integers over 10. The default value is #\a. You can pass #\A to get upper-case hex digits, for example.

Note: CommonLisp’s digit-char.

Function: gauche-character-encoding

Returns a symbol designates the native character encoding, selected at the compile time. The possible return values are those:

euc-jp

EUC-JP

utf-8

UTF-8

sjis

Shift JIS

none

No multibyte character support (8-bit fixed-length character).

To switch code at compile time according to the internal encoding, you can use feature identifiers gauche.ces.*–see Using platform-dependent features.

Function: supported-character-encodings

Returns a list of string names of character encoding schemes that are supported in the native multibyte encoding scheme.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]

This document was generated on July 19, 2014 using texi2html 1.82.