| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
text.unicode - Unicode utilitiesThis module provides various operations on a sequence of Unicode codepoints.
Gauche can be compiled with a native encoding other than Unicode, and the full Unicode-compatible behavior on characters and strings may not be available on such systems. So we provide most operations in two flavors: Operations on characters and strings, or operations on codepoints represented as a sequence of integers.
If Gauche is compiled with its native encoding being none,
euc-jp or sjis, character-and-string operations
are likely to be partial functions of the operations defined
in Unicode standard. That is, if the operation can yield a
character that are not supported in the native encoding, it
may be remapped to an alternative character. Each manual entry
explains the detailed behavior.
The codepoint operations are independent from Gauche’s native
encoding and supports full spec as defined in Unicode standard.
If Gauche is compiled with the utf-8 native encoding,
the operations are essentially the same as character-and-string flavors
when you convert codepoints and characters by char->integer and
integer->char. The codepoint operations are handy when
you need to support the algorithms described in Unicode standard
fully, no matter what the running Gauche’s native encoding is.
| 11.47.1 Unicode transfer encodings | ||
| 11.47.2 Unicode text segmentation | ||
| 11.47.3 Full string case conversion |
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The procedures in this group operate on codepoints represented as integers. In the following descriptions, ‘octets’ refers to an integer between 0 to 255, inclusive.
They take optional strictness argument. It specifies what to do when the procedure encounters a datum outside of the defined domain. Its value can be either one of the following symbols:
strictRaises an error when the procedure encounters such input. This is the default behavior.
permissiveWhenever possible, treat the date as if it is a valid value.
For example, codepoint value beyond #x10ffff is invalid
in Unicode standard, but it may be useful for some other purpose
that just want to use UTF-8 as an encoding scheme of binary data.
ignoreWhenver possible, treat the invalid input as if they do not exist.
The procedure may still raise an error in permissive or
ignore strictness mode, if there can’t be a sensible
way to handle the input data.
Takes an integer codepoint and returns list of octets that encodes the input in UTF-8.
(ucs4->utf8 #x3bb) ⇒ (206 187) (ucs4->utf8 #x3042) ⇒ (227 129 130) |
If strictness is strict (default), input codepoint
between #xd800 to #xdfff, and beyond #x110000,
are rejected. If strictness is permissive, it accepts
input between 0 and #x7fffffff, inclusive; it may produce
5 or 6 octets if the input is large (as the original UTF-8 definition).
If strictness is ignore, it returns an empty list
for invalid codepoints.
Takes octet as the first octet of UTF-8 sequence, and returns the number of total octets requried to decode the codepoint.
If strictness is strict (default), this
procedure returns either 1, 2, 3 or 4. An error is
thrown if octet cannot be a leading octet of
a proper UTF-8 encoded Unicode codepoint.
If strictness is permissive, this procedure
may return an integer between 0 and 6, inclusive.
It allows the codepoint range #x110000 to
#x7fffffff as the original utf-8 spec, so
the maximum number of octets can be up to 6.
If the input is in the range between #xc0
and #xdf, inclusive, this procedure returns
1–it’s up to the application how to treat these illegal
octets. For other values, it returns 0.
If strictness is ignore, this procedure
returns 0 when it would raise an error if
strictness is strict. Other than that,
it works the same as the default case.
Takes a list of octets, and decodes it as a utf-8 sequence. Returns two values: The decoded ucs4 codepoint, and the rest of the input list.
An invalid utf8 sequence causes an error if strictness
is strict, or skipped if it is ignore.
If strictness is permissive, the procedure accepts
the original utf-8 sequence which can produce surrogated pair
range (between #xd800 and #dfff) and the range
between #x110000 to #x7fffffff. The invalid
octet sequence is still an error with permissive mode.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
These procedures implements grapheme-cluster and word breaking algorithms defined in UAX #29: Unicode Text Segmentation.
From given string or codepoint sequence (a <sequence>
object containing list of codepoints), returns a list of
words. Each cluster is represented as a string, or
a sequence of the same type as input, respectively.
(string->words "That's it.")
⇒ ("That's" " " "it" ".")
(codepoints->words '(84 104 97 116 39 115 32 105 116 46)
⇒ ((84 104 97 116 39 115) (32) (105 116) (46))
|
In the second example, the list is a list of codepoints of characters in "That’s it."
From given string or codepoint sequence (a <sequence>
object containing list of codepoints), returns a list of
grapheme clusters. Each cluster is represented as a string,
or a sequence of the same type as input, respectively.
The following procedures are low-level building blocks
to build the above string->words etc.
A generator argument is a procedure
with no arguments, and returns a value (or some values) at at time
for every call, until it returns EOF.
From given generator is a generator of characters or codepoints,
returns a generator that returns two values: The first value is the
character or codepoint generated from the original generator, and the
second value is a boolean flag, which is #t if a word
or a grapheme cluster
breaks before the character/codepoint, and #f otherwise.
Suppose a generator g returns characters in a string
That's it., one at a time. Then the created generator
will work as follows:
(define brk (make-word-breaker g)) (brk) ⇒ #\T and #t (brk) ⇒ #\h and #f (brk) ⇒ #\a and #f (brk) ⇒ #\t and #f (brk) ⇒ #\' and #f (brk) ⇒ #\s and #f (brk) ⇒ #\space and #t (brk) ⇒ #\i and #t (brk) ⇒ #\t and #f (brk) ⇒ #\. and #t (brk) ⇒ #<eof> and #t |
It shows the word breaks at those character boundaries shown
by the caret ^ below (for clearity, I use _ to indicate
the space).
T h a t ' s _ i t . ^ ^ ^ ^ ^ |
The input generator is a generator of characters or codepoints, and return is a procedure that takes a list of characters or codepoints, and returns an object. These procedures creates a generator that returns an object at at time, each consists of a word or a grapheme cluster, respectively.
Suppose a generator g returns characters in a string
That's it., one at a time, again.
Then the created generator works as follows:
(define brk (make-word-reader g list->string)) (brk) ⇒ "That's" (brk) ⇒ " " (brk) ⇒ "it" (brk) ⇒ "." (brk) ⇒ #<eof> |
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
[R6RS] Converts given string to upper case, using language-independent full case folding defined by Unicode standard. They differ from srfi-13’s procedures with the same names (See section String case mapping), which simply uses character-by-character case mapping. Notably, the length of resulting string may differ from the source string, and some conversions are sensitive to whether the character is at the word boundary or not. The word boundaries are determined according to UAX #29 text segmentation rules.
(string-upcase "straße") ⇒ "STRASSE" (string-downcase "ΧΑΟΣΧΑΟΣ.ΧΑΟΣ. Σ.") ⇒ "χαοσχαοσ.χαος. σ." (string-titlecase "You're talking about R6RS, right?") ⇒ "You're Talking About R6rs, Right?" (string-foldcase "straße") ⇒ "strasse" (string-foldcase "ΧΑΟΣΣ") ⇒ "χαοσσ" |
Like string-upcase etc, but these work on a sequence of
codepoints instead. Returns a sequence of the same type of the input.
(codepoints-upcase '#(115 116 114 97 223 101)) ⇒ #(83 84 82 65 83 83 69) |
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] |
This document was generated by Shiro Kawai on May 28, 2012 using texi2html 1.82.