[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.47 text.unicode - Unicode utilities

Module: text.unicode

This module provides various operations on a sequence of Unicode codepoints.

Gauche can be compiled with a native encoding other than Unicode, and the full Unicode-compatible behavior on characters and strings may not be available on such systems. So we provide most operations in two flavors: Operations on characters and strings, or operations on codepoints represented as a sequence of integers.

If Gauche is compiled with its native encoding being none, euc-jp or sjis, character-and-string operations are likely to be partial functions of the operations defined in Unicode standard. That is, if the operation can yield a character that are not supported in the native encoding, it may be remapped to an alternative character. Each manual entry explains the detailed behavior.

The codepoint operations are independent from Gauche’s native encoding and supports full spec as defined in Unicode standard. If Gauche is compiled with the utf-8 native encoding, the operations are essentially the same as character-and-string flavors when you convert codepoints and characters by char->integer and integer->char. The codepoint operations are handy when you need to support the algorithms described in Unicode standard fully, no matter what the running Gauche’s native encoding is.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.47.1 Unicode transfer encodings

The procedures in this group operate on codepoints represented as integers. In the following descriptions, ‘octets’ refers to an integer between 0 to 255, inclusive.

They take optional strictness argument. It specifies what to do when the procedure encounters a datum outside of the defined domain. Its value can be either one of the following symbols:

strict

Raises an error when the procedure encounters such input. This is the default behavior.

permissive

Whenever possible, treat the date as if it is a valid value. For example, codepoint value beyond #x10ffff is invalid in Unicode standard, but it may be useful for some other purpose that just want to use UTF-8 as an encoding scheme of binary data.

ignore

Whenver possible, treat the invalid input as if they do not exist.

The procedure may still raise an error in permissive or ignore strictness mode, if there can’t be a sensible way to handle the input data.

Function: ucs4->utf8 codepoint :optional strictness

Takes an integer codepoint and returns list of octets that encodes the input in UTF-8.

 
(ucs4->utf8 #x3bb)  ⇒ (206 187)
(ucs4->utf8 #x3042) ⇒ (227 129 130)

If strictness is strict (default), input codepoint between #xd800 to #xdfff, and beyond #x110000, are rejected. If strictness is permissive, it accepts input between 0 and #x7fffffff, inclusive; it may produce 5 or 6 octets if the input is large (as the original UTF-8 definition). If strictness is ignore, it returns an empty list for invalid codepoints.

Function: utf8-length octet :optional strictness

Takes octet as the first octet of UTF-8 sequence, and returns the number of total octets requried to decode the codepoint.

If strictness is strict (default), this procedure returns either 1, 2, 3 or 4. An error is thrown if octet cannot be a leading octet of a proper UTF-8 encoded Unicode codepoint.

If strictness is permissive, this procedure may return an integer between 0 and 6, inclusive. It allows the codepoint range #x110000 to #x7fffffff as the original utf-8 spec, so the maximum number of octets can be up to 6. If the input is in the range between #xc0 and #xdf, inclusive, this procedure returns 1–it’s up to the application how to treat these illegal octets. For other values, it returns 0.

If strictness is ignore, this procedure returns 0 when it would raise an error if strictness is strict. Other than that, it works the same as the default case.

Function: utf8->ucs4 octet-list :optional strictness

Takes a list of octets, and decodes it as a utf-8 sequence. Returns two values: The decoded ucs4 codepoint, and the rest of the input list.

An invalid utf8 sequence causes an error if strictness is strict, or skipped if it is ignore. If strictness is permissive, the procedure accepts the original utf-8 sequence which can produce surrogated pair range (between #xd800 and #dfff) and the range between #x110000 to #x7fffffff. The invalid octet sequence is still an error with permissive mode.

Function: ucs4->utf16 codepoint :optional strictness
Function: utf16-length octet :optional strictness
Function: utf16->ucs4 octet-list :optional strictness

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.47.2 Unicode text segmentation

These procedures implements grapheme-cluster and word breaking algorithms defined in UAX #29: Unicode Text Segmentation.

Function: string->words string
Function: codepoints->words sequence

From given string or codepoint sequence (a <sequence> object containing list of codepoints), returns a list of words. Each cluster is represented as a string, or a sequence of the same type as input, respectively.

 
(string->words "That's it.")
 ⇒ ("That's" " " "it" ".")
(codepoints->words '(84 104 97 116 39 115 32 105 116 46)
 ⇒ ((84 104 97 116 39 115) (32) (105 116) (46))

In the second example, the list is a list of codepoints of characters in "That’s it."

Function: string->grapheme-clusters string
Function: codepoints->grapheme-clusters sequence

From given string or codepoint sequence (a <sequence> object containing list of codepoints), returns a list of grapheme clusters. Each cluster is represented as a string, or a sequence of the same type as input, respectively.

The following procedures are low-level building blocks to build the above string->words etc. A generator argument is a procedure with no arguments, and returns a value (or some values) at at time for every call, until it returns EOF.

Function: make-word-breaker generator
Function: make-grapheme-cluster-breaker generator

From given generator is a generator of characters or codepoints, returns a generator that returns two values: The first value is the character or codepoint generated from the original generator, and the second value is a boolean flag, which is #t if a word or a grapheme cluster breaks before the character/codepoint, and #f otherwise.

Suppose a generator g returns characters in a string That's it., one at a time. Then the created generator will work as follows:

 
(define brk (make-word-breaker g))
(brk)  ⇒  #\T     and #t
(brk)  ⇒  #\h     and #f
(brk)  ⇒  #\a     and #f
(brk)  ⇒  #\t     and #f
(brk)  ⇒  #\'     and #f
(brk)  ⇒  #\s     and #f
(brk)  ⇒  #\space and #t
(brk)  ⇒  #\i     and #t
(brk)  ⇒  #\t     and #f
(brk)  ⇒  #\.     and #t
(brk)  ⇒  #<eof>  and #t

It shows the word breaks at those character boundaries shown by the caret ^ below (for clearity, I use _ to indicate the space).

 
  T h a t ' s _ i t .
 ^           ^ ^   ^ ^
Function: make-word-reader generator return
Function: make-grapheme-cluster-reader generator return

The input generator is a generator of characters or codepoints, and return is a procedure that takes a list of characters or codepoints, and returns an object. These procedures creates a generator that returns an object at at time, each consists of a word or a grapheme cluster, respectively.

Suppose a generator g returns characters in a string That's it., one at a time, again. Then the created generator works as follows:

 
(define brk (make-word-reader g list->string))
(brk)  ⇒  "That's"
(brk)  ⇒  " "
(brk)  ⇒  "it"
(brk)  ⇒  "."
(brk)  ⇒  #<eof>

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.47.3 Full string case conversion

Function: string-upcase string
Function: string-downcase string
Function: string-titlecase string
Function: string-foldcase string

[R6RS] Converts given string to upper case, using language-independent full case folding defined by Unicode standard. They differ from srfi-13’s procedures with the same names (See section String case mapping), which simply uses character-by-character case mapping. Notably, the length of resulting string may differ from the source string, and some conversions are sensitive to whether the character is at the word boundary or not. The word boundaries are determined according to UAX #29 text segmentation rules.

 
(string-upcase "straße")
 ⇒ "STRASSE"
(string-downcase "ΧΑΟΣΧΑΟΣ.ΧΑΟΣ. Σ.")
 ⇒ "χαοσχαοσ.χαος. σ."
(string-titlecase "You're talking about R6RS, right?")
 ⇒ "You're Talking About R6rs, Right?"
(string-foldcase "straße")
 ⇒ "strasse"
(string-foldcase "ΧΑΟΣΣ")
 ⇒ "χαοσσ"
Function: codepoints-upcase sequence
Function: codepoints-downcase sequence
Function: codepoints-titlecase sequence
Function: codepoints-foldcase sequence

Like string-upcase etc, but these work on a sequence of codepoints instead. Returns a sequence of the same type of the input.

 
(codepoints-upcase '#(115 116 114 97 223 101))
 ⇒ #(83 84 82 65 83 83 69)

[ < ] [ > ]   [ << ] [ Up ] [ >> ]

This document was generated by Shiro Kawai on May 28, 2012 using texi2html 1.82.