For Gauche 0.9.5


Next: , Previous: , Up: ライブラリモジュール - Gauche拡張モジュール   [Contents][Index]

9.34 gauche.unicode - Unicodeユーティリティ

Module: gauche.unicode

このモジュールは、Unicodeのコードポイントの列に対する様々な操作を提供します。

Gaucheはコンパイル時に、内部文字エンコーディングとしてUnicode以外も選べます。 その場合、文字列に対して完全にUnicode互換の動作を提供できない場合があります。 そこで、本モジュールの多くの操作は、文字列を対象にするものと、 コードポイントの数値のシーケンスを対象にするものの両方で提供されます。

Gaucheの内部エンコーディングがnoneeuc-jpsjisの場合は、 文字および文字列に対する操作はUnicode標準で定義されたものと完全には一致しないでしょう。 操作の結果が、内部エンコーディングでは定義されていない文字になった場合、 それは代替文字に置き換えられます。各関数のエントリで詳しく説明してあります。

コードポイントの列に対する操作はGaucheの内部文字エンコーディングとは 無関係であり、Unicode標準の定義が完全にサポートされます。 Gaucheがutf-8でコンパイルされている場合、コードポイント列に 対する操作は、文字列に対する操作と (各要素をchar->integerおよび integer->charで変換すること以外は)一致します。 コードポイント列に対する操作はGaucheの内部文字エンコーディングにかかわらず ポータブルなアルゴリズムを必要とする場合に便利です。


Next: , Previous: , Up: Unicodeユーティリティ   [Contents][Index]

9.34.1 Unicode transfer encodings

The procedures in this group operate on codepoints represented as integers. In the following descriptions, ‘octets’ refers to an integer between 0 to 255, inclusive.

They take optional strictness argument. It specifies what to do when the procedure encounters a datum outside of the defined domain. Its value can be either one of the following symbols:

strict

Raises an error when the procedure encounters such input. This is the default behavior.

permissive

Whenever possible, treat the date as if it is a valid value. For example, codepoint value beyond #x10ffff is invalid in Unicode standard, but it may be useful for some other purpose that just want to use UTF-8 as an encoding scheme of binary data.

ignore

Whenver possible, treat the invalid input as if they do not exist.

The procedure may still raise an error in permissive or ignore strictness mode, if there can’t be a sensible way to handle the input data.

Function: ucs4->utf8 codepoint :optional strictness

Takes an integer codepoint and returns a list of octets that encodes the input in UTF-8.

(ucs4->utf8 #x3bb)  ⇒ (206 187)
(ucs4->utf8 #x3042) ⇒ (227 129 130)

If strictness is strict (default), input codepoint between #xd800 to #xdfff, and beyond #x110000, are rejected. If strictness is permissive, it accepts input between 0 and #x7fffffff, inclusive; it may produce 5 or 6 octets if the input is large (as the original UTF-8 definition). If strictness is ignore, it returns an empty list for invalid codepoints.

Function: utf8-length octet :optional strictness

Takes octet as the first octet of UTF-8 sequence, and returns the number of total octets requried to decode the codepoint.

If strictness is strict (default), this procedure returns either 1, 2, 3 or 4. An error is thrown if octet cannot be a leading octet of a proper UTF-8 encoded Unicode codepoint.

If strictness is permissive, this procedure may return an integer between 0 and 6, inclusive. It allows the codepoint range #x110000 to #x7fffffff as the original utf-8 spec, so the maximum number of octets can be up to 6. If the input is in the range between #xc0 and #xdf, inclusive, this procedure returns 1–it’s up to the application how to treat these illegal octets. For other values, it returns 0.

If strictness is ignore, this procedure returns 0 when it would raise an error if strictness is strict. Other than that, it works the same as the default case.

Function: utf8->ucs4 octet-list :optional strictness

Takes a list of octets, and decodes it as a utf-8 sequence. Returns two values: The decoded ucs4 codepoint, and the rest of the input list.

An invalid utf8 sequence causes an error if strictness is strict, or skipped if it is ignore. If strictness is permissive, the procedure accepts the original utf-8 sequence which can produce surrogated pair range (between #xd800 and #dfff) and the range between #x110000 to #x7fffffff. The invalid octet sequence is still an error with permissive mode.

Function: utf8->string u8vector :optional start end

[R7RS] Converts a sequence of utf8 octets in u8vector to a string. Optional start and/or end argument(s) will limit the range of the input.

If Gauche’s native encoding is utf8, u8vector->string (see ユニフォームベクタの変換) will do the job faster; but this routine can be used regardless of Gauche’s native encoding, and it raises an error if u8vector contains octet sequences illegal as utf8.

Function: string->utf8 string :optional start end

[R7RS] Converts a string to a u8vector of utf8 octets. Optional start and/or end argument(s) will limit the range of the input.

If Gauche’s native encoding is utf8, string->u8vector (see ユニフォームベクタの変換) will do the job faster; but this routine can be used regardless of Gauche’s native encoding.

Function: ucs4->utf16 codepoint :optional strictness

Takes an integer codepont and returns a list of integers that encodes the input in UTF-16. The output is either one integer or two integers, and each integer is in the range between 0 and 65535 (inclusive).

If strictness is strict (default), input codepoint between #xd800 to #xdfff, and beyond #x110000, are rejected. If strictness is permissive, it accepts high surrogates and low surrogates, in which case the result is single element list of input. If strictness is ignore, an empty list is returned for an invalid codepoint (including surrogates).

Function: utf16-length code :optional strictness

Code must be an integer between 0 and 65535, inclusive. Returns 1 if code is BMP character codepoint, or 2 if code is high surrogate codepoint.

If strictness is strict (default), an error is signalled if code is a low surrogate, or it is out of range. If strictness is permissive, 1 is returned for low surrogates, but an error is signalled for out of range arguments. If strictness is ignore, 0 is returned for low surrogates and out of range arguments.

Function: utf16->ucs4 code-list :optional strictness

Takes a list of integers and decodes it as a utf-16 sequence. Returns two values: The decoded ucs4 codepoint, and the rest of input list.

If strictness is strict (default), an invalid utf-16 sequence and out-of-range integer raise an error. If strictness is permissive, an out-of-range integer causes an error, but a lone surrogate is allowed and returned as is. If strictness is ignore, lone surrogates and out-of-range integers are just ignored.


Next: , Previous: , Up: Unicodeユーティリティ   [Contents][Index]

9.34.2 Unicode text segmentation

These procedures implements grapheme-cluster and word breaking algorithms defined in UAX #29: Unicode Text Segmentation.

Function: string->words string
Function: codepoints->words sequence

From given string or codepoint sequence (a <sequence> object containing list of codepoints), returns a list of words. Each cluster is represented as a string, or a sequence of the same type as input, respectively.

(string->words "That's it.")
 ⇒ ("That's" " " "it" ".")
(codepoints->words '(84 104 97 116 39 115 32 105 116 46)
 ⇒ ((84 104 97 116 39 115) (32) (105 116) (46))

In the second example, the list is a list of codepoints of characters in "That’s it."

Function: string->grapheme-clusters string
Function: codepoints->grapheme-clusters sequence

From given string or codepoint sequence (a <sequence> object containing list of codepoints), returns a list of grapheme clusters. Each cluster is represented as a string, or a sequence of the same type as input, respectively.

The following procedures are low-level building blocks to build the above string->words etc. A generator argument is a procedure with no arguments, and returns a value (or some values) at at time for every call, until it returns EOF.

Function: make-word-breaker generator
Function: make-grapheme-cluster-breaker generator

From given generator is a generator of characters or codepoints, returns a generator that returns two values: The first value is the character or codepoint generated from the original generator, and the second value is a boolean flag, which is #t if a word or a grapheme cluster breaks before the character/codepoint, and #f otherwise.

Suppose a generator g returns characters in a string That's it., one at a time. Then the created generator will work as follows:

(define brk (make-word-breaker g))
(brk)  ⇒  #\T     and #t
(brk)  ⇒  #\h     and #f
(brk)  ⇒  #\a     and #f
(brk)  ⇒  #\t     and #f
(brk)  ⇒  #\'     and #f
(brk)  ⇒  #\s     and #f
(brk)  ⇒  #\space and #t
(brk)  ⇒  #\i     and #t
(brk)  ⇒  #\t     and #f
(brk)  ⇒  #\.     and #t
(brk)  ⇒  #<eof>  and #t

It shows the word breaks at those character boundaries shown by the caret ^ below (for clearity, I use _ to indicate the space).

  T h a t ' s _ i t .
 ^           ^ ^   ^ ^
Function: make-word-reader generator return
Function: make-grapheme-cluster-reader generator return

The input generator is a generator of characters or codepoints, and return is a procedure that takes a list of characters or codepoints, and returns an object. These procedures creates a generator that returns an object at at time, each consists of a word or a grapheme cluster, respectively.

Suppose a generator g returns characters in a string That's it., one at a time, again. Then the created generator works as follows:

(define brk (make-word-reader g list->string))
(brk)  ⇒  "That's"
(brk)  ⇒  " "
(brk)  ⇒  "it"
(brk)  ⇒  "."
(brk)  ⇒  #<eof>

Previous: , Up: Unicodeユーティリティ   [Contents][Index]

9.34.3 Full string case conversion

Function: string-upcase string
Function: string-downcase string
Function: string-titlecase string
Function: string-foldcase string

[R6RS][R7RS] Converts given string to upper case, using language-independent full case folding defined by Unicode standard. They differ from srfi-13’s procedures with the same names (see 文字列のケース(大文字小文字)マッピング), which simply uses character-by-character case mapping. Notably, the length of resulting string may differ from the source string, and some conversions are sensitive to whether the character is at the word boundary or not. The word boundaries are determined according to UAX #29 text segmentation rules.

(string-upcase "straße")
 ⇒ "STRASSE"
(string-downcase "ΧΑΟΣΧΑΟΣ.ΧΑΟΣ. Σ.")
 ⇒ "χαοσχαοσ.χαος. σ."
(string-titlecase "You're talking about R6RS, right?")
 ⇒ "You're Talking About R6rs, Right?"
(string-foldcase "straße")
 ⇒ "strasse"
(string-foldcase "ΧΑΟΣΣ")
 ⇒ "χαοσσ"
Function: codepoints-upcase sequence
Function: codepoints-downcase sequence
Function: codepoints-titlecase sequence
Function: codepoints-foldcase sequence

Like string-upcase etc, but these work on a sequence of codepoints instead. Returns a sequence of the same type of the input.

(codepoints-upcase '#(115 116 114 97 223 101))
 ⇒ #(83 84 82 65 83 83 69)
Function: string-ci=? string1 string2 string3 …
Function: string-ci<? string1 string2 string3 …
Function: string-ci<=? string1 string2 string3 …
Function: string-ci>? string1 string2 string3 …
Function: string-ci>=? string1 string2 string3 …

[R7RS] Case-insensitive string comparison, using full-string case conversion.

Note that Gauche has builtin string-ci=? etc., which use character-wise case folding (see 文字列の比較). These are different procedures.

(string-ci=? "\u00df" "SS") ⇒ #t

Previous: , Up: Unicodeユーティリティ   [Contents][Index]