For Development HEAD DRAFTSearch (procedure/syntax/module):

9.37 gauche.unicode - Unicode utilities

Module: gauche.unicode

This module provides various operations on a sequence of Unicode codepoints.

Gauche used to support a native encoding other than Unicode, and the full Unicode-compatible behavior on characters and strings may not have been available on such systems. So we provide most operations in two flavors: Operations on characters and strings, or operations on codepoints represented as a sequence of integers.


9.37.1 Unicode transfer encodings

The procedures in this group operate on codepoints represented as integers. In the following descriptions, an octet refers to an integer between 0 to 255, inclusive.

They take optional strictness argument. It specifies what to do when the procedure encounters a datum outside of the defined domain. Its value can be either one of the following symbols:

strict

Raises an error when the procedure encounters such input. This is the default behavior.

permissive

Whenever possible, treat the date as if it is a valid value. For example, codepoint value beyond #x10ffff is invalid in Unicode standard, but it may be useful for some other purpose that just want to use UTF-8 as an encoding scheme of binary data.

replace

If the procedure sees invalid input, replaces it with a unicode replacement character U+FFFD and proceed.

ignore

Whenver possible, treat the invalid input as if they do not exist.

The procedure may still raise an error in permissive, replace or ignore strictness mode, if there can’t be a sensible way to handle the input data.

Function: ucs4->utf8 codepoint :optional strictness

{gauche.unicode} Takes an integer codepoint and returns a list of octets that encodes the input in UTF-8.

(ucs4->utf8 #x3bb)  ⇒ (206 187)
(ucs4->utf8 #x3042) ⇒ (227 129 130)

If strictness is strict (default), input codepoint between #xd800 to #xdfff, and beyond #x110000, are rejected. If strictness is replace, such input yields a utf8 sequence #xef, #xbf, #xbd, which encodes U+FFFD. If strictness is permissive, it accepts input between 0 and #x7fffffff, inclusive; it may produce 5 or 6 octets if the input is large (as the original UTF-8 definition). If strictness is ignore, it returns an empty list for invalid codepoints.

Function: utf8-length octet :optional strictness

{gauche.unicode} Takes octet as the first octet of UTF-8 sequence, and returns the number of total octets required to decode the codepoint.

If octet is not an exact integer between 0 and 255 (inclusive), an error is thrown, regardless of strictness argument.

If strictness is strict (default), this procedure returns either 1, 2, 3 or 4. An error is thrown if octet cannot be a leading octet of a proper UTF-8 encoded Unicode codepoint.

If strictness is permissive or replace, this procedure may return an integer between 0 and 6, inclusive. If the input is from #xf8 to #xfd, inclusive, this returns 5 or 6, according to the original utf-8 spec (these values corresponds to the codepoint range #x110000 to #x7fffffff). If the input is in the range between #x80 and #xbf, inclusive, or #xfe or #xff, this procedure returns 1–it’s up to the application how to treat these illegal octets.

If strictness is ignore, this procedure returns 0 when it would raise an error if strictness is strict. Other than that, it works the same as the default case.

Function: utf8->ucs4 octet-list :optional strictness

{gauche.unicode} Takes a list of octets, and decodes it as a utf-8 sequence. Returns two values: The decoded ucs4 codepoint, and the rest of the input list.

If it finds a value other than exact integer between 0 and 255 in the input list, an error is thrown regardless of the value of strictness.

An invalid utf8 sequence causes an error if strictness is strict, or skipped if it is ignore. If strictness is replace, such utf8 sequence yields U+FFFD. If strictness is permissive, the procedure accepts the original utf-8 sequence which can produce surrogated pair range (between #xd800 and #dfff) and the range between #x110000 to #x7fffffff. The invalid octet sequence is still an error with permissive mode.

Function: utf8->string u8vector :optional start end

[R7RS base] {gauche.unicode} Converts a sequence of utf8 octets in u8vector to a string. Optional start and/or end argument(s) will limit the range of the input.

If Gauche’s native encoding is utf8, this procedure first tries u8vector->string (see Uniform vectors). If the input utf8 sequence is valid, this is the fastest way. If the input contains invalid utf8 sequence, the procedure falls back to construct a string by one character at at time, replacing invalid sequence with Unicode replacement character U+FFFD. Hence it always returns a complete string. To know if the input contains invalid utf8 sequence, you can use u8vector->string directly.

If Gauche’s native encoding is other than utf8, there’s no U+FFFD so invalid utf8 sequence throws an error.

Function: string->utf8 string :optional start end

[R7RS base] {gauche.unicode} Converts a string to a u8vector of utf8 octets. Optional start and/or end argument(s) will limit the range of the input.

If Gauche’s native encoding is utf8, this procedure just calls string->u8vector (see Uniform vectors). Otherwise, it first converts the input string to utf-8, then string->u8vector is called.

Function: ucs4->utf16 codepoint :optional strictness

{gauche.unicode} Takes an integer codepoint and returns a list of integers that encodes the input in UTF-16.

If strictness is strict (default), the input must be either between 0 and #xd7ff or between #xe000 and #x10ffff. An error is thrown otherwise. The ’hole’ is the codepoint reserved for surrogates, and there’s no valid mapping from them to utf-16 is defined.

If strictness is replace, such input is replaced with #xfffd, which encodes Unicode replacement character.

If strictness is permissive, it accepts high surrogates and low surrogates, in which case the result is single element list of input. An error is still thrown for negative input and input greater than or equal to #x110000.

If strictness is ignore, an empty list is returned for an invalid codepoint (including surrogates).

Note: We can encode values larger than #x10ffff in utf-8 in the permissive mode, but not in utf-16.

Function: utf16-length code :optional strictness

{gauche.unicode} Code must be an exact integer between 0 and 65535, inclusive. Returns 1 if code is BMP character codepoint, or 2 if code is a high surrogate.

If strictness is strict (default), an error is signalled if code is a low surrogate, or it is out of range. If strictness is permissive or replace, 1 is returned for low surrogates, but an error is signalled for out of range arguments. If strictness is ignore, 0 is returned for low surrogates and out of range arguments.

Function: utf16->ucs4 code-list :optional strictness

{gauche.unicode} Takes a list of exact integers and decodes it as a utf-16 sequence. Returns two values: The decoded ucs4 codepoint, and the rest of input list.

If strictness is strict (default), an invalid utf-16 sequence and out-of-range integer raise an error. If strictness is permissive, an out-of-range integer causes an error, but a lone surrogate is allowed and returned as is. If strictness is replace, a lone surrogate is replaced with U+FFFD. If strictness is ignore, lone surrogates and out-of-range integers are just ignored.

Function: utf16->string u8vector :optional endian ignore-bom? start end
Function: utf32->string u8vector :optional endian ignore-bom? start end

{guache.unicode} [R7RS scheme.bytevector] Convert utf16 and utf32 sequence stored in u8vector to a string, respectively. For utf16->string, if the input contains invlaid utf16 sequence (unpaired surrogate), it is replaced with Unicode replacement character U+FFFD. If the number of input octet is not the multiple of unit (2 octets for utf16, and 4 octets for utf32), an error is thrown.

The optional endian and ignore-bom? arguments determines whether the input is in UTF16BE/UTF32BE or UTF16LE/UTF32LE. If ignore-bom? is #f or omitted, the first two octets of input is examined; if it’s BOM, it deterimines the endianness regardless of endian argument, and the BOM won’t be included in the output. If the input does not begin with BOM, or ignore-bom? is true, then the endianness is determined by endian argument: It can be big-endian or big for UTF16BE/UTF32BE, and little-endian or little or arm-big-endian for UTF16LE/UTF32LE (see Endianness, for the details of endianness).

Note that if ignore-bom? is given and true, the initial BOM is interpreted as a codepoint U+FEFF. If endian is #f or omitted, UTF16BE/UTF32BE is assumed (it is defined so in R7RS scheme.bytevector).

In R7RS scheme.bytevector, ignore-bom? argument is called endianness-mandatory. The behavior is the same.

Optinoal argument start and end trims the input octet sequence before other processing (including BOM detection). These arguments are Gauche’s extension, and not the part of R7RS scheme.bytevector.

Function: string->utf16 str :optional endian add-bom? start end
Function: string->utf32 str :optional endian add-bom? start end

{gauche.unicode} [R7RS scheme.bytevector] Encode a string str to utf-16 and utf-32 sequences stored in a u8vector, respectively.

The optional endian argument specifies whether the encoding is UTF16BE/UTF32BE or UTF16LE/UTF32LE. If it is a symobl big-endian or big, the encoding is UTF16BE/UTF32BE. If it is a symbol little-endian, little or arm-little-endian, the encoding is UTF16LE/UTF32LE. See Endianness, for the details of endianness. If it is omitted or #f, UTF16BE/UTF32BE is assumed.

The second optional argument add-bom? specifies, if true value is given, the output contains BOM. When omitted BOM won’t be added.

The start and end arguments limits the range of str to be converted.

R7RS scheme.bytevector only defines endian optional argument. The rest is Gauche’s extension.


9.37.2 Unicode text segmentation

These procedures implements grapheme-cluster and word breaking algorithms defined in UAX #29: Unicode Text Segmentation.

Function: string->words string
Function: codepoints->words sequence

{gauche.unicode} From given string or codepoint sequence (a <sequence> object containing codepoints), returns a list of words. Each word is represented as a string, or a sequence of the same type as input, respectively.

(string->words "That's it.")
 ⇒ ("That's" " " "it" ".")
(codepoints->words '(84 104 97 116 39 115 32 105 116 46)
 ⇒ ((84 104 97 116 39 115) (32) (105 116) (46))
(codepoints->words '#(84 104 97 116 39 115 32 105 116 46)
 ⇒ (#(84 104 97 116 39 115) #(32) #(105 116) #(46))

In the second and third example, the input is a sequence of codepoints of characters in "That’s it."

Function: string->grapheme-clusters string
Function: codepoints->grapheme-clusters sequence

{gauche.unicode} From given string or codepoint sequence (a <sequence> object containing codepoints), returns a list of grapheme clusters. Each cluster is represented as a string, or a sequence of the same type as input, respectively.

The following procedures are low-level building blocks to build the above string->words etc.

Function: make-word-breaker generator
Function: make-grapheme-cluster-breaker generator

{gauche.unicode} From given generator, which is a generator of characters or codepoints, returns a generator that returns two values: The first value is the character or codepoint generated from the original generator, and the second value is a boolean flag, which is #t if a word or a grapheme cluster breaks before the character/codepoint, and #f otherwise.

Suppose a generator g returns characters in a string That's it., one at a time. Then the created generator will work as follows:

(define brk (make-word-breaker g))
(brk)  ⇒  #\T     and #t
(brk)  ⇒  #\h     and #f
(brk)  ⇒  #\a     and #f
(brk)  ⇒  #\t     and #f
(brk)  ⇒  #\'     and #f
(brk)  ⇒  #\s     and #f
(brk)  ⇒  #\space and #t
(brk)  ⇒  #\i     and #t
(brk)  ⇒  #\t     and #f
(brk)  ⇒  #\.     and #t
(brk)  ⇒  #<eof>  and #t

It shows the word breaks at those character boundaries shown by the caret ^ below (for clearity, I use _ to indicate the space).

  T h a t ' s _ i t .
 ^           ^ ^   ^ ^
Function: make-word-reader generator return
Function: make-grapheme-cluster-reader generator return

{gauche.unicode} The input generator is a generator of characters or codepoints, and return is a procedure that takes a list of characters or codepoints, and returns an object. These procedures creates a generator that returns an object at at time, each consists of a word or a grapheme cluster, respectively.

Suppose a generator g returns characters in a string That's it., one at a time, again. Then the created generator works as follows:

(define brk (make-word-reader g list->string))
(brk)  ⇒  "That's"
(brk)  ⇒  " "
(brk)  ⇒  "it"
(brk)  ⇒  "."
(brk)  ⇒  #<eof>

9.37.3 Full string case conversion

Function: string-upcase string
Function: string-downcase string
Function: string-titlecase string
Function: string-foldcase string

[R6RS][R7RS char][SRFI-129] {gauche.unicode} Converts the case of given string using language-independent full case folding defined by Unicode standard. They differ from SRFI-13’s procedures with the same names (see String case mapping), which simply uses character-by-character case mapping. Notably, the length of resulting string may differ from the source string, and some conversions are sensitive to whether the character is at the word boundary or not. The word boundaries are determined according to UAX #29 text segmentation rules.

(string-upcase "straße")
 ⇒ "STRASSE"
(string-downcase "ΧΑΟΣΧΑΟΣ.ΧΑΟΣ. Σ.")
 ⇒ "χαοσχαοσ.χαος. σ."
(string-titlecase "You're talking about R6RS, right?")
 ⇒ "You're Talking About R6rs, Right?"
(string-foldcase "straße")
 ⇒ "strasse"
(string-foldcase "ΧΑΟΣΣ")
 ⇒ "χαοσσ"

Procedures string-upcase, string-downcase, and string-foldcase are also in R7RS scheme.char module.

Procedure string-titlecase is also defined in SRFI-129.

Function: codepoints-upcase sequence
Function: codepoints-downcase sequence
Function: codepoints-titlecase sequence
Function: codepoints-foldcase sequence

{gauche.unicode} Like string-upcase etc, but these work on a sequence of codepoints instead. Returns a sequence of the same type of the input.

(codepoints-upcase '#(115 116 114 97 223 101))
 ⇒ #(83 84 82 65 83 83 69)
Function: string-ci=? string1 string2 string3 …
Function: string-ci<? string1 string2 string3 …
Function: string-ci<=? string1 string2 string3 …
Function: string-ci>? string1 string2 string3 …
Function: string-ci>=? string1 string2 string3 …

[R7RS char] {gauche.unicode} Case-insensitive string comparison, using full-string case conversion.

Note that Gauche has builtin string-ci=? etc., which use character-wise case folding (see String comparison). These are different procedures.

(string-ci=? "\u00df" "SS") ⇒ #t

9.37.4 East asian width property

Function: char-east-asian-width char-or-codepoint

{gauche.unicode} The argument may be a character or a nonnegative integer of Unicode codepoint. Returns one of the symbols N (neutral), F (fullwidth), H (halfwidth), W (wide), Na (narrow), and A (ambiguous).

The meaning of this property is explained in Unicode standard annex #11, http://unicode.org/reports/tr11/.

Function: string-east-asian-width str :key F H W Na N A

{gauche.unicode} Computes the ’width’ of the given string str, taking each character’s East Asian Width int account. It gives you a rough estimate of how much space the string will occupy on the screen, when displayed with monospace fonts.

Although it is true that the exact width is generally undecidable without actually rendering them with glyphs, heuristics can give a good estimate if the string consists of limited scripts. For example, if the string is with ASCII and CJK ideographs, computing ’Full-width’ and ’Wide’ letters as twice wide as ASCII character is likely to work.

This procedure assigns the following widths for each character’s East Asian Width category. They can be altered by giving keyword arguments F, H, W, Na, N, and A, respectively:

F (Full width)

2

H (Half width)

1

W (Wide)

2

Na (Narrow)

1

N (Neutral)

1

A (Ambiguous)

2

See UAX #11 (http://unicode.org/reports/tr11/), section 5, for more detailed discussions of computing the width of displayed strings.

(string-east-asian-width "abc") ⇒ 3
(string-east-asian-width "いろは") ⇒ 6
(string-east-asian-width "1番目" :W 1.5) ⇒ 4.0
Function: string-take-width str width :key F H W Na N A
Function: string-drop-width str width :key F H W Na N A

{gauche.unicode} Like string-take and string-drop (see srfi.13 - String library), returns a prefix and suffix of the input string str, but using the string’s width instead of the number of characters.

string-take-width returns the longest prefix of a string str such that its string-east-asian-width does not exceed width. On the other hand, string-drop-width removes such prefix from str and returns the rest.

The keyword arguments are to customize mappings of East Asian Width categories to numerical widths. See string-east-asian-width above for the details.



For Development HEAD DRAFTSearch (procedure/syntax/module):
DRAFT