gauche.unicode
- Unicode utilities ¶This module provides various operations on a sequence of Unicode codepoints.
Gauche used to support a native encoding other than Unicode, and the full Unicode-compatible behavior on characters and strings may not have been available on such systems. So we provide most operations in two flavors: Operations on characters and strings, or operations on codepoints represented as a sequence of integers.
• Unicode transfer encodings: | ||
• Unicode text segmentation: | ||
• Full string case conversion: | ||
• East asian width property: |
The procedures in this group operate on codepoints represented as integers. In the following descriptions, an octet refers to an integer between 0 to 255, inclusive.
They take optional strictness argument. It specifies what to do when the procedure encounters a datum outside of the defined domain. Its value can be either one of the following symbols:
strict
Raises an error when the procedure encounters such input. This is the default behavior.
permissive
Whenever possible, treat the date as if it is a valid value.
For example, codepoint value beyond #x10ffff
is invalid
in Unicode standard, but it may be useful for some other purpose
that just want to use UTF-8 as an encoding scheme of binary data.
replace
If the procedure sees invalid input, replaces it with a unicode
replacement character U+FFFD
and proceed.
ignore
Whenver possible, treat the invalid input as if they do not exist.
The procedure may still raise an error in permissive
, replace
or
ignore
strictness mode, if there can’t be a sensible
way to handle the input data.
{gauche.unicode
}
Takes an integer codepoint and returns a list of octets that
encodes the input in UTF-8.
(ucs4->utf8 #x3bb) ⇒ (206 187) (ucs4->utf8 #x3042) ⇒ (227 129 130)
If strictness is strict
(default), input codepoint
between #xd800
to #xdfff
, and beyond #x110000
,
are rejected. If strictness is replace
, such input
yields a utf8 sequence #xef, #xbf, #xbd
, which encodes
U+FFFD
.
If strictness is permissive
, it accepts
input between 0
and #x7fffffff
, inclusive; it may produce
5 or 6 octets if the input is large (as the original UTF-8 definition).
If strictness is ignore
, it returns an empty list
for invalid codepoints.
{gauche.unicode
}
Takes octet as the first octet of UTF-8 sequence, and
returns the number of total octets required to decode
the codepoint.
If octet is not an exact integer between 0 and 255 (inclusive), an error is thrown, regardless of strictness argument.
If strictness is strict
(default), this
procedure returns either 1, 2, 3 or 4. An error is
thrown if octet cannot be a leading octet of
a proper UTF-8 encoded Unicode codepoint.
If strictness is permissive
or replace
, this procedure
may return an integer between 0 and 6, inclusive.
If the input is from #xf8
to #xfd
, inclusive,
this returns 5 or 6, according to the original utf-8 spec
(these values corresponds to the codepoint
range #x110000
to
#x7fffffff
).
If the input is in the range between #x80
and #xbf
, inclusive, or #xfe or #xff
,
this procedure returns 1–it’s up to the application how to treat these illegal
octets.
If strictness is ignore
, this procedure
returns 0
when it would raise an error if
strictness is strict
. Other than that,
it works the same as the default case.
{gauche.unicode
}
Takes a list of octets, and decodes it as a utf-8 sequence.
Returns two values: The decoded ucs4 codepoint, and the
rest of the input list.
If it finds a value other than exact integer between 0 and 255 in the input list, an error is thrown regardless of the value of strictness.
An invalid utf8 sequence causes an error if strictness
is strict
, or skipped if it is ignore
.
If strictness is replace
, such utf8 sequence
yields U+FFFD
.
If strictness is permissive
, the procedure accepts
the original utf-8 sequence which can produce surrogated pair
range (between #xd800
and #dfff
) and the range
between #x110000
to #x7fffffff
. The invalid
octet sequence is still an error with permissive
mode.
[R7RS base]
{gauche.unicode
}
Converts a sequence of utf8 octets in u8vector to a string.
Optional start and/or end argument(s) will limit the
range of the input.
If Gauche’s native encoding is utf8, this procedure first tries
u8vector->string
(see Uniform vectors).
If the input utf8 sequence is valid, this is the fastest way.
If the input contains invalid utf8 sequence, the procedure falls
back to construct a string by one character at at time,
replacing invalid sequence with Unicode replacement character U+FFFD
.
Hence it always returns a complete string.
To know if the input contains invalid utf8 sequence, you can use
u8vector->string
directly.
If Gauche’s native encoding is other than utf8,
there’s no U+FFFD
so invalid utf8 sequence throws an error.
[R7RS base]
{gauche.unicode
}
Converts a string to a u8vector of utf8 octets.
Optional start and/or end argument(s) will limit the
range of the input.
If Gauche’s native encoding is utf8, this procedure
just calls string->u8vector
(see Uniform vectors).
Otherwise, it first converts the input string to utf-8, then
string->u8vector
is called.
{gauche.unicode
}
Takes an integer codepoint and returns a list of integers
that encodes the input in UTF-16.
If strictness is strict
(default),
the input must be either
between 0
and #xd7ff
or between #xe000
and
#x10ffff
. An error is thrown otherwise. The ’hole’ is the
codepoint reserved for surrogates, and there’s no valid
mapping from them to utf-16 is defined.
If strictness is replace
, such input is replaced
with #xfffd
, which encodes Unicode replacement character.
If strictness is permissive
, it accepts
high surrogates and low surrogates, in which case the result is
single element list of input. An error is still thrown for
negative input and input greater than or equal to #x110000
.
If strictness is ignore
,
an empty list is returned for an invalid codepoint (including surrogates).
Note: We can encode values larger than #x10ffff
in utf-8
in the permissive mode, but not in utf-16.
{gauche.unicode
}
Code must be an exact integer between 0 and 65535, inclusive.
Returns 1 if code is BMP character codepoint, or
2 if code is a high surrogate.
If strictness is strict
(default), an error is
signalled if code is a low surrogate, or it is out of range.
If strictness is permissive
or replace
, 1 is returned
for low surrogates, but an error is signalled for out of range arguments.
If strictness is ignore
, 0 is returned
for low surrogates and out of range arguments.
{gauche.unicode
}
Takes a list of exact integers and decodes it as a utf-16 sequence.
Returns two values: The decoded ucs4 codepoint, and the rest of
input list.
If strictness is strict
(default), an invalid utf-16
sequence and out-of-range integer raise an error. If strictness
is permissive
, an out-of-range integer causes an error, but
a lone surrogate is allowed and returned as is.
If strictness is replace
, a lone surrogate is
replaced with U+FFFD
.
If strictness
is ignore
, lone surrogates and out-of-range integers are just
ignored.
{guache.unicode
}
[R7RS scheme.bytevector]
Convert utf16 and utf32 sequence stored in u8vector to a string,
respectively.
For utf16->string
,
if the input contains invlaid utf16 sequence (unpaired surrogate),
it is replaced with Unicode replacement character U+FFFD
.
If the number of input octet is not the multiple of unit (2 octets
for utf16, and 4 octets for utf32), an error is thrown.
The optional endian and ignore-bom? arguments
determines whether the input is in UTF16BE/UTF32BE or UTF16LE/UTF32LE.
If ignore-bom? is #f
or omitted, the first two octets
of input is examined; if it’s BOM, it deterimines the endianness
regardless of endian argument,
and the BOM won’t be included in the output.
If the input does not begin with BOM, or ignore-bom? is true,
then the endianness is determined by endian argument: It can be
big-endian
or big
for UTF16BE/UTF32BE, and
little-endian
or little
or arm-big-endian
for UTF16LE/UTF32LE (see Endianness, for the details of endianness).
Note that if ignore-bom? is given and true,
the initial BOM is interpreted as a codepoint U+FEFF
.
If endian is #f
or omitted, UTF16BE/UTF32BE
is assumed (it is defined so in R7RS scheme.bytevector
).
In R7RS scheme.bytevector
, ignore-bom? argument
is called endianness-mandatory. The behavior is the same.
Optinoal argument start and end trims the input octet sequence
before other processing (including BOM detection). These arguments are
Gauche’s extension, and not the part of R7RS scheme.bytevector
.
{gauche.unicode
}
[R7RS scheme.bytevector]
Encode a string str to utf-16 and utf-32 sequences stored in a u8vector,
respectively.
The optional endian argument specifies whether the encoding
is UTF16BE/UTF32BE or UTF16LE/UTF32LE. If it is a symobl big-endian
or
big
, the encoding is UTF16BE/UTF32BE. If it is a symbol
little-endian
, little
or arm-little-endian
,
the encoding is UTF16LE/UTF32LE.
See Endianness, for the details of endianness.
If it is omitted or #f
, UTF16BE/UTF32BE is assumed.
The second optional argument add-bom? specifies, if true value is given, the output contains BOM. When omitted BOM won’t be added.
The start and end arguments limits the range of str to be converted.
R7RS scheme.bytevector
only defines endian optional
argument. The rest is Gauche’s extension.
These procedures implements grapheme-cluster and word breaking algorithms defined in UAX #29: Unicode Text Segmentation.
{gauche.unicode
}
From given string or codepoint sequence (a <sequence>
object containing codepoints), returns a list of
words. Each word is represented as a string, or
a sequence of the same type as input, respectively.
(string->words "That's it.") ⇒ ("That's" " " "it" ".") (codepoints->words '(84 104 97 116 39 115 32 105 116 46) ⇒ ((84 104 97 116 39 115) (32) (105 116) (46)) (codepoints->words '#(84 104 97 116 39 115 32 105 116 46) ⇒ (#(84 104 97 116 39 115) #(32) #(105 116) #(46))
In the second and third example, the input is a sequence of codepoints of characters in "That’s it."
{gauche.unicode
}
From given string or codepoint sequence (a <sequence>
object containing codepoints), returns a list of
grapheme clusters. Each cluster is represented as a string,
or a sequence of the same type as input, respectively.
The following procedures are low-level building blocks
to build the above string->words
etc.
{gauche.unicode
}
From given generator, which is a generator of characters or codepoints,
returns a generator that returns two values: The first value is the
character or codepoint generated from the original generator, and the
second value is a boolean flag, which is #t
if a word
or a grapheme cluster
breaks before the character/codepoint, and #f
otherwise.
Suppose a generator g returns characters in a string
That's it.
, one at a time. Then the created generator
will work as follows:
(define brk (make-word-breaker g)) (brk) ⇒ #\T and #t (brk) ⇒ #\h and #f (brk) ⇒ #\a and #f (brk) ⇒ #\t and #f (brk) ⇒ #\' and #f (brk) ⇒ #\s and #f (brk) ⇒ #\space and #t (brk) ⇒ #\i and #t (brk) ⇒ #\t and #f (brk) ⇒ #\. and #t (brk) ⇒ #<eof> and #t
It shows the word breaks at those character boundaries shown
by the caret ^
below (for clearity, I use _
to indicate
the space).
T h a t ' s _ i t . ^ ^ ^ ^ ^
{gauche.unicode
}
The input generator is a generator of characters or codepoints,
and return is a procedure that takes a list of characters or
codepoints, and returns an object. These procedures creates a
generator that returns an object at at time, each consists of a
word or a grapheme cluster, respectively.
Suppose a generator g returns characters in a string
That's it.
, one at a time, again.
Then the created generator works as follows:
(define brk (make-word-reader g list->string)) (brk) ⇒ "That's" (brk) ⇒ " " (brk) ⇒ "it" (brk) ⇒ "." (brk) ⇒ #<eof>
[R6RS][R7RS char][SRFI-129]
{gauche.unicode
}
Converts the case of given string
using language-independent full case folding defined by Unicode standard.
They differ from SRFI-13’s procedures
with the same names (see String case mapping),
which simply uses character-by-character case mapping.
Notably, the length of resulting string may differ from the source string,
and some conversions are sensitive to whether the character is at the
word boundary or not. The word boundaries are determined according
to UAX #29 text segmentation rules.
(string-upcase "straße") ⇒ "STRASSE" (string-downcase "ΧΑΟΣΧΑΟΣ.ΧΑΟΣ. Σ.") ⇒ "χαοσχαοσ.χαος. σ." (string-titlecase "You're talking about R6RS, right?") ⇒ "You're Talking About R6rs, Right?" (string-foldcase "straße") ⇒ "strasse" (string-foldcase "ΧΑΟΣΣ") ⇒ "χαοσσ"
Procedures string-upcase
, string-downcase
,
and string-foldcase
are also in R7RS scheme.char
module.
Procedure string-titlecase
is also defined in SRFI-129.
{gauche.unicode
}
Like string-upcase
etc, but these work on a sequence of
codepoints instead. Returns a sequence of the same type of the input.
(codepoints-upcase '#(115 116 114 97 223 101)) ⇒ #(83 84 82 65 83 83 69)
[R7RS char]
{gauche.unicode
}
Case-insensitive string comparison, using full-string case conversion.
Note that Gauche has builtin string-ci=?
etc., which use
character-wise case folding (see String comparison). These are
different procedures.
(string-ci=? "\u00df" "SS") ⇒ #t
{gauche.unicode
}
The argument may be a character or a nonnegative integer of Unicode
codepoint. Returns one of the symbols N
(neutral),
F
(fullwidth), H
(halfwidth), W
(wide),
Na
(narrow), and A
(ambiguous).
The meaning of this property is explained in Unicode standard annex #11, http://unicode.org/reports/tr11/.
{gauche.unicode
}
Computes the ’width’ of the given string str, taking each character’s
East Asian Width int account. It gives you a rough estimate of
how much space the string will occupy on the screen, when displayed
with monospace fonts.
Although it is true that the exact width is generally undecidable without actually rendering them with glyphs, heuristics can give a good estimate if the string consists of limited scripts. For example, if the string is with ASCII and CJK ideographs, computing ’Full-width’ and ’Wide’ letters as twice wide as ASCII character is likely to work.
This procedure assigns the following widths for each character’s
East Asian Width category. They can be altered by giving
keyword arguments F
, H
, W
, Na
, N
,
and A
, respectively:
2
1
2
1
1
2
See UAX #11 (http://unicode.org/reports/tr11/), section 5, for more detailed discussions of computing the width of displayed strings.
(string-east-asian-width "abc") ⇒ 3 (string-east-asian-width "いろは") ⇒ 6 (string-east-asian-width "1番目" :W 1.5) ⇒ 4.0
{gauche.unicode
}
Like string-take
and string-drop
(see srfi.13
- String library),
returns a prefix and
suffix of the input string str, but using the string’s width
instead of the number of characters.
string-take-width
returns the longest prefix of a
string str such that its string-east-asian-width
does
not exceed width. On the other hand, string-drop-width
removes such prefix from str and returns the rest.
The keyword arguments are to customize mappings of East Asian Width
categories to numerical widths. See string-east-asian-width
above for the details.