For Gauche 0.9.5


Next: , Previous: , Up: Concepts   [Contents][Index]

2.2 Multibyte strings

Traditionally, a string is considered as a simple array of bytes. Programmers tend to imagine a string as a simple array of characters (though a character may occupy more than one byte). It’s not the case in Gauche.

Gauche supports multibyte string natively, which means characters are represented by variable number of bytes in a string. Gauche retains semantic compatibility of Scheme string, so such details can be hidden, but it’ll be helpful if you know a few points.

A string object keeps a type tag and a pointer to the storage of the string body. The storage of the body is managed in a sort of “copy-on-write” way—if you take substring, e.g. using directly by substring or using regular expression matcher, or even if you copy a string by copy-string, the underlying storage is shared (the “anchor” of the string is different, so the copied string is not eq? to the original string). The actual string is copied only if you destructively modify it.

Consequently the algorithm like pre-allocating a string by make-string and filling it with string-set! becomes extremely inefficient in Gauche. Don’t do it. (It doesn’t work with mulitbyte strings anyway). Sequential access of string is much more efficient using string ports (see String ports).

String search primitives such as string-scan (see String utilities) and regular expression matcher (see Regular expressions) can return a matched string directly, without using index access at all.

You can choose internal encoding scheme at the time of compiling Gauche. At runtime, a procedure gauche-character-encoding can be used to query the internal encoding. At compile time, you can use a feature identifier to check the internal encoding. (see Platform-dependent features.) Currently, the following internal encodings are supported.

utf-8

UTF-8 encoding of Unicode. This is the default. The feature identifier gauche.ces.utf8 indicates Gauche is compiled with this internal encoding.

euc-jp

EUC-JP encoding of ASCII, JIS X 0201 kana, JIS X 0212 and JIS X 0213:2000 Japanese character set. The feature identifier gauche.ces.eucjp indicates Gauche is compiled with this internal encoding.

sjis

Shift-JIS encoding of JIS X 0201 kana and JIS X 0213:2000 Japanese character set. For source-code compatibility, the character code between 0 and 0x7f is mapped to ASCII. The feature identifier gauche.ces.sjis indicates Gauche is compiled with this internal encoding.

none

8-bit fixed-length character encoding, with the code between 0 and 0x7f matches ASCII. It’s up to the application to interpret the string with certain character encodings. The feature identifier gauche.ces.none indicates Gauche is compiled with this internal encoding.

Conversions from other encoding scheme is provided as a special port. See Character code conversion, for details.

The way to specify the encoding of source programs will be explained in the next section.


Next: , Previous: , Up: Concepts   [Contents][Index]