For Development HEAD DRAFTSearch (procedure/syntax/module):

9.5 gauche.charconv - Character Code Conversion

Module: gauche.charconv

This module defines a set of functions that converts character encoding schemes (CES) of the given data stream.

This module is implicitly loaded when :encoding keyword argument is given to the file stream creating functions (such as open-input-file and call-with-output-file).

For the portable programs, you can use transcoded ports defined in SRFI-181 (see Transcoded ports).


9.5.1 Supported character encoding schemes

A CES is represented by its name as a string or a symbol. Case is ignored. There may be several aliases defined for a single encoding.

A CES name "none" is special; it means the string is an octet sequence and it’s up to the application to interpret the sequence in an appropriate encoding. So, conversion to and from CES "none" does nothing.

Gauche natively supports conversions between Unicode transfer encodings (UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE), Latin-N encodings (ISO8859-1 to 16), and typical Japanese character encodings: ISO2022JP, ISO2022JP-3, EUC-JP (EUC-JISX0213), Shift_JISX0213.

Conversions between other encodings are handled by iconv(3) by default. However, iconv(3) API lacks a feature to customize the behavior when an input character can’t be encoded in the output CES. If you need to be sensitive about it, you can disable delegation to iconv(3) by the following parameter.

Parameter: external-conversion-library

The value of this parameter can be a symbol iconv or #f. The default value is iconv.

Conversion ports opened during this parameter being iconv will use iconv(3) library if the requested conversion isn’t supported by Gauche’s native converters. This only affect when the conversion port is opened—once it is opened, this parameter value is irrelevant.

You can check whether the specific conversion is supported on your system or not, by the following function.

Function: ces-conversion-supported? from-ces to-ces

{gauche.charconv} Returns #t if conversion from the character encoding scheme (CES) from-ces to to-ces is supported in this system.

Note that this procedure may return true even if system only supports partial conversion between from-ces and to-ces. In such case, actual conversion might lose information by coercing characters in from-ces which are not supported in to-ces. (For example, conversion from Unicode to EUC-JP is "supported", although Unicode has characters that are not in EUC-JP).

Also note that this procedure always returns #t if from-ces and/or to-ces is "none", for conversion to/from CES "none" always succeeds (in fact, it does nothing).

This procedure may be affected by the value of the parameter external-conversion-library.

Also there are two useful procedures to deal with CES names.

Function: ces-equivalent? ces-a ces-b :optional unknown-value

{gauche.charconv} Returns true if two CESes ces-a and ces-b are equivalent to the knowledge of the system. Returns false if they are not. If the system doesn’t know about equivalency, unknown-value is returned, whose default is #f.

CES "none" works like a wild card; it is "equivalent" to any CES. (Thus, ces-equivalent? is not transitive. The intended use of ces-equivalent? is to compare two given CES names and see if conversion is required or not).

(ces-equivalent? 'eucjp "EUC-JP")            ⇒ #t
(ces-equivalent? 'shift_jis "EUC-JP")        ⇒ #f
(ces-equivalent? "NoSuchEncoding" 'utf-8 '?) ⇒ ?
Function: ces-upper-compatible? ces-a ces-b :optional unknown-value

{gauche.charconv} Returns true if a string encoded in CES ces-b can also be regarded as a string encoded in ces-a without conversion, to the knowledge of the system. Returns false if not. Returns unknown-value if the system can’t determine which is the case.

Like ces-equivalent?, CES "none" works like a wildcard. It is upper-compatible to any CES, and any CES is upper-compatible to "none".

(ces-upper-compatible? "eucjp" "ASCII")             ⇒ #t
(ces-upper-compatible? "eucjp" "utf-8")             ⇒ #f
(ces-upper-compatible? "utf-8" "NoSuchEncoding" '?) ⇒ ?

When Gauche’s internal conversion routine encounters a character that can’t be mapped, the behaivor depends on the illegal output handling mode of the conversion port, specified by illegal-output keyword arguments. If the mode is raise, an <io-encoding-error> is thrown. If the mode is replace, the character is replaced with a replacement character.

A replacement character is U+FFFD (REPLACEMENT CHARACTER) if it is available. For Japanese encodings, U+FFFD isn’t available, and we use U+3013 (geta mark), for it is traditionally used as the replacement character. If neither one is available, ? is used.

If that happens in iconv, handling of such character depends on iconv implementation (glibc implementation returns an error).

If the conversion routine encounters an input sequence that is illegal in the input CES, an <io-decoding-error> is signaled.

UTF encoding and BOM

Unicode character U+FEFF (Zero-Width No-Break Space) can have a special meaning if it appears at the very beginning of UTF stream. It serves as a BOM (Byte-order mark) to signify the byte order of the following UTF data. For UTF-16 and UTF-32, it is critical to know the byte order. UTF-8 does not need one, for the byte order doesn’t matter. Nevertheless, some software adds BOM to a UTF-8 data just to indicate it is in UTF-8.

Technically, BOM is not a part of the text content, but rather a piece of meta-information about the format. That poses an issue; when you deal with a data stream, sometimes you just want to deal with the content, while the other times you want to deal with the entire data, including the meta-information. Traditionally those two are not strictly distinguished and dealt in ad-hoc way. We take the following approach, depending on the specified encoding.

UTF-8

We don’t treat BOM specially; if the first codepoint is U+FEFF, it is read as the character #\ufeff. For output, no BOM will be produced. This is the default behaivor of I/O.

UTF-8-BOM

This is a ’pseudo’ encoding—it is UTF-8, but if the input data begins with BOM, it is simply ignored. This is for the convenience of the programs that just don’t want to be bothered by optional BOM at the beginning of UTF-8 stream. This encoding can’t be used for output. If you absolutely need to produce UTF-8 with BOM, just write #\ufeff at the beginning of the UTF-8 stream.

UTF-16, UTF-32

The input recognizes BOM and decides the byte order; BOM itself won’t appear in the read data. If BOM is missing, big-endian (UTF-16BE, UTF-32BE) is assumed. The output emits BOM at the beginning of the data.

UTF-16LE, UTF-32LE, UTF-16BE, UTF-32BE

We assume the byte-order meta-information is given via separate channel, so that the caller already know the byte-order of the input. These do not treat BOM specially; if the first codepoint is U+FEFF, it is read as the character #\ufeff. For output, no BOM will be produced.

Details of Gauche’s native conversion algorithm

Between EUC_JP, Shift JIS and ISO2022JP, Gauche uses arithmetic conversion whenever possible. This even maps the undefined codepoint properly. Between Unicode (UTF-8) and EUC_JP, Gauche uses lookup tables. Between Unicode and Shift JIS or ISO2022JP, Gauche converts the input CES to EUC_JP, then convert it to the output CES. ISO8859-N are converted to Unicode using tables, then converted to the output CES if necessary. If the same CES is specified for input and output, Gauche’s conversion routine just copies input characters to output characters, without checking the validity of the encodings.

EUC_JP, EUCJP, EUCJ, EUC_JISX0213

Covers ASCII, JIS X 0201 kana, JIS X 0212 and JIS X 0213 character sets. JIS X 0212 character set is supported merely because it uses the code region JIS X 0213 doesn’t use, and JIS X 0212 characters are not converted properly to Shift JIS and UTF-8. Use JIS X 0213.

SHIFT_JIS, SHIFTJIS, SJIS

Covers Shift_JISX0213, except that 0x5c and 0x7e is mapped to ASCII character set (REVERSE SOLIDUS and TILDE), instead of JIS X 0201 Roman (YEN SIGN and OVERLINE).

UTF-8, UTF8

Unicode. Note that some JIS X 0213 characters are mapped to Extension B (U+20000 and up). Some JIS X 0213 characters are mapped to two unicode characters (one base character plus a combining character).

ISO2022JP, CSISO2022JP, ISO2022JP-1, ISO2022JP-2, ISO2022JP-3

These encodings differ a bit (except ISO2022JP and CSISO2022JP, which are synonyms), but Gauche handles them same. If one of these CES is specified as input, Gauche recognizes escape sequences of any of CES. ISO2022JP-2 defines several non-Japanese escape sequences, and they are recognized by Gauche, but mapped to substitution character (’?’ or geta mark).

For output, Gauche assumes ISO2022JP first, and uses ISO2022JP-1 escape sequence to put JIS X 0212 character, or uses ISO2022JP-3 escape sequence to put JIS X 0213 plane 2 character. Thus, if the string contains only JIS X 0208 characters, the output is compatible to ISO2022JP. Precisely speaking, JIS X 0213 specifies some characters in JIS X 0208 codepoint that shouldn’t be mixed with JIS X 0208 characters; Gauche output those characters as JIS X 0208 for compatibility. (This is the same policy as Emacs-Mule’s iso2022jp-3-compatible mode).


9.5.2 Autodetecting the encoding scheme

There are cases that you don’t know the CES of the input, but you know it is one of several possible encodings. The charconv module has a mechanism to guess the input encoding. There can be multiple algorithms, and each algorithm has the name (wildcard CES). Right now, there’s only one algorithm implemented:

"*JP"

To guess the character encoding from japanese text, among either ISO2022-JP(-1,2,3), EUCJP, SHIFT_JIS or UTF-8.

(Even when the input is UTF-8 with BOM, it is still recognized as UTF-8, not UTF-8-BOM).

The wildcard CES can be used in place of CES name for some conversion functions.

Function: ces-guess-from-string string scheme

{gauche.charconv} Guesses the CES of string by the character guessing scheme scheme (e.g. "*JP"). Returns CES name that can be used by other charconv functions. It may return #f if the guessing scheme finds no possible encoding in string. Note that if there may be more than one possible encoding in string, the guessing scheme returns one of them, usually in favor of the native CES.


9.5.3 Conversion ports

Function: open-input-conversion-port source from-code :key to-code buffer-size owner? illegal-output

{gauche.charconv} Takes an input port source, which feeds characters encoded in from-code, and returns another input port, from which you can read characters encoded in to-code.

If to-code is omitted, the native CES is assumed.

buffer-size is used to allocate internal buffer size for conversion. The default size is about 1 kilobytes and it’s suitable for typical cases.

hanlding argument specifies the behavior when the output CES doesn’t have the corresponding character of input. It can be a symbol raise to raise an <io-encoding-error> in such cases, or a symbol replace to replace the character with a replacement character appropriate in the output CES. If omitted, raise is assumed.

Note that iconv(3) library API doesn’t offer an option to choose the illegal-output handling mode. So when the conversion is delegated to iconv(3), illegal-output is ignored and the behavior follows the underlying iconv(3) implementation. If you need to make sure illegal-output is honored, you can bind the parameter external-conversion-library to #f when calling this procedure; then the conversion port won’t use iconv(3) and raises unsupported encodings error if the conversion can’t be handled entirely within Gauche.

By default, open-input-conversion-port leaves source open. If you specify true value to owner?, the function closes source after it reads EOF from the port.

If you don’t know the source’s CES, you can specify CES guessing scheme, such as "*JP", in place of from-code. The conversion port tries to guess the encoding, by prefetching the data from source up to the buffer size. It signals an error if the code guessing routine finds no appropriate CES. If the guessing routine finds ambiguous input, however, it silently assume one of possible CES’s, in favor of the native CES. Hence it is possible that the guessing is wrong if the buffer size is too small. The default size is usually enough for most text documents, but it may fail if the large text contains mostly ASCII characters and multibyte characters appear only at the very end of the document. To be sure for the worst case, you have to specify the buffer size large enough to hold entire text.

For example, the following code copies a file unknown.txt to a file eucjp.txt, converting unknown japanese CES to EUC-JP.

(call-with-output-file "eucjp.txt"
  (lambda (out)
    (copy-port (open-input-conversion-port
                 (open-input-file "unknown.txt")
                 "*jp"             ;guess code
                 :to-code "eucjp"
                 :owner? #t)       ;close unknown.txt afterwards
               out)))

For the portable code, you can also use SRFI-181 transcoded-port (see Transcoded ports).

Function: open-output-conversion-port sink to-code :key from-code buffer-size owner? illegal-output

{gauche.charconv} Creates and returns an output port that converts given characters from from-code to to-code and feed to an output port sink. If from-code is omitted, the native CES is assumed. You can’t specify a character guessing scheme (such as "*JP") to neither from-code nor to-code.

buffer-size specifies the size of internal conversion buffer. The characters put to the returned port may stay in the buffer, until the port is explicitly flushed (by flush) or the port is closed.

By default, the returned port doesn’t closes sink when itself is closed. If a keyword argument owner? is provided and true, however, it closes sink when it is closed.

The illegal-output keyword argument is the same as open-input-conversion-port.

For the portable code, you can also use SRFI-181 transcoded-port (see Transcoded ports).

Function: ces-convert-to return-type source from-code :optional to-code :key illegal-output
Function: ces-convert source from-code :optional to-code :key illegal-output

{gauche.charconv} Convert source, which is a string or an u8vector of multibyte encoding in from-code, to a string or u8vector encoded in to-code. If to-code is omitted, the native CES is assumed.

In ces-convert-to, you can specify the return type by return-type argument; it must be either a class object <string> or <u8vector>. On the other hand, ces-convert always returns a string, regardless of the type of source.

If to-code is different from the native CES and a string is returned, it can be an incomplete string. It’s for the backward compatibility—in general, we recommend to use u8vector to represent multibyte sequence in CES other than the native encoding.

from-code can be a name of character guessing scheme (e.g. "*JP").

The keyword argument illegal-output controls the behavior when input contains a character that can’t be encoded in the output. See open-input-conversion-port above for the description. By default, an <io-encoding-error> is raised, except when the conversion is delegated to iconv(3), in which case the behavior depends on the external library.

For the portable code, you can also use SRFI-181 bytevector->string and string->bytevector (see Transcoded ports).

Function: call-with-input-conversion iport proc :key encoding conversion-buffer-size illegal-output
Function: call-with-output-conversion oport proc :key encoding conversion-buffer-size illegal-output

{gauche.charconv} These procedures can be used to perform character I/O with different encoding temporary from the original port’s encoding.

call-with-input-conversion takes an input port iport which uses the character encoding encoding, and calls proc with one argument, a conversion input port. From the port, proc can read characters in utf-8. Note that once proc is called, it has to read all the characters until EOF; see the note below.

call-with-output-conversion takes an output port oport which expects the character encoding encoding, and calls proc with one argument, a temporary conversion output port. To the port, proc can write characters in utf-8. When proc returns, or it exits with an error, the temporary conversion output port is flushed and closed. The caller of call-with-output-conversion can continue to use oport with original encoding afterwards.

Both procedure returns the value(s) that proc returns. The default value of encoding is Gauche’s internal encoding. Those procedures don’t create a conversion port when it is not necessary. If conversion-buffer-size is given, it is used as the buffer-size argument when the conversion port is open.

You shouldn’t use iport/oport directly while proc is active—character encoding is a stateful process, and mixing I/O from/to the conversion port and the underlying port will screw up the state.

Note: for the call-with-input-conversion, you can’t use iport again unless proc reads EOF from it. It’s because a conversion port needs to buffer the input, and there’s no way to undo the buffered input to iport when proc returns.

Function: with-input-conversion iport thunk :key encoding conversion-buffer-size illegal-output
Function: with-output-conversion oport thunk :key encoding conversion-buffer-size illegal-output

{gauche.charconv} Similar to call-with-*-conversion, but these procedures call thunk without arguments, while the conversion port is set as the current input or output port, respectively. The meaning of keyword arguments are the same as call-with-*-conversion.

Function: wrap-with-input-conversion port from-code :key to-code owner? buffer-size illegal-output
Function: wrap-with-output-conversion port to-code :key from-code owner? buffer-size illegal-output

{gauche.charconv} Convenient procedures to avoid adding unnecessary conversion port. Each procedure works like open-input-conversion-port and open-output-conversion-port, respectively, except if system knows no conversion is needed, no conversion port is created and port is returned as is.

When a conversion port is created, port is always owned by the port. When you want to close the port, always close the port returned by wrap-with-*-conversion, instead the original port. If you close the original port first, the pending conversion won’t be flushed. (Some conversion requires trailing sequence that is generated only when the conversion port is closing, so simply calling flush isn’t enough.)

The buffer-size and illegal-output arguments are passed to the open-*-conversion-port.



For Development HEAD DRAFTSearch (procedure/syntax/module):
DRAFT