[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.2 gauche.charconv - Character Code Conversion

Module: gauche.charconv

This module defines a set of functions that converts character encoding schemes (CES) of the given data stream.

This module is implicitly loaded when :encoding keyword argument is given to the file stream creating functions (such as open-input-file and call-with-output-file).

As of release 0.5.6, Gauche natively supports conversions between typical Japanese character encodings: ISO2022JP, ISO2022JP-3, EUC-JP (EUC-JISX0213), Shift_JISX0213, UTF-8 (Unicode 3.2). Conversions between other encodings are handled by iconv(3). See section Supported character encoding schemes, for details.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.2.1 Supported character encoding schemes

A CES is represented by its name as a string or a symbol. Case is ignored. There may be several aliases defined for a single encoding.

A CES name "none" is special. When Gauche’s native encoding is none, Gauche just treats a string as a byte sequence, and it’s up to the application to interpret the sequence in an appropriate encoding. So, conversion to and from CES "none" does nothing.

You can check whether the specific conversion is supported on your system or not, by the following function.

Function: ces-conversion-supported? from-ces to-ces

Returns #t if conversion from the character encoding scheme (CES) from-ces to to-ces is supported in this system.

Note that this procedure may return true even if system only supports partial conversion between from-ces and to-ces. In such case, actual conversion might lose information by coercing characters in from-ces which are not supported in to-ces. (For example, conversion from Unicode to EUC-JP is "supported", although Unicode has characters that are not in EUC-JP).

Also note that this procedure always returns #t if from-ces and/or to-ces is "none", for conversion to/from CES "none" always succeeds (in fact, it does nothing).

 
;; see if you can convert the internal encoding to EUC-JP
(ces-conversion-supported? (gauche-character-encoding) "euc-jp")

Also there are two useful procedures to deal with CES names.

Function: ces-equivalent? ces-a ces-b :optional unknown-value

Returns true if two CESes ces-a and ces-b are equivalent to the knowledge of the system. Returns false if they are not. If the system doesn’t know about equivalency, unknown-value is returned, whose default is #f.

CES "none" works like a wild card; it is "equivalent" to any CES. (Thus, ces-equivalent? is not transitive. The intended use of ces-equivalent? is to compare two given CES names and see if conversion is required or not).

 
(ces-equivalent? 'eucjp "EUC-JP")            ⇒ #t
(ces-equivalent? 'shift_jis "EUC-JP")        ⇒ #f
(ces-equivalent? "NoSuchEncoding" 'utf-8 '?) ⇒ ?
Function: ces-upper-compatible? ces-a ces-b :optional unknown-value

Returns true if a string encoded in CES ces-b can also be regarded as a string encoded in ces-a without conversion, to the knowledge of the system. Returns false if not. Returns unknown-value if the system can’t determine which is the case.

Like ces-equivalent?, CES "none" works like a wildcard. It is upper-compatible to any CES, and any CES is upper-compatible to "none".

 
(ces-upper-compatible? "eucjp" "ASCII")             ⇒ #t
(ces-upper-compatible? "eucjp" "utf-8")             ⇒ #f
(ces-upper-compatible? "utf-8" "NoSuchEncoding" '?) ⇒ ?

Conversion between common japanese CESes (EUC_JP, Shift JIS, UTF-8 and ISO2022-JP) of the character set JIS X 0201 and JIS X 0213 is handled by Gauche’s built-in algorithm (see below for details). When other CES name is given, Gauche uses iconv(3) if it is linked.

When Gauche’s conversion routine encounters a character that can’t be mapped, it replaces the character for "geta mark" (U+3013) if it’s a multibyte character in the input encoding, or for ’?’ if it’s a singlebyte character in the input encoding. If that happens in iconv, handling of such character depends on iconv implementation (glibc implementation returns an error).

If the conversion routine encounters an input sequence that is illegal in the input CES, an error is signaled.

Details of Gauche’s native conversion algorithm: Between EUC_JP, Shift JIS and ISO2022JP, Gauche uses arithmetic conversion whenever possible. This even maps the undefined codepoint properly. Between Unicode (UTF-8) and EUC_JP, Gauche uses lookup tables. Between Unicode and Shift JIS or ISO2022JP, Gauche converts the input CES to EUC_JP, then convert it to the output CES. If the same CES is specified for input and output, Gauche’s conversion routine just copies input characters to output characters, without checking the validity of the encodings.

EUC_JP, EUCJP, EUCJ, EUC_JISX0213

Covers ASCII, JIS X 0201 kana, JIS X 0212 and JIS X 0213 character sets. JIS X 0212 character set is supported merely because it uses the code region JIS X 0213 doesn’t use, and JIS X 0212 characters are not converted properly to Shift JIS and UTF-8. Use JIS X 0213.

SHIFT_JIS, SHIFTJIS, SJIS

Covers Shift_JISX0213, except that 0x5c and 0x7e is mapped to ASCII character set (REVERSE SOLIDUS and TILDE), instead of JIS X 0201 Roman (YEN SIGN and OVERLINE).

UTF-8, UTF8

Unicode 3.2. Note that some JIS X 0213 characters are mapped to Extension B (U+20000 and up). Some JIS X 0213 characters are mapped to two unicode characters (one base character plus a combining character).

ISO2022JP, CSISO2022JP, ISO2022JP-1, ISO2022JP-2, ISO2022JP-3

These encodings differ a bit (except ISO2022JP and CSISO2022JP, which are synonyms), but Gauche handles them same. If one of these CES is specified as input, Gauche recognizes escape sequences of any of CES. ISO2022JP-2 defines several non-Japanese escape sequences, and they are recognized by Gauche, but mapped to substitution character (’?’ or geta mark).

For output, Gauche assumes ISO2022JP first, and uses ISO2022JP-1 escape sequence to put JIS X 0212 character, or uses ISO2022JP-3 escape sequence to put JIS X 0213 plane 2 character. Thus, if the string contains only JIS X 0208 characters, the output is compatible to ISO2022JP. Precisely speaking, JIS X 0213 specifies some characters in JIS X 0208 codepoint that shouldn’t be mixed with JIS X 0208 characters; Gauche output those characters as JIS X 0208 for compatibility. (This is the same policy as Emacs-Mule’s iso2022jp-3-compatible mode).


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.2.2 Autodetecting the encoding scheme

There are cases that you don’t know the CES of the input, but you know it is one of several possible encodings. The charconv module has a mechanism to guess the input encoding. There can be multiple algorithms, and each algorithm has the name (wildcard CES). Right now, there’s only one algorithm implemented:

"*JP"

To guess the character encoding from japanese text, among either ISO2022-JP(-1,2,3), EUCJP, SHIFT_JIS or UTF-8.

The wildcard CES can be used in place of CES name for some conversion functions.

Function: ces-guess-from-string string scheme

Guesses the CES of string by the character guessing scheme scheme (e.g. "*JP"). Returns CES name that can be used by other charconv functions. It may return #f if the guessing scheme finds no possible encoding in string. Note that if there may be more than one possible encoding in string, the guessing scheme returns one of them, usually in favor of the native CES.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.2.3 Conversion ports

Function: open-input-conversion-port source from-code :key to-code buffer-size owner?

Takes an input port source, which feeds characters encoded in from-code, and returns another input port, from which you can read characters encoded in to-code.

If to-code is omitted, the native CES is assumed.

buffer-size is used to allocate internal buffer size for conversion. The default size is about 1 kilobytes and it’s suitable for typical cases.

If you don’t know the source’s CES, you can specify CES guessing scheme, such as "*JP", in place of from-code. The conversion port tries to guess the encoding, by prefetching the data from source up to the buffer size. It signals an error if the code guessing routine finds no appropriate CES. If the guessing routine finds ambiguous input, however, it silently assume one of possible CES’s, in favor of the native CES. Hence it is possible that the guessing is wrong if the buffer size is too small. The default size is usually enough for most text documents, but it may fail if the large text contains mostly ASCII characters and multibyte characters appear only at the very end of the document. To be sure for the worst case, you have to specify the buffer size large enough to hold entire text.

By default, open-input-conversion-port leaves source open. If you specify true value to owner?, the function closes source after it reads EOF from the port.

For example, the following code copies a file ‘unknown.txt’ to a file ‘eucjp.txt’, converting unknown japanese CES to EUC-JP.

 
(call-with-output-file "eucjp.txt"
  (lambda (out)
    (copy-port (open-input-conversion-port
                 (open-input-file "unknown.txt")
                 "*jp"             ;guess code
                 :to-code "eucjp"
                 :owner? #t)       ;close unknown.txt afterwards
               out)))
Function: open-output-conversion-port sink to-code :key from-code buffer-size owner?

Creates and returns an output port that converts given characters from from-code to to-code and feed to an output port sink. If from-code is omitted, the native CES is assumed. You can’t specify a character guessing scheme (such as "*JP") to neither from-code nor to-code.

buffer-size specifies the size of internal conversion buffer. The characters put to the returned port may stay in the buffer, until the port is explicity flushed (by flush) or the port is closed.

By default, the returned port doesn’t closes sink when itself is closed. If a keyword argument owner? is provided and true, however, it closes sink when it is closed.

Function: ces-convert string from-code :optional to-code

Convert string’s character encoding from from-code to to-code, and returns the converted string. The returned string may be a byte-string if to-code is different from the native CES.

from-code can be a name of character guessing scheme (e.g. "*JP"). when to-code is omitted, the native CES is assumed.

Function: call-with-input-conversion iport proc :key encoding conversion-buffer-size
Function: call-with-output-conversion oport proc :key encoding conversion-buffer-size

These procedures can be used to perform character I/O with different encoding temporary from the original port’s encoding.

call-with-input-conversion takes an input port iport which uses the character encoding encoding, and calls proc with one argument, a conversion input port. From the port, proc can read characters in Gauche’s internal encoding. Note that once proc is called, it has to read all the characters until EOF; see the note below.

call-with-output-conversion takes an output port oport which expects the character encoding encoding, and calls proc with one argument, a temporary conversion output port. To the port, proc can write characters in Gauche’s internal encoding. When proc returns, or it exits with an error, the temporary conversion output port is flushed and closed. The caller of call-with-output-conversion can continue to use oport with original encoding afterwards.

Both procedure returns the value(s) that proc returns. The default value of encoding is Gauche’s internal encoding. Those procedures don’t create a conversion port when it is not necessary. If conversion-buffer-size is given, it is used as the buffer-size argument when the conversion port is open.

You shouldn’t use iport/oport directly while proc is active—character encoding is a stateful process, and mixing I/O from/to the conversion port and the underlying port will screw up the state.

Note: for the call-with-input-conversion, you can’t use iport again unless proc reads EOF from it. It’s because a conversion port needs to buffer the input, and there’s no way to undo the buffered input to iport when proc returns.

Function: with-input-conversion iport thunk :key encoding conversion-buffer-size
Function: with-output-conversion oport thunk :key encoding conversion-buffer-size

Similar to call-with-*-conversion, but these procedures call thunk without arguments, while the conversion port is set as the current input or output port, respectively. The meaning of keyword arguments are the same as call-with-*-conversion.

Function: wrap-with-input-conversion port from-code :key to-code owner? buffer-size
Function: wrap-with-output-conversion port to-code :key from-code owner? buffer-size

Convenient procedures to avoid adding unnecessary conversion port. Each procedure works like open-input-conversion-port and open-output-conversion-port, respectively, except if system knows no conversion is needed, no conversion port is created and port is returned as is.

When a conversion port is created, port is always owned by the port. When you want to close the port, always close the port returned by wrap-with-*-conversion, instead the original port. If you close the original port first, the pending conversion won’t be flushed. (Some conversion requires trailing sequence that is generated only when the conversion port is closing, so simply calling flush isn’t enough.)

The buffer-size argument is passed to the open-*-conversion-port.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]

This document was generated on July 19, 2014 using texi2html 1.82.