gauche.charconv- Character Code Conversion
This module defines a set of functions that converts character encoding schemes (CES) of the given data stream.
This module is implicitly loaded when
:encoding keyword argument
is given to the file stream creating functions
As of release 0.5.6, Gauche natively supports conversions between
typical Japanese character encodings: ISO2022JP, ISO2022JP-3,
EUC-JP (EUC-JISX0213), Shift_JISX0213, UTF-8 (Unicode 3.2).
Conversions between other encodings are handled by
See Supported character encoding schemes, for details.
|• Supported character encoding schemes:|
|• Autodetecting the encoding scheme:|
|• Conversion ports:|
A CES is represented by its name as a string or a symbol. Case is ignored. There may be several aliases defined for a single encoding.
A CES name "none" is special. When Gauche’s native encoding is
Gauche just treats a string as a byte sequence, and it’s up to the application
to interpret the sequence in an appropriate encoding. So, conversion
to and from CES "none" does nothing.
You can check whether the specific conversion is supported on your system or not, by the following function.
#t if conversion from the character encoding scheme
(CES) from-ces to to-ces is supported in this system.
Note that this procedure may return true even if system only supports partial conversion between from-ces and to-ces. In such case, actual conversion might lose information by coercing characters in from-ces which are not supported in to-ces. (For example, conversion from Unicode to EUC-JP is "supported", although Unicode has characters that are not in EUC-JP).
Also note that this procedure always returns
if from-ces and/or to-ces is "none",
for conversion to/from CES "none" always succeeds (in fact, it does nothing).
;; see if you can convert the internal encoding to EUC-JP (ces-conversion-supported? (gauche-character-encoding) "euc-jp")
Also there are two useful procedures to deal with CES names.
Returns true if two CESes ces-a and ces-b are equivalent
to the knowledge of the system. Returns false if they are not.
If the system doesn’t know about equivalency, unknown-value
is returned, whose default is
CES "none" works like a wild card; it is "equivalent" to any CES.
ces-equivalent? is not transitive.
The intended use of
is to compare two given CES names and see if conversion is required or not).
(ces-equivalent? 'eucjp "EUC-JP") ⇒ #t (ces-equivalent? 'shift_jis "EUC-JP") ⇒ #f (ces-equivalent? "NoSuchEncoding" 'utf-8 '?) ⇒ ?
Returns true if a string encoded in CES ces-b can also be regarded as a string encoded in ces-a without conversion, to the knowledge of the system. Returns false if not. Returns unknown-value if the system can’t determine which is the case.
ces-equivalent?, CES "none" works like a wildcard.
It is upper-compatible to any CES, and any CES is upper-compatible to
(ces-upper-compatible? "eucjp" "ASCII") ⇒ #t (ces-upper-compatible? "eucjp" "utf-8") ⇒ #f (ces-upper-compatible? "utf-8" "NoSuchEncoding" '?) ⇒ ?
Conversion between common japanese CESes (EUC_JP, Shift JIS, UTF-8
and ISO2022-JP) of the character set JIS X 0201 and JIS X 0213
is handled by Gauche’s built-in algorithm (see below for details).
When other CES name is given, Gauche uses
iconv(3) if it is linked.
When Gauche’s conversion routine encounters a character that can’t be mapped, it replaces the character for "geta mark" (U+3013) if it’s a multibyte character in the input encoding, or for ’?’ if it’s a singlebyte character in the input encoding. If that happens in iconv, handling of such character depends on iconv implementation (glibc implementation returns an error).
If the conversion routine encounters an input sequence that is illegal in the input CES, an error is signaled.
Details of Gauche’s native conversion algorithm: Between EUC_JP, Shift JIS and ISO2022JP, Gauche uses arithmetic conversion whenever possible. This even maps the undefined codepoint properly. Between Unicode (UTF-8) and EUC_JP, Gauche uses lookup tables. Between Unicode and Shift JIS or ISO2022JP, Gauche converts the input CES to EUC_JP, then convert it to the output CES. If the same CES is specified for input and output, Gauche’s conversion routine just copies input characters to output characters, without checking the validity of the encodings.
EUC_JP, EUCJP, EUCJ, EUC_JISX0213
Covers ASCII, JIS X 0201 kana, JIS X 0212 and JIS X 0213 character sets. JIS X 0212 character set is supported merely because it uses the code region JIS X 0213 doesn’t use, and JIS X 0212 characters are not converted properly to Shift JIS and UTF-8. Use JIS X 0213.
SHIFT_JIS, SHIFTJIS, SJIS
Covers Shift_JISX0213, except that 0x5c and 0x7e is mapped to ASCII character set (REVERSE SOLIDUS and TILDE), instead of JIS X 0201 Roman (YEN SIGN and OVERLINE).
Unicode 3.2. Note that some JIS X 0213 characters are mapped to Extension B (U+20000 and up). Some JIS X 0213 characters are mapped to two unicode characters (one base character plus a combining character).
ISO2022JP, CSISO2022JP, ISO2022JP-1, ISO2022JP-2, ISO2022JP-3
These encodings differ a bit (except ISO2022JP and CSISO2022JP, which are synonyms), but Gauche handles them same. If one of these CES is specified as input, Gauche recognizes escape sequences of any of CES. ISO2022JP-2 defines several non-Japanese escape sequences, and they are recognized by Gauche, but mapped to substitution character (’?’ or geta mark).
For output, Gauche assumes ISO2022JP first, and uses ISO2022JP-1 escape sequence to put JIS X 0212 character, or uses ISO2022JP-3 escape sequence to put JIS X 0213 plane 2 character. Thus, if the string contains only JIS X 0208 characters, the output is compatible to ISO2022JP. Precisely speaking, JIS X 0213 specifies some characters in JIS X 0208 codepoint that shouldn’t be mixed with JIS X 0208 characters; Gauche output those characters as JIS X 0208 for compatibility. (This is the same policy as Emacs-Mule’s iso2022jp-3-compatible mode).
There are cases that you don’t know the CES of the input, but you know it is one of several possible encodings. The charconv module has a mechanism to guess the input encoding. There can be multiple algorithms, and each algorithm has the name (wildcard CES). Right now, there’s only one algorithm implemented:
To guess the character encoding from japanese text, among either ISO2022-JP(-1,2,3), EUCJP, SHIFT_JIS or UTF-8.
The wildcard CES can be used in place of CES name for some conversion functions.
Guesses the CES of string by the character guessing scheme
scheme (e.g. "*JP"). Returns CES name that can be used
by other charconv functions. It may return
#f if the
guessing scheme finds no possible encoding in string.
Note that if there may be more than one possible encoding in
string, the guessing scheme returns one of them,
usually in favor of the native CES.
Takes an input port source, which feeds characters encoded in from-code, and returns another input port, from which you can read characters encoded in to-code.
If to-code is omitted, the native CES is assumed.
buffer-size is used to allocate internal buffer size for conversion. The default size is about 1 kilobytes and it’s suitable for typical cases.
If you don’t know the source’s CES, you can specify
CES guessing scheme, such as
"*JP", in place of from-code.
The conversion port tries to guess the encoding, by prefetching
the data from source up to the buffer size. It signals an error
if the code guessing routine finds no appropriate CES.
If the guessing routine finds ambiguous input, however, it silently
assume one of possible CES’s, in favor of the native CES.
Hence it is possible that the guessing is wrong if the buffer
size is too small. The default size is usually enough for most
text documents, but it may fail if the large text contains mostly ASCII
characters and multibyte characters appear only at the very end of
To be sure for the worst case,
you have to specify the buffer size large enough to
hold entire text.
open-input-conversion-port leaves source open.
If you specify true value to owner?, the function closes
source after it reads EOF from the port.
For example, the following code copies a file unknown.txt to a file eucjp.txt, converting unknown japanese CES to EUC-JP.
(call-with-output-file "eucjp.txt" (lambda (out) (copy-port (open-input-conversion-port (open-input-file "unknown.txt") "*jp" ;guess code :to-code "eucjp" :owner? #t) ;close unknown.txt afterwards out)))
Creates and returns an output port that converts given characters from from-code to to-code and feed to an output port sink. If from-code is omitted, the native CES is assumed. You can’t specify a character guessing scheme (such as "*JP") to neither from-code nor to-code.
buffer-size specifies the size of internal conversion buffer.
The characters put to the returned port may stay in the buffer,
until the port is explicity flushed (by
the port is closed.
By default, the returned port doesn’t closes sink when itself is closed. If a keyword argument owner? is provided and true, however, it closes sink when it is closed.
Convert string’s character encoding from from-code to to-code, and returns the converted string. The returned string may be a byte-string if to-code is different from the native CES.
from-code can be a name of character guessing scheme (e.g. "*JP"). when to-code is omitted, the native CES is assumed.
These procedures can be used to perform character I/O with different encoding temporary from the original port’s encoding.
call-with-input-conversion takes an input port iport
which uses the character encoding encoding, and
calls proc with one argument, a conversion input port.
From the port, proc can read characters in
Gauche’s internal encoding.
Note that once proc is called, it has to read all the
characters until EOF; see the note below.
call-with-output-conversion takes an output port oport
which expects the character encoding encoding, and
calls proc with one argument,
a temporary conversion output port.
To the port, proc can write characters in
Gauche’s internal encoding.
When proc returns, or it exits with an error,
the temporary conversion output port is flushed and closed.
The caller of
can continue to use oport with original encoding afterwards.
Both procedure returns the value(s) that proc returns. The default value of encoding is Gauche’s internal encoding. Those procedures don’t create a conversion port when it is not necessary. If conversion-buffer-size is given, it is used as the buffer-size argument when the conversion port is open.
You shouldn’t use iport/oport directly while proc is active—character encoding is a stateful process, and mixing I/O from/to the conversion port and the underlying port will screw up the state.
Note: for the
call-with-input-conversion, you can’t
use iport again unless proc reads EOF from it.
It’s because a conversion port needs to buffer the input, and
there’s no way to undo the buffered input to iport
when proc returns.
but these procedures call thunk without arguments,
while the conversion port is set as the current input or output port,
The meaning of keyword arguments are the same as
Convenient procedures to avoid adding unnecessary conversion port.
Each procedure works like
except if system knows no conversion is needed,
no conversion port is created and port is returned as is.
When a conversion port is created, port is always owned by the port.
When you want to close the port, always close the port returned
wrap-with-*-conversion, instead the original port.
If you close the original port first, the pending conversion
won’t be flushed. (Some conversion requires trailing sequence that
is generated only when the conversion port is closing, so simply
flush isn’t enough.)
The buffer-size argument is passed to