gauche.charconv
- Character Code Conversion ¶This module defines a set of functions that converts character encoding schemes (CES) of the given data stream.
This module is implicitly loaded when :encoding
keyword argument
is given to the file stream creating functions
(such as open-input-file
and call-with-output-file
).
For the portable programs, you can use transcoded ports defined in SRFI-181 (see Transcoded ports).
• Supported character encoding schemes: | ||
• Autodetecting the encoding scheme: | ||
• Conversion ports: |
A CES is represented by its name as a string or a symbol. Case is ignored. There may be several aliases defined for a single encoding.
A CES name "none" is special; it means the string is an octet sequence and it’s up to the application to interpret the sequence in an appropriate encoding. So, conversion to and from CES "none" does nothing.
Gauche natively supports conversions between Unicode transfer encodings (UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE), Latin-N encodings (ISO8859-1 to 16), and typical Japanese character encodings: ISO2022JP, ISO2022JP-3, EUC-JP (EUC-JISX0213), Shift_JISX0213.
Conversions between other encodings are handled by iconv(3)
by default. However, iconv(3)
API lacks a feature to customize
the behavior when an input character can’t be encoded in the output CES.
If you need to be sensitive about it, you can disable delegation to
iconv(3)
by the following parameter.
The value of this parameter can be a symbol iconv
or #f
.
The default value is iconv
.
Conversion ports opened during this parameter being iconv
will use iconv(3)
library if the requested conversion isn’t
supported by Gauche’s native converters.
This only affect when the conversion port
is opened—once it is opened, this parameter value is irrelevant.
You can check whether the specific conversion is supported on your system or not, by the following function.
{gauche.charconv
}
Returns #t
if conversion from the character encoding scheme
(CES) from-ces to to-ces is supported in this system.
Note that this procedure may return true even if system only supports partial conversion between from-ces and to-ces. In such case, actual conversion might lose information by coercing characters in from-ces which are not supported in to-ces. (For example, conversion from Unicode to EUC-JP is "supported", although Unicode has characters that are not in EUC-JP).
Also note that this procedure always returns #t
if from-ces and/or to-ces is "none",
for conversion to/from CES "none" always succeeds (in fact, it does nothing).
This procedure may be affected by the value of the parameter
external-conversion-library
.
Also there are two useful procedures to deal with CES names.
{gauche.charconv
}
Returns true if two CESes ces-a and ces-b are equivalent
to the knowledge of the system. Returns false if they are not.
If the system doesn’t know about equivalency, unknown-value
is returned, whose default is #f
.
CES "none" works like a wild card; it is "equivalent" to any CES.
(Thus, ces-equivalent?
is not transitive.
The intended use of ces-equivalent?
is to compare two given CES names and see if conversion is required or not).
(ces-equivalent? 'eucjp "EUC-JP") ⇒ #t (ces-equivalent? 'shift_jis "EUC-JP") ⇒ #f (ces-equivalent? "NoSuchEncoding" 'utf-8 '?) ⇒ ?
{gauche.charconv
}
Returns true if a string encoded in CES ces-b can also
be regarded as a string encoded in ces-a without conversion,
to the knowledge of the system.
Returns false if not. Returns unknown-value
if the system can’t determine which is the case.
Like ces-equivalent?
, CES "none" works like a wildcard.
It is upper-compatible to any CES, and any CES is upper-compatible to
"none".
(ces-upper-compatible? "eucjp" "ASCII") ⇒ #t (ces-upper-compatible? "eucjp" "utf-8") ⇒ #f (ces-upper-compatible? "utf-8" "NoSuchEncoding" '?) ⇒ ?
When Gauche’s internal conversion routine encounters a character that can’t
be mapped, the behaivor depends on the illegal output handling mode
of the conversion port, specified by illegal-output keyword
arguments.
If the mode is raise
, an <io-encoding-error>
is thrown. If the mode is replace
,
the character is replaced with a replacement character.
A replacement character is U+FFFD (REPLACEMENT CHARACTER) if it is
available. For Japanese encodings, U+FFFD isn’t available, and
we use U+3013 (geta mark), for it is traditionally used as the
replacement character. If neither one is available,
?
is used.
If that happens in iconv, handling of such character depends on iconv implementation (glibc implementation returns an error).
If the conversion routine encounters an input sequence that
is illegal in the input CES, an <io-decoding-error>
is signaled.
Unicode character U+FEFF (Zero-Width No-Break Space) can have a special meaning if it appears at the very beginning of UTF stream. It serves as a BOM (Byte-order mark) to signify the byte order of the following UTF data. For UTF-16 and UTF-32, it is critical to know the byte order. UTF-8 does not need one, for the byte order doesn’t matter. Nevertheless, some software adds BOM to a UTF-8 data just to indicate it is in UTF-8.
Technically, BOM is not a part of the text content, but rather a piece of meta-information about the format. That poses an issue; when you deal with a data stream, sometimes you just want to deal with the content, while the other times you want to deal with the entire data, including the meta-information. Traditionally those two are not strictly distinguished and dealt in ad-hoc way. We take the following approach, depending on the specified encoding.
UTF-8
We don’t treat BOM specially; if the first codepoint is U+FEFF,
it is read as the character #\ufeff
.
For output, no BOM will be produced.
This is the default behaivor of I/O.
UTF-8-BOM
This is a ’pseudo’ encoding—it is UTF-8, but if the input data begins
with BOM, it is simply ignored. This is for the convenience
of the programs that just don’t want to be bothered by optional BOM
at the beginning of UTF-8 stream. This encoding can’t be used
for output. If you absolutely need to produce UTF-8 with BOM,
just write #\ufeff
at the beginning of the UTF-8 stream.
UTF-16, UTF-32
The input recognizes BOM and decides the byte order; BOM itself won’t appear in the read data. If BOM is missing, big-endian (UTF-16BE, UTF-32BE) is assumed. The output emits BOM at the beginning of the data.
UTF-16LE, UTF-32LE, UTF-16BE, UTF-32BE
We assume the byte-order meta-information is given via separate channel,
so that the caller already know the byte-order of the input.
These do not treat BOM specially; if the first codepoint is U+FEFF,
it is read as the character #\ufeff
.
For output, no BOM will be produced.
Between EUC_JP, Shift JIS and ISO2022JP, Gauche uses arithmetic conversion whenever possible. This even maps the undefined codepoint properly. Between Unicode (UTF-8) and EUC_JP, Gauche uses lookup tables. Between Unicode and Shift JIS or ISO2022JP, Gauche converts the input CES to EUC_JP, then convert it to the output CES. ISO8859-N are converted to Unicode using tables, then converted to the output CES if necessary. If the same CES is specified for input and output, Gauche’s conversion routine just copies input characters to output characters, without checking the validity of the encodings.
EUC_JP, EUCJP, EUCJ, EUC_JISX0213
Covers ASCII, JIS X 0201 kana, JIS X 0212 and JIS X 0213 character sets. JIS X 0212 character set is supported merely because it uses the code region JIS X 0213 doesn’t use, and JIS X 0212 characters are not converted properly to Shift JIS and UTF-8. Use JIS X 0213.
SHIFT_JIS, SHIFTJIS, SJIS
Covers Shift_JISX0213, except that 0x5c and 0x7e is mapped to ASCII character set (REVERSE SOLIDUS and TILDE), instead of JIS X 0201 Roman (YEN SIGN and OVERLINE).
UTF-8, UTF8
Unicode. Note that some JIS X 0213 characters are mapped to Extension B (U+20000 and up). Some JIS X 0213 characters are mapped to two unicode characters (one base character plus a combining character).
ISO2022JP, CSISO2022JP, ISO2022JP-1, ISO2022JP-2, ISO2022JP-3
These encodings differ a bit (except ISO2022JP and CSISO2022JP, which are synonyms), but Gauche handles them same. If one of these CES is specified as input, Gauche recognizes escape sequences of any of CES. ISO2022JP-2 defines several non-Japanese escape sequences, and they are recognized by Gauche, but mapped to substitution character (’?’ or geta mark).
For output, Gauche assumes ISO2022JP first, and uses ISO2022JP-1 escape sequence to put JIS X 0212 character, or uses ISO2022JP-3 escape sequence to put JIS X 0213 plane 2 character. Thus, if the string contains only JIS X 0208 characters, the output is compatible to ISO2022JP. Precisely speaking, JIS X 0213 specifies some characters in JIS X 0208 codepoint that shouldn’t be mixed with JIS X 0208 characters; Gauche output those characters as JIS X 0208 for compatibility. (This is the same policy as Emacs-Mule’s iso2022jp-3-compatible mode).
There are cases that you don’t know the CES of the input, but you know it is one of several possible encodings. The charconv module has a mechanism to guess the input encoding. There can be multiple algorithms, and each algorithm has the name (wildcard CES). Right now, there’s only one algorithm implemented:
"*JP"
To guess the character encoding from japanese text, among either ISO2022-JP(-1,2,3), EUCJP, SHIFT_JIS or UTF-8.
(Even when the input is UTF-8 with BOM, it is still recognized as UTF-8, not UTF-8-BOM).
The wildcard CES can be used in place of CES name for some conversion functions.
{gauche.charconv
}
Guesses the CES of string by the character guessing scheme
scheme (e.g. "*JP"). Returns CES name that can be used
by other charconv functions. It may return #f
if the
guessing scheme finds no possible encoding in string.
Note that if there may be more than one possible encoding in
string, the guessing scheme returns one of them,
usually in favor of the native CES.
{gauche.charconv
}
Takes an input port source, which feeds characters
encoded in from-code, and returns another input port,
from which you can read characters encoded in to-code.
If to-code is omitted, the native CES is assumed.
buffer-size is used to allocate internal buffer size for conversion. The default size is about 1 kilobytes and it’s suitable for typical cases.
hanlding argument specifies the behavior when the output CES
doesn’t have the corresponding character of input. It can be a symbol
raise
to raise an <io-encoding-error>
in such cases,
or a symbol replace
to replace the character with a replacement
character appropriate in the output CES. If omitted, raise
is assumed.
Note that iconv(3)
library API doesn’t offer an option to
choose the illegal-output handling mode. So when the conversion is delegated to
iconv(3)
, illegal-output is ignored and the behavior follows
the underlying iconv(3)
implementation. If you need to make
sure illegal-output is honored, you can bind the parameter
external-conversion-library
to #f
when calling
this procedure; then the conversion port won’t
use iconv(3)
and raises unsupported encodings error if the
conversion can’t be handled entirely within Gauche.
By default, open-input-conversion-port
leaves source open.
If you specify true value to owner?, the function closes
source after it reads EOF from the port.
If you don’t know the source’s CES, you can specify
CES guessing scheme, such as "*JP"
, in place of from-code.
The conversion port tries to guess the encoding, by prefetching
the data from source up to the buffer size. It signals an error
if the code guessing routine finds no appropriate CES.
If the guessing routine finds ambiguous input, however, it silently
assume one of possible CES’s, in favor of the native CES.
Hence it is possible that the guessing is wrong if the buffer
size is too small. The default size is usually enough for most
text documents, but it may fail if the large text contains mostly ASCII
characters and multibyte characters appear only at the very end of
the document.
To be sure for the worst case,
you have to specify the buffer size large enough to
hold entire text.
For example, the following code copies a file unknown.txt to a file eucjp.txt, converting unknown japanese CES to EUC-JP.
(call-with-output-file "eucjp.txt" (lambda (out) (copy-port (open-input-conversion-port (open-input-file "unknown.txt") "*jp" ;guess code :to-code "eucjp" :owner? #t) ;close unknown.txt afterwards out)))
For the portable code, you can also use SRFI-181 transcoded-port
(see Transcoded ports).
{gauche.charconv
}
Creates and returns an output port that converts
given characters from from-code to to-code
and feed to an output port sink.
If from-code is omitted, the native CES is assumed.
You can’t specify a character guessing scheme (such as "*JP") to
neither from-code nor to-code.
buffer-size specifies the size of internal conversion buffer.
The characters put to the returned port may stay in the buffer,
until the port is explicitly flushed (by flush
) or
the port is closed.
By default, the returned port doesn’t closes sink when itself is closed. If a keyword argument owner? is provided and true, however, it closes sink when it is closed.
The illegal-output keyword argument is the same
as open-input-conversion-port
.
For the portable code, you can also use SRFI-181 transcoded-port
(see Transcoded ports).
{gauche.charconv
}
Convert source, which is a string or an u8vector
of multibyte encoding in from-code, to a
string or u8vector encoded in to-code. If to-code is
omitted, the native CES is assumed.
In ces-convert-to
, you can specify the return type by
return-type argument; it must be either a class object <string>
or <u8vector>
. On the other hand, ces-convert
always returns
a string, regardless of the type of source.
If to-code is different from the native CES and a string is returned, it can be an incomplete string. It’s for the backward compatibility—in general, we recommend to use u8vector to represent multibyte sequence in CES other than the native encoding.
from-code can be a name of character guessing scheme (e.g. "*JP").
The keyword argument illegal-output controls
the behavior when input contains a character that can’t be encoded
in the output. See open-input-conversion-port
above for the
description. By default, an <io-encoding-error>
is raised,
except when the conversion is delegated to iconv(3)
, in which case
the behavior depends on the external library.
For the portable code, you can also use SRFI-181 bytevector->string
and string->bytevector
(see Transcoded ports).
{gauche.charconv
}
These procedures can be used to perform character I/O with
different encoding temporary from the original port’s encoding.
call-with-input-conversion
takes an input port iport
which uses the character encoding encoding, and
calls proc with one argument, a conversion input port.
From the port, proc can read characters in utf-8.
Note that once proc is called, it has to read all the
characters until EOF; see the note below.
call-with-output-conversion
takes an output port oport
which expects the character encoding encoding, and
calls proc with one argument,
a temporary conversion output port.
To the port, proc can write characters in utf-8.
When proc returns, or it exits with an error,
the temporary conversion output port is flushed and closed.
The caller of call-with-output-conversion
can continue to use oport with original encoding afterwards.
Both procedure returns the value(s) that proc returns. The default value of encoding is Gauche’s internal encoding. Those procedures don’t create a conversion port when it is not necessary. If conversion-buffer-size is given, it is used as the buffer-size argument when the conversion port is open.
You shouldn’t use iport/oport directly while proc is active—character encoding is a stateful process, and mixing I/O from/to the conversion port and the underlying port will screw up the state.
Note: for the call-with-input-conversion
, you can’t
use iport again unless proc reads EOF from it.
It’s because a conversion port needs to buffer the input, and
there’s no way to undo the buffered input to iport
when proc returns.
{gauche.charconv
}
Similar to call-with-*-conversion
,
but these procedures call thunk without arguments,
while the conversion port is set as the current input or output port,
respectively.
The meaning of keyword arguments are the same as call-with-*-conversion
.
{gauche.charconv
}
Convenient procedures to avoid adding unnecessary conversion port.
Each procedure works like open-input-conversion-port
and open-output-conversion-port
, respectively,
except if system knows no conversion is needed,
no conversion port is created and port is returned as is.
When a conversion port is created, port is always owned by the port.
When you want to close the port, always close the port returned
by wrap-with-*-conversion
, instead the original port.
If you close the original port first, the pending conversion
won’t be flushed. (Some conversion requires trailing sequence that
is generated only when the conversion port is closing, so simply
calling flush
isn’t enough.)
The buffer-size and illegal-output arguments are passed to
the open-*-conversion-port
.