Utf8Only


We're planning to use utf-8 exclusively as the internal character encoding scheme (CES) in future. That is, we'll eventually drop --enable-multibyte=ENCODING configure option.

This document is for those who have been using Gauche in internal CES other than utf-8. Particularly, we describe how to migrate your system to utf-8 Gauche.

Rationale

When we started developing Gauche, it wasn't unusual that most of the text a user deal with was in their local CES, such as Latin-1 or EUC-JP, and alining internal CES to the main CES they use had an advantage of reducing conversion overhead. These days, however, it has become inevitable that you receive text in various encodings, most likely in one of the Unicode encodings, and having a limited subset of characters internally is a disadvantage, for conversion would lose information.

Besides, text processing is not only about CES. Unicode standardizes text segmentation, normalization, case conversion, etc. If the internal CES is non-unicode, those rules can be partial functions and we need to come up the reasonable way to fill the gap for each case. Maintaining the same level of features consistently in various encodings is difficult.

So we decided it would only be reasonable to maintain a single internal encoding that is a common multiple.

What to expect

You will still be able to have your code and data in your favorite CES. If you don't specify CES, Gauche uses the value of default-file-encoding as the CES when you open the file. Its default value is Gauche's native CES. If you switch the native CES, you might want to set up the value of the parameter somewhere in your application.

Regarding the source code, you can mark its CES by placing the magic comment near the beginning of your source. A magic comment is something like this:

coding: latin-1

See Multibyte scripts, for the details.

Regarding the data, it depends on how you read/write them. Gauche's port opening operations (e.g. with-input-from-file) universally take :encoding keyword argument that specifies the external encoding of data. If you have relied on the default encoding, add :encoding arguments---the code works with both versions of Gauche (the one with your CES, and the one with utf-8).

If your data file is a plain text and you just need to read it, another option is to use coding-aware port to read (see Coding-aware ports). Then you can use the coding: magic comment in the input text.

Sometimes it may be easier to batch convert the data file; then you can avoid conversion overhead every time you read/write the data.

Be aware of raw code point values. If you compare the value of (char->integer <char>) with a raw integer, that value depends on the internal encoding. If you've been using Latin-1 with Gauche configured as --enable-multibyte=none, you might've assume that character code is in the range of [0,255], but it'll no longer be the case.

We'll start warning when you configure Gauche with non-utf-8 encoding from 0.9.12.


Last modified : 2023/10/09 10:46:57 UTC