For Development HEAD DRAFTSearch (procedure/syntax/module):

12.81 text.tr - Transliterate characters

Module: text.tr

This module implements a transliterate function, that substitutes characters of the input string. This functionality is realized in Unix tr(1) command, and incorporated in various programs such as sed(1) and perl.

Gauche’s tr is aware of multibyte characters.

Function: tr from-list to-list :key :complement :delete :squeeze :table-size :input :output

{text.tr} Reads from input and writes to output, with transliterating characters in from-list to the corresponding ones in to-list. Characters that doesn’t appear in from-list are passed through.

The default values of input and output are current input port and current output port, respectively.

Both from-list and to-list must be strings. They may contain the following special syntax. Other characters that doesn’t fits in the syntax are taken as they are.

x-y

Expanded to the increasing sequence of characters from x to y, inclusive. The order is determined by the internal character encoding system; generally it is safer to limit use of this within the range of the same character class. The character x must be before y.

x*n

Repeat x for n times. n is a decimal number notation. Meaningful only in to-list; it is an error to use this form in from-list. If n is omitted or zero, x is repeated until to-list matches the length of from-list (any character after it is ignored).

\x

Represents x itself. Use this escape to avoid a special character to be interpreted as itself. Note that if you place a backslash in a string, you must write \\, for the Scheme reader also interprets backslash as a special character.

There’s no special sequence to represent non-graphical characters, for you can put such characters by the string syntax.

Here’s some basic examples.

;; swaps case of input
(tr "A-Za-z" "a-zA-Z")

;; replaces 7-bit non-graphical characters to ‘?’
(tr "\x00-\x19\x7f" "?*")

If from-list contains duplicated characters, the first correspondence is used, and the subsequent correspondences are ignored.

(string-tr "abc" "aabbcc" "123456")
 ⇒ "135"

If to-list is shorter than from-list, the behavior depends on the keyword argument delete. If a true value is given, characters that appear in from-list but not in to-list are deleted. Otherwise, the extra characters in from-list are just passed through.

(string-tr "abracadabra" "abc" "" :delete #t)
 ⇒ "rdr"

When a true value is specified to complement, the character set in from-list is complemented. Note that it implies huge set of characters, so it is not very useful unless either output character set is a single character (using ‘*’) or used with delete keyword.

When a true value is specified to squeeze, the sequence of the same replaced characters is squeezed to one. If to-list is empty, the sequence of the same characters in from-list is squeezed.

Internally, tr builds a table to map the characters for efficiency. Since Gauche can deal with potentially huge set of characters, it limits the use of the table for only smaller characters (<256 by default). If you want to transliterate multibyte characters on the large text, however, you might want to use larger table, trading off the memory usage. You can specify the internal table size by table-size keyword argument. For example, if you transliterate lots of EUC-JP hiragana text to katakana, you may want to set table size greater than 42483 (the character code of the last katakana).

Note that the pre-calculation to build the transliterate table needs some overhead. If you want to call tr many times inside loop, consider to use build-transliterator described below.

Function: string-tr string from-list to-list :key :complement :delete :squeeze :table-size

{text.tr} Works like tr, except that input is taken from a string string.

Function: build-transliterator from-list to-list :key :complement :delete :squeeze :table-size :input :output

{text.tr} Returns a procedure that does the actual transliteration. This effectively “pre-compiles” the internal data structure. If you want to run tr with the same sets repeatedly, you may build the procedure once and apply it repeatedly, saving the overhead of initialization.

A note for an edge case: When input and/or output keyword arguments are omitted, the created transliterator is set up to use current-input-port and/or current-output-port at the time transliterator is called.

(with-input-from-file "huge-file.txt"
  (lambda ()
    (let loop ((line (read-line)))
      (unless (eof-object? line) (tr "A-Za-z" "a-zA-Z")))))

;; runs more efficiently...

(with-input-from-file "huge-file.txt"
  (lambda ()
    (let ((ptr (build-transliterator "A-Za-z" "a-zA-Z")))
      (let loop ((line (read-line)))
        (unless (eof-object? line) (ptr))))))


For Development HEAD DRAFTSearch (procedure/syntax/module):
DRAFT