Textutils - Text manipulation utility for STk

version 0.3.3

1. Introduction

1.1 Overview

Scheme library textutils is a random collection of functions and macros to process strings and text data. Most of them are small and easy enough to write every time I needed ones, until I finally decided to avoid duplication of effort.

Some of the ideas are borrowed from other unix tools or functions in other scripting languages, which I find handy in the daily chores as a programmer.

SRFI-13 draft ("string libraries", http://srfi.schemers.org/srfi-13/srfi-13.html) defines a quite complete set of string-manipulate functions. Some of the functions in this library have similar features of ones in the SRFI. The SRFI version takes more optional arguments to customize its behavior and also is more strict about the arguments. My version can be seen as a convenient wrapper on the SRFI's.

The area this library covers is somewhat vague. You may find just a few of the features are useful but not all of them. In such case just pick up the code of your interest.

The functions provided so far is:

conversion: Convenient macros to filter out irregular input: ensure-string, ensure-number
string modification: Convenient string routines: strjoin, strsplit, strtrim, strtrim-left, strtrim-right, strchomp, strchop, strpad,
regexp utilities: Scsh-like regexp functions and macros on top of STk's regexp: rxmatch, rxmatch-start, rxmatch-end, rxmatch-substring, rxmatch-let, rxmatch-if, rxmatch-cond, rxmatch-case
filters: Filtering a character stream. Providing similar functionality of unix tr, grep and awk: transliterate, string-transliterate, build-transliterator, input-lines-for-each, input-lines-map, input-lines-fold, grep, sawk
input and output: linebreak, set-line-delimiter, with-line-delimiter, print, println, csv-records-map, csv-records-for-each, csv-write-record

To use the module, require and import it as follows:

(require "textutils")

1.2 Installation

The newest version of the library is 0.3.3 at the time I'm writing this. It can be obtained from http://practical-scheme.net/vault/textutils-0.3.3.tar.gz.

If you read this document off-line, check out the newest document on line. http://practical-scheme.net/vault/textutils.html.

Once you get the package, you can install it by the following procedure.

Ungzip+untar the package. This creates a `textutils' subdirectory.
cd textutils
stk Makefile.stk
make install

1.3 Notation Conventions

Only the procedures explicitly documented here are exported from the module.

Some procedure entries have CL-like &key argument notations, although standard Scheme doesn't have it. I used the notation just for convenience. If you give a keyword argument which is not specified, usually it is just ignored.

1.4 Portability

The module uses some STk-specific features listed below.

STk's module and autoload. Actually, the functions are defined in several separated files, and the toplevel file textutils.stk defines the module and uses autoload to load those files. If your Scheme doesn't have these, or has different syntax, take the individual files.
String port (with-input-from-string, with-output-to-string), and some STk built-in string functions such as string-index. I used them wherever it's faster than using general way.
STk's regular expression interface (string->regexp, etc).
Keyword object. get-keyword, keyword?.

2. Conversion

In real life, you have to coordinate with the outside world which is not under your control. When you expect strings to be passed, and even state so in the interface specification, they always pass symbols, #fs or all the other possible stuff. Better to be prepared.

Macro: ensure-string MOSTLY-STRING

If MOSTLY-STRING is a string, return itself. Expecting that's the most common case. Otherwise, return a string which reasonably represents MOSTLY-STRING. This macro is used to check the arguments which is expected mostly string, and prevent the script from failing if it gets unexpected object. "Reasonably" here means the following conversion.

#f, (): null string
symbol: result of symbol->string
number: result of number->string
other objects: "write" representation of the object

Macro: ensure-number MOSTLY-STRING [DEFAULT]: Mostly expects MOSTLY-STRING to be a string representing a number, and returns the number. If MOSTLY-STRING is already a number, returns itself. Otherwise, it returns DEFAULT, or if it is omitted, 0.

3. String manipulation

Functions of this category takes a string and returns a modified string. The string argument is passed to ensure-string first, so it's more tolerable (or sloppier, depending on your usual programming style) than SFRI-13's similar functions.

3.1 Trim and pad

Trimming leading and trailing whitespace characters, and cutting various newline characters at the end of line are very common in CGI scripts.

Function: strtrim STRING: Removes leading and trailing whitespaces (space, tab, return and newline) and returns it.

Function: strtrim-left STRING: Removes leading whitespaces (space, tab, return and newline) and returns it.

Function: strtrim-right STRING: Removes trailing whitespaces (space, tab, return and newline) and returns it.

Function: strchomp STRING: Cuts the last newline character(s) if there's any in STRING. This function can handle all of CRLF, LF-only, and CR-only styles.

Function: strpad STRING WIDTH [PADCHAR [NOCHOP]]

WIDTH must be a exact integer. PADCHAR must be a character, and whose default is #\space.

If length of STRING is shorter than the absolute value of WIDTH, then the string is widened by adding PADCHAR. If WIDTH is positive, the string is left adjusted (pad is on right). If WIDTH is negative, the string is right adjusted (pad is on left).

If the string is wider than the absolute value of WIDTH, the right part of string is chopped when WIDTH is positive, or the left part when WIDTH is negative, to fit the entire length to be WIDTH. If the optional argument NOCHOP is provided and true, however, the string is not chopped if it is wider than the width.

Function: strchop STRING [WIDTH]

If WIDTH is positive and the length of STRING is longer than WIDTH, chop the right part of the string to fit to WIDTH. If the string is narrower than the width, the string itself is returned.

If WIDTH is negative, -WIDTH characters from right will be removed from STRING. If the string is shorter than -WIDTH, null string is returned.

The default value of WIDTH is -1, by which the behavior of strchop resembles to Perl's chop.

3.2 Join and split

Function: strjoin DELIMITER LIST

This function is a scheme version of Perl's join function. Elements in LIST and concatenated, glued by DELIMITER. DELIMITER must be a string.

LIST is a list of strings. Elements are filtered by ensure-string, so non-string object is tolerated. Even it can be an improper list.

(strjoin ", " '("apple" "orange" "mango"))
  => "apple, orange, mango"

(strjoin ":" '(a b c d "e"))
  => "a:b:c:d:e"

(strjoin "," '(4 . 3))
  => "4,3"

(strjoin "," "x")
  => "x"

Note: SRFI-13 has string-join. (strjoin delim list) is roughly equivalent to (string-join list delim 'infix).

Function: strsplit REGEXP STRING

Splits a given STRING by a character sequence which matches REGEXP, and returns a list of strings. REGEXP can be either a string representing a regular expression, or a regexp object created by string->regexp. Unlike split-string which comes with STk, this function allows users to specify the delimiter by a regular expression rather than just a character sequence. It is like Perl's `split' function. For example,

(strsplit "/" "/usr/local//lib")
  => ("" "usr" "local" "" "lib")

(strsplit "/+" "/usr/local//lib")
  => ("" "usr" "local" "lib")

If REGEXP matches a null string, this function acts like breaking up all the characters of the string, consuming whatever sequence matching the regular expression. This behavior is taken from Perl's split function, too.

(strsplit "/*" "/usr/local//lib")
  => ("" "u" "s" "r" "l" "o" "c" "a" "l" "l" "i" "b")

4. Regexp utilities

Functions and macros convenient to write conditional branch based on regexp match. The interface is borrowed from scsh.

Unfortunately, Bigloo's pattern match library which comes with STk defines match-lambda and match-case, and they sound too similar to the match-cond in scsh. It indicates the term "match" is too broad just to use for regular expression match. So I chose more specific and consistent names here.

Method: rxmatch RE STRING

RE can be either an STk regular expression object created by string->regexp or a string which describes regular expression. A string STRING is matched by RE. If it matches, the function returns an opaque match object. Otherwise it returns #f.

This method can be extended to accept other types in RE, to extend matching functionality.

This corresponds to scsh's regexp-search.

Function: rxmatch-start MATCH [I]

Function: rxmatch-end MATCH [I]

Function: rxmatch-substring MATCH [I]

MATCH is a match object returned by rxmatch. Without I, or I equals to zero, the functions return start, end or the substring of entire match, respectively. With positive integer I, it returns those of I-th submatches. It is an error to pass other values to I.

It is allowed to pass #f to MATCH for convenience. the functions return #f.

These functions correspond to scsh's match:start, match:end and match:substring.

In the following macros, match-expr is an expression which produces a match object or #f.

Macro: rxmatch-let MATCH-EXPR (VAR ...) FORM ...

Evaluates match-expr, and if matched, binds VAR ... to the matched strings, then evaluates FORMs. The first VAR receives the entire match, and subsequent variables receive submatches. If the number of submatches are smaller than the number of variables to receive them, the rest of variables will get #f.

It is possible to put #f in variable position, which says you don't care that variable

This macro corresponds to scsh's let-match.

Macro: rxmatch-if MATCH-EXPR (VAR ...) THEN-FORM ELSE-FORM: Evaluates match-expr, and if matched, binds VAR ... to the matched strings and evaluate THEN-FORM. Otherwise evaluate ELSE-FORM.
This macro corresponds to scsh's if-match.

Macro: rxmatch-cond CLAUSE ...

CLAUSE may be one of the following pattern.

(MATCH-EXPR (VAR ...) FORM ...): If MATCH-EXPR returns a match, works like rxmatch-let.
(test EXPR FORM ...): Nothing with regexp. Evaluates EXPR and if it is true evaluates FORMs.
(test EXPR => PROC): Nothing with regexp. Evaluates EXPR and if it is true, calls PROC with the result of EXPR as the only argument.
(else FORM ...): If other clauses fail, FORMs are evaluated.

This macro corresponds to scsh's match-cond.

Macro: rxmatch-case STRING CLAUSE ...

CLAUSE may be one of the following pattern.

(RE (VAR ...) FORM ...): RE must be either a literal string describing a regexp, or a regexp object. If it matches with given STRING, the rest of clause works like rxmatch-let.
(test EXPR FORM ...): Nothing with regexp. Evaluates EXPR and if it is true evaluates FORMs.
(test EXPR => PROC): Nothing with regexp. Evaluates EXPR and if it is true, calls PROC with the result of EXPR as the only argument.
(else FORM ...): If other clauses fail, FORMs are evaluated.

5. Filters

This category of function reads characters from input stream and processes them. They simulate popular unix filter programs.

5.1 Transliteration

"Transliteration" is the function found in unix tr(1) command, as well as Perl's tr operator.

Function: transliterate FROM TO &key INPUT OUTPUT DELETE-UNREPLACED SQUASH COMPLEMENT-MATCH

FROM and TO must be a string. The basic function is to get character stream from input and put them to output, with replacing the character which is in FROM for the corresponding character in TO. You can use range operator `-' to show the character range. Unless specified by the keyword arguments, input and output are the current input and output port, respectively.

(with-input-from-string "Hello, world"
  (transliterate "a-z" "A-Z")
  => prints "HELLO, WORLD"

Note that the character range depends on underlying character encoding. It's safer to use it within the same character class (numbers, lowercase alphabets, or uppercase alphabets).

If TO is shorter than FROM, the last character of TO is repeated, unless a keyword argument :delete-unreplaced is provided and not #f.

If a keyword argument :delete-unreplaced is provided and not #f, any characters specified by FROM not found in TO are deleted. (It is equivalent to the `d' modifier of Perl's tr operator.)

If a keyword argument :squash is provided and not #f, sequences of characters which were replaced to the same character are squashed down to single instance of the character. (It is equivalent to the `s' modifier of Perl's tr operator.)

If a keyword argument :complement-match is provided and not #f, FROM character list is complemented. (It is equivalent to the `c' modifier of Perl's tr operator.) Note that it is possible that the complemented character set can include unexpected characters if multibyte character set is supported in future.

Function: string-transliterate STRING FROM TO &key :delete-unreplaced :squash :complement-match

Convenient wrapper for transliterate. Input is taken from STRING, and result is returned as a string.

(string-transliterate "Hello, world" "a-z" "A-Z")
  => "HELLO, WORLD"

(string-transliterate "Hello, world" "0-9A-Za-z" "z-aZ-A9-0")
  => "iLEEB, 3B8EM"

Actually, the transliteration is done in two steps. First, it build a procedure which receives a string and do transliteration. Then the procedure is applied to the specified string. You can use the following function to do the first step only.

Function: build-transliterator FROM TO &key :delete-unreplaced :squash :complement-match

Returns a procedure which takes one string argument and does the transliteration specified by FROM, TO and the keyword arguments, then returns the transilterated string. The meaning of the arguments is the same as string-transliterate.

If you apply the same transliteration on lots of different strings, it is more efficient to build a procedure by this function then apply it to the strings, rather than calling string-transliterate many times.

5.2 Generic filter

These functions read input line by line, and pass it to the specified procedure.

Function: input-lines-for-each PROC INPUT

Function: input-lines-map PROC INPUT

INPUT may be either an input port, a string specifying an input file, or a list of strings. If it is a string, it is regarded as a filename and the input is taken from the file. If it is a list, it is regarded as a list of "lines", and the input is taken element by element from the list.

For each line from INPUT, a procedure PROC is called with two arguments: the line itself, and the line number. Line number counts from 1.

input-lines-for-each doesn't return meaningful result. input-lines-map collects the result of application of proc and returns it.

Function: input-lines-fold PROC INIT INPUT

Generalized iterator. For each line from INPUT, PROC is called by three arguments; the line itself, the partial result of application of the previous PROC, and the line count. For the first line, INIT is passed as the second argument of PROC. The result of the last application of PROC is returned.

If we write n-th line as line[n] and INPUT has N lines, the application of input-lines-fold is the same as this:

(proc lines[N] N (proc lines[N-1] N-1 (proc .... (proc lines[2] 2 (proc lines[1] 1 init)) ...)))

INPUT may be either an input port, a string specifying an input file, or a list of strings.

In future versions, the concept of "line" may be expanded to a chunk of characters delimited by given record separator.

5.3 Grep

The famous grep.

Function: grep PATTERN [OPTION VALUE ...] INPUT ...

Each INPUT can be an input port, a string, or a list of strings. If it is a string, it is regarded as a filename and the input is taken from the file. If it is a list, it is regarded as a list of "lines", and the input is taken element by element from the list. PATTERN can be a regexp object or a string describing regexp.

This function takes each line from the input, and if the line matches PATTERN, puts the matched line (by default).

You can give a few options to modify the output pattern. Option is a keyword followed by a value (much like unix commands). The following keyword are recognized:

:show-line boolean: If true, shows a line number of the matched line. Line number starts from 1. Default is false. This corresponds to the -n option of grep(1).
:show-file boolean: If true, shows a name of input file, if the INPUT is a filename. Default is false.
:show-match boolean: Show the matched line itself. Default is true.

If the one of those flags are true, the list of specified item corresponding to the matched line is returned. If more than one of those flags are true, for each matched line a list of specified items is returned.

(grep "[0-9]+" '("C" "C++" "Algol-60" "Perl" "Scheme" "Lisp" "Fortran-77"))
  => ("Algol-60" "Fortran-77")

(grep "[0-9]+" :show-line #t '("C" "C++" "Algol-60" "Perl" "Scheme" "Lisp" "Fortran-77"))
  => ((2 "Algol-60") (6 "Fortran-77"))

(grep "[0-9]+" :show-line #t :show-match #f '("C" "C++" "Algol-60" "Perl" "Scheme" "Lisp" "Fortran-77"))
  => (2 6)

(grep "stk" "| ps -efa")    ; you can use STk's pipe syntax as well
  => ("shiro     1393   737  0 11:01 ttyp1    00:00:00 /usr/local/bin/stk")

(apply grep "myfunc" :show-file #t :show-line #t :show-match #f (glob "*.stk"))
  => lists file names and line numbers which contains 'myfunc'

(grep "\\.stk~[0-9]+~$" (glob "*"))  ; scans the current directory with regexp

5.4 Awk (sort of...)

A macro providing awk-line control flow.

Sometimes I want the expressive power of Scheme inside awk; and it's much easier to write awk-like environment in Scheme, instead of write scheme-line environment in awk.

This is just a first try. The code is not tested well yet.

Macro: sawk CONTEXT (CLAUSE ...) [OPTION VALUE ...] [INPUT ...]

Reads records from given INPUTs and perform actions specified by CLAUSEs. In current implementation `record' is synonym to `line'.

INPUT may be either an input port, a string, or a list of strings. If it is a string, it is regarded as a filename and the input is taken from the file. If it is a list, it is regarded as a list of "lines", and the input is taken element by element from the list.

For each input, sawk reads from it line by line, breaks it up to number of fields, then tests each CLAUSE with the line. If a CLAUSE satisfies condition, the action associated to the CLAUSE is evaluated. Like awk (and unlike cond or case), those actions can be 'cascaded', i.e. the line can match multiple clauses.

CONTEXT is a symbol which is bound to a procedure and allows you to access various current context information inside CLAUSE. It is described later in detail.

You can give options to customize sawk behavior. Option is a keyword followed by a value (much like unix commands). Currently, the following options are recognized.

:delimiter DELIMITER: Specifies how to break a record into fields. DELIMITER must be either a regexp or a string specifying regexp. The default delimiter is (string->regexp "[ \t]+").
:input-list EXPR: When evaluated, EXPR must produce a list of values valid to the input parameter of sawk. These inputs are processed as if they are appended before the actual input parameters. Since sawk is a macro, you can't use apply when you have an expression to produce a list of inputs, so use this option. See an example at the bottom of this page.

CLAUSE takes one of the following, where RE may be either a regexp object or a string specifying regexp.

(RE (VARS ...) FORM ...): If the current line matches RE, binds the match and submatches to VARS and evaluates FORMs. VARs works the same as rxmatch-let.
(! RE FORM ...): Negate condition of the above.
(begin FORM ...): For every INPUT, FORM is evaluated just after the input is opened.
(end FORM ...): For every INPUT, FORM is evaluated just after an eof is read from the input.
(test EXPR FORM ...): Evaluates EXPR and if it is true, evaluates FORMs.
(test EXPR => PROC): Evaluates EXPR and if it is true, calls PROC with the result as the only argument.

Inside CLAUSE, you can use a context accessor to retrieve and modify various internal status of sawk. Suppose you give symbol `$' to the CONTEXT parameter. Then you can use the following expressions.

($ 0): Returns the current line.
($ n) where n is a positive integer: Returns the N-th field. Field count begins from 1, as awk does.
($ :nr), ($ :record-number): Returns the current record number. Record number is a number of record sawk has been read.
($ :fnr), ($ :file-record-number): The current record number in the current input. This number is reset for every input.
($ :nf), ($ :number-of-fields): The number of fields in the current input record.
($ :input): Returns the current processing input.
($ :collect value): Collects result. value is accumulated to the internal result list and will be returned as a result of sawk.

Some examples:

Checks password file entry with no password field.

(sawk $
  ((test (equal? ($ 2) "")
         (format #t "User ~a doesn't have password\n" ($ 1))))
  :delimiter ":"
  "/etc/passwd")

Same, but returns a list of users instead of printing messages.

(sawk $
  ((test (equal? ($ 2) "") ($ :collect ($ 1))))
  :delimiter ":"
  "/etc/passwd")

From all the stk source in the current directory, picks up what looks like a toplevel function definition.

(sawk $
  (("^\\(define *\\( *([^ |\"';]+)" 
     (#f name) 
     ($ :collect (list ($ :input) ($ :fnr) name))))
  :input-list (glob "*.stk"))

6. Input and Output

This section collects a bit of functions which does input and output. I'm still wondering if I should make these functions into a separate module, something like ioutils.... I put them here for the time being, since I use them a lot in my daily work.

6.1 Line termination

Different systems use different conventions to represent an end of line. Usually, the difference is carefully covered by the wrapper of libraries and you can live happily as far as you're in just one of such environments... Once you have to exchange the information, you have to deal with the difference explicitly.

STk's read-line can already read both unix-style (newline-only) and DOS-style (CRLF) files transparently. The problem arises, for example, when you want to create a text file in unix which should be read into some other DOS/Windows programs, or when you communicates over the network.

Function: linebreak [PORT]: Print out a current line break sequence to PORT. If PORT is omitted, the current output port is used.

Function: set-line-delimiter DELIMITER [PORT]

Set line break sequence (line delimiter) to the PORT. If PORT is omitted, the current output port is used.

DELIMITER can be one of the keywords :lf-only, :crlf or :cr-only. It can also be #f, in that case the line delimiter to the port is reset to the default line delimiter.

This doesn't affect the built-in newline function; it always print the system's default line delimiter.

Macro: with-line-delimiter (DELIMITER [PORT]) BODY ...: Set line break sequence of PORT to DELIMITER, then evaluates BODY. The line delimiter is restored upon exit of this form.

Note: I've only used these functions on Unix environment to produce DOS-style files. I'm not sure if the current version works in the reverse way.

6.2 Printing string

These small functions are quite useful to compose write-in-10-minutes-and-used-only-once type scripts.

The interface of those functions is not nice, in Scheme-ish way. They have two distinct functionaities in one function, have different optional argument syntax, and one of the name conflicts with CommonLisp's library function. However, I couldn't resist the temptation to insert a debugging print stubs like (print "args=~s" args) in my code... maybe I've been exposed to Perl too much and fatally infected. These functions are in a separate file so if you don't like them, just remove them.

Function: print [PORT] FORMAT-OR-STRING ARG ...

If a string and one or more arguments are given, it works as if (format #t FORMAT ARG ...). If the only string argument is given, it works as (display STRING). If the first argument is an output port or #f, the output goes to the port or a string.

(print "~s ~a" x y) 
  == (format #t "~s ~a" x y) 

(print "foobar")
  == (display "foobar")
 
(print port "bye.")
  == (display "bye." port)

(print #f "Answer: ~s~%" answer)
  == (format #f "Answer: ~s~%" answer)

;; some boundary conditions:
(print #f) == ""
(print port) == (display "" port)

Function: println [PORT] FORMAT-OR-STRING ARG ...

Same as print, except linebreak is called after the string is produced with the same port.

Well, when I was in a programming course they taught me Pascal.

6.3 CSV File

CSV format (I don't know what it stands for... comma separated values?) is one of popular text formats to keep table-type data. It's handy to exchange data from/to the outside world. Most of spreadsheet or database programs have an option to save the data into this format. When your customer comes to your office and asks you to import his/her data in Excel to your application, the easiest way is to tell him/her to export the sheet in "text format (comma separated)", unless you're big fan of VisualBasic for Application.

I've never seen the definition of CSV format, but it appears that each field is separated by a comma, and each record is separated by a newline. A field can be surrounded by double-quote characters, and literal comma characters and newline can be included to the field value as far as the entire field is quoted. To include a double-quote character itself to the value, two consecutive double-quote character is used.

This is an example of a CSV file, which has three records (in fact, the first record works as a list of field names, but each application will take care of that.) And each record has four fields.

"serial number","title","authors","description"
6943,"Structure of Computer Programming","Abelson, Sussman and Sussman","`Must read' of the students who wants to
capture the spirit of programming."
2113,"The C++ Programming Language, 3rd Ed.","Strouptrup","The definite book from
the creator of the language.
3rd edition covers ""modern"" features of the language,
such as namespaces."

Now, these are the functions to read such files:

Function: csv-records-map PROC [INPUT]
Function: csv-records-for-each PROC [INPUT]: INPUT may be either an input port, a string, or a list of strings. If it is a string, it is regarded as a filename and the input is taken from the file. If it is a list, it is regarded as a list of "lines", and the input is taken element by element from the list. If it is omitted, current input port is used.
These functions read from INPUT as a CSV file, and for each records, call PROC with two arguments; the record represented as a list of fields, and the record count. Record count begins from 1. The result of application of PROC is collected in csv-records-map, and discarded in csv-records-for-each.

Function: csv-records-fold PROC INIT [INPUT]

A generalized iterator. Input is read like csv-records-map and csv-records-for-each, but PROC is called with three arguments; a record, a record count, and the partial result the previous call of PROC returned. For the first time PROC is called, INIT is passed to the third argument. The result of the last call of PROC is returned from csv-records-fold.

For example, you can pick the records which matches your criteria by the following code:

(csv-records-fold (lambda (rec cnt result)
                     (if (satisfy-criteria? rec) (cons rec result) result))
                  '())

To write csv file, use these functions.

Function: csv-write-record RECORD PORT: RECORD should be a list of field values. Each value is converted to string, and written out to PORT in CSV format.
The end-of-line sequence can be set by set-line-delimiter. See "Line Termination" section above.