Scheme library textutils
is a random collection of
functions and macros to process strings and text data.
Most of them are small and
easy enough to write every time I needed ones, until I
finally decided to avoid duplication of effort.
Some of the ideas are borrowed from other unix tools or functions in other scripting languages, which I find handy in the daily chores as a programmer.
SRFI-13 draft ("string libraries", http://srfi.schemers.org/srfi-13/srfi-13.html) defines a quite complete set of string-manipulate functions. Some of the functions in this library have similar features of ones in the SRFI. The SRFI version takes more optional arguments to customize its behavior and also is more strict about the arguments. My version can be seen as a convenient wrapper on the SRFI's.
The area this library covers is somewhat vague. You may find just a few of the features are useful but not all of them. In such case just pick up the code of your interest.
The functions provided so far is:
ensure-string
, ensure-number
strjoin
, strsplit
,
strtrim
, strtrim-left
, strtrim-right
,
strchomp
, strchop
, strpad
,
rxmatch
, rxmatch-start
, rxmatch-end
,
rxmatch-substring
, rxmatch-let
, rxmatch-if
,
rxmatch-cond
, rxmatch-case
transliterate
, string-transliterate
,
build-transliterator
,
input-lines-for-each
, input-lines-map
,
input-lines-fold
, grep
,
sawk
linebreak
, set-line-delimiter
,
with-line-delimiter
, print
, println
,
csv-records-map
, csv-records-for-each
,
csv-write-record
To use the module, require and import it as follows:
(require "textutils")
The newest version of the library is 0.3.3 at the time I'm writing this. It can be obtained from http://practical-scheme.net/vault/textutils-0.3.3.tar.gz.
If you read this document off-line, check out the newest document on line. http://practical-scheme.net/vault/textutils.html.
Once you get the package, you can install it by the following procedure.
cd textutils
stk Makefile.stk
make install
Only the procedures explicitly documented here are exported from the module.
Some procedure entries have CL-like &key
argument notations,
although standard Scheme doesn't have it.
I used the notation just for convenience. If you give a keyword
argument which is not specified, usually it is just ignored.
The module uses some STk-specific features listed below.
textutils.stk
defines the module and uses autoload to
load those files.
If your Scheme doesn't have these, or has different syntax,
take the individual files.
with-input-from-string
, with-output-to-string
),
and some STk built-in string functions such as
string-index
. I used them wherever it's faster than
using general way.
string->regexp
, etc).
get-keyword
, keyword?
.
In real life, you have to coordinate with the outside world
which is not under your control. When you expect strings to be
passed, and even state so in the interface specification, they always
pass symbols, #f
s or all the other possible stuff.
Better to be prepared.
MOSTLY-STRING
is a string, return itself. Expecting that's the
most common case.
Otherwise, return a string which reasonably represents MOSTLY-STRING
.
This macro is used to
check the arguments which is expected mostly string, and prevent
the script from failing if it gets unexpected object. "Reasonably"
here means the following conversion.
#f, ()
symbol->string
number->string
MOSTLY-STRING
to be a string representing
a number, and returns the number. If MOSTLY-STRING
is
already a number, returns itself. Otherwise, it returns DEFAULT
,
or if it is omitted, 0.
Functions of this category takes a string and returns a modified string.
The string argument is passed to ensure-string
first, so
it's more tolerable (or sloppier, depending on your usual programming style)
than SFRI-13's similar functions.
Trimming leading and trailing whitespace characters, and cutting various newline characters at the end of line are very common in CGI scripts.
WIDTH
must be a exact integer. PADCHAR
must be a
character, and whose default is #\space
.
If length of STRING
is shorter than the absolute value of
WIDTH
, then the string is widened by adding PADCHAR
.
If WIDTH
is positive, the string is left adjusted (pad is on
right). If WIDTH
is negative, the string is right adjusted
(pad is on left).
If the string is wider than the absolute value of WIDTH
,
the right part of string is chopped when WIDTH
is positive,
or the left part when WIDTH
is negative, to fit the entire
length to be WIDTH
.
If the optional argument NOCHOP
is provided and true, however,
the string is not chopped if it is wider than the width.
WIDTH
is positive and the length of STRING
is longer
than WIDTH
, chop the right part of the string to fit to WIDTH
.
If the string is narrower than the width, the string itself is returned.
If WIDTH
is negative, -WIDTH
characters from right
will be removed from STRING
. If the string is shorter than
-WIDTH
, null string is returned.
The default value of WIDTH
is -1, by which the behavior of
strchop
resembles to Perl's chop
.
This function is a scheme version of Perl's join function.
Elements in LIST
and concatenated, glued by
DELIMITER
. DELIMITER
must be a string.
LIST
is a list of strings. Elements are filtered by
ensure-string
, so non-string object is tolerated.
Even it can be an improper list.
(strjoin ", " '("apple" "orange" "mango")) => "apple, orange, mango" (strjoin ":" '(a b c d "e")) => "a:b:c:d:e" (strjoin "," '(4 . 3)) => "4,3" (strjoin "," "x") => "x"
Note: SRFI-13 has string-join
. (strjoin delim list)
is
roughly equivalent to (string-join list delim 'infix)
.
Splits a given STRING
by a character sequence which matches
REGEXP
, and returns a list of strings.
REGEXP
can be either a string representing a regular expression,
or a regexp object created by string->regexp
.
Unlike split-string
which comes with STk, this function
allows users to specify the delimiter by a regular expression rather than
just a character sequence. It is like Perl's `split' function.
For example,
(strsplit "/" "/usr/local//lib") => ("" "usr" "local" "" "lib") (strsplit "/+" "/usr/local//lib") => ("" "usr" "local" "lib")
If REGEXP
matches a null string, this function acts like
breaking up all the characters of the string, consuming whatever
sequence matching the regular expression. This behavior is taken
from Perl's split function, too.
(strsplit "/*" "/usr/local//lib") => ("" "u" "s" "r" "l" "o" "c" "a" "l" "l" "i" "b")
Functions and macros convenient to write conditional branch based on regexp match. The interface is borrowed from scsh.
Unfortunately, Bigloo's pattern match library which comes with
STk defines match-lambda
and match-case
, and they
sound too similar to the match-cond
in scsh.
It indicates the term "match" is too broad just to use
for regular expression match. So I chose more specific
and consistent names here.
RE
can be either an STk regular expression object
created by string->regexp
or a string which describes
regular expression. A string STRING
is matched by
RE
. If it matches, the function returns an opaque
match object. Otherwise it returns #f
.
This method can be extended to accept other types in RE
,
to extend matching functionality.
This corresponds to scsh's regexp-search
.
MATCH
is a match object returned by rxmatch
.
Without I
, or I
equals to zero, the functions return
start, end or the substring of entire match, respectively.
With positive integer I
, it returns those of I
-th
submatches. It is an error to pass other values to I
.
It is allowed to pass #f
to MATCH
for convenience.
the functions return #f
.
These functions correspond to scsh's match:start
, match:end
and match:substring
.
In the following macros, match-expr
is an expression
which produces a match object or #f
.
Evaluates match-expr
, and if matched, binds VAR ...
to the matched strings, then evaluates FORM
s.
The first VAR
receives the entire match, and subsequent
variables receive submatches. If the number of submatches are
smaller than the number of variables to receive them, the rest
of variables will get #f
.
It is possible to put #f
in variable position, which says
you don't care that variable
This macro corresponds to scsh's let-match
.
match-expr
, and if matched, binds VAR ...
to the matched strings and evaluate THEN-FORM
.
Otherwise evaluate ELSE-FORM
.
This macro corresponds to scsh's if-match
.
CLAUSE
may be one of the following pattern.
(MATCH-EXPR (VAR ...) FORM ...)
MATCH-EXPR
returns a match, works like rxmatch-let
.
(test EXPR FORM ...)
EXPR
and if it is true
evaluates FORM
s.
(test EXPR => PROC)
EXPR
and if it is true,
calls PROC
with the result of EXPR
as the only argument.
(else FORM ...)
FORM
s are
evaluated.
This macro corresponds to scsh's match-cond
.
CLAUSE
may be one of the following pattern.
(RE (VAR ...) FORM ...)
RE
must be either a literal string describing a
regexp, or a regexp object. If it matches with given STRING
,
the rest of clause works like rxmatch-let
.
(test EXPR FORM ...)
EXPR
and if it is true
evaluates FORM
s.
(test EXPR => PROC)
EXPR
and if it is true,
calls PROC
with the result of EXPR
as the only argument.
(else FORM ...)
FORM
s are
evaluated.
This category of function reads characters from input stream and processes them. They simulate popular unix filter programs.
"Transliteration" is the function found in unix tr(1) command, as well as Perl's tr operator.
FROM
and TO
must be a string. The basic function is to
get character stream from input and put them to output,
with replacing the character which is in FROM
for the corresponding
character in TO
. You can use range operator `-
' to
show the character range. Unless specified by the keyword arguments,
input and output are the current input and output port, respectively.
(with-input-from-string "Hello, world" (transliterate "a-z" "A-Z") => prints "HELLO, WORLD"
Note that the character range depends on underlying character encoding. It's safer to use it within the same character class (numbers, lowercase alphabets, or uppercase alphabets).
If TO
is shorter than FROM
, the last character of
TO
is repeated, unless a keyword argument :delete-unreplaced
is provided and not #f
.
If a keyword argument :delete-unreplaced
is provided and not
#f
, any characters specified by FROM
not found in TO
are deleted. (It is equivalent to the `d
' modifier of Perl's
tr
operator.)
If a keyword argument :squash
is provided and not #f
,
sequences of characters which were replaced to the same character are
squashed down to single instance of the character.
(It is equivalent to the `s
' modifier of Perl's
tr
operator.)
If a keyword argument :complement-match
is provided and
not #f
, FROM
character list is complemented.
(It is equivalent to the `c
' modifier of Perl's
tr
operator.)
Note that it is possible that the complemented character set
can include unexpected characters if multibyte character set
is supported in future.
Convenient wrapper for transliterate
. Input is taken from
STRING
, and result is returned as a string.
(string-transliterate "Hello, world" "a-z" "A-Z") => "HELLO, WORLD" (string-transliterate "Hello, world" "0-9A-Za-z" "z-aZ-A9-0") => "iLEEB, 3B8EM"
Actually, the transliteration is done in two steps. First, it build a procedure which receives a string and do transliteration. Then the procedure is applied to the specified string. You can use the following function to do the first step only.
Returns a procedure which takes one string argument and
does the transliteration specified by FROM
, TO
and
the keyword arguments, then returns the transilterated string.
The meaning of the arguments is the same as
string-transliterate
.
If you apply the same transliteration on lots of different strings,
it is more efficient to build a procedure by this function then
apply it to the strings, rather than calling string-transliterate
many times.
These functions read input line by line, and pass it to the specified procedure.
INPUT
may be either an input port, a string
specifying an input file, or a list of strings.
If it is a string, it is regarded as a filename and the input is taken
from the file. If it is a list, it is regarded as a list of "lines",
and the input is taken element by element from the list.
For each line from INPUT
, a procedure PROC
is called
with two arguments: the line itself, and the line number.
Line number counts from 1.
input-lines-for-each
doesn't return meaningful result.
input-lines-map
collects the result of application of
proc
and returns it.
Generalized iterator. For each line from INPUT
,
PROC
is called by three arguments;
the line itself, the partial result of application of the
previous PROC
, and the line count. For the first
line, INIT
is passed as the second argument of PROC
.
The result of the last application of PROC
is returned.
If we write n-th line as line[n]
and INPUT
has N
lines, the application of input-lines-fold
is the same as this:
(proc lines[N] N (proc lines[N-1] N-1 (proc .... (proc lines[2] 2 (proc lines[1] 1 init)) ...)))
INPUT
may be either an input port, a string
specifying an input file, or a list of strings.
In future versions, the concept of "line" may be expanded to a chunk of characters delimited by given record separator.
The famous grep.
INPUT
can be an input port, a string, or a list of strings.
If it is a string, it is regarded as a filename and the input is taken
from the file. If it is a list, it is regarded as a list of "lines",
and the input is taken element by element from the list.
PATTERN
can be a regexp object or a string describing regexp.
This function takes each line from the input, and if the line
matches PATTERN
, puts the matched line (by default).
You can give a few options to modify the output pattern. Option is a keyword followed by a value (much like unix commands). The following keyword are recognized:
:show-line boolean
-n
option of grep(1).
:show-file boolean
INPUT
is a filename.
Default is false.
:show-match boolean
If the one of those flags are true, the list of specified item corresponding to the matched line is returned. If more than one of those flags are true, for each matched line a list of specified items is returned.
(grep "[0-9]+" '("C" "C++" "Algol-60" "Perl" "Scheme" "Lisp" "Fortran-77")) => ("Algol-60" "Fortran-77") (grep "[0-9]+" :show-line #t '("C" "C++" "Algol-60" "Perl" "Scheme" "Lisp" "Fortran-77")) => ((2 "Algol-60") (6 "Fortran-77")) (grep "[0-9]+" :show-line #t :show-match #f '("C" "C++" "Algol-60" "Perl" "Scheme" "Lisp" "Fortran-77")) => (2 6) (grep "stk" "| ps -efa") ; you can use STk's pipe syntax as well => ("shiro 1393 737 0 11:01 ttyp1 00:00:00 /usr/local/bin/stk") (apply grep "myfunc" :show-file #t :show-line #t :show-match #f (glob "*.stk")) => lists file names and line numbers which contains 'myfunc' (grep "\\.stk~[0-9]+~$" (glob "*")) ; scans the current directory with regexp
A macro providing awk-line control flow.
Sometimes I want the expressive power of Scheme inside awk; and it's much easier to write awk-like environment in Scheme, instead of write scheme-line environment in awk.
This is just a first try. The code is not tested well yet.
INPUT
s and perform actions
specified by CLAUSE
s. In current implementation
`record' is synonym to `line'.
INPUT
may be either an input port, a string, or a list of
strings. If it is a string, it is regarded as a filename and the
input is taken
from the file. If it is a list, it is regarded as a list of "lines",
and the input is taken element by element from the list.
For each input, sawk
reads from it line by line,
breaks it up to number of fields, then
tests each CLAUSE
with the line. If a CLAUSE
satisfies condition, the action associated to the CLAUSE
is evaluated.
Like awk
(and unlike cond
or case
),
those actions can be 'cascaded', i.e. the line can match multiple
clauses.
CONTEXT
is a symbol which is bound to a procedure and
allows you to access various current context information
inside CLAUSE
. It is described later in detail.
You can give options to customize sawk
behavior.
Option is a keyword followed by a value (much like unix commands).
Currently, the following options are recognized.
:delimiter DELIMITER
(string->regexp "[ \t]+")
.
:input-list EXPR
CLAUSE
takes one of the following, where RE
may be
either a regexp object or a string specifying regexp.
(RE (VARS ...) FORM ...)
RE
, binds the match and submatches
to VARS
and evaluates FORM
s. VAR
s works the same
as rxmatch-let
.
(! RE FORM ...)
(begin FORM ...)
INPUT
, FORM
is evaluated just after
the input is opened.
(end FORM ...)
INPUT
, FORM
is evaluated just after
an eof is read from the input.
(test EXPR FORM ...)
EXPR
and if it is true, evaluates FORM
s.
(test EXPR => PROC)
EXPR
and if it is true, calls PROC
with
the result as the only argument.
Inside CLAUSE
, you can use a context accessor
to retrieve and modify various internal status of sawk
.
Suppose you give symbol `$
' to the CONTEXT
parameter.
Then you can use the following expressions.
($ 0)
($ n) where n is a positive integer
($ :nr), ($ :record-number)
sawk
has been read.
($ :fnr), ($ :file-record-number)
($ :nf), ($ :number-of-fields)
($ :input)
($ :collect value)
sawk
.
Some examples:
Checks password file entry with no password field.
(sawk $ ((test (equal? ($ 2) "") (format #t "User ~a doesn't have password\n" ($ 1)))) :delimiter ":" "/etc/passwd")
Same, but returns a list of users instead of printing messages.
(sawk $ ((test (equal? ($ 2) "") ($ :collect ($ 1)))) :delimiter ":" "/etc/passwd")
From all the stk source in the current directory, picks up what looks like a toplevel function definition.
(sawk $ (("^\\(define *\\( *([^ |\"';]+)" (#f name) ($ :collect (list ($ :input) ($ :fnr) name)))) :input-list (glob "*.stk"))
This section collects a bit of functions which does input and output.
I'm still wondering if I should make these functions into a separate
module, something like ioutils
....
I put them here for the time being, since I use them a lot in
my daily work.
Different systems use different conventions to represent an end of line. Usually, the difference is carefully covered by the wrapper of libraries and you can live happily as far as you're in just one of such environments... Once you have to exchange the information, you have to deal with the difference explicitly.
STk's read-line
can already read both unix-style
(newline-only) and DOS-style (CRLF) files transparently.
The problem arises, for example, when you want to create
a text file in unix which should be read into some other
DOS/Windows programs, or when you communicates over
the network.
PORT
. If PORT
is omitted, the current output port is used.
PORT
.
If PORT
is omitted, the current output port is used.
DELIMITER
can be one of the keywords :lf-only
,
:crlf
or :cr-only
. It can also be #f
,
in that case the line delimiter to the port is reset to the
default line delimiter.
This doesn't affect the built-in newline
function; it
always print the system's default line delimiter.
PORT
to DELIMITER
,
then evaluates BODY
. The line delimiter is restored
upon exit of this form.
Note: I've only used these functions on Unix environment to produce DOS-style files. I'm not sure if the current version works in the reverse way.
These small functions are quite useful to compose write-in-10-minutes-and-used-only-once type scripts.
The interface of those functions is not nice,
in Scheme-ish way. They have two distinct functionaities
in one function, have different optional argument syntax,
and one of the name conflicts with CommonLisp's library function.
However, I couldn't resist the temptation to insert a
debugging print stubs like (print "args=~s" args)
in my code...
maybe I've been exposed to Perl too much and fatally infected.
These functions are in a separate file so if you don't like them,
just remove them.
If a string and one or more arguments are given, it works as
if (format #t FORMAT ARG ...)
.
If the only string argument is given,
it works as (display STRING)
.
If the first argument is an output port or #f
,
the output goes to the port or a string.
(print "~s ~a" x y) == (format #t "~s ~a" x y) (print "foobar") == (display "foobar") (print port "bye.") == (display "bye." port) (print #f "Answer: ~s~%" answer) == (format #f "Answer: ~s~%" answer) ;; some boundary conditions: (print #f) == "" (print port) == (display "" port)
Same as print
, except linebreak
is called after
the string is produced with the same port.
Well, when I was in a programming course they taught me Pascal.
CSV format (I don't know what it stands for... comma separated values?) is one of popular text formats to keep table-type data. It's handy to exchange data from/to the outside world. Most of spreadsheet or database programs have an option to save the data into this format. When your customer comes to your office and asks you to import his/her data in Excel to your application, the easiest way is to tell him/her to export the sheet in "text format (comma separated)", unless you're big fan of VisualBasic for Application.
I've never seen the definition of CSV format, but it appears that each field is separated by a comma, and each record is separated by a newline. A field can be surrounded by double-quote characters, and literal comma characters and newline can be included to the field value as far as the entire field is quoted. To include a double-quote character itself to the value, two consecutive double-quote character is used.
This is an example of a CSV file, which has three records (in fact, the first record works as a list of field names, but each application will take care of that.) And each record has four fields.
"serial number","title","authors","description" 6943,"Structure of Computer Programming","Abelson, Sussman and Sussman","`Must read' of the students who wants to capture the spirit of programming." 2113,"The C++ Programming Language, 3rd Ed.","Strouptrup","The definite book from the creator of the language. 3rd edition covers ""modern"" features of the language, such as namespaces."
Now, these are the functions to read such files:
INPUT
may be either an input port, a string, or a list of
strings. If it is a string, it is regarded as a filename and the
input is taken from the file. If it is a list, it is regarded as a
list of "lines", and the input is taken element by element from the
list. If it is omitted, current input port is used.
These functions read from INPUT
as a CSV file, and for each
records, call PROC
with two arguments; the record
represented as a list of fields, and the record count.
Record count begins from 1. The result of application
of PROC
is collected in csv-records-map
, and
discarded in csv-records-for-each
.
csv-records-map
and csv-records-for-each
, but
PROC
is called with three arguments; a record,
a record count, and the partial result the previous call of PROC
returned. For the first time PROC
is called, INIT
is
passed to the third argument. The result of the last call of PROC
is returned from csv-records-fold
.
For example, you can pick the records which matches your criteria by the following code:
(csv-records-fold (lambda (rec cnt result) (if (satisfy-criteria? rec) (cons rec result) result)) '())
To write csv file, use these functions.
RECORD
should be a list of field values. Each value is
converted to string, and written out to PORT
in CSV format.
The end-of-line sequence can be set by set-line-delimiter
.
See "Line Termination" section above.
Jump to: b - c - e - g - i - l - p - r - s - t - w
This document was generated on 10 February 2001 using texi2html 1.56k.