[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.12 Strings

Builtin Class: <string>

A string class. In Gauche, a string can be viewed in two ways: a sequence of characters, or a sequence of bytes.

It should be emphasized that Gauche’s internal string object, string body, is immutable. To comply R5RS in which strings are mutable, a Scheme-level string object is an indirect pointer to a string body. Mutating a string means that Gauche creates a new immutable string body that reflects the changes, then swap the pointer in the Scheme-level string object.

This may affect some assumptions on the cost of string operations.

Gauche does not attempt to make string mutation faster; (string-set! s k c) is exactly as slow as to take two substrings, before and after of k-th character, and concatenate them with a single-character string inbetween. So, just avoid string mutations; we believe it’s a better practice. See also String Constructors.

R5RS string operations are very minimal. Gauche supports some extra built-in operations, and also a rich string library defined in SRFI-13. See section srfi-13 - String library, for details about SRFI-13.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.12.1 String syntax

Reader syntax: ""

[R5RS+] Denotes a literal string. Inside the double quotes, the following backslash escape sequences are recognized.

\"

[R5RS] Double-quote character

\\

[R5RS] Backslash character

\n

Newline character (ASCII 0x0a).

\r

Return character (ASCII 0x0d).

\f

Form-feed character (ASCII 0x0c).

\t

Tab character (ASCII 0x09)

\0

ASCII NUL character (ASCII 0x00).

\<whitespace>*<newline><whitespace>*

Ignored. This can be used to break a long string literal for readability. This escape sequence is introduced in R6RS.

\xNN

A byte represented by two-digit hexadecimal number NN. The byte is interpreted as the internal multibyte encoding.

\uNNNN

A character whose UCS2 code is represented by four-digit hexadecimal number NNNN.

\UNNNNNNNN

A character whose UCS4 code is represented by eight-digit hexadecimal number NNNNNNNN.

If Gauche is compiled with internal encoding other than UTF-8, the reader uses gauche.charconv module to interpret \uNNNN and \UNNNNNNNN escape sequence.

The following code is an example of backslash-newline escape sequence:

 
(define *message* "\
  This is a long message \
  in a literal string.")

*message*
  ⇒ "This is a long message in a literal string."

Note the whitespace just after ‘message’. Since any whitespaces before ‘in’ is eaten by the reader, you have to put a whitespace between ‘message’ and the following backslash. If you want to include an actual newline character in a string, and any indentation after it, you can put ’\n’ in the next line like this:

 
(define *message/newline* "\
  This is a long message, \
  \n   with a line break.")
Reader syntax: #*""

Denotes incomplete string. The same escape sequences as the complete string syntax are recognized.

Rationale of the syntax: ’#*’ is used for bit vector in Common Lisp. Since an incomplete strings is really a byte vector, it has similarity. (Bit vector can be added later, if necessary, and two can coexist).


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.12.2 String Predicates

Function: string? obj

[R5RS] Returns #t if obj is a string, #f otherwise.

Function: string-immutable? obj

Returns #t if obj is an immutable string, #f otherwise

Function: string-incomplete? obj

Returns #t if obj is an incomplete string, #f otherwise


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.12.3 String Constructors

Function: make-string k :optional char

[R5RS] Returns a string of length k. If optional char is given, the new string is filled with it. Otherwise, the string is filled with a whitespace. The result string is always complete.

 
(make-string 5 #\x) ⇒ "xxxxx"

Note that the algorithm to allocate a string by make-string and then fills it one character at a time is extremely inefficient in Gauche, and should be avoided. That kind of algorithms unnecessarily assumes underlying string allocation and representation mechanism, which Gauche doesn’t follow. You can use an output string port for a string construction (See section String ports). Even creating a list of characters and using list->string is faster than using make-string and string-set!.

Function: make-byte-string k :optional byte

Creates and returns an incomplete string o size k. If byte is given, which must be an exact integer, and its lower 8 bits are used to initialize every byte in the created string.

Function: string char …

[R5RS] Returns a string consisted by char ….

Generic Function: x->string obj

A generic coercion function. Returns a string representation of obj. The default methods are defined as follows: strings are returned as is, numbers are converted by number->string, symbols are converted by symbol->string, and other objects are converted by display.

Other class may provide a method to customize the behavior.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.12.4 String interpolation

The term "string interpolation" is used in various scripting languages such as Perl and Python to refer to the feature to embed expressions in a string literal, which are evaluated and then their results are inserted into the string literal at run time.

Scheme doesn’t define such a feature, but Gauche implements it as a reader macro.

Reader syntax: #`string-literal

Evaluates to a string. If string-literal contains the character sequence ,expr, where expr is a valid external representation of a Scheme expression, expr is evaluated and its result is inserted in the original place (by using x->string, see String Constructors).

The comma and the following expression must be adjacent (without containing any whitespace characters), or it is not recognized as a special sequence.

Two adjacent commas are converted to a single comma. You can embed a comma before a non-whitespace character in string-literal by this.

Other characters in the string-literal are copied as is.

If you use a variable as expr and need to delimit it from the subsequent string, you can use the symbol escape syntax using ‘|’ character, as shown in the last two examples below.

 
#`"This is Gauche, version ,(gauche-version)."
 ⇒ "This is Gauche, version 0.9.3.3."

#`"Date: ,(sys-strftime \"%Y/%m/%d\" (sys-localtime (sys-time)))"
 ⇒ "Date: 2002/02/18"

(let ((a "AAA")
      (b "BBB"))
 #`"xxx ,a ,b zzz")
 ⇒ "xxx AAA BBB zzz"

#`"123,,456,,789"
 ⇒ "123,456,789"

(let ((n 5)) #`"R,|n|RS")
 ⇒ "R5RS"

(let ((x "bar")) #`"foo,|x|.")
 ⇒ "foobar"

In fact, the reader expands this syntax into a macro call, which is then expanded into a call of string-append as follows:

 
#`"This is Gauche, version ,(gauche-version)."
 ≡
(string-append "This is Gauche, version "
               (x->string (gauche-version))
               ".")

Rationale of the syntax: Some other scripting languages use ‘$expr’ or ’#{...}’. I chose this syntax with respect to the quasiquote (See section Quasiquotation). Although it may be awkward to delimit variable names by ‘|’, the comma syntax should be easier to read than the other exotic syntax for seasoned Scheme programmers.

Note that Scheme allows wider range of characters for valid identifier names than usual scripting languages. Consequently, you will almost always need to use ‘|’ delimiters when you interpolate the value of a variable. For example, while you can write "$year/$month/$day $hour:$minutes:$seconds" in Perl, you should write #`",|year|/,|month|/,day ,|hour|:,|minutes|:,seconds". It may be better always to delimit direct variable references in this syntax to avoid confusion.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.12.5 String Accessors & Modifiers

Function: string-length string

[R5RS] Returns a length of (possibly incomplete) string string.

Function: string-size string

Returns a size of (possibly incomplete) string. A size of string is a number of bytes string occupies on memory. The same string may have different sizes if the native encoding scheme differs.

For incomplete string, its length and its size always match.

Function: string-ref cstring k :optional fallback

[R5RS+] Returns k-th character of a complete string cstring. It is an error to pass an incomplete string.

By default, an error is signalled if k is out of range (negative, or greater than or equal to the length of cstring). However, if an optional argument fallback is given, it is returned in such case. This is Gauche’s extension.

Function: string-byte-ref string k

Returns k-th byte of a (possibly incomplete) string string. Returned value is an integer in the range between 0 and 255. k must be greater than or equal to zero, and less than (string-size string).

Function: string-set! string k char

[R5RS] Substitute string’s k-th character by char. k must be greater than or equal to zero, and less than (string-length string). Return value is undefined.

If string is an incomplete string, integer value of the lower 8 bits of char is used to set string’s k-th byte.

See the notes in make-string about performance consideration.

Function: string-byte-set! string k byte

Substitute string’s k-th byte by integer byte. byte must be in the range between 0 to 255, inclusive. k must be greater than or equal to zero, and less than (string-size string). If string is a complete string, it is turned to incomplete string by this operation. Return value is undefined.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.12.6 String Comparison

Function: string=? string1 string2
Function: string-ci=? string1 string2

[R5RS]

Function: string<? string1 string2
Function: string<=? string1 string2
Function: string>? string1 string2
Function: string>=? string1 string2
Function: string-ci<? string1 string2
Function: string-ci<=? string1 string2
Function: string-ci>? string1 string2
Function: string-ci>=? string1 string2

[R5RS]


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.12.7 String utilities

Function: substring string start end

[R5RS] Returns a substring of string, starting from start-th character (inclusive) and ending at end-th character (exclusive). The start and end arguments must satisfy 0 <= start < N, 0 <= end <= N, and start <= end, where N is the length of the string.

When start is zero and end is N, this procedure returns a copy of string.

Actually, extended string-copy explained below is a superset of substring. This procedure is kept mostly for compatibility of R5RS programs. See also subseq in gauche.sequence - Sequence framework, for the generic version.

Function: string-append string …

[R5RS] Returns a newly allocated string whose content is concatenation of string ….

See also string-concatenate in String reverse & append.

Function: string->list string :optional start end
Function: list->string list

[R5RS+][SRFI-13] Converts a string to a list of characters or vice versa.

You can give an optional start/end indexes to string->list, as specified in SRFI-13.

For list->string, every elements of list must be a character, or an error is signalled. If you want to build a string out of a mixed list of strings and characters, you may want to use tree->string in text.tree - Lazy text construction.

Function: string-copy string :optional start end

[R5RS+][SRFI-13] Returns a copy of string. You can give start and/or end index to extract the part of the original string (it makes string-copy a superset of substring effectively).

If only start argument is given, a substring beginning from start-th character (inclusive) to the end of string is returned. If both start and end argument are given, a substring from start-th character (inclusive) to end-th character (exclusive) is returned. See substring above for the condition that start and end should satisfy.

Function: string-fill! string char :optional start end

[R5RS+][SRFI-13] Fills string by char. Optional start and end limits the effective area.

 
(string-fill! "orange" #\X)
  ⇒ "XXXXXX"
(string-fill! "orange" #\X 2 4)
  ⇒ "orXXge"
Function: string-join strs :optional delim grammer

[SRFI-13] Concatenate strings in the list strs, with a string delim as ‘glue’.

The argument grammer may be one of the following symbol to specify how the strings are concatenated.

infix

Use delim between each string. This mode is default. Note that this mode introduce ambiguity when strs is an empty string or a list with a null string.

 
(string-join '("apple" "mango" "banana") ", ")
  ⇒ "apple, mango, banana"
(string-join '() ":")
  ⇒ ""
(string-join '("") ":")
  ⇒ ""
strict-infix

Works like infix, but empty list is not allowed to strs, thus avoiding ambiguity.

prefix

Use delim before each string.

 
(string-join '("usr" "local" "bin") "/" 'prefix)
  ⇒ "/usr/local/bin"
(string-join '() "/" 'prefix)
  ⇒ ""
(string-join '("") "/" 'prefix)
  ⇒ "/"
suffix

Use delim after each string.

 
(string-join '("a" "b" "c") "&" 'suffix)
  ⇒ "a&b&c&"
(string-join '() "&" 'suffix)
  ⇒ ""
(string-join '("") "&" 'suffix)
  ⇒ "&"
Function: string-scan string item :optional return
Function: string-scan-right string item :optional return

Scan item (either a string or a character) in string. While string-scan finds the leftmost match, string-scan-right finds the rightmost match.

The return argument specifies what value should be returned when item is found in string. It must be one of the following symbols.

index

Returns the index in string if item is found, or #f. This is the default behavior.

 
(string-scan "abracadabra" "ada") ⇒ 5
(string-scan "abracadabra" #\c) ⇒ 4
(string-scan "abracadabra" "aba") ⇒ #f
before

Returns a substring of string before item, or #f if item is not found.

 
(string-scan "abracadabra" "ada" 'before) ⇒ "abrac"
(string-scan "abracadabra" #\c 'before) ⇒ "abra"
after

Returns a substring of string after item, or #f if item is not found.

 
(string-scan "abracadabra" "ada" 'after) ⇒ "bra"
(string-scan "abracadabra" #\c 'after) ⇒ "adabra"
before*

Returns a substring of string before item, and the substring after it. If item is not found, returns (values #f #f).

 
(string-scan "abracadabra" "ada" 'before*)
  ⇒ "abrac" and "adabra"
(string-scan "abracadabra" #\c 'before*)
  ⇒ "abra" and "cadabra"
after*

Returns a substring of string up to the end of item, and the rest. If item is not found, returns (values #f #f).

 
(string-scan "abracadabra" "ada" 'after*)
  ⇒ "abracada" and "bra"
(string-scan "abracadabra" #\c 'after*)
  ⇒ "abrac" and "adabra"
both

Returns a substring of string before item and after item. If item is not found, returns (values #f #f).

 
(string-scan "abracadabra" "ada" 'both)
  ⇒ "abrac" and "bra"
(string-scan "abracadabra" #\c 'both)
  ⇒ "abra" and "adabra"
Function: string-split string splitter

Splits string by splitter and returns a list of strings. splitter can be a character, a character set, a string, a regexp, or a procedure.

If splitter is a character, the character is used as a delimiter.

If splitter is a character set, any consecutive characters that are member of the character set are used as a delimiter.

If a procedure is given to splitter, it is called for each character in string, and the consecutive characters that caused splitter to return a true value are used as a delimiter.

 
(string-split "/aa/bb//cc" #\/)    ⇒ ("" "aa" "bb" "" "cc")
(string-split "/aa/bb//cc" "/")    ⇒ ("" "aa" "bb" "" "cc")
(string-split "/aa/bb//cc" "//")   ⇒ ("/aa/bb" "cc")
(string-split "/aa/bb//cc" #[/])   ⇒ ("" "aa" "bb" "cc")
(string-split "/aa/bb//cc" #/\/+/) ⇒ ("" "aa" "bb" "cc")
(string-split "/aa/bb//cc" #[\w])  ⇒ ("/" "/" "//" "")
(string-split "/aa/bb//cc" char-alphabetic?) ⇒ ("/" "/" "//" "")

;; some boundary cases
(string-split "abc" #\/) ⇒ ("abc")
(string-split ""    #\/) ⇒ ("")

See also string-tokenize in (See section Other string operations).


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.12.8 Incomplete strings

A string can be flagged as "incomplete" if it may contain byte sequences that do not consist of a valid multibyte character in the Gauche’s native encoding.

Incomplete strings may be genereated in several circumstances; reading binary data as a string, reading a string data that has been ’chopped’ in middle of a multibyte character, or concatenating a string with other incomplete strings, for example.

Incomplete strings should be regarded as an exceptional case. It used to be a way to handle byte strings, but now we have u8vector (See section gauche.uvector - Uniform vectors) for that purpose. In fact, we’re planning to remove it in the future releases.

Just in case, if you happen to get an incomplete string, you can convert it to a complete string by the following procedure:

Function: string-incomplete->complete str :optional handling

Reinterpret the content of an incomplete string str and returns a newly created complete string from it. The handling argument specifies how to handle the illegal byte sequences in str.

#f

If str contains an illegal byte sequence, give up the conversion and returns #f. This is the default behavior.

:omit

Omit any illegal byte sequences. Always returns a complete string.

a character

Replace each byte in illegal byte sequences by the given character. Always returns a complete string.

If str is already a complete string, its copy is returned.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]

This document was generated by Shiro Kawai on May 28, 2012 using texi2html 1.82.