Strings (Gauche Users’ Reference)

Next: Regular expressions, Previous: Character Sets, Up: Core library [Contents][Index]

6.10 Strings ¶

Builtin Class: <string> ¶: A string class.

It should be emphasized that Gauche’s internal string object, string body, is immutable. To comply R7RS in which strings are mutable, a Scheme-level string object is an indirect pointer to a string body. Mutating a string means that Gauche creates a new immutable string body that reflects the changes, then swap the pointer in the Scheme-level string object.

This may affect some assumptions on the cost of string operations.

Copying string is O(1), no matter how long the string is, since the same string body is shared.
Taking substring usually is also O(1), for the resulting string shares the substring of the original string body. Gauche may copy a part of the string for better memory management, but the visible cost should stay pretty close to O(1). (However, note that accessing to a specific point by index within the original string may cost O(N) because of multibyte string; which is a different story).
On the other hand, mutating a string cost O(N) where N is the length of string, even for replacing a character.

Gauche does not attempt to make string mutation faster; (string-set! s k c) is exactly as slow as to take two substrings, before and after of k-th character, and concatenate them with a single-character string inbetween. So, just avoid string mutations; we believe it’s a better practice. See also String constructors.

R7RS string operations are very minimal. Gauche supports some extra built-in operations, and also a rich string library defined in SRFI-13. See srfi.13 - String library, for details about SRFI-13.

Next: String predicates, Previous: Strings, Up: Strings [Contents][Index]

6.10.1 String syntax ¶

Reader Syntax: "…" ¶

[R7RS+] Denotes a literal string. Inside the double quotes, the following backslash escape sequences are recognized.

\": [R7RS] Double-quote character
\\: [R7RS] Backslash character
\n: [R7RS] Newline character (ASCII 0x0a).
\r: [R7RS] Return character (ASCII 0x0d).
\f: Form-feed character (ASCII 0x0c).
\t: [R7RS] Tab character (ASCII 0x09)
\a: [R7RS] Alarm character (ASCII 0x07).
\b: [R7RS] Backspace character (ASCII 0x08).
\0: ASCII NUL character (ASCII 0x00).
\<whitespace>*<newline><whitespace>*: [R7RS] Ignored. This can be used to break a long string literal for readability. This escape sequence is introduced in R6RS.
\xN;: [R7RS] A character whose Unicode codepoint is represented by hexadecimal number N, which is any number of hexadecimal digits. (See the compatibility notes below.)
\uNNNN: A character whose UCS2 code is represented by four-digit hexadecimal number NNNN.
\UNNNNNNNN: A character whose UCS4 code is represented by eight-digit hexadecimal number NNNNNNNN.

The following code is an example of backslash-newline escape sequence:

(define *message* "\
  This is a long message \
  in a literal string.")

*message*
  ⇒ "This is a long message in a literal string."

Note the whitespace just after ‘message’. Since any whitespaces before ‘in’ is eaten by the reader, you have to put a whitespace between ‘message’ and the following backslash. If you want to include an actual newline character in a string, and any indentation after it, you can put ’\n’ in the next line like this:

(define *message/newline* "\
  This is a long message, \
  \n   with a line break.")

Note for the compatibility: We used to recognize a syntax \xNN (two-digit hexadecimal number, without semicolon terminator) as a character in a string; for example, "\x0d\x0a" was the same as "\r\n". We still support it when we don’t see the terminating semicolon, for the compatibility. There are ambiguous cases: "\0x0a;" means "\n" in the current syntax, while "\n;" in the legacy syntax.

Setting the reader mode to legacy restores the old behavior. Setting the reader mode to warn-legacy makes it work like the default behavior, but prints warning when it finds legacy syntax. See Reader lexical mode, for the details.

Next: String constructors, Previous: String syntax, Up: Strings [Contents][Index]

6.10.2 String predicates ¶

Function: string? obj ¶: [R7RS base] Returns #t if obj is a string, #f otherwise.

Function: string-immutable? obj ¶

Returns #t if obj is an immutable string, #f otherwise

String literals, and the strings returned from certain procedures such as symbol->string are immutable. To ensure you get an immutable string in a program, you can use string-copy-immutable.

Function: string-incomplete? obj ¶: Returns #t if obj is an incomplete string, #f otherwise

Next: String interpolation, Previous: String predicates, Up: Strings [Contents][Index]

6.10.3 String constructors ¶

Function: make-string k :optional char ¶

[R7RS base] Returns a string of length k. If optional char is given, the new string is filled with it. Otherwise, the string is filled with a whitespace. The result string is always complete.

(make-string 5 #\x) ⇒ "xxxxx"

Note that the algorithm to allocate a string by make-string and then fills it one character at a time is extremely inefficient in Gauche, and should be avoided.

In Gauche, a string is simply a pointer to an immutable string content. If you mutate a string by, e.g. string-set!, Gauche allocates whole new immutable string content, copies the original content with modification, then swap the pointer of the original string. It is no more efficient than making a new copy.

You can use an output string port for a string construction (see String ports). Even creating a list of characters and using list->string is faster than using make-string and string-set!.

Function: make-byte-string k :optional byte ¶: Creates and returns an incomplete string o size k. If byte is given, which must be an exact integer, and its lower 8 bits are used to initialize every byte in the created string.

Function: string char … ¶: [R7RS base] Returns a string consisted by char ….

Generic Function: x->string obj ¶

A generic coercion function. Returns a string representation of obj. The default methods are defined as follows: strings are returned as is, numbers are converted by number->string, symbols are converted by symbol->string, and other objects are converted by display.

Other class may provide a method to customize the behavior.

Next: String cursors, Previous: String constructors, Up: Strings [Contents][Index]

6.10.4 String interpolation ¶

The term "string interpolation" is used in various scripting languages such as Perl and Python to refer to the feature to embed expressions in a string literal, which are evaluated and then their results are inserted into the string literal at run time.

Scheme doesn’t define such a feature, but Gauche implements it as a reader macro.

Reader Syntax: #string-literal ¶

Evaluates to a string. If string-literal contains the character sequence ~expr, where expr is a valid external representation of a Scheme expression, expr is evaluated and its result is inserted in the original place (by using x->string, see String constructors).

The tilde and the following expression must be adjacent (without containing any whitespace characters), or it is not recognized as a special sequence.

To include a tilde itself immediately followed by non-delimiting character, use ~~.

Other characters in the string-literal are copied as is.

If you use a variable as expr and need to delimit it from the subsequent string, you can use the symbol escape syntax using ‘|’ character, as shown in the last two examples below.

#"This is Gauche, version ~(gauche-version)."
 ⇒ "This is Gauche, version 0.9.16_pre3."

#"Date: ~(sys-strftime \"%Y/%m/%d\" (sys-localtime (sys-time)))"
 ⇒ "Date: 2002/02/18"

(let ((a "AAA")
      (b "BBB"))
 #"xxx ~a ~b zzz")
 ⇒ "xxx AAA BBB zzz"

#"123~~456~~789"
 ⇒ "123~456~789"

(let ((n 7)) #"R~|n|RS")
 ⇒ "R7RS"

(let ((x "bar")) #"foo~|x|.")
 ⇒ "foobar"

In fact, the reader expands this syntax into a macro call, which is then expanded into a call of string-append as follows:

#"This is Gauche, version ~(gauche-version)."
 ≡
(string-interpolate* ("This is Gauche, version "
                      (gauche-version)
                      "."))

;; then, it expands to...

(string-append "This is Gauche, version "
               (x->string (gauche-version))
               ".")

(NB: The exact spec of string-interpolate* might change in future, so do not rely on the current behavior.)

Since the #"..." syntax is equivalent to a macro call of string-interpolate*, which is provided in the Gauche module, it must be visible from where you use the interpolation syntax. When you write Gauche code, typically you implicitly inherit the Gauche module so you don’t need to worry; however, if you start from R7RS code, make sure you import string-interpolate* (by (import (gauche base)), for example) whenever you use string interpolation syntax. Also be careful not to shadow string-interpolate* locally.

Reader Syntax: #`string-literal ¶

This is the old style of string-interpolation. It is still recognized, but discouraged for the new code.

Inside string-literal, you can use ,expr (instead of ~expr) to evaluate expr. If comma isn’t immediately followed by a character starting an expression, it loses special meaning.

#`"This is Gauche, version ,(gauche-version)"

Rationale of the syntax: There are wide variation of string interpolation syntax among scripting languages. They are usually linked with other syntax of the language (e.g. prefixing $ to mark evaluating place is in sync with variable reference syntax in some languages).

The old style of string interpolation syntax was taken from quasiquote syntax, because those two are conceptually similar operations (see Quasiquotation). However, since comma character is frequently used in string literals, it was rather awkward.

We decided that tilde is more suitable as the unquote character for the following reasons.

Traditionally, Lisp’s string formatter format uses ~ to introduce format directives (see Formatting output). Lispers are used to scan ~’s in a string as variable portions.
Gauche’s ~ is a universal accessor, and the operator has a nuance of “taking something out of it” (see Universal accessor).
Clojure, a new Lisp dialect, adopted ~ as the unquote character in the quasiquote syntax, instead of commas.

Note that Scheme allows wider range of characters for valid identifier names than usual scripting languages. Consequently, you will almost always need to use ‘|’ delimiters when you interpolate the value of a variable. For example, while you can write "$year/$month/$day $hour:$minutes:$seconds" in Perl, you should write #"~|year|/~|month|/~day ~|hour|:~|minutes|:~seconds". It may be better always to delimit direct variable references in this syntax to avoid confusion.

Next: String indexing, Previous: String interpolation, Up: Strings [Contents][Index]

6.10.5 String cursors ¶

String cursors are opaque objects that point into strings, similar to indexes. Cursors however are more efficient. For example, to get a character with string-ref using an index on a multibyte string, Gauche needs to iterate from the beginning of the string until that position, or O(n). Using cursors you can access in O(1) (for singlebyte (ASCII) strings or an indexed string, Gauche does it in O(1) even with index. See String indexing, for the details of indexed string.)

For a string of length n, there can be n+1 cursors. The last cursor at the end of the string does not point to any valid character, it’s usually used to determine if nothing is found.

A string cursor is associated with a specific string and should not be used with another string. A string cursor also becomes invalid when the associated string is modified. Accessing an invalid cursor does not always fail though. Running gosh with -fsafe-string-cursors could help catch these issues, with some performance overhead. See Command-line options.

Most of the time, string cursors aren’t heap-allocated. It is only allocated in heap either (1) when it points at a huge byte index, or (2) when you use -fsafe-string-cursors to enable extra run-time check.

The threashold of byte index to cause a string cursor to be heap-allocated is 2^56 on 64bit systems, and 2^24 on 32bit systems, in the current implementation. On 64bit systems you will never hit the threashold practically. On 32bit systems you may, if you have a huge string, but you may want to consider using other data structure rather than keeping such data in one string object.

Most procedures that take indexes in Gauche can also take cursors. Relying on this though is unportable. For example, the substring procedure in RnRS standards does not mention anything about cursors even though the Gauche version accepts cursors. For portable programs, you should only use cursors on procedures from srfi.130 module (see srfi.130 - Cursor-based string library).

Builtin Class: <string-cursor> ¶

Represents a cursor. When printed out, you’ll see the byte offset from the beginning of the string, not the character index.

(string-index->cursor "あかさたな" 2)
 ⇒ #<string-cursor 6>

Function: string-cursor? obj ¶: [SRFI-130] Returns #t if obj is a string cursor, #f otherwise.

Function: string-cursor-start str ¶: [SRFI-130] Returns a cursor pointing to the start of a string str. It returns a valid cursor on an empty string too. It’s the same as string-cursor-end in that case.

Function: string-cursor-end str ¶: [SRFI-130] Returns a cursor pointing to the end of str (the point after the last character.) If str is empty, it is the same as string-cursor-start. This cursor does not point to any valid character of the string.

Function: string-cursor-next str cur ¶: [SRFI-130] Returns the cursor into str following cur. cur can also be an index. An error is signaled if cur points to the end of the string.

Function: string-cursor-prev str cur ¶: [SRFI-130] Returns the cursor into str preceding cur. cur can also be an index. An error is signaled if cur points to the beginning of the string.

Function: string-cursor-forward str cur n ¶: [SRFI-130] Returns the cursor into str following cur by n characters. cur can also be an index.

Function: string-cursor-back str cur n ¶: [SRFI-130] Returns the cursor into str preceding cur by n characters. cur can also be an index.

Function: string-index->cursor str index ¶: [SRFI-130] Convert an index to a cursor. If index is a cursor it will be returned as-is.

Function: string-cursor->index str cur ¶: [SRFI-130] Convert a cursor to an index. If cur is a an index it will be returned as-is.

Function: string-cursor-diff str start end ¶: [SRFI-130] Returns the number of characters between start and end. It should be non-negative if start precedes end, non-positive otherwise. start and end also accept index.

Function: string-cursor=? cur1 cur2 ¶
Function: string-cursor<? cur1 cur2 ¶
Function: string-cursor<=? cur1 cur2 ¶
Function: string-cursor>? cur1 cur2 ¶
Function: string-cursor>=? cur1 cur2 ¶: [SRFI-130] Compares two cursors or two indexes (but not a cursor and an index) and returns #t or #f accordingly.

Next: String accessors & modifiers, Previous: String cursors, Up: Strings [Contents][Index]

6.10.6 String indexing ¶

Since Gauche stores strings in multibyte encoding, random access requires O(N) by default. In most cases, string access is either sequential or search-and-extract pattern, and Gauche provides direct means for these operations, so you don’t need to deal with indexed access. However, there may be a case that you have need more efficient random access string (mostly when porting third-party code, we imagine).

There are a couple of ways to achieve O(1) random access.

First, instead of integer character indexes, you can use string cursors (see String cursors). It is defined by srfi.130, and you can use the code that’s using SRFI-130 as is, without worring about slow access. However, if external interface gives you integer character index, converting index to cursor and vice versa takes O(N) after all.

There’s another way. You can precompute string index, mapping from integer character index to the position in the multibyte string. It costs O(N) of time and space to compute it, but once computed, you have O(1) random access. (We store positions for every K characters, where K is between 16 to 256, so it won’t take up as large storage as the actual string body).

For portability, SRFI-135 Immutable Texts provides O(1) accessible string as “texts”. On Gauche, a text is just an immutable string with index attached.

Function: string-build-index! str ¶

Computes and attaches index to a string str, and returns str itself. The operation doesn’t alter the content of str, and you can pass immutable string as well.

If str is a single-byte string (ASCII-only, or incomplete), or a short one (less than 64 octets), no index is attached. It is ok to pass a string which already has an index; then index computation is skipped.

The index is attached to the string’s content. If you alter str by e.g. string-set!, the index is discarded.

Function: string-fast-indexable? str ¶: Returns #t iff index access of a string str is effectively O(1), that is, str is either a single-byte string, a short string, or a long multibyte string with index computed.

Next: String comparison, Previous: String indexing, Up: Strings [Contents][Index]

6.10.7 String accessors & modifiers ¶

Function: string-length string ¶: [R7RS base] Returns a length of (possibly incomplete) string string.

Function: string-size string ¶

Returns a size of (possibly incomplete) string. A size of string is a number of bytes string occupies on memory. The same string may have different sizes if the native encoding scheme differs.

For incomplete string, its length and its size always match.

Function: string-ref cstring k :optional fallback ¶

[R7RS+ base] Returns k-th character of a complete string cstring. It is an error to pass an incomplete string.

By default, an error is signaled if k is out of range (negative, or greater than or equal to the length of cstring). However, if an optional argument fallback is given, it is returned in such case. This is Gauche’s extension.

If cstring is a multibyte string without index attached, this procedure takes O(k) time. See String indexing, for ensuring O(1) access.

k can also be a string cursor (also Gauche’s extension). Cursor acccess is O(1).

Function: string-byte-ref string k ¶: Returns k-th byte of a (possibly incomplete) string string. Returned value is an integer in the range between 0 and 255. k must be greater than or equal to zero, and less than (string-size string).

Function: string-set! string k char ¶

[R7RS base] Substitute string’s k-th character by char. k must be greater than or equal to zero, and less than (string-length string). Return value is undefined.

If string is an incomplete string, integer value of the lower 8 bits of char is used to set string’s k-th byte.

See the notes in make-string about performance consideration.

Function: string-byte-set! string k byte ¶: Substitute string’s k-th byte by integer byte. byte must be in the range between 0 to 255, inclusive. k must be greater than or equal to zero, and less than (string-size string). If string is a complete string, it is turned to incomplete string by this operation. Return value is undefined.

Next: String utilities, Previous: String accessors & modifiers, Up: Strings [Contents][Index]

6.10.8 String comparison ¶

Function: string=? string1 string2 string3 … ¶

[R7RS base] Returns #t iff all arguments are strings with the same content.

If any of arguments is incomplete string, it returns #t iff all arguments are incomplete and have exactly the same content. In other words, a complete string and an incomplete string never equal to each other.

Function: string<? string1 string2 string3 … ¶

Function: string<=? string1 string2 string3 … ¶

Function: string>? string1 string2 string3 … ¶

Function: string>=? string1 string2 string3 … ¶

[R7RS base] Compares strings in codepoint order. Returns #t iff all the arguments are ordered.

Comparison between an incomplete string and a complete string, or between two incomplete strings, are done by octet-to-octet comparison. If a complete string and an incomplete string have exactly the same binary representation of the content, a complete string is smaller.

Function: string-ci=? string1 string2 string3 … ¶

Function: string-ci<? string1 string2 string3 … ¶

Function: string-ci<=? string1 string2 string3 … ¶

Function: string-ci>? string1 string2 string3 … ¶

Function: string-ci>=? string1 string2 string3 … ¶

Case-insensitive string comparison.

These procedures fold argument character-wise, according to Unicode-defined character-by-character case mapping. See char-foldcase for the details (Characters). Character-wise case folding doesn’t handles the case like German eszett:

(string-ci=? "\u00df" "SS") ⇒ #f

R7RS requires string-ci* procedures to use string case folding. Gauche provides R7RS-conformant case insensitive comparison procedures in gauche.unicode (see Full string case conversion). If you write in R7RS, importing (scheme char) library, you’ll use gauche.unicode’s string-ci* procedures.

Next: Incomplete strings, Previous: String comparison, Up: Strings [Contents][Index]

6.10.9 String utilities ¶

Function: substring string start end ¶

[R7RS+ base] Returns a substring of string, starting from start-th character (inclusive) and ending at end-th character (exclusive). The start and end arguments must satisfy 0 <= start < N, 0 <= end <= N, and start <= end, where N is the length of the string.

start and end can also be string cursors, but this is an extension of Gauche.

When start is zero and end is N, this procedure returns a copy of string. (See also opt-substring below, if you don’t want to copy if not necessary.)

Actually, extended string-copy explained below is a superset of substring. This procedure is kept mostly for compatibility of R7RS programs. See also subseq in gauche.sequence - Sequence framework, for the generic version.

Function: opt-substring string :optional start end ¶

Like substring, returns a part of string between start-th character (inclusive) and end-th character (exclusive). However, if the entire string is used (e.g. start is 0 and end is the length of string, or the arguments are omitted, etc.), string is returned as is, without copying.

This is a typical handling of optional start/end indexes for many string utilities. Note that using substring forces copying the input string even when it’s not necessary.

Besides exact integers, #f or #<undef> is allowed as start and end, to indicate the argument is missing. In that case, 0 is assumed for start, and the length of string is assumed for end.

Function: string-append string … ¶

[R7RS base] Returns a newly allocated string whose content is concatenation of string ….

See also string-concatenate in String reverse & append.

Function: string->list string :optional start end ¶

Function: list->string list ¶

[R7RS base] Converts a string to a list of characters or vice versa.

You can give an optional start/end indexes to string->list.

For list->string, every elements of list must be a character, or an error is signaled. If you want to build a string out of a mixed list of strings and characters, you may want to use tree->string in text.tree - Lazy text construction.

Function: string-copy string :optional start end ¶

[R7RS base] Returns a copy of string. You can give start and/or end index to extract the part of the original string (it makes string-copy a superset of substring effectively).

If only start argument is given, a substring beginning from start-th character (inclusive) to the end of string is returned. If both start and end argument are given, a substring from start-th character (inclusive) to end-th character (exclusive) is returned. See substring above for the condition that start and end should satisfy.

Node: R7RS’s destructive version string-copy! is provided by srfi.13 module (see srfi.13 - String library).

Function: string-copy-immutable string :optional start end ¶

If string is immutable, return it as is. Otherwise, returns an immutable copy of string. It is a dual of string-copy which always returns a mutable copy.

The optional start and end argument may be a nonnegative integer character index and/or string cursors to restrict the range of string to be copied.

Function: string-fill! string char :optional start end ¶

[R7RS base] Fills string by char. Optional start and end limits the effective area.

(string-fill! "orange" #\X)
  ⇒ "XXXXXX"
(string-fill! "orange" #\X 2 4)
  ⇒ "orXXge"

See the notes in make-string about performance consideration.

Function: string-join strs :optional delim grammar ¶

[SRFI-13] Concatenate strings in the list strs, with a string delim as ‘glue’.

The argument grammar may be one of the following symbol to specify how the strings are concatenated.

infix

Use delim between each string. This mode is default. Note that this mode introduce ambiguity when strs is an empty string or a list with a null string.

(string-join '("apple" "mango" "banana") ", ")
  ⇒ "apple, mango, banana"
(string-join '() ":")
  ⇒ ""
(string-join '("") ":")
  ⇒ ""

strict-infix

Works like infix, but empty list is not allowed to strs, thus avoiding ambiguity.

prefix

Use delim before each string.

(string-join '("usr" "local" "bin") "/" 'prefix)
  ⇒ "/usr/local/bin"
(string-join '() "/" 'prefix)
  ⇒ ""
(string-join '("") "/" 'prefix)
  ⇒ "/"

suffix

Use delim after each string.

(string-join '("a" "b" "c") "&" 'suffix)
  ⇒ "a&b&c&"
(string-join '() "&" 'suffix)
  ⇒ ""
(string-join '("") "&" 'suffix)
  ⇒ "&"

Function: string-scan string item :optional return ¶

Function: string-scan-right string item :optional return ¶

Scan item (either a string or a character) in string. While string-scan finds the leftmost match, string-scan-right finds the rightmost match.

The return argument specifies what value should be returned when item is found in string. It must be one of the following symbols.

index

Returns the index in string if item is found, or #f. This is the default behavior.

(string-scan "abracadabra" "ada") ⇒ 5
(string-scan "abracadabra" #\c) ⇒ 4
(string-scan "abracadabra" "aba") ⇒ #f

before

Returns a substring of string before item, or #f if item is not found.

(string-scan "abracadabra" "ada" 'before) ⇒ "abrac"
(string-scan "abracadabra" #\c 'before) ⇒ "abra"

after

Returns a substring of string after item, or #f if item is not found.

(string-scan "abracadabra" "ada" 'after) ⇒ "bra"
(string-scan "abracadabra" #\c 'after) ⇒ "adabra"

before*

Returns a substring of string before item, and the substring after it. If item is not found, returns (values #f #f).

(string-scan "abracadabra" "ada" 'before*)
  ⇒ "abrac" and "adabra"
(string-scan "abracadabra" #\c 'before*)
  ⇒ "abra" and "cadabra"

after*

Returns a substring of string up to the end of item, and the rest. If item is not found, returns (values #f #f).

(string-scan "abracadabra" "ada" 'after*)
  ⇒ "abracada" and "bra"
(string-scan "abracadabra" #\c 'after*)
  ⇒ "abrac" and "adabra"

both

Returns a substring of string before item and after item. If item is not found, returns (values #f #f).

(string-scan "abracadabra" "ada" 'both)
  ⇒ "abrac" and "bra"
(string-scan "abracadabra" #\c 'both)
  ⇒ "abra" and "adabra"

Function: string-split string splitter :optional grammar limit start end ¶

Function: string-split string splitter :optional limit start end ¶

[SRFI-152+] Splits string by splitter and returns a list of strings. splitter can be a character, a character set, a string, a regexp, or a procedure.

If splitter is a character or a string, it is used as a delimiter. Note that SRFI-152’s string-split only allows strings for splitter (it also interprets the first optional argument as a grammar; see below for the compatibility note.)

If splitter is a character set, any consecutive characters that are member of the character set are used as a delimiter.

If a procedure is given to splitter, it is called for each character in string, and the consecutive characters that caused splitter to return a true value are used as a delimiter.

(string-split "/aa/bb//cc" #\/)    ⇒ ("" "aa" "bb" "" "cc")
(string-split "/aa/bb//cc" "/")    ⇒ ("" "aa" "bb" "" "cc")
(string-split "/aa/bb//cc" "//")   ⇒ ("/aa/bb" "cc")
(string-split "/aa/bb//cc" #[/])   ⇒ ("" "aa" "bb" "cc")
(string-split "/aa/bb//cc" #/\/+/) ⇒ ("" "aa" "bb" "cc")
(string-split "/aa/bb//cc" #[\w])  ⇒ ("/" "/" "//" "")
(string-split "/aa/bb//cc" char-alphabetic?) ⇒ ("/" "/" "//" "")

;; some boundary cases
(string-split "abc" #\/) ⇒ ("abc")
(string-split ""    #\/) ⇒ ("")

The grammar argument is the same as string-join above; it must be one of symbols infix, strict-infix, prefix or suffix. When omitted, infix is assumed.

(string-split "/a/b/c/" "/" 'infix)  ⇒ ("" "a" "b" "c" "")
(string-split "/a/b/c/" "/" 'prefix) ⇒ ("a" "b" "c" "")
(string-split "/a/b/c/" "/" 'suffix) ⇒ ("" "a" "b" "c")

In general, the following relationship holds:

(string-join XS DELIM GRAMMAR) ⇒ S
(string-split S DELIM GRAMMAR) ⇒ XS

If limit is given and not #f, it must be a nonnegative integer and specifies the maximum number of match to the splitter. Once the limit is reached, the rest of string is included in the result as is.

(string-split "a.b..c" "." 'infix 0)   ⇒ ("a.b..c")
(string-split "a.b..c" "." 'infix 1)   ⇒ ("a" "b..c")
(string-split "a.b..c" "." 'infix 2)   ⇒ ("a" "b" ".c")

Compatibility note: The grammar argument is added for the consistency of srfis (SRFI-130, SRFI-152, see srfi.152 - String library (reduced)). However, for the backward compatibility and the convenience, it also accepts limit without grammar argument; it is distinguishable since grammar is a symbol and limit is an integer. For the code that’s compatible to SRFI-152, use the first form that takes grammar argument.

(string-split "a.b..c" "." 2)   ⇒ ("a" "b" ".c")

The start and end arguments limits input string in the given range before splitting.

See also string-tokenize in (see Other string operations).

Function: string-map proc str str2 … ¶

Function: string-map proc str :optional start end ¶

[R7RS base][SRFI-13] Applies proc over each character in the input string, and gathers the characters returned from proc into a string and returns it. It is an error if proc returns non-character.

Because of historical reasons, this procedure has two interfaces. The first one takes one or more input strings, and proc receives as many characters as the number of input strings, each character being taken from each string. Iteration stops on the shortest string. This is defined in R7RS-small, and consistent with map, vector-map, etc.

The second one takes only one string argument, and optional start/end arguments, which may be nonnegative integer indexes or string cursors to limit the input range of the string. This is defined in SRFI-13, string library.

The order in which proc is applied is not guaranteed to be left to right. You shouldn’t depend on the order.

If proc saves a continuation and it is invoked later, the result already returned from string-map won’t be affected (as specified in R7RS).

(string-map char-upcase "apple") ⇒ "APPLE"
(string-map (^[a b] (if (char>? a b) a b)) "orange" "apple") ⇒ "orpng"
(string-map char-upcase "pineapple" 0 4) ⇒ "PINE"

Function: string-for-each proc str str2 … ¶

Function: string-for-each proc str :optional start end ¶

[R7RS base][SRFI-13] Applies proc over each character in the input string in left-to-right order. The results of proc is discarded.

Because of historical reasons, this procedure has two interfaces, first one defined in R7RS and second one defined in SRFI-13. See string-map above for the explanation.

Previous: String utilities, Up: Strings [Contents][Index]

6.10.10 Incomplete strings ¶

A string can be flagged as "incomplete" if it may contain byte sequences that do not consist of a valid multibyte character in the Gauche’s native encoding.

Incomplete strings may be generated in several circumstances; reading binary data as a string, reading a string data that has been ’chopped’ in middle of a multibyte character, or concatenating a string with other incomplete strings, for example.

Incomplete strings should be regarded as an exceptional case. It used to be a way to handle byte strings, but now we have u8vector (see Uniform vectors) for that purpose. In fact, we’re planning to remove it in the future releases.

Just in case, if you happen to get an incomplete string, you can convert it to a complete string by string-incomplete->complete.

Reader Syntax: #**"…" ¶

Denotes incomplete string. The same escape sequences as the complete string syntax are recognized.

Rationale of the syntax: #* is used for bit vectors. Since an incomplete strings is really a byte vector, it has similarity.

Note: We used #*"...." for an incomplete string on 0.9.9 and before. It turned out that it couldn’t coexist with bitvectors, for #* is a valid bitvector literal (zero-length vector), and " is a delimiter, so #*"...." can be parsed as a zero-length bitvector followed by a string. From 0.9.10, we changed the incomplete string literal to #**"...". It’s a bit lengthy, but incomplete strings are anomalies and shouldn’t be used often anyway.

For the backward compatibility, #*"..." is still read as an incomplete string literal, unless the reader lexical mode is strict-r7 (see Reader lexical mode, for the details). If the reader lexical mode is warn-legacy, it is read as an incomplete string, but a warning is issued. If the mode is strict-r7, it is read as a zero-length bitvector followed by a string.

In future releasers, #*"..." would be warned by default, and later we’ll gradually move to strict-r7 behavior.

Function: string-incomplete->complete str :optional handling filler ¶

Reinterpret the content of an incomplete string str and returns a newly created complete string from it. The handling argument specifies how to handle the illegal byte sequences in str.

#f: If str contains an illegal byte sequence, give up the conversion and returns #f. This is the default behavior.
:omit: Omit any illegal byte sequences.
:replace: Replace each byte in illegal byte sequences by a character given in filler argument, defaulted to ?.
:escape: Replace each byte in illegal byte sequences by a sequence of filler <hexdigit> <hexdigit>. Besides, the filler characters in the original string is replaced with filler filler.

If str is already a complete string, its copy is returned.

The procedure always returns a complete string, except when the handling argument is #f (default) and the input is an incomplete string, in which case #f is returned.

(string-incomplete->complete #**"_abc")
  ⇒ "_abc"     ; can be represented as a complete string

(string-incomplete->complete #**"_ab\x80;c")
  ⇒ #f        ; can't be represented as a complete string

(string-incomplete->complete #**"_ab\x80;c" :omit)
  ⇒ "_abc"     ; omit the illegal bytes

(string-incomplete->complete #**"_ab\x80;c" :replace #\_)
  ⇒ "_ab_c"    ; replace the illegal bytes

(string-incomplete->complete #**"_ab\x80;c" :escape #\_)
  ⇒ "__ab_80c" ; escape the illegal bytes and escape char itself

• String syntax:
• String predicates:
• String constructors:
• String interpolation:
• String cursors:
• String indexing:
• String accessors & modifiers:
• String comparison:
• String utilities:
• Incomplete strings: