text.csv
- CSV tables ¶Provides a function to parse/generate CSV (comma separated value) tables, including the format defined in RFC4180. You can customize the separator and quoter character to deal with variations of CSV formats.
CSV table is consisted by a series of records, separated by a newline. Each record contains number of fields, separated by a separator character (by default, a comma). A field can contain comma or newline if quoted, i.e. surrounded by double-quote characters. To include double-quote character in a quoted field, use two consecutive double-quote character. Usually, the whitespaces around the field are ignored.
Since use cases of CSV-like files vary, we provide layered API to be combined flexibly.
The bottom layer of API is to convert text into list of lists and vice versa.
{text.csv
}
Returns a procedure with one optional argument, an input port.
When the procedure is called, it reads one record from the port
(or, if omitted, from the current input port)
and returns a list of fields.
If input reaches EOF, it returns EOF.
{text.csv
}
Returns a procedure with two arguments, output port and
a list of fields. When the procedure is called, it
outputs a separator-separated fields with proper escapes,
to the output port. Each field value must be a string.
The separator argument can be a character or a string.
You can also specify the record delimiter
string by newline; for example, you can pass "\r\n"
to prepare a file to be read by Windows programs.
The output of field is quoted when it contains special characters—
which automatically includes characters
in separator, quote-char and newline argument,
plus the characters in the char-set given to special-char-set;
its default is #[;\s]
.
Occasionally, CSV files generated from spreadsheet contains superfluous rows/columns and we need to make sense of them. Here are some utilities to help them.
A typical format of such spreadsheet-generated CSV file has the following properties:
The main purpose of middle-level CSV parser is to take the output of low-level parser, which is a list of lists of strings, and find the header row, and then convert the subsequent record rows into tuples according to the header row. A tuple is just a list of strings, but ordered in the same way as the specified header spec.
{text.csv
}
Convert input rows (a list of lists of strings)
to a list of tuples.
A tuple is a list of slot values.
First, it looks for a header row that
matches the given header-spec. Once the header row is found,
parse the subsequent rows as record row according to the header
and convert them to tuples.
If no header is found, #f
is returned.
Header-specs is a list of header spec, each of which can be either a string, a regexp, or a predicate on a string. If it’s a string, a column that exactly matches the string is picked. If it’s a regexp, a column that matches the regexp is picked. And if it’s a predicate, as you might have already guessed, a column that satisfies the predicate is picked.
The order fo header-specs determines the order of columns of output tuples.
Required-slots determines if the input row is a valid record row or not. The structure of required-slots is as follows:
<required-slots> : (<spec> ...) <spec> : <header-spec> | (<header-spec> <predicate>)
The <header-spec>
compared to the elements of header-slot
(by equal?
) to figure out which columns to check. A single
<header-spec>
in <spec>
means that the column shouldn’t be
empty for a valid record row. If <spec> is a list of
<header-spec>
and <predicate>
, then the value of the
column corresponds to the <header-spec> is passed to <predicate>
to determine if it’s a valid record row.
If required-slots is omitted or an empty list, any row with at least one non-empty column to be included in the tuple.
If allow-gap? is #t
, it keeps reading rows until the end,
skipping invalid rows. If allow-gap? is #f
(default),
it stops reading once it sees an invalid row after headers.
Let’s see an example. Suppose we have the following CSV file as data.csv. It has extra rows and columns, as is often seen in spreadsheet-exported files.
,,,,,,,, "Exported data",,,,,,,, ,,,,,,,, ,,Year,Country,,Population,GDP,,Note ,,1958,"Land of Lisp",,39994,"551,435,453",, ,,1957,"United States of Formula Translators",,115333,"4,343,225,434",,Estimated ,,1959,"People's Republic of COBOL",,82524,"3,357,551,143",, ,,1970,"Kingdom of Pascal",,3785,,,"GDP missing" ,,,,,,,, ,,1962,"APL Republic",,1545,"342,335,151",,
You can extract tuples of Country, Year, GDP and Population, as follows:
(use text.csv) (use gauche.generator) (call-with-input-file "data.csv" (^p (csv-rows->tuples (generator->list (cute (make-csv-reader #\,) p)) '("Country" "Year" "GDP" "Population")))) ⇒ (("Land of Lisp" "1958" "551,435,453" "39994") ("United States of Formula Translators" "1957" "4,343,225,434" "115333") ("People's Republic of COBOL" "1959" "3,357,551,143" "82524") ("Kingdom of Pascal" "1970" "" "3785"))
Note that irrelevant rows are skipped, and columns in the results are ordered as specified in the header-specs.
Since there’s a gap (empty row) after the “Kingdom of Pascal” entry,
csv-rows->tuples
stops processing there by default. If you want
to include “APL Republic”, you have to pass :allow-gap? #t
to csv-rows->tuples
.
The next example gives :required-slots
option to eliminate
rows with missing some of Year, Country or GDP—thus “Kingdom of Pascal”
is omitted from the result, while “APL Republic” is included
because of :allow-gap?
argument.
(It also checks Year has exactly 4 digits.)
(call-with-input-file "data.csv" (^p (csv-rows->tuples (generator->list (cute (make-csv-reader #\,) p)) '("Country" "Year" "GDP" "Population") :required-slots '(("Year" #/^\d{4}$/) "Country" "GDP") :allow-gap? #t))) ⇒ (("Land of Lisp" "1958" "551,435,453" "39994") ("United States of Formula Translators" "1957" "4,343,225,434" "115333") ("People's Republic of COBOL" "1959" "3,357,551,143" "82524") ("APL Republic" "1962" "342,335,151" "1545"))
The following two procedures are ingredients of
csv-rows->tuples
:
{text.csv
}
Create a procedure that takes a row (a list of strings) and checks if
if it matches the criteria specified by header-specs.
(See csv-rows->tuples
above about header-specs.)
If the input satisfies the spec, it returns a permuter vector that maps
the tuple positions to the input column numbers.
Otherwise, it returns #f
.
The permuter vector is a vector of integers, where K-th element being I means the K-th item of the tuple should be taken from I-th column.
Let’s see the example. Suppose we know that the input contains the following row as the header row:
(define *input-row* '("" "" "Year" "Country" "" "Population" "GDP" "Notes"))
We want to detect that row, but we only needs Country, Year, GDP and Population columns, in that order. So we create a header parser as follows:
(define header-parser (make-csv-header-parser '("Country" "Year" "GDP" "Population")))
Applying this header parser to the input data returns the permuter vector:
(header-parser *input-row*) ⇒ #(3 2 6 5)
It means, the first item of tuple (Country) is in the 3rd column of the input, the second item of tuple (Year) is in the 2nd column of the input, and so on. This permuter vector can be used to parse record rows to get tuples.
{text.csv
}
Create a procedure that converts one input row into a tuple.
Permuter is the vector returned by make-csv-header-parser
.
See cvs-rows->tuples
above for header-slots and
required-slots arguments.