For Development HEAD DRAFTSearch (procedure/syntax/module):

12.64 text.csv - CSV tables

Module: text.csv

Provides a function to parse/generate CSV (comma separated value) tables, including the format defined in RFC4180. You can customize the separator and quoter character to deal with variations of CSV formats.

CSV table is consisted by a series of records, separated by a newline. Each record contains number of fields, separated by a separator character (by default, a comma). A field can contain comma or newline if quoted, i.e. surrounded by double-quote characters. To include double-quote character in a quoted field, use two consecutive double-quote character. Usually, the whitespaces around the field are ignored.

Since use cases of CSV-like files vary, we provide layered API to be combined flexibly.

Low-level API

The bottom layer of API is to convert text into list of lists and vice versa.

Function: make-csv-reader separator :optional (quote-char #\")

{text.csv} Returns a procedure with one optional argument, an input port. When the procedure is called, it reads one record from the port (or, if omitted, from the current input port) and returns a list of fields. If input reaches EOF, it returns EOF.

Function: make-csv-writer separator :optional newline (quote-char #\") special-char-set

{text.csv} Returns a procedure with two arguments, output port and a list of fields. When the procedure is called, it outputs a separator-separated fields with proper escapes, to the output port. Each field value must be a string. The separator argument can be a character or a string.

You can also specify the record delimiter string by newline; for example, you can pass "\r\n" to prepare a file to be read by Windows programs.

The output of field is quoted when it contains special characters— which automatically includes characters in separator, quote-char and newline argument, plus the characters in the char-set given to special-char-set; its default is #[;\s].

Middle-level API

Occasionally, CSV files generated from spreadsheet contains superfluous rows/columns and we need to make sense of them. Here are some utilities to help them.

A typical format of such spreadsheet-generated CSV file has the following properties:

  1. There’s a “header row” near the top; not necessarily the very first row, but certainly it comes before any real data. It signifies the meaning of each column of the data. There may be superfluous columns inserted just for cosmetics, and sometimes the order of columns are changed when the original spreadsheet is edited. So we need some flexibility to interpret the input data.
  2. “Record rows” follow the header row. It contains actual data. There may be superfluous rows inserted just for cosmetics. Also, it’s often the case that the end of data isn’t marked clearly (you find large number of rows of empty strings, for example).

The main purpose of middle-level CSV parser is to take the output of low-level parser, which is a list of lists of strings, and find the header row, and then convert the subsequent record rows into tuples according to the header row. A tuple is just a list of strings, but ordered in the same way as the specified header spec.

Function: csv-rows->tuples rows header-specs :key required-slots allow-gap?

{text.csv} Convert input rows (a list of lists of strings) to a list of tuples. A tuple is a list of slot values.

First, it looks for a header row that matches the given header-spec. Once the header row is found, parse the subsequent rows as record row according to the header and convert them to tuples. If no header is found, #f is returned.

Header-specs is a list of header spec, each of which can be either a string, a regexp, or a predicate on a string. If it’s a string, a column that exactly matches the string is picked. If it’s a regexp, a column that matches the regexp is picked. And if it’s a predicate, as you might have already guessed, a column that satisfies the predicate is picked.

The order fo header-specs determines the order of columns of output tuples.

Required-slots determines if the input row is a valid record row or not. The structure of required-slots is as follows:

   <required-slots> : (<spec> ...)
   <spec> : <header-spec> | (<header-spec> <predicate>)

The <header-spec> compared to the elements of header-slot (by equal?) to figure out which columns to check. A single <header-spec> in <spec> means that the column shouldn’t be empty for a valid record row. If <spec> is a list of <header-spec> and <predicate>, then the value of the column corresponds to the <header-spec> is passed to <predicate> to determine if it’s a valid record row.

If required-slots is omitted or an empty list, any row with at least one non-empty column to be included in the tuple.

If allow-gap? is #t, it keeps reading rows until the end, skipping invalid rows. If allow-gap? is #f (default), it stops reading once it sees an invalid row after headers.

Let’s see an example. Suppose we have the following CSV file as data.csv. It has extra rows and columns, as is often seen in spreadsheet-exported files.

,,,,,,,,
"Exported data",,,,,,,,
,,,,,,,,
,,Year,Country,,Population,GDP,,Note
,,1958,"Land of Lisp",,39994,"551,435,453",,
,,1957,"United States of Formula Translators",,115333,"4,343,225,434",,Estimated
,,1959,"People's Republic of COBOL",,82524,"3,357,551,143",,
,,1970,"Kingdom of Pascal",,3785,,,"GDP missing"
,,,,,,,,
,,1962,"APL Republic",,1545,"342,335,151",,

You can extract tuples of Country, Year, GDP and Population, as follows:

(use text.csv)
(use gauche.generator)

(call-with-input-file "data.csv"
  (^p (csv-rows->tuples
       (generator->list (cute (make-csv-reader #\,) p))
       '("Country" "Year" "GDP" "Population"))))
 ⇒
  (("Land of Lisp" "1958" "551,435,453" "39994")
   ("United States of Formula Translators" "1957" "4,343,225,434" "115333")
   ("People's Republic of COBOL" "1959" "3,357,551,143" "82524")
   ("Kingdom of Pascal" "1970" "" "3785"))

Note that irrelevant rows are skipped, and columns in the results are ordered as specified in the header-specs.

Since there’s a gap (empty row) after the “Kingdom of Pascal” entry, csv-rows->tuples stops processing there by default. If you want to include “APL Republic”, you have to pass :allow-gap? #t to csv-rows->tuples.

The next example gives :required-slots option to eliminate rows with missing some of Year, Country or GDP—thus “Kingdom of Pascal” is omitted from the result, while “APL Republic” is included because of :allow-gap? argument. (It also checks Year has exactly 4 digits.)

(call-with-input-file "data.csv"
  (^p (csv-rows->tuples
       (generator->list (cute (make-csv-reader #\,) p))
       '("Country" "Year" "GDP" "Population")
        :required-slots '(("Year" #/^\d{4}$/) "Country" "GDP")
        :allow-gap? #t)))
 ⇒
 (("Land of Lisp" "1958" "551,435,453" "39994")
  ("United States of Formula Translators" "1957" "4,343,225,434" "115333")
  ("People's Republic of COBOL" "1959" "3,357,551,143" "82524")
  ("APL Republic" "1962" "342,335,151" "1545"))

The following two procedures are ingredients of csv-rows->tuples:

Function: make-csv-header-parser header-specs

{text.csv} Create a procedure that takes a row (a list of strings) and checks if if it matches the criteria specified by header-specs. (See csv-rows->tuples above about header-specs.) If the input satisfies the spec, it returns a permuter vector that maps the tuple positions to the input column numbers. Otherwise, it returns #f.

The permuter vector is a vector of integers, where K-th element being I means the K-th item of the tuple should be taken from I-th column.

Let’s see the example. Suppose we know that the input contains the following row as the header row:

(define *input-row*
  '("" "" "Year" "Country" "" "Population" "GDP" "Notes"))

We want to detect that row, but we only needs Country, Year, GDP and Population columns, in that order. So we create a header parser as follows:

(define header-parser
  (make-csv-header-parser '("Country" "Year" "GDP" "Population")))

Applying this header parser to the input data returns the permuter vector:

(header-parser *input-row*)
 ⇒ #(3 2 6 5)

It means, the first item of tuple (Country) is in the 3rd column of the input, the second item of tuple (Year) is in the 2nd column of the input, and so on. This permuter vector can be used to parse record rows to get tuples.

Function: make-csv-record-parser header-slots permuter :optional required-slots

{text.csv} Create a procedure that converts one input row into a tuple.

Permuter is the vector returned by make-csv-header-parser.

See cvs-rows->tuples above for header-slots and required-slots arguments.



For Development HEAD DRAFTSearch (procedure/syntax/module):
DRAFT