RFC822 message parsing (Gauche Users’ Reference)

12.41 `rfc.822` - RFC822 message parsing

Module: rfc.822 ¶: Defines a set of functions that parses and constructs the “Internet Message Format”, a text format used to exchange e-mails. The most recent specification can be found in RFC5322. The format was originally defined in RFC 822, and people still call it “RFC822 format”, hence I named this module. In the following document, I also refer to the format as “RFC822 format”.

Parsing message headers

Function: rfc822-read-headers iport :key strict? reader ¶

{rfc.822} Reads RFC822 format message from an input port iport, until it reaches the end of the message header. The header fields are broken into a list of the following format:

((name body) …)

Name … are the field names, and body … are the corresponding field body, both as strings. Field names are converted to lower-case characters. Field bodies are not modified, except the folded line is unfolded. The order of fields are preserved.

By default, the parser works permissively. If EOF is encountered during parsing header, it is taken as the end of the message. And if a line that doesn’t consist neither continuing (folded) line nor start a new header field, it is simply ignored. You can change this behavior by giving true value to the keyword argument strict?; then the parser raises an error for such a malformed header.

The keyword argument reader takes a procedure that reads a line from iport. Its default is read-line, which should be enough for most cases.

Function: rfc822-header->list iport :key strict? reader ¶: {rfc.822} Deprecated. This is an old name of rfc822-read-headers. This is kept for the backward compatibility. The new code should use rfc822-read-headers instead.

Function: rfc822-header-ref header-list field-name :optional default ¶

{rfc.822} An utility procedure to get a specific field from the parsed header list, which is returned by rfc822-read-headers.

Field-name specifies the field name in a lowercase string. If the field with given name is in header-list, the procedure returns its value in a string. Otherwise, if default is given, it is returned, and if not, #f is returned.

If there are more than one header with the same field name, value of the first one is returned. To get all values of multiple header fields, use rfc822-header-ref* below.

This procedure can actually be used not only for the result of rfc822-read-headers, but for retrieving a value keyed by strings in a list-of-list structure: ((name value option ...) ...). For example, the return value of parse-cookie-string can be passed to rfc-822-header-ref (see rfc.cookie - HTTP cookie handling, for parse-cookie-string).

(rfc822-header-ref
  '(("from" "foo@example.com") ("to" "bar@example.com"))
  "from")
 ⇒ "foo@example.com"

;; If no entry matches, #f is returned by default
(rfc822-header-ref
  '(("from" "foo@example.com") ("to" "bar@example.com"))
  "reply-to")
 ⇒ #f

;; You can give the default value for no-match case
(rfc822-header-ref
  '(("from" "foo@example.com") ("to" "bar@example.com"))
  "reply-to" 'none)
 ⇒ none

;; By giving the default value, you can distinguish
;; the no-match case and there's actually an entry with value #f.
(rfc822-header-ref
  '(("from" "foo@example.com") ("reply-to" #f))
  "reply-to" 'none)
 ⇒ #f

Function: rfc822-header-ref* header-list field-name ¶: Like rfc822-header-ref, looks up header entries in header-list with the name field-name, however, this procedure returns all values of matching headers in a list. If there’s no matching headers, an empty list is returned.

Function: rfc822-header-put header-list field-name field-value ¶: Returns an rfc822 header list which is the same as header-list except that a header with field-name and field-value is added. Field-name is converted to lowercase letters. If header-list already contains headers with field-name, such headers are excluded from the output. The header-list won’t be modified.

Basic field parsers

Several procedures are provided to parse "structured" header fields of RFC2822 messages. These procedures deal with the body of a header field, i.e. if the header field is "To: Wandering Schemer <schemer@example.com>", they parse "Wandering Schemer <schemer@example.com>".

Most of procedures take an input port. Usually you first parse the entire header fields by rfc822-read-headers, obtain the body of the header by rfc822-header-ref, then open an input string port for the body and use those procedures to parse them.

The reason for this complexity is because you need different tokenization schemes depending on the type of the field. Rfc2822 also allows comments to appear between tokens for most cases, so a simple-minded regexp won’t do the job, since rfc2822 comment can be nested and can’t be represented by regular grammar. So, this layer of procedures are designed flexible enough to handle various syntaxes. For the standard header types, high-level parsers are also provided; see "specific field parsers" below.

Function: rfc822-next-token iport :optional tokenizer-specs ¶

{rfc.822} A basic tokenizer. First it skips whitespaces and/or comments (CFWS) from iport, if any. Then reads one token according to tokenizer-specs. If iport reaches EOF before any token is read, EOF is returned.

Tokenizer-specs is a list of tokenizer spec, which is either a char-set or a cons of a char-set and a procedure.

After skipping CFWS, the procedure peeks a character at the head of iport, and checks it against the char-sets in tokenizer-specs one by one. If a char-set that contains the character belongs to is found, then a token is retrieved as follows: If the tokenizer spec is just a char-set, a sequence of characters that belong to the char-set consists a token. If it is a cons, the procedure is called with iport to read a token.

If the head character doesn’t match any char-sets, the character is taken from iport and returned.

The default tokenizer-specs is as follows:

(list (cons #["] rfc822-quoted-string)
      (cons *rfc822-atext-chars* rfc822-dot-atom))

Where rfc822-quoted-string and rfc822-dot-atom are tokenizer procedures described below, and *rfc822-atext-chars* is bound to a char-set of atext specified in rfc2822. This means rfc822-next-token retrieves a token either quoted-string or dot-atom specified in rfc2822 by default.

Using tokenizer-specs, you can customize how the header field is parsed. For example, if you want to retrieve a token that is either (1) a word constructed by alphabetic characters, or (2) a quoted string, then you can call rfc822-next-token by this:

(rfc822-next-token iport
   `(#[[:alpha:]] (#["] . ,rfc822-quoted-string)))

Function: rfc822-field->tokens field :optional tokenizer-specs ¶: {rfc.822} A convenience procedure. Creates an input string port for a field body field, and calls rfc822-next-token repeatedly on it until it consumes all input, then returns a list of tokens. Tokenizer-specs is passed to rfc822-next-token.

Function: rfc822-skip-cfws iport ¶: {rfc.822} A utility procedure that consumes any comments and/or whitespace characters from iport, and returns the head character that is neither a whitespace nor a comment. The returned character remains in iport.

Constant: *rfc822-atext-chars* ¶: {rfc.822} Bound to a char-set that is a valid constituent of atom.

Constant: *rfc822-standard-tokenizers* ¶: {rfc.822} Bound to the default tokenizer-specs.

Function: rfc822-atom iport ¶
Function: rfc822-dot-atom iport ¶
Function: rfc822-quoted-string iport ¶: {rfc.822} Tokenizers for atom, dot-atom and quoted-string, respectively. The double-quotes and escaping backslashes within quoted-string are removed by rfc822-quoted-string.

Specific field parsers

Function: rfc822-parse-date string ¶

{rfc.822} Takes RFC-822 type date string, and returns eight values:

year, month, day-of-month, hour, minutes, seconds, timezone,
day-of-week.

Timezone is an offset from UT in minutes. Day-of-week is a day from sunday, and may be #f if that information is not available. Month is an integer between 1 and 12, inclusive. If the string is not parsable, all the elements are #f.

Function: rfc822-date->date string ¶

{rfc.822} Parses RFC822 type date format and returns SRFI-19 <date> object (see Date). If string can’t be parsed, returns #f instead.

To construct rfc822 date string from SRFI-19 date, you can use date->rfc822-date below.

Message constructors

Function: rfc822-write-headers headers :key output continue check ¶

{rfc.822} This is a sort of inverse function of rfc822-read-headers. It receives a list of header data, in which each header data consists of (<name> <body>), and writes them out in RFC822 header field format to the output port specified by the output keyword argument. The default output is the current output port.

By default, the procedure assumes headers contains all the header fields, and adds an empty line in the end of output to indicate the end of the header. You can pass a true value to the continue keyword argument to prevent this, enabling more headers can be added later.

I said “a sort of” above. That’s because this function doesn’t (and can’t) do the exact inverse. Specifically, the caller is responsible for line folding and make sure each header line doesn’t exceed the “hard limit” defined by RFC2822 (998 octets). If the line length of header data exceeds that, the caller should insert newline (\r\n) and one or more whitespaces as needed. This procedure cannot do the line folding on behalf of the caller, because the places where line folding is possible depend on the semantics of each header field.

It is also the caller’s responsibility to make sure header field bodies don’t have any characters except non-NUL US-ASCII characters. If you want to include characters outside of that range, you should convert them in the way allowed by the protocol, e.g. MIME. The rfc.mime module (see rfc.mime - MIME message handling) provides a convenience procedure mime-encode-text for such purpose. Again, this procedure cannot do the encoding automatically, since the way the field should be encoded depends on header fields.

What this procedure can do is to check and report such violations. By default, it runs several checks and signals an error if it finds any violations of RFC2822. You can control this checking behavior by the check keyword argument. It can take one of the following values:

:error: Default. Signals an error if a violation is found.
#f, :ignore: Doesn’t perform any check. Trust the caller.
procedure: When rfc822-write-headers finds a violation, the procedure is called with three arguments; the header field name, the header field body, and the type of violation explained below. The procedure may correct the problem and return two values, the corrected header field name and body. The returned values are checked again. If the procedure returns the header field name and body unchanged, an error is signaled in the same way as :error is specified.

The third argument passed to the procedure given to the check argument is one of the following symbols. New symbols may be added in future versions for more checks.

incomplete-string: Incomplete string is passed.
bad-character: Header field contains characters outside of US-ASCII or NUL.
line-too-long: Line length exceeds 998 octet limit.
stray-crlf: The string contains CR and/or LF character that doesn’t consist of proper line folding.

Function: date->rfc822-date date ¶: {rfc.822} Takes SRFI-19 <date> object (see Date) and returns a string of its rfc822 date representation. This is a reverse operation of rfc822-date->date.

12.41 rfc.822 - RFC822 message parsing

Parsing message headers

Basic field parsers

Specific field parsers

Message constructors

12.41 `rfc.822` - RFC822 message parsing