MIME message handling (Gauche Users’ Reference)

12.51 `rfc.mime` - MIME message handling

Module: rfc.mime ¶

This module provides utility procedures to handle Multipurpose Internet Mail Extensions (MIME) messages, defined in RFC2045 thorough RFC2049. Provided APIs include procedures to parse or compose MIME-specific header fields, and parse or compose MIME-encoded message bodies.

This module mainly focuses on providing low-level building-block procedures, on top of which application-specific modules are to be built. For example, rfc.http uses this module to compose multipart/form-data message for the body of POST requests (see rfc.http - HTTP client).

This module is supposed to be used with rfc.822 module (see rfc.822 - RFC822 message parsing).

Utilities for header fields

A few utility procedures to parse and generate MIME-specific header fields.

Function: mime-parse-version field ¶

{rfc.mime} If field is a valid header field for MIME-Version, returns its major and minor versions in a list. Otherwise, returns #f. It is allowed to pass #f to field, so that you can directly pass the result of rfc822-header-ref to it. Given parsed header list by rfc822-read-headers, you can get mime version (currently, it should be (1 0)) by the following code.

(mime-parse-version (rfc822-header-ref headers "mime-version"))

Note: simple regexp such as #/\d+\.\d+/ doesn’t do this job, for field may contain comments between tokens.

Function: mime-parse-content-type field ¶

{rfc.mime} Parses the "content-type" header field, and returns a list such as:

(type subtype (attribute . value) …)

where type and subtype are MIME media type and subtype in a string, respectively

(mime-parse-content-type "text/html; charset=iso-2022-jp")
 ⇒ ("text" "html" ("charset" . "iso-2022-jp"))

If field is not a valid content-type field, #f is returned.

Function: mime-parse-content-disposition field ¶: {rfc.mime} Parses Content-disposition header field as specified in RFC2183. (mime-parse-content-disposition "attachment; filename=genome.jpeg;\ modification-date=\"Wed, 12 Feb 1997 16:29:51 -0500\";") ⇒ ("attachment" ("filename" . "genome.jpeg") ("modification-date" . "Wed, 12 Feb 1997 16:29:51 -0500"))

Function: mime-parse-parameters :optional iport ¶

Function: mime-compose-parameters params :optional oport :key start-column ¶

{rfc.mime} These are low-level utility procedures to parse and compose parameter part of header fields (as appeared in RFC2045 Section 5.1 etc).

Mime-parse-parameters reads the parameter part of the header body from an input port iport, and returns an assoc list of the parameter names and values. Conversely, mime-compose-parameters takes an assoc list of names and values, compose parameter part and emit it to oport. When omitted, the current input port and the current output port are used for iport and oport, respectively. You can pass #f to oport and mime-compose-parameters returns the result in a string instead of emitting it to a port.

(call-with-input-string
   "; name=foo; filename=\"foo/bar/baz\""
   mime-parse-parameters)
 ⇒ (("name" . "foo") ("filename" . "foo/bar/baz"))

(mime-compose-parameters
 '(("name" . "foo") ("filename" . "foo/bar/baz"))
 #f)
 ⇒ "; name=foo; filename=\"foo/bar/baz\""

Mime-compose-parameters tries to insert folding line breaks between parameters to avoid the header line becomes too long. You can pass the beginning column position of the parameter part via start-column argument.

We plan to make these procedures handle RFC2231’s parameter value extension transparently in future.

Function: mime-decode-word word ¶

{rfc.mime} Decodes RFC2047-encoded word. If word isn’t an encoded word, it is returned as is.

Note that this procedure decodes only if the entire word is an “encoded word” defined in RFC2047. If you are dealing with a field that may contain multiple encoded word and/or unencoded parts, use mime-decode-text below.

(mime-decode-word "=?iso-8859-1?q?this=20is=20some=20text?=")
 ⇒ "this is some text"

Function: mime-decode-text text ¶

{rfc.mime} Returns a string in which all encoded words contained within text are decoded. This procedure can deal with a header field body that may contain mixture of non-encoded and encoded parts, and/or multiple encoded parts. One of such header field is the Subject field of email.

(mime-decode-text "This is =?US-ASCII?q?some=20text?=")
 ⇒ "This is some text"

Care should be taken if you apply this procedure to a “structured” header field body (see RFC2822 section 2.2.2). The proper way of parsing a structured header field body is to tokenize it first, then to decode each word using mime-decode-word. since the decoded text may contain characters that affects the tokenization. (However, if you can just show the header field in human readable way for informational purposes, you may just use mime-decode-text on entire header field for the convenience).

Function: mime-encode-word word :key charset transfer-encoding ¶

{rfc.mime} Encodes word in the RFC2047 format. The keyword argument charset specifies the character encoding scheme in string or symbol. whose default is utf-8. If charset is other than utf-8 and word is a complete string, the procedure converts the character encoding to charset, then performs transfer encoding.

(mime-encode-word "this is some text")
 ⇒ "=?utf-8?B?dGhpcyBpcyBzb21lIHRleHQ=?="

The keyword argument transfer-encoding specifies how the octets are encoded to transfer-safe characters. You can give a symbol b, B or base64 for Base64, and Q, q, quoted-printable for Quoted-printable transfer encodings. An error is raised if you pass values other than those. The default is Base64 encoding.

This procedure does not consider the length of the resulting encoded word, which RFC2047 recommends to be less than 75 octets. Use mime-encode-text below to conform the line length limit.

(Note: In most Gauche procedures, a keyword argument encoding is used to specify character encodings. In this context we have two encodings, however, and to avoid the confusion we chose to use the terms “charset” and “transfer-encoding” that appear in RFC documents.)

Function: mime-encode-text text :key charset transfer-encoding line-width start-column force ¶

{rfc.mime} Encode text in RFC2047 format if necessary, and considering line folding if the result gets too long.

The keyword arguments charset and transfer-encoding are the same as mime-encode-word.

If the text only consists of printable ASCII characters, no encoding is done, and only line folding is considered. However, if a true value is given to the force argument, even ASCII-only text is encoded.

The line-width specifies the maximum line width of the result. Its default is 76. If the encoded word gets too long, it is splitted to multiple encoded words and CR LF SPC sequence (“folding white space” defined in RFC2822) are inserted inbetween. You can suppress this behavior by passing #f or 0 to line-width. Since encoded word needs some overhead characters, it doesn’t make much sense to specify small value to line-width. Current implementation rejects line-width smaller than 30.

The start-column keyword argument can be used to shorten the first of folded lines to make room for header field name. For example, if you want to encode the body of a Subject header field, you can pass the value of (string-length "Subject: ") so that the encoded result can directly concatenated after the header field name. The default value is 0.

This procedure is not designed to encode parts of structured header fields, which have further restrictions such as which parts can be encoded and where the folding white spaces can be inserted. The robust way is to encode some parts first, then construct a structured header fields, considering line folding.

Streaming parser

The streaming parser is designed so that you can decide how to do with the message body before the entire message is read.

Function: mime-parse-message port headers handler ¶

{rfc.mime} The fundamental streaming parser. Port is an input port from where the message is read. Headers is a list of headers parsed by rfc822-read-headers; that is, this procedure is supposed to be called after the header part of the message is parsed from port:

(let* ((headers (rfc822-read-headers port)))
  (if (mime-parse-version (rfc822-header-ref headers "mime-version"))
     ;; parse MIME message
     (mime-parse-message port headers handler)
     ;; retrieve a non-MIME body
     ...))

Mime-parse-message analyzes headers, and calls handler on each message body with two arguments:

(handler part-info xport)

Part-Info is a <mime-part> structure described below that encapsulates the information of this part of the message. Xport is an input port, initially points to the beginning of the body of message. The handler can read from the port as if it is reading from the original port. However, xport recognizes MIME boundary internally, and returns EOF when it reaches the end of the part. (Do not read from the original port directly, or it will mess up the internal state of vport).

Handler can read the part into the memory, or save it to the disk, or even discard the part. Whatever it does, it has to read from vport until it returns EOF.

The return value of handler will be set in the content slot of part-info. If the message has nested multipart messages, handler is called for each "leaf" part, in depth-first order. Handler can know its nesting level by examining part-info structure. The message doesn’t need to be a multipart type; if it is a MIME message type, handler is called on the body of enclosed message. If it is other media types such as text or application, handler is called on the (only) message body.

Class: <mime-part> ¶

{rfc.mime} A structure that encloses metainformation about a MIME part. It is constructed when the header of the part is read, and passed to the handler that reads the body of the part.

It has the following slots:

Instance Variable of <mime-part>: type ¶: MIME media type string. If content-type header is omitted to the part, an appropriate default value is set.

Instance Variable of <mime-part>: subtype ¶: MIME media subtype string. If content-type header is omitted to the part, an appropriate default value is set.

Instance Variable of <mime-part>: parameters ¶: Associative list of parameters given to content-type header field.

Instance Variable of <mime-part>: transfer-encoding ¶: The value of content-transfer-encoding header field. If the header field is omitted, an appropriate default value is set.

Instance Variable of <mime-part>: headers ¶: The list of header fields, as parsed by rfc822-read-headers.

Instance Variable of <mime-part>: parent ¶: If this is a part of multipart message or encapsulated message, points to the enclosing part’s <mime-part> structure. Otherwise #f.

Instance Variable of <mime-part>: index ¶: Sequence number of this part within the same parent.

Instance Variable of <mime-part>: content ¶: If this part is multipart/* or message/* media type, this slot contains a list of parts within it. Otherwise, the return value of handler is stored.

Instance Variable of <mime-part>: source ¶: This slot is only used when composing a MIME message. The caller can set this slot a name of the file to be inserted into this part, instead of setting the entire content of the file to the content slot. See mime-compose-message below for the more details.

Function: mime-retrieve-body part-info xport outp ¶

{rfc.mime} A procedure to retrieve message body. It is intended to to be a building block of handler to be passed to mime-parse-message.

Part-info is a <mime-part> object. Xport is an input port passed to the handler, from which the MIME part can be read. This procedure read from xport until it returns EOF. It also looks at the transfer-encoding of part-info, and decodes the body accordingly; that is, base64 encoding and quoted-printable encoding is handled. The result is written out to an output port outp.

This procedure does not handle charset conversion. The caller must use CES conversion port as outp (see gauche.charconv - Character Code Conversion) if desired.

A couple of convenience procedures are defined for typical cases on top of mime-retrieve-body.

Function: mime-body->string part-info xport ¶
Function: mime-body->file part-info xport filename ¶: {rfc.mime} Reads in the body of mime message, decoding transfer encoding, and returns it as a string or writes it to a file, respectively.

The simplest form of MIME message parser would be like this:

(let ((headers (rfc822-read-headers port)))
  (mime-parse-message port headers
                      (cut mime-body->string <> <>)))

This reads all the message on memory (i.e. the "leaf" <mime-part> objects’ content field would hold the part’s body as a string), and returns the top <mime-part> object. Content transfer encoding is recognized and handled, but character set conversion isn’t done.

You may want to feed the message body to a file directly, or even want to skip some body according to mime media types and/or other header information. Then you can put the logic in the handler closure. That’s the reason that this module provides building blocks, instead of all-in-one procedure.

Message composer

Function: mime-compose-message parts :optional port :key boundary ¶

Function: mime-compose-message-string parts :key boundary ¶

{rfc.mime} Composes a MIME multipart message. Mime-compose-message emits the result to an output port port, whose default is the current output port. Mime-compose-message-string makes the result into a string. You can give a boundary string via boundary argument; when omitted, a fresh boundary string is automatically generated by mime-make-boundary below.

Mime-compose-message returns the boundary string. Mime-compose-message-string returns two values, the result string and the boundary string.

The content of the message is provided by the parts argument, which can be a list of instances of <mime-part> (see above) or lists that describe parts. The list form is supported for the caller’s convenience, and internally it is converted to a list of <mime-part>s.

The syntax of each part element in parts are defined as follow.

<part>           : <mime-part> | <mime-part-desc>

<mime-part>      : an instance of the class <mime-part>

<mime-part-desc> : (<content-type> (<header> ...) <body>)
<content-type>   : (<type> <subtype> <header-param> ...)
<header-param>   : (<key> . <value>) ...
<header>         : (<header-name> <encoded-header-value>)
                 | (<header-name> (<header-value> <header-param> ...))
<body>           : a string
                 | (file <filename>)
                 | (subparts <part> ...)

Note: In the first form of <header>, <encoded-header-value> must already be encoded using RFC2047 or RFC2231 if the original value contains non-ascii characters. In the second form, we plan to do RFC2231 encoding on behalf of the caller; but the current version does not implement it. The caller should not pass encoded words in this form, since it may result double-encoding when we implement the auto encoding feature; for the time being, the second form restricts ASCII-only values.

If <body> is a string, it is used as the part’s content. If <body> is (file filename), the content is read from the named file. If <body> is (subparts part …), the part becomes nested MIME part.

It is the caller’s responsibility to match the content type and the content. For example, if <body> is in the third form, the part must have multipart content type.

The caller needs to provide proper content-transfer-encoding header, depending on the application. If none is given, the content is inserted into the message as is, which may be appropriate for some applications, but if you want to use the result in email message you certainly want to encode binary part with base64, for example.

Function: mime-make-boundary ¶: {rfc.mime} Returns a unique string that can be used as a boundary of a MIME multipart message.

12.51 rfc.mime - MIME message handling

Utilities for header fields

Streaming parser

Message composer

12.51 `rfc.mime` - MIME message handling