For Gauche 0.9.5


Next: , Previous: , Up: Library modules - Utilities   [Contents][Index]

12.41 rfc.uri - URI parsing and construction

Module: rfc.uri

Provides a set of procedures to parse and construct Uniform Resource Identifiers defined in RFC 2396 (RFC2396), as well as Data URI scheme defined in RFC2397.

First, lets review the structur of URI briefly. The following graph shows how the URI is constructed:

URI-+-scheme
    |
    +-specific--+--authority-+--userinfo
                |            +--host
                |            +--port
                +--path
                +--query
                +--fragment

Not all URIs have this full hierachy. For exmaple, mailto:admin@example.com has only scheme (mailto) and specific (admin@example.com) parts.

Most popular URI schemes, however, organize resources in a tree, so they adopt authority (which usually identifies the server) and the hierarchical path. In the URI http://example.com:8080/search?q=key#results, the authority part is exmaple.com:8080, the path is /search, the query is key and the fragment is results. The userinfo can be provided before hostname, such as anonymous in ftp://anonymous@example.com/pub/.

We have procedures that decompose a URI into those parts, and that compose a URI from those parts.

Parsing URI

Function: uri-ref uri parts

Extract specific part(s) from the given URI. You can fully decompose URI by the procedures described below, but in actual applications, you often need only some of the parts. This procedure comes handy for it.

The parts argument may be a symbol, or a list of symbols, to name the desired parts. The recognized symbos are as follows.

scheme

The scheme part, as string..

authority

The authority part, as string. If URI doesn’t have the part, #f.

userinfo

The userinfo part, as string. If URI doesn’t have the part, #f.

host

The host part, as string. If URI doesn’t have the part, #f.

port

The port part, as integer. If URI doesn’t have the part, #f.

path

The path part, as string. If URI isn’t hierarchical, this returns the specific part.

query

The query part, as string. If URI doesn’t have the part, #f.

fragment

The fragment part, as string. If URI doesn’t have the part, #f.

scheme+authority

The scheme and authority part.

host+port

The host and port part.

userinfo+host+port

The userinfo, host and port part.

path+query

The path and query part.

path+query+fragment

The path, query and fragment part.

(define uri "http://foo:bar@example.com:8080/search?q=word#results")

(uri-ref uri 'scheme)             ⇒ "http"
(uri-ref uri 'authority)          ⇒ "//foo:bar@example.com:8080/"
(uri-ref uri 'userinfo)           ⇒ "foo:bar"
(uri-ref uri 'host)               ⇒ "example.com"
(uri-ref uri 'port)               ⇒ 8080
(uri-ref uri 'path)               ⇒ "/search"
(uri-ref uri 'query)              ⇒ "q=word"
(uri-ref uri 'fragment)           ⇒ "results"
(uri-ref uri 'scheme+authority)   ⇒ "http://foo:bar@example.com:8080/"
(uri-ref uri 'host+port)          ⇒ "example.com:8080"
(uri-ref uri 'userinfo+host+port) ⇒ "foo:bar@example.com:8080"
(uri-ref uri 'path+query)         ⇒ "/search?q=word"
(uri-ref uri 'path+query+fragment)⇒ "/search?q=word#results"

You can extract multiple parts at once by specifying a list of parts. A list of parts is returned.

(uri-ref uri '(host+port path+query))
  ⇒ ("example.com:8080" "/search?q=word")
Function: uri-parse uri
Function: uri-scheme&specific uri
Function: uri-decompose-hierarchical specific
Function: uri-decompose-authority authority

General parser of URI. These functions does not decode URI encoding, since the parts to be decoded differ among the uri schemes. After parsing uri, use uri-decode below to decode them.

uri-parse is the most handy procedure. It breaks the uri into the following parts and returns them as multiple values. If the uri doesn’t have the corresponding parts, #f are returned for the parts.

The following procedures are finer grained and break up uris with different stages.

uri-scheme&specific takes a URI uri, and returns two values, its scheme part and its scheme-specific part. If uri doesn’t have a scheme part, #f is returned for it.

(uri-scheme&specific "mailto:sclaus@north.pole")
  ⇒ "mailto" and "sclaus@north.pole"
(uri-scheme&specific "/icons/new.gif")
  ⇒ #f and "/icons/new.gif"

If the URI scheme uses hierarchical notation, i.e. “//authority/path?query#fragment”, you can pass the scheme-specific part to uri-decompose-hierarchical and it returns four values, authority, path, query and fragment.

(uri-decompose-hierarchical "//www.foo.com/about/company.html")
  ⇒ "www.foo.com", "/about/company.html", #f and #f
(uri-decompose-hierarchical "//zzz.org/search?key=%3fhelp")
  ⇒ "zzz.org", "/search", "key=%3fhelp" and #f
(uri-decompose-hierarchical "//jjj.jp/index.html#whatsnew")
  ⇒ "jjj.jp", "/index.html", #f and "whatsnew"
(uri-decompose-hierarchical "my@address")
  ⇒ #f, #f, #f and #f

Furthermore, you can parse authority part of the hierarchical URI by uri-decompose-authority. It returns userinfo, host and port.

(uri-decompose-authority "yyy.jp:8080")
  ⇒ #f, "yyy.jp" and "8080"
(uri-decompose-authority "[::1]:8080")  ;(IPv6 host address)
  ⇒ #f, "::1" and "8080"
(uri-decompose-authority "mylogin@yyy.jp")
  ⇒ "mylogin", "yyy.jp" and #f
Function: uri-decompose-data uri

Parse a Data URI string uri. You can either pass the entire uri including data: scheme part, or just the specific part. If the passed uri is invalid as a data uri, an error is signalled.

Returns two values: parsed content type and the decoded data. The data is a string if the content type is text/*, and a u8vector otherwise.

The content-type is parsed by mime-parse-content-type (see MIME message handling). The result format is a list as follows:

(type subtype (attribute . value) …).

Here are a couple of examples:

(uri-decompose-data
 "data:text/plain;charset=utf-8;base64,KGhlbGxvIHdvcmxkKQ==")
  ⇒ ("text" "plain" ("charset" . "utf-8")) and "(hello world)"

(uri-decompose-data
 "data:application/octet-stream;base64,AAECAw==")
  ⇒ ("application" "octet-stream") and #u8(0 1 2 3)

Constructing URI

Function: uri-compose :key scheme userinfo host port authority path path* query fragment specific

Compose a URI from given components. There can be various combinations of components to create a valid URI—the following diagram shows the possible ’paths’ of combinations:

        /-----------------specific-------------------\
        |                                            |
 scheme-+------authority-----+-+-------path*---------+-
        |                    | |                     |
        \-userinfo-host-port-/ \-path-query-fragment-/

If #f is given to a keyword argument, it is equivalent to the absence of that keyword argument. It is particularly useful to pass the results of parsed uri.

If a component contains a character that is not appropriate for that component, it must be properly escaped before being passed to url-compose.

Some examples:

(uri-compose :scheme "http" :host "foo.com" :port 80
             :path "/index.html" :fragment "top")
  ⇒ "http://foo.com:80/index.html#top"

(uri-compose :scheme "http" :host "foo.net"
             :path* "/cgi-bin/query.cgi?keyword=foo")
  ⇒ "http://foo.net/cgi-bin/query.cgi?keyword=foo"

(uri-compose :scheme "mailto" :specific "a@foo.org")
  ⇒ "mailto:a@foo.org"

(receive (authority path query fragment)
   (uri-decompose-hierarchical "//foo.jp/index.html#whatsnew")
 (uri-compose :authority authority :path path
              :query query :fragment fragment))
  ⇒ "//foo.jp/index.html#whatsnew"
Function: uri-merge base-uri relative-uri relative-uri2 …

Arguments are strings representing full or part of URIs. This procedure resolves relative-uri in relative to base-uri, as defined in RFC3986 Section 5.2. “Relative Resolution”.

If more realtive-uri2s are given, first relative-uri is merged to base-uri, then the next argument is merged to the resulting uri, and so on.

(uri-merge "http://example.com/foo/index.html" "a/b/c")
 ⇒ "http://example.com/foo/a/b/c"

(uri-merge "http://example.com/foo/search?q=abc" "../about#me")
 ⇒ "http://example.com/about#me"

(uri-merge "http://example.com/foo" "http://example.net/bar")
 ⇒ "http://example.net/bar"

(uri-merge "http://example.com/foo/" "q" "?xyz")
 ⇒ "http://example.com/foo/q?xyz"
Function: uri-compose-data data :key content-type encoding

Creates a Data URI of the given data, with specified content-type and transfer encoding. Returns a string.

The data argument must be a string or a u8vector.

The content-type argument can be #f (default), a string that represents a content type (e.g. "text/plain;charset=utf-8"), or a list form of parsed content type (e.g. ("application" "octet-stream"). If it is #f, text/plain with the gauche’s native character encoding is used when data is a complete string, and application/octet-stream is used otherwise.

The encoding argument can be either #f (default), or a symbol uri or base64. This is for transfer encoding, not character encoding. If it is #f, URI encoding is used for text data and base64 encoding is used for binary data.

(uri-compose-data "(hello world)")
 ⇒ "data:text/plain;charset=utf-8,%28hello%20world%29"

(uri-compose-data "(hello world)" :encoding 'base64)
 ⇒ "data:text/plain;charset=utf-8;base64,KGhlbGxvIHdvcmxkKQ=="

(uri-compose-data '#u8(0 1 2 3))
 ⇒ "data:application/octet-stream;base64,AAECAw=="

URI Encoding and decoding

Function: uri-decode :key :cgi-decode
Function: uri-decode-string string :key :cgi-decode :encoding

Decodes “URI encoding”, i.e. %-escapes. uri-decode takes input from the current input port, and writes decoded result to the current output port. uri-decode-string takes input from string and returns decoded string.

If cgi-decode is true, also replaces + to a space character.

To uri-decode-string you can provide the external character encoding by the encoding keyword argument. When it is given, the decoded octet sequence is assumed to be in the specified encoding and converted to the Gauche’s internal character encoding.

Function: uri-encode :key :noescape
Function: uri-encode-string string :key :noescape :encoding

Encodes unsafe characters by %-escape. uri-encode takes input from the current input port and writes the result to the current output port. uri-encode-string takes input from string and returns the encoded string.

By default, characters that are not specified “unreserved” in RFC3986 are escaped. You can pass different character set to noescape argument to keep from being encoded. For example, the older RFC2396 has several more “unreserved” characters, and passing *rfc2396-unreserved-char-set* (see below) prevents those characters from being escaped.

The multibyte characters are encoded as the octet stream of Gauche’s native multibyte representation by default. However, you can pass the encoding keyword argument to uri-encode-string, to convert string to the specified character encoding.

Constant: *rfc2396-unreserved-char-set*
Constant: *rfc3986-unreserved-char-set*

These constants are bound to character sets that represents “unreserved” characters defined in RFC2396 and RFC3986, respectively. (See Character set, and Character-set library, for operations on character sets).


Next: , Previous: , Up: Library modules - Utilities   [Contents][Index]