rfc.uri
- URI parsing and construction ¶Provides a set of procedures to parse and construct Uniform Resource Identifiers defined in RFC 2396 (https://www.ietf.org/rfc/rfc2396.txt), as well as Data URI scheme defined in RFC2397.
First, lets review the structure of URI briefly. The following graph shows how the URI is constructed:
URI-+-scheme | +-specific--+--authority-+--userinfo | +--host | +--port +--path +--query +--fragment
Not all URIs have this full hierarchy. For example,
mailto:admin@example.com
has only scheme (mailto
)
and specific (admin@example.com
) parts.
Most popular URI schemes, however, organize resources
in a tree, so they adopt authority (which usually identifies
the server) and the hierarchical path. In the URI
http://example.com:8080/search?q=key#results
, the authority
part is example.com:8080
, the path is /search
,
the query is key
and the fragment is results
.
The userinfo can be provided before hostname, such as anonymous
in ftp://anonymous@example.com/pub/
.
We have procedures that decompose a URI into those parts, and that compose a URI from those parts.
{rfc.uri
}
Extract specific part(s) from the given URI. This module offers
a set of procedures to fully decompose URI, but in actual
applications, you often need only some of the parts. This procedure
comes handy for it.
The parts argument may be a symbol, or a list of symbols, to name the desired parts. The recognized symbos are as follows.
scheme
The scheme part, as string.
authority
The authority part, as string.
If URI doesn’t have the part, #f
.
userinfo
The userinfo part, as string. If URI doesn’t have the part, #f
.
host
The host part, as string. If URI doesn’t have the part, #f
.
port
The port part, as integer. If URI doesn’t have the part, #f
.
path
The path part, as string. If URI isn’t hierarchical, this returns the specific part.
query
The query part, as string. If URI doesn’t have the part, #f
.
fragment
The fragment part, as string. If URI doesn’t have the part, #f
.
scheme+authority
The scheme and authority part, as string
host+port
The host and port part, as string.
userinfo+host+port
The userinfo, host and port part, as string.
path+query
The path and query part.
path+query+fragment
The path, query and fragment part.
(define uri "http://foo:bar@example.com:8080/search?q=word#results") (uri-ref uri 'scheme) ⇒ "http" (uri-ref uri 'authority) ⇒ "//foo:bar@example.com:8080/" (uri-ref uri 'userinfo) ⇒ "foo:bar" (uri-ref uri 'host) ⇒ "example.com" (uri-ref uri 'port) ⇒ 8080 (uri-ref uri 'path) ⇒ "/search" (uri-ref uri 'query) ⇒ "q=word" (uri-ref uri 'fragment) ⇒ "results" (uri-ref uri 'scheme+authority) ⇒ "http://foo:bar@example.com:8080/" (uri-ref uri 'host+port) ⇒ "example.com:8080" (uri-ref uri 'userinfo+host+port) ⇒ "foo:bar@example.com:8080" (uri-ref uri 'path+query) ⇒ "/search?q=word" (uri-ref uri 'path+query+fragment)⇒ "/search?q=word#results"
You can extract multiple parts at once by specifying a list of parts. A list of parts is returned.
(uri-ref uri '(host+port path+query)) ⇒ ("example.com:8080" "/search?q=word")
{rfc.uri
}
General parser of URI. These functions does not decode
URI encoding, since the parts to be decoded differ among
the uri schemes. After parsing uri, use uri-decode
below
to decode them.
uri-parse
is the most handy procedure. It breaks the uri
into the following parts and returns them as multiple values.
If the uri doesn’t have the corresponding
parts, #f
are returned for the parts.
"mailto"
in "mailto:foo@example.com"
).
"anonymous"
in ftp://anonymous@ftp.example.com/pub/foo
).
"ftp.example.com"
in ftp://anonymous@ftp.example.com/pub/foo
).
8080
in http://www.example.com:8080/
).
"/index.html"
in
http://www.example.com/index.html
).
"key=xyz&lang=en"
in
http://www.example.com/search?key=xyz&lang=en
).
"section4"
in
http://www.example.com/document.html#section4
).
The following procedures are finer grained and break up uris with different stages.
uri-scheme&specific
takes a URI uri, and
returns two values, its scheme part and its scheme-specific part.
If uri doesn’t have a scheme part, #f
is returned for it.
(uri-scheme&specific "mailto:sclaus@north.pole") ⇒ "mailto" and "sclaus@north.pole" (uri-scheme&specific "/icons/new.gif") ⇒ #f and "/icons/new.gif"
If the URI scheme uses hierarchical notation, i.e.
“//authority/path?query#fragment
”,
you can pass
the scheme-specific part to uri-decompose-hierarchical
and it returns four values, authority, path, query
and fragment.
(uri-decompose-hierarchical "//www.foo.com/about/company.html") ⇒ "www.foo.com", "/about/company.html", #f and #f (uri-decompose-hierarchical "//zzz.org/search?key=%3fhelp") ⇒ "zzz.org", "/search", "key=%3fhelp" and #f (uri-decompose-hierarchical "//jjj.jp/index.html#whatsnew") ⇒ "jjj.jp", "/index.html", #f and "whatsnew" (uri-decompose-hierarchical "my@address") ⇒ #f, #f, #f and #f
Furthermore, you can parse authority part of the
hierarchical URI by uri-decompose-authority
.
It returns userinfo, host and port.
(uri-decompose-authority "yyy.jp:8080") ⇒ #f, "yyy.jp" and "8080" (uri-decompose-authority "[::1]:8080") ;(IPv6 host address) ⇒ #f, "::1" and "8080" (uri-decompose-authority "mylogin@yyy.jp") ⇒ "mylogin", "yyy.jp" and #f
{rfc.uri
}
Decompose query string such as "foo=abc&bar"
into
a list of parameters (("foo" ""abc") ("bar" #t)
,
where each parameter is represented by a list of its name (string) and
value (string or #t
).
If you’re writing a CGI script, cgi-parse-parameters
in www.cgi
is handier, for it
integrates handling of query string, form parameters,
and cookies, on top of this procedure (see www.cgi
- CGI utility).
(uri-decompose-query "a=b&a=c") ⇒ (("a" "b") ("a" "c"))
#t
.
(uri-decompose-query "a&b") ⇒ (("a" #t) ("b" #t))
The optional separators argument takes a char-set object
to be used to separate each parameters. The default is
#[&;]
, for historically both &
and ;
can be used.
However, some application only allows &
.
See also url-compose-query
below, for the inverse of this procedure.
{rfc.uri
}
Parse a Data schemed uri. You can either pass the entire
uri including data:
scheme part, or just the specific part.
If the passed uri is invalid as a data uri, an error is signalled.
Returns two values: parsed content type and the decoded data.
The data is a string if the content type is text/*
, and
a u8vector otherwise.
The content-type is parsed by mime-parse-content-type
(see rfc.mime
- MIME message handling). The result format is a list as follows:
(type subtype (attribute . value) ...)
.
Here are a couple of examples:
(uri-decompose-data "data:text/plain;charset=utf-8;base64,KGhlbGxvIHdvcmxkKQ==") ⇒ ("text" "plain" ("charset" . "utf-8")) and "(hello world)" (uri-decompose-data "application/octet-stream;base64,AAECAw==") ⇒ ("application" "octet-stream") and #u8(0 1 2 3)
{rfc.uri
}
Compose a URI from given components.
There can be various combinations of components to create a valid
URI—the following diagram shows the possible ’paths’ of
combinations:
/-----------------specific-------------------\ | | scheme-+------authority-----+-+-------path*---------+- | | | | \-userinfo-host-port-/ \-path-query-fragment-/
If #f
is given to a keyword argument, it is
equivalent to the absence of that keyword argument.
It is particularly useful to pass the results of
parsed uri.
If a component contains a character that is not appropriate
for that component, it must be properly escaped before
being passed to url-compose
.
Some examples:
(uri-compose :scheme "http" :host "foo.com" :port 80 :path "/index.html" :fragment "top") ⇒ "http://foo.com:80/index.html#top" (uri-compose :scheme "http" :host "foo.net" :path* "/cgi-bin/query.cgi?keyword=foo") ⇒ "http://foo.net/cgi-bin/query.cgi?keyword=foo" (uri-compose :scheme "mailto" :specific "a@foo.org") ⇒ "mailto:a@foo.org" (receive (authority path query fragment) (uri-decompose-hierarchical "//foo.jp/index.html#whatsnew") (uri-compose :authority authority :path path :query query :fragment fragment)) ⇒ "//foo.jp/index.html#whatsnew"
{rfc.uri
}
Arguments are strings representing
full or part of URIs. This procedure resolves relative-uri
in relative to base-uri, as defined in RFC3986 Section 5.2.
“Relative Resolution”.
If more relative-uri2s are given, first relative-uri is merged to base-uri, then the next argument is merged to the resulting uri, and so on.
(uri-merge "http://example.com/foo/index.html" "a/b/c") ⇒ "http://example.com/foo/a/b/c" (uri-merge "http://example.com/foo/search?q=abc" "../about#me") ⇒ "http://example.com/about#me" (uri-merge "http://example.com/foo" "http://example.net/bar") ⇒ "http://example.net/bar" (uri-merge "http://example.com/foo/" "q" "?xyz") ⇒ "http://example.com/foo/q?xyz"
{rfc.uri
}
The argument params is a list of parameter specs. Each parameter
spec must be in the form of (name value)
, where name
is a string and value is either a string or #t
(see url-decompose-query
above).
Each parameter’s name and value is urlencoded and concatenated
as a url query string. If a parameter’s value is #t
,
the output only includes parameter’s name but not value.
(uri-compose-query '(("foo" "abc") ("bar" #t))) ⇒ "foo=abc&bar"
The optional encoding argument specifies character encoding
of the output. The default is utf-8
. If it is other than that,
the strings are converted to the specified encoding before urlencoding.
A higher-level utility, http-compose-query
in rfc.http
,
is build on top of this
(see Http client utilities).
{rfc.uri
}
Creates a Data URI of the given data, with specified content-type
and transfer encoding. Returns a string.
The data argument must be a string or a u8vector.
The content-type argument can be #f
(default),
a string that represents a content type (e.g. "text/plain;charset=utf-8"
),
or a list form of parsed content type
(e.g. ("application" "octet-stream")
. If it is #f
,
text/plain
with the gauche’s native character encoding is
used when data is a complete string, and application/octet-stream
is used otherwise.
The encoding argument can be either #f
(default),
or a symbol uri
or base64
. This is for transfer encoding,
not character encoding. If it is #f
, URI encoding is used
for text data and base64 encoding is used for binary data.
(uri-compose-data "(hello world)") ⇒ "data:text/plain;charset=utf-8,%28hello%20world%29" (uri-compose-data "(hello world)" :encoding 'base64) ⇒ "data:text/plain;charset=utf-8;base64,KGhlbGxvIHdvcmxkKQ==" (uri-compose-data '#u8(0 1 2 3)) ⇒ "data:application/octet-stream;base64,AAECAw=="
{rfc.uri
}
Decodes “URI encoding”, i.e. %
-escapes.
uri-decode
takes input from the current input port,
and writes decoded result to the current output port.
uri-decode-string
takes input from string and
returns decoded string.
If cgi-decode is true, also replaces +
to a space character.
To uri-decode-string
you can provide the external character
encoding by the encoding keyword argument. When it is given,
the decoded octet sequence is assumed to be in the specified encoding
and converted to the Gauche’s internal character encoding.
{rfc.uri
}
Encodes unsafe characters by %
-escape. uri-encode
takes input from the current input port and writes the result to
the current output port. uri-encode-string
takes input
from string and returns the encoded string.
By default, characters that are not specified “unreserved” in
RFC3986 are escaped. You can pass different character
set to noescape argument to keep from being encoded.
For example, the older RFC2396 has several more “unreserved”
characters, and passing *rfc2396-unreserved-char-set*
(see below)
prevents those characters from being escaped.
The multibyte characters are encoded as the octet stream of Gauche’s
native multibyte representation by default. However, you can pass
the encoding
keyword argument to uri-encode-string
,
to convert string to the specified character encoding.
{rfc.uri
}
These constants are bound to character sets that represents
“unreserved” characters defined in RFC2396 and RFC3986, respectively.
(See Character Sets, and scheme.charset
- R7RS character sets, for
operations on character sets).