util.levenshtein- Levenshtein edit distance
This module provides procedures to calculate edit distance between two sequences. Edit distance is the minimum number of edit operations required to match one sequence to another. Three algorithms are implemented:
Count deletion of one element, insertion of one element, and susbstitution of one element.
Besides deletion, insertion and substitution, we allow transposition of adjacent elements.
Also called optimal string alignment distance. Like Damerau-Levenshtein, but once transposition is applied, no further editing on those elements are allowed.
These algorithms are often used to compare strings, but the procedures in this module can handle any type of sequence (see Sequence framework).
Calculates Levenshtein distance (
l-*), restricted edit
re-*) and Damerau-Levenshtein distance (
between sequences, respectively. Each algorithm comes in two
flavors: The singular form
*-distance takes two sequences,
seq-A and seq-B, and calculates distance between them.
The plural form
*-distances takes a sequence seq-A
and a list of sequences seq-Bs, and calculates distances
between seq-A and each in seq-Bs.
If you need to calculate distances from a single sequence to many sequences, using the plural version is much faster than repeatedly calling the singular version, for the plural version can reuse internal data structures and save allocation and setup time.
Sequences can be any object that satisfy the
(see Sequence framework).
The keyword argument elt= is used to compare elements in
the sequences. Its default is
The keyword argument cutoff must be, if given, a nonnegative
exact integer. Once the possible minimum distance between two sequences
becomes greater than this number, the algorithm stops and gives
as the result, and moves on to the next calculation.
This is useful when you run the algorithm on large set of sequences
and you only need to look for the pairs closer than the certain limit.
In our implementation, Levenshtein is the fastest, Damerau-Levenshtein
is the slowest and Restricted edit is somewhere inbetween. If you don’t
need to take into account of transpositions, use Levenshtein; it counts
act, while other algorithms yield 1 for it.
If you need to consider transpositions, choose either
dl-. The catch in
re- is that it does not
satisfy triangular inequality, i.e. for given three sequences X, Y and Z,
(Damerau-)Levenshtein distance L always satisfy
L(X;Z) <= L(X;Y) + L(Y;Z), but restricted edit distance doesn’t
(l-distance "cat" "act") ⇒ 2 (l-distances "cat" '("Cathy" "scathe" "stack") :elt= char-ci=?) ⇒ (2 3 4) (re-distance "cat" "act") ⇒ 1 (re-distances "pepper" '("peter" "piper" "picked" "peck" "pickled" "peppers") :cutoff 4) ⇒ (2 2 4 4 #f 1) (dl-distance '(a b c d e) '(c d a b e)) ⇒ 4
Note: If you pass list of sequences to the second argument of the singular version by accident, you might not get an error immediately because a list is also a sequence.