util.levenshtein
- Levenshtein edit distance ¶This module provides procedures to calculate edit distance between two sequences. Edit distance is the minimum number of edit operations required to match one sequence to another. Three algorithms are implemented:
Count deletion of one element, insertion of one element, and substitution of one element.
Besides deletion, insertion and substitution, we allow transposition of adjacent elements.
Also called optimal string alignment distance. Like Damerau-Levenshtein, but once transposition is applied, no further editing on those elements are allowed.
These algorithms are often used to compare strings, but the procedures
in this module can handle any type of sequence (see gauche.sequence
- Sequence framework).
{util.levenshtein
}
Calculates Levenshtein distance (l-*
), restricted edit
distance (re-*
) and Damerau-Levenshtein distance (dl-*
)
between sequences, respectively. Each algorithm comes in two
flavors: The singular form *-distance
takes two sequences,
seq-A and seq-B, and calculates distance between them.
The plural form *-distances
takes a sequence seq-A
and a list of sequences seq-Bs, and calculates distances
between seq-A and each in seq-Bs.
If you need to calculate distances from a single sequence to many sequences, using the plural version is much faster than repeatedly calling the singular version, for the plural version can reuse internal data structures and save allocation and setup time.
Sequences can be any object that satisfy the <sequence>
protocol
(see gauche.sequence
- Sequence framework).
The keyword argument elt= is used to compare elements in
the sequences. Its default is eqv?
.
The keyword argument cutoff must be, if given, a nonnegative
exact integer. Once the possible minimum distance between two sequences
becomes greater than this number, the algorithm stops and gives #f
as the result, and moves on to the next calculation.
This is useful when you run the algorithm on large set of sequences
and you only need to look for the pairs closer than the certain limit.
In our implementation, Levenshtein is the fastest, Damerau-Levenshtein
is the slowest and Restricted edit is somewhere inbetween. If you don’t
need to take into account of transpositions, use Levenshtein; it counts
2 for cat
-> act
, while other algorithms yield 1 for it.
If you need to consider transpositions, choose either
re-
or dl-
. The catch in re-
is that it does not
satisfy triangular inequality, i.e. for given three sequences X, Y and Z,
(Damerau-)Levenshtein distance L always satisfy
L(X;Z) <= L(X;Y) + L(Y;Z), but restricted edit distance doesn’t
guarantee that.
(l-distance "cat" "act") ⇒ 2 (l-distances "cat" '("Cathy" "scathe" "stack") :elt= char-ci=?) ⇒ (2 3 4) (re-distance "cat" "act") ⇒ 1 (re-distances "pepper" '("peter" "piper" "picked" "peck" "pickled" "peppers") :cutoff 4) ⇒ (2 2 4 4 #f 1) (dl-distance '(a b c d e) '(c d a b e)) ⇒ 4
Note: If you pass a list of sequences to the second argument of the singular version by accident, you might not get an error immediately because a list is also a sequence.