Next: Pattern matching, Previous: The longest common subsequence, Up: Library modules - Utilities [Contents][Index]

`util.levenshtein`

- Levenshtein edit distance- Module:
**util.levenshtein** -
This module provides procedures to calculate edit distance between two sequences. Edit distance is the minimum number of edit operations required to match one sequence to another. Three algorithms are implemented:

- Levenshtein distance
Count deletion of one element, insertion of one element, and substitution of one element.

- Damerau-Levenshtein distance
Besides deletion, insertion and substitution, we allow transposition of adjacent elements.

- Restricted edit distance
Also called optimal string alignment distance. Like Damerau-Levenshtein, but once transposition is applied, no further editing on those elements are allowed.

These algorithms are often used to compare strings, but the procedures in this module can handle any type of sequence (see Sequence framework).

- Function:
**l-distance***seq-A seq-B :key elt= cutoff* - Function:
**l-distances***seq-A seq-Bs :key elt= cutoff* - Function:
**re-distance***seq-A seq-B :key elt= cutoff* - Function:
**re-distances***seq-A seq-Bs :key elt= cutoff* - Function:
**dl-distance***seq-A seq-B :key elt= cutoff* - Function:
**dl-distances***seq-A seq-Bs :key elt= cutoff* {

`util.levenshtein`} Calculates Levenshtein distance (`l-*`

), restricted edit distance (`re-*`

) and Damerau-Levenshtein distance (`dl-*`

) between sequences, respectively. Each algorithm comes in two flavors: The singular form`*-distance`

takes two sequences,`seq-A`and`seq-B`, and calculates distance between them. The plural form`*-distances`

takes a sequence`seq-A`and a list of sequences`seq-Bs`, and calculates distances between`seq-A`and each in`seq-Bs`.If you need to calculate distances from a single sequence to many sequences, using the plural version is much faster than repeatedly calling the singular version, for the plural version can reuse internal data structures and save allocation and setup time.

Sequences can be any object that satisfy the

`<sequence>`

protocol (see Sequence framework).The keyword argument

`elt=`is used to compare elements in the sequences. Its default is`eqv?`

.The keyword argument

`cutoff`must be, if given, a nonnegative exact integer. Once the possible minimum distance between two sequences becomes greater than this number, the algorithm stops and gives`#f`

as the result, and moves on to the next calculation. This is useful when you run the algorithm on large set of sequences and you only need to look for the pairs closer than the certain limit.In our implementation, Levenshtein is the fastest, Damerau-Levenshtein is the slowest and Restricted edit is somewhere inbetween. If you don’t need to take into account of transpositions, use Levenshtein; it counts 2 for

`cat`

->`act`

, while other algorithms yield 1 for it. If you need to consider transpositions, choose either`re-`

or`dl-`

. The catch in`re-`

is that it does not satisfy triangular inequality, i.e. for given three sequences X, Y and Z, (Damerau-)Levenshtein distance L always satisfy L(X;Z) <= L(X;Y) + L(Y;Z), but restricted edit distance doesn’t guarantee that.(l-distance "cat" "act") ⇒ 2 (l-distances "cat" '("Cathy" "scathe" "stack") :elt= char-ci=?) ⇒ (2 3 4) (re-distance "cat" "act") ⇒ 1 (re-distances "pepper" '("peter" "piper" "picked" "peck" "pickled" "peppers") :cutoff 4) ⇒ (2 2 4 4 #f 1) (dl-distance '(a b c d e) '(c d a b e)) ⇒ 4

Note: If you pass a list of sequences to the second argument of the singular version by accident, you might not get an error immediately because a list is also a sequence.

DRAFT