For Development HEAD DRAFTSearch (procedure/syntax/module):

12.89 util.levenshtein - Levenshtein edit distance

Module: util.levenshtein

This module provides procedures to calculate edit distance between two sequences. Edit distance is the minimum number of edit operations required to match one sequence to another. Three algorithms are implemented:

Levenshtein distance

Count deletion of one element, insertion of one element, and substitution of one element.

Damerau-Levenshtein distance

Besides deletion, insertion and substitution, we allow transposition of adjacent elements.

Restricted edit distance

Also called optimal string alignment distance. Like Damerau-Levenshtein, but once transposition is applied, no further editing on those elements are allowed.

These algorithms are often used to compare strings, but the procedures in this module can handle any type of sequence (see gauche.sequence - Sequence framework).

Function: l-distance seq-A seq-B :key elt= cutoff
Function: l-distances seq-A seq-Bs :key elt= cutoff
Function: re-distance seq-A seq-B :key elt= cutoff
Function: re-distances seq-A seq-Bs :key elt= cutoff
Function: dl-distance seq-A seq-B :key elt= cutoff
Function: dl-distances seq-A seq-Bs :key elt= cutoff

{util.levenshtein} Calculates Levenshtein distance (l-*), restricted edit distance (re-*) and Damerau-Levenshtein distance (dl-*) between sequences, respectively. Each algorithm comes in two flavors: The singular form *-distance takes two sequences, seq-A and seq-B, and calculates distance between them. The plural form *-distances takes a sequence seq-A and a list of sequences seq-Bs, and calculates distances between seq-A and each in seq-Bs.

If you need to calculate distances from a single sequence to many sequences, using the plural version is much faster than repeatedly calling the singular version, for the plural version can reuse internal data structures and save allocation and setup time.

Sequences can be any object that satisfy the <sequence> protocol (see gauche.sequence - Sequence framework).

The keyword argument elt= is used to compare elements in the sequences. Its default is eqv?.

The keyword argument cutoff must be, if given, a nonnegative exact integer. Once the possible minimum distance between two sequences becomes greater than this number, the algorithm stops and gives #f as the result, and moves on to the next calculation. This is useful when you run the algorithm on large set of sequences and you only need to look for the pairs closer than the certain limit.

In our implementation, Levenshtein is the fastest, Damerau-Levenshtein is the slowest and Restricted edit is somewhere inbetween. If you don’t need to take into account of transpositions, use Levenshtein; it counts 2 for cat -> act, while other algorithms yield 1 for it. If you need to consider transpositions, choose either re- or dl-. The catch in re- is that it does not satisfy triangular inequality, i.e. for given three sequences X, Y and Z, (Damerau-)Levenshtein distance L always satisfy L(X;Z) <= L(X;Y) + L(Y;Z), but restricted edit distance doesn’t guarantee that.

(l-distance "cat" "act")  ⇒ 2
(l-distances "cat" '("Cathy" "scathe" "stack")
  :elt= char-ci=?)
  ⇒ (2 3 4)

(re-distance "cat" "act") ⇒ 1

(re-distances "pepper"
  '("peter" "piper" "picked" "peck" "pickled" "peppers")
  :cutoff 4)
  ⇒ (2 2 4 4 #f 1)

(dl-distance '(a b c d e) '(c d a b e)) ⇒ 4

Note: If you pass a list of sequences to the second argument of the singular version by accident, you might not get an error immediately because a list is also a sequence.



For Development HEAD DRAFTSearch (procedure/syntax/module):
DRAFT