The longest common subsequence (Gauche Users’ Reference)

12.85 `util.lcs` - The longest common subsequence

Module: util.lcs ¶

This module implements the algorithm to find the longest common subsequence of two given sequences. The implemented algorithm is based on Eugene Myers’ O(ND) algorithm (Eugene Myers, An O(ND) Difference Algorithm and Its Variations, Algorithmica Vol. 1 No. 2, pp. 251-266, 1986.).

One of the applications of this algorithm is to calculate the difference of two text streams; see text.diff - Calculate difference of text streams.

Function: lcs seq-a seq-b :optional eq-fn ¶

{util.lcs} Calculates and returns the longest common sequence of two lists, seq-a and seq-b. Optional eq-fn specifies the comparison predicate; if omitted, equal? is used.

(lcs '(x a b y) '(p a q b))
 ⇒ (a b)

Function: lcs-with-positions seq-a seq-b :optional eq-fn ¶

{util.lcs} This is the detailed version of lcs. The arguments are the same.

Returns a list of the following structure:

(length ((elt a-pos b-pos) …))

Length is an integer showing the length of the found LCS. What follows is a list of elements of LCS; each sublist consists of the element, the integer position of the element in seq-a, then the integer position of the element in seq-b.

(lcs-with-positions '(a) '(a))
 ⇒ (1 ((a 0 0)))

(lcs-with-positions '(x a b y) '(p q a b))
 ⇒ (2 ((a 1 2) (b 2 3)))

(lcs-with-positions '(x a b y) '(p a q b))
 ⇒ (2 ((a 1 1) (b 2 3)))

(lcs-with-positions '(x y) '(p q))
 ⇒ (0 ())

Function: lcs-fold a-proc b-proc both-proc seed a b :optional eq-fn ¶

{util.lcs} A fundamental iterator over the "edit list" derived from two lists a and b.

A-proc, b-proc, both-proc are all procedures that take two arguments. The second argument is a intermediate state value of the calculation. The first value is an element only in a for a-proc, or an element only in b for b-proc, or an element in both a and b for both-proc. The return value of each procedure is used as the state value of the next call of either one of the procedures. Seed is used as the initial value of the state value. The last state value is returned from lcs-fold.

The three procedures are called in the following order: Suppose the sequence a consists of a’ca”, and b consists of b’cb”, where a’, b’, a”, and b” are subsequences, and c is the head of the LCS of a and b. Then a-proc is called first on each element in a’, b-proc is called second on each element in b’, then both-proc is called on c. Afterwards, the process is repeated using a” and b”.

Function: lcs-edit-list a b :optional eq-fn ¶

{util.lcs} Calculates ’edit-list’ from two lists a and b, which is the smallest set of commands (additions and deletions) that changes a into b. This procedure is built on top of lcs-fold above.

Returns a list of hunks, which is a contiguous section of additions and deletions. Each hunk consists of a list of directives, which is a form of:

(+|- position element)

Here’s an example. Suppose a and b are the following lists, respectively.

a ≡ ("A" "B" "C" "E" "H" "J" "L" "M" "N" "P")
b ≡ ("B" "C" "D" "E" "F" "J" "K" "L" "M" "R" "S" "T")

Then, (lcs-edit-list a b equal?) returns the following list.

(((- 0 "A"))
 ((+ 2 "D"))
 ((- 4 "H") (+ 4 "F"))
 ((+ 6 "K"))
 ((- 8 "N") (- 9 "P") (+ 9 "R") (+ 10 "S") (+ 11 "T"))
)

The result consists of five hunks. The first hunk consists of one directive, (- 0 "A"), which means the element "A" at the position 0 of list a has to be deleted. The second hunk also consists of one directive, (+ 2 "D"), meaning the element "D" at the position 2 of list b has to be added. The third hunk means "H" at the position 4 of list a should be removed and "F" at the position 4 of list b should be added, and so on.

If you are familiar with Perl’s Algorithm::Diff module, you may notice that this is the same structure that its diff procedure returns.

Function: lcs-edit-list/context a b :optional eq-fn :key context-size ¶

Function: lcs-edit-list/unified a b :optional eq-fn :key context-size ¶

{util.lcs} Calculates edit list of two sequences a and b, including context surrounding the changes. It is similar to what you get with diff -c and diff -u commands, respectively. The text.diff module provides context diff on top of this procedure (see text.diff - Calculate difference of text streams).

The eq-fn argument specifies the equality predicate on the elements. Its default is equal?. The context-size keyword argument specifies the maximum size of the unchanged elements surrounding each edits. It must be an exact positive integer. Its default is 3.

Returns a list of hunks. Each hunk represents a chunk of difference between a and b. The format of hunk differs between lcs-edit-list/context and lcs-edit-list/unified.

A hunk of lcs-edit-list/context has the following form:

<hunk> : #(<a-edits> <b-edits>)

<a-edits> : <edits>
<b-edits> : <edits>

<edits> : (<start-pos> <end-pos> <edit> ...)

<edit> : (= <element>)     ; same in both a and b
       | (- <element>)     ; deleted: only in a
       | (+ <element>)     ; inserted: only in b
       | (! <element>)     ; changed

The <start-pos> and <end-pos> are exact integer indexes of the sequence the hunk is covering; the start position inclusive, the end position exclusive, and 0-based.

Each <edit> shows the element in that range, and whether the element is to be unchanged, deleted, inserted or changed, to edit the sequence a to make b.

Here’s an example:

(lcs-edit-list/context '(a b c) '(a B c d))
 ⇒
   (#((0 3 (= a) (! b) (= c))
     ((0 4 (= a) (! B) (= c) (+ d)))))

The result contains one hunk. It shows that you should change b to B, and insert d.

A hunk of lcs-edit-list/unified has the following form:

<hunk> : #(<a-start> <a-size> <b-start> <b-size> (<edit> ...))

<edit> : (= <element>)     ; unchanged
       | (- <element>)     ; deleted
       | (+ <element>)     ; inserted

Each start and size in the hunk specifies the start position within the sequence (0-based) and the length of the hunk in the sequence, respectively.

(lcs-edit-list/unified '(a b c) '(a B c d))
  ⇒
  (#(0 3 0 4 ((= a) (- b) (+ B) (= c) (+ d))))

Context size: In each hunk, up to context-size unchanged elements are attached before and after the inserted/delete/changed elements, to show the context. If there’s less than (+ 1 (* conext-size 2)) elements between two changes, they are merged into one hunk. In the followig example, the change of b to B and e to E are merged into one hunk for they’re not far enough, while the change of i to I is in a separate hunk:

(lcs-edit-list/context '(a b c d e f g h i j)
                       '(a B c d E f g h I j)
                       equal? :context-size 1)
 ⇒
   (#((0 6 (= a) (! b) (= c) (= d) (! e) (= f))
      (0 6 (= a) (! B) (= c) (= d) (! E) (= f)))
    #((7 10 (= h) (! i) (= j))
      (7 10 (= h) (! I) (= j))))

(lcs-edit-list/unified '(a b c d e f g h i j)
                       '(a B c d E f g h I j)
                       equal? :context-size 1)
 ⇒
   (#(0 6 0 6 ((= a) (- b) (+ B) (= c) (= d) (- e) (+ E) (= f)))
    #(7 3 7 3 ((= h) (- i) (+ I) (= j))))

12.85 util.lcs - The longest common subsequence

12.85 `util.lcs` - The longest common subsequence