Gauche:SpellingCorrection

Gauche:SpellingCorrection

Shiro: Peter Norvig has shown how to write a spell checker in 20 lines of Python: How to Write a Spelling Corrector. Well, it's beautiful. So much that I got envious and wanted to see if I could write it up in Gauche. Here's more or less a literal translation of the original code, with some tweaks to adapt Gauche's style. (I didn't have English text for training handy, so I just grab Gauche's English manual, which may have bias in the word frequency as well as large number of misspellings in itself. Replace the NWORDS definition to use some other texts if you have any.)

(use srfi-1)
(use srfi-13)
(use srfi-14)
(use srfi-42)
(use gauche.collection)
(use file.util)

(define (words text) (string-split (string-downcase text) #[^a-z]))

(define (train features)
  (let1 model (make-hash-table 'string=?)
    (for-each (cut hash-table-update! model <> (cut + <> 1) 1) features)
    model))

(define NWORDS (train (words (file->string "src/Gauche/doc/gauche-refe.texi"))))

(define (edits1 w)
  (let ((n (string-length w))
        (set (make-hash-table 'string=?)))
    (define (reg s) (hash-table-put! set s #t))
    (do-ec (: i n) (reg #`",(string-take w i),(string-drop w (+ i 1))"))
    (do-ec (: i (- n 1))
           (reg #`",(string-take w i),(ref w (+ i 1)),(ref w i),(string-drop w (+ i 2))"))
    (do-ec (: i n) (: c (char-set->list char-set:lower-case))
           (reg #`",(string-take w i),c,(string-drop w (+ i 1))"))
    (do-ec (: i (+ n 1)) (: c (char-set->list char-set:lower-case))
           (reg #`",(string-take w i),c,(string-drop w i)"))
    (hash-table-keys set)))

(define (known-edits2 w)
  (let1 set (make-hash-table 'string=?)
    (do-ec (: w1 (edits1 w))
           (: w2 (edits1 w1))
           (if (hash-table-exists? NWORDS w2))
           (hash-table-put! set w2 #t))
    (hash-table-keys set)))

(define (known words) (filter (cut hash-table-exists? NWORDS <>) words))

(define (correct word)
  (let1 candidates
      (cond ((known `(,word)) pair? => values)
            ((known (edits1 word)) pair? => values)
            ((known-edits2 word) pair? => values)
            (else `(,word)))
    (values-ref (fold2 (lambda (word maxword maxscore)
                         (let1 score (hash-table-get NWORDS word 1)
                           (if (> score maxscore)
                             (values word score)
                             (values maxword maxscore))))
                       "" -1 candidates)
                0)))

At this moment I admit it's far uglier than the Python version.

Let's see what I can improve in Gauche from this.

Lack of set operations

The comprehensions in Python code can be translated easily into Gauche with srfi-42 Eager comprehensions. However, applying set operation on the result is a pain.

Originally I just use srfi-1's lset-* operations and other list-as-set functions like delete-duplicates. For example, the original version of edits1 was like this:

(define (edits1 w)
  (let1 n (string-length w)
    (lset-union
     string=?
     (list-ec (: i n) #`",(string-take w i),(string-drop w (+ i 1))")
     (list-ec (: i (- n 1))
              #`",(string-take w i),(ref w (+ i 1)),(ref w i),(string-drop w (+ i 2))")
     (list-ec (: i n) (: c (char-set->list char-set:lower-case))
              #`",(string-take w i),c,(string-drop w (+ i 1))")
     (list-ec (: i (+ n 1)) (: c (char-set->list char-set:lower-case))
              #`",(string-take w i),c,(string-drop w i)"))))

Unfortunately, this made known-edits2 (which applies edits1 twice) very slow (e.g. searching distance-2 edits of "unnecessarily" took more than a minute). I dropped back to use hash tables explicitly. This is where something I can improve a lot. Some random ideas:

Shortcut op with predicate

The first half of 'correct' function is to find a candidate set:

  1. If the given word is known, just use it.
  2. If there's any known words in the distance-1 edits of the word, use them.
  3. If there's any known words in the distance-2 edits, use them.
  4. otherwise, use the given word.

The straightforward code can be like this:

(find pair? (list (known `(,word)) (known (edits1 word)) ...))

But it calculates edit sets even they are unnecessary.

Python version avoids the unnecessary calculation using the fact that empty set counts false, but in Scheme we can't do that. I came up using srfi-61 extension of cond, but it is very kludgy.

One possible improvement is to have an OR-like macro that takes optional predicate. It's along the line of cond-list (in list.util) and srfi-61, so I might be able to come up some generic framework for that line.

Max with key

Particulary ugly part is the last part of 'correct' function. I hope I'm missing something, but it appears that we don't have a function that chooses an element that has maximum attribute. 'Max' and 'max-ec' can only return the maximum attribute value, not the element that has it. Definitely it is something we like to have it.

Shiro(2007/04/11 21:27:41 PDT): Well, actually this one's shorter, but the score lookup may be done more than once per element. (In this code it probably doesn't matter, since the candidates list is typically pretty short).

(define (correct word)
  (let1 candidates
      (cond ((known `(,word)) pair? => values)
            ((known (edits1 word)) pair? => values)
            ((known-edits2 word) pair? => values)
            (else `(,word)))
    (reduce (lambda (a b)
              (if (> (hash-table-get NWORDS a 1) (hash-table-get NWORDS b 1)) a b))
            #f candidates)))

More generic sequence operations

Another idea I have for some time is to make all the list operations of srfi-1 generic, so that all 'string-drop'/'string-take's above can be replaced by simple 'drop'/'take'. (You can use 'subseq' right now, but 'drop'/'take' will be simpler).

The reason we don't have that yet is a concern on performance; generic function dispach is much slower than the current list-specific operations. However, I have an idea to provide list-specific fact operations on different name (e.g. list-drop, list-take, list-fold etc.) and to make the unqualified names (drop, take, fold, ...) generic by default. It will benefit both those who write something quick and those who want to tune for particular type of data.

Comments and Discussions

(Feel free to leave comments/discussions here; both English and Japanese is OK.)

Jens Axel Søgaard?(2007/04/11): Here is a variation:

(define (correct word)
  (define (falsify o) (if (null? o) #f o))
  (let1 candidates
     (or (falsify (known `(,word))
         (falsify (known (edits1 word)))
         (falsify (known-edits2 word))
         `(,word))
    <as before>))

Alternatively, one can define an or+

(define-syntax or+
  (syntax-rules ()
    ((_ falsify expr ...)
     (let ((f falsify)) 
       (or (falsify expr) ...)))))

and write

(define (correct word)
  (define (falsify o) (if (null? o) #f o))
  (let1 candidates
     (or+ falsify 
         (known `(,word))
         (known (edits1 word))
         (known-edits2 word)
         `(,word))
     <as before>))

Shiro(2007/04/16 14:28:59 PDT): Yep. By "to have an OR-like macro that takes optional predicate" I meant something in the line of your or+ macro. Do you think it is worth to have it built-in?

Soegaard?(2007/04/29): I am not sure. The first variation is probably easier to understand, since there is one less construct to remember.

More ...