Constructing a Spell Checker

Authors

Publication

Pub Details

Date

Pages

See all articles from QL Hacker's Journal 26

A spell checker is usually comprised of two parts: 1) word lookup (to see if a word is spelled correctly) and 2) word suggestion (to suggest the correct spelling of the word). Past issues of the QHJ have looked at different algorithms to tell how close two words are, a key part of word suggestion. This article will focus on word lookup.

The key thing to decide in creating the word lookup algorithm is the data structure for storing the words and quickly looking them up. If the word list was fairly short, a brute force method would work. Since most spell checkers will need a word list in the tens of thousands, the lookup algorithm will need to be smarter. We also need to keep in mind that the words will be of many different lengths.

At first the most obvious data structure would be a tree structure. A word would walk down the tree structure letter by letter. When it reached the end of its length it would check the current tree node to see if it is a valid word. Let’s take a look at three words, bar, bard, and barr, and the following tree structure.

                      b
/ \
a b
/ \
r m
/ \
d n

With the word BAR, the B is valid, with leads to an A which is valid and it leads to an R which is valid. The R node will have a value of 1 to signify that it is the end of a valid word. This way the structure can parse both BAR and BARN and distinguish between the two. When parsing BARR, the B is fine, which leads to A, which leads to R, but now there is no R path in the tree and the word is determined to be invalid.

The problem with this data structure is two fold: one, you need to construct it out of the dictionary file at run time, which can take some time, or you need to find a way to store it so it can be read in easily. The second problems is that the language we are going to construct the spell checker in is SuperBasic, which does not easily support making tree structures. They are easily created with C structures or Pascal records.

We could use a hashing algorithm since it is designed for very quick look up, but with a very static list of words, our hashing algorithm may require more dataspace than we really need.

We need to come up with some data structure that is tailored to our needs. One that will provide a fairly quick look up and minimize on the data space needed to store the word list.

Here is a suggestion: Store the words in a flat array. The words will be pre-sorted on disk, first by the length of the word and then alphabetically. This means that all of the two letter words will be grouped together and sorted alphabetically, then the three letter words, etc. Word length is one way to distinguish one word from an other.

Create a two-dimensional array called start_array(x,y). The X value will be LENGTH and the Y value will be FIRST_CHAR. As the words are read in, the array will be used to keep track of where the first 2 letter starts in the array, where the first three letter word starts, and so on. It will also keep track of where words start by the first letter. When you need to do a lookup of the word BAR, LENGTH is 3, FIRST_CHAR is equal to B, so you would look up start_array(3,’B’). this will return where the first 3 letter word that starts with B is stored in the word array. From there the search can be a simple brute force search that compares all three letter B words to see if they match BAR.

To determine where the search should end, you will also need to know where the first three letter C word is at. This can also be looked up in the start_array. Below is a little pseudo code showing how this would work.

    start = start_array(3,'B')
stop = start_array(3,'C')

FOR x = start TO stop
IF word_array$(x) = BAR THEN EXIT success
NEXT x
EXIT fail

Products

 

Downloadable Media

 

Image Gallery

Scroll to Top