Through work I have come to know one Alex Bocast. When I first met Alex, he mentioned working on a Pattern Matching Algorithm, which he calls Token Reconstruction and that he was pursuing a patent on it. About a few months ago I saw Alex again and queried him on how his patent was going. After informing me that it had been granted, he gave me some brief details on the algorithm. Since he had a copy of the patent with him, he provide me a copy and challenged me to work out the code.
Now, patent documents are not the easiest things to read, but there were some easy to follow flow charts that really helped. After a few days (with many interruptions), I had the program running and generating the same results listed in the patent.
The patent is titled “Method and Apparatus for Reconstructing a Token from a Token Fragment.” The application that the patent was applied for is this: given an incoming word/string/token, find what word/string/token it matches closest. The incoming word/token may be misspelled or have lost some characters in transmission, so a pattern match must be made.
The core of the algorithm is a fast pattern match that determines the percent of alikeness between two tokens, the unreconstructed token and the vocabulary token. The algorithm was designed to be computationally fast. The algorithm appears not to be robust, but Alex has some run some tests to show that it produces good results. Other algorithms are more robust, but at the cost of being computationally slow.
Alex best described the algorithm is this way: both tokens are compared from the left (beginning/forward) and right (end/rear). A “vector” is calculated during both matches and a calculation is made to see if forward and rear “vectors” are pointing toward each other, as in the vocabulary token. The “vectors” will diverge if there are few matches. The results are then calculated given a tolerance. The tolerance defines how “tight” you want the algorithm to be. Tolerance usually goes up for longer strings.
The algorithm described in the patent can have many variations. It can be fast or slow, forward-biased, rearward-biased, or balanced. I picked the slow and balanced variation.
I ran some tests with this algorithm verses the Ratcliff/Obershelp Pattern Matching algorithm published in the first QHJ. Both algorithms have been written in SuperBasic. My tests showed that the Bocast algorithm was faster, less correct, and the R/O algorithm to be slower and more correct.
After I ran these tests, I provided Alex with the draft of my document and my source code. I sent him three pages of text and he returned six pages of comments.
Alex feels that is not valid to compare individual results from different algorithms, but better to compare the overall results of what token/word that they determine to be the correct one. It’s now how one determines the result as long as the result is correct. Alex does have a good point.
He prefers to describe it as: “The measure returned by Token Reconstruction, called the Reconstruction Index, is the relative likelihood that a vocabulary token V is a reconstruction of the unreconstructed token U, given a tolerance for error T.”
So, I will not reveal my results. Mostly based on the above and the fact that my code needed a few changes, and this would affect my initial results. Instead I’ll list some results that Alex has come up with in testing his algorithm with others.
| Technique | 521-Word | 1142-Word | 1872-Word |
|---|---|---|---|
| Token Reconstruction | 80% | 78% | 75% |
| Minimum Edit Distance | 64% | 62% | 60% |
| Simple Correlation Matrix | |||
| Dot Product | 58% | 54% | 52% |
| Hamming Distance | 69% | 68% | 67% |
| Cosine Distance | 76% | 75% | 74% |
| SVD Correlation Matrix | |||
| Cosine Distance | 81% | 76% | 74% |
These results are testing published in Communications of the ACM May 1992. Personally, I’m not familiar with the various algorithms mentioned.
Here are some results testing Token Reconstruction verses the spelling checkers in some popular programs.
Position of Intended word in list of suggested Replacements.
| 1 | 2 | 3 | 4 | 5 | 6+ | |
|---|---|---|---|---|---|---|
| Token Reconst. | 78% | 89% | 94% | 95% | 95% | 97% |
| Alis/Proximity | 77% | 85% | 85% | 87% | 87% | 89% |
| WordPerfect 5.0 | 64% | 80% | 82% | 84% | 85% | 88% |
| Enable | 72% | 78% | 80% | 80% | 80% | 80% |
Alex had a few comments about my initial results that do show some more insight into the algorithm. Alex took the examples in my test and provided these results:
| U vs. V | f(U,V,T) | f(V,U,T) |
|---|---|---|
| pencilvania vs. pennsylvania | 67.5% (T=2) | 44.4% (T=3) |
| misisipi vs. mississippi | 73.9% (T=1) | 9% (T=4) |
| *abcdef vs. abcdef | 60.6% (T=1) | 89.7% (T=1) |
U = unreconstructed token
V = vocabulary token
T = tolerance
The second column of results ( f(V,U,T) ) is the the same algorithm with the two tokens reversed.
One thing that I did discover about Token Reconstruction when implementing, is that it has two almost-worst cases and one worst case. But, in Alex’s testing, he showed that by reversing the strings ( switching U and V ) the worst case goes away. Alex has asked me not to reveal the worst case, as this is one way he can determine patent infringement. I’ll leave it up the the reader to figure it out. I must point out that the worst case is very unlikely to come up in real life. It only causes the algorithm problems if any proof is applied to it.
Now for the legal stuff
Mr. Bocast has assured me that anyone is free to use this patented algorithm for purely personal, non-commercial use without paying a license fee. Anyone who requests to use the algorithm for such non-commercial personal use will be gladly granted a royalty-free exploratory license by writing that request to Design Services Group, Inc., 1357 MacBeth Street, McLean, VA 22102 USA. Alex will provide mail/phone support to these licensed users — the algorithm is simple to code but requires care to configure correctly for optimum performance on any given problem.
However, all rights are reserved for all commercial applications. This means that software that contains this patented algorithm may not be distributed in any way to any person without a licensing agreement to cover such distribution. In other words, this algorithm is not in the public domain.
If your application is dependent on speed, then the Bocast algorithm may fit the bill. Just keep in mind the caveats that come with it. For those interested, I can make copies of the patent, or they can be ordered from the U.S. Patent Office for $1.50 each. The patent number is 5,008,818.
** bocast_ssb
** Impementation of Bocast Pattern Matching Algorithm
** described in U.S. Patent 5,008,818
** Method and Apparatus for Reconstructing a Token From
** a Token Fragment
**
INPUT "Enter Check String :";s$
INPUT "Enter Vocabulary String :";v$
INPUT "Enter Tolerance :";tolerance
** Optional Code to make the shorter
** string as s$
** IF LEN(s$) > LEN(v$) THEN
** temp$ = s$
** s$ = v$
** v$ = temp$
** END IF
j = 1 : jlim = LEN(s$) : jlen = jlim
k = 1 : klim = LEN(v$) : klen = klim : lastk = 0
forward_index = 0
rear_index = 0
PRINT s$;" vs. ";v$
** Front-Emphasis IDA/FCR
REPEAT loop
IF (j <= jlim) AND (k <= klim) THEN
IF s$(j) = v$(k) THEN
j = j + 1
lastk = k
END IF
k = k + 1
ELSE
EXIT loop
END IF
END REPEAT loop
** Front-Emphasis RDA/LCR
REPEAT loop
IF (j <= jlim) AND (lastk < klim) THEN
IF s$(jlim) = v$(klim) THEN
jlim = jlim - 1
END IF
klim = klim - 1
ELSE
EXIT loop
END IF
END REPEAT loop
IF ( jlim < LEN(s$)) OR ( j > 1 ) THEN
IF (jlim - j) < tolerance THEN
jj = j + jlim
kk = k + klim
IF jj < kk THEN
forward_index = jj/kk
ELSE
forward_index = kk/jj
END IF
END IF
END IF
j = 1 : jlim = LEN(s$)
k = 1 : klim = LEN(v$) : lastk = klen + 1
** Rear-Emphasis
REPEAT loop
IF (jlim > 0) AND (klim > 0) THEN
IF s$(jlim) = v$(klim) THEN
jlim = jlim - 1
lastk = klim
END IF
klim = klim - 1
ELSE
EXIT loop
END IF
END REPEAT loop
** Rear-Emphasis RDA/LCR
REPEAT loop
IF (j <= jlim) AND (k < lastk) THEN
IF s$(j) = v$(k) THEN
j = j + 1
END IF
k = k + 1
ELSE
EXIT loop
END IF
END REPEAT loop
IF (jlim < LEN(s$)) OR ( j > 1) THEN
IF (jlim - j) < tolerance THEN
jj = jlen*2 - j - jlim + 2
kk = klen*2 - k - klim + 2
IF jj < kk THEN
rear_index = jj/kk
ELSE
rear_index = kk/jj
END IF
END IF
END IF
answer = (forward_index + rear_index)/2
PRINT "Forward Index is ";forward_index
PRINT "Rear Index is ";rear_index
PRINT "Result is ";answer