LZW Compression

Authors

Tim Swenson

Publication

QL Hacker's Journal

Pub Details

QL Hacker's Journal 11 Issue: 11

Date

November 1992

Pages

See all articles from QL Hacker's Journal 11

Through work I have come to know one Alex Bocast. When I first met Alex, he mentioned working on a Pattern Matching Algorithm, which he calls Token Reconstruction and that he was pursuing a patent on it. About a few months ago I saw Alex again and queried him on how his patent was going. After informing me that it had been granted, he gave me some brief details on the algorithm. Since he had a copy of the patent with him, he provide me a copy and challenged me to work out the code.

Now, patent documents are not the easiest things to read, but there were some easy to follow flow charts that really helped. After a few days (with many interruptions), I had the program running and generating the same results listed in the patent.

The patent is titled “Method and Apparatus for Reconstructing a Token from a Token Fragment.” The application that the patent was applied for is this: given an incoming word/string/token, find what word/string/token it matches closest. The incoming word/token may be misspelled or have lost some characters in transmission, so a pattern match must be made.

The core of the algorithm is a fast pattern match that determines the percent of alikeness between two tokens, the unreconstructed token and the vocabulary token. The algorithm was designed to be computationally fast. The algorithm appears not to be robust, but Alex has some run some tests to show that it produces good results. Other algorithms are more robust, but at the cost of being computationally slow.

Alex best described the algorithm is this way: both tokens are compared from the left (beginning/forward) and right (end/rear). A “vector” is calculated during both matches and a calculation is made to see if forward and rear “vectors” are pointing toward each other, as in the vocabulary token. The “vectors” will diverge if there are few matches. The results are then calculated given a tolerance. The tolerance defines how “tight” you want the algorithm to be. Tolerance usually goes up for longer strings.

The algorithm described in the patent can have many variations. It can be fast or slow, forward-biased, rearward-biased, or balanced. I picked the slow and balanced variation.

I ran some tests with this algorithm verses the Ratcliff/Obershelp Pattern Matching algorithm published in the first QHJ. Both algorithms have been written in SuperBasic. My tests showed that the Bocast algorithm was faster, less correct, and the R/O algorithm to be slower and more correct.

After I ran these tests, I provided Alex with the draft of my document and my source code. I sent him three pages of text and he returned six pages of comments.

Alex feels that is not valid to compare individual results from different algorithms, but better to compare the overall results of what token/word that they determine to be the correct one. It’s now how one determines the result as long as the result is correct. Alex does have a good point.

He prefers to describe it as: “The measure returned by Token Reconstruction, called the Reconstruction Index, is the relative likelihood that a vocabulary token V is a reconstruction of the unreconstructed token U, given a tolerance for error T.”

So, I will not reveal my results. Mostly based on the above and the fact that my code needed a few changes, and this would affect my initial results. Instead I’ll list some results that Alex has come up with in testing his algorithm with others.

Technique	521-Word	1142-Word	1872-Word
Token Reconstruction	80%	78%	75%
Minimum Edit Distance	64%	62%	60%
Simple Correlation Matrix
Dot Product	58%	54%	52%
Hamming Distance	69%	68%	67%
Cosine Distance	76%	75%	74%
SVD Correlation Matrix
Cosine Distance	81%	76%	74%

These results are testing published in Communications of the ACM May 1992. Personally, I’m not familiar with the various algorithms mentioned.

Here are some results testing Token Reconstruction verses the spelling checkers in some popular programs.

Position of Intended word in list of suggested Replacements.

	1	2	3	4	5	6+
Token Reconst.	78%	89%	94%	95%	95%	97%
Alis/Proximity	77%	85%	85%	87%	87%	89%
WordPerfect 5.0	64%	80%	82%	84%	85%	88%
Enable	72%	78%	80%	80%	80%	80%

Alex had a few comments about my initial results that do show some more insight into the algorithm. Alex took the examples in my test and provided these results:

U vs. V	f(U,V,T)	f(V,U,T)
pencilvania vs. pennsylvania	67.5% (T=2)	44.4% (T=3)
misisipi vs. mississippi	73.9% (T=1)	9% (T=4)
*abcdef vs. abcdef	60.6% (T=1)	89.7% (T=1)

U = unreconstructed token
V = vocabulary token
T = tolerance
The second column of results ( f(V,U,T) ) is the the same algorithm with the two tokens reversed.

One thing that I did discover about Token Reconstruction when implementing, is that it has two almost-worst cases and one worst case. But, in Alex’s testing, he showed that by reversing the strings ( switching U and V ) the worst case goes away. Alex has asked me not to reveal the worst case, as this is one way he can determine patent infringement. I’ll leave it up the the reader to figure it out. I must point out that the worst case is very unlikely to come up in real life. It only causes the algorithm problems if any proof is applied to it.

Now for the legal stuff

Mr. Bocast has assured me that anyone is free to use this patented algorithm for purely personal, non-commercial use without paying a license fee. Anyone who requests to use the algorithm for such non-commercial personal use will be gladly granted a royalty-free exploratory license by writing that request to Design Services Group, Inc., 1357 MacBeth Street, McLean, VA 22102 USA. Alex will provide mail/phone support to these licensed users — the algorithm is simple to code but requires care to configure correctly for optimum performance on any given problem.

However, all rights are reserved for all commercial applications. This means that software that contains this patented algorithm may not be distributed in any way to any person without a licensing agreement to cover such distribution. In other words, this algorithm is not in the public domain.

If your application is dependent on speed, then the Bocast algorithm may fit the bill. Just keep in mind the caveats that come with it. For those interested, I can make copies of the patent, or they can be ordered from the U.S. Patent Office for $1.50 each. The patent number is 5,008,818.

** bocast_ssb
**   Impementation of Bocast Pattern Matching Algorithm
**     described in U.S. Patent 5,008,818
**     Method and Apparatus for Reconstructing a Token From
**       a Token Fragment
**

INPUT "Enter Check String :";s$
INPUT "Enter Vocabulary String :";v$
INPUT "Enter Tolerance :";tolerance

** Optional Code to make the shorter
** string as s$
** IF LEN(s$) > LEN(v$) THEN
**    temp$ = s$
**    s$ = v$
**    v$ = temp$
** END IF

j = 1 : jlim = LEN(s$) : jlen = jlim
k = 1 : klim = LEN(v$) : klen = klim : lastk = 0
forward_index = 0
rear_index = 0

PRINT s$;" vs. ";v$

** Front-Emphasis IDA/FCR

REPEAT loop
   IF  (j <= jlim) AND (k <= klim) THEN
      IF s$(j) = v$(k) THEN
         j = j + 1
         lastk = k
      END IF
      k = k + 1
   ELSE
      EXIT loop
   END IF
END REPEAT loop

** Front-Emphasis RDA/LCR

REPEAT loop
   IF (j <= jlim) AND (lastk < klim) THEN
      IF s$(jlim) = v$(klim) THEN
         jlim = jlim - 1
      END IF
      klim = klim - 1
   ELSE
      EXIT loop
   END IF
END REPEAT loop

IF ( jlim < LEN(s$)) OR ( j > 1 ) THEN
   IF (jlim - j) < tolerance THEN
      jj = j + jlim
      kk = k + klim
      IF jj < kk THEN
         forward_index = jj/kk
      ELSE
         forward_index = kk/jj
      END IF
   END IF
END IF

j = 1 : jlim = LEN(s$)
k = 1 : klim = LEN(v$) : lastk = klen + 1

** Rear-Emphasis

REPEAT loop
   IF (jlim > 0) AND (klim > 0) THEN
      IF s$(jlim) = v$(klim) THEN
         jlim = jlim - 1
         lastk = klim
      END IF
      klim = klim - 1
   ELSE
      EXIT loop
   END IF
END REPEAT loop

** Rear-Emphasis RDA/LCR

REPEAT loop
   IF (j <= jlim) AND (k < lastk) THEN
      IF s$(j) = v$(k) THEN
         j = j + 1
      END IF
      k = k + 1
   ELSE
      EXIT loop
   END IF
END REPEAT loop

IF (jlim < LEN(s$)) OR ( j > 1) THEN
   IF (jlim - j) < tolerance THEN
      jj = jlen*2 - j - jlim + 2
      kk = klen*2 - k - klim + 2
      IF jj < kk THEN
         rear_index = jj/kk
      ELSE
         rear_index = kk/jj
      END IF
   END IF
END IF

answer = (forward_index + rear_index)/2
PRINT "Forward Index is ";forward_index
PRINT "Rear Index is    ";rear_index
PRINT "Result is        ";answer

Now for the legal stuff

Products

Downloadable Media

Related Articles

Image Gallery

Tags

People