Approximate String Matching

Authors

Publication

Pub Details

Date

Pages

See all articles from QL Hacker's Journal 18

In past issues, string matching has been a reaccuring theme. In the April 1994 issue of “C Users Journal” there was an article on string matching that uses a different algorithm for guessing how close two strings are; Frequency Distribution.

Frequency Distribution counts how many occurances of each letter appear in the word. For “test”, t=2, e=s=1. For “shasta”, s=a=2, h=t=1. To guess how close two words are the frequency distribution is compared. If a word has about the same frequency distribution, then it is assumed to be equal.

There lies the weakness of this algorithm. The author states that anagrams, like “BAT” and “TAB”, would appear to be the same, since they have the same frequency distribution, where they are not. I think this can be extended to all words that are spelled differently but have the same letters in them. A simple example would be “TEA” and “ATE”. They are not an anagram, but they have the same frequency distribution.

I would expect that only a small percentage of words would fail in this algorithm. This algorithm would be useful for a rough comparison. Adding another algorithm to check the cases that might cause this algorithm to fail would be useful.

#include <stdio_h>

int freq_match(mask, test)
char *mask;
char *test;
{
static int freq_count[256]; /* the frequency */
/* distribution */

int divergence;
int i;

/* freq_match("maskwork",""); */
if (test[0] == '¥0) {
/* initialize the distribution array */
for (i=0; i<256; i++)
freq_count[i++] = 0;

/* compute the distribution */
for ( i=0; mask[i] != '¥0'; i++)
freq_count[mask[i]] += 1;

/* return a zero for initialization */
return 0;
}

/* freq_match("don't care","testword"); */
else {
/* subtract the freq. dist. of the test word */
for (i=0; test[i] != '¥0'; i++)
freq_count[test[i]] -= 1;

/* compute the divergence */
for (divergence=0, i=0; i<256; i++)
divergence += abs(freq_count[i]);

/* this code is to reset the freq. dist. */
/* back to the settings for the mask word */
for ( i=0; test[i] != '¥0'; i++ )
freq_count[test[i]] += 1;

return divergence;
}
}

void main()
{
char mask[80], test[80];

printf("Mask:");
gets(mask);

printf("Test:");
gets(test);

freq_match(mask,"");
printf("The divergence is %d.¥n",
freq_match("",test));

}

Products

 

Downloadable Media

 

Image Gallery

Scroll to Top