Recently, I was thinking of building an issue tracker. One of the features I thought would be cool to have, was to see previous issues related to the current issue (it may possibly give us ideas for the fix, or it may show us the cause of the present problem). I was thinking of implementing this in a search type method (search against the previous issues).
Doing exact word matches is quite easy in any database. I wanted this to be a little more flexible, being able to pick up words that may be related/mispelled etc. As a starting point, one algorithm to be talked about is the Levenshtien distance. What this algorithm does is it computes the minimum number of insertions, substitutions, and deletions required to convert one string to another. For e.g. mispelling "String" with "tring", would cause the algorithm to give a distance of 1, as you would expect (add 'S' to get string 1 from string 2).
This algorithm may not do as good a job with getting related words, since the different forms of words (like tense, tension, tensed etc) would get scores based on the number of letters different from the base word. There may be other words like "hence" (for the case above) which would get a lower distance score (hence would be determined a closer match). Other methods or implementation changes on the algorithm could help solve this.
But moving on, I will describe the algorithm here. Lets say you have 2 strings A and B. Let string A be TALLY and let string B be TOLL. Lets take a guess on the minimum number of changes needed - I would guess 2 - replace the the O with the A, and insert an Y in the end, to transform TOLL into TALLY. Now I will walk through the algorithm:
We will create a 6 x 5 matrix D - 6 is the (length of String A + 1) and the 5 is the (length of String B + 1).
We will populate row 0, col i with value i; and col 0, row j with value j.
for i = 0 to length(A) // Note size of Array is 1 + Length(A)
D[i,0] = i;
for j = 1 to length(B) // Note size of Array is 1 + Length(B)
D[0,j] = j;
After this we will parse through the matrix using nested for loops (lines 1 and 2) and determine the values for each spot,
1 for i = 1 to length(A) {
2 for j = 1 to length(B) {
3 if A[i-1] = B[j-1] then cost = 0; Else cost = 1;
4 D[i,j] = min(D[i-1,j]+1,D[i-1,j-1]+cost,D[i,j-1]+1);
5 }
6 }
7 return D[i,j]; // This is the minimum number of steps neededThe cost is computed in line 3. As may be seen, the cost is 0 when letter j is the same as letter i. In our example, one case where this would occur is, for the position [1,1], since the first letter of A ('t' ) matches the first letter of B ('t' ).
D[i-1,j]+1 represents a deletion operation.
D[i-1,j-1]+cost represents the substitution operation (if the letters match the cost of substitution is 0 - see line 3).
D[i,j-1]+1 represents the insertion operation.
From my interpretation, the value at D[x,y] represents how many operations are needed for transforming the first x characters of String A into the first y characters of String B. Let me illustrate this with the following:
T O L L
0 1 2 3 4
T 1 0
A 2
L 3
L 4
Y 5
At location 1,1 we can get from the T in TOLL to the T in TALLY doing 0 changes. Moving on,
T O L L
0 1 2 3 4
T 1 0
A 2 1
L 3
L 4
Y 5
At location 2,1 we can get from the TA in TALLY to T in TOLL doing 1 change, that is leaving the first T as is, and then deleting the A. As you may observe, D[i-1,j]+1 represents deletion.
Similarly D[i,j-1]+1 represents insertion.
When calculating a new D[i,j] we want to pick the optimal solution at each point, this way we will have an overall optimal solution. Doing a quick fill-up of the table above,
T O L L
0 1 2 3 4
T 1 0 1 2 3
A 2 1 1 2 3
L 3 2 2 1 1
L 4 3 3 2 1
Y 5 4 4 3 2 <-- Number of changes needed to change all of TALLY into TOLL
The lower right corner represents the value of transforming whole of String A into String B (consistent with our interpretation from before). From this we can determine that the minimum number of changes needed is 2, which confirms what we had guessed at the start of this post!
[ add comment ] | [ 0 trackbacks ] | permalink |




( 3 / 103 )
Calendar



