Recently, I was thinking of building an issue tracker. One of the features I thought would be cool to have, was to see previous issues related to the current issue (it may possibly give us ideas for the fix, or it may show us the cause of the present problem). I was thinking of implementing this in a search type method (search against the previous issues).
Doing exact word matches is quite easy in any database. I wanted this to be a little more flexible, being able to pick up words that may be related/mispelled etc. As a starting point, one algorithm to be talked about is the Levenshtien distance. What this algorithm does is it computes the minimum number of insertions, substitutions, and deletions required to convert one string to another. For e.g. mispelling "String" with "tring", would cause the algorithm to give a distance of 1, as you would expect (add 'S' to get string 1 from string 2).
This algorithm may not do as good a job with getting related words, since the different forms of words (like tense, tension, tensed etc) would get scores based on the number of letters different from the base word. There may be other words like "hence" (for the case above) which would get a lower distance score (hence would be determined a closer match). Other methods or implementation changes on the algorithm could help solve this.
But moving on, I will describe the algorithm here. Lets say you have 2 strings A and B. Let string A be TALLY and let string B be TOLL. Lets take a guess on the minimum number of changes needed - I would guess 2 - replace the the O with the A, and insert an Y in the end, to transform TOLL into TALLY. Now I will walk through the algorithm:
We will create a 6 x 5 matrix D - 6 is the (length of String A + 1) and the 5 is the (length of String B + 1).
We will populate row 0, col i with value i; and col 0, row j with value j.
for i = 0 to length(A) // Note size of Array is 1 + Length(A)
D[i,0] = i;
for j = 1 to length(B) // Note size of Array is 1 + Length(B)
D[0,j] = j;
After this we will parse through the matrix using nested for loops (lines 1 and 2) and determine the values for each spot,
1 for i = 1 to length(A) {
2 for j = 1 to length(B) {
3 if A[i-1] = B[j-1] then cost = 0; Else cost = 1;
4 D[i,j] = min(D[i-1,j]+1,D[i-1,j-1]+cost,D[i,j-1]+1);
5 }
6 }
7 return D[i,j]; // This is the minimum number of steps neededThe cost is computed in line 3. As may be seen, the cost is 0 when letter j is the same as letter i. In our example, one case where this would occur is, for the position [1,1], since the first letter of A ('t' ) matches the first letter of B ('t' ).
D[i-1,j]+1 represents a deletion operation.
D[i-1,j-1]+cost represents the substitution operation (if the letters match the cost of substitution is 0 - see line 3).
D[i,j-1]+1 represents the insertion operation.
From my interpretation, the value at D[x,y] represents how many operations are needed for transforming the first x characters of String A into the first y characters of String B. Let me illustrate this with the following:
T O L L
0 1 2 3 4
T 1 0
A 2
L 3
L 4
Y 5
At location 1,1 we can get from the T in TOLL to the T in TALLY doing 0 changes. Moving on,
T O L L
0 1 2 3 4
T 1 0
A 2 1
L 3
L 4
Y 5
At location 2,1 we can get from the TA in TALLY to T in TOLL doing 1 change, that is leaving the first T as is, and then deleting the A. As you may observe, D[i-1,j]+1 represents deletion.
Similarly D[i,j-1]+1 represents insertion.
When calculating a new D[i,j] we want to pick the optimal solution at each point, this way we will have an overall optimal solution. Doing a quick fill-up of the table above,
T O L L
0 1 2 3 4
T 1 0 1 2 3
A 2 1 1 2 3
L 3 2 2 1 1
L 4 3 3 2 1
Y 5 4 4 3 2 <-- Number of changes needed to change all of TALLY into TOLL
The lower right corner represents the value of transforming whole of String A into String B (consistent with our interpretation from before). From this we can determine that the minimum number of changes needed is 2, which confirms what we had guessed at the start of this post!
[ add comment ] | [ 0 trackbacks ] | permalink |




( 3 / 103 )I was looking into the TEA (Tiny Encryption Algorithm) article in Wikipedia. It mentioned about Fiestel Ciphers, which I have found out is a type of block cipher. Here I will share my findings:
Source: Block Ciphers
Block Ciphers
Block Cipher means that the algorithm encrypts block by block. A block is some specified length of bits. This number is usually 48, 56, 64,80,128,256 etc.... A block cipher produces the same cipher text for the same block of plain text when using the same key (this is not the case with stream ciphers, that is why this note is here!). Also, since block ciphers have a 1 to 1 mapping, the decryption algorithm is the inverse of the encryption algorithm, i.e. E(x) = y; E^(-1)(y) = x; given some key k used;
There are 3 variations of the block cipher. They are:
Iteration Block Ciphers: Apply the same algorithm over and over again on the resulting cipher text, thus encrypting it further and further. This is what the usually talked number of rounds is about. In each round, a new key may be chosen or the same key may be used.
Fiestel Ciphers: The plain text is split into two halves. During each round, the the right half and the key are passed into the encryption algorithm. This is then XORed with the left half. The two halves are then swapped. The same process is repeated.
Cipher Block Chaining: The cipher text is not only a function of the key and the encryption algorithm but also of the blocks previously encoded. In this method, a block of cipher text is XORed with the next block of plain text that is to be encoded. The first block is the initialization vector, which is a block of dummy text. Overall, this means that to decrypt some block of the plain text, the prior block has to be decrypted as well. This carries the danger that if some block is lost, then all subsequent information is lost as well!
The following images are from Wikipedia, they do an excellent job in illustrating Cipher Block Chaining


[ add comment ] | [ 0 trackbacks ] | permalink |




( 2.8 / 64 )Missed a week, but moving on..
I have had interest in cryptography, and have lightly scratched the top of its surface many times in the past. So, I thought I'd be a bit more focussed and do something this time (the inspiration to do this again, probably came from reading Digital Fortress - note, I do realize it does not describe the most sophisticated/complex theories, never the less, it was entertaining
Source for the info below: An Introduction to Cryptography
The Basics
As most would realize encryption is the process of "coding" some plain text message into some unreadable 'cipher text'. This cipher text is the encrypted message, this is then decrypted back to get the plan text message.
Symmetric Cipher
It means that the "key" used for encryption is the same as the key used for decryption
Asymmetric Cryptography (Public-key Cryptography)
It means the key for encryption and decryption are not the same. In public key cryptography, you publish your encryption key to the public. Anyone can use this information to send encrypted messaged to you. But you hold the private key which you would use to decrypt the message.
The good thing about this setup is the need to share the key between the sender and receiver is eliminated. All communication involve the public key, and no private key is ever transmitted or shared - I guess what this means is there is no longer a need to ensure a secure medium of transmission, any public infrastructure would satisfy.
As you may realize, keys are what create a specific cipher text for a given plain text. Generally, the size of the key is a good indicator of the security of the text. Of course, it also depends on the algorithm (more so).
One interesting thing is, public key cryptography key size and conventional cryptography key size are unrelated. A 80 bit conventional key has the equivalent strength of a 1024 bit public key
With that I'll end this session of AONTW!
Wish everyone a happy work week
[ add comment ] | [ 0 trackbacks ] | permalink |




( 2.9 / 52 )Lets see how long I keep this up...
I think I'll post something new I learned every week (mostly weekends). Here goes the first. This is about Video Encoding. Here are somethings I learned in the process of research.
Sources:
Quantization (Image Processing)
HOWTO Mencoder Introduction Guide
Webopedia
Video codecs and formats are not the same thing. MPEG 4 for e.g. is a video format, Xvid is a codec. Codecs create the actual videos.
Then there are multimedia containers. The container is what will contain the encoded video and audio. You can put anything into the container format (as long as it supports it - e.g. video and audio). One e.g. of a container format is AVI.
Some programs to do video encoding:
- VirtualDub (gui + commandline)
- mencoder (commandline)
Some terminology/concepts useful when using these programs:
- Quantization: a lossy compression technique achieved by compressing a range of values to a single quantum value.
- I-frame: also known as the Key frame, these frames contain information frame information without reference to any other frames (think of it as 1 snapshot in a movie; this will make more sense as you read P-frame and B-frame(s) below). Hence I-frames take the most bits to store, but improve the video quality
- P-frame: P-frames follow I-frames and contain information that has changed since that I-frame (such as color information and content change). Hence, they depend on the I-frame to fill in their data. P-frames are also aptly called delta-frames.
- B-frame: B-frames or bi-directional predictive frames rely on the frames preceeding and following them. They contain data of what has changed between the 2 frames.
- GOP: stands for Group of Pictures
The [I/P/B]-frame quantization values range between 1-31. The higher the number, the more compression (hence more loss of information --> smaller file size --> lower quality)
Bitrate is how much bits per second to store the data (higher means more bits are used to store per second of data, meaning more information stored).
Based on the above 2 factors (at least at a basic level), the quality and file size of videos can be controlled. The challenge is to find the right values that would optimize the quality to fit your space needs (possibly time needs as well, i.e. how long you have to encode)
Of course, one thing to be remembered is the encode's quality would depend on the source's quality. It wouldn't matter if you had quantization values of 1 and bitrate of 2000Kbps if the source is very poor)
Also, there are other options you can play with that would have an effect in the quality and time taken like motion detection etc..but think of this as a basic starting point
[ add comment ] | [ 0 trackbacks ] | permalink |




( 2.8 / 24 )Time to upgrade
[ add comment ] | [ 0 trackbacks ] | permalink |




( 3 / 66 )
Calendar



