This blog post is about an amazing pattern matching algorithm which are going to help us reduce time complexity of brute force O(n*m) solution to O(n+m)!
Knuth Morris Pratt’s (KMP) Algorithm
Before writing anything complex let me first tell you where kmp algorithm makes optimization :
- Remember when you coded the brute force solution and you were so happy that 5 of the characters have matched but suddenly 6th character of pattern is different from the n’th character of the string? You were like aww man we are going back to 0’th index of pattern and (n – 4)’th character of string. Well We are not doing that with KMP. We never go back but only forward!
string = "kabcdesdk"
pattern = "kabcdh"
In brute force solution, after matching characters from ‘k’ to ‘d’ we will find a mismatch between ‘e’ and ‘h’. So we will go back to index 1 of string and index 0 of pattern and start matching again.
Now that I have told you the answer of ‘where?’, it’s time to figure out ‘how?’.
To be able to never go back, there must be a crazy magic spell that can let us know how many characters have already matched behind our current index,right? Suppose,
txt = 'abababf'
pattern = 'ababf'
In the above example, We find our first mismatch between ‘a’ (at index 4 in txt) and ‘f'(at index 4 in pattern). Now if we aim to stay at index 4 in txt and not go back, we somehow must already know that characters from index 2 to 3 in txt have already matched to the characters from index 0 to 1 in pattern because solution is obviously from index 2 to 6 in txt. So how are we going to do that? Let’s find out!
Now things get pretty interesting here, And you would be amazed to find out how this method works (at least I was when I did). The magic spell I talked about consists preprocessing the pattern and finding out the count of characters which are same in both suffix and prefix of every substring that starts from index 0 and ends at index k ( k < length of pattern) in pattern.
Too much information at-once? Let’s break it down..
[Notice we are only focusing on pattern right now and also we have taken our first steps to the solution (finally :D)]
We have built this count array but how would it be useful to our solution? To answer this question let’s consider a scenario where a random txt string has already matched ‘aba’ with our pattern but gets mismatched at next character, i.e. the fourth character of our pattern is ‘b’ but txt has anything else but ‘b’, i.e.ception
txt = ‘pqrsabad…….’
Suppose we are at character ‘d’ of txt in above example, Now that this ‘d’ has mismatched to ‘b’ of our pattern, all we want to know is how many characters have already matched before ‘d’ in txt without iteration backwards so that we can continue our search from ‘d’ again.
You know it’s like life where you come to realize oh, this thing didn’t work out but you don’t actually wanna think about it too much as it hurts but still want to take the lesson from it before moving on to next thing so that you won’t repeat same stuff again.
Anyway, where were we? yeah ‘d’. Damn this ‘d’.
So ‘d’ in txt didn’t match the ‘b’ from our pattern. Now focus on this point, we want to directly jump to THAT index in our pattern where all the previous characters before that index have matched to the maximum number of characters before the current character in txt (in this case ‘d’).
Notice that until now I have emphasized on not going backwards in txt to know the matching characters and now I am telling to jump directly to the ‘right’ index in pattern, These two are basically the same thing as if you are staying at certain index at txt and continue from there, you would want to start search again from the right index in pattern.
How do we make this jump? This is where our count array shines. Whenever we will find a mismatch between pattern character and txt character, we will get the jump index by count[indexOfMismatchCharacterInPattern – 1] and we will keep jumping back until we have found either index 0 or a matching character on jump index in pattern with the current character in txt . All these jump indexes will always take us to a point where all the characters before jump index in pattern are matching to certain characters before current index in txt, all we have to do is check if current character in txt matches with the jump index character in pattern before moving on. You can check for this fact yourself by making few test cases.
Now with our found jump index, we will start searching again one by one every character in txt starting from current position and pattern starting from our jump index.
The last thing that we have to do is make a halt, this is simple, whenever you reach at the end of pattern (i.e. all characters matched) just subtract the length of the pattern from the current index of txt and you will find the starting and ending point of occurrence.
I encourage you to write your own code and then match it from mine or any other resource from the internet. I am writing a quick untested code, If you find any bug please let me know in comments.
Code to compute count array is actually simple. We are just applying kmp search to the pattern itself. Two simple steps:
- Initialize maximum matching prefix index to 0.
- Iterate through the pattern:
- If the current iterating character matches to the character on maximum matching prefix index, just add 1 to maximum matching prefix index and save this value to count array at current iterating index.
- If not matches the find count[currentIteratingIndex – 1] and keep jumping back until you find index 0 or a matching character with currentIteratingIndex on jumping index. This is already discussed before when matching txt with pattern.
Code to match pattern and txt. Very similar to before, now just apply kmp search between pattern and txt.
(I don’t actually know how to write code in wordpress so I am just taking screenshots and uploading them)
I hope I made sense, If you have any doubt or find any correction please be kind to tell me in comments.