Page 184 - Algorithms Notes for Professionals

P. 184

Chapter 40: Substring Search

Section 40.1: Introduction To Knuth-Morris-Pratt (KMP)
Algorithm

Suppose that we have a text and a pattern. We need to determine if the pattern exists in the text or not. For
example:

+-------+---+---+---+---+---+---+---+---+
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+-------+---+---+---+---+---+---+---+---+
| Text | a | b | c | b | c | g | l | x |
+-------+---+---+---+---+---+---+---+---+

+---------+---+---+---+---+
| Index | 0 | 1 | 2 | 3 |
+---------+---+---+---+---+
| Pattern | b | c | g | l |
+---------+---+---+---+---+

This pattern does exist in the text. So our substring search should return 3, the index of the position from which this
pattern starts. So how does our brute force substring search procedure work?

What we usually do is: we start from the 0th index of the text and the 0th index of our *pattern and we compare
Text[0] with Pattern[0]. Since they are not a match, we go to the next index of our text and we compare Text[1]
with Pattern[0]. Since this is a match, we increment the index of our pattern and the index of the Text also. We
compare Text[2] with Pattern[1]. They are also a match. Following the same procedure stated before, we now
compare Text[3] with Pattern[2]. As they do not match, we start from the next position where we started ﬁnding
the match. That is index 2 of the Text. We compare Text[2] with Pattern[0]. They don't match. Then incrementing
index of the Text, we compare Text[3] with Pattern[0]. They match. Again Text[4] and Pattern[1] match, Text[5]
and Pattern[2] match and Text[6] and Pattern[3] match. Since we've reached the end of our Pattern, we now
return the index from which our match started, that is 3. If our pattern was: bcgll, that means if the pattern didn't
exist in our text, our search should return exception or -1 or any other predeﬁned value. We can clearly see that, in
the worst case, this algorithm would take O(mn) time where m is the length of the Text and n is the length of the
Pattern. How do we reduce this time complexity? This is where KMP Substring Search Algorithm comes into the
picture.

The Knuth-Morris-Pratt String Searching Algorithm or KMP Algorithm searches for occurrences of a "Pattern" within
a main "Text" by employing the observation that when a mismatch occurs, the word itself embodies suﬃcient
information to determine where the next match could begin, thus bypassing re-examination of previously matched
characters. The algorithm was conceived in 1970 by Donuld Knuth and Vaughan Pratt and independently by James
H. Morris. The trio published it jointly in 1977.

Let's extend our example Text and Pattern for better understanding:

+-------+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| Index |0 |1 |2 |3 |4 |5 |6 |7 |8 |9 |10|11|12|13|14|15|16|17|18|19|20|21|22|
+-------+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| Text |a |b |c |x |a |b |c |d |a |b |x |a |b |c |d |a |b |c |d |a |b |c |y |
+-------+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
+---------+---+---+---+---+---+---+---+---+
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

colegiohispanomexicano.net – Algorithms Notes 180

179 180 181 182 183 184 185 186 187 188 189