Page 184 - Algorithms Notes for Professionals
P. 184

Chapter 40: Substring Search



       Section 40.1: Introduction To Knuth-Morris-Pratt (KMP)
       Algorithm


       Suppose that we have a text and a pattern. We need to determine if the pattern exists in the text or not. For
       example:



       +-------+---+---+---+---+---+---+---+---+
       | Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
       +-------+---+---+---+---+---+---+---+---+
       |  Text | a | b | c | b | c | g | l | x |
       +-------+---+---+---+---+---+---+---+---+

       +---------+---+---+---+---+
       | Index   | 0 | 1 | 2 | 3 |
       +---------+---+---+---+---+
       | Pattern | b | c | g | l |
       +---------+---+---+---+---+


       This pattern does exist in the text. So our substring search should return 3, the index of the position from which this
       pattern starts. So how does our brute force substring search procedure work?


       What we usually do is: we start from the 0th index of the text and the 0th index of our *pattern and we compare
       Text[0] with Pattern[0]. Since they are not a match, we go to the next index of our text and we compare Text[1]
       with Pattern[0]. Since this is a match, we increment the index of our pattern and the index of the Text also. We
       compare Text[2] with Pattern[1]. They are also a match. Following the same procedure stated before, we now
       compare Text[3] with Pattern[2]. As they do not match, we start from the next position where we started finding
       the match. That is index 2 of the Text. We compare Text[2] with Pattern[0]. They don't match. Then incrementing
       index of the Text, we compare Text[3] with Pattern[0]. They match. Again Text[4] and Pattern[1] match, Text[5]
       and Pattern[2] match and Text[6] and Pattern[3] match. Since we've reached the end of our Pattern, we now
       return the index from which our match started, that is 3. If our pattern was: bcgll, that means if the pattern didn't
       exist in our text, our search should return exception or -1 or any other predefined value. We can clearly see that, in
       the worst case, this algorithm would take O(mn) time where m is the length of the Text and n is the length of the
       Pattern. How do we reduce this time complexity? This is where KMP Substring Search Algorithm comes into the
       picture.

       The Knuth-Morris-Pratt String Searching Algorithm or KMP Algorithm searches for occurrences of a "Pattern" within
       a main "Text" by employing the observation that when a mismatch occurs, the word itself embodies sufficient
       information to determine where the next match could begin, thus bypassing re-examination of previously matched
       characters. The algorithm was conceived in 1970 by Donuld Knuth and Vaughan Pratt and independently by James
       H. Morris. The trio published it jointly in 1977.

       Let's extend our example Text and Pattern for better understanding:



       +-------+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
       | Index |0 |1 |2 |3 |4 |5 |6 |7 |8 |9 |10|11|12|13|14|15|16|17|18|19|20|21|22|
       +-------+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
       |  Text |a |b |c |x |a |b |c |d |a |b |x |a |b |c |d |a |b |c |d |a |b |c |y |
       +-------+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
       +---------+---+---+---+---+---+---+---+---+
       |  Index  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

       colegiohispanomexicano.net – Algorithms Notes                                                           180
   179   180   181   182   183   184   185   186   187   188   189