String Matching Algorithms

STRING MATCHING
Aditya Pratap Singh

215/CO/15
Netaji Subhas Institute Of Technology
CONTENTS
● Introduction
● String Matching
● Basic Classification
● Naive Algorithm
● Rabin-Karp Algorithm
○ String Hashing
○ Hash value for substrings
● Knuth-Morris-Pratt Algorithm
○ Prefix Function
○ KMP Matcher
● Summary
INTRODUCTION
● String matching algorithms are an important class of string

algorithms that tries to find one or many indices where one
or several strings(or patterns) are found in the larger string(or
text)
● Why do we need string matching?

String matching is used in various applications like spell
checkers, spam filters, search engines, plagiarism detectors,
bioinformatics and DNA sequencing etc.
STRING MATCHING
● To find all occurrences of a pattern in a given text

● Formally, given a pattern P[1..m] and a text T[1..n], find all
occurrences of P in T. Both P and T belongs to Σ*
● P occurs with shift s(beginning at s+1): P[1] = T[s+1], P[2] =
T[s+2],…, P[m] = T[s+m]
● If so, s is called a valid shift, otherwise an invalid shift
● Note: one occurrence can start within another one ie.
overlapping is allowed. eg P=abab T=abcabababbc, P occurs
at s=3 and s=5.
*text is the string that we are searching

*pattern is the string that we are searching for
*Shift is an offset into a string
BASIC CLASSIFICATION
1. Naive Algorithm - The naive approach is accomplished by

performing a brute-force comparison of each character in the
pattern at each possible placement of the pattern in the
string. It is O(mn) in the worst case scenario
2. Rabin-Karp Algorithm - It compares the string’s hash values,

rather than string themselves. Performs well in practice and
generalized to other algorithm for related problems such as
2D-string matching
3. Knuth-Morris-Pratt Algorithm - It is improved on brute-force

algorithm and is capable of running O(m+n) in the worst
case. It improves the running time by taking advantage of
prefix function
NAIVE ALGORITHM
One of the most obvious approach towards the string matching

problem would be to compare the first element of the pattern to
be searched ‘p’, with the first element of the string ‘s’ in which to
locate ‘p’.
If the first element of ‘p’ matches the first element of ‘s’ ,

compare the second element and so on. If match found proceed
likewise until entire ‘p’ is found. If a mismatch is found at any
position , shift index to one position to the right and continue
comparison
This approach is easy to understand and implement but it can be

too slow in some cases.
In worst case it may take (m*n) iterations to complete the task.
PSEUDOCODE
function naive(text[], pattern[]){

for(i = 0; i < n; i++) {
for(j = 0; j < m && i + j < n; j++) {
if(text[i + j] != pattern[j]) break; // mismatch found
if(j == m) // match found
}
}
}
ILLUSTRATION
String S = a b c a b a a b c a b a c
Pattern P = a b a a
Step 1: Compare P[1] with S[1]

abcabaabcabac
abaa

abcabaabcabac
abaa
ILLUSTRATION

abcabaabcabac
abaa
Since mismatch is detected, shift ‘p’ one position to the left and
perform steps analogous to those from step 1 to step 3. At
position where mismatch is detected, shift ‘p’ one position to
right and repeat matching procedure.
ILLUSTRATION
Finally, a match is found after shifting ‘p’ three times to the right
side.
abcabaabcabac
abaa
Drawbacks : If ‘m’ is the length of pattern P and ‘n’ is the length

of text T, then the matching time is O(n*m), which is certainly a
very slow running time
RABIN-KARP ALGORITHM
This is actually the naive approach augmented with a powerful

programming technique - hash function
Algorithm :
1. Calculate the hash for the pattern P
2. Calculate the hash values for all the prefixes of the text T.
3. Now, we can compare a substring of length |s| in constant
time using the calculated hashes.
This algorithm was authored by Michael Rabin and Richard Karp

in 1987.
STRING HASHING
Problem - Given a string S of length n = |S| . Calculate the hash

value of S
Solution -
where p and m are suitably chosen prime numbers.

CHOICE OF PARAMETERS
‘p’ should be taken roughly equal to the number of characters in

the input alphabet. If input is composed of only lowercase
characters of English alphabet, p=31 is a good choice. If the
input may contain both uppercase and lowercase letters, then
p=53 is a good choice.
‘m’ should be a large prime. A popular choice is m = 10^9+7

This is a large number but still small enough so that we can
perform multiplication of two values using 64 bit integers.
HASH CALCULATION OF SUBSTRINGS OF GIVEN STRING
Problem : Given string S and indices i and j . Find the hash value
of S[i..j]
Solution :
By definition we have,
Multiplying by pi gives,
So by knowing the hash value of each prefix of string S, we can

compute the hash of any substring in constant O(1) time.
PSEUDOCODE
vector<int> rabin_karp(string const& pat, string const& text) {
const int p = 31, m = 1e9 + 9;
int S = pat.size(), T = text.size();
vector<long long> p_pow(max(S, T));

p_pow[0] = 1;
for (int i = 1; i < (int)p_pow.size(); i++)
p_pow[i] = (p_pow[i-1] * p) % m;
vector<long long> h(T + 1, 0);

for (int i = 0; i < T; i++)
h[i+1] = (h[i] + (text[i] - 'a' + 1) * p_pow[i]) % m;
long long h_s = 0;
for (int i = 0; i < S; i++)
h_s = (h_s + (pat[i] - 'a' + 1) * p_pow[i]) % m;
vector<int> occurrences;
for (int i = 0; i + S - 1 < T; i++) {
long long cur_h = (h[i+S] + m - h[i]) % m;
if (cur_h == h_s * p_pow[i] % m)
occurrences.push_back(i);
}
return occurrences;
}
KNUTH-MORRIS-PRATT ALGORITHM
Knuth, Morris and Pratt proposed a linear time algorithm for the
string matching problem.
A matching time of O(n) is achieved by avoiding comparisons with

elements of ‘S’ that have previously been involved in comparison
with some element of the pattern ‘p’ to be matched ie.
backtracking on the string ‘S’ never occurs.
KMP makes use of ‘prefix function’

PREFIX FUNCTION
The prefix function of a string is defined as an array Ⲡ of length n,

where Ⲡ[i] is the length of the longest proper prefix of the
substring s[0..i] which is also a suffix of this substring.
A proper prefix of a string is a prefix that is not equal to the string
itself. So by definition Ⲡ[0] = 0
Mathematically,
EXAMPLE
S = “aabaaab”
PREFIX Ⲡ[i]
a a 0
aa aa 1
aab aab 0
aaba aaba 1
aabaa aabaa 2
aabaaa aabaaa 2
aabaaab aabaaab 3
ALGORITHM TO COMPUTE PREFIX FUNCTION
● We compute the prefix values Ⲡ[i] in a loop iterating from i=1

to i=n-1 (Ⲡ[0] just gets assigned with 0)
● To calculate the current value Ⲡ[i] we set the variable j

denoting the length of the best suffix for ‘i-1’ . Initially j = Ⲡ[i-1]
● Test if the suffix of length ‘j+1’ is also a prefix by comparing s[j]

and s[i]. If they are equal then we assign Ⲡ[i] = j+1 . Otherwise,
we reduce j to Ⲡ[j-1] and repeat this step.
● If we have reached the length j=0 and still don’t have the
match, then we assign Ⲡ[i] = 0 and go to the next index ‘i+1’
PSEUDOCODE
vector<int> prefix_function(string s){

int n = (int)s.length();
vector<int> pi(n);
for(int i=1;i<n;i++){
int j = pi[i-1];
while(j>0 and s[i]!=s[j]) j = pi[j-1];
if(s[i] == s[j]) ++j;
pi[i] = j;
}
return pi;
}
Runtime - O(n)
KMP MATCHER
● This is a classical application of prefix function, which we just

learned
● Given text T and string S, we need to find all occurrences of S
in T
● Denote with n the length of the string S and with m the length
of the string T ie. n = |S| and m = |T|
● Generate a string S + # + T , where # is a separator that neither
appears in S nor T . Now calculate the prefix function of this
string
● By definition, Ⲡ[i] in this string corresponds to the largest block
that coincides with S and ends at position ‘i’ .
● Note: Ⲡ[i] can not be larger than ‘n’ because of the separator #
that we used
● If Ⲡ[i] == n, then we can say that string S appears completely at
this position.
EXAMPLE
S = “aba”
T = “aababac”
Generated string(G) = “aba#aababac”
Index (i) PREFIX Ⲡ[i]
4 a 1
5 aa 1
6 aab 2
7 aaba 3
8 aabab 2
9 aababa 3
10 aababac 0
Ⲡ[i] = n(=3) at positions i = 7 and 9 of G , which means at indices

i = 1 and i=3 in the Text , there is occurrence of the pattern(S)
PSEUDOCODE
vector<int> kmp(string pattern,string text){

string str = pattern + "#" + text;
int n = pattern.length(), m = str.length();
vector<int> pi = prefix_function(str);
vector<int> ret;
for(int i=n+1;i<m;i++) {
if(pi[i] == n) ret.pb(i-2*n);
}
return ret;
}
Runtime: O(n+m)
SUMMARY
Algorithm Time Complexity Key Ideas Approach
Brute Force (Naive) O(m*n) Searching with all Linear Searching

alphabets
Rabin-Karp Θ(m+n) Compare the text Hashing Based

and patterns using
their hash functions
Knuth-Morris-Pratt O(m+n) Constructs an Heuristic Based

automaton from the
pattern
n = |pattern| , length of pattern

m = |text| , length of text
THANK YOU

String Matching Algorithms

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

String Matching Algorithms

Caricato da

Copyright:

Formati disponibili

STRING MATCHING

Aditya Pratap Singh

● String matching algorithms are an important class of string

● Why do we need string matching?

● To find all occurrences of a pattern in a given text

*text is the string that we are searching

1. Naive Algorithm - The naive approach is accomplished by

2. Rabin-Karp Algorithm - It compares the string’s hash values,

3. Knuth-Morris-Pratt Algorithm - It is improved on brute-force

One of the most obvious approach towards the string matching

If the first element of ‘p’ matches the first element of ‘s’ ,

This approach is easy to understand and implement but it can be

function naive(text[], pattern[]){

Step 1: Compare P[1] with S[1]

Step 2: Compare P[2] with S[2]

Step 3: Compare P[3] with S[3]

Drawbacks : If ‘m’ is the length of pattern P and ‘n’ is the length

This is actually the naive approach augmented with a powerful

This algorithm was authored by Michael Rabin and Richard Karp

Problem - Given a string S of length n = |S| . Calculate the hash

where p and m are suitably chosen prime numbers.

‘p’ should be taken roughly equal to the number of characters in

‘m’ should be a large prime. A popular choice is m = 10^9+7

So by knowing the hash value of each prefix of string S, we can

vector<long long> p_pow(max(S, T));

vector<long long> h(T + 1, 0);

A matching time of O(n) is achieved by avoiding comparisons with

KMP makes use of ‘prefix function’

The prefix function of a string is defined as an array Ⲡ of length n,

● We compute the prefix values Ⲡ[i] in a loop iterating from i=1

● To calculate the current value Ⲡ[i] we set the variable j

● Test if the suffix of length ‘j+1’ is also a prefix by comparing s[j]

vector<int> prefix_function(string s){

● This is a classical application of prefix function, which we just

Ⲡ[i] = n(=3) at positions i = 7 and 9 of G , which means at indices

vector<int> kmp(string pattern,string text){

Algorithm Time Complexity Key Ideas Approach

Brute Force (Naive) O(m*n) Searching with all Linear Searching

Rabin-Karp Θ(m+n) Compare the text Hashing Based

Knuth-Morris-Pratt O(m+n) Constructs an Heuristic Based

n = |pattern| , length of pattern

Potrebbero piacerti anche