## KMP Prefix Table Notes

### The Tricky Part Explained

A very popular implementation of the KMP(Knuth–Morris–Pratt) pattern searching algorithm is as follows:

```
// KMP pattern matching.
// Returns the index of the first occurrence of `pattern` in `data`.
// Returns (size_t)-1 if not found.
// The caller is responsible for allocating enough memory for lps_table.
template <typename T>
size_t KMPSearch(T* data, size_t data_size, T* pattern, size_t* lps_table, size_t pattern_size) {
if (data_size < pattern_size || !pattern_size) return (size_t)-1;
if (pattern_size == 1) {
// just do a naive search
for (size_t index = 0; index < data_size; index++) {
if (*(data + index) == *pattern) {
return index;
}
}
return (size_t)-1;
}
// first build the lps table
lps_table[0] = 0;
size_t i = 0, j = 1;
while (j < pattern_size) {
if (pattern[i] == pattern[j]) {
lps_table[j++] = ++i;
}
else {
if (i) {
i = lps_table[i - 1]; // <---- why can and should we do this?
}
else {
lps_table[j] = 0;
j++;
}
}
}
// now that the lps table has been built
// begin the search
i = 0;
j = 0;
while (i < data_size) {
if (data[i] == pattern[j]) {
i++;
j++;
if (j == pattern_size) {
return i - pattern_size; // for single-target searches
// emit (i-pattern_size); j = lps_table[j-1] // for multiple-target searches
}
}
else {
if (j) {
j = lps_table[j - 1];
}
else {
i++;
}
}
}
return (size_t)-1;
}
```

One tricky part of this implementation is when building the lps table, we’ll update `i`

as follows:

```
i = lps_table[i - 1]; // <---- why can and should we do this?
```

Here are some of my own thoughts.

When `pattern[i] != pattern[j]`

, we know that:

A…B are inclusive boundaries.

Next, we want to 1) start \(i_{next}\) after the **second longest prefix suffix** of \(Pattern_{0...j-1}\), which is shorter than \(Pattern_{0...i-1}\) (being the current longest), so that next time when the comparison `pattern[i] == pattern[j]`

succeeds, 2) we are able to get the longest prefix suffix for \(Pattern_{0...j}\), which is:

Now, questions are:

1) How do we find the longest

prefix suffixshorter than \(Pattern_{0...i-1}\)?

We just refer to `lps[i-1]`

, it stores the length of the longest **proper** prefix suffix of \(Pattern_{0...i-1}\), which is shorter than \(Pattern_{0...i-1}\) itself.

2) By setting

`i`

to`lps[i-1]`

, how can we know that when`pattern[i] == pattern[j]`

succeeds, we get another longest prefix suffix?

According to \((2)\), the longest **prefix** suffix of \(Pattern_{0...i-1}\) will also be \(Pattern_{j-1-(i-1)...j-1}\)’s longest prefix **suffix**, thus at the same time being the **second** longest prefix suffix for \(Pattern_{0...j-1}\) (Why?). Therefore, when a new comparison `pattern[i] == pattern[j]`

succeeds, we get the longest prefix suffix for \(Pattern_{0...j}\).

### Why?

We can prove this by contradiction. Assuming `B`

is the longest prefix suffix of \(Pattern_{0...i-1}\) `A`

and `B`

is not the second longest prefix suffix of \(Pattern_{0...j-1}\), then there exists a second longest prefix suffix of \(Pattern_{0...j-1}\) `C`

, which is longer than `B`

and shorter than `A`

. `C`

is both a prefix of `A`

and a suffix of \(Pattern_{j-1-(i-1)...j-1}\), and according to \((2)\), `C`

is also a suffix of `A`

, meaning `C`

is a prefix suffix of `A`

that’s longer than `B`

, which contradicts the assumption that `B`

is the longest prefix suffix of `A`

. Q.E.D.