KMP Prefix Table Notes

The Tricky Part Explained

A very popular implementation of the KMP(Knuth–Morris–Pratt) pattern searching algorithm is as follows:

// KMP pattern matching.
// Returns the index of the first occurrence of `pattern` in `data`.
// Returns (size_t)-1 if not found.
// The caller is responsible for allocating enough memory for lps_table.
template <typename T>
size_t KMPSearch(T* data, size_t data_size, T* pattern, size_t* lps_table, size_t pattern_size) {
    if (data_size < pattern_size || !pattern_size) return (size_t)-1;
    if (pattern_size == 1) {
        // just do a naive search
        for (size_t index = 0; index < data_size; index++) {
            if (*(data + index) == *pattern) {
                return index;
            }
        }
        return (size_t)-1;
    }
    // first build the lps table
    lps_table[0] = 0;
    size_t i = 0, j = 1;
    while (j < pattern_size) {
        if (pattern[i] == pattern[j]) {
            lps_table[j++] = ++i;
        }
        else {
            if (i) {
                i = lps_table[i - 1]; // <---- why can and should we do this?
            }
            else {
                lps_table[j] = 0;
                j++;
            }
        }
    }
    // now that the lps table has been built
    // begin the search
    i = 0;
    j = 0;
    while (i < data_size) {
        if (data[i] == pattern[j]) {
            i++;
            j++;
            if (j == pattern_size) {
                return i - pattern_size; // for single-target searches
                // emit (i-pattern_size); j = lps_table[j-1] // for multiple-target searches
            }
        }
        else {
            if (j) {
                j = lps_table[j - 1];
            }
            else {
                i++;
            }
        }
    }
    return (size_t)-1;
}

One tricky part of this implementation is when building the lps table, we’ll update i as follows:

i = lps_table[i - 1]; // <---- why can and should we do this?

Here are some of my own thoughts.

When pattern[i] != pattern[j], we know that:

\[\begin{align} Pattern_{i} &\neq Pattern_{j}\\ Pattern_{0...i-1} &= Pattern_{j-1-(i-1)...j-1} \end{align}\]

A…B are inclusive boundaries.

Next, we want to 1) start \(i_{next}\) after the second longest prefix suffix of \(Pattern_{0...j-1}\), which is shorter than \(Pattern_{0...i-1}\) (being the current longest), so that next time when the comparison pattern[i] == pattern[j] succeeds, 2) we are able to get the longest prefix suffix for \(Pattern_{0...j}\), which is:

\[Pattern_{0...i_{next}} (=Pattern_{j-i_{next}...j})\]

Now, questions are:

1) How do we find the longest prefix suffix shorter than \(Pattern_{0...i-1}\)?

We just refer to lps[i-1], it stores the length of the longest proper prefix suffix of \(Pattern_{0...i-1}\), which is shorter than \(Pattern_{0...i-1}\) itself.

2) By setting i to lps[i-1], how can we know that when pattern[i] == pattern[j] succeeds, we get another longest prefix suffix?

According to \((2)\), the longest prefix suffix of \(Pattern_{0...i-1}\) will also be \(Pattern_{j-1-(i-1)...j-1}\)’s longest prefix suffix, thus at the same time being the second longest prefix suffix for \(Pattern_{0...j-1}\) (Why?). Therefore, when a new comparison pattern[i] == pattern[j] succeeds, we get the longest prefix suffix for \(Pattern_{0...j}\).

Why?

We can prove this by contradiction. Assuming B is the longest prefix suffix of \(Pattern_{0...i-1}\) A and B is not the second longest prefix suffix of \(Pattern_{0...j-1}\), then there exists a second longest prefix suffix of \(Pattern_{0...j-1}\) C, which is longer than B and shorter than A. C is both a prefix of A and a suffix of \(Pattern_{j-1-(i-1)...j-1}\), and according to \((2)\), C is also a suffix of A, meaning C is a prefix suffix of A that’s longer than B, which contradicts the assumption that B is the longest prefix suffix of A. Q.E.D.

· Algorithm, KMP