The following is a write-up on an implementation of the Knuth-Morris-Pratt (or KMP) text search algorithm in OCaml. The algorithm itself is rather straight forward but implementing it in a functional style serves as a good example of how lazy data structures may be used as an optimization technique.
The algorithm
KMP solves the problem of efficiently searching for occurrences of one string within another.
To illustrate, searching for the string ABCD
in --ABC-ABCF-ABCD--ABCDEF
should recognize two matches at position 12 and 18, counting from one.
Efficiency is achieved by realizing that backtracking, after matching a prefix of the search pattern, may be eliminated if the pattern is analyzed upfront.
For instance searching for the above pattern from left to right, the sequence of the first
three characters (A
, B
and C
) matches the initial characters of the string being
searched; When realizing that the match fails at the fourth character, a naive
implementation would shift back two characters, starting the search from scratch at
B
.
By pre examining the search pattern, it is clear that backtracking in this
case may be safely omitted since no previous sequence of characters is a
prefix of the pattern. In other words, the only reason for ever backtracking
k
steps after failing to match after n
steps into the pattern, is if the
sequence of the pattern starting at n - k
and ending at n
is a prefix of
the pattern itself! Thus the KMP algorithm contains an initial step for
generating a lookup table where each index of the search pattern is mapped to
a corresponding number of steps required for backtracking. The table for the
search string: ABCD AB ABC DEF
looks like:
ABCD AB ABC DEF
000000120123000
Here the first character for which backtracking is required is the second
occurrence of B
(position 7); That is because the previous character is A
,
which may indeed be a starting point of a complete match. Even in this case
however, backtracking would be redundant since the alternative prefix match is
already known; One may simply continue to match the next character of the
string being searched, while shifting the pattern itself. This ensures a
strict linear bound on the complexity of the algorithm after the initial
lookup table has been constructed.
Implementation
The lookup table from above precisely defines the type of shifting required for each sequence of matches of the input pattern. As such, the table is really an encoding of a state machine where a transition between two steps can be decided exclusively by looking at the result of matching the next character. Such an automata can be represented in OCaml using the following type:
type pattern = { is_match : bool; step: char -> pattern }
The boolean flag indicates whether the pattern is in a matching state or not,
that is if a complete match of the pattern has been observed. The function
step defines the transition. For simplicity I’m using char
here although
there is little point in constraining the type.
Using this definition of pattern, it is straight forward to write a transformation turning a pattern into a search function:
let run (pattern: pattern) (text: string) : int list =
let accum (pattern,ix,acc) c =
let pattern = next c pattern in
let acc = if is_match pattern then (ix :: acc) else acc in
(pattern, ix + 1, acc)
in
let (_,_,acc) = List.fold_left accum (pattern, 0, []) text in
List.rev acc
The function run
accepts a pattern and returns a function than can be
applied on any string, yielding a list of positions.
More interesting is the generation of the pattern itself. How can we define define a function that when given a list of characters returns a pattern representing the table described by KMP?
Looking at the structure of the table, it’s clear that the state machine it encodes contains cycles. That is, failing to match at one state means transitioning to a prior state; Most of the times the initial state. Cyclic structures and functional programming may sound like at odds, although a cyclic structure are semantically equivalent with infinite trees. In this case however, we better be careful; The whole purpose of KMP is to avoid redundant re-computations why we need to make sure that each step of the automata is generated once and once only.
Logically constructing the pattern corresponding to the KMP table is simple.
At each character of the input search string, a step is generated; Unless the
input string is empty the result is an incomplete state with the is_match
property set to false
. The step function looks at an incoming character and
decide to continue to the next step or return to a previous step depending on
whether the character matches the current character of the pattern or not.
The challenge is to figure out how to refer to the previous state. Applying
the following thought experiment might help; Pretend for a second that an
already have a completed pattern automata, p0
is given, with its current
state set to the initial one. One could replicate the pattern by
looking at the input from left two right, using p0
as a reference for
backtracking in case of failed matches. Consider the example above with input
pattern:
ABCD AB ABC DEF
Starting by matching p0
at the second character, that is feeding it
with B
, would bring it back to its initial state expecting an A
. Next
step, given C
and again it would reset itself. A few more characters into
the sequence, matching A
and B
it would progress two steps where next step
expects C
. After eight steps a space character is expected and p0
has now
matched A
and B
and thus expects a C
. When given p0
, the step
function is simple to define: If a space character is provided continue to the
next step, otherwise go to p0
.
Technically this approach poses a problem - In order to generate the pattern
we need the pattern! Sounds like being trapped in a catch twenty two
situation. However, examining the problem carefully should make it clear that
it is not necessary to peek inside p0
until at least one step has been
constructed; So instead of passing a value that is fully evaluated one might
just get away by passing a lazy value and be careful not to force its
evaluation until after it’s created. The lazy
keyword in OCaml provides
such a mechanism.
Here is the complete implementation of pattern generation function implementing the described strategy:
open Lazy
let mk b f = { is_match = b ; step = f }
let step c p = p.step c
let rec const b = { is_match = b; step = fun _ -> const b }
let generate_pattern (cs: char list) : pattern =
let rec pattern = lazy (gen pattern cs)
and gen curr_pattern = function
| [] ->
const true
| [c] ->
mk false @@ fun x ->
let next_pattern = force curr_pattern in
if x = c then
mk true (fun _ -> next_pattern)
else
next_pattern
| c :: cs ->
let next_pattern = lazy (step c @@ force curr_pattern) in
let cont_pattern = lazy (gen next_pattern cs) in
mk false @@ fun x ->
force @@ if x = c then cont_pattern else curr_pattern
in
force pattern
The one line that may stick out is the recursive definition of pattern: let
rec pattern = lazy (gen pattern cs)
. The operator lazy
creates an
unevaluated thunk which enables forward referencing.
The function gen
is the work horse and its parameter curr_pattern
corresponds to p0
from above, and is a handler to a lazy pattern.
The function matches a list of characters constituting the given pattern and
differentiates between three cases; An empty list is a special case and will
result in a pattern that always matches and resets itself to the same state.
For a single character, a step is constructed which when given a character,
forces evaluation of curr_pattern
and branches depending on the match.
When more than one character is matched, the succeeding step is constructed
lazily by applying the first character to the current pattern and recursively
generating the continuation pattern. Note that the evaluation of
cont_pattern
is only forced from within the step function.
It is important to realize that he automata generated by the function is indeed cyclic.
For completeness, here’s the final search function composing run
and generate_pattern
:
let search cs = run @@ generate_pattern cs
Perhaps a bit mind boggling (at least for me) but I hope this example helps to showcase how lazy data structures may be used as an optimization technique, at least providing an alternative to more imperative approaches such as mutable arrays.
Thanks to Shayne Fletcher for coming up with the challenge in the first place.