Parser combinators are sets of functions for building parsers in a composable fashion. Haskell’s Parsec library and OCaml’s Angstrom are two examples. Both of these libraries expose monadic interfaces for describing context-sensitive grammars. This post looks at implementing a more restricted parsing library, structured around applicative functors rather than monads.
What could justify giving up on monads? Depending on the design, one may get a few things in return. In this exercise, the aim is for an API with the following features:
- The ability to extract all valid symbols from a parser.
- Allow some form of pretty printing.
- Support multiple evaluation strategies (e.g. backtracking and non-backtracking).
To see why (1) is not possible to accomplish with monads, consider a parser constructed using the monadic bind-operator:
let p = p1 >>= f
f is a function of the form
'a -> 'b parser; that means we don’t
know what sort of parser it produces until it’s provided with a value. Therefore,
we cannot infer all possible symbols consumed by those parsers. The same
reasoning applies to pretty printing.
Designing a parser type
The essence of a parser is a function with a type analogous to:
char list -> ('a * char list) option
A parser takes a list of characters as input, and, when successful, returns a parsed value along with the remaining input.
This representation, however, falls short of supporting symbol extraction, pretty printing, or allowing for multiple evaluation strategies. To accommodate for those we’d have to embellish the type with more context. The problem is we don’t necessarily know the complete feature set upfront. Every time a request for some new capability comes in – be it error reporting, logging or something else – we would have to go back and change the definition to accommodate for the new functionality.
Rather than trying to anticipate all use cases upfront, an alternative approach is to choose a representation that preserves as much structure as possible, so that alternative interpreters may be added later on. To do that, we’re effectively going to enumerate the set of parsers, and ways of combining them, by defining a type for representing an abstract syntax tree (AST). To this purpose, we’ll use a GADT, that also expresses that different parsers are indexed by different types. The initial constructors are split into two sets – primitive ones (the leaf nodes of the tree), and combinators for composing parsers:
type 'a t = (* Primitive parsers *) | Fail : string -> 'a t | Empty : unit t | Return : 'a -> 'a t | Symbol : char -> char t (* Composition - applicative *) | Map : ('a -> 'b) * 'a t -> 'b t | Product : 'a t * 'b t -> ('a * 'b) t (* Composition - alternative *) | Either : 'a t * 'a t -> 'a t
The primitive parsers are:
Fail msg- a parser that always fails.
Empty- a parser that succeeds when given empty input.
Return x- a parser that does not consume any input and always returns
Symbol c- a parser that matches input when the first character is
The applicative interface is what provides sequential composition, as in:
x with parser
p1, then parse
y with parser
p2 and combine
Either constructor makes it possible to provide alternative execution
paths, i.e. parsers that succeed on different types of input.
As we don’t want to give users direct access to the type, we’ll mechanically add some smart constructors:
let empty = Empty let fail m = Fail m let return x = Return x let symbol c = Symbol c let map f x = Map (f, x) let product p q = Product (p,q) let either p q = Either (p,q)
Having an applicative interface means that it is also possible to leverage
the new syntax extension – the
let+ .. and+ notation – which I’ve described
Assuming OCaml 4.08 or the dune
, we can add a syntax module, like so:
module Syntax = struct let (let+) p f = map f p let (and+) p q = product p q end
In addition to these, we’ll include an
Ops module with infix versions
and some derived combinators:
module Ops = struct open Syntax let ( <$> ) f p = map f p let ( <|> ) p q = either p q let ( <*> ) pf px = let+ f = pf and+ x = px in f x let ( *> ) p q = (fun _ x -> x) <$> p <*> q let ( <* ) p q = const <$> p <*> q end
Before moving on, some of the code examples also assume a few general utility functions:
(* Identity *) val id : 'a -> 'a (* Const *) val const : 'a -> 'b -> 'a (* Forward composition *) val ( >> ) : ('a -> 'b) -> ('b -> 'c) -> 'a -> 'c (* Converting a char list to a string *) val string_of_list : char list -> string (* And back again *) val list_of_string : string -> char list
Their implementation, along with the complete code is available here.
Building simple parsers
How can we use the API to build actual parsers? Let’s consider a few simple examples. First, a parser that parses a specific string, i.e. a function:
val string : string -> string t
To implement the
string parser, we can use the
symbol primitive and fold over the
given string to combine the parsers using applicative (
let+ .. and+) syntax:
let string s = let accum c p = (* Parse 'x' using the 'symbol c' parser *) let+ x = symbol c (* Then parse 'xs' using the 'p' parser *) and+ xs = p in (* Combine 'x' and 'xs' in a string *) Printf.sprintf "%c%s" x xs in List.fold_right accum (string_to_list s) (return "")
Next, let’s attempt a parser for parsing digits. A version that detects a single digit may be defined as:
let digit = list_of_string "0123456789" |> List.map symbol |> List.fold_left either fail
Choosing between a list of parsers is a natural generalization of the binary
combinator, and deserves its own version:
let choose xs = List.fold_left either fail
We may also generalize the
digit definition to take a list of symbols as an argument:
let one_of cs = choose @@ List.map symbol cs
digit is achieved by:
let digit = one_of "0123456789"
Hitting the boundaries
Next, consider a parser that recognizes arbitrary integers? An integer consists of at least one digit but we don’t know exactly how many. How can we define a parser that captures this semantics? As a first try:
let int = let rec digits () = let+ d = digit and+ ds = either (digits ()) (return ) in d :: ds in map (string_of_list >> int_of_string) @@ digits ()
Can you spot the problem with the above definition? If not, just try running
it and you’ll find it throwing a stack-overflow exception. The problem is
that the recursive call is never conditional on any base case and is always
eagerly evaluated. We need a way to describe parsers that may consume
arbitrarily large inputs without constructing infinite parsing expressions!
To generalize from the
int example, we’re aiming for a combinator with the
val many : 'a t -> ('a t) list
One solution would be to introduce a constructor, say
for representing lazy parsers:
type 'a t = ... | Delay : (unit -> 'a t) -> 'a t let delay f = Delay f
That is a parser with delayed construction. We may use
delay to define
many, as in:
let rec many p = let many_one = let+ x = p and+ xs = delay @@ fun _ -> many p) in x :: xs in either many_one (return )
Now, each step of the recursion is evaluated on demand rather than upfront.
This solution would work well if it weren’t for the more ambitious set of constraints
having to do with pretty printing and symbol extraction. The problem is that
in order to extract all possible symbols of a delayed parser, we’d need to evaluate
it; this would unroll the infinite recursion expressed in the
many definition, and
once again kill the stack.
Fixing the parser definition
Is there any fix for the problem of simultaneously having a finite traversable representation, and providing sufficient expressive power for describing infinite parsers? The clue is in the question. A way of expressing recursive structures without recursion is exactly what is offered by the fixed-point combinator.
Let’s extend the parser type with a fixed-point constructor (and also get rid of
type 'a t = ... | Fix : ('a t -> 'a t) -> 'a t let fix f = Fix f
We can then use
fix as a remedy for the recursiveness of the definition of
let many p = fix @@ fun many -> let many_one = let+ x = p and+ xs = many in x :: xs in either many_one (return )
Note that there is no
rec keyword in sight. The function
fix is just a
handy tool for allowing us to mimic recursive functions. You may be left
wondering how exactly this solves the problem of symbol extraction, given
that we still end up with
Fix nodes – functions of type
('a t -> 'a t)
– that need to be evaluated. Hopefully, the next sections on interpreting
parsers will bring some clarity on that matter.
Coming back to the example of integer-parsing, here’s how
int may be defined
in terms of
let int = let+ d = digit and+ ds = many digit in int_of_string @@ string_of_list (d :: ds)
Again, we may extract the common pattern of parsing one or more times using the same parser (one or more digits in the example above) by adding a new combinator, as in:
let many_one p = let+ x = p and+ xs = many p x :: xs
Given a parser
p, the parser
many p succeeds if it can apply
p at least one
time on its input.
As a side note, the code generously makes use of the new applicative syntax to demonstrate how it may be used to write declarative code that also avoids infix operators. As an alternative, we can define the same functions without relying on the syntax extension; for instance:
(* Requires Ops module to be open *) let many p = fix @@ fun many -> either (List.cons <$> p <*> many) (return ) let many_one p = List.cons <$> p <*> many p let int = (string_of_list >> int_of_string) <$> many_one digit
I’ll leave it to the reader to conclude which style is preferable.
(* Requires Ops module to be open *) let int = (string_of_list >> int_of_string string_of_list) <$> many_one digit
To wrap up with the initial set of parser examples, here is an implementation of a float parser that also supports parsing values in scientific notation:
let float = (* Ex: 123. *) let p1 = let+ ds = many_one digit and+ d = symbol '.' in ds @ [d] in (* Ex: 123.45 *) let p2 = let+ ds1 = p1 and+ ds2 = many_one digit in ds1 @ ds2 in (* Ex: 12.34e56 or 12e34 *) let p3 = let+ ds1 = p2 <|> many_one digit and+ e = symbol 'e' and+ ds2 = many_one digit in ds1 @ [e] @ ds2 in let fol cs = float_of_string @@ string_of_list cs in fol <$> choice [p1; p2; p3]
Up until this point, the focus has been on designing a sufficiently expressive parser language and we’ve even implemented some simple examples, for parsing decimal numbers. However, there is not a single line of code concerning how to run a parser on actual input. We also have not made any decisions about the exact semantics of what that would look like. For instance, should running a parser return a list of all possible results or only the first valid one it finds? Do we prefer the parser to backtrack one or more steps in case it gets stuck?
The lack of such design decisions is a feature, not a bug! Carefully designing a frontend for constructing parsers, allows us to provide multiple backends for running them. At first we’ll look at a simple non-backtracking evaluator.
For executing a parser we need to turn it into a function. For a simplistic
version that does not deal with error reporting, we can use the same type as
described above – a function of type:
char list -> ('a * char list) option:
val eval : 'a t -> (char list -> ('a * char list) option)
I’ve included a redundant set of parentheses to emphasize the fact that
eval take a parser and returns a function.
Given the recursive definition of the parser type, we naturally need a recursive implementation.
Because of the GADT structure, each recursive call may invoke the function
with a different type of parser; for that reason, we need to use the syntax
for polymorphic locally abstract
Following is a complete implementation. It assumes a module
for applicative and monadic combinators for option types, as described
let rec eval : type a. a t -> char list -> (a * char list) option = fun p -> match p with | Fail _ -> const None | Empty -> fun cs -> ( match cs with |  -> Some ((), ) | _ -> None ) | Return x -> fun cs -> Some (x, cs) | Symbol s -> fun cs -> ( match cs with | c :: cs when c = s -> Some (s, cs) | _ -> None ) | Map (f, p) -> let ep = eval p in fun cs -> Option.Syntax( let+ (x, cs) = ep cs in (f x, cs) ) | Product (p, q) -> let ep = eval p in let eq = eval q in fun cs -> Option.Syntax( let* (x, cs) = ep cs in let* (y, cs) = eq cs in Some ((x,y), cs) ) | Either (p, q) -> let ep = eval p in let eq = eval q in fun cs -> ( match ep cs with | Some r -> Some r | None -> eq cs ) | Fix f -> fun cs -> eval (f (Fix f)) cs
Each branch of the pattern match returns a function that accepts
char list and produces an optional result.
For convenience we can expose it as a function that transforms a parser specification into a function consuming strings for actually doing the parsing:
val eval' : 'a t -> string -> ('a * char list) option
let eval' p = list_of_string >> eval p
Following are a few examples, testing the evaluator on the
# eval' float "12.34";; - : (float * char list) option = Some (12.34, ) # eval' float "1.2e3";; - : (float * char list) option = Some (1200., ) # eval' float "a1.23";; - : (float * char list) option = None
The implementation of
eval is not backtracking and only returns one possible
match of the parser for a given input. An alternative implementation would
return a list of results, as in:
val eval : 'a t -> 'a char list -> 'a list
This implementation is left as an exercise.
Symbol extraction is another example of an interpreter for parsers – a function that identifies the set of all possible inputs to a parser by traversing its tree:
val symbols : 'a parser -> char list
eval, it is defined as a recursive function that handles all parser
module CS = Set.Make (Char) let symbols p = let rec aux : type a. a t -> CS.t = function | Fail _ -> CS.empty | Empty -> CS.empty | Return _ -> CS.empty | Symbol c -> CS.singleton c | Map (_, p) -> aux p | Product (p,q) -> CS.union (aux p) (aux q) | Either (p,q) -> CS.union (aux p) (aux q) | Fix f -> aux @@ f @@ Fix (fun _ -> Fail "") in aux p |> CS.to_seq |> List.of_seq
Most cases are trivial – for instance,
Symbol c returns a singleton set
Either returns the union of the symbols from each branch.
Fix f is perhaps the most interesting case. The payload of
Fix is a
function of type:
'a t -> 'a t, so we just just need to pass it some parser
that does not introduce any additional symbol, hence the
We can try it out on the
# symbols float;; - : CS.elt list = ['.'; '0'; '1'; '2'; '3'; '4'; '5'; '6'; '7'; '8'; '9'; 'e']
As a final example of a parser interpreter, we’ll write a function for converting parsers into strings:
val show : 'a t -> string
Once again, it follows a familiar pattern where we map each constructor:
let rec show : type a. a t -> string = function | Fail msg -> msg | Empty -> "empty" | Symbol c -> Printf.sprintf "symbol '%c'" c | Return x -> "return ?" | Map (f, p) -> Printf.sprintf "map ? (%s)" (show p) | Product (p,q) -> Printf.sprintf "product (%s) (%s)" (show p) (show q) | Either (p,q) -> Printf.sprintf "either (%s) (%s)" (show p) (show q) | Fix f -> Printf.sprintf "let rec f () = %s) in f ()" @@ show (f (Fail "f ()"))
Note that we’re not able to fully display all constructions. The payload
Map consists of a function that we just don’t know how to print. We also
lack a printer for the argument to
return. For fully transparent parsers
we’d have to impose some additional structure – such as pairing arguments with
printers – but for now we just insert some question marks to indicate the
opaque arguments. Note also how we recover the recursive structure of
by printing it as
let rec .. . As an example, consider printing a parser
obtained by the
# let p = many (symbol 'a');; val p : char list P.t = Fix <fun> # print_endline @@ show p;; let rec f () = either (map ? (product (map ? (symbol 'a')) (f ()))) (return ?)) in f ()
To see how this makes sense, here’s a version where the question marks have been filled in:
let rec f () = either (map List.cons (product (map id (symbol 'a')) (f ()) ) ) (return ) in f ()
This very much resembles the initial recursive version of
many that we
started off with, before introducing the fixed-point combinator.
Optimizing the evaluator
Looking at the implementation of
eval, you may have noticed that in every
case – except for
Fix f – there are no recursive calls in the body of the
returned functions. For instance, in the
Product (p,q) case, both
q are evaluated upfront, before returning a function that consumes actual
input. Thus, for parsers not using
fix – all traces of the parser AST are
eval, and we pay no additional run-time cost during actual
Is it possible to amend the case for
Fix f to achieve evaluation upfront?
A first attempt would be to simply factor out the recursive call:
| Fix f -> let g = eval @@ f (Fix f) in fun cs -> g cs
One realizes quickly, however, that this wont’ quite cut it as evaluation may
get stuck on ever-expanding
f (Fix f) expressions. Remember that
fix was introduced to encode recursive definitions required for
many combinator. The strategy is to transform it back to a
recursive definition at evaluation level, similar to what we did for printing
in the implementation of
In other words, we need to go from:
To, an expression of the form:
let rec aux () = ... aux () in aux ()
Where the body of
aux is given by applying the function
f to some
other expression that may call back to itself. The problem, however, is
that the argument to
f must itself be parser expression. We can work around
this by introducing another node for keeping a raw evaluator:
type 'a t = ... | Raw : (char list -> ('a * char list) option) -> 'a t
Exposing such a constructor would of course defeat the purpose of having
a fixed-point combinator in the first place, as we’d have to give up symbol
extraction and printing anyway. However, we’ll only use it as an internal
helper node. If we exclusively export the smart constructors, and make the type
'a t', either abstract or private, users would never be able to construct
raw nodes. We can thus safely ignore such cases for any other evaluator.
Coming back to the implementation of
eval, here’s how the recursive transformation
let rec eval : type a. a t -> char list -> (a * char list) option = fun p -> ... | Fix f -> let rec k = lazy (eval @@ f (Raw (fun cs -> Lazy.force k cs))) in Lazy.force k | Raw f -> f
The recursive definition of
k, is calling
eval lazily. However, since we
immediately force it, we only ever evaluate it once! The expression returned
f may contain a sub-node
Raw g, where
g refers back to
An alternative option would be to use ref-cells instead of the lazy constructs.
A few more tweaks
Before looking at another example of how to use the parsing library, there are a few utility functions that will come in handy:
val one_of : string -> char t val between :'a t -> 'b t -> 'c t -> 'c t val forget :'a t -> a t
- The parser
one_of ssucceeds on any symbol from the given string
between p1 p2 pis a parser that first parser
p1, followed by
p, followed by
p2and only returns the result of
- The parser
forget p, runs
pand returns its result without consuming any input
between may be written in terms of the
existing combinators. As for
forget, it’s useful for determining that a parser is
followed by another parser by looking ahead, and requires a new node
in the parser AST:
type 'a t = ... | Forget : 'a t -> 'a t
Finally, it’s a bit silly to only expose a primitive
for parsing a single character, when so many parsers are based on parsing
specific strings – such as keywords in some grammar. As a pure optimization
strategy, we can replace
Token with a more generalized version that accepts
any character from a given set:
module CS = Set.Make (Char) type 'a t = ... | One_of : CS.t -> char t
The complementing code changes are straight forward.
An example – parsing s-expressions
To give a more elaborate example of writing a parser using this library, we’ll consider parsing s-expressions.
An s-expression may be represented in OCaml by a data type like:
type atom = | Int of int | Float of float | String of string | Symbol of string type sexp = | Atom of atom | List of sexp list
The goal is to define a parser of type
sexp t, that can parse strings into
sexp values. Here’s an example:
(var "x" ((times (plus 1 2) (val "y"))))
This string would be parsed into the following OCaml value:
List [Atom (Symbol "var"); Atom (String "x"); List [List [Atom (Symbol "times"); List [Atom (Symbol "plus"); Atom (Int 1); Atom (Int 2)]; List [Atom (Symbol "val"); Atom (String "y")]]]]
The parser should differentiate between different types of input, such as symbols, integers, floats and (quoted) strings, which may also contain spaces and parentheses.
We’ll start with a few helpers for parsing different types of strings:
let space = one_of "\n\t\r " let left_paren = symbol '(' let right_paren = symbol ')' let regular_char = let special_char = list_of_string "()\n\r\t\" " in let is_regular c = not @@ List.mem c special_char in List.init 256 Char.chr |> List.filter is_regular |> string_of_list |> one_of let regular_string = string_of_list <$> many_one regular_char let quoted_string = let quote = symbol '\"' in let string = many (regular_char <|> space <|> left_paren <|> right_paren) in string_of_list <$> between quote quote string
It’s now fairly straight forward to combine the functions for parsing
let atom = let end_num = forget (empty <|> map ignore space <|> map ignore right_paren) in choice [ map (fun i -> Int i ) int <* end_num ; map (fun f -> Float f) float <* end_num ; map (fun s -> String s) quoted_string ; map (fun s -> Symbol s) regular_string ]
sexp type is recursive, we’ll again turn to
fix to express
that as a parser:
let sexpr = fix @@ fun sexpr -> let expr = either ((fun es -> List es) <$> between left_paren right_paren (many sexpr)) ((fun a -> Atom a ) <$> atom) in between (many space) (many space) expr
This post covered quite a few different concepts, including:
- Parser combinators.
- Difference between monadic and applicative parser APIs.
- Applicative functor composition and complementary
let+ .. and+ ..syntax.
- GADTs for representing abstract syntax trees.
- Fixed-point combinators for encoding recursive expressions.
It serves to exemplify a more general pattern – related to initial algebras – where the approach is to identify a data type, for a particular domain, while making as few commitments as possible. Whenever we define a value in the algebra – in our case, the parser algebra – we don’t loose any information that went into constructing it. We can thus map it to other more refined representations. For parsers, this meant providing multiple backends in the form of evaluator and printer functions.
You’ll find the full code, including the examples, here.