Parser combinators are sets of functions for building parsers in a composable fashion. Haskell’s Parsec library and OCaml’s Angstrom are two examples. Both of these libraries expose monadic interfaces for describing context-sensitive grammars. This post looks at implementing a more restricted parsing library, structured around applicative functors rather than monads.
What could justify giving up on monads? Depending on the design, one may get a few things in return. In this exercise, the aim is for an API with the following features:
- The ability to extract all valid symbols from a parser.
- Allow some form of pretty printing.
- Support multiple evaluation strategies (e.g. backtracking and non-backtracking).
To see why (1) is not possible to accomplish with monads, consider a parser constructed using the monadic bind-operator:
let p = p1 >>= f
Here, f
is a function of the form 'a -> 'b parser
; that means we don’t
know what sort of parser it produces until it’s provided with a value. Therefore,
we cannot infer all possible symbols consumed by those parsers. The same
reasoning applies to pretty printing.
Designing a parser type
The essence of a parser is a function with a type analogous to:
char list -> ('a * char list) option
A parser takes a list of characters as input, and, when successful, returns a parsed value along with the remaining input.
This representation, however, falls short of supporting symbol extraction, pretty printing, or allowing for multiple evaluation strategies. To accommodate for those we’d have to embellish the type with more context. The problem is we don’t necessarily know the complete feature set upfront. Every time a request for some new capability comes in – be it error reporting, logging or something else – we would have to go back and change the definition to accommodate for the new functionality.
Rather than trying to anticipate all use cases upfront, an alternative approach is to choose a representation that preserves as much structure as possible, so that alternative interpreters may be added later on. To do that, we’re effectively going to enumerate the set of parsers, and ways of combining them, by defining a type for representing an abstract syntax tree (AST). To this purpose, we’ll use a GADT, that also expresses that different parsers are indexed by different types. The initial constructors are split into two sets – primitive ones (the leaf nodes of the tree), and combinators for composing parsers:
type 'a t =
(* Primitive parsers *)
| Fail : string -> 'a t
| Empty : unit t
| Return : 'a -> 'a t
| Symbol : char -> char t
(* Composition - applicative *)
| Map : ('a -> 'b) * 'a t -> 'b t
| Product : 'a t * 'b t -> ('a * 'b) t
(* Composition - alternative *)
| Either : 'a t * 'a t -> 'a t
The primitive parsers are:
Fail msg
- a parser that always fails.Empty
- a parser that succeeds when given empty input.Return x
- a parser that does not consume any input and always returnsx
.Symbol c
- a parser that matches input when the first character isc
.
The applicative interface is what provides sequential composition, as in:
first parse x
with parser p1
, then parse y
with parser p2
and combine
their results.
The Either
constructor makes it possible to provide alternative execution
paths, i.e. parsers that succeed on different types of input.
As we don’t want to give users direct access to the type, we’ll mechanically add some smart constructors:
let empty = Empty
let fail m = Fail m
let return x = Return x
let symbol c = Symbol c
let map f x = Map (f, x)
let product p q = Product (p,q)
let either p q = Either (p,q)
Having an applicative interface means that it is also possible to leverage
the new syntax extension – the let+ .. and+
notation – which I’ve described
here.
Assuming OCaml 4.08 or the dune
future_syntax stanza
, we can add a syntax module, like so:
module Syntax = struct
let (let+) p f = map f p
let (and+) p q = product p q
end
In addition to these, we’ll include an Ops
module with infix versions
and some derived combinators:
module Ops = struct
open Syntax
let ( <$> ) f p = map f p
let ( <|> ) p q = either p q
let ( <*> ) pf px = let+ f = pf and+ x = px in f x
let ( *> ) p q = (fun _ x -> x) <$> p <*> q
let ( <* ) p q = const <$> p <*> q
end
Before moving on, some of the code examples also assume a few general utility functions:
(* Identity *)
val id : 'a -> 'a
(* Const *)
val const : 'a -> 'b -> 'a
(* Forward composition *)
val ( >> ) : ('a -> 'b) -> ('b -> 'c) -> 'a -> 'c
(* Converting a char list to a string *)
val string_of_list : char list -> string
(* And back again *)
val list_of_string : string -> char list
Their implementation, along with the complete code is available here.
Building simple parsers
How can we use the API to build actual parsers? Let’s consider a few simple examples. First, a parser that parses a specific string, i.e. a function:
val string : string -> string t
To implement the string
parser, we can use the symbol
primitive and fold over the
given string to combine the parsers using applicative (let+ .. and+)
syntax:
let string s =
let accum c p =
(* Parse 'x' using the 'symbol c' parser *)
let+ x = symbol c
(* Then parse 'xs' using the 'p' parser *)
and+ xs = p in
(* Combine 'x' and 'xs' in a string *)
Printf.sprintf "%c%s" x xs
in
List.fold_right accum (list_of_string s) (return "")
Next, let’s attempt a parser for parsing digits. A version that detects a single digit may be defined as:
let digit =
list_of_string "0123456789"
|> List.map symbol
|> List.fold_left either fail
Choosing between a list of parsers is a natural generalization of the binary either
combinator, and deserves its own version:
let choose xs = List.fold_left either fail
We may also generalize the digit
definition to take a list of symbols as an argument:
let one_of cs = choose @@ List.map symbol cs
Now, digit
is achieved by:
let digit = one_of "0123456789"
Hitting the boundaries
Next, consider a parser that recognizes arbitrary integers? An integer consists of at least one digit but we don’t know exactly how many. How can we define a parser that captures this semantics? As a first try:
let int =
let rec digits () =
let+ d = digit
and+ ds = either (digits ()) (return []) in
d :: ds
in
map (string_of_list >> int_of_string) @@ digits ()
Can you spot the problem with the above definition? If not, just try running
it and you’ll find it throwing a stack-overflow exception. The problem is
that the recursive call is never conditional on any base case and is always
eagerly evaluated. We need a way to describe parsers that may consume
arbitrarily large inputs without constructing infinite parsing expressions!
To generalize from the int
example, we’re aiming for a combinator with the
following signature:
val many : 'a t -> ('a t) list
One solution would be to introduce a constructor, say Delay
,
for representing lazy parsers:
type 'a t =
...
| Delay : (unit -> 'a t) -> 'a t
let delay f = Delay f
That is a parser with delayed construction. We may use delay
to define
many
, as in:
let rec many p =
let many_one =
let+ x = p
and+ xs = delay @@ fun _ -> many p) in
x :: xs
in
either many_one (return [])
Now, each step of the recursion is evaluated on demand rather than upfront.
This solution would work well if it weren’t for the more ambitious set of constraints
having to do with pretty printing and symbol extraction. The problem is that
in order to extract all possible symbols of a delayed parser, we’d need to evaluate
it; this would unroll the infinite recursion expressed in the many
definition, and
once again kill the stack.
Fixing the parser definition
Is there any fix for the problem of simultaneously having a finite traversable representation, and providing sufficient expressive power for describing infinite parsers? The clue is in the question. A way of expressing recursive structures without recursion is exactly what is offered by the fixed-point combinator.
Let’s extend the parser type with a fixed-point constructor (and also get rid of Delay
):
type 'a t =
...
| Fix : ('a t -> 'a t) -> 'a t
let fix f = Fix f
We can then use fix
as a remedy for the recursiveness of the definition of many
,
from above:
let many p =
fix @@ fun many ->
let many_one =
let+ x = p
and+ xs = many in
x :: xs
in
either many_one (return [])
Note that there is no rec
keyword in sight. The function fix
is just a
handy tool for allowing us to mimic recursive functions. You may be left
wondering how exactly this solves the problem of symbol extraction, given
that we still end up with Fix
nodes – functions of type ('a t -> 'a t)
– that need to be evaluated. Hopefully, the next sections on interpreting
parsers will bring some clarity on that matter.
Coming back to the example of integer-parsing, here’s how int
may be defined
in terms of many
:
let int =
let+ d = digit
and+ ds = many digit in
int_of_string @@ string_of_list (d :: ds)
Again, we may extract the common pattern of parsing one or more times using the same parser (one or more digits in the example above) by adding a new combinator, as in:
let many_one p =
let+ x = p
and+ xs = many p
x :: xs
Given a parser p
, the parser many p
succeeds if it can apply p
at least one
time on its input.
As a side note, the code generously makes use of the new applicative syntax to demonstrate how it may be used to write declarative code that also avoids infix operators. As an alternative, we can define the same functions without relying on the syntax extension; for instance:
(* Requires Ops module to be open *)
let many p = fix @@ fun many ->
either (List.cons <$> p <*> many) (return [])
let many_one p = List.cons <$> p <*> many p
let int =
(string_of_list >> int_of_string) <$> many_one digit
I’ll leave it to the reader to conclude which style is preferable.
(* Requires Ops module to be open *)
let int =
(string_of_list >> int_of_string string_of_list) <$> many_one digit
To wrap up with the initial set of parser examples, here is an implementation of a float parser that also supports parsing values in scientific notation:
let float =
(* Ex: 123. *)
let p1 =
let+ ds = many_one digit
and+ d = symbol '.' in
ds @ [d]
in
(* Ex: 123.45 *)
let p2 =
let+ ds1 = p1
and+ ds2 = many_one digit in
ds1 @ ds2
in
(* Ex: 12.34e56 or 12e34 *)
let p3 =
let+ ds1 = p2 <|> many_one digit
and+ e = symbol 'e'
and+ ds2 = many_one digit in
ds1 @ [e] @ ds2
in
let fol cs = float_of_string @@ string_of_list cs in
fol <$> choice [p1; p2; p3]
Evaluating parsers
Up until this point, the focus has been on designing a sufficiently expressive parser language and we’ve even implemented some simple examples, for parsing decimal numbers. However, there is not a single line of code concerning how to run a parser on actual input. We also have not made any decisions about the exact semantics of what that would look like. For instance, should running a parser return a list of all possible results or only the first valid one it finds? Do we prefer the parser to backtrack one or more steps in case it gets stuck?
The lack of such design decisions is a feature, not a bug! Carefully designing a frontend for constructing parsers, allows us to provide multiple backends for running them. At first we’ll look at a simple non-backtracking evaluator.
For executing a parser we need to turn it into a function. For a simplistic
version that does not deal with error reporting, we can use the same type as
described above – a function of type: char list -> ('a * char list) option
:
val eval : 'a t -> (char list -> ('a * char list) option)
I’ve included a redundant set of parentheses to emphasize the fact that
eval
take a parser and returns a function.
Given the recursive definition of the parser type, we naturally need a recursive implementation.
Because of the GADT structure, each recursive call may invoke the function
with a different type of parser; for that reason, we need to use the syntax
for polymorphic locally abstract
types.
Following is a complete implementation. It assumes a module Option.Syntax
for applicative and monadic combinators for option types, as described
here:
let rec eval : type a. a t -> char list -> (a * char list) option = fun p ->
match p with
| Fail _ ->
const None
| Empty ->
fun cs ->
( match cs with
| [] -> Some ((), [])
| _ -> None
)
| Return x ->
fun cs -> Some (x, cs)
| Symbol s ->
fun cs ->
( match cs with
| c :: cs when c = s -> Some (s, cs)
| _ -> None
)
| Map (f, p) ->
let ep = eval p in
fun cs ->
Option.Syntax(
let+ (x, cs) = ep cs in
(f x, cs)
)
| Product (p, q) ->
let ep = eval p in
let eq = eval q in
fun cs ->
Option.Syntax(
let* (x, cs) = ep cs in
let* (y, cs) = eq cs in
Some ((x,y), cs)
)
| Either (p, q) ->
let ep = eval p in
let eq = eval q in
fun cs ->
( match ep cs with
| Some r -> Some r
| None -> eq cs
)
| Fix f ->
fun cs -> eval (f (Fix f)) cs
Each branch of the pattern match returns a function that accepts
a char list
and produces an optional result.
For convenience we can expose it as a function that transforms a parser specification into a function consuming strings for actually doing the parsing:
val eval' : 'a t -> string -> ('a * char list) option
Implemented as:
let eval' p = list_of_string >> eval p
Following are a few examples, testing the evaluator on the float
parser,
defined above:
# eval' float "12.34";;
- : (float * char list) option = Some (12.34, [])
# eval' float "1.2e3";;
- : (float * char list) option = Some (1200., [])
# eval' float "a1.23";;
- : (float * char list) option = None
The implementation of eval
is not backtracking and only returns one possible
match of the parser for a given input. An alternative implementation would
return a list of results, as in:
val eval : 'a t -> 'a char list -> 'a list
This implementation is left as an exercise.
Extracting symbols
Symbol extraction is another example of an interpreter for parsers – a function that identifies the set of all possible inputs to a parser by traversing its tree:
val symbols : 'a parser -> char list
As with eval
, it is defined as a recursive function that handles all parser
constructors:
module CS = Set.Make (Char)
let symbols p =
let rec aux : type a. a t -> CS.t = function
| Fail _ -> CS.empty
| Empty -> CS.empty
| Return _ -> CS.empty
| Symbol c -> CS.singleton c
| Map (_, p) -> aux p
| Product (p,q) -> CS.union (aux p) (aux q)
| Either (p,q) -> CS.union (aux p) (aux q)
| Fix f -> aux @@ f @@ Fix (fun _ -> Fail "")
in
aux p |> CS.to_seq |> List.of_seq
Most cases are trivial – for instance, Symbol c
returns a singleton set
with c
, and Either
returns the union of the symbols from each branch.
The Fix f
is perhaps the most interesting case. The payload of Fix
is a
function of type: 'a t -> 'a t
, so we just just need to pass it some parser
that does not introduce any additional symbol, hence the Fail
.
We can try it out on the float
parser:
# symbols float;;
- : CS.elt list =
['.'; '0'; '1'; '2'; '3'; '4'; '5'; '6'; '7'; '8'; '9'; 'e']
Showing parsers
As a final example of a parser interpreter, we’ll write a function for converting parsers into strings:
val show : 'a t -> string
Once again, it follows a familiar pattern where we map each constructor:
let rec show : type a. a t -> string = function
| Fail msg ->
msg
| Empty ->
"empty"
| Symbol c ->
Printf.sprintf "symbol '%c'" c
| Return x ->
"return ?"
| Map (f, p) ->
Printf.sprintf "map ? (%s)" (show p)
| Product (p,q) ->
Printf.sprintf "product (%s) (%s)" (show p) (show q)
| Either (p,q) ->
Printf.sprintf "either (%s) (%s)" (show p) (show q)
| Fix f ->
Printf.sprintf "let rec f () = %s) in f ()" @@ show (f (Fail "f ()"))
Note that we’re not able to fully display all constructions. The payload
of Map
consists of a function that we just don’t know how to print. We also
lack a printer for the argument to return
. For fully transparent parsers
we’d have to impose some additional structure – such as pairing arguments with
printers – but for now we just insert some question marks to indicate the
opaque arguments. Note also how we recover the recursive structure of Fix f
by printing it as let rec ..
. As an example, consider printing a parser
obtained by the many
combinator:
# let p = many (symbol 'a');;
val p : char list P.t = Fix <fun>
# print_endline @@ show p;;
let rec f () = either (map ? (product (map ? (symbol 'a')) (f ()))) (return ?)) in f ()
To see how this makes sense, here’s a version where the question marks have been filled in:
let rec f () =
either
(map
List.cons
(product
(map id (symbol 'a'))
(f ())
)
)
(return [])
in
f ()
This very much resembles the initial recursive version of many
that we
started off with, before introducing the fixed-point combinator.
Optimizing the evaluator
Looking at the implementation of eval
, you may have noticed that in every
case – except for Fix f
– there are no recursive calls in the body of the
returned functions. For instance, in the Product (p,q)
case, both p
and
q
are evaluated upfront, before returning a function that consumes actual
input. Thus, for parsers not using fix
– all traces of the parser AST are
eliminated by eval
, and we pay no additional run-time cost during actual
parsing.
Is it possible to amend the case for Fix f
to achieve evaluation upfront?
A first attempt would be to simply factor out the recursive call:
| Fix f ->
let g = eval @@ f (Fix f) in
fun cs -> g cs
One realizes quickly, however, that this wont’ quite cut it as evaluation may
get stuck on ever-expanding f (Fix f)
expressions. Remember that
fix
was introduced to encode recursive definitions required for
the many
combinator. The strategy is to transform it back to a
recursive definition at evaluation level, similar to what we did for printing
in the implementation of show
, above.
In other words, we need to go from:
Fix f
To, an expression of the form:
let rec aux () =
...
aux ()
in
aux ()
Where the body of aux
is given by applying the function f
to some
other expression that may call back to itself. The problem, however, is
that the argument to f
must itself be parser expression. We can work around
this by introducing another node for keeping a raw evaluator:
type 'a t =
...
| Raw : (char list -> ('a * char list) option) -> 'a t
Exposing such a constructor would of course defeat the purpose of having
a fixed-point combinator in the first place, as we’d have to give up symbol
extraction and printing anyway. However, we’ll only use it as an internal
helper node. If we exclusively export the smart constructors, and make the type
'a t'
, either abstract or private, users would never be able to construct
raw nodes. We can thus safely ignore such cases for any other evaluator.
Coming back to the implementation of eval
, here’s how the recursive transformation
is achieved:
let rec eval : type a. a t -> char list -> (a * char list) option = fun p ->
...
| Fix f ->
let rec k = lazy (eval @@ f (Raw (fun cs -> Lazy.force k cs))) in
Lazy.force k
| Raw f ->
f
The recursive definition of k
, is calling eval
lazily. However, since we
immediately force it, we only ever evaluate it once! The expression returned
by applying f
may contain a sub-node Raw g
, where g
refers back to k
.
An alternative option would be to use ref-cells instead of the lazy constructs.
A few more tweaks
Before looking at another example of how to use the parsing library, there are a few utility functions that will come in handy:
val one_of : string -> char t
val between :'a t -> 'b t -> 'c t -> 'c t
val forget :'a t -> a t
Here:
- The parser
one_of s
succeeds on any symbol from the given strings
. between p1 p2 p
is a parser that first parserp1
, followed byp
, followed byp2
and only returns the result ofp
.- The parser
forget p
, runsp
and returns its result without consuming any input
The functions one_of
and between
may be written in terms of the
existing combinators. As for forget
, it’s useful for determining that a parser is
followed by another parser by looking ahead, and requires a new node
in the parser AST:
type 'a t =
...
| Forget : 'a t -> 'a t
Finally, it’s a bit silly to only expose a primitive Token
constructor
for parsing a single character, when so many parsers are based on parsing
specific strings – such as keywords in some grammar. As a pure optimization
strategy, we can replace Token
with a more generalized version that accepts
any character from a given set:
module CS = Set.Make (Char)
type 'a t =
...
| One_of : CS.t -> char t
The complementing code changes are straight forward.
An example – parsing s-expressions
To give a more elaborate example of writing a parser using this library, we’ll consider parsing s-expressions.
An s-expression may be represented in OCaml by a data type like:
type atom =
| Int of int
| Float of float
| String of string
| Symbol of string
type sexp =
| Atom of atom
| List of sexp list
The goal is to define a parser of type sexp t
, that can parse strings into
sexp
values. Here’s an example:
(var "x" ((times (plus 1 2) (val "y"))))
This string would be parsed into the following OCaml value:
List
[Atom (Symbol "var"); Atom (String "x");
List
[List
[Atom (Symbol "times");
List
[Atom (Symbol "plus"); Atom (Int 1);
Atom (Int 2)];
List
[Atom (Symbol "val"); Atom (String "y")]]]]
The parser should differentiate between different types of input, such as symbols, integers, floats and (quoted) strings, which may also contain spaces and parentheses.
We’ll start with a few helpers for parsing different types of strings:
let space = one_of "\n\t\r "
let left_paren = symbol '('
let right_paren = symbol ')'
let regular_char =
let special_char = list_of_string "()\n\r\t\" " in
let is_regular c = not @@ List.mem c special_char in
List.init 256 Char.chr |> List.filter is_regular |> string_of_list |> one_of
let regular_string = string_of_list <$> many_one regular_char
let quoted_string =
let quote = symbol '\"' in
let string = many (regular_char <|> space <|> left_paren <|> right_paren) in
string_of_list <$> between quote quote string
It’s now fairly straight forward to combine the functions for parsing atom
s:
let atom =
let end_num =
forget (empty <|> map ignore space <|> map ignore right_paren)
in
choice
[ map (fun i -> Int i ) int <* end_num
; map (fun f -> Float f) float <* end_num
; map (fun s -> String s) quoted_string
; map (fun s -> Symbol s) regular_string
]
Since the sexp
type is recursive, we’ll again turn to fix
to express
that as a parser:
let sexpr =
fix @@ fun sexpr ->
let expr =
either
((fun es -> List es) <$> between left_paren right_paren (many sexpr))
((fun a -> Atom a ) <$> atom)
in
between (many space) (many space) expr
Conclusion
This post covered quite a few different concepts, including:
- Parser combinators.
- Difference between monadic and applicative parser APIs.
- Applicative functor composition and complementary
let+ .. and+ ..
syntax. - GADTs for representing abstract syntax trees.
- Fixed-point combinators for encoding recursive expressions.
It serves to exemplify a more general pattern – related to initial algebras – where the approach is to identify a data type, for a particular domain, while making as few commitments as possible. Whenever we define a value in the algebra – in our case, the parser algebra – we don’t lose any information that went into constructing it. We can thus map it to other more refined representations. For parsers, this meant providing multiple backends in the form of evaluator and printer functions.
You’ll find the full code, including the examples, here.