Parser combinators are sets of functions for building parsers in a composable fashion. Haskell’s Parsec library and OCaml’s Angstrom are two examples. Both of these libraries expose monadic interfaces for describing context-sensitive grammars. This post looks at implementing a more restricted parsing library, structured around applicative functors rather than monads.

What could justify giving up on monads? Depending on the design, one may get a few things in return. In this exercise, the aim is for an API with the following features:

  1. The ability to extract all valid symbols from a parser.
  2. Allow some form of pretty printing.
  3. Support multiple evaluation strategies (e.g. backtracking and non-backtracking).

To see why (1) is not possible to accomplish with monads, consider a parser constructed using the monadic bind-operator:

let p = p1 >>= f

Here, f is a function of the form 'a -> 'b parser; that means we don’t know what sort of parser it produces until it’s provided with a value. Therefore, we cannot infer all possible symbols consumed by those parsers. The same reasoning applies to pretty printing.

Designing a parser type

The essence of a parser is a function with a type analogous to:

char list -> ('a * char list) option

A parser takes a list of characters as input, and, when successful, returns a parsed value along with the remaining input.

This representation, however, falls short of supporting symbol extraction, pretty printing, or allowing for multiple evaluation strategies. To accommodate for those we’d have to embellish the type with more context. The problem is we don’t necessarily know the complete feature set upfront. Every time a request for some new capability comes in – be it error reporting, logging or something else – we would have to go back and change the definition to accommodate for the new functionality.

Rather than trying to anticipate all use cases upfront, an alternative approach is to choose a representation that preserves as much structure as possible, so that alternative interpreters may be added later on. To do that, we’re effectively going to enumerate the set of parsers, and ways of combining them, by defining a type for representing an abstract syntax tree (AST). To this purpose, we’ll use a GADT, that also expresses that different parsers are indexed by different types. The initial constructors are split into two sets – primitive ones (the leaf nodes of the tree), and combinators for composing parsers:

type 'a t =
  (* Primitive parsers *)
  | Fail      : string            -> 'a t
  | Empty     :                       unit t
  | Return    : 'a                -> 'a t
  | Symbol    : char              -> char t
  (* Composition - applicative *)
  | Map       : ('a -> 'b) * 'a t -> 'b t
  | Product   : 'a t * 'b t       -> ('a * 'b) t
  (* Composition - alternative *)
  | Either    : 'a t * 'a t       -> 'a t

The primitive parsers are:

  • Fail msg - a parser that always fails.
  • Empty - a parser that succeeds when given empty input.
  • Return x - a parser that does not consume any input and always returns x.
  • Symbol c - a parser that matches input when the first character is c.

The applicative interface is what provides sequential composition, as in: first parse x with parser p1, then parse y with parser p2 and combine their results.

The Either constructor makes it possible to provide alternative execution paths, i.e. parsers that succeed on different types of input.

As we don’t want to give users direct access to the type, we’ll mechanically add some smart constructors:

let empty = Empty

let fail m = Fail m

let return x = Return x

let symbol c = Symbol c

let map f x = Map (f, x)

let product p q = Product (p,q)

let either p q = Either (p,q)

Having an applicative interface means that it is also possible to leverage the new syntax extension – the let+ .. and+ notation – which I’ve described here. Assuming OCaml 4.08 or the dune future_syntax stanza , we can add a syntax module, like so:

module Syntax = struct
  let (let+) p f = map f p
  let (and+) p q = product p q
end

In addition to these, we’ll include an Ops module with infix versions and some derived combinators:

module Ops = struct
  open Syntax
  let ( <$> ) f p   = map f p
  let ( <|> ) p q   = either p q
  let ( <*> ) pf px = let+ f = pf and+ x = px in f x
  let ( *>  ) p q   = (fun _ x -> x) <$> p <*> q
  let ( <*  ) p q   = const <$> p <*> q
end

Before moving on, some of the code examples also assume a few general utility functions:

(* Identity *)
val id : 'a -> 'a

(* Const *)
val const : 'a -> 'b -> 'a

(* Forward composition *)
val ( >> ) : ('a -> 'b) -> ('b -> 'c) -> 'a -> 'c

(* Converting a char list to a string *)
val string_of_list : char list -> string

(* And back again *)
val list_of_string : string -> char list

Their implementation, along with the complete code is available here.

Building simple parsers

How can we use the API to build actual parsers? Let’s consider a few simple examples. First, a parser that parses a specific string, i.e. a function:

val string : string -> string t

To implement the string parser, we can use the symbol primitive and fold over the given string to combine the parsers using applicative (let+ .. and+) syntax:

let string s =
  let accum c p =
    (* Parse 'x' using the 'symbol c' parser *)
    let+ x  = symbol c
    (* Then parse 'xs' using the 'p' parser *)
    and+ xs = p in
    (* Combine 'x' and 'xs' in a string *)
    Printf.sprintf "%c%s" x xs
  in
  List.fold_right accum (string_to_list s) (return "")

Next, let’s attempt a parser for parsing digits. A version that detects a single digit may be defined as:

let digit =
  list_of_string "0123456789"
  |> List.map symbol
  |> List.fold_left either fail

Choosing between a list of parsers is a natural generalization of the binary either combinator, and deserves its own version:

let choose xs = List.fold_left either fail

We may also generalize the digit definition to take a list of symbols as an argument:

let one_of cs = choose @@ List.map symbol cs

Now, digit is achieved by:

let digit = one_of "0123456789"

Hitting the boundaries

Next, consider a parser that recognizes arbitrary integers? An integer consists of at least one digit but we don’t know exactly how many. How can we define a parser that captures this semantics? As a first try:

let int =
  let rec digits () =
    let+ d  = digit
    and+ ds = either (digits ()) (return []) in
    d :: ds
  in
  map (string_of_list >> int_of_string) @@ digits ()

Can you spot the problem with the above definition? If not, just try running it and you’ll find it throwing a stack-overflow exception. The problem is that the recursive call is never conditional on any base case and is always eagerly evaluated. We need a way to describe parsers that may consume arbitrarily large inputs without constructing infinite parsing expressions! To generalize from the int example, we’re aiming for a combinator with the following signature:

val many : 'a t -> ('a t) list

One solution would be to introduce a constructor, say Delay, for representing lazy parsers:

type 'a t =
  ...
  | Delay : (unit -> 'a t) -> 'a t

let delay f = Delay f

That is a parser with delayed construction. We may use delay to define many, as in:

let rec many p =
  let many_one =
    let+ x  = p
    and+ xs = delay @@ fun _ -> many p) in
    x :: xs
  in
  either many_one (return [])

Now, each step of the recursion is evaluated on demand rather than upfront. This solution would work well if it weren’t for the more ambitious set of constraints having to do with pretty printing and symbol extraction. The problem is that in order to extract all possible symbols of a delayed parser, we’d need to evaluate it; this would unroll the infinite recursion expressed in the many definition, and once again kill the stack.

Fixing the parser definition

Is there any fix for the problem of simultaneously having a finite traversable representation, and providing sufficient expressive power for describing infinite parsers? The clue is in the question. A way of expressing recursive structures without recursion is exactly what is offered by the fixed-point combinator.

Let’s extend the parser type with a fixed-point constructor (and also get rid of Delay):

type 'a t =
  ...
  | Fix : ('a t -> 'a t) -> 'a t

let fix f = Fix f

We can then use fix as a remedy for the recursiveness of the definition of many, from above:

let many p =
  fix @@ fun many ->
    let many_one =
      let+ x  = p
      and+ xs = many in
      x :: xs
    in
    either many_one (return [])

Note that there is no rec keyword in sight. The function fix is just a handy tool for allowing us to mimic recursive functions. You may be left wondering how exactly this solves the problem of symbol extraction, given that we still end up with Fix nodes – functions of type ('a t -> 'a t) – that need to be evaluated. Hopefully, the next sections on interpreting parsers will bring some clarity on that matter.

Coming back to the example of integer-parsing, here’s how int may be defined in terms of many:

let int =
  let+ d  = digit
  and+ ds = many digit in
  int_of_string @@ string_of_list (d :: ds)

Again, we may extract the common pattern of parsing one or more times using the same parser (one or more digits in the example above) by adding a new combinator, as in:

let many_one p =
  let+ x = p
  and+ xs = many p
  x :: xs

Given a parser p, the parser many p succeeds if it can apply p at least one time on its input.

As a side note, the code generously makes use of the new applicative syntax to demonstrate how it may be used to write declarative code that also avoids infix operators. As an alternative, we can define the same functions without relying on the syntax extension; for instance:

(* Requires Ops module to be open *)

let many p = fix @@ fun many ->
  either (List.cons <$> p <*> many) (return [])

let many_one p = List.cons <$> p <*> many p

let int =
  (string_of_list >> int_of_string) <$> many_one digit

I’ll leave it to the reader to conclude which style is preferable.

(* Requires Ops module to be open *)
let int =
  (string_of_list >> int_of_string string_of_list) <$> many_one digit

To wrap up with the initial set of parser examples, here is an implementation of a float parser that also supports parsing values in scientific notation:

let float =
  (* Ex: 123. *)
  let p1 =
    let+ ds = many_one digit
    and+ d  = symbol '.' in
    ds @ [d]
  in
  (* Ex: 123.45 *)
  let p2 =
    let+ ds1 = p1
    and+ ds2 = many_one digit in
    ds1 @ ds2
  in
  (* Ex: 12.34e56 or 12e34 *)
  let p3 =
    let+ ds1 = p2 <|> many_one digit
    and+ e   = symbol 'e'
    and+ ds2 = many_one digit in
    ds1 @ [e] @ ds2
  in
  let fol cs = float_of_string @@ string_of_list cs in
  fol <$> choice [p1; p2; p3]

Evaluating parsers

Up until this point, the focus has been on designing a sufficiently expressive parser language and we’ve even implemented some simple examples, for parsing decimal numbers. However, there is not a single line of code concerning how to run a parser on actual input. We also have not made any decisions about the exact semantics of what that would look like. For instance, should running a parser return a list of all possible results or only the first valid one it finds? Do we prefer the parser to backtrack one or more steps in case it gets stuck?

The lack of such design decisions is a feature, not a bug! Carefully designing a frontend for constructing parsers, allows us to provide multiple backends for running them. At first we’ll look at a simple non-backtracking evaluator.

For executing a parser we need to turn it into a function. For a simplistic version that does not deal with error reporting, we can use the same type as described above – a function of type: char list -> ('a * char list) option:

val eval : 'a t -> (char list -> ('a * char list) option)

I’ve included a redundant set of parentheses to emphasize the fact that eval take a parser and returns a function.

Given the recursive definition of the parser type, we naturally need a recursive implementation.

Because of the GADT structure, each recursive call may invoke the function with a different type of parser; for that reason, we need to use the syntax for polymorphic locally abstract types. Following is a complete implementation. It assumes a module Option.Syntax for applicative and monadic combinators for option types, as described here:

let rec eval :  type a. a t -> char list -> (a * char list) option = fun p ->
  match p with
  | Fail _ ->
    const None
  | Empty ->
    fun cs ->
      ( match cs with
        | []  -> Some ((), [])
        | _   -> None
      )
  | Return x  ->
    fun cs -> Some (x, cs)
  | Symbol s ->
    fun cs ->
      ( match cs with
        | c :: cs when c = s  -> Some (s, cs)
        | _                   -> None
      )
  | Map (f, p) ->
    let ep = eval p in
    fun cs ->
      Option.Syntax(
        let+ (x, cs) = ep cs in
        (f x, cs)
      )
  | Product (p, q) ->
    let ep = eval p in
    let eq = eval q in
    fun cs ->
      Option.Syntax(
        let* (x, cs) = ep cs in
        let* (y, cs) = eq cs in
        Some ((x,y), cs)
      )
  | Either (p, q) ->
    let ep = eval p in
    let eq = eval q in
    fun cs ->
      ( match ep cs with
        | Some r -> Some r
        | None   -> eq cs
      )
  | Fix f ->
    fun cs -> eval (f (Fix f)) cs

Each branch of the pattern match returns a function that accepts a char list and produces an optional result.

For convenience we can expose it as a function that transforms a parser specification into a function consuming strings for actually doing the parsing:

val eval' : 'a t -> string -> ('a * char list) option

Implemented as:

let eval' p = list_of_string >> eval p

Following are a few examples, testing the evaluator on the float parser, defined above:

# eval' float "12.34";;
- : (float * char list) option = Some (12.34, [])

# eval' float "1.2e3";;
- : (float * char list) option = Some (1200., [])

# eval' float "a1.23";;
- : (float * char list) option = None

The implementation of eval is not backtracking and only returns one possible match of the parser for a given input. An alternative implementation would return a list of results, as in:

val eval : 'a t -> 'a char list -> 'a list

This implementation is left as an exercise.

Extracting symbols

Symbol extraction is another example of an interpreter for parsers – a function that identifies the set of all possible inputs to a parser by traversing its tree:

val symbols : 'a parser -> char list

As with eval, it is defined as a recursive function that handles all parser constructors:

module CS = Set.Make (Char)

let symbols p =
  let rec aux : type a. a t -> CS.t = function
    | Fail _        -> CS.empty
    | Empty         -> CS.empty
    | Return _      -> CS.empty
    | Symbol c      -> CS.singleton c
    | Map (_, p)    -> aux p
    | Product (p,q) -> CS.union (aux p) (aux q)
    | Either (p,q)  -> CS.union (aux p) (aux q)
    | Fix f         -> aux @@ f @@ Fix (fun _ -> Fail "")
  in
  aux p |> CS.to_seq |> List.of_seq

Most cases are trivial – for instance, Symbol c returns a singleton set with c, and Either returns the union of the symbols from each branch. The Fix f is perhaps the most interesting case. The payload of Fix is a function of type: 'a t -> 'a t, so we just just need to pass it some parser that does not introduce any additional symbol, hence the Fail.

We can try it out on the float parser:

# symbols float;;
- : CS.elt list =
['.'; '0'; '1'; '2'; '3'; '4'; '5'; '6'; '7'; '8'; '9'; 'e']

Showing parsers

As a final example of a parser interpreter, we’ll write a function for converting parsers into strings:

val show : 'a t -> string

Once again, it follows a familiar pattern where we map each constructor:

let rec show : type a. a t -> string = function
  | Fail msg ->
    msg
  | Empty ->
    "empty"
  | Symbol c ->
    Printf.sprintf "symbol '%c'" c
  | Return x ->
    "return ?"
  | Map (f, p) ->
    Printf.sprintf "map ? (%s)" (show p)
  | Product (p,q) ->
    Printf.sprintf "product (%s) (%s)" (show p) (show q)
  | Either (p,q) ->
    Printf.sprintf "either (%s) (%s)" (show p) (show q)
  | Fix f ->
    Printf.sprintf "let rec f () = %s) in f ()" @@ show (f (Fail "f ()"))

Note that we’re not able to fully display all constructions. The payload of Map consists of a function that we just don’t know how to print. We also lack a printer for the argument to return. For fully transparent parsers we’d have to impose some additional structure – such as pairing arguments with printers – but for now we just insert some question marks to indicate the opaque arguments. Note also how we recover the recursive structure of Fix f by printing it as let rec .. . As an example, consider printing a parser obtained by the many combinator:

# let p = many (symbol 'a');;
val p : char list P.t = Fix <fun>
# print_endline @@ show p;;
let rec f () = either (map ? (product (map ? (symbol 'a')) (f ()))) (return ?)) in f ()

To see how this makes sense, here’s a version where the question marks have been filled in:

let rec f () =
  either
    (map
      List.cons
      (product
        (map id (symbol 'a'))
        (f ())
      )
    )
    (return [])
in
f ()

This very much resembles the initial recursive version of many that we started off with, before introducing the fixed-point combinator.

Optimizing the evaluator

Looking at the implementation of eval, you may have noticed that in every case – except for Fix f – there are no recursive calls in the body of the returned functions. For instance, in the Product (p,q) case, both p and q are evaluated upfront, before returning a function that consumes actual input. Thus, for parsers not using fix – all traces of the parser AST are eliminated by eval, and we pay no additional run-time cost during actual parsing.

Is it possible to amend the case for Fix f to achieve evaluation upfront? A first attempt would be to simply factor out the recursive call:

    | Fix f ->
      let g = eval @@ f (Fix f) in
      fun cs -> g cs

One realizes quickly, however, that this wont’ quite cut it as evaluation may get stuck on ever-expanding f (Fix f) expressions. Remember that fix was introduced to encode recursive definitions required for the many combinator. The strategy is to transform it back to a recursive definition at evaluation level, similar to what we did for printing in the implementation of show, above.

In other words, we need to go from:

Fix f

To, an expression of the form:

let rec aux () =
  ...
  aux ()
in
aux ()

Where the body of aux is given by applying the function f to some other expression that may call back to itself. The problem, however, is that the argument to f must itself be parser expression. We can work around this by introducing another node for keeping a raw evaluator:

  type 'a t =
    ...
    | Raw : (char list -> ('a * char list) option) -> 'a t

Exposing such a constructor would of course defeat the purpose of having a fixed-point combinator in the first place, as we’d have to give up symbol extraction and printing anyway. However, we’ll only use it as an internal helper node. If we exclusively export the smart constructors, and make the type 'a t', either abstract or private, users would never be able to construct raw nodes. We can thus safely ignore such cases for any other evaluator.

Coming back to the implementation of eval, here’s how the recursive transformation is achieved:

let rec eval :  type a. a t -> char list -> (a * char list) option = fun p ->
  ...
  | Fix f ->
    let rec k = lazy (eval @@ f (Raw (fun cs -> Lazy.force k cs))) in
    Lazy.force k
  | Raw f ->
    f

The recursive definition of k, is calling eval lazily. However, since we immediately force it, we only ever evaluate it once! The expression returned by applying f may contain a sub-node Raw g, where g refers back to k. An alternative option would be to use ref-cells instead of the lazy constructs.

A few more tweaks

Before looking at another example of how to use the parsing library, there are a few utility functions that will come in handy:

val one_of  : string -> char t
val between :'a t -> 'b t -> 'c t -> 'c t
val forget  :'a t -> a t

Here:

  • The parser one_of s succeeds on any symbol from the given string s.
  • between p1 p2 p is a parser that first parser p1, followed by p, followed by p2 and only returns the result of p.
  • The parser forget p, runs p and returns its result without consuming any input

The functions one_of and between may be written in terms of the existing combinators. As for forget, it’s useful for determining that a parser is followed by another parser by looking ahead, and requires a new node in the parser AST:

type 'a t =
  ...
  | Forget    : 'a t -> 'a t

Finally, it’s a bit silly to only expose a primitive Token constructor for parsing a single character, when so many parsers are based on parsing specific strings – such as keywords in some grammar. As a pure optimization strategy, we can replace Token with a more generalized version that accepts any character from a given set:

module CS = Set.Make (Char)

type 'a t =
  ...
  | One_of    : CS.t -> char t

The complementing code changes are straight forward.

An example – parsing s-expressions

To give a more elaborate example of writing a parser using this library, we’ll consider parsing s-expressions.

An s-expression may be represented in OCaml by a data type like:

type atom =
  | Int of int
  | Float of float
  | String of string
  | Symbol of string

type sexp =
  | Atom of atom
  | List of sexp list

The goal is to define a parser of type sexp t, that can parse strings into sexp values. Here’s an example:

(var "x" ((times (plus 1 2) (val "y"))))

This string would be parsed into the following OCaml value:

List
  [Atom (Symbol "var"); Atom (String "x");
  List
    [List
      [Atom (Symbol "times");
      List
        [Atom (Symbol "plus"); Atom (Int 1);
        Atom (Int 2)];
      List
        [Atom (Symbol "val"); Atom (String "y")]]]]

The parser should differentiate between different types of input, such as symbols, integers, floats and (quoted) strings, which may also contain spaces and parentheses.

We’ll start with a few helpers for parsing different types of strings:

let space         = one_of "\n\t\r "
let left_paren    = symbol '('
let right_paren   = symbol ')'

let regular_char =
  let special_char  = list_of_string "()\n\r\t\" " in
  let is_regular c = not @@ List.mem c special_char in
  List.init 256 Char.chr |> List.filter is_regular |> string_of_list |> one_of

let regular_string = string_of_list <$> many_one regular_char

let quoted_string =
  let quote   = symbol '\"' in
  let string  = many (regular_char <|> space <|> left_paren <|> right_paren) in
  string_of_list <$> between quote quote string

It’s now fairly straight forward to combine the functions for parsing atoms:

let atom =
  let end_num =
    forget (empty <|> map ignore space <|> map ignore right_paren)
  in
  choice
    [ map (fun i -> Int i )   int <* end_num
    ; map (fun f -> Float f)  float <* end_num
    ; map (fun s -> String s) quoted_string
    ; map (fun s -> Symbol s) regular_string
    ]

Since the sexp type is recursive, we’ll again turn to fix to express that as a parser:

let sexpr =
  fix @@ fun sexpr ->
    let expr =
      either
        ((fun es -> List es) <$> between left_paren right_paren (many sexpr))
        ((fun a  -> Atom a ) <$> atom)
    in
    between (many space) (many space) expr

Conclusion

This post covered quite a few different concepts, including:

  • Parser combinators.
  • Difference between monadic and applicative parser APIs.
  • Applicative functor composition and complementary let+ .. and+ .. syntax.
  • GADTs for representing abstract syntax trees.
  • Fixed-point combinators for encoding recursive expressions.

It serves to exemplify a more general pattern – related to initial algebras – where the approach is to identify a data type, for a particular domain, while making as few commitments as possible. Whenever we define a value in the algebra – in our case, the parser algebra – we don’t loose any information that went into constructing it. We can thus map it to other more refined representations. For parsers, this meant providing multiple backends in the form of evaluator and printer functions.

You’ll find the full code, including the examples, here.