dfa.io | about | archive

Parsing a Functional Language

I’m (slowly) working on a small compiler for a tiny functional language called PCF. Probably influenced by too much time on Hacker News, I’m writing the compiler in Rust.

I recently got the parser working, and I figured it’d be a nice opportunity to share some little fun tidbits I learned. I used a recursive descent strategy, and the AST uses De Bruijn Indices.

The source language

PCF is a pretty small language. We don’t have any built-in arithmetic primitives: just the zero constant and the successor function. We can define addition like this:

plus = fix rec : nat -> nat -> nat in
  \ x : nat.
  \ y : nat.
  ifz x {
    Zero => y
  | Succ x' => Succ (rec x' y)
  }

This program takes two numbers and, in each recursive call, decrements the first argument by one while incrementing the second. Then we’re done once the first argument is 0.

The fix operator binds the name “rec” to the entire block following “in”.

Lexing

Writing the lexer actually turned out to be pretty fun. The lexer turns the source program into a vector of tokens, which are much easier to deal with than strings.

I had to take some care to make sure that lexing keywords didn’t interfere with the possible identifier names. I had a bug for a bit where if an identifier started with an “i”, the lexer would chop the “i” off the beginning.

Writing the lexer also made me wish that Rust could pattern match strings against regex. I learned SML has this feature, which improves my opinion of SML a bit. Seems like it could be a cool feature to implement.

My favorite part of writing the lexer was allowing variable names to contain single quotes. It only took a couple of minutes and felt really nice to have that sort of aesthetic control.

Parsing

This was my first time handwriting a parser. It definitely changed the way I think about parsers. Before, I’d only used parser generators like YACC and Bison, so parser code has always felt like some weird DSL. Somehow, it was very freeing to realize that parsers are just “normal code.”

There were a handful of fun challenges PCF presented in parsing, which I cover below.

Recursive Descent

We read the token stream from left to right. Ideally, we could just read the stream once and have a correct parse tree in the first go.

In order to do this, we need to able to make correct parsing decisions by just peeking a couple characters ahead. No backtracking allowed. The set of grammars that satisfy this requirement are called LL(k) grammars.

Recursive descent parsers rely on LL(1) grammars, with only a single lookahead token. Though I think this rule is frequently broken in practice.

Recursive descent parsing is just DFS. It builds the AST up in a single pass, DFS-style.

To give you a feel, let’s look at parsing lambdas. The relevant line from the grammar looks like

Which is to say, “if I’m a lambda, I look like a backslash followed by an identifier followed by a colon followed by a type followed by a dot followed by an expression.”

And here’s the code:

    Token::Lambda => {
      let param = self.consume_ident();
      self.bindings.enter_bindsite(&param);
      self.expect(Token::Colon);
      let param_type = self.build_type();
      self.expect(Token::Period);
      let func_body = self.build_expr();
      self.bindings.exit_bindsite(&param);
      ast::Expr::Lambda(
        param_type,
        Box::new(func_body)
      )
    },

If we’re sitting on a lambda token, then we eat the parameter name, eat the colon, recursively parse the type, eat the period, and then recursively parse the body expression.

Hopefully you can see how this is just a DFS :)

If you’re interested, here’s the full grammar for expressions:

$T$ is the variable for types and $I$ is the variable for identifiers.

Function application

Parsing function application turned out to be pretty tricky. PCF has a ML-style function application syntax, so if we have a function f and its argument x, applying f to x is just written “f x” without parenthesis.

I was convinced for a while that this was impossible with recursive descent parsing. If you only have a single token of lookahead and you’re sitting on a lambda, how can you possibly know if this lambda is being applied or not?

I figured it out while reading about grammar factoring here. The parser just happily reads in a whole list of expressions, and treats it as one big curried function application.

I got the associativity wrong initially, which turned out to be pretty fun to fix. The solution was basically “turn a thing that is subtly a foldr into a thing that is less subtly a foldl.”

In the grammar, the $Z$ variable is the result of factoring. It makes sure that function application isn’t ambiguously generated.

Let’s talk about that bindsite stuff now.

De Bruijin Indices

This is a technique to avoid issues with variable shadowing. Here’s an example:

\ x : nat . 
\ y : nat . ifz x { 
              Zero => Succ (y) 
            | Succ x => Succ (Succ (x)) 
            }

Here, the variable x bound on line 4 shadows the variable x bound on line 1. De Bruijin indices identify variables by integers, allowing us to distinguish which version of a shadowed variable we’re talking about.

In De Bruijn indices, the variable 1 is the variable bound by the most recent bindsite. Variable 2 is the one most recently bound after that. And so on. Our sample program gets parsed as

Lambda(Nat, 
Lambda(Nat, Ifz(Var("x", 2), 
            Succ(Var("y", 1)), 
            Succ(Succ(Var("x", 1)))
            )))

You can see that the x referenced on line 2 differs from the x referenced on line 4! As desired :)

Wrap-up

If you made it this far, thanks for reading! Hope you learned something from this post; I know I certainly learned a lot writing it.

As I get more work done on this compiler, I’ll be sure to post updates here.

Good luck with your lives ~

Traversing the Infinite Complete $\omega$-nary Tree

The infinite complete $\omega$-nary tree is one where every node has $\mathbb{N}$-many children. There are no leaves; the tree just extends downward infinitely. Call this graph $\mathfrak{G}$.

We can’t BFS or DFS over $\mathfrak{G}$. A DFS would simply get stuck on the leftmost branch forever and a BFS would never reach depth 2. How then are we to traverse it?

In the infinite complete binary tree, nodes are uniquely indentified by a finite length binary string. In $\mathfrak{G}$, nodes are uniquely indentified by a finite sequence of natural numbers. Let $s(v)$ be $v$’s corresponding sequence. In $\mathfrak{G}$, $u$ is the parent of $v$ iff $s(u)$’s length is one less than $s(v)$’s’ and $s(u)$ is a prefix of $s(v)$.

Any tree traversal produces a well order on the tree’s vertices. BFS on the complete infinite binary tree is the shortlex ordering (sort first by length, then lexographically). In fact, on level $i$, the set of corresponding binary strings is the set of all $i$-bit natural numbers, and the nodes are visited in increasing order.

Further, any tree tree traversal has order type $\omega$.

A traversal of $\mathfrak{G}$ is a well order on the nodes of $\mathfrak{G}$. What does this order look like? Here’s the idea (nodes are represented as int tuples):

def traverse():
    visitedNextChild = {() : 0}
    while True:
        currentlyVisited = visitedNextChild.keys()
        for v in currentlyVisited:
            nextChild = v + (visitedNextChild[v],)
            visit(nextChild)
            visitedNextChild[v] += 1
            visitedNextChild[nextChild] = 0

Here, We start with the root node, which we can represent as the empty tuple. We maintain a mapping from visited nodes to the next child of theirs to visit. At each iteration, we visit each of the prescribed next children, and update the mapping.

The fact that this visits every node in $\mathfrak{G}$ follows easily by induction.

In math symbols, if $S_i$ is the set of visited nodes at iteration $i$, then

(there are totally duplicates being added here, but that’s the beauty of sets).

Fix the nodes $u = s_1\ldots s_{n-1}$ and $v = s_1\ldots s_{n-1}s_n$. Define $t(x)$ to be the iteration at which $x$ is visited. Then $t(v) = t(u) + s_n + 1$. This leads to this gorgeous fact:

$s_1\ldots s_n$ is visited at iteration $\sum_{i=1}^n (s_i + 1) = n + \sum_{i=1}^n s_i$.

This means that our tree traversal has a pretty interesting sub-relation: namely that $u < v$ if $u$’s length + $u$’s digit sum is less than $v$’s length + $v$’s digit sum. Or, (if we one-index), just the digit sums.

From here on out, we’ll one-index for simplicity’s sake. (That is, assume $\mathbb{N}$ starts at 1).

Let’s see if we can characterize the entire ordering. (That is, instead of building a relation based on iteration, build a relation built on precise ordering of traversal).

It’s exactly the same relation, but if they tied, you recurse on the largest proper prefix of each.

def lessThan(u, v):
  # u < v
  return digitSum(u) < digitSum(v) or lessThan(u[::-1], v[::-1])

So the empty sequence is the least element (as we visit the root of $\mathfrak{G}$ first). I’m fairly certain that if you create the corresponding $\leq$ relation, this becomes a total order.

Here’s the cool thing: we’ve produced an order on $\mathbb{N}^*$ that has order type $\omega$! (The normal shortlex trick doesn’t work when our alphabet is countably infinite).

In general, if we want to produce an ordering of order type $\omega$ on , it suffices to partition $\mathbb{N}^*$ into countably many partitions, each of finite size. Then the “concatentation” of these partitions yields order type $\omega$.

Just some fun observations :)

Consider Your Hardware

This is a story about a compiler optimization that didn’t optimize.

My compilers class last semester had a lab focused on optimizations. Our compiler targeted x86_64 assembly and the optimization in question removed register-register moves. That is, if you have an instruction of the form

mov %rax, %rbx

the compiler will modify the generated code, renaming occurrences of rbx after this instruction to rax. This is called register coalescing.

We had a good bit of trouble implementing it (mostly because of a poorly-thought-out interference graph implementation). After a couple frustrating days, we got it working.

The result? No statistically significant improvement. Dang.

At first glance, maybe this isn’t all that surprising. Movs are supposed to be cheap, right?

But there’s a little bit more to the story. Turns out (I didn’t know this at the time), modern Intel processors have way more 16 registers. They often have over a 100. There’s something called a register file, and the processor maintains a mapping from x86_64 registers to physical registers. In fact, register-register movs really become nothing more than a pointer-renaming.

So the optimization that we spent a couple days implementing was already taken care of for us at the hardware level.

Being the impetuous 20-year-old I am, I quickly developed the opinion that register coalescing is dumb. An antiquated optimization, made obsolete by modern hardware.

But a conversation with the OS professor here changed my mind.

In general, (he says), simply because you can do something at the hardware level doesn’t mean you should. With the rise of mobile devices and the #internetOfThings, a lot of widely-used processors don’t have a lot of power. If you can compile a binary once and take care of all register-register moves once and for all, sometimes that’s worthwhile.

It’s an interesting topic - very cool to think of the compiler as just a small part of a larger whole. The compiler and the hardware together share the responsibility of making code run efficiently - how the work is divided is a complex question.

And more generally, it’s a reminder to think critically about the hardware your code is running on prior to jumping into a nontrivial task. Premature optimization strikes again!