Parsing a Functional Language
I’m (slowly) working on a small compiler for a tiny functional language called PCF. Probably influenced by too much time on Hacker News, I’m writing the compiler in Rust.
I recently got the parser working, and I figured it’d be a nice opportunity to share some little fun tidbits I learned. I used a recursive descent strategy, and the AST uses De Bruijn Indices.
The source language
PCF is a pretty small language. We don’t have any built-in arithmetic primitives: just the zero constant and the successor function. We can define addition like this:
This program takes two numbers and, in each recursive call, decrements the first argument by one while incrementing the second. Then we’re done once the first argument is 0.
The fix operator binds the name “rec” to the entire block following “in”.
Writing the lexer actually turned out to be pretty fun. The lexer turns the source program into a vector of tokens, which are much easier to deal with than strings.
I had to take some care to make sure that lexing keywords didn’t interfere with the possible identifier names. I had a bug for a bit where if an identifier started with an “i”, the lexer would chop the “i” off the beginning.
Writing the lexer also made me wish that Rust could pattern match strings against regex. I learned SML has this feature, which improves my opinion of SML a bit. Seems like it could be a cool feature to implement.
My favorite part of writing the lexer was allowing variable names to contain single quotes. It only took a couple of minutes and felt really nice to have that sort of aesthetic control.
This was my first time handwriting a parser. It definitely changed the way I think about parsers. Before, I’d only used parser generators like YACC and Bison, so parser code has always felt like some weird DSL. Somehow, it was very freeing to realize that parsers are just “normal code.”
There were a handful of fun challenges PCF presented in parsing, which I cover below.
We read the token stream from left to right. Ideally, we could just read the stream once and have a correct parse tree in the first go.
In order to do this, we need to able to make correct parsing decisions by just peeking a couple characters ahead. No backtracking allowed. The set of grammars that satisfy this requirement are called LL(k) grammars.
Recursive descent parsers rely on LL(1) grammars, with only a single lookahead token. Though I think this rule is frequently broken in practice.
Recursive descent parsing is just DFS. It builds the AST up in a single pass, DFS-style.
To give you a feel, let’s look at parsing lambdas. The relevant line from the grammar looks like
Which is to say, “if I’m a lambda, I look like a backslash followed by an identifier followed by a colon followed by a type followed by a dot followed by an expression.”
And here’s the code:
If we’re sitting on a lambda token, then we eat the parameter name, eat the colon, recursively parse the type, eat the period, and then recursively parse the body expression.
Hopefully you can see how this is just a DFS :)
If you’re interested, here’s the full grammar for expressions:
$T$ is the variable for types and $I$ is the variable for identifiers.
Parsing function application turned out to be pretty tricky. PCF has a ML-style function application syntax, so if we have a function f and its argument x, applying f to x is just written “f x” without parenthesis.
I was convinced for a while that this was impossible with recursive descent parsing. If you only have a single token of lookahead and you’re sitting on a lambda, how can you possibly know if this lambda is being applied or not?
I figured it out while reading about grammar factoring here. The parser just happily reads in a whole list of expressions, and treats it as one big curried function application.
I got the associativity wrong initially, which turned out to be pretty fun to fix. The solution was basically “turn a thing that is subtly a foldr into a thing that is less subtly a foldl.”
In the grammar, the $Z$ variable is the result of factoring. It makes sure that function application isn’t ambiguously generated.
Let’s talk about that bindsite stuff now.
De Bruijin Indices
This is a technique to avoid issues with variable shadowing. Here’s an example:
Here, the variable x bound on line 4 shadows the variable x bound on line 1. De Bruijin indices identify variables by integers, allowing us to distinguish which version of a shadowed variable we’re talking about.
In De Bruijn indices, the variable 1 is the variable bound by the most recent bindsite. Variable 2 is the one most recently bound after that. And so on. Our sample program gets parsed as
You can see that the x referenced on line 2 differs from the x referenced on line 4! As desired :)
If you made it this far, thanks for reading! Hope you learned something from this post; I know I certainly learned a lot writing it.
As I get more work done on this compiler, I’ll be sure to post updates here.
Good luck with your lives ~