dfa.io | about | archive

Chinese Learning Resources

I’ve spent many hours honing my learning-Chinese setup, and this page is intended to share some of the resources I’ve found.

Define-on-highlight in Kindle (with pinyin)

I consider highlight-on-lookup an essential feature for reading ‘real’ Chinese long-form material. I don’t have the patience to handwrite characters into Pleco every time I don’t understand what’s going on. This problem is solved with the following (very-very-very-critical-to-my-life) file [here]. It’s mobi file for the English-Chinese dictionary CC-CEDICT.

I put this file in the documents/dictionaries directory on my Kindle; it might be different for your e-reader. If your e-reader doesn’t support mobi, I suggest the wonderful tool ebook-convert as a means to convert it to epub.

This isn’t a perfect solution; CC-CEDICT is probably my least favorite English-Chinese dictionary around, but I have been unable to find any other dictionaries in electronic form. The Kindle has a English-Chinese dictionary by default, but for baffling reasons, it does not include pinyin.

Another good approach could be add pinyin to the builtin Kindle dictionary, but the file format might be obsfucated.

Programmatically collecting vocab words from books

ChineseTextAnalyzer is the best program of this type I’ve seen so far (shout out to the incredible 书博 for telling me about this one). It does a lot of stuff, but I’ll just outline my favorite part here.

You feed it a text file and it splits the file up into words (presumably with longest-match against CC-CEDICT). Based on a set of ‘known words,’ ChineseTextAnalyzer can give you a list of unknown words from the corpus, sorted by frequency of appearance in the text. It’s then super easy to export these words into a CSV file and import into Anki.

I find this method of gathering vocab to be much more motivating that studying lists of words from a textbook. This way, I get longer-form reading with content I’m interested in, and a vocab list custom-tailored to the content I want to consume.

Parsing a Functional Language

I’m (slowly) working on a small compiler for a tiny functional language called PCF. Probably influenced by too much time on Hacker News, I’m writing the compiler in Rust.

I recently got the parser working, and I figured it’d be a nice opportunity to share some little fun tidbits I learned. I used a recursive descent strategy, and the AST uses De Bruijn Indices.

The source language

PCF is a pretty small language. We don’t have any built-in arithmetic primitives: just the zero constant and the successor function. We can define addition like this:

plus = fix rec : nat -> nat -> nat in
\ x : nat.
\ y : nat.
ifz x {
Zero => y
| Succ x' => Succ (rec x' y)

This program takes two numbers and, in each recursive call, decrements the first argument by one while incrementing the second. Then we’re done once the first argument is 0.

The fix operator binds the name “rec” to the entire block following “in”.


Writing the lexer actually turned out to be pretty fun. The lexer turns the source program into a vector of tokens, which are much easier to deal with than strings.

I had to take some care to make sure that lexing keywords didn’t interfere with the possible identifier names. I had a bug for a bit where if an identifier started with an “i”, the lexer would chop the “i” off the beginning.

Writing the lexer also made me wish that Rust could pattern match strings against regex. I learned SML has this feature, which improves my opinion of SML a bit. Seems like it could be a cool feature to implement.

My favorite part of writing the lexer was allowing variable names to contain single quotes. It only took a couple of minutes and felt really nice to have that sort of aesthetic control.


This was my first time handwriting a parser. It definitely changed the way I think about parsers. Before, I’d only used parser generators like YACC and Bison, so parser code has always felt like some weird DSL. Somehow, it was very freeing to realize that parsers are just “normal code.”

There were a handful of fun challenges PCF presented in parsing, which I cover below.

Recursive Descent

We read the token stream from left to right. Ideally, we could just read the stream once and have a correct parse tree in the first go.

In order to do this, we need to able to make correct parsing decisions by just peeking a couple characters ahead. No backtracking allowed. The set of grammars that satisfy this requirement are called LL(k) grammars.

Recursive descent parsers rely on LL(1) grammars, with only a single lookahead token. Though I think this rule is frequently broken in practice.

Recursive descent parsing is just DFS. It builds the AST up in a single pass, DFS-style.

To give you a feel, let’s look at parsing lambdas. The relevant line from the grammar looks like

Which is to say, “if I’m a lambda, I look like a backslash followed by an identifier followed by a colon followed by a type followed by a dot followed by an expression.”

And here’s the code:

    Token::Lambda => {
let param = self.consume_ident();
let param_type = self.build_type();
let func_body = self.build_expr();

If we’re sitting on a lambda token, then we eat the parameter name, eat the colon, recursively parse the type, eat the period, and then recursively parse the body expression.

Hopefully you can see how this is just a DFS :)

If you’re interested, here’s the full grammar for expressions:

$T$ is the variable for types and $I$ is the variable for identifiers.

Function application

Parsing function application turned out to be pretty tricky. PCF has a ML-style function application syntax, so if we have a function f and its argument x, applying f to x is just written “f x” without parenthesis.

I was convinced for a while that this was impossible with recursive descent parsing. If you only have a single token of lookahead and you’re sitting on a lambda, how can you possibly know if this lambda is being applied or not?

I figured it out while reading about grammar factoring here. The parser just happily reads in a whole list of expressions, and treats it as one big curried function application.

I got the associativity wrong initially, which turned out to be pretty fun to fix. The solution was basically “turn a thing that is subtly a foldr into a thing that is less subtly a foldl.”

In the grammar, the $Z$ variable is the result of factoring. It makes sure that function application isn’t ambiguously generated.

Let’s talk about that bindsite stuff now.

De Bruijin Indices

This is a technique to avoid issues with variable shadowing. Here’s an example:

\ x : nat . 
\ y : nat . ifz x {
Zero => Succ (y)
| Succ x => Succ (Succ (x))

Here, the variable x bound on line 4 shadows the variable x bound on line 1. De Bruijin indices identify variables by integers, allowing us to distinguish which version of a shadowed variable we’re talking about.

In De Bruijn indices, the variable 1 is the variable bound by the most recent bindsite. Variable 2 is the one most recently bound after that. And so on. Our sample program gets parsed as

Lambda(Nat, Ifz(Var("x", 2),
Succ(Var("y", 1)),
Succ(Succ(Var("x", 1)))

You can see that the x referenced on line 2 differs from the x referenced on line 4! As desired :)


If you made it this far, thanks for reading! Hope you learned something from this post; I know I certainly learned a lot writing it.

As I get more work done on this compiler, I’ll be sure to post updates here.

Good luck with your lives ~

Traversing the Infinite Complete $\omega$-nary Tree

The infinite complete $\omega$-nary tree is one where every node has $\mathbb{N}$-many children. There are no leaves; the tree just extends downward infinitely. Call this graph $\mathfrak{G}$.

We can’t BFS or DFS over $\mathfrak{G}$. A DFS would simply get stuck on the leftmost branch forever and a BFS would never reach depth 2. How then are we to traverse it?

In the infinite complete binary tree, nodes are uniquely indentified by a finite length binary string. In $\mathfrak{G}$, nodes are uniquely indentified by a finite sequence of natural numbers. Let $s(v)$ be $v$’s corresponding sequence. In $\mathfrak{G}$, $u$ is the parent of $v$ iff $s(u)$’s length is one less than $s(v)$’s’ and $s(u)$ is a prefix of $s(v)$.

Any tree traversal produces a well order on the tree’s vertices. BFS on the complete infinite binary tree is the shortlex ordering (sort first by length, then lexographically). In fact, on level $i$, the set of corresponding binary strings is the set of all $i$-bit natural numbers, and the nodes are visited in increasing order.

Further, any tree tree traversal has order type $\omega$.

A traversal of $\mathfrak{G}$ is a well order on the nodes of $\mathfrak{G}$. What does this order look like? Here’s the idea (nodes are represented as int tuples):

def traverse():
visitedNextChild = {() : 0}
while True:
currentlyVisited = visitedNextChild.keys()
for v in currentlyVisited:
nextChild = v + (visitedNextChild[v],)
visitedNextChild[v] += 1
visitedNextChild[nextChild] = 0

Here, We start with the root node, which we can represent as the empty tuple. We maintain a mapping from visited nodes to the next child of theirs to visit. At each iteration, we visit each of the prescribed next children, and update the mapping.

The fact that this visits every node in $\mathfrak{G}$ follows easily by induction.

In math symbols, if $S_i$ is the set of visited nodes at iteration $i$, then

(there are totally duplicates being added here, but that’s the beauty of sets).

Fix the nodes $u = s_1\ldots s_{n-1}$ and $v = s_1\ldots s_{n-1}s_n$. Define $t(x)$ to be the iteration at which $x$ is visited. Then $t(v) = t(u) + s_n + 1$. This leads to this gorgeous fact:

$s_1\ldots s_n$ is visited at iteration $\sum_{i=1}^n (s_i + 1) = n + \sum_{i=1}^n s_i$.

This means that our tree traversal has a pretty interesting sub-relation: namely that $u < v$ if $u$’s length + $u$’s digit sum is less than $v$’s length + $v$’s digit sum. Or, (if we one-index), just the digit sums.

From here on out, we’ll one-index for simplicity’s sake. (That is, assume $\mathbb{N}$ starts at 1).

Let’s see if we can characterize the entire ordering. (That is, instead of building a relation based on iteration, build a relation built on precise ordering of traversal).

It’s exactly the same relation, but if they tied, you recurse on the largest proper prefix of each.

def lessThan(u, v):
# u < v
return digitSum(u) < digitSum(v) or lessThan(u[::-1], v[::-1])

So the empty sequence is the least element (as we visit the root of $\mathfrak{G}$ first). I’m fairly certain that if you create the corresponding $\leq$ relation, this becomes a total order.

Here’s the cool thing: we’ve produced an order on $\mathbb{N}^*$ that has order type $\omega$! (The normal shortlex trick doesn’t work when our alphabet is countably infinite).

In general, if we want to produce an ordering of order type $\omega$ on , it suffices to partition $\mathbb{N}^*$ into countably many partitions, each of finite size. Then the “concatentation” of these partitions yields order type $\omega$.

Just some fun observations :)