Hello again! I’m picking up my series on Automata, with this post that goes into what I had always meant to get to: parsers. We’ll check out the old-school linear time parsing algorithms, which only need to go over the input once, without backtracking or caching. Originally this was one big post, but given the feedback I’ve gotten from (non-)readers, I’ve now split it up into two. In this first post we’ll check out LL, parse tables, and recursive descent. This post is meant to be readable for people unfamiliar with parsing, and yet be interesting for those who are familiar with the more traditional explanations! It’s still interesting because I like to explain things from an automata point of view instead of a procedural algorithm. I’ll use examples of grammars, and tables, and automata, and even some Rust code to show you how to implement a parser. The second post is on LR parsing. Enjoy!
Refresher from Pushy Automata
We’ll start with a brief refresher from the previous post of the series, pushy automata, since that was a little while back.
Push-Down Automata
Push-down automata (PDAs) are automata with a stack. Normal finite automata just consume input and have fixed memory via their states. PDAs can remember things on a single stack too, by pushing onto it and popping from it. Here’s a deterministic PDA for recognising the language of words that start with zeroes, followed by an equal number of ones:
So we start at , see if there is a as input, ignore the top of the stack, and put a on the stack as a marker for the end of the stack. Now we’re in state , in which we can consume more zeroes from the input and put those on the stack. If we find a one as input, we remove a zero from the stack by not pushing anything new on the stack. Now we’re in state where we remove zeroes from the stack for every one in the input, until we consume the final one by removing the from the stack.
Aside: This is one of the examples from the old blog post, and I now see that it is missing a transition! This automaton rejects the input , because there is no transition labeled . Oops ^_^
Context-Free Grammars, Derivations, and a naive PDA translation
A context-free grammar that describes the above language is:
(step) | |
() |
Sort is the start symbol, the starting point in the grammar. If we’re using the grammar productively we start from the start symbol and use the rules left-to-right to replace sorts until we get the sentence in the corresponding context-free language that we want. Something like: $ S \to 0 S 1 \to 0 0 S 1 1 \to 0 0 1 1 $. This is called a derivation.
The most general, naive way to derive a push-down automaton for any context-free grammar is by pushing the end-of-stack and start symbol at the start, having a “main” state that uses the grammar rules with the body reversed (to deal with the stack order), and an accept state that pops the end-of-stack:
Here the stack grows left-to-right, so the lowest symbol on the stack is $ (end of stack), followed by S (the grammar start symbol). By the rules of the grammar we can manipulate the top of the stack and rewrite it to the body. If the input lines up with what we have on the stack, we can eliminate both. It’s simple, but inefficient because of all the nondeterminism.
Derivations, Parse Trees and Ambiguity
Let’s look at a slightly more interesting grammar from a parser perspective:
(add) | |
(mul) | |
() |
When you want to derive , you can do this in all manner of ways. The following derivation picks just an arbitrary sort on which to apply a rule from the grammar:
- (first S)
- (first S)
- (first S)
- (second S)
- (first S)
Notice how in some steps the leftmost was replaced, while in others the rightmost was replaced. Generally speaking, you’ll want either a leftmost or a rightmost derivation for parsers, which is to say: a grammar rule is always applied to the leftmost or rightmost sort. There are three reasons for this. The first is that you want a parser to be predictable in when it applies grammar rules, as you may connect so-called semantic actions to each rule. These are pieces of code that are run when the parser applies the particular rule. (A typical example is a simple calculator). Such actions could perform side-effects, therefore order matters. For this reason, leftmost vs rightmost can also be observed. Two other reasons you to want this predictable derivation order is ease of implementation, and ease of proving things about your algorithm. These last two care less for whether it’s leftmost or rightmost.
The most common semantic actions I’m aware of is to build a syntax tree with a parser. This builds a tree structure out of the parsed text. A parse tree, or concrete syntax tree, contains all the rule applications as seen in the grammar. An abstract syntax tree abstracts over some parts of the syntax tree, such as leaving out whitespace, or parentheticals (the shape of the tree captures the precedence anyway), or injections (grammars rules of the form ). Let’s look at some parse trees of the last example, :
Notice how the leaves of the two trees are in the same order left-to-right as the input, but for the left tree the plus is higher up in the tree while in the right tree the star is higher up. If we want to interpreter the input as simple arithmetic, where multiplication binds tighter than addition, the left tree is the one we want. This is the predecedence of the operators, .
When you can get multiple trees like this, the grammar is called ambiguous. More formally, if you use only leftmost derivations (or only rightmost) and still find two distinct derivations that give the same sentence, the grammar is ambiguous. So to be clear: the above trees can be created with only leftmost derivations, it’s not a matter of choosing one or the other for the two trees. Derivation order (leftmost or rightmost) has to do with side-effect order of semantic actions only. When you build trees you don’t really need side-effects, so the derivation order has no effect on it.
With that recap out of the way: For the purposes of this blog post, we’ll look at unambiguous grammars for the languages we want to parse. Still, whether you use leftmost derivation or rightmost derivation in a parser that parses unambiguous grammars matters quite a lot in terms of what languages you can describe deterministically. It also influences how easily you can write a parser by hand for such a grammar, and how easily you can (programmatically) explain why your parser doesn’t accept certain inputs (parser error messages). So let’s have a look at LL and LR parsing techniques, where the first L in those abbreviations stands for Left-to-right (as in reading direction in text), and the second letters are respectively leftmost and rightmost derivation.
Topdown, (Strong) LL parsing
To take a good look at LL parsing, we will first work with a grammar that is not ambiguous or left-recursive:
(1) | |
(2) | |
(3) |
So sort is the start symbol, we also have sort , and we have round brackets, plusses, and ’s. This is enough information to create a table that, based on (1) the next sort to be parsed and (2) the next symbol in the input, predicts which rule from the grammar to use to parse the input further. In other words, if you know where you are in the input and grammar, you can look ahead at the next symbol of input and tell which unique grammar rule predicts the next bit of input (assuming the input fits the grammar). The table for the above grammar looks like so:
( | a | |
---|---|---|
2 | 1 | |
3 |
A table like the above is an LL(1) parse table, because it uses only 1 symbol of “look-ahead” in the columns. LL(1) grammars are always strong LL grammars, which means that they only need the combination of the sort to be parsed and the next symbol(s) to decide on a unique grammar rule to apply. In general, LL(k) grammars do not have to be strong, and if they are not, you also need to know what was already parsed from the input (the “left context”) in order to choose a unique grammar rule[1]. For example, the following grammar is LL(2), and not strong:
(1) | |
(2) | |
(3) |
You can see this if you try to write an LL(2) parse table for it:
a a | a b | b a | |
---|---|---|---|
1 | 1 | ||
2 | 2,3 | 3 |
If you look ahead to a b
on the input, and the next sort is , then it really depends on whether you are at the start of the input or in the middle of rule 1. If you’re at the start, you must choose rule 3 so you can parse a b
as part of the rule 1, but if you’re already in the middle of rule 1, you must choose rule 2 for so you can continue to parse b a
of rule 1.
If you mark in rule 1 with where you are in rule 1 ($ S = A₁\ a\ b\ A₂\ b\ a $), you get an LL(2) grammar that is strong, although the table for it is larger[2]:
a a | a b | b a | |
---|---|---|---|
1 | 1 | ||
2 | 3 | ||
2 | 3 |
In general, you can always use this trick to construct a strong, structurally equivalent LL grammar with the same look-ahead. This is quite useful for constructing simple LL parsers. However, the downside of these parsers is that on wrong input they can fail later than a more complicated LL(k) parser that works for the non-strong grammar. And that matters if you want to give nice error messages.
An intuition for table construction by automaton
Building the above tables was a matter of keeping in mind what they mean, and squinting a little. But in the case of a larger grammar, or a parsetable generator, of course you want an exact process. Before I dive into First and Follow sets that are the traditional method for building these tables, let me give you a method that is less practical but in my opinion more intuitive.
Step 1: Let’s build a simple automaton for each rule of the grammar, where we assume both sorts and terminals are on the input.
Note how each node of a rule automaton has the number of the rule followed by the offset into the body of the rule. The state represents where we are in the rule while parsing by that rule. The last node doesn’t have this offset so you can easily identify it, even when it’s no longer a final state.
Typically you’ll find texts on parsers display the position in a rule more visually with “LR item” notation. This uses the actual rule and a dot to indicate where we are in the rule. While this makes individual automata states more readable, it makes the automata large and therefore harder to display in a readable way as a whole. That’s why you won’t find that notation in this post’s automata. But just to illustrate an example of the notation:
Shorthand in this blog | LR Item notation |
---|---|
S₁₀ | |
S₁₁ | |
S₁₅ | |
S₁ |
Step 2: Now instead of consuming a sort with an automaton, we’ll use rules to jump to the automata of the rules for that sort instead. We’ll use the PDA stack with unique labels to get back to where you would be after consuming the sort.
The is an abbreviation for an edge that pushes a symbol on the stack unconditionally, it was hard to get graphviz to cooperate on node placement of this graph otherwise… Now at every node that had a sort transition you have multiple transition options, these are where you need to look ahead. Soooo…
Step 3: at a sort transition node, for each transition, follow transitions until you’ve consumed k terminals (2 in this example) from the input. These are the terminals of the column in the parse table, and the rule of the transition gets put into that cell. You can also put the look-ahead into the automaton:
Building LL tables for strong LL grammars by traditional method
While the above building of automata gives a visual intuition, it’s not the most efficient way to express how we can build LL tables. The traditional method does the same thing in essence, but using some pre-computed sets to calculate the cells in the table.
A cell at the row labeled with sort and the column labeled with terminal(s) should have the grammar rule (where is a mix of terminals and sorts or ), under the following condition: is in the First set of , or is in the First set of and is in the Follow set of . In other words:
Let’s unpack that. The First set of a terminal is a singleton set with that terminal. The First set of a sort is the set of first non-terminals that the sort can expand to, directly or indirectly. So a rule causes to appear in the First set of , causes the First set of to be included in the First set of , and causes to appear in the First set of . This last rule says can be expanded to “nothing”, so if that’s an option we need to check the Follow set of .
The Follow set is basically every non-terminal that can follow in the grammar. So when you have , is in the Follow set of . A rule causes the Follow set of to be included in the Follow set of . And the Follow set of the start symbol has the end-of-file ‘terminal’ .
The Follow set is basically every non-terminal that can follow in the grammar. So when you have , is in the Follow set of . A rule causes the Follow set of to be included in the Follow set of . And the Follow set of the start symbol has the end-of-file ‘terminal’ .
Finally, there is the dot operator between the First and Follow sets: this is a truncated product, that takes every combination of the two sets, sticks them together (in order), and truncates to length k. That’s a bit of an abstraction over the k in LL(k), which I didn’t take into account in the explanation of First and Follow sets. The First sets should have length k strings of course, and so you may need to take more First/Follow sets into account when computing these. Another thing I glossed over is that we actually use the First set of , a mix of terminals and sorts on the right-hand side of our grammar rules. If is , then its First set is .
Ok, with that all done, we can use those tables. But before we do, a quick word about expressive power, because LL is not particularly powerful…
Limitations and Expressive power
There are always languages that cannot be captured by an LL(k) grammar that can be captured by an LL(k+1) grammar. In other words, look-ahead size is important in the expressivity of an LL grammar, and LL(k) for any specific k does not capture all deterministic context-free languages[3].
We’ve already seen an example of an LL(2) grammar, but what would be a language that cannot be captured by any LL(k)? Take the language of a’s followed by b’s, where the number of a’s the number of b’s. Or as a grammar:
(1) | |
(2) | |
(3) | |
(4) |
The problem for LL here is that we would have to look ahead in the input until we read the entire input before we could decide whether we can start consuming the input with Rule 1 or Rule 2 (and then Rule 3).
There is a class of grammars called LL-regular (LLR) grammars captures all LL(k) grammars for any k and slightly more. These LLR grammars are cool in that they are still parseable in linear time, as long as you have something called a “regular partition” of your grammar. Getting that is an undecidable problem though. And since there is an LR(1) grammar that is not in LLR, this stuff is the trifecta of confusing, impractical, and less powerful[4] than a much more useful technique that we will cover later in this post: LR. But first, going from tables to parsers!
Predictive Parsing
Since we already know what the tables mean, we can write a simple parse table interpreter to finish our predictive parser. The parser is called predictive because based on the k look-ahead terminals, we decide the grammar rule to use to continue parsing, which typically predicts some of the structure of the input well beyond the part we peeked at for the look-ahead.
Ok, let’s write a quick parse table interpreter for our LL(2) example. We’ll start with some definitions.
use std::collections::HashMap;
use std::env;
use lazy_static::lazy_static;
use peekmore::PeekMore;
type Terminal = char;
#[derive(Clone, Copy, Debug, Eq, Hash, PartialEq)]
enum Sort {
S,
A1,
A2,
}
enum Symbol {
Sort(Sort),
Terminal(Terminal),
}
#[derive(Debug, Eq, PartialEq)]
enum Rule {
S,
Aa,
AEpsilon,
}
The imports become useful in a second, for now we have our terminals, sorts, a combination type Symbol
, and the names of our grammar rules. Assuming we keep around a proper PDA stack of symbols, we can write our grammar rules now:
impl Rule {
fn apply(&self, stack: &mut Vec<Symbol>) {
match self {
Rule::S => Self::s(stack),
Rule::Aa => Self::aa(stack),
Rule::AEpsilon => Self::a_epsilon(stack),
}
}
fn s(stack: &mut Vec<Symbol>) {
stack.push(Symbol::Terminal('a'));
stack.push(Symbol::Terminal('b'));
stack.push(Symbol::Sort(Sort::A2));
stack.push(Symbol::Terminal('b'));
stack.push(Symbol::Terminal('a'));
stack.push(Symbol::Sort(Sort::A1));
}
fn aa(stack: &mut Vec<Symbol>) {
stack.push(Symbol::Terminal('a'));
}
#[allow(clippy::ptr_arg)]
fn a_epsilon(_stack: &mut Vec<Symbol>) {}
}
Clippy is great for catching all kinds of poor code, but for consistency I’ve chosen to #[allow]
this time. Note that to effectively run a context-free grammar on a PDA, you need to push the symbols in your rules on the stack in reverse, as mentioned in the recap.
lazy_static! {
static ref TABLE: HashMap<(Sort, Terminal, Terminal), Rule> = {
let mut table = HashMap::new();
assert_eq!(None, table.insert((Sort::S, 'a', 'a'), Rule::S));
assert_eq!(None, table.insert((Sort::S, 'a', 'b'), Rule::S));
assert_eq!(None, table.insert((Sort::A1, 'a', 'a'), Rule::Aa));
assert_eq!(None, table.insert((Sort::A1, 'a', 'b'), Rule::AEpsilon));
assert_eq!(None, table.insert((Sort::A2, 'a', 'b'), Rule::Aa));
assert_eq!(None, table.insert((Sort::A2, 'b', 'a'), Rule::AEpsilon));
table
};
}
Nothing very special really, just encoding what we had already. The main parse loop is also very unexciting now that we have implemented most of the logic of the grammar already. We basically manage the stack, eliminating terminals on the stack with those from the input and applying rules from the table based on sort and look-ahead, and give errors if we get unexpected input:
pub fn lex(input: String) -> Vec<Terminal> {
input.chars().collect()
}
pub fn main() -> Result<(), String> {
let input = env::args().next().expect("Argument string to parse");
let input = lex(input);
let mut input = input.iter().peekmore();
let mut stack = Vec::new();
stack.push(Symbol::Sort(Sort::S));
while let Some(symbol) = stack.pop() {
return match symbol {
Symbol::Terminal(predicted) => {
if let Some(&&actual) = input.next() {
if predicted == actual {
continue;
}
Err(format!(
"Expected terminal {predicted:?}, but got {actual:?}."
))
} else {
Err(format!("Expected terminal {predicted:?}, but got EOF."))
}
}
Symbol::Sort(sort) => {
if let &[Some(&term1), Some(&term2)] = input.peek_amount(2) {
if let Some(r) = TABLE.get(&(sort, term1, term2)) {
r.apply(&mut stack);
continue;
} else {
Err(format!(
"Unexpected {term1:?} {term2:?} while parsing {sort:?}"
))
}
} else {
Err("Unexpected end of input.".to_owned())
}
}
};
}
Ok(())
}
Recursive Descent
By encoding the parse table in data, we get some amount of interpretive overhead. We have a parse table interpreter with a stack we manage ourselves, but the stack is not really used any different from a call stack. So what if we use function calls instead? That’s the idea of recursive descent parsing. It actually makes our code smaller and more straight-forward, which is why it’s so popular as a technique for hand-written parsers.
use std::env;
use peekmore::PeekMore;
use peekmore::PeekMoreIterator;
type Iter<'a> = PeekMoreIterator<std::slice::Iter<'a, Terminal>>;
type Terminal = char;
fn consume(input: &mut Iter, predicted: Terminal) -> Result<(), String> {
if let Some(&actual) = input.next() {
if actual == predicted {
Ok(())
} else {
Err(format!(
"Expected terminal {predicted:?}, but got {actual:?}."
))
}
} else {
Err("Unexpected end of file.".to_owned())
}
}
This time we only need terminals as a type, the rest is gone, and so is the hashmap import for the parsetable. We will need the input, and be able to remove predicted terminals from it, so consume
comes in handy.
fn sort_s(input: &mut Iter) -> Result<(), String> {
// S
match input.peek_amount(2) {
&[Some('a'), Some('a')] => s(input),
&[Some('a'), Some('b')] => s(input),
&[term1, term2] => Err(format!("Unexpected {term1:?} {term2:?} while parsing S")),
_ => Err("Unexpected end of file.".to_owned()),
}
}
fn sort_A1(input: &mut Iter) -> Result<(), String> {
// A1
match input.peek_amount(2) {
&[Some('a'), Some('a')] => a_a(input),
&[Some('a'), Some('b')] => a_epsilon(input),
&[term1, term2] => Err(format!("Unexpected {term1:?} {term2:?} while parsing A")),
_ => Err("Unexpected end of file.".to_owned()),
}
}
fn sort_A2(input: &mut Iter) -> Result<(), String> {
// A2
match input.peek_amount(2) {
&[Some('a'), Some('b')] => a_a(input),
&[Some('b'), Some('a')] => a_epsilon(input),
&[term1, term2] => Err(format!("Unexpected {term1:?} {term2:?} while parsing A")),
_ => Err("Unexpected end of file.".to_owned()),
}
}
fn s(input: &mut Iter) -> Result<(), String> {
sort_A1(input)?;
consume(input, 'a')?;
consume(input, 'b')?;
sort_A2(input)?;
consume(input, 'b')?;
consume(input, 'a')
}
fn a_a(input: &mut Iter) -> Result<(), String> {
consume(input, 'a')
}
fn a_epsilon(_input: &mut Iter) -> Result<(), String> {
Ok(())
}
Our parse table has now become code directly, with these functions named after the sorts of the rows. They call rules that are also functions, which in turn use the sort functions. Those rules also use consume
, this time without having to reverse the order of the rule body.
pub fn lex(input: String) -> Vec<Terminal> {
input.chars().collect()
}
pub fn main() -> Result<(), String> {
let input = env::args().next().expect("Argument string to parse");
let input = lex(input);
let mut input = input.iter().peekmore();
sort_s(&mut input)
}
Finally, our main function just calls the right sort function instead of putting that sort on the stack. And the loop is gone, since we now use recursion.
Summary of LL, and an insight from the automaton
We’ve now seen LL(k) parsing, left-to-right leftmost derivation. This leftmost derivation directly corresponds to walking through the parse tree topdown, depth-first, leftmost child first. Whenever we expand a leftmost sort by a rule for that sort, we have to choose a rule, therefore we use the look-ahead (with a length of k) to see ahead and choose based on this.
We’ve seen an LL(1) and an LL(2) grammar, and in general more look-ahead allows us to parse more grammars and more languages. Both are important: certain languages cannot be expressed in LL(1) or LL(2), and some LL(1) grammars are harder to read and write than the LL(2) grammar of the same language.
We’ve seen how we can construct simple DFAs for each rule in our grammar, and then replace the sort transitions by a (PDA) push transition () from to all starts of DFAs corresponding to rules of , and a pop transition () from the ends of those DFAs to . Then the LL table, the decision table of sort + look-ahead = rule, naturally follows from this PDA by finding what input will be consumed if a certain rule is chosen, and using that as the look-ahead to make the decision for that rule.
The recursive descent way of writing a parser directly as code is nice and simple, it really just follows the grammar. Since you’re writing plain old code with function calls, you can imagine people have found nice ways to extend and adapt the pattern of recursive descent parsers. For one, it’s quite easy to reason about where you are in the parse when hitting an error state, which makes it fairly easy to give friendly error messages when the parser doesn’t accept an input. You can also use a trick to fix up direct left-recursion called node reparenting, where you use a loop or tail-recursion locally construct the tree bottom-up. You could argue that such a parser is a hybrid between recursive descent and ascent, a “recursive descent-ascent parser”.
Finally, if we look back at the automaton, we can see that the PDAs we build have a very strict shape. We either have a non-deterministic choice due to multiple push transitions for a sort, or we have predicted input, a single path of terminals to consume from the input. If we think back to the NFAs and DFAs from early on in this blog post series, those used the input to chose what state to go to next. Now we have single-path DFAs that just consume input, and a separate table on a look-ahead to resolve non-determinism from the pushes and pops. The strict shape here indicated that we’re not really making full use of the power of automata. This will change with the next parsing technique.
Continue?
Now that you’ve learned all about LL parsing, would you like to learn about LR parsing? About how it’s more powerful, also uses parse tables with some of the same construction tricks, also has nice corresponding push-down automata, and even has an analogue to recursive descent? Then click on to part 2! Or bookmark it for later, when you are able to absorb information again or whatever ¯\_(ツ)_/¯
-
I’m fairly sure my prose description there is the same as a formal definition, and it feel a bit nicer to think about than the ones you can find on Wikipedia. ↩
-
Technically you’d need to see and as separate symbols and duplicate the rules for , resulting in a larger grammar in correspondence with the larger table. But I couldn’t be bothered, and the parse table as shown works just as well. This is relevant to the code size of a recursive descent parser too, since you can just reuse the code for rules 2 and 3 instead of having duplicate code for the two extra rules. What’s a recursive descent parser? That comes just a little later in the post, so keep reading ;) ↩
-
Yes, there are non-deterministic context-free languages. Those are the context-free languages that can only be parsed with a non-deterministic PDA. Since this post is about deterministic parsers, we’ll ignore the non-deterministic languages. ↩
-
While I find the Wikipedia article on LLR confusing, and it makes a good case for why it’s not really used, I’m still somewhat intrigued. This is one of those things that will stay on my reading list for a while I think, something I still want to understand further… ↩