Hello again! I’m picking up my series on Automata, with this post that goes into what I had always meant to get to: parsers. We’ll check out the old-school linear time parsing algorithms, which only need to go over the input once, without backtracking or caching. We’ll check out LL and LR, parse tables, recursive descent and recursive ascent. Welcome to the world of deterministic parsing…

Refresher from Pushy Automata

We’ll start with a brief refresher from the previous post of the series, pushy automata, since that was a little while back.

Push-Down Automata

Push-down automata (PDAs) are automata with a stack. Normal finite automata just consume input and have fixed memory via their states. PDAs can remember things on a single stack too, by pushing onto it and popping from it. Here’s a deterministic PDA for recognising the language of words that start with zeroes, followed by an equal number of ones:

Non-regular language example, deterministic digraph "Non-regular language example, deterministic" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR; start [shape=none, label="", width=0]; q₀ [shape=doublecircle, width=0.4]; q₃ [shape=doublecircle, width=0.4]; start -> q₀; q₀ -> q₁ [label="0, ε → $"]; q₁ -> q₂ [label="1, 0 → ε"]; q₂ -> q₃ [label="1, $ → ε"]; q₁ -> q₁ [label="0, ε → 0"]; q₂ -> q₂ [label="1, 0 → ε"]; } Non-regular language example, deterministic start q₀ q₀ start->q₀ q₁ q₁ q₀->q₁ 0, ε → $ q₃ q₃ q₁->q₁ 0, ε → 0 q₂ q₂ q₁->q₂ 1, 0 → ε q₂->q₃ 1, $ → ε q₂->q₂ 1, 0 → ε

So we start at q0q_0, see if there is a 00 as input, ignore the top of the stack, and put a $\$ on the stack as a marker for the end of the stack. Now we’re in state q1q_1, in which we can consume more zeroes from the input and put those on the stack. If we find a one as input, we remove a zero from the stack by not pushing anything new on the stack. Now we’re in state q2q_2 where we remove zeroes from the stack for every one in the input, until we consume the final one by removing the $\$ from the stack.

Aside: This is one of the examples from the old blog post, and I now see that it is missing a transition! This automaton rejects the input 0101, because there is no transition q11, $  εq3q_1 \xrightarrow{1,\ \$\ \to\ \varepsilon} q_3. Oops ^_^

Context-Free Grammars, Derivations, and a naive PDA translation

A context-free grammar that describes the above language is:

S=0S1 S = 0 S 1 (step) \text{(step)}
S=ε S = \varepsilon (ε) \text{(}\varepsilon\text{)}

Sort SS is the start symbol, the starting point in the grammar. If we’re using the grammar productively we start from the start symbol and use the rules left-to-right to replace sorts until we get the sentence in the corresponding context-free language that we want. Something like: S0S100S110011 S \to 0 S 1 \to 0 0 S 1 1 \to 0 0 1 1 . This is called a derivation.

The most general, naive way to derive a push-down automaton for any context-free grammar is by pushing the end-of-stack and start symbol at the start, having a “main” state that uses the grammar rules with the body reversed (to deal with the stack order), and an accept state that pops the end-of-stack:

Non-regular language example, deterministic digraph "Non-regular language example, deterministic" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR; start [shape=none, label="", width=0]; q₁ [shape=doublecircle, width=0.4]; start -> q₀ [label="$ S"]; q₀ -> q₁ [label="ε, $ → ε"]; q₀ -> q₀ [label="ε, S → 1 S 0\nε, S → ε\n0, 0 →ε\n1, 1 → ε"]; } Non-regular language example, deterministic start q₀ q₀ start->q₀ $ S q₁ q₁ q₀->q₁ ε, $ → ε q₀->q₀ ε, S → 1 S 0 ε, S → ε 0, 0 →ε 1, 1 → ε

Here the stack grows left-to-right, so the lowest symbol on the stack is $ (end of stack), followed by S (the grammar start symbol). By the rules of the grammar we can manipulate the top of the stack and rewrite it to the body. If the input lines up with what we have on the stack, we can eliminate both. It’s simple, but inefficient because of all the nondeterminism.

Derivations, Parse Trees and Ambiguity

Let’s look at a slightly more interesting grammar from a parser perspective:

S=S+S S = S + S (add) \text{(add)}
S=SS S = S * S (mul) \text{(mul)}
S=1 S = 1 (ε) \text{(}\varepsilon\text{)}

When you want to derive 1+11 1 + 1 * 1 , you can do this in all manner of ways. The following derivation picks just an arbitrary sort on which to apply a rule from the grammar:

  1. S S (first S)
  2. S+S S + S (first S)
  3. 1+S 1 + S (first S)
  4. 1+SS 1 + S * S (second S)
  5. 1+S1 1 + S * 1 (first S)
  6. 1+11 1 + 1 * 1

Notice how in some steps the leftmost SS was replaced, while in others the rightmost was replaced. Generally speaking, you’ll want either a leftmost or a rightmost derivation for parsers, which is to say: a grammar rule is always applied to the leftmost or rightmost sort. There are three reasons for this. The first is that you want a parser to be predictable in when it applies grammar rules, as you may connect so-called semantic actions to each rule. These are pieces of code that are run when the parser applies the particular rule. (A typical example is a simple calculator). Such actions could perform side-effects, therefore order matters. For this reason, leftmost vs rightmost can also be observed. Two other reasons you to want this predictable derivation order is ease of implementation, and ease of proving things about your algorithm. These last two care less for whether it’s leftmost or rightmost.

The most common semantic actions I’m aware of is to build a syntax tree with a parser. This builds a tree structure out of the parsed text. A parse tree, or concrete syntax tree, contains all the rule applications as seen in the grammar. An abstract syntax tree abstracts over some parts of the syntax tree, such as leaving out whitespace, or parentheticals (the shape of the tree captures the precedence anyway), or injections (grammars rules of the form S1=S2 S₁ = S₂ ). Let’s look at some parse trees of the last example, 1+11 1 + 1 * 1 :

Parse trees of 1 + (1 * 1) and (1 + 1) * 1 digraph "Parse trees of 1 + (1 * 1) and (1 + 1) * 1" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101", label=S]; edge [fontcolor="#010101", color="#010101"]; rankdir=TB; subgraph { Lfirst1 [label=1]; Lplus [label="+"]; Lhidden [shape=none, label="", width=0.5]; Ladd; Lsecond1 [label=1]; Lstar [label="*"]; Lthird1 [label=1]; Lmul; Ladd -> Lfirst1; Ladd -> Lplus; Ladd -> Lhidden [style="invis"]; Ladd -> Lmul; Lmul -> Lsecond1; Lmul -> Lstar; Lmul -> Lthird1; } subgraph { Rfirst1 [label=1]; Rplus [label="+"]; Rsecond1 [label=1]; Radd; Rhidden [shape=none, label="", width=0.5]; Rstar [label="*"]; Rthird1 [label=1]; Rmul; Rmul -> Radd; Rmul -> Rhidden [style="invis"]; Rmul -> Rstar; Rmul -> Rthird1; Radd -> Rfirst1; Radd -> Rplus; Radd -> Rsecond1; } } Parse trees of 1 + (1 * 1) and (1 + 1) * 1 Lfirst1 1 Lplus + Lhidden Ladd S Ladd->Lfirst1 Ladd->Lplus Lmul S Ladd->Lmul Lsecond1 1 Lstar * Lthird1 1 Lmul->Lsecond1 Lmul->Lstar Lmul->Lthird1 Rfirst1 1 Rplus + Rsecond1 1 Radd S Radd->Rfirst1 Radd->Rplus Radd->Rsecond1 Rhidden Rstar * Rthird1 1 Rmul S Rmul->Radd Rmul->Rstar Rmul->Rthird1

Notice how the leaves of the two trees are in the same order left-to-right as the input, but for the left tree the plus is higher up in the tree while in the right tree the star is higher up. If we want to interpreter the input as simple arithmetic, where multiplication binds tighter than addition, the left tree is the one we want. This is the predecedence of the operators, >+ * > + .

When you can get multiple trees like this, the grammar is called ambiguous. More formally, if you use only leftmost derivations (or only rightmost) and still find two distinct derivations that give the same sentence, the grammar is ambiguous. So to be clear: the above trees can be created with only leftmost derivations, it’s not a matter of choosing one or the other for the two trees. Derivation order (leftmost or rightmost) has to do with side-effect order of semantic actions only. When you build trees you don’t really need side-effects, so the derivation order has no effect on it.

With that recap out of the way: For the purposes of this blog post, we’ll look at unambiguous grammars for the languages we want to parse. Still, whether you use leftmost derivation or rightmost derivation in a parser that parses unambiguous grammars matters quite a lot in terms of what languages you can describe deterministically. It also influences how easily you can write a parser by hand for such a grammar, and how easily you can (programmatically) explain why your parser doesn’t accept certain inputs (parser error messages). So let’s have a look at LL and LR parsing techniques, where the first L in those abbreviations stands for Left-to-right (as in reading direction in text), and the second letters are respectively leftmost and rightmost derivation.

Topdown, (Strong) LL parsing

To take a good look at LL parsing, we will first work with a grammar that is not ambiguous or left-recursive:

S=F S = F (1) \text{(1)}
S=(S+F) S = ( S + F ) (2) \text{(2)}
F=a F = a (3) \text{(3)}

So sort S S is the start symbol, we also have sort F F , and we have round brackets, plusses, and a a ’s. This is enough information to create a table that, based on (1) the next sort to be parsed and (2) the next symbol in the input, predicts which rule from the grammar to use to parse the input further. In other words, if you know where you are in the input and grammar, you can look ahead at the next symbol of input and tell which unique grammar rule predicts the next bit of input (assuming the input fits the grammar). The table for the above grammar looks like so:

  ( a
SS 2 1
FF   3

A table like the above is an LL(1) parse table, because it uses only 1 symbol of “look-ahead” in the columns. LL(1) grammars are always strong LL grammars, which means that they only need the combination of the sort to be parsed and the next symbol(s) to decide on a unique grammar rule to apply. In general, LL(k) grammars do not have to be strong, and if they are not, you also need to know what was already parsed from the input (the “left context”) in order to choose a unique grammar rule1. For example, the following grammar is LL(2), and not strong:

S=A a b A b a S = A\ a\ b\ A\ b\ a (1) \text{(1)}
A=a A = a (2) \text{(2)}
A= A = (3) \text{(3)}

You can see this if you try to write an LL(2) parse table for it:

  a a a b b a
SS 1 1  
AA 2 2,3 3

If you look ahead to a b on the input, and the next sort is AA, then it really depends on whether you are at the start of the input or in the middle of rule 1. If you’re at the start, you must choose rule 3 so you can parse a b as part of the rule 1, but if you’re already in the middle of rule 1, you must choose rule 2 for AA so you can continue to parse b a of rule 1.

If you mark AA in rule 1 with where you are in rule 1 (S=A1 a b A2 b a S = A₁\ a\ b\ A₂\ b\ a ), you get an LL(2) grammar that is strong, although the table for it is larger2:

  a a a b b a
SS 1 1  
A1A_1 2 3  
A2A_2   2 3

In general, you can always use this trick to construct a strong, structurally equivalent LL grammar with the same look-ahead. This is quite useful for constructing simple LL parsers. However, the downside of these parsers is that on wrong input they can fail later than a more complicated LL(k) parser that works for the non-strong grammar. And that matters if you want to give nice error messages.

An intuition for table construction by automaton

Building the above tables was a matter of keeping in mind what they mean, and squinting a little. But in the case of a larger grammar, or a parsetable generator, of course you want an exact process. Before I dive into First and Follow sets that are the traditional method for building these tables, let me give you a method that is less practical but in my opinion more intuitive.

Step 1: Let’s build a simple automaton for each rule of the grammar, where we assume both sorts and terminals are on the input.

Simple automata for each grammar rule from the last example digraph "Simple automata for each grammar rule from the last example" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR; subgraph { start1 [shape=none, label="", width=0]; S₁ [shape=doublecircle, width=0.4]; start1 -> S₁₀; S₁₀ -> S₁₁ [label="A"]; S₁₁ -> S₁₂ [label="a"]; S₁₂ -> S₁₃ [label="b"]; S₁₃ -> S₁₄ [label="A"]; S₁₄ -> S₁₅ [label="b"]; S₁₅ -> S₁ [label="a"]; } subgraph { node [style=filled, fillcolor="#ddd"]; start2 [style="", shape=none, label="", width=0]; A₂ [shape=doublecircle, width=0.4]; start2 -> A₂₀; A₂₀ -> A₂ [label="a"]; } subgraph { node [style=filled, fillcolor="#aaa"]; start3 [style="", shape=none, label="", width=0]; A₃ [shape=doublecircle, width=0.4]; start3 -> A₃; } } Simple automata for each grammar rule from the last example start1 S₁₀ S₁₀ start1->S₁₀ S₁ S₁ S₁₁ S₁₁ S₁₀->S₁₁ A S₁₂ S₁₂ S₁₁->S₁₂ a S₁₃ S₁₃ S₁₂->S₁₃ b S₁₄ S₁₄ S₁₃->S₁₄ A S₁₅ S₁₅ S₁₄->S₁₅ b S₁₅->S₁ a start2 A₂₀ A₂₀ start2->A₂₀ A₂ A₂ A₂₀->A₂ a start3 A₃ A₃ start3->A₃

Note how each node of a rule automaton has the number of the rule followed by the offset into the body of the rule. The state represents where we are in the rule while parsing by that rule. The last node doesn’t have this offset so you can easily identify it, even when it’s no longer a final state.

Typically you’ll find texts on parsers display the position in a rule more visually with “LR item” notation. This uses the actual rule and a dot to indicate where we are in the rule. While this makes individual automata states more readable, it makes the automata large and therefore harder to display in a readable way as a whole. That’s why you won’t find that notation in this post’s automata. But just to illustrate an example of the notation:

Shorthand in this blog LR Item notation
S₁₀ S=. A a b A b a S = .\ A\ a\ b\ A\ b\ a
S₁₁ S=A . a b A b a S = A\ .\ a\ b\ A\ b\ a
S₁₅ S=A a b A b . a S = A\ a\ b\ A\ b\ .\ a
S₁ S=A a b A b a . S = A\ a\ b\ A\ b\ a\ .

Step 2: Now instead of consuming a sort with an automaton, we’ll use ε\varepsilon rules to jump to the automata of the rules for that sort instead. We’ll use the PDA stack with unique labels to get back to where you would be after consuming the sort.

Single PDA using the automata from the grammar rules digraph "Single PDA using the automata from the grammar rules" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR; start1 [shape=none, label="", width=0]; S₁ [shape=doublecircle, width=0.4]; subgraph { rank=same; edge [style=invis]; A₂₀ [style=filled, fillcolor="#ddd"]; S₁₀ -> A₂₀; } subgraph { rank=same; edge [style=invis]; A₂ [style=filled, fillcolor="#ddd"]; A₃ [style=filled, fillcolor="#aaa"]; A₃ -> S₁₁; S₁₁ -> A₂; } subgraph { start1 -> S₁₀; S₁₀ -> S₁₁ [label="A", fontcolor="#01010140", color="#01010140"]; S₁₁ -> S₁₂ [label="a"]; S₁₂ -> S₁₃ [label="b"]; S₁₃ -> S₁₄ [label="A", fontcolor="#01010140", color="#01010140"]; S₁₄ -> S₁₅ [label="b"]; S₁₅ -> S₁ [label="a"]; } subgraph { A₂₀ -> A₂ [label="a"]; } S₁₀ -> A₂₀ [taillabel="↓S₁₁", labelangle=70, labeldistance=1.5, constraint=false]; A₂ -> S₁₁ [taillabel="↑S₁₁", labelangle=-60, labeldistance=2, constraint=false]; S₁₀ -> A₃ [taillabel="↓S₁₁", labelangle=60, labeldistance=1.75, constraint=false]; A₃ -> S₁₁ [taillabel="↑S₁₁", labelangle=60, labeldistance=2, constraint=false]; S₁₃ -> A₂₀ [taillabel="↓S₁₄", labelangle=-35, labeldistance=3, constraint=false]; A₂ -> S₁₄ [taillabel="↑S₁₄", labelangle=40, labeldistance=2, constraint=false]; S₁₃ -> A₃ [taillabel="↓S₁₄", labelangle=-50, labeldistance=2, constraint=false]; A₃ -> S₁₄ [taillabel="↑ S₁₄", labelangle=40, labeldistance=2, constraint=false]; } Single PDA using the automata from the grammar rules start1 S₁₀ S₁₀ start1->S₁₀ S₁ S₁ A₂₀ A₂₀ A₂ A₂ A₂₀->A₂ a S₁₀->A₂₀ ↓S₁₁ A₃ A₃ S₁₀->A₃ ↓S₁₁ S₁₁ S₁₁ S₁₀->S₁₁ A A₂->S₁₁ ↑S₁₁ S₁₄ S₁₄ A₂->S₁₄ ↑S₁₄ A₃->S₁₁ ↑S₁₁ A₃->S₁₄ ↑ S₁₄ S₁₂ S₁₂ S₁₁->S₁₂ a S₁₃ S₁₃ S₁₂->S₁₃ b S₁₃->A₂₀ ↓S₁₄ S₁₃->A₃ ↓S₁₄ S₁₃->S₁₄ A S₁₅ S₁₅ S₁₄->S₁₅ b S₁₅->S₁ a

The X\downarrow{}X is an abbreviation for an ε,εX\varepsilon, \varepsilon \rightarrow X edge that pushes a symbol on the stack unconditionally, it was hard to get graphviz to cooperate on node placement of this graph otherwise… Now at every node that had a sort transition you have multiple transition options, these are where you need to look ahead. Soooo…

Step 3: at a sort transition node, for each \downarrow transition, follow transitions until you’ve consumed k terminals (2 in this example) from the input. These are the terminals of the column in the parse table, and the rule of the \downarrow transition gets put into that cell. You can also put the look-ahead into the automaton:

Single PDA using the automata from the grammar rules digraph "Single PDA using the automata from the grammar rules" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR; start1 [shape=none, label="", width=0]; S₁ [shape=doublecircle, width=0.4]; subgraph { rank=same; edge [style=invis]; A₂₀ [style=filled, fillcolor="#ddd"]; S₁₀ -> A₂₀; } subgraph { rank=same; edge [style=invis]; A₂ [style=filled, fillcolor="#ddd"]; A₃ [style=filled, fillcolor="#aaa"]; A₃ -> S₁₁; S₁₁ -> A₂; } subgraph { start1 -> S₁₀; S₁₀ -> S₁₁ [label="A", fontcolor="#01010140", color="#01010140"]; S₁₁ -> S₁₂ [label="a"]; S₁₂ -> S₁₃ [label="b"]; S₁₃ -> S₁₄ [label="A", fontcolor="#01010140", color="#01010140"]; S₁₄ -> S₁₅ [label="b"]; S₁₅ -> S₁ [label="a"]; } subgraph { A₂₀ -> A₂ [label="a"]; } S₁₀ -> A₂₀ [taillabel="(aa), ↓S₁₁", labelangle=0, labeldistance=1.5, constraint=false]; A₂ -> S₁₁ [taillabel="↑S₁₁", labelangle=-60, labeldistance=2, constraint=false]; S₁₀ -> A₃ [taillabel="(ab), ↓S₁₁", labelangle=80, labeldistance=2.5, constraint=false]; A₃ -> S₁₁ [taillabel="↑S₁₁", labelangle=60, labeldistance=2, constraint=false]; S₁₃ -> A₂₀ [taillabel="(ab), ↓S₁₄", labelangle=-30, labeldistance=5, constraint=false]; A₂ -> S₁₄ [taillabel="↑S₁₄", labelangle=40, labeldistance=2, constraint=false]; S₁₃ -> A₃ [taillabel="(ba), ↓S₁₄", labelangle=-60 labeldistance=2, constraint=false]; A₃ -> S₁₄ [taillabel="↑ S₁₄", labelangle=40, labeldistance=2, constraint=false]; } Single PDA using the automata from the grammar rules start1 S₁₀ S₁₀ start1->S₁₀ S₁ S₁ A₂₀ A₂₀ A₂ A₂ A₂₀->A₂ a S₁₀->A₂₀ (aa), ↓S₁₁ A₃ A₃ S₁₀->A₃ (ab), ↓S₁₁ S₁₁ S₁₁ S₁₀->S₁₁ A A₂->S₁₁ ↑S₁₁ S₁₄ S₁₄ A₂->S₁₄ ↑S₁₄ A₃->S₁₁ ↑S₁₁ A₃->S₁₄ ↑ S₁₄ S₁₂ S₁₂ S₁₁->S₁₂ a S₁₃ S₁₃ S₁₂->S₁₃ b S₁₃->A₂₀ (ab), ↓S₁₄ S₁₃->A₃ (ba), ↓S₁₄ S₁₃->S₁₄ A S₁₅ S₁₅ S₁₄->S₁₅ b S₁₅->S₁ a

Building LL tables for strong LL grammars by traditional method

While the above building of automata gives a visual intuition, it’s not the most efficient way to express how we can build LL tables. The traditional method does the same thing in essence, but using some pre-computed sets to calculate the cells in the table.

A cell at the row labeled with sort AA and the column labeled with terminal(s) vv should have the grammar rule A=wA = w (where ww is a mix of terminals and sorts or ε\varepsilon), under the following condition: vv is in the First set of ww, or ε\varepsilon is in the First set of ww and vv is in the Follow set of AA. In other words: vFirst(w)Follow(A)v \in \textit{First}(w) \cdot \textit{Follow}(A)

Let’s unpack that. The First set of a terminal is a singleton set with that terminal. The First set of a sort is the set of first non-terminals that the sort can expand to, directly or indirectly. So a rule A=a[...]A = a \textrm{\footnotesize[...]} causes aa to appear in the First set of AA, A=B[...]A = B \textrm{\footnotesize[...]} causes the First set of BB to be included in the First set of AA, and A=εA = \varepsilon causes ε\varepsilon to appear in the First set of AA. This last rule says AA can be expanded to “nothing”, so if that’s an option we need to check the Follow set of AA.

The Follow set is basically every non-terminal that can follow AA in the grammar. So when you have B=[...]A a[...]B = \textrm{\footnotesize[...]} A\ a \textrm{\footnotesize[...]}, aa is in the Follow set of AA. A rule B=[...]AB = \textrm{\footnotesize[...]} A causes the Follow set of BB to be included in the Follow set of AA. And the Follow set of the start symbol has the end-of-file ‘terminal’ $\$.

The Follow set is basically every non-terminal that can follow AA in the grammar. So when you have B=[...]A a[...]B = \textrm{\footnotesize[...]} A\ a \textrm{\footnotesize[...]}, aa is in the Follow set of AA. A rule B=[...]AB = \textrm{\footnotesize[...]} A causes the Follow set of BB to be included in the Follow set of AA. And the Follow set of the start symbol has the end-of-file ‘terminal’ $\$.

Finally, there is the dot operator between the First and Follow sets: this is a truncated product, that takes every combination of the two sets, sticks them together (in order), and truncates to length k. That’s a bit of an abstraction over the k in LL(k), which I didn’t take into account in the explanation of First and Follow sets. The First sets should have length k strings of course, and so you may need to take more First/Follow sets into account when computing these. Another thing I glossed over is that we actually use the First set of ww, a mix of terminals and sorts on the right-hand side of our grammar rules. If ww is a A B ba\ A\ B\ b, then its First set is {a}First(A)First(B){b}\{a\} \cdot \textit{First}(A) \cdot \textit{First}(B) \cdot \{b\}.

Ok, with that all done, we can use those tables. But before we do, a quick word about expressive power, because LL is not particularly powerful…

Limitations and Expressive power

There are always languages that cannot be captured by an LL(k) grammar that can be captured by an LL(k+1) grammar. In other words, look-ahead size is important in the expressivity of an LL grammar, and LL(k) for any specific k does not capture all deterministic context-free languages3.

We’ve already seen an example of an LL(2) grammar, but what would be a language that cannot be captured by any LL(k)? Take the language of a’s followed by b’s, where the number of a’s \geq the number of b’s. Or as a grammar:

S=aS S = a S (1) \text{(1)}
S=A S = A (2) \text{(2)}
A=aAb A = a A b (3) \text{(3)}
A=ε A = \varepsilon (4) \text{(4)}

The problem for LL here is that we would have to look ahead in the input until we read the entire input before we could decide whether we can start consuming the input with Rule 1 or Rule 2 (and then Rule 3).

There is a class of grammars called LL-regular (LLR) grammars captures all LL(k) grammars for any k and slightly more. These LLR grammars are cool in that they are still parseable in linear time, as long as you have something called a “regular partition” of your grammar. Getting that is an undecidable problem though. And since there is an LR(1) grammar that is not in LLR, this stuff is the trifecta of confusing, impractical, and less powerful4 than a much more useful technique that we will cover later in this post: LR. But first, going from tables to parsers!

Predictive Parsing

Since we already know what the tables mean, we can write a simple parse table interpreter to finish our predictive parser. The parser is called predictive because based on the k look-ahead terminals, we decide the grammar rule to use to continue parsing, which typically predicts some of the structure of the input well beyond the part we peeked at for the look-ahead.

Ok, let’s write a quick parse table interpreter for our LL(2) example. We’ll start with some definitions.

use std::collections::HashMap;
use std::env;

use lazy_static::lazy_static;
use peekmore::PeekMore;

type Terminal = char;

#[derive(Clone, Copy, Debug, Eq, Hash, PartialEq)]
enum Sort {
    S,
    A1,
    A2,
}

enum Symbol {
    Sort(Sort),
    Terminal(Terminal),
}

#[derive(Debug, Eq, PartialEq)]
enum Rule {
    S,
    Aa,
    AEpsilon,
}

The imports become useful in a second, for now we have our terminals, sorts, a combination type Symbol, and the names of our grammar rules. Assuming we keep around a proper PDA stack of symbols, we can write our grammar rules now:

impl Rule {
    fn apply(&self, stack: &mut Vec<Symbol>) {
        match self {
            Rule::S => Self::s(stack),
            Rule::Aa => Self::aa(stack),
            Rule::AEpsilon => Self::a_epsilon(stack),
        }
    }

    fn s(stack: &mut Vec<Symbol>) {
        stack.push(Symbol::Terminal('a'));
        stack.push(Symbol::Terminal('b'));
        stack.push(Symbol::Sort(Sort::A2));
        stack.push(Symbol::Terminal('b'));
        stack.push(Symbol::Terminal('a'));
        stack.push(Symbol::Sort(Sort::A1));
    }

    fn aa(stack: &mut Vec<Symbol>) {
        stack.push(Symbol::Terminal('a'));
    }

    #[allow(clippy::ptr_arg)]
    fn a_epsilon(_stack: &mut Vec<Symbol>) {}
}

Clippy is great for catching all kinds of poor code, but for consistency I’ve chosen to #[allow] this time. Note that to effectively run a context-free grammar on a PDA, you need to push the symbols in your rules on the stack in reverse, as mentioned in the recap.

lazy_static! {
    static ref TABLE: HashMap<(Sort, Terminal, Terminal), Rule> = {
        let mut table = HashMap::new();
        assert_eq!(None, table.insert((Sort::S,  'a', 'a'), Rule::S));
        assert_eq!(None, table.insert((Sort::S,  'a', 'b'), Rule::S));
        assert_eq!(None, table.insert((Sort::A1, 'a', 'a'), Rule::Aa));
        assert_eq!(None, table.insert((Sort::A1, 'a', 'b'), Rule::AEpsilon));
        assert_eq!(None, table.insert((Sort::A2, 'a', 'b'), Rule::Aa));
        assert_eq!(None, table.insert((Sort::A2, 'b', 'a'), Rule::AEpsilon));
        table
    };
}

Nothing very special really, just encoding what we had already. The main parse loop is also very unexciting now that we have implemented most of the logic of the grammar already. We basically manage the stack, eliminating terminals on the stack with those from the input and applying rules from the table based on sort and look-ahead, and give errors if we get unexpected input:

pub fn lex(input: String) -> Vec<Terminal> {
    input.chars().collect()
}

pub fn main() -> Result<(), String> {
    let input = env::args().next().expect("Argument string to parse");
    let input = lex(input);
    let mut input = input.iter().peekmore();
    let mut stack = Vec::new();
    stack.push(Symbol::Sort(Sort::S));
    while let Some(symbol) = stack.pop() {
        return match symbol {
            Symbol::Terminal(predicted) => {
                if let Some(&&actual) = input.next() {
                    if predicted == actual {
                        continue;
                    }
                    Err(format!(
                        "Expected terminal {predicted:?}, but got {actual:?}."
                    ))
                } else {
                    Err(format!("Expected terminal {predicted:?}, but got EOF."))
                }
            }
            Symbol::Sort(sort) => {
                if let &[Some(&term1), Some(&term2)] = input.peek_amount(2) {
                    if let Some(r) = TABLE.get(&(sort, term1, term2)) {
                        r.apply(&mut stack);
                        continue;
                    } else {
                        Err(format!(
                            "Unexpected {term1:?} {term2:?} while parsing {sort:?}"
                        ))
                    }
                } else {
                    Err("Unexpected end of input.".to_owned())
                }
            }
        };
    }
    Ok(())
}

Recursive Descent

By encoding the parse table in data, we get some amount of interpretive overhead. We have a parse table interpreter with a stack we manage ourselves, but the stack is not really used any different from a call stack. So what if we use function calls instead? That’s the idea of recursive descent parsing. It actually makes our code smaller and more straight-forward, which is why it’s so popular as a technique for hand-written parsers.

use std::env;

use peekmore::PeekMore;
use peekmore::PeekMoreIterator;

type Iter<'a> = PeekMoreIterator<std::slice::Iter<'a, Terminal>>;

type Terminal = char;

fn consume(input: &mut Iter, predicted: Terminal) -> Result<(), String> {
    if let Some(&actual) = input.next() {
        if actual == predicted {
            Ok(())
        } else {
            Err(format!(
                "Expected terminal {predicted:?}, but got {actual:?}."
            ))
        }
    } else {
        Err("Unexpected end of file.".to_owned())
    }
}

This time we only need terminals as a type, the rest is gone, and so is the hashmap import for the parsetable. We will need the input, and be able to remove predicted terminals from it, so consume comes in handy.

fn sort_s(input: &mut Iter) -> Result<(), String> {
    // S
    match input.peek_amount(2) {
        &[Some('a'), Some('a')] => s(input),
        &[Some('a'), Some('b')] => s(input),
        &[term1, term2] => Err(format!("Unexpected {term1:?} {term2:?} while parsing S")),
        _ => Err("Unexpected end of file.".to_owned()),
    }
}

fn sort_A1(input: &mut Iter) -> Result<(), String> {
    // A1
    match input.peek_amount(2) {
        &[Some('a'), Some('a')] => a_a(input),
        &[Some('a'), Some('b')] => a_epsilon(input),
        &[term1, term2] => Err(format!("Unexpected {term1:?} {term2:?} while parsing A")),
        _ => Err("Unexpected end of file.".to_owned()),
    }
}

fn sort_A2(input: &mut Iter) -> Result<(), String> {
    // A2
    match input.peek_amount(2) {
        &[Some('a'), Some('b')] => a_a(input),
        &[Some('b'), Some('a')] => a_epsilon(input),
        &[term1, term2] => Err(format!("Unexpected {term1:?} {term2:?} while parsing A")),
        _ => Err("Unexpected end of file.".to_owned()),
    }
}

fn s(input: &mut Iter) -> Result<(), String> {
    sort_A1(input)?;
    consume(input, 'a')?;
    consume(input, 'b')?;
    sort_A2(input)?;
    consume(input, 'b')?;
    consume(input, 'a')
}

fn a_a(input: &mut Iter) -> Result<(), String> {
    consume(input, 'a')
}

fn a_epsilon(_input: &mut Iter) -> Result<(), String> {
    Ok(())
}

Our parse table has now become code directly, with these functions named after the sorts of the rows. They call rules that are also functions, which in turn use the sort functions. Those rules also use consume, this time without having to reverse the order of the rule body.

pub fn lex(input: String) -> Vec<Terminal> {
    input.chars().collect()
}

pub fn main() -> Result<(), String> {
    let input = env::args().next().expect("Argument string to parse");
    let input = lex(input);
    let mut input = input.iter().peekmore();
    sort_s(&mut input)
}

Finally, our main function just calls the right sort function instead of putting that sort on the stack. And the loop is gone, since we now use recursion.

Summary of LL, and an insight from the automaton

We’ve now seen LL(k) parsing, left-to-right leftmost derivation. This leftmost derivation directly corresponds to walking through the parse tree topdown, depth-first, leftmost child first. Whenever we expand a leftmost sort by a rule for that sort, we have to choose a rule, therefore we use the look-ahead (with a length of k) to see ahead and choose based on this.

We’ve seen an LL(1) and an LL(2) grammar, and in general more look-ahead allows us to parse more grammars and more languages. Both are important: certain languages cannot be expressed in LL(1) or LL(2), and some LL(1) grammars are harder to read and write than the LL(2) grammar of the same language.

We’ve seen how we can construct simple DFAs for each rule in our grammar, and then replace the sort transitions N1AN2N_1 \xrightarrow{A} N_2 by a (PDA) push transition (A\downarrow A) from N1N_1 to all starts of DFAs corresponding to rules of AA, and a pop transition (A\uparrow A) from the ends of those DFAs to N2N_2. Then the LL table, the decision table of sort + look-ahead = rule, naturally follows from this PDA by finding what input will be consumed if a certain rule is chosen, and using that as the look-ahead to make the decision for that rule.

The recursive descent way of writing a parser directly as code is nice and simple, it really just follows the grammar. Since you’re writing plain old code with function calls, you can imagine people have found nice ways to extend and adapt the pattern of recursive descent parsers. For one, it’s quite easy to reason about where you are in the parse when hitting an error state, which makes it fairly easy to give friendly error messages when the parser doesn’t accept an input. You can also use a trick to fix up direct left-recursion called node reparenting, where you use a loop or tail-recursion locally construct the tree bottom-up. You could argue that such a parser is a hybrid between recursive descent and ascent, a “recursive descent-ascent parser”.

Finally, if we look back at the automaton, we can see that the PDAs we build have a very strict shape. We either have a non-deterministic choice due to multiple push transitions for a sort, or we have predicted input, a single path of terminals to consume from the input. If we think back to the NFAs and DFAs from early on in this blog post series, those used the input to chose what state to go to next. Now we have single-path DFAs that just consume input, and a separate table on a look-ahead to resolve non-determinism from the pushes and pops. The strict shape here indicated that we’re not really making full use of the power of automata. This will change with the next parsing technique.

Bottomup, LR parsing

LR stands for left-to-right, rightmost derivation in reverse. If you think about it, left-to-right and rightmost derivation are incompatible: The rightmost derivation chooses the rule for the rightmost sort first every time, but that means skipping over some unknown amount of input if you read left-to-right to even get to that point. However, the reverse of the rightmost derivation is a left-to-right form of parsing. This reverse derivation describes going bottomup, left-to-right through the parse tree.

Expressive power and relation to LL

One of the biggest upsides of LR(k) parsing is its expressivity. The set of all LL(k) languages of any k is a strict subset of all LR(1) languages. Note that this is speaking of languages, not grammars. For grammars it holds that any LL(k) grammar for a specific k is also an LR(k) grammar, and not necessarily the other way around.

An LR(k) grammar of any k greater than 1 can be automatically transformed into an LR(1) grammar that is not necessarily structurally equivalent. This is highlights the difference between grammar and language level equivalence. We can basically capture any LR language in an LR(1) grammar, but LR with larger k may be able to describe the language in a nicer way (smaller grammar).

A good overview of how LL and LR relate to each other on the grammar and language level is summarised on the Computer Science Stack Exchange. In the comments someone suggests making a list of examples for each of these relationships, which seems like a great idea, but not something I have the patience for right now. This blog post has enough scope creep already.

How LR works

In order to give a reverse rightmost derivation, we need to figure what sorts can be at the leftmost leaf of the parse tree for our LR grammar. Then we try to apply the rules for those sorts all simultaneously. And to do so we can’t just use the automaton build we’ve used for LL.

Remember that the automata we’ve used previously mapped well on recursive descent, and showed us where to use an LL parse table with look-ahead to resolve ambiguity. Crucially, those automata observe every rule we go into. But for LR we need to explore down all the rules simultaneously. Let’s see if we can’t get somewhere with that idea and the example grammar of the language that wasn’t LL:

S=aS S = a S (1) \text{(1)}
S=A S = A (2) \text{(2)}
A=aAb A = a A b (3) \text{(3)}
A=ε A = \varepsilon (4) \text{(4)}

We start again with the separate automata for each rule:

Simple automata for each grammar rule from the example digraph "Simple automata for each grammar rule from the example" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR; subgraph { node [style=filled, fillcolor="#777", fontcolor="#fefefe"]; start4 [style="", shape=none, label="", width=0]; A₄ [shape=doublecircle, width=0.4]; start4 -> A₄; } subgraph { node [style=filled, fillcolor="#aaa"]; start3 [style="", shape=none, label="", width=0]; A₃ [shape=doublecircle, width=0.4]; start3 -> A₃₀; A₃₀ -> A₃₁ [label="a"]; A₃₁ -> A₃₂ [label="A"]; A₃₂ -> A₃ [label="b"]; } subgraph { node [style=filled, fillcolor="#ddd"]; start2 [style="", shape=none, label="", width=0]; S₂ [shape=doublecircle, width=0.4]; start2 -> S₂₀; S₂₀ -> S₂ [label="A"]; } subgraph { start1 [shape=none, label="", width=0]; S₁ [shape=doublecircle, width=0.4]; start1 -> S₁₀; S₁₀ -> S₁₁ [label="s"]; S₁₁ -> S₁ [label="A"]; } } Simple automata for each grammar rule from the example start4 A₄ A₄ start4->A₄ start3 A₃₀ A₃₀ start3->A₃₀ A₃ A₃ A₃₁ A₃₁ A₃₀->A₃₁ a A₃₂ A₃₂ A₃₁->A₃₂ A A₃₂->A₃ b start2 S₂₀ S₂₀ start2->S₂₀ S₂ S₂ S₂₀->S₂ A start1 S₁₀ S₁₀ start1->S₁₀ S₁ S₁ S₁₁ S₁₁ S₁₀->S₁₁ s S₁₁->S₁ A

Now in order to explore to the bottom-left of the parse tree, we need to be free to go into any rule. So we will connect the rules again to the nodes that expect a certain sort, but with epsilon transitions so we don’t observe how far down we are or with what rule in particular we got there. We’ll need that later, but let’s not worry about that until we have the downward exploration.

Partially constructed automaton using the automata from the grammar rules, using epsilon transitions digraph "Partially constructed automaton using the automata from the grammar rules, using epsilon transitions" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR; start1 [shape=none, label="", width=0]; start2 [shape=box, label="", width=0, color="#00000000"]; start3 [shape=none, label="", width=0]; subgraph { node [style=filled, fillcolor="#777", fontcolor="#fefefe"]; A₄; } subgraph { node [style=filled, fillcolor="#aaa"]; A₃; start3 -> A₃₀ [style=invis, weight=3]; A₃₀ -> A₃₁ [label="a", weight=10]; A₃₁ -> A₃₂ [label="A", weight=3]; A₃₂ -> A₃ [label="b", weight=3]; } subgraph { node [style=filled, fillcolor="#ddd"]; S₂; start1 -> start2:w [arrowhead=none]; start2:e -> S₂₀ [weight=3]; S₂₀ -> S₂ [label="A", weight=3]; } subgraph { S₁; start1 -> S₁₀ [weight=3]; S₁₀ -> S₁₁ [label="a", weight=3]; S₁₁ -> S₁ [label="S", weight=3]; } subgraph{ rank=same; edge [style=invis]; start1 -> start3 [minlen=3, weight=3]; } subgraph{ rank=same; edge [style=invis]; S₁₀ -> start2 [weight=3]; start2 -> A₃₀ [weight=3, minlen=2]; } subgraph{ rank=same; edge [style=invis]; S₁₁ -> S₂₀ [weight=3]; S₂₀ -> A₃₁ [weight=3, minlen=2]; A₃₁ -> A₄ [weight=3]; } subgraph{ rank=same; edge [style=invis]; S₁ -> S₂ [weight=3]; S₂ -> A₃₂ [weight=3, minlen=2]; } S₁₁ -> S₂₀ [label="ε"]; S₁₁ -> S₁₀ [label="ε"]; S₂₀ -> A₃₀ [label="ε"]; A₃₁ -> A₄ [label="ε"]; A₃₁ -> A₃₀:se [label="ε"]; S₂₀ -> start3:n [arrowhead="none"]; start3:s -> A₄ [label="ε"]; } Partially constructed automaton using the automata from the grammar rules, using epsilon transitions start1 start2 start1->start2:w start3 S₁₀ S₁₀ start1->S₁₀ A₃₀ A₃₀ S₂₀ S₂₀ start2:e->S₂₀ A₄ A₄ start3:s->A₄ ε A₃ A₃ A₃₁ A₃₁ A₃₀->A₃₁ a A₃₁->A₄ ε A₃₁->A₃₀:se ε A₃₂ A₃₂ A₃₁->A₃₂ A A₃₂->A₃ b S₂ S₂ S₂₀->start3:n S₂₀->A₃₀ ε S₂₀->S₂ A S₁ S₁ S₁₁ S₁₁ S₁₀->S₁₁ a S₁₁->S₂₀ ε S₁₁->S₁ S S₁₁->S₁₀ ε

Obviously this is not a full automaton model of a parser yet, but it allows us to always go down to the next leaf of the parse tree without using the stack. Let’s clean up the automaton with an NFA-to-DFA conversion:

Partially constructed automaton using the automata from the grammar rules, after merging states through NFA-to-DFA conversion digraph "Partially constructed automaton using the automata from the grammar rules, after merging states through NFA-to-DFA conversion" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR; start1 [shape=none, label="", width=0]; S₁; S₂ [style=filled, fillcolor="#ddd"]; A₃ [style=filled, fillcolor="#aaa"]; Box0 [shape=none, label=<<table cellborder="0" port="t"><tr><td>S₁₀</td></tr><tr><td bgcolor="#ddd">S₂₀</td></tr><tr><td bgcolor="#aaa">A₃₀</td></tr><tr><td bgcolor="#777"><font color="#fefefe">A₄</font></td></tr></table>>]; Box1 [shape=none, label=<<table cellborder="0" port="t"><tr><td>S₁₀</td></tr><tr><td>S₁₁</td></tr><tr><td bgcolor="#ddd">S₂₀</td></tr><tr><td bgcolor="#aaa">A₃₀</td></tr><tr><td bgcolor="#aaa">A₃₁</td></tr><tr><td bgcolor="#777"><font color="#fefefe">A₄</font></td></tr></table>>]; Box2 [shape=none, label=<<table cellborder="0" port="t"><tr><td bgcolor="#ddd">S₂</td></tr><tr><td bgcolor="#aaa">A₃₂</td></tr></table>>]; start1 -> Box0:t; Box0:t -> Box1:t [label="a", weight=2]; Box0:t -> S₂ [label="A"]; Box1:t:n -> Box1:t:n [label="a"]; Box1:t -> Box2:t [label="A", weight=2]; Box1:t -> S₁ [label="S"]; Box2:t -> A₃ [label="b"]; subgraph { rank=same; edge [style=invis]; Box2:t -> S₁ [minlen=0]; } subgraph { rank=same; edge [style=invis]; Box1:t -> S₂ [minlen=0]; } } Partially constructed automaton using the automata from the grammar rules, after merging states through NFA-to-DFA conversion start1 Box0 S₁₀ S₂₀ A₃₀ A₄ start1->Box0:t S₁ S₁ S₂ S₂ A₃ A₃ Box0:t->S₂ A Box1 S₁₀ S₁₁ S₂₀ A₃₀ A₃₁ A₄ Box0:t->Box1:t a Box1:t->S₁ S Box1:n->Box1:n a Box2 S₂ A₃₂ Box1:t->Box2:t A Box2:t->A₃ b

This is almost exactly how an LR(0) automaton would be drawn. Instead of S₁₀ and S₁₁, you write out the “LR item”s S = . a S and S = a . S. But otherwise it would be pretty much this. This is considered a PDA, though what’s happening on the stack is left implicit. That’s because what’s actually happening on the stack of LR automata is very structured, but a little involved. That makes the PDA harder to read and draw, but I’ll demonstrate it once:

A fully explicit PDA that does LR parsing digraph "A fully explicit PDA that does LR parsing" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR; start1 [shape=none, label="", width=0]; fin [shape=doublecircle, width=0.4, label=""] S₁; S₂ [style=filled, fillcolor="#ddd"]; A₃ [style=filled, fillcolor="#aaa"]; Box0 [shape=none, label=<<table cellborder="0" port="t"><tr><td>S₁₀</td></tr><tr><td bgcolor="#ddd">S₂₀</td></tr><tr><td bgcolor="#aaa">A₃₀</td></tr><tr><td bgcolor="#777"><font color="#fefefe">A₄</font></td></tr></table>>]; Box1 [shape=none, label=<<table cellborder="0" port="t"><tr><td>S₁₀</td></tr><tr><td>S₁₁</td></tr><tr><td bgcolor="#ddd">S₂₀</td></tr><tr><td bgcolor="#aaa">A₃₀</td></tr><tr><td bgcolor="#aaa">A₃₁</td></tr><tr><td bgcolor="#777"><font color="#fefefe">A₄</font></td></tr></table>>]; Box2 [shape=none, label=<<table cellborder="0" port="t"><tr><td bgcolor="#ddd">S₂</td></tr><tr><td bgcolor="#aaa">A₃₂</td></tr></table>>]; start1 -> Box0:t; Box0:t -> Box1:t [label="a,↓[a,0]", weight=3]; Box0:t -> S₂ [headlabel="↓[A,0]", labelangle=30, labeldistance=5]; Box1:t:n -> Box1:t:n [label="a,↓[a,1]"]; Box1:t -> Box2:t [label="↓[A,1]", weight=3]; Box2:t -> A₃ [label="b,↓[b,2]", weight=3]; S₁:e -> S₁:e [taillabel=" ↑[S,1] [a,1] ↓[S,1]", labeldistance=5, labelangle=-15]; S₁ -> fin [headlabel="↑[S,1] [a,0]", labeldistance=4.5, labelangle=-30]; S₂ -> fin [label="↑[A,0] "] Box2:t -> S₁:n [xlabel="\n\n↑[A,1] ↓[S,1] "]; A₃ -> Box2:t [xlabel="↑[b,2] [A,1] [a,1] ↓[A,1]"]; A₃ -> S₂ [taillabel="↑[b,2] [A,1] [a,0] ↓[A,0]", labeldistance=5, labelangle=40]; subgraph { rank=same; edge [style=invis]; Box2:t -> S₁ [minlen=4]; } subgraph { rank=same; edge [style=invis]; Box1:t -> S₂ [minlen=0]; S₂ -> fin [minlen=0]; } } A fully explicit PDA that does LR parsing start1 Box0 S₁₀ S₂₀ A₃₀ A₄ start1->Box0:t fin S₁ S₁ S₁->fin ↑[S,1] [a,0] S₁:e->S₁:e ↑[S,1] [a,1] ↓[S,1] S₂ S₂ S₂->fin ↑[A,0]      A₃ A₃ A₃->S₂ ↑[b,2] [A,1] [a,0] ↓[A,0] Box2 S₂ A₃₂ A₃->Box2:t ↑[b,2] [A,1] [a,1] ↓[A,1] Box0:t->S₂ ↓[A,0] Box1 S₁₀ S₁₁ S₂₀ A₃₀ A₃₁ A₄ Box0:t->Box1:t a,↓[a,0] Box1:n->Box1:n a,↓[a,1] Box1:t->Box2:t ↓[A,1] Box2:t->S₁:n ↑[A,1] ↓[S,1]   Box2:t->A₃ b,↓[b,2]

This should look quite familiar. We’re pushing inputs on the stack as we consume them, paired with the state we’re in at that point. And then we’re popping whole bodies of rules off the stack and replacing them with the sort of that rule. The first thing is called a shift action, the second is called a reduce action. We’ve seen this trick before in the naive PDA built from a CFG, all the way at the start of this post in the refresher. But this time we get an automaton with more states.

Notice that where a reduce action goes depends on originating state of the last thing that’s popped. That’s why we keep track of those on the stack. When we reduce by rule 3 (state A₃), depending on whether the a came from box 1 or box 0, we go to different places. This is easier to see in our proper LR(0) automaton, where box 1 points to state S₁ with a transition labeled A. This is a goto action. In an LR parse table interpreter, the goto is a separate action that immediately follows a reduce action, which merely returns to the last popped state. When a reduce just returns that’s also more like a function call and return, so that’s nice. Anyway, that’s also why a reduce transition in the above automaton always labels the originating state of the pushed sort the same as the last thing popped from the stack.

Something worth repeating now that we’re looking at the details: LL decides what rule to take before consuming the input for that rule, whereas LR decides what rule to take after consuming all the input for that rule. In other words, we only reduce by a rule when we’ve seen the entire body of the rule, that why there’s less trouble with look-ahead.

Speaking of look-ahead: we have some shift-reduce problems in our automaton. And by that I mean: how do we choose when to shift and when to reduce when both are an option? This is a determinism issue in our current automaton, and just like in our LL automaton, we solve it with look-ahead (and yes, that can and will later be summarised in a parse table). Our latest automaton gives a clear view of what we will be able to do if we reduce, so the look-ahead follows from what input can be consumed next after each reduce:

A fully explicit PDA that does LR parsing, with look-ahead digraph "A fully explicit PDA that does LR parsing, with look-ahead" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR; start1 [shape=none, label="", width=0]; fin [shape=doublecircle, width=0.4, label=""]; S₁; S₂ [style=filled, fillcolor="#ddd"]; A₃ [style=filled, fillcolor="#aaa"]; Box0 [shape=none, label=<<table cellborder="0" port="t"><tr><td>S₁₀</td></tr><tr><td bgcolor="#ddd">S₂₀</td></tr><tr><td bgcolor="#aaa">A₃₀</td></tr><tr><td bgcolor="#777"><font color="#fefefe">A₄</font></td></tr></table>>]; Box1 [shape=none, label=<<table cellborder="0" port="t"><tr><td>S₁₀</td></tr><tr><td>S₁₁</td></tr><tr><td bgcolor="#ddd">S₂₀</td></tr><tr><td bgcolor="#aaa">A₃₀</td></tr><tr><td bgcolor="#aaa">A₃₁</td></tr><tr><td bgcolor="#777"><font color="#fefefe">A₄</font></td></tr></table>>]; Box2 [shape=none, label=<<table cellborder="0" port="t"><tr><td bgcolor="#ddd">S₂</td></tr><tr><td bgcolor="#aaa">A₃₂</td></tr></table>>]; start1 -> Box0:t; Box0:t -> Box1:t [label="a,↓[a,0]", weight=3]; Box0:t -> S₂ [headlabel="($), ↓[A,0]", labelangle=30, labeldistance=5]; Box1:t:n -> Box1:t:n [label="a,↓[a,1]"]; Box1:t -> Box2:t [label="(b,$), ↓[A,1]", weight=3]; Box2:t -> A₃ [label="b,↓[b,2]", weight=3]; S₁:e -> S₁:e [taillabel=" ↑[S,1] [a,1] ↓[S,1]", labeldistance=5, labelangle=-15]; S₁ -> fin [headlabel="↑[S,1] [a,0]", labeldistance=4.5, labelangle=-30]; S₂ -> fin [label="↑[A,0] "] Box2:t -> S₁:n [xlabel="\n\n($), ↑[A,1] ↓[S,1] "]; A₃ -> Box2:t [xlabel="↑[b,2] [A,1] [a,1] ↓[A,1]"]; A₃ -> S₂ [taillabel="↑[b,2] [A,1] [a,0] ↓[A,0]", labeldistance=5, labelangle=40]; subgraph { rank=same; edge [style=invis]; Box2:t -> S₁ [minlen=4]; } subgraph { rank=same; edge [style=invis]; Box1:t -> S₂ [minlen=0]; S₂ -> fin [minlen=0]; } } A fully explicit PDA that does LR parsing, with look-ahead start1 Box0 S₁₀ S₂₀ A₃₀ A₄ start1->Box0:t fin S₁ S₁ S₁->fin ↑[S,1] [a,0] S₁:e->S₁:e ↑[S,1] [a,1] ↓[S,1] S₂ S₂ S₂->fin ↑[A,0]      A₃ A₃ A₃->S₂ ↑[b,2] [A,1] [a,0] ↓[A,0] Box2 S₂ A₃₂ A₃->Box2:t ↑[b,2] [A,1] [a,1] ↓[A,1] Box0:t->S₂ ($), ↓[A,0] Box1 S₁₀ S₁₁ S₂₀ A₃₀ A₃₁ A₄ Box0:t->Box1:t a,↓[a,0] Box1:n->Box1:n a,↓[a,1] Box1:t->Box2:t (b,$), ↓[A,1] Box2:t->S₁:n ($), ↑[A,1] ↓[S,1]   Box2:t->A₃ b,↓[b,2]

As you can see, we need at most 1 look-ahead to deterministically parse this grammar. We’re sometimes looking ahead to the end-of-input represented with $. The look-ahead makes this an LALR(1) grammar; what that is and why it’s different from normal LR(1) is what we’ll see in the next section.

LR parsetable construction and expressivity

Let’s look at some example grammars, how to construct their tables, and when you need a better parsetable construction method.

LR(0)

LR(0) does not look ahead but just reduces whenever possible. If there are multiple options, you have a shift-reduce or a reduce-reduce conflict. Shift-shift conflicts don’t exist in LR since the NFA-to-DFA process would have merged the two states such conflicting transitions would point to. Let’s look at an LR(0) grammar:

S=E2 S = E 2 (1) \text{(1)}
E=E1 E = E 1 (2) \text{(2)}
E=1 E = 1 (3) \text{(3)}

The LR automaton for this grammar is:

An LR(0) automaton for the above grammar digraph "An LR(0) automaton for the above grammar" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR; subgraph { rank=same; Box0 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box0</td></tr> <tr><td>S₁₀</td></tr> <tr><td bgcolor="#ddd">E₂₀</td></tr> <tr><td bgcolor="#aaa">E₃₀</td></tr> </table>>]; fin [shape=doublecircle, width=0.4, label=""]; Box0:t:s -> fin:n [xlabel="S "]; } start1 [shape=none, label="", width=0]; fin [shape=doublecircle, width=0.4, label=""]; S₁; E₂ [style=filled, fillcolor="#ddd"]; E₃ [style=filled, fillcolor="#aaa"]; Box1 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box1</td></tr> <tr><td>S₁₁</td></tr> <tr><td bgcolor="#ddd">E₂₁</td></tr> </table>>]; start1 -> Box0:t; Box0:t -> Box1:t [label="E", weight=2, minlen=2]; Box0:t -> E₃ [label="1"]; Box1:t -> S₁ [label="2"]; Box1:t -> E₂ [label="1"]; } An LR(0) automaton for the above grammar Box0 Box0 S₁₀ E₂₀ E₃₀ fin Box0:s->fin:n S   E₃ E₃ Box0:t->E₃ 1 Box1 Box1 S₁₁ E₂₁ Box0:t->Box1:t E start1 start1->Box0:t S₁ S₁ E₂ E₂ Box1:t->S₁ 2 Box1:t->E₂ 1

The corresponding parse table follows this automaton:

  1 2 $ E
Box0 s E₃   accept Box1
Box1 s E₂ s S₁    
E₃ r 3 r 3 r 3  
E₂ r 2 r 2 r 2  
S₁ r 1 r 1 r 1  

The transition from box 0 to E₃ that shifts 1 becomes a shift action to E₃ in the row of box 0 and the column of 1. The transition from box 0 to box 1 with E becomes a goto to box 1 in the row of box 0 and column of E. Finally a state that’s at the end of a rule will get all reduce actions by that rule (indicated by its number) in the column for input. Accepting the input is typically based on look-ahead of the end-of-input.

Simple LR (SLR)

The smallest motivating example for Simple LR is the following grammar that parses the same language as before:

S=E2 S = E 2 (1) \text{(1)}
E=1E E = 1 E (2) \text{(2)}
E=1 E = 1 (3) \text{(3)}

Notice how rule 2 is now right-recursive instead of left-recursive. It’s a nice symmetry how left-recursive rules give you trouble in LL, and right-recursive rules (can) give you trouble in LR5.

An LR(0) automaton for the above grammar digraph "An LR(0) automaton for the above grammar" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR; subgraph { rank=same; Box0 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box0</td></tr> <tr><td>S₁₀</td></tr> <tr><td bgcolor="#ddd">E₂₀</td></tr> <tr><td bgcolor="#aaa">E₃₀</td></tr> </table>>]; fin [shape=doublecircle, width=0.4, label=""]; Box0:t:s -> fin:n [xlabel="S "]; } start1 [shape=none, label="", width=0]; fin [shape=doublecircle, width=0.4, label=""]; S₁₁; E₂ [style=filled, fillcolor="#ddd"]; Box1 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box1</td></tr> <tr><td bgcolor="#ddd">E₂₀</td></tr> <tr><td bgcolor="#ddd">E₂₁</td></tr> <tr><td bgcolor="#aaa">E₃</td></tr> </table>>]; start1 -> Box0:t; Box0:t -> Box1:t [label="1", weight=2]; Box0:t -> S₁₁ [label="E"]; S₁₁ -> S₁ [label="2"]; Box1:t -> E₂ [label="E"]; Box1:t:s -> Box1:t:s [label="1"]; } An LR(0) automaton for the above grammar Box0 Box0 S₁₀ E₂₀ E₃₀ fin Box0:s->fin:n S   S₁₁ S₁₁ Box0:t->S₁₁ E Box1 Box1 E₂₀ E₂₁ E₃ Box0:t->Box1:t 1 start1 start1->Box0:t S₁ S₁ S₁₁->S₁ 2 E₂ E₂ Box1:t->E₂ E Box1:s->Box1:s 1
  1 2 $ E
Box0 s Box1   accept S₁₁
Box1 s Box1 / r 3 r 3 r 3 E₂
S₁₁   s S₁    
S₁ r 1 r 1 r 1  
E₂ r 2 r 2 r 2  

Yay, we have a shift-reduce conflict. How do we solve it? By not blindly putting a reduce in the entire row of a state that could reduce. If we check the Follow set of the sort we’re reducing to (we defined that when we built LL parse tables, remember?), we can put the reduce action in only the column of the terminals that are in that follow set. If we look at the grammar, we can see that only 2 can follow E. So the SLR table for this grammar is:

  1 2 $ E
Box0 s Box1   accept S₁₁
Box1 s Box1 r 3   E₂
S₁₁   s S₁    
S₁     r 1  
E₂   r 2    

Look-Ahead LR (LALR)

From now on we’ll be looking at reduce-reduce conflicts only. While you can get shift-reduce conflicts with the following algorithms through grammars that don’t fit (due to ambiguity or requiring more look-ahead than you’re taking into account), when you give an LALR(k) grammar to an SLR(k) algorithm you can only get reduce-reduce conflicts. Same with an LR(k) grammar put through the LALR(k) algorithm.

Here our example grammar that just barely doesn’t work with SLR:

S=aEc S = a E c (1) \text{(1)}
S=aFd S = a F d (2) \text{(2)}
S=bFc S = b F c (3) \text{(3)}
E=e E = e (4) \text{(4)}
F=e F = e (5) \text{(5)}

See how rules 4 and 5 are the same except they have different sort names? Yeah, that’s going to be fun if they’re used with the same prefix like in rules 1 and 2. Let’s have a look at the automaton and SLR parse table.

A LR(0) automaton for the above grammar digraph "A LR(0) automaton for the above grammar" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR; subgraph { rank=same; Box0 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box0</td></tr> <tr><td>S₁₀</td></tr> <tr><td bgcolor="#ddd">S₂₀</td></tr> <tr><td>S₃₀</td></tr> </table>>]; fin [shape=doublecircle, width=0.4, label=""]; Box0:t:s -> fin:n [xlabel="S "]; } start1 [shape=none, label="", width=0]; fin [shape=doublecircle, width=0.4, label=""]; S₁₂; S₁; S₂₂ [style=filled, fillcolor="#ddd"]; S₂ [style=filled, fillcolor="#ddd"]; S₃₂; S₃; F₅ [fontcolor="#fefefe", style=filled, fillcolor="#777"]; Box1 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box1</td></tr> <tr><td>S₁₁</td></tr> <tr><td bgcolor="#ddd">S₂₁</td></tr> <tr><td bgcolor="#aaa">E₄₀</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₀</font></td></tr> </table>>]; Box2 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box2</td></tr> <tr><td>S₃₁</td></tr> <tr><td bgcolor="#aaa">E₄₀</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₀</font></td></tr> </table>>]; Box3 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box3</td></tr> <tr><td bgcolor="#aaa">E₄</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅</font></td></tr> </table>>]; start1 -> Box0:t; Box0:t -> Box1:t [label="a"]; Box0:t -> Box2:t [label="b"]; Box1:t -> Box3:t [label="e"]; Box1:t -> S₁₂ [label="E"]; Box1:t -> S₂₂ [label="F"]; S₁₂ -> S₁ [label="c"]; S₂₂ -> S₂ [label="d"]; Box2:t -> F₅ [label="e"]; Box2:t -> S₃₂ [label="F"]; S₃₂ -> S₃ [label="c"]; } A LR(0) automaton for the above grammar Box0 Box0 S₁₀ S₂₀ S₃₀ fin Box0:s->fin:n S   Box1 Box1 S₁₁ S₂₁ E₄₀ F₅₀ Box0:t->Box1:t a Box2 Box2 S₃₁ E₄₀ F₅₀ Box0:t->Box2:t b start1 start1->Box0:t S₁₂ S₁₂ S₁ S₁ S₁₂->S₁ c S₂₂ S₂₂ S₂ S₂ S₂₂->S₂ d S₃₂ S₃₂ S₃ S₃ S₃₂->S₃ c F₅ F₅ Box1:t->S₁₂ E Box1:t->S₂₂ F Box3 Box3 E₄ F₅ Box1:t->Box3:t e Box2:t->S₃₂ F Box2:t->F₅ e
  a b c d e $ E F
Box0 s Box1 s Box2       accept    
Box1         s Box3   S₁₂ S₂₂
Box2         s F₅   S₃₂  
Box3     r 4 / r 5 r 5        
S₁₂     s S₁          
S₁           r 1    
S₂₂       s S₂        
S₂           r 2    
S₃₂     s S₃          
S₃           r 3    
F₅     r 5 r 5        

The reduce-reduce conflict, as promised. It’s in box 3, where we can reduce by E₄ or F₅, when the look-ahead is c. This is because the look-ahead sets of both E and F contain c due to rules 1 and 3. If we look at the automaton though, we can clearly see that if we reduce and we have a c next, we should reduce by E.

Look-Ahead LR parsing uses basically this method, analysing what shifts can happen after certain reduces. Putting it is algorithmic terms, LALR doesn’t use LL Follow sets, but defines more accurate Follow sets based on the automaton. Each instance of the start of a rule in the automaton (F₅₀ in boxes 1 and 2) gets a separate Follow set computed. That’s how we resolve the conflict with LALR:

  a b c d e $ E F
Box3     r 4 r 5        

Note that since the LALR Follow sets follow directly from the automaton, this is basically the same as the intuition given at the end of the previous section.

LR(1)

I like this LALR parsing story. It’s so intuitive with the NFA-to-DFA conversion, just looking at the automaton to see the follow sets. But, it’s doesn’t give you the complete power of deterministic push-down automata. I present to you the previous example grammar with one more rule:

S=aEc S = a E c (1) \text{(1)}
S=aFd S = a F d (2) \text{(2)}
S=bFc S = b F c (3) \text{(3)}
E=e E = e (4) \text{(4)}
F=e F = e (5) \text{(5)}
S=bEd S = b E d (6) \text{(6)}

This results in an automaton that’s almost the same as before:

A LR(0) automaton for the above grammar digraph "A LR(0) automaton for the above grammar" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR; subgraph { rank=same; Box0 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box0</td></tr> <tr><td>S₁₀</td></tr> <tr><td bgcolor="#ddd">S₂₀</td></tr> <tr><td>S₃₀</td></tr> <tr><td bgcolor="#ddd">S₆₀</td></tr> </table>>]; fin [shape=doublecircle, width=0.4, label=""]; Box0:t:s -> fin:n [xlabel="S "]; } start1 [shape=none, label="", width=0]; fin [shape=doublecircle, width=0.4, label=""]; S₁₂; S₁; S₂₂ [style=filled, fillcolor="#ddd"]; S₂ [style=filled, fillcolor="#ddd"]; S₃₂; S₃; S₆₂ [style=filled, fillcolor="#ddd"]; S₆ [style=filled, fillcolor="#ddd"]; Box1 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box1</td></tr> <tr><td>S₁₁</td></tr> <tr><td bgcolor="#ddd">S₂₁</td></tr> <tr><td bgcolor="#aaa">E₄₀</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₀</font></td></tr> </table>>]; Box2 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box2</td></tr> <tr><td>S₃₁</td></tr> <tr><td bgcolor="#ddd">S₆₁</td></tr> <tr><td bgcolor="#aaa">E₄₀</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₀</font></td></tr> </table>>]; Box3 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box3</td></tr> <tr><td bgcolor="#aaa">E₄</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅</font></td></tr> </table>>]; start1 -> Box0:t; Box0:t -> Box1:t [label="a"]; Box0:t -> Box2:t [label="b"]; Box1:t -> Box3:t [label="e"]; Box1:t -> S₁₂ [label="E"]; Box1:t -> S₂₂ [label="F"]; S₁₂ -> S₁ [label="c"]; S₂₂ -> S₂ [label="d"]; Box2:t -> Box3:t [label="e"]; Box2:t -> S₃₂ [label="F"]; S₃₂ -> S₃ [label="c"]; Box2:t -> S₆₂ [label="E"]; S₆₂ -> S₆ [label="d"]; } A LR(0) automaton for the above grammar Box0 Box0 S₁₀ S₂₀ S₃₀ S₆₀ fin Box0:s->fin:n S   Box1 Box1 S₁₁ S₂₁ E₄₀ F₅₀ Box0:t->Box1:t a Box2 Box2 S₃₁ S₆₁ E₄₀ F₅₀ Box0:t->Box2:t b start1 start1->Box0:t S₁₂ S₁₂ S₁ S₁ S₁₂->S₁ c S₂₂ S₂₂ S₂ S₂ S₂₂->S₂ d S₃₂ S₃₂ S₃ S₃ S₃₂->S₃ c S₆₂ S₆₂ S₆ S₆ S₆₂->S₆ d Box1:t->S₁₂ E Box1:t->S₂₂ F Box3 Box3 E₄ F₅ Box1:t->Box3:t e Box2:t->S₃₂ F Box2:t->S₆₂ E Box2:t->Box3:t e

We now have a reduce-reduce conflict in box 3 again. With look-ahead a you can reduce to both E and F. Same for look-ahead b by the way. It is deterministically decidable which one we should reduce to, but it basically now depends on which state we came from.

With LALR we build an automaton for each rule, and try to reuse that rule independent of the context in which it is used. That’s keeps our process simple, our automaton small, but it also causes us to lose exactly the information we need to resolve the reduce-reduce conflict in box 3 above: the left context. I know the information is technically on the stack, but our parsing process decides on the rule to reduce by based on the state and look-ahead only.

LR(k) automata/parsers keep the same parsing process still, they just have larger automata in which their states summarise the left context. We’re basically going to distinguish almost every occurrence of a sort in the grammar, similar to when we made our LL(2) grammar strong:

A LR(1) automaton for the above grammar digraph "A LR(1) automaton for the above grammar" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR; subgraph { rank=same; Box0 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box0</td></tr> <tr><td>S₁₀ ($)</td></tr> <tr><td bgcolor="#ddd">S₂₀ ($)</td></tr> <tr><td>S₃₀ ($)</td></tr> <tr><td bgcolor="#ddd">S₆₀ ($)</td></tr> </table>>]; fin [shape=doublecircle, width=0.4, label=""]; Box0:t:s -> fin:n [xlabel="S "]; } start1 [shape=none, label="", width=0]; fin [shape=doublecircle, width=0.4, label=""]; S₁₂; S₁; S₂₂ [style=filled, fillcolor="#ddd"]; S₂ [style=filled, fillcolor="#ddd"]; Box1 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box1</td></tr> <tr><td>S₁₁ ($)</td></tr> <tr><td bgcolor="#ddd">S₂₁ ($)</td></tr> <tr><td bgcolor="#aaa">E₄₀ (c)</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₀ (d)</font></td></tr> </table>>]; Box2 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box2</td></tr> <tr><td>S₃₁ ($)</td></tr> <tr><td bgcolor="#ddd">S₆₁ ($)</td></tr> <tr><td bgcolor="#aaa">E₄₀ (d)</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₀ (c)</font></td></tr> </table>>]; Box3 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box3</td></tr> <tr><td bgcolor="#aaa">E₄₁ (c)</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₁ (d)</font></td></tr> </table>>]; Box4 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box4</td></tr> <tr><td bgcolor="#aaa">E₄₁ (d)</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₁ (c)</font></td></tr> </table>>]; S₃₂; S₃; S₆₂ [style=filled, fillcolor="#ddd"]; S₆ [style=filled, fillcolor="#ddd"]; start1 -> Box0:t; Box0:t -> Box1:t [label="a"]; Box0:t -> Box2:t [label="b"]; Box1:t -> Box3:t [label="e"]; Box1:t -> S₁₂ [label="E"]; Box1:t -> S₂₂ [label="F"]; S₁₂ -> S₁ [label="c"]; S₂₂ -> S₂ [label="d"]; Box2:t -> Box4:t [label="e"]; Box2:t -> S₃₂ [label="F"]; S₃₂ -> S₃ [label="c"]; Box2:t -> S₆₂ [label="E"]; S₆₂ -> S₆ [label="d"]; } A LR(1) automaton for the above grammar Box0 Box0 S₁₀ ($) S₂₀ ($) S₃₀ ($) S₆₀ ($) fin Box0:s->fin:n S   Box1 Box1 S₁₁ ($) S₂₁ ($) E₄₀ (c) F₅₀ (d) Box0:t->Box1:t a Box2 Box2 S₃₁ ($) S₆₁ ($) E₄₀ (d) F₅₀ (c) Box0:t->Box2:t b start1 start1->Box0:t S₁₂ S₁₂ S₁ S₁ S₁₂->S₁ c S₂₂ S₂₂ S₂ S₂ S₂₂->S₂ d Box1:t->S₁₂ E Box1:t->S₂₂ F Box3 Box3 E₄₁ (c) F₅₁ (d) Box1:t->Box3:t e Box4 Box4 E₄₁ (d) F₅₁ (c) Box2:t->Box4:t e S₃₂ S₃₂ Box2:t->S₃₂ F S₆₂ S₆₂ Box2:t->S₆₂ E S₃ S₃ S₃₂->S₃ c S₆ S₆ S₆₂->S₆ d

How do we do this? We duplicate each rule for each terminal in the LL follow set of the sort of that rule. We annotate each of those rules with that terminal. Now we do our usual thing: rule to automaton, epsilons, NFA-to-DFA. But when wiring the epsilons, extra terminal annotations should now match up with the LALR follow set of the occurrence of the sort.

With this particular example, the automaton looks almost the same. There’s a bit more fluff with the annotations, but they basically propagate the look-ahead for each rule. Which means we can distinguish the context in which E and F are used differently! In general though, duplicating each rule for each terminal in the LL follow set leads to a very large amount of rules, and plenty of the time this isn’t necessary… LR(1) automata have lots of redundant states that do basically the same thing and would have been merged in LALR without any reduce-reduce conflicts.

Parse table construction algorithm

You’ve already seen parse table construction by automaton for both LL and the many flavours of LR now. And you’ve seen parse table construction by First and Follow set for LL. Parse table construction for LR will of course also require First and Follow sets, sometimes including more accurate Follow sets for particular occurrences of sorts. It’s mostly an iterative build-up of the NFA-to-DFA (powerset construction) though. I’m not going to detail that in this post.

While researching the material, I found some claims for minimal LR(1) algorithms, which create LALR(1) tables when possible, and slightly larger tables when necessary. They look interesting, but quite something to figure out, and I haven’t gotten to what I wanted to write about most yet, so that will have to wait until some other time. Perhaps I’ll include the normal LR parse table construction algorithm there too as a start.

Recursive Ascent

We finally get to the original impetus for this blog post: recursive ascent parsing. As you might be able to guess, this is the LR analogue to recursive descent for LL. So we’re going to write code that directly executes the LR automaton instead of simulating it by parse table interpretation.

Before, in recursive descent parsing, we saw that rules and sorts become functions. Rules call sort functions to parse a sort, and sorts check the look-ahead to choose a rule by which to parse the alternative of the sort. Basically grammar rules became functions, and the parse table was split into functions for each sort.

In recursive ascent parsing we will turn states of the LR automaton into functions. Each function will shift or reduce based on the input and call the corresponding state for that edge. Let’s expand our LR(1) example a little bit, and then take a look at the recursive ascent parsing:

S=aEc S = a E c (1) \text{(1)}
S=aFd S = a F d (2) \text{(2)}
S=bFc S = b F c (3) \text{(3)}
E=ee E = e e (4) \text{(4)}
F=ee F = e e (5) \text{(5)}
S=bEd S = b E d (6) \text{(6)}
S=beea S = b e e a (7) \text{(7)}

The reason for the extra es in rules 3 and 4 is to show how that increases the LR(1) automaton size. We’ll now have 4 states instead of 2 + an LALR reduce-reduce conflict. The reason for adding rule 7 is so we have a state where we might shift or reduce depending on the look-ahead, which influences the code we generate. Let’s check out the automaton first:

A LR(0) automaton for the above grammar digraph "A LR(0) automaton for the above grammar" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR; subgraph { rank=same; Box0 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box0</td></tr> <tr><td>S₁₀</td></tr><tr><td bgcolor="#ddd">S₂₀</td></tr> <tr><td>S₃₀</td></tr> <tr><td bgcolor="#ddd">S₆₀</td></tr> <tr><td>S₇₀</td></tr> </table>>]; fin [shape=doublecircle, width=0.4, label=""]; Box0:t:s -> fin:n [xlabel="S "]; } start1 [shape=none, label="", width=0]; fin [shape=doublecircle, width=0.4, label=""]; S₁₂; S₁; S₂₂ [style=filled, fillcolor="#ddd"]; S₂ [style=filled, fillcolor="#ddd"]; S₃₂; S₃; S₆₂ [style=filled, fillcolor="#ddd"]; S₆ [style=filled, fillcolor="#ddd"]; Box1 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box1</td></tr> <tr><td>S₁₁</td></tr> <tr><td bgcolor="#ddd">S₂₁</td></tr> <tr><td bgcolor="#aaa">E₄₀</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₀</font></td></tr> </table>>]; Box2 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box2</td></tr> <tr><td>S₃₁</td></tr> <tr><td bgcolor="#ddd">S₆₁</td></tr> <tr><td>S₇₁</td></tr> <tr><td bgcolor="#aaa">E₄₀</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₀</font></td></tr> </table>>]; Box3 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box3</td></tr> <tr><td bgcolor="#aaa">E₄₁</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₁</font></td></tr> </table>>]; Box4 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box4</td></tr> <tr><td>S₇₂</td></tr> <tr><td bgcolor="#aaa">E₄₁</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₁</font></td></tr> </table>>]; Box5 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box5</td></tr> <tr><td bgcolor="#aaa">E₄ (c)</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅ (d)</font></td></tr> </table>>]; Box6 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box6</td></tr> <tr><td>S₇₃</td></tr> <tr><td bgcolor="#aaa">E₄ (d)</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅ (c)</font></td></tr> </table>>]; start1 -> Box0:t; Box0:t -> Box1:t [label="a"]; Box0:t -> Box2:t [label="b"]; Box1:t -> Box3:t [label="e"]; Box1:t -> S₁₂ [label="E"]; Box1:t -> S₂₂ [label="F"]; Box3:t -> Box5:t [label="e"]; S₁₂ -> S₁ [label="c"]; S₂₂ -> S₂ [label="d"]; Box2:t -> Box4:t [label="e"]; Box2:t -> S₃₂ [label="F"]; S₃₂ -> S₃ [label="c"]; Box2:t -> S₆₂ [label="E"]; S₆₂ -> S₆ [label="d"]; Box4:t -> Box6:t [label="e"]; Box6:t -> S₇ [label="a"]; } A LR(0) automaton for the above grammar Box0 Box0 S₁₀ S₂₀ S₃₀ S₆₀ S₇₀ fin Box0:s->fin:n S   Box1 Box1 S₁₁ S₂₁ E₄₀ F₅₀ Box0:t->Box1:t a Box2 Box2 S₃₁ S₆₁ S₇₁ E₄₀ F₅₀ Box0:t->Box2:t b start1 start1->Box0:t S₁₂ S₁₂ S₁ S₁ S₁₂->S₁ c S₂₂ S₂₂ S₂ S₂ S₂₂->S₂ d S₃₂ S₃₂ S₃ S₃ S₃₂->S₃ c S₆₂ S₆₂ S₆ S₆ S₆₂->S₆ d Box1:t->S₁₂ E Box1:t->S₂₂ F Box3 Box3 E₄₁ F₅₁ Box1:t->Box3:t e Box2:t->S₃₂ F Box2:t->S₆₂ E Box4 Box4 S₇₂ E₄₁ F₅₁ Box2:t->Box4:t e Box5 Box5 E₄ (c) F₅ (d) Box3:t->Box5:t e Box6 Box6 S₇₃ E₄ (d) F₅ (c) Box4:t->Box6:t e S₇ S₇ Box6:t->S₇ a

Perhaps making both changes at the same time makes this a bad example to show off LR(1) automaton size… If you imagine the automaton without rule 7 you’ll see that boxes 3 and 4 are the same except for their ingoing and outgoing edges. This is what happens with longer rules and having to distinguish the final state of the rules for a different look-ahead (boxes 5 and 6 here).

The other notable difference is that we now have a box 6 that can both shift and reduce. This will make the code for the recursive ascent more interesting. Let’s start with the basics:

use std::env;
use std::iter::Peekable;

type Iter<'a> = Peekable<std::slice::Iter<'a, Terminal>>;

type Terminal = char;

#[derive(Clone, Copy, Debug, Eq, Hash, PartialEq)]
enum Sort {
    S,
    E,
    F,
}

/// Box0, the starting state of our automaton.
/// Itemset:
/// S = . a E c
/// S = . a F d
/// S = . b F c
/// S = . b E d
/// S = . b e e a
fn box0(input: &mut Iter) -> Result<(), String> {
    match input.next() {
        None => Err(String::from("Unexpected end of input: S = .")),
        Some('a') => match box1(input)? {
            Sort::S => Ok(()),
            s => unreachable!("Unexpected sort: S = a {s:?}"),
        },
        Some('b') => match box2(input)? {
            Sort::S => Ok(()),
            s => unreachable!("Unexpected sort: S = b {s:?}"),
        },
        Some(c) => Err(format!("Unexpected input: S = . {c}")),
    }
}

This should look familiar from the recursive descent parser code. The notable difference is that we now have a function name less connected to the grammar, and more to the LR automaton. This makes it harder to understand the code, stack traces, etc.

/// Box1
/// Itemset:
/// S = a . E c
/// S = a . F d
/// E = . e
/// F = . e
fn box1(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Err(String::from(
            "Unexpected end of input while parsing S = a . E; S = a . F",
        )),
        Some('e') => match box3(input)? {
            Sort::E => s12(input),
            Sort::F => s22(input),
            s => unreachable!("Unexpected sort: S = a {s:?}"),
        },
        Some(c) => Err(format!("Unexpected input: S = a . {c}")),
    }
}

/// S = a E . c
fn s12(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Err(String::from(
            "Unexpected end of input while parsing S = a E . c",
        )),
        Some('c') => s1(input),
        Some(c) => Err(format!("Unexpected input: S = a E . {c}")),
    }
}

/// S = a E c .
fn s1(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Ok(Sort::S),
        Some(c) => Err(format!("Unexpected input: S = a E c . {c}")),
    }
}

/// S = a F . d
fn s22(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Err(String::from(
            "Unexpected end of input while parsing S = a F . d",
        )),
        Some('d') => s2(input),
        Some(c) => Err(format!("Unexpected input: S = a F . {c}")),
    }
}

/// S = a F d .
fn s2(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Ok(Sort::S),
        Some(c) => Err(format!("Unexpected input: S = a F d . {c}")),
    }
}

/// Box3
/// Itemset:
/// E = e . e (c)
/// F = e . e (d)
fn box3(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Err(String::from(
            "Unexpected end of input while parsing E or F.",
        )),
        Some('e') => box5(input),
        Some(c) => Err(format!("Unexpected input: E = e . {c} ; F = e . {c}")),
    }
}

/// Box5
/// Itemset:
/// E = e e . (c)
/// F = e e . (d)
fn box5(input: &mut Iter) -> Result<Sort, String> {
    match input.peek() {
        None => Err(String::from(
            "Unexpected end of input while parsing E or F.",
        )),
        Some('c') => Ok(Sort::E),
        Some('d') => Ok(Sort::F),
        Some(c) => Err(format!("Unexpected input: E = e e . {c} ; F = e e . {c}")),
    }
}

This bit of code should give you an idea of the code pattern in the “easy case”. Each state either shifts in one-or-more rules it’s in (e.g. s12, box3), shifts into a new rule expecting a sort back to use for the goto (e.g. box1), or reduces (e.g. s1, box5).

/// Box2
/// Itemset:
/// S = b . F c
/// S = b . E d
/// S = b . e e a
/// E = . e
/// F = . e
fn box2(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Err(String::from(
            "Unexpected end of input while parsing E or F.",
        )),
        Some('e') => match box4(input)? {
            (0, Sort::F) => s32(input),
            (0, Sort::E) => s62(input),
            (1, Sort::S) => Ok(Sort::S),
            s => unreachable!("Unexpected return/sort: S = b {s:?}"),
        },
        Some(c) => Err(format!("Unexpected input: S = b . {c}")),
    }
}

This is the point where things start looking different. In box 2 we might shift e because we’ve entered rules 4 or 5 which will reduce to E or F. But we could also be in rule 7. If the result from box 4 is that we were in rule 7, we need to go back to the previous caller. So function box4 returns a pair of the number of returns left to go and the sort we’re reducing to. This way we can distinguish the two cases and take the appropriate action.

If you want to keep a recursive ascent code generator simpler you can of course always return a pair. You could also generate the code in continuation passing style, where you pass a function that takes the sort and does the goto action instead of accepting a pair as a result. But because the Rust compiler is not very good at tail call optimisation, so I’m not doing that pattern here.

/// S = b F . c
fn s32(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Err(String::from(
            "Unexpected end of input while parsing S = b F . c",
        )),
        Some('c') => s3(input),
        Some(c) => Err(format!("Unexpected input: S = b F . {c}")),
    }
}

/// S = b F c .
fn s3(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Ok(Sort::S),
        Some(c) => Err(format!("Unexpected input: S = b F c . {c}")),
    }
}

/// S = b E . d
fn s62(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Err(String::from(
            "Unexpected end of input while parsing S = b E . d",
        )),
        Some('d') => s6(input),
        Some(c) => Err(format!("Unexpected input: S = b E . {c}")),
    }
}

/// S = b E d .
fn s6(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Ok(Sort::S),
        Some(c) => Err(format!("Unexpected input: S = b E d . {c}")),
    }
}

/// Box4
/// Itemset:
/// S = b e . e a
/// E = e . e (d)
/// F = e . e (c)
fn box4(input: &mut Iter) -> Result<(usize, Sort), String> {
    match input.next() {
        None => Err(String::from(
            "Unexpected end of input while parsing E or F.",
        )),
        Some('e') => box6(input).map(decr),
        Some(c) => Err(format!(
            "Unexpected input: S = b e . {c}; E = e . {c} ; F = e . {c}"
        )),
    }
}

/// helper
fn decr((c, s): (usize, Sort)) -> (usize, Sort) {
    (c - 1, s)
}

Note how in box4 we’re now calling the decrement helper function after the call to box6 to count one return we’re going to do immediately after.

/// Box6
/// Itemset:
/// S = b e e . a
/// E = e e . (d)
/// F = e e . (c)
fn box6(input: &mut Iter) -> Result<(usize, Sort), String> {
    match input.peek() {
        None => Err(String::from(
            "Unexpected end of input while parsing E or F.",
        )),
        Some('c') => Ok((2, Sort::F)).map(decr),
        Some('d') => Ok((2, Sort::E)).map(decr),
        Some('a') => {
            input.next();
            s7(input).map(decr)
        }
        Some(c) => Err(format!("Unexpected input: E = e . {c} ; F = e . {c}")),
    }
}

The number of returns to do is equal to the size of the body of the rule we are reducing. Of course we immediately decrement because we are going to immediately return, hence the map(decr).

/// S = b e e a .
fn s7(_input: &mut Iter) -> Result<(usize, Sort), String> {
    Ok((4, Sort::S)).map(decr)
}

fn lex(input: String) -> Vec<Terminal> {
    input.chars().collect()
}

pub fn main() -> Result<(), String> {
    let input = env::args().next().expect("Argument string to parse");
    let input = lex(input);
    let mut input = input.iter().peekable();
    box0(&mut input)
}

In our main function we can call box0 with the input. Since this is LR(1) we only need a peekable iterator, that can look ahead 1 terminal.

Table size = Code size

With both recursive descent and recursive ascent parsing, we’re representing the parsing logic directly in code, not as an explicit data representation of a parse table. As such, if you have a larger parse table, you get more code. In LR, when LALR doesn’t suffice, parse tables can potentially grow quite large, as we saw to a limited extent with the last example.

Recursive Ascent-Descent Parsing

Have you noticed that in the recursive ascent code there are some pretty boring and tiny looking functions? I’m talking about s12, s1, s22, s2, s32, s3, s62, s6. These will likely be targeted by the inliner of the Rust compiler6, but aren’t they a bit much to even generate?

The common denominator of these functions, and the states of the LR automaton they correspond to, is that they have already honed in on a single rule from the grammar and are only parsing that. Kind of like in an LL parser, except we used the LR automaton mechanism to select the rule instead of an LL look-ahead. If we follow that idea to its logical conclusion, we can do LL parsing from any point where we know there’s only one rule left (or equivalently, inline those simple functions). This means we only have box functions left:

fn box1(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Err(String::from(
            "Unexpected end of input while parsing E or F.",
        )),
        Some('e') => match box3(input)? {
            Sort::E => {
                consume(input, 'c')?;
                Ok(Sort::S)
            }
            Sort::F => {
                consume(input, 'd')?;
                Ok(Sort::S)
            }
            s => unreachable!("Unexpected sort: S = a {s:?}"),
        },
        Some(c) => Err(format!("Unexpected input: S = a . {c}")),
    }
}

This is using the consume function from the recursive descent parser example from before.

/// Box6
/// Itemset:
/// S = b e e . a
/// E = e e . (d)
/// F = e e . (c)
fn box6(input: &mut Iter) -> Result<(usize, Sort), String> {
    match input.peek() {
        None => Err(String::from(
            "Unexpected end of input while parsing E or F.",
        )),
        Some('c') => Ok((2, Sort::F)).map(decr),
        Some('d') => Ok((2, Sort::E)).map(decr),
        Some('a') => {
            input.next(); // consume 'a'
            Ok((3, Sort::S)).map(decr)
        }
        Some(c) => Err(format!("Unexpected input: E = e . {c} ; F = e . {c}")),
    }
}

Note that in box 6 we now count the number of symbols in the body of the rule before the dot to come up with the number of returns.

Left Corners?

The left corner of a rule in the grammar is the left-most symbol in the body of the rule, plus the left corners of any sorts in left corner. So it’s basically a First set with the sorts included. I found this is some of the older literature, and figured I’d add a note for myself in here.

There is/was such a thing as left-corner parsing, seemingly mostly used in natural language processing (NLP). NLP mostly uses ambiguous context-free grammars, and typically uses (used?) a backtracking parser to deal with that. These can be slow of course. And it turns out left corners helped with this, by adding some “filtering” that allows the parser to backtrack less. This is connected to recursive ascent-descent parsing, which you could also see as filtering with LR to finish parsing with LL. In our case we just don’t do backtracking.

Fin

I really need to stop working on this blog post and publish it already. It’s been over a year since I started working on it (on and off, during holidays when I had enough focus time)7. I already had an idea of where to go to next (generalised parsers), but now I also want to study minimal LR(1) automaton/parse table algorithms, and look at continuation passing style again because I think you can pass the left-context as a function argument. This would give you an LALR automaton structure with LR parsing power. Is that a good idea? Don’t know, needs testing (or reading papers/blog posts, probably someone else already tried this before).

I usually have a pithy remark or sneak the Kaomoji into the footnotes, but I must be out of practice, because I can’t think of a good way to do that…

Ehh, whatever ¯\_(ツ)_/¯

Footnotes

  1. I’m fairly sure my prose description there is the same as a formal definition, and it feel a bit nicer to think about than the ones you can find on Wikipedia

  2. Technically you’d need to see A1 A₁ and A2 A₂ as separate symbols and duplicate the rules for A A , resulting in a larger grammar in correspondence with the larger table. But I couldn’t be bothered, and the parse table as shown works just as well. This is relevant to the code size of a recursive descent parser too, since you can just reuse the code for rules 2 and 3 instead of having duplicate code for the two extra rules. What’s a recursive descent parser? That comes just a little later in the post, so keep reading ;) 

  3. Yes, there are non-deterministic context-free languages. Those are the context-free languages that can only be parsed with a non-deterministic PDA. Since this post is about deterministic parsers, we’ll ignore the non-deterministic languages. 

  4. While I find the Wikipedia article on LLR confusing, and it makes a good case for why it’s not really used, I’m still somewhat intrigued. This is one of those things that will stay on my reading list for a while I think, something I still want to understand further… 

  5. Indirect left recursion is even worse in LL. At least the direct version can still be dealt with by an automatic grammar rewrite algorithm. That’s more or less what the node-reparenting trick mentioned at the end of the LL section does. Similarly, there are automatic grammar rewrites for direct right-recursion for LR, and indirect right recursion can be more problematic… 

  6. Actually, I checked in Compiler Explorer how this turns out, and while s7 is inlined and compiled away entirely, adapting box1 to consume directly will make the assembly at opt-level=3 smaller. Adding an #[inline] hint on consume helps as well. Though I may just be seeing the effect of uniform error messages through consume. Actually following and understanding the optimised assembly code is a pain, so I just clicked around a bit to verify that the example code is reduced to a state machine with jumps and labels instead of using function call instructions. So that’s neat, basically what I was hoping for :) 

  7. I hope you appreciate how much time it took to find example grammars to steal (or occasionally develop myself) and especially how much time it took to get GraphViz to output somewhat decent automata of those examples!