Pratt Parsing: Introduction and Implementation in TypeScript
Introduction
When hand-writing recursive descent parsers, one often encounters the issue of left recursion. The conventional approach is to rewrite the grammar to eliminate left recursion, but this requires introducing additional grammar rules, making the code less maintainable. Additionally, operators of different precedence and associativity pose further challenges. These reasons make the implementation of traditional recursive descent parsers more difficult.
In this article, we will introduce the Pratt Parsing algorithm proposed by Vaughan Pratt, which can address these problems.
Basic Concepts
First, let’s introduce some basic concepts in parsing.
-
Infix Operators: For example, the positive and negative signs
+
-
. The characteristic of infix operators is that they appear between operands. -
Prefix Operators: For example, addition, subtraction, multiplication, and division
+
-
*
/
. The characteristic of prefix operators is that they appear before operands. -
Postfix Operators: For example, factorial
!
. The characteristic of postfix operators is that they appear after operands.
In this article, we define:
1type InfixOp = '+' | '-' | '*' | '/' | '^'
2type PrefixOp = '-' | '+'
3type PostfixOp = '!'
AST (Abstract Syntax Tree) is a tree-like structure used to represent the syntactic structure of a program. For an expression, we define the following AST structure:
1class ExprNode {
2 toString(): string {
3 throw new Error('not implemented')
4 }
5}
6
7class ValueNode extends ExprNode {
8 constructor(public val: number) {
9 super()
10 }
11 toString() {
12 return this.val.toString()
13 }
14}
15
16class PrefixOpNode extends ExprNode {
17 constructor(public op: PrefixOp, public rhs: ExprNode) {
18 super()
19 }
20 toString() {
21 return `(${this.op}${this.rhs})`
22 }
23}
24class InfixOpNode extends ExprNode {
25 constructor(public op: InfixOp, public lhs: ExprNode, public rhs: ExprNode) {
26 super()
27 }
28 toString() {
29 return `(${this.lhs}${this.op}${this.rhs})`
30 }
31}
32class PostfixOpNode extends ExprNode {
33 constructor(public op: PostfixOp, public lhs: ExprNode) {
34 super()
35 }
36 toString() {
37 return `(${this.lhs}${this.op})`
38 }
39}
The tokenizer is responsible for splitting the input string into a sequence of tokens. It typically provides an interface in the form of an iterator. Tokenization is relatively simple, so we won’t go into detail here. Let’s assume that we already have a stream of tokens.
In this example, we define the following token types:
1
2enum TokenType { None, Num, Add, Sub, Mul, Div, Fac, Pow, LPa, RPa, Eoi }
3
4class Token {
5 constructor(public type: TokenType = TokenType.None, public value: string = '') { }
6}
-
TokenType.None
: Empty token used as a placeholder. -
TokenType.Num
: Number. -
TokenType.Add
,TokenType.Sub
,TokenType.Mul
,TokenType.Div
,TokenType.Pow
: Addition, subtraction, multiplication, division, exponentiation. -
TokenType.Fac
: Factorial. -
TokenType.LPa
,TokenType.RPa
: Left and right parentheses. -
TokenType.Eoi
: End of Input, indicating the end of the input.
Defining precedence:
1 precOf(token: string): number {
2 const precs = {
3 '+': 10,
4 '-': 10,
5 '*': 20,
6 '/': 20,
7 '^': 30,
8 '!': 40,
9 } as Record<string, number>
10
11 return precs[token] || 0
12 }
Define an iterator:
1class Iterator<T> {
2 private pos: number
3 constructor(private inner: Array<T> = []) {
4 this.pos = 0
5 }
6 next(): T | undefined {
7 var peek = this.peek()
8 this.pos++
9 return peek
10 }
11 peek(): T | undefined {
12 return this.inner[this.pos] || undefined
13 }
14}
We will wrap a token array with an iterator.
-
The
next()
method returns the current token and moves the iterator pointer to the next token. This operation is also known asconsume
oreat
. -
The
peek()
method returns the current token without changing the iterator pointer.
Our token stream is defined as follows:
1let tokens = new Iterator<Token>([
2 // - 1 + (2 - 3) * 6 / 3 ! - 2 ^ 3 ^ 4
3 new Token(TokenType.Sub, '-'),
4 new Token(TokenType.Num, '1'),
5 new Token(TokenType.Add, '+'),
6 new Token(TokenType.LPa, '('),
7 new Token(TokenType.Num, '2'),
8 new Token(TokenType.Sub, '-'),
9 new Token(TokenType.Num, '3'),
10 new Token(TokenType.RPa, ')'),
11 new Token(TokenType.Mul, '*'),
12 new Token(TokenType.Num, '6'),
13 new Token(TokenType.Div, '/'),
14 new Token(TokenType.Num, '3'),
15 new Token(TokenType.Fac, '!'),
16 new Token(TokenType.Sub, '-'),
17 new Token(TokenType.Num, '2'),
18 new Token(TokenType.Pow, '^'),
19 new Token(TokenType.Num, '3'),
20 new Token(TokenType.Pow, '^'),
21 new Token(TokenType.Num, '4'),
22 new Token(TokenType.Eoi),
23])
Pratt Parser
Pratt Parser is a top-down parser that works based on the following principles:
-
Divide Tokens into two categories: prefix operators and infix operators. Assume that prefix operators have the highest precedence, followed by infix operators in descending order. Note: For postfix operators, we treat them as a special form of infix operators. (The specific reason will be explained later.) Therefore, we can categorize them into three classes without any problem.
-
Each Token is associated with a parsing function, which is responsible for parsing expressions starting with that Token.
-
For example, for Num, we parse and obtain a ValueNode.
-
For example, for Add, we parse and obtain an InfixOpNode.
Since we divide Tokens into three classes, each class has its corresponding parser, specifically:
-
PrefixParser: Parser for prefix Tokens.
-
InfixParser: Parser for infix Tokens.
-
PostfixParser: Parser for postfix Tokens. (Can be seen as a special form of InfixParser)
-
-
Each Token is associated with a precedence, which determines the parsing order. In the Parse function, we actually have two precedence levels: the context precedence (which may not be the precedence of the current Token) derived from the arguments of the Parse function, and the precedence of the next Token obtained through peeking.
-
We parse the prefix expression first. Here comes the key part. If the precedence of the next Token is greater than the context precedence, we consider it as part of the infix/postfix expression and continue parsing. Otherwise, we return the prefix expression.
Let’s take an example to understand the core code below:
1 parse(prec: number = 0): ExprNode {
2 let token = this.tokens.next()!
3 let prefixParser: PrefixFn = this.parsers.prefix[token.type]
4 if (!prefixParser) {
5 throw new Error(`Unexpected prefix token ${token.type}`)
6 }
7 let lhs: ExprNode = prefixParser(token)
8 let precRight = this.precOf(this.tokens.peek()!.value)
9
10
11 while (prec < precRight) {
12 token = this.tokens.next()!
13 let infixParser: InfixFn | PostfixFn = this.parsers.infix[token.type] || this.parsers.postfix[token.type]
14 if (!infixParser) {
15 throw new Error(`Unexpected infix or postfix token ${token.value}`)
16 }
17 lhs = infixParser(lhs, token)
18 precRight = this.precOf(this.tokens.peek()!.value)
19 }
20
21 return lhs
22 }
Initial state,
-
The current context precedence is 0, and the following token stream is
+ 1 - 2 * 3
. Using the prefix+
parser, we first get(+1)
. Then, since the precedence of-
is the same, we directly return the LHS, which is+ 1
as the AST node. -
At this point, the context precedence is 0, and the remaining token stream is
- 2 * 3
. We see that the precedence of-
is 10, which is higher than 0. Therefore, we consume-
and call the infix-
parser.-
Now, the context precedence is 10, and the remaining token stream is
2 * 3
. We first treat 2 as a prefix and call its prefix parser, which gives us aValueNode(2)
.- Then, we peek at
*
and find that the precedence of*
is 20, which is higher than 10. So we consume*
and call the infix*
parser.-
The infix parser calls
Parse
.- Now, the context precedence is 20, and the remaining token stream is
3
. We consume it as a prefix. Then, we find that the precedence ofEoi
is 0, which is lower than 20. So we directly return the AST node for3
.
- Now, the context precedence is 20, and the remaining token stream is
-
The infix parser obtains 3 as the RHS. Combining it with the LHS, which is 2 obtained as the actual argument, we get an InfixOpNode(
2 * 3
).
-
- Then, we peek at
-
Then we return from the infix parser for
-
, and with LHS=(+1) and RHS=(2 * 3), we get an InfixOpNode((+1) - (2 * 3)
).
-
-
Finally, since there are no more tokens, we return directly.
Our Prefix/Infix/PostfixParser definitions are as follows:
1 public parsers = {
2 prefix: {
3 [TokenType.Num]: (token: Token) => {
4 return new ValueNode(parseInt(token.value))
5 },
6 [TokenType.Add]: (token: Token) => {
7 return new PrefixOpNode('+', this.parse(this.precOf('+')))
8 },
9 [TokenType.Sub]: (token: Token) => {
10 return new PrefixOpNode('-', this.parse(this.precOf('-')))
11 },
12 [TokenType.LPa]: (token: Token) => {
13 const expr = this.parse(0)
14 const next = this.tokens.next()!
15 if (next.type !== TokenType.RPa) {
16 throw new Error('Expected )')
17 }
18 return expr
19 },
20 } as { [k: number]: PrefixFn },
21 infix: {
22 [TokenType.Add]: (lhs: ExprNode, token: Token) => {
23 return new InfixOpNode('+', lhs, this.parse(this.precOf('+')))
24 },
25 [TokenType.Sub]: (lhs: ExprNode, token: Token) => {
26 return new InfixOpNode('-', lhs, this.parse(this.precOf('-')))
27 },
28 [TokenType.Mul]: (lhs: ExprNode, token: Token) => {
29 return new InfixOpNode('*', lhs, this.parse(this.precOf('*')))
30 },
31 [TokenType.Div]: (lhs: ExprNode, token: Token) => {
32 return new InfixOpNode('/', lhs, this.parse(this.precOf('/')))
33 },
34 [TokenType.Pow]: (lhs: ExprNode, token: Token) => {
35 return new InfixOpNode('^', lhs, this.parse(this.precOf('^') - 1)) // Right associative
36 },
37 } as { [k: number]: InfixFn },
38 postfix: {
39 [TokenType.Fac]: (lhs: ExprNode, token: Token) => {
40 return new PostfixOpNode('!', lhs)
41 }
42 } as { [k: number]: PostfixFn }
43 }
Step-by-Step Analysis
The above example was just for quick understanding. In reality, we will encounter other problems as well:
-
Cases with suffixes
-
Cases with equal precedence
-
Cases with left associativity and right associativity
So let’s provide a more comprehensive example.
1- 1 + (2 - 3) * 6 / 3 ! - 2 ^ 3 ^ 4
First, we read the Token -
, which is a prefix operator. Therefore, we find its PrefixParser.
1- 1 + (2 - 3) * 6 / 3 ! - 2 ^ 3 ^ 4
2^
1[TokenType.Sub]: (token: Token) => {
2 return new PrefixOpNode('-', this.parse(this.precOf('-')))
3},
We continue parsing based on the precedence of -
. That is, this.parse(this.precOf('-'))
.
So we read the Token 1
, which is a number. Therefore, we find its PrefixParser.
1- 1 + (2 - 3) * 6 / 3 ! - 2 ^ 3 ^ 4
2 ^
1[TokenType.Num]: (token: Token) => {
2 return new ValueNode(parseInt(token.value))
3},
We get a ValueNode(1)
, and then continue parsing. We peek at the next Token, which is +
. Since the precedence of -
is higher than +
, we continue parsing.
1- 1 + (2 - 3) * 6 / 3 ! - 2 ^ 3 ^ 4
2 ^
Then we encounter Token (
on the right. It has no precedence (or we can say its precedence is 0). So we don’t enter the while loop, and instead return. After returning two levels, we get PrefixOpNode('-', ValueNode(1))
.
At this point, the current precedence is 0, and the current Token is -
. We can peek at Token +
. Since the precedence of +
is higher than 0, we enter the while loop. Then we get the InfixParser for +
.
1- 1 + (2 - 3) * 6 / 3 ! - 2 ^ 3 ^ 4
2^ ^
3 peek
Therefore, we continue parsing based on the precedence of +
. That is, this.parse(this.precOf('+'))
.
At this point, the prefix we read outside the while loop is (
. So we use the PrefixParser for (
.
1[TokenType.LPa]: (token: Token) => {
2 const expr = this.parse(0)
3 const next = this.tokens.next()!
4 if (next.type !== TokenType.RPa) {
5 throw new Error('Expected )')
6 }
7 return expr
8},
In the (
, we actually treat the expression inside the parentheses as a completely new expression. As you can see, the precedence is reset to zero (line 2 of the code above).
We will parse the internal expression and get the sub-expression InfixOpNode('-', ValueNode(2), ValueNode(3))
.
Now, how is the right parenthesis handled? In the parse
function, since the precedence of the right parenthesis is 0, it will be directly returned. Therefore, in the PrefixParser for LPa, we will consume the right parenthesis. And then continue parsing.
Thus, the precedence is restored to the precedence of +
as the function returns. And we continue parsing.
1- 1 + (2 - 3) * 6 / 3 ! - 2 ^ 3 ^ 4
2 ^
Since the precedence of *
is higher than +
, we will call the InfixParser for *
. And thus, we parse the right sub-expression 6 / 3!
.
It is worth mentioning that the InfixParser and PostfixParser are actually equivalent. Because Postfix is just a special case of Infix. You can compare the following two sub-parsers:
1[TokenType.Mul]: (lhs: ExprNode, token: Token) => {
2 return new InfixOpNode('*', lhs, this.parse(this.precOf('*')))
3},
4[TokenType.Fac]: (lhs: ExprNode, token: Token) => {
5 return new PostfixOpNode('!', lhs)
6}
It can be observed that their implementations are exactly the same. The only difference is that the InfixParser will continue parsing the right subexpression, while the PostfixParser will not.
Lastly, let’s take a look at how associativity is handled. Since operators like +
and *
are left-associative, they don’t require any special handling. However, the ^
operator is right-associative. For example, in the expression 2 ^ 3 ^ 4
, it should be parsed as 2 ^ (3 ^ 4)
. Therefore, we need to handle this special case in the InfixParser.
1[TokenType.Pow]: (lhs: ExprNode, token: Token) => {
2 return new InfixOpNode('^', lhs, this.parse(this.precOf('^') - 1)) // Right associativity
3},
Why do we subtract one from the precedence? This is done to make the left ^
operator have a higher precedence, so that the right 3 ^ 4
will be parsed as a subexpression.
In the end, we obtain the following AST:
1InfixOpNode {
2 op: '-',
3 lhs: InfixOpNode {
4 op: '+',
5 lhs: PrefixOpNode { op: '-', rhs: ValueNode { val: 1 } },
6 rhs: InfixOpNode {
7 op: '/',
8 lhs: InfixOpNode {
9 op: '*',
10 lhs: InfixOpNode {
11 op: '-',
12 lhs: ValueNode { val: 2 },
13 rhs: ValueNode { val: 3 }
14 },
15 rhs: ValueNode { val: 6 }
16 },
17 rhs: PostfixOpNode { op: '!', lhs: ValueNode { val: 3 } }
18 }
19 },
20 rhs: InfixOpNode {
21 op: '^',
22 lhs: ValueNode { val: 2 },
23 rhs: InfixOpNode {
24 op: '^',
25 lhs: ValueNode { val: 3 },
26 rhs: ValueNode { val: 4 }
27 }
28 }
29}
The corresponding parenthesized expression is:
(((-1) + (((2 - 3) * 6) / (3!))) - (2 ^ (3 ^ 4)))
Full Code
Here is the complete code:
1/**
2 * Author: Zijing Zhang (Pluveto)
3 * Date: 2023-01-11 15:28:00
4 * Description: A Pratt Parser implemented in TypeScript.
5 */
6const util = require('util')
7
8type InfixOp = '+' | '-' | '*' | '/' | '^'
9type PrefixOp = '-' | '+'
10type PostfixOp = '!'
11
12class ExprNode {
13 toString(): string {
14 throw new Error('not implemented')
15 }
16}
17
18class ValueNode extends ExprNode {
19 constructor(public val: number) {
20 super()
21 }
22 toString() {
23 return this.val.toString()
24 }
25}
26
27class PrefixOpNode extends ExprNode {
28 constructor(public op: PrefixOp, public rhs: ExprNode) {
29 super()
30 }
31 toString() {
32 return `(${this.op}${this.rhs})`
33 }
34}
35class InfixOpNode extends ExprNode {
36 constructor(public op: InfixOp, public lhs: ExprNode, public rhs: ExprNode) {
37 super()
38 }
39 toString() {
40 return `(${this.lhs}${this.op}${this.rhs})`
41 }
42}
43class PostfixOpNode extends ExprNode {
44 constructor(public op: PostfixOp, public lhs: ExprNode) {
45 super()
46 }
47 toString() {
48 return `(${this.lhs}${this.op})`
49 }
50}
51
52class Iterator<T> {
53 private pos: number
54 constructor(private inner: Array<T> = []) {
55 this.pos = 0
56 }
57 next(): T | undefined {
58 var peek = this.peek()
59 this.pos++
60 return peek
61 }
62 peek(): T | undefined {
63 return this.inner[this.pos] || undefined
64 }
65}
66
67
68enum TokenType { None, Num, Add, Sub, Mul, Div, Fac, Pow, LPa, RPa, Eoi }
69
70class Token {
71 constructor(public type: TokenType = TokenType.None, public value: string = '') { }
72}
73
74type PrefixFn = (token: Token) => ExprNode
75type InfixFn = (lhs: ExprNode, token: Token) => ExprNode
76type PostfixFn = (lhs: ExprNode, token: Token) => ExprNode
77
78class Parser {
79 constructor(public tokens: Iterator<Token>) { }
80 public parsers = {
81 prefix: {
82 [TokenType.Num]: (token: Token) => {
83 return new ValueNode(parseInt(token.value))
84 },
85 [TokenType.Add]: (token: Token) => {
86 return new PrefixOpNode('+', this.parse(this.precOf('+')))
87 },
88 [TokenType.Sub]: (token: Token) => {
89 return new PrefixOpNode('-', this.parse(this.precOf('-')))
90 },
91 [TokenType.LPa]: (token: Token) => {
92 const expr = this.parse(0)
93 const next = this.tokens.next()!
94 if (next.type !== TokenType.RPa) {
95 throw new Error('Expected )')
96 }
97 return expr
98 },
99 } as { [k: number]: PrefixFn },
100 infix: {
101 [TokenType.Add]: (lhs: ExprNode, token: Token) => {
102 return new InfixOpNode('+', lhs, this.parse(this.precOf('+')))
103 },
104 [TokenType.Sub]: (lhs: ExprNode, token: Token) => {
105 return new InfixOpNode('-', lhs, this.parse(this.precOf('-')))
106 },
107 [TokenType.Mul]: (lhs: ExprNode, token: Token) => {
108 return new InfixOpNode('*', lhs, this.parse(this.precOf('*')))
109 },
110 [TokenType.Div]: (lhs: ExprNode, token: Token) => {
111 return new InfixOpNode('/', lhs, this.parse(this.precOf('/')))
112 },
113 [TokenType.Pow]: (lhs: ExprNode, token: Token) => {
114 return new InfixOpNode('^', lhs, this.parse(this.precOf('^') - 1)) // Right associative
115 },
116 } as { [k: number]: InfixFn },
117 postfix: {
118 [TokenType.Fac]: (lhs: ExprNode, token: Token) => {
119 return new PostfixOpNode('!', lhs)
120 }
121 } as { [k: number]: PostfixFn }
122 }
123 precOf(token: string): number {
124 const precs = {
125 '+': 10,
126 '-': 10,
127 '*': 20,
128 '/': 20,
129 '^': 30,
130 '!': 40,
131 } as Record<string, number>
132
133 return precs[token] || 0
134 }
135 parse(prec: number = 0): ExprNode {
136 let token = this.tokens.next()!
137 let prefixParser: PrefixFn = this.parsers.prefix[token.type]
138 if (!prefixParser) {
139 throw new Error(`Unexpected prefix token ${token.type}`)
140 }
141 let lhs: ExprNode = prefixParser(token)
142 let precRight = this.precOf(this.tokens.peek()!.value)
143
144
145 while (prec < precRight) {
146 token = this.tokens.next()!
147 let infixParser: InfixFn | PostfixFn = this.parsers.infix[token.type] || this.parsers.postfix[token.type]
148 if (!infixParser) {
149 throw new Error(`Unexpected infix or postfix token ${token.value}`)
150 }
151 lhs = infixParser(lhs, token)
152 precRight = this.precOf(this.tokens.peek()!.value)
153 }
154
155 return lhs
156 }
157}
158
159
160let tokens = new Iterator<Token>([
161 // - 1 + (2 - 3) * 6 / 3 ! - 2 ^ 3 ^ 4
162 new Token(TokenType.Sub, '-'),
163 new Token(TokenType.Num, '1'),
164 new Token(TokenType.Add, '+'),
165 new Token(TokenType.LPa, '('),
166 new Token(TokenType.Num, '2'),
167 new Token(TokenType.Sub, '-'),
168 new Token(TokenType.Num, '3'),
169 new Token(TokenType.RPa, ')'),
170 new Token(TokenType.Mul, '*'),
171 new Token(TokenType.Num, '6'),
172 new Token(TokenType.Div, '/'),
173 new Token(TokenType.Num, '3'),
174 new Token(TokenType.Fac, '!'),
175 new Token(TokenType.Sub, '-'),
176 new Token(TokenType.Num, '2'),
177 new Token(TokenType.Pow, '^'),
178 new Token(TokenType.Num, '3'),
179 new Token(TokenType.Pow, '^'),
180 new Token(TokenType.Num, '4'),
181 new Token(TokenType.Eoi),
182])
183
184let parser = new Parser(tokens)
185let ast = parser.parse()
186
187console.log(util.inspect(ast, { showHidden: false, depth: null, colors: true }))
188console.log("Equivalent expression: ", ast.toString());