Algorithms for Compiler Design: THE PREDICTIVE TOP-DOWN PARSER

7/24/2010 7:52:26 PM

4.3 THE PREDICTIVE TOP-DOWN PARSER

A backtracking parser is a non-deterministic recognizer of the language generated by the grammar. The backtracking problems in the top-down parser can be solved; that is, a top-down parser can function as a deterministic recognizer if it is capable of predicting or detecting which alternatives are right choices for the expansion of nonterminals (that derive to more than one alternative) during the parsing of input string w. By carefully writing a grammar, eliminating left recursion, and left-factoring the result, we obtain a grammar that can be parsed by a top-down parser. This grammar will be able to predict the right alternative for the expansion of a nonterminal during the parsing process; and hence, it need not backtrack.

If A → α₁ | α₂ | … | α_n are the A-productions in the grammar, then a top-down parser can decide if a nonterminal A is to be expanded or not. And if it is to be expanded, the parser decides which A-production should be used. It looks at the next input symbol and finds out which of the α_i derivatives to a string that start with the terminal symbol comes next in the input. If none of the α_i derives to a string starting with a terminal symbol, the parser reports the failure; otherwise, it carries out the derivation of A using a production A → α_i, where α_i derives to a string whose first terminal symbol is the symbol coming next in the input. Therefore, we conclude that if the set of first-terminal symbols of the strings derivable from α_i is computed for each α_i, and this set is made available to the parser, then the parser can predict the right choice for the expansion of nonterminal A. This information can be easily computed using the productions of the grammar. We define a function FIRST(α), where α is in (V ∪ T)*, as follows:

FIRST(α) = Set of those terminals with which the strings derivable from α start

If α = XYZ, then FIRST(α) is computed as follows:

FIRST(α) = FIRST(XYZ) = { X } if X is terminal.

Otherwise,

FIRST(α) = FIRST(XYZ) = FIRST(X) if X does not derive to an empty string; that is, if

FIRST(X) does not contain ∈.

If FIRST(X) contains ∈, then

FIRST(α) = FIRST(XYZ) = FIRST(X) − { ∈ } ∪ FIRST(YZ)

FIRST(YZ) is computed in an identical manner:

FIRST(YZ) = { Y } if Y is terminal.

Otherwise,

FIRST(YZ) = FIRST(Y) if Y does not derive to an empty string (i.e., if FIRST(Y) does not contain ∈). If FIRST(Y) contains ∈, then

FIRST(YZ) = FIRST(Y) − { ∈ } ∪ FIRST(Z)

For example, consider the grammar:

FIRST(S) = FIRST(ACB) ∪ FIRST(CbB) ∪

FIRST(A) = FIRST(da) ∪ FIRST(BC)

FIRST(B) = FIRST(g) ∪ FIRST(∈ )

FIRST(C) = FIRST(h) ∪ FIRST(∈)

Therefore:

FIRST(BC) = FIRST(B) − { ∈ } ∪ FIRST(C)

Substituting in (II) we get:

FIRST(A)={ d } ∪ { g, h, ∈ }

FIRST(ACB) =FIRST(A) − { ∈ } ∪ FIRST(CB)

FIRST(CB) =FIRST(C) − { ∈ } ∪ FIRST(B)

Therefore, substituting in (III) we get:

FIRST(ACB)={ d, g, h, ∈ } ∪ { g, h, ∈ }

Similarly,

FIRST(CbB) =FIRST(C) − { ∈ } ∪ FIRST(bB)

Similarly,

FIRST(Ba) =FIRST(B) − { ∈ } ∪ FIRST(a)

Therefore, substituting in (I), we get:

FIRST(S)={ d, g, h, ∈ } ∪ { b, h, ∈ } ∪ { a, g, ∈ }

EXAMPLE 1

Consider the following grammar:

FIRST(aAb)= { a }

FIRST(cd)= { c }, and

FIRST(ef)= { e }

Hence, while deriving S, the parser looks at the next input symbol. And if it happens to be the terminal a, then the parser derives S using S → aAb. Otherwise, the parser reports an error. Similarly, when expanding A, the parser looks at the next input symbol; if it happens to be the terminal c, then the parser derives A using A → cd. If the next terminal input symbol happens to be e, then the parser derives A using A → ef. Otherwise, an error is reported.

Therefore, we conclude that if the right-hand FIRST for the production S → aAb is computed, we can decide when the parser should do the derivation using the production S → aAb. Similarly, if the right-hand FIRST for the productions A → cd and A → ef are computed, then we can decide when derivation is to be done using A → cd and A → ef, respectively. These decisions can be encoded in the form of table, as shown in Table 1, and can be made available to the parser for the correct selection of productions for derivations during parsing.

Table 1: Production Selections for Parsing Derivations
�	a	b	c	d	e	f	$
S	S → aAb	�	�	�	�	�	�
A	�	�	A → cd	�	A → ef	�	�

The number of rows of the table are equal to the number of nonterminals, whereas the number of columns are equal to the number of terminals, including the end marker. The parser uses of the nonterminal to be derived as the row index of the table, and the next input symbol is used as the column index when the parser decides which production will be derived. Here, the production S → aAb is added in the table at [S, a] because FIRST(aAb) contains a terminal a. Hence, S must be derived using S → aAb if and only if the terminal symbol coming next in the input is a. Similarly, the production A → cd is added at [A, c], because FIRST(cd) contain c. Hence, A must be derived using A → cd if and only if the terminal symbol coming next in the input is c. Finally, A must be derived using A → ef if and only if the terminal symbol coming next in the input is e. Hence, the production A → ef is added at [A, e]. Therefore, we conclude that the table can be constructed as follows:

for every production A → α do
  for every a in FIRST(α) do
    TABLE[A, a] = A → α

Using the above method, every production of the grammar gets added into the table at the proper place when the grammar is ∈-free. But when the grammar is not ∈-free, ∈-productions will not get added to the table. If there is an ∈-production A → ∈ in the grammar, then deciding when A is to be derived to ∈ is not possible using the production's right-hand FIRST. Some additional information is required to decide where the production A → ∈ is to be added to the table.

Tip�

The derivation by A → ∈ is a right choice when the parser is on the verge of expanding the nonterminal A and the next input symbol happens to be a terminal, which can occur immediately following A in any string occurring on the right side of the production. This will lead to the expansion of A to ∈, and the next leaf in the parse tree will be considered, which is labeled by the symbol immediately following A and, therefore, may match the next input symbol.

Therefore, we conclude that the production A → ∈ is to be added in the table at [A, b] for every b that immediately follows A in any of the production grammar's right-hand strings. To compute the set of all such terminals, we make use of the function FOLLOW(A), where A is a nonterminal, as defined below:

FOLLOW(A) = Set of terminals that immediately follow A in any string occurring on the right side of productions of the grammar

For example, if A → αBβ is a production, then FOLLOW(B) can be computed using A → αBβ, as shown below:

FOLLOW(B) = FIRST(β) if FIRST(β) does not contain ∈.

Therefore, we conclude that when the grammar is not ∈-free, then the table can be constructed as follows:

Compute FIRST and FOLLOW for every nonterminal of the grammar.

For every production A → α, do:

{
 for every non-∈ member a in FIRST(α) do
 TABLE[A, a] = A → α
 If FIRST(α) contain ∈ then
 For every b in FOLLOW(A) do
 TABLE[A, b] = A → α
}

Therefore, we conclude that if the table is constructed using the above algorithm, a top-down parser can be constructed that will be a nonbacktracking, or ‘predictive’ parser.

1 Implementation of a Table-Driven Predictive Parser

A table-driven parser can be implemented using an input buffer, a stack, and a parsing table. The input buffer is used to hold the string to be parsed. The string is followed by a "$" symbol that is used as a right-end maker to indicate the end of the input string. The stack is used to hold the sequence of grammar symbols. A "$" indicates bottom of the stack. Initially, the stack has the start symbol of a grammar above the $. It is a two-dimensional array TABLE[A, a], where A is a nonterminal and a is a terminal, or $ symbol. The parser is controlled by a program that behaves as follows:

The program considers X, the symbol on the top of the stack, and the next input symbol a.
If X = a = $, then parser announces the successful completion of the parsing and halts.
If X = a ≠ $, then the parser pops the X off the stack and advances the input pointer to the next input symbol.
If X is a nonterminal, then the program consults the parsing table entry TABLE[x, a]. If TABLE[x, a] = x → UVW, then the parser replaces X on the top of the stack by UVW in such a manner that U will come on the top. If TABLE[x, a] = error, then the parser calls the error-recovery routine.

For example consider the following grammar:

FIRST(S) = FIRST(aABb) = { a }

FIRST(A) = FIRST(c) ∪ FIRST(∈) = { c, ∈ }

FIRST(B) = FIRST(d) ∪ FIRST(∈) = { d, ∈ }

Since the right-end marker $ is used to mark the bottom of the stack, $ will initially be immediately below S (the start symbol) on the stack; and hence, $ will be in the FOLLOW(S). Therefore:

Using S → aABb, we get:

Therefore, the parsing table is as shown in Table 2.

Table 2: Production Selections for Parsing Derivations
�	a	b	c	d	$
S	S → aABb	�	�	�	�
A	�	A → ∈	A → c	A → ∈	�
B	�	B → ∈	�	B → d	�

Consider an input string acdb. The various steps in the parsing of this string, in terms of the contents of the stack and unspent input, are shown in Table 3.

Table 3: Steps Involved in Parsing the String *acdb*
Stack Contents	Unspent Input	Moves

$S	acdb$	Derivation using S → aABb
$bBAa	acdb$	Popping a off the stack and advancing one position in the input
$bBA	cdb$	Derivation using A → c
$bBc	cdb$	Popping c off the stack and advancing one position in the input
$bB	db$	Derivation using B → d
$bd	db$	Popping d off the stack and advancing one position in the input
$b	b$	Popping b off the stack and advancing one position in the input
$	$	Announce successful completion of the parsing

Similarly, for the input string ab, the various steps in the parsing of the string, in terms of the contents of the stack and unspent input, are shown in Table 4.

Table 4: Production Selections for String ab Parsing Derivations
Stack Contents	Unspent Input	Moves

$S	ab$	Derivation using S → aABb
$bBAa	ab$	Popping a off the stack and advancing one position in the input
$bBA	b$	Derivation using A → ∈
$bB	b$	Derivation using B → ∈
$b	b$	Popping b off the stack and advancing one position in the input
$	$	Announce successful completion of the parsing

For a string adb, the various steps in the parsing of the string, in terms of the contents of the stack and unspent input, are shown in Table 5.

Table 5: Production Selections for Parsing Derivations for the String *adb*
Stack Contents	Unspent Input	Moves

$S	adb$	Derivation using S → aABb
$bBAa	adb$	Popping a off the stack and advancing one position in the input
$bBA	ab$	Calling an error-handling routine

The heart of the table-driven parser is the parsing table-the parser looks at the parsing table to decide which alternative is a right choice for the expansion of a nonterminal during the parsing of the input string. Hence, constructing a table-driven predictive parser can be considered as equivalent to constructing the parsing table.

A parsing table for any grammar can be obtained by the application of the above algorithm; but for some grammars, some of the entries in the parsing table may end up being multiple defined entries. Whereas for certain grammars, all of the entries in the parsing table are singly defined entries. If the parsing table contains multiple entries, then the parser is still non-deterministic. The parser will be a deterministic recognizer if and only if there are no multiple entries in the parsing table. All such grammars (i.e., those grammars that, after applying the algorithm above, contain no multiple entries in the parsing table) constitute a subset of CFGs called "LL(1)" grammars. Therefore, a given grammar is LL(1) if its parsing table, constructed by algorithm above, contains no multiple entries. If the table contains multiple entries, then the grammar is not LL(1).

In the acronym LL(1), the first L stands for the left-to-right scan of the input, the second L stands for the left-most derivation, and the (1) indicates that the next input symbol is used to decide the next parsing process (i.e., length of the lookahead is "1").

In the LL(1) parsing system, parsing is done by scanning the input from left to right, and an attempt is made to derive the input string in a left-most order. The next input symbol is used to decide what is to be done next in the parsing process. The predictive parser discussed above, therefore, is a LL(1) parser, because it also scans the input from left to right and attempts to obtain the left-most derivation of it; and it also makes use of the next input symbol to decide what is to be done next. And if the parsing table used by the predictive parser does not contain multiple entries, then the parser acts as a recognizer of only the members of L(G); hence, the grammar is LL(1).

Therefore, LL(1) is the grammar for which an LL(1) parser can be constructed, which acts as a deterministic recognizer of L(G). If a grammar is LL(1), then a deterministic top-down table-driven recognizer can be constructed to recognize L(G). A parsing table constructed for a given grammar G will have multiple entries if the grammar contains multiple productions that derive the same nonterminal-that is, the grammar contains the productions A → α | β, and both α and β derive to a string that starts with the same terminal symbol. Therefore, one of the basic requirements for a grammar to be considered LL(1) is when the grammar contains multiple productions that derive the same nonterminal, such as:

for every pair of productions A → α | β

FIRST(α) ∩ FIRST(β) = φ (i.e., FIRST(α) and FIRST(β) should be disjoint sets for every pair of productions A → α | β)

For a grammar to be LL(1), the satisfaction of the condition above is necessary as well sufficient if the grammar is ∈-free. When the grammar is not ∈-free, then the satisfaction of the above condition is necessary but not sufficient, because either FIRST(α) or FIRST(β) might contain ∈, but not both. The above condition will still be satisfied; but if FIRST(β) contains ∈, then production A → β will be added in the table on all terminals in FOLLOW(A). Hence, it also required that FIRST(α) and FOLLOW(A) contain no common symbols. Therefore, an additional condition must be satisfied in order for a grammar to be LL(1). When the grammar is not ∈-free: for every pair of productions A → α | β

if FIRST(β) contains ∈, and FIRST(α) does not contain ∈, then

FIRST(α) ∩ FOLLOW(A) = φ

Therefore, for a grammar to be LL(1), the following conditions must be satisfied:

For every pair of productions

{
  (1) FIRST(α) ∩ FIRST(β) = φ
  and
  if FIRST(β) contains ∈, and FIRST(α) does not contain ∈
  then
  (1) FIRST(α) ∩ FOLLOW(A) = φ
}

2 Examples

EXAMPLE 2

Test whether the grammar is LL(1) or not, and construct a predictive parsing table for it.

Since the grammar contains a pair of productions S → AaAb | BbBa, for the grammar to be LL(1), it is required that:

Hence, the grammar is LL(1).

To construct a parsing table, the FIRST and FOLLOW sets are computed, as shown below:

Using S → AaAb, we get:
Using S → BbBa, we get

Table 6: Production Selections for Example 2 Parsing Derivations
�	a	b	$
S	S → AaAb	S → BbBa	�
A	A → ∈	A → ∈	�
B	B → ∈	B → ∈	�

EXAMPLE 3

Consider the following grammar, and test whether the grammar is LL(1) or not.

For a pair of productions S → 1AB | ∈:

because FOLLOW(S) = { $ } (i.e., it contains only the end marker. Similarly, for a pair of productions A → 1AC | 0C:

Hence, the grammar is LL(1). Now, show that no left-recursive grammar can be LL(1).

One of the basic requirements for a grammar to be LL(1) is: for every pair of productions A → α | β in the grammar's set of productions, FIRST(α) and FIRST(β) should be disjointed.

If a grammar is left-recursive, then the set of productions will contain at least one pair of the form A → Aα | β; and hence, FIRST(Aα) and FIRST(β) will not be disjointed sets, because everything in the FIRST(β) will also be in the FIRST(Aα). It thereby violates the condition for LL(1) grammar. Hence, a grammar containing a pair of productions A → Aα | β (i.e., a left-recursive grammar) cannot be LL(1).

Now, let X be a nullable nonterminal that derives to at least two terminal strings. Show that in LL(1) grammar, no production rule can have two consecutive occurrences of X on the right side of the production.

Since X is a nullable X ∈, X is also deriving to at least to two terminal strings-Xw₁ and Xw₂-where w₁ and w₂ are the strings of terminals. Therefore, for a grammar using X to be LL(1), it is required that:

FIRST(w₁) ∩ FIRST(w₂) = φ

FIRST (w₁) ∩ FOLLOW(X) and FIRST(w₂) ∩ FOLLOW(X) = φ

If this grammar contains a production rule A → αXXβ-a production whose right side has two consecutive occurrences of X-then everything in FIRST(X) will also be in the FOLLOW(X); and since FIRST(X) contains FIRST(w₁) as well as FIRST(w₂), the second condition will therefore not be satisfied. Hence, a grammar containing a production of the form A → αXXβ will never be LL(1), thereby proving that in LL(1) grammar, no production rule can have two consecutive occurrences of X on the right side of the production.

EXAMPLE 4

Construct a predictive parsing table for the following grammar where S| is a start symbol and # is the end marker.

Here, # is taken as one of the grammar symbols. And therefore, the initial configuration of the parser will be (S|, w#), where the first member of the pair is the contents of the stack and the second member is the contents of input buffer.

Therefore, by substituting in (I), we get:

Using S| → S# we get:
Using S → qABC we get:

Substituting in (II) we get:
Using A → bbD we get:

Therefore, the parsing table is derived as shown in Table 7.

Table 7: Production Selections for Example 4 Parsing Derivations
�	q	a	b	c	#
S	S → S#	�	�	�	�
S	S → qabc	�	�	�	�
A	�	A → a	A → bbD	�	�
B	�	B → a	B → ∈	�	B → ∈
C	�	�	C → b	�	C → ∈
D	�	D → ∈	D → ∈	D → c	D → ∈

EXAMPLE 5

Construct predictive parsing table for the following grammar:

Since the grammar is ∈-free, FOLLOW sets are not required to be computed in order to enter the productions into the parsing table. Therefore the parsing table is as shown in Table 8.

Table 8: Production Selections for Example 5 Parsing Derivations
�	a	b	f	g	d
S	S → A	�	�	�	�
A	A → aS	�	�	A → d	�
B	�	B → bBC	B → f	�	�
C	�	�	�	C → g	�

EXAMPLE 6

Construct a predictive parsing table for the following grammar, where S is a start symbol.

Using S → iEtSS₁:
Using S₁ → eS:

Therefore, the parsing table is as shown in Table 9.

Table 9: Production Selections for Example 6 Parsing Derivations
�	i	a	b	e	T	$
S	S → iEtSS₁	S → a	�	�	�	�
S₁	�	�	�	S1 → eS	�	S₁ → ∈
S₁	�	�	�	S1 → ∈	�	�
E	�	�	E → b	�	�	�

EXAMPLE 7

Construct an LL(1) parsing table for the following grammar:

Computation of FIRST and FOLLOW:

Therefore by substituting in (I) we get:

Using the production S → aBDh we get:
Using the production B → cC, we get:
Using the production C → bC, we get:
Using the production D → EF, we get:

Therefore, the parsing table is as shown in Table 10.

Table 10: Production Selections for Example 7 Parsing Derivations
�	a	b	c	g	f	h	$
S	S→aBDh	�	�	�	�	�	�
B	�	�	B → cC	�	�	�	�
C	�	C → bC	�	C → ∈	C → ∈	C → ∈	�
D	�	�	�	D → EF	D → EF	D → EF	�
E	�	�	�	E → g	E → ∈	E → ∈	�
F	�	�	�	�	F → f	F → ∈	�

Other

Algorithms for Compiler Design: A HANDLE OF A RIGHT SENTENTIAL FORM

Algorithms for Compiler Design: IMPLEMENTATION in Bottom-up Parsing

Algorithms for Compiler Design: THE LR PARSER

Algorithms for Compiler Design: DATA STRUCTURES FOR REPRESENTING PARSING TABLES

Algorithms for Compiler Design: WHY LR PARSING IS ATTRACTIVE

Algorithms for Compiler Design: EXAMPLES for Bottom-up Parsing

Combinations and Permutations with F#

Production Diagnostics Improvements in CLR 4

Using the Dynamic Keyword in C# 4.0