|
Traditionally, context–free grammars have been used as a basis of the syntax analysis phase of compilation. A context–free grammar is sometimes called a BNF (Backus–Naur form) grammar.
Informally, a context–free grammar is simply a set of rewriting rules or productions. A production is of the form
A → B C D … Z
A is the left–hand–side (LHS) of the production. B C D … Z constitute the right–hand–side (RHS) of the production. Every production has exactly one symbol in its LHS; it can have any number of symbols (zero or more) on its RHS. A production represents the rule that any occurrence of its LHS symbol can be represented by the symbols on its RHS. The production
<program> → begin <statement list> end
states that a program is required to be a statement list delimited by a begin and end.
Two kinds of symbols may appear in a context–free grammar: nonterminal and terminals.
In this tutorial, nonterminals are often delimited by < and > for ease of recognition. However,
nonterminals can also be recognized by the fact that they appear on the left–hand sides of
productions. A nonterminal is, in effect, a placeholder. All nonterminals must be replaced,
or rewritten, by a production having the appropriate nonterminal on its LHS. In constrast,
terminals are never changed or rewritten. Rather, they represent the tokens of a language.
Thus the overall purpose of a set of productions (a context–free grammar) is to specify
what sequences of terminals (tokens) are legal. A context–free grammar does this in a
remarkably elegant way: We start with a single nonterminal symbol called the start or goal
symbol. We then apply productions, rewriting nonterminals until only terminals remain.
Any sequence of terminals that can be produced by doing this is considered legal. To see
how this works, let us look at a context–free grammar for a small subset of Pascal that we
call Small Pascal. λ will represent the empty or null string. Thus a production
A → λ
states that A can be replaced by the empty string, effectively erasing it.
Programming language constructs often involve optional items, or lists of items. To cleanly represent such features, an extended BNF notation is often utilised. An optional item sequence is enclosed in square brackets, [ and ]. For example, in
<program> → begin <statement sequence> end
a program can be optionally labelled. Optional sequences are enclosed by braces, { and }. Thus in
<statement sequence> → <statement> {<statement>}
a statement sequence is defined to be a single a statement, optionally followed by zero or more additional statements.
An extended BNF has the same definitional capability as ordinary BNFs. In particular, the following transforms can be used to map extended BNFs into standard form. An optional sequence is replaced by a new nonterminal that generates λ or the items in the sequence. Similarly, an optional sequence is replaced by a new nonterminal that genereates λ or the sequence of followed by the nonterminal. Thus our statement sequence can be transformed into
<statement sequence> → <statement> <statement tail>
<statement tail> → λ
<statement tail> → <statement> <statement tail>
The advantage of extended BNFs is that they are more compact and readable. We can envision a preprocessor that takes extended BNFs and produces standard BNFs.
Figure 3.1 shows a context–free grammar for a Small Pascal. An augmenting production
<system goal> → <program> EOF
appears in the grammar to make sure that the string matched by <system goal> includeds all the input. It specifies that the end–of–file marker, EOF, must follow after the last valid token of a program.
1. <system goal> → <program> EOF
2. <program> → begin <statement sequence> end .
3. <statement sequence> → <statement> {<statement>}
4. <statement> → ID := <expression> ;
5. <statement> → readln ( <identifier list> ) ;
6. <statement> → writeln ( <expression list> ) ;
7. <identifier list> → ID {, ID}
8. <expression list> → <expression> {, <expression>}
9. <expression> → <factor> <addition operator> <factor>
10. <factor> → ( <expression> )
11. <factor> → ID
12. <factor> → INTLITERAL
13. <addition operator> → PLUS
14. <addition operator> → MINUS
Figure 3.1 – Extended BNF defining Small Pascal
To see how this grammar defines legal Small Pascal programs, let’s follow the derivation of one such program, begin ID := ID + ( INTLITERAL – ID ) ; end EOF, starting from the nonterminal <system goal>.
<system goal>
<program> EOF (rule 1)
begin <statement sequence> end EOF (rule 2)
begin <statement> {, <statement} end EOF (rule 3)
begin <statement> end EOF (Choose 0 repetitions)
begin ID := <expression> ; end EOF (rule 4)
begin ID := <factor> <add op > <factor> ; end EOF (rule 9)
begin ID := <factor> + <factor> ; end EOF (rule 13)
begin ID := ID + <factor> ; end EOF (rule 11)
begin ID := ID + ( <expression> ) ; end EOF (rule 10)
begin ID := ID + ( <factor> {<add op> <factor>} ) end EOF (rule 9)
begin ID := ID + ( <factor> <add op> <factor> ) end EOF (Choose 1 repetition)
begin ID := ID + ( <factor> – <factor> ) end EOF (Apply rule 14)
begin ID := ID + ( INTLITERAL – <factor> ) end EOF (Apply rule 12)
begin ID := ID + ( INTLITERAL – ID ) end EOF (Apply rule 11)
At this point no nonterminals remain, so our derivation of a Small Pascal program is completed.
A context–free grammar defines a language, which is a set of sequences of tokens. Any sequence of tokens that can be derived using the grammar is valid; any sequence that cannot be derived is not valid. Actually, to be precise, any token sequence derivable from a context–free grammar is considered syntactically valid. When static semantics are checked by the semantic routines, semantic errors in a syntactically valid program may be discovered. For example, in Pascal the statement
A := ‘X’ + True;
has no syntax errors but it does have a semantic error: the operator + is not defined for adding a character to a boolean value.
Structure as well as syntax can be defined in a context–free grammar. For expressions, this
includes associctivity and operator precedence. Associativity is concerned with the order in
which consecutive instances of an operator are applied (as in A–B–C). Operator precedence
refers to the relative priority of operators. For example, we expect A+B*C to mean A+(B*C),
since * is usually considered to be a higher precedence operator than +. Small Pascal has only
one level of precedence, because it does not include multiplication nor addition. If
multiplication were included, it would be at a higher level of precedence than addition. The
following grammar fragment defines such a precedence relationship.
<expression> → <term> {<add op> <term>}
<term> → <factor> {<mult op> <factor>}
<factor> → ( <expression> )
<factor> → ID
<factor> → INTLITERAL
Examining a derivation tree for the expression A+B*C, which is derived from this
grammar fragment, illustrates how this definition works. A derivation tree is formed by
showing nonterminals expansions as subtrees. Figure 3.2 shows the tree for this expression.

Figure 3.2 – Derivation Tree for A+B*C
The tree shows that * has a higher precedence than + because the second and third Ids are
grouped together (with *) in a subtree, and then this subtree is grouped together with the first
ID the first ID (using +) in the main derivation tree.
Now we may ask: can a derivation tree with wrong precedence ever be formed in this
grammar? No – the production rules don’t allow it. Try to build ID+ID*ID with + applied
first. Since a * can only be generated by a <term>, ID+ID must appear unless the
subexpression is enclosed in parentheses. With parentheses, the desired grouping can be
forced.
We have so far introduced regular expressions and context–free grammars, which are
respectively used for defining the lexer and parser of a grammar. There is one more thing the
reader must know, which is the type of grammar used in SableCC. In this paper we will give
just a brief idea of a particular grammar, LALR(1), a subset of LR(1), which is commonly
used in the description of almost any programming language. If you would like to
know more about other types of grammars, such as LL(k) or LR(k), then you should refer to [1] or [2].
Instead of describing the LALR(1) type of grammar in full, we will just outline the
conflicts that may occur in this type of grammar. There are two types of conflicts that may occur in this type of grammar: Reduce⁄Reduce and Shift⁄Reduce conflicts. A reduce⁄reduce conflict occurs when the parser does not know which production to reduce. For example, the grammar
A → Bc | d B →
is not LALR(1) because if the input to the parser is d, the parser will not know if it reduces to B or if it reduces to A, because both A and B can be replaced by d. To solve this conflicts, we eliminate production B, which is a subset of A, and change the grammar to
A → Ac | d
Now if the input to the parser is a d, it will reduce to A.
As for shift⁄reduce conflicts, they occur when the parser does not know if it reduces or if it shifts (pushes the token or nonterminal to the stack) the current input. For instance, let’s analyse the grammar
A → BaDe | BaDeCe
B → c
D → b
C → s
Suppose that the input is cabe. When the parser receives this input, it first reduces c to B, then it shifts it (pushes it to the stack). It then reads a and shifts it, next it reads b, reduces it to D and shifts it, finally it reads e and know it does not know if it shifts e or if it reduces the string cabe to A.
Now that you know the theory behind regular expressions and grammars you can dive into the Sablecc tutorial, where I show you how to write a small compiler using a compiler generator tool called SableCC.
If you have any queries regarding this tutorial, just drop me an email.
References:
- Andrew W. Appel, “Modern compiler implementation in Java”, Cambridge University Press, 1998.
- Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, “Compilers Principles, Techniques, and Tools”, Addison-Wesley, 1986.
|