Home | AI | Physics | Mathematics | Compiler Design | Games | Downloads | About Me
You are visitor number 147043 since February 2003
Compiler Design
Introduction
Regular Expressions
Grammars
SableCC
 
Free Books


Context – Free Grammars

Traditionally, context–free grammars have been used as a basis of the syntax analysis phase of compilation. A context–free grammar is sometimes called a BNF (Backus–Naur form) grammar.

Informally, a context–free grammar is simply a set of rewriting rules or productions. A production is of the form

A → B C D … Z


A is the left–hand–side (LHS) of the production. B C D … Z constitute the right–hand–side (RHS) of the production. Every production has exactly one symbol in its LHS; it can have any number of symbols (zero or more) on its RHS. A production represents the rule that any occurrence of its LHS symbol can be represented by the symbols on its RHS. The production

<program> → begin <statement list> end

states that a program is required to be a statement list delimited by a begin and end.

Two kinds of symbols may appear in a context–free grammar: nonterminal and terminals. In this tutorial, nonterminals are often delimited by < and > for ease of recognition. However, nonterminals can also be recognized by the fact that they appear on the left–hand sides of productions. A nonterminal is, in effect, a placeholder. All nonterminals must be replaced, or rewritten, by a production having the appropriate nonterminal on its LHS. In constrast, terminals are never changed or rewritten. Rather, they represent the tokens of a language. Thus the overall purpose of a set of productions (a context–free grammar) is to specify what sequences of terminals (tokens) are legal. A context–free grammar does this in a remarkably elegant way: We start with a single nonterminal symbol called the start or goal symbol. We then apply productions, rewriting nonterminals until only terminals remain. Any sequence of terminals that can be produced by doing this is considered legal. To see how this works, let us look at a context–free grammar for a small subset of Pascal that we call Small Pascal. λ will represent the empty or null string. Thus a production

A → λ


states that A can be replaced by the empty string, effectively erasing it.

Programming language constructs often involve optional items, or lists of items. To cleanly represent such features, an extended BNF notation is often utilised. An optional item sequence is enclosed in square brackets, [ and ]. For example, in

<program> → begin <statement sequence> end


a program can be optionally labelled. Optional sequences are enclosed by braces, { and }. Thus in

<statement sequence> → <statement> {<statement>}


a statement sequence is defined to be a single a statement, optionally followed by zero or more additional statements.

An extended BNF has the same definitional capability as ordinary BNFs. In particular, the following transforms can be used to map extended BNFs into standard form. An optional sequence is replaced by a new nonterminal that generates λ or the items in the sequence. Similarly, an optional sequence is replaced by a new nonterminal that genereates λ or the sequence of followed by the nonterminal. Thus our statement sequence can be transformed into

<statement sequence> → <statement> <statement tail>
<statement tail> → λ
<statement tail> → <statement> <statement tail>

The advantage of extended BNFs is that they are more compact and readable. We can envision a preprocessor that takes extended BNFs and produces standard BNFs.

Figure 3.1 shows a context–free grammar for a Small Pascal. An augmenting production

<system goal> → <program> EOF


appears in the grammar to make sure that the string matched by <system goal> includeds all the input. It specifies that the end–of–file marker, EOF, must follow after the last valid token of a program.

                                        1. <system goal> → <program> EOF
                                        2. <program> → begin <statement sequence> end .
                                        3. <statement sequence> → <statement> {<statement>}
                                        4. <statement> → ID := <expression> ;
                                        5. <statement> → readln ( <identifier list> ) ;
                                        6. <statement> → writeln ( <expression list> ) ;
                                        7. <identifier list> → ID {, ID}
                                        8. <expression list> → <expression> {, <expression>}
                                        9. <expression> → <factor> <addition operator> <factor>
                                        10. <factor> → ( <expression> )
                                        11. <factor> → ID
                                        12. <factor> → INTLITERAL
                                        13. <addition operator> → PLUS
                                        14. <addition operator> → MINUS

Figure 3.1 – Extended BNF defining Small Pascal

To see how this grammar defines legal Small Pascal programs, let’s follow the derivation of one such program, begin ID := ID + ( INTLITERAL – ID ) ; end EOF, starting from the nonterminal <system goal>.

<system goal>
<program> EOF                                                                 (rule 1)
begin <statement sequence> end EOF                                 (rule 2)
begin <statement> {, <statement} end EOF                         (rule 3)
begin <statement> end EOF                                                (Choose 0 repetitions)
begin ID := <expression> ; end EOF                                    (rule 4)
begin ID := <factor> <add op > <factor> ; end EOF             (rule 9)
begin ID := <factor> + <factor> ; end EOF                          (rule 13)
begin ID := ID + <factor> ; end EOF                                    (rule 11)
begin ID := ID + ( <expression> ) ; end EOF                         (rule 10)
begin ID := ID + ( <factor> {<add op> <factor>} ) end EOF  (rule 9)
begin ID := ID + ( <factor> <add op> <factor> ) end EOF     (Choose 1 repetition)
begin ID := ID + ( <factor> – <factor> ) end EOF                  (Apply rule 14)
begin ID := ID + ( INTLITERAL – <factor> ) end EOF              (Apply rule 12)
begin ID := ID + ( INTLITERAL – ID ) end EOF                        (Apply rule 11)

At this point no nonterminals remain, so our derivation of a Small Pascal program is completed.

A context–free grammar defines a language, which is a set of sequences of tokens. Any sequence of tokens that can be derived using the grammar is valid; any sequence that cannot be derived is not valid. Actually, to be precise, any token sequence derivable from a context–free grammar is considered syntactically valid. When static semantics are checked by the semantic routines, semantic errors in a syntactically valid program may be discovered. For example, in Pascal the statement

A := ‘X’ + True;


has no syntax errors but it does have a semantic error: the operator + is not defined for adding a character to a boolean value.

Structure as well as syntax can be defined in a context–free grammar. For expressions, this includes associctivity and operator precedence. Associativity is concerned with the order in which consecutive instances of an operator are applied (as in A–B–C). Operator precedence refers to the relative priority of operators. For example, we expect A+B*C to mean A+(B*C), since * is usually considered to be a higher precedence operator than +. Small Pascal has only one level of precedence, because it does not include multiplication nor addition. If multiplication were included, it would be at a higher level of precedence than addition. The following grammar fragment defines such a precedence relationship.

                                        <expression> → <term> {<add op> <term>}                                         <term> → <factor> {<mult op> <factor>}
                                        <factor> → ( <expression> )
                                        <factor> → ID
                                        <factor> → INTLITERAL

Examining a derivation tree for the expression A+B*C, which is derived from this grammar fragment, illustrates how this definition works. A derivation tree is formed by showing nonterminals expansions as subtrees. Figure 3.2 shows the tree for this expression.


Figure 3.2 – Derivation Tree for A+B*C

The tree shows that * has a higher precedence than + because the second and third Ids are grouped together (with *) in a subtree, and then this subtree is grouped together with the first ID the first ID (using +) in the main derivation tree.

Now we may ask: can a derivation tree with wrong precedence ever be formed in this grammar? No – the production rules don’t allow it. Try to build ID+ID*ID with + applied first. Since a * can only be generated by a <term>, ID+ID must appear unless the subexpression is enclosed in parentheses. With parentheses, the desired grouping can be forced.

We have so far introduced regular expressions and context–free grammars, which are respectively used for defining the lexer and parser of a grammar. There is one more thing the reader must know, which is the type of grammar used in SableCC. In this paper we will give just a brief idea of a particular grammar, LALR(1), a subset of LR(1), which is commonly used in the description of almost any programming language. If you would like to know more about other types of grammars, such as LL(k) or LR(k), then you should refer to [1] or [2].

Instead of describing the LALR(1) type of grammar in full, we will just outline the conflicts that may occur in this type of grammar. There are two types of conflicts that may occur in this type of grammar: Reduce⁄Reduce and Shift⁄Reduce conflicts. A reduce⁄reduce conflict occurs when the parser does not know which production to reduce. For example, the grammar

A → Bc | d B →


is not LALR(1) because if the input to the parser is d, the parser will not know if it reduces to B or if it reduces to A, because both A and B can be replaced by d. To solve this conflicts, we eliminate production B, which is a subset of A, and change the grammar to

A → Ac | d


Now if the input to the parser is a d, it will reduce to A.

As for shift⁄reduce conflicts, they occur when the parser does not know if it reduces or if it shifts (pushes the token or nonterminal to the stack) the current input. For instance, let’s analyse the grammar

                                        A → BaDe | BaDeCe
                                        B → c
                                        D → b
                                        C → s

Suppose that the input is cabe. When the parser receives this input, it first reduces c to B, then it shifts it (pushes it to the stack). It then reads a and shifts it, next it reads b, reduces it to D and shifts it, finally it reads e and know it does not know if it shifts e or if it reduces the string cabe to A.

Now that you know the theory behind regular expressions and grammars you can dive into the Sablecc tutorial, where I show you how to write a small compiler using a compiler generator tool called SableCC.

If you have any queries regarding this tutorial, just drop me an email.

References:

  1. Andrew W. Appel, “Modern compiler implementation in Java”, Cambridge University Press, 1998.
  2. Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, “Compilers Principles, Techniques, and Tools”, Addison-Wesley, 1986.


Books



© 2002, 2003 Fidel Viegas. All rights reserved.