Sei sulla pagina 1di 136

AN INTRODUCTION TO

COMPILER CONSTRUCTION

O S ADEWALE
Department of Computer Science
The Federal University of Technology
AKURE, NIGERIA

PREFACE
Compilers and interpreters are a necessary part of any computer system,
without them, we would all be programming in assembly language or even
machine language. This has made compiler construction an important,
practical area of research in computer science. The object of this book is to
present in a coherent fashion the major techniques used in compiler writing,
in order to make it easier for the novice to enter the field and for the expert
to reference the literature.
The book is intended to serve two needs; it can be used as a self-study and
reference book for the professional programmer interested in or involved in
compiler construction, and as a text in compiler construction at the
undergraduate or graduate level. The emphasis is on solving the problems
universally encountered in designing a compiler, regardless of the source
language or the target machine.
A number of ideas and techniques discussed in this book can be profitably
used in general software design. For example, the finite-state techniques and
regular expression used to build lexical analysers have also been used in text
editors, bibliographic search system and pattern recognition program.
Context-free grammars and syntax-directed translation schemes have been
used to build text processors of many sorts. Techniques of code optimisation
also have applicability to program verifiers and to programs that produce
structured programs from unstructured ones.

ii

TABLE OF CONTENTS
Chapter One The Compiling Process

Chapter Two Lexical Analysis

Chapter Three Formal Grammars

22

Chapter Four Top-Down Parsing

31

Chapter Five Bottom-Up Parsing

46

Chapter Six SLR and LR Parsing

61

Chapter Seven LALR Parsing

71

Chapter Eight Parsing Miscellany

78

Chapter Nine Introduction to YACC

84

Chapter Ten Syntax-Directed Translation

90

Chapter Eleven Semantic Analysis

97

Chapter Twelve Intermediate Representations

109

Chapter Thirteen Code Optimisation

117

Bibliography

132

iii

CHAPTER ONE
What is a compiler?
A compiler is a program that takes as input a program written in one
language (the source language) and translates it into a functionally
equivalent program in another language (the target language). The source
language is usually a high-level language like Pascal or C, and the target
language is usually a low-level language like assembly or machine
language. As it translates, a compiler also reports errors and warnings to
help the programmer make corrections to the source, so the translation can
be completed. Theoretically, the source and target can be any language,
but the most common use of a compiler is translating an ASCII source
program written in a language such as C into a machine specific result like
SPARC assembly that can execute on that designated hardware.
Although we will focus on writing a compiler for a programming
language, the techniques you learn can be valuable and useful for a widevariety of parsing and translating tasks e.g converting javadoc
comments to HTML, generating a table from the results of an SQL query,
collating responses from e-mail survey, implementing a server that
responds to a network protocol like http or imap, or screen-scraping
information from an on-line source. Your printer uses parsing to render
PostScript files. Hardware engineers use a full-blown compiler to translate
from a hardware description language to the schematic of the circuit. Your
email spam filter quite possibly uses scanning and parsing to detect
unwanted email. And the list goes on

How does a compiler work?


From the previous diagram, you can see there are two main stages in the
compiling process: analysis and synthesis. The analysis stage breaks up
the source program into pieces, and creates a generic (languageindependent) intermediate representation of the program. Then, the
synthesis stage constructs the desired target program from the intermediate
representation. Typically, a compilers analysis stage is called its front-end
and the synthesis stage its back-end. Each of the stages is broken down
into a set of phases that handle different parts of the tasks.
What advantage do you see to structuring the compiler this way
(separating the front and back ends)?
The analysis stage (front-end)
There are four phases in the analysis stage of compiling:
1)

Lexical Analysis or Scanning: The stream of characters making up


a source program is read from left to right and grouped into tokens,
which are sequences of characters that have a collective meaning.
Examples of tokens are identifiers (user-defined names), reserved
words, integers, doubles or floats, delimiters, operators, and special
symbols.
Example of lexical analysis:
int a;
a = a + 2;

A lexical analyser scanning the code fragment above might return:


int T_INT (reserved word)
a T_IDENTIFIER (variable name)
; T_SPECIAL (special symbol with value of ;)
a T_IDENTIFIER (variable name)
= T_OP (operator with value of =)
a T_IDENTIFIER (variable name)
+ T_OP (operator with value of +)
2 T_INTCONSTANT (integer constant with value of 2)
; T_SPECIAL (special symbol with value of ;)

2)

Syntax Analysis or Parsing: The tokens found during scanning are


grouped together using a context-free grammar. A grammar is a set
of rules that define valid structures in the programming language.
Each token is associated with a specific rule, and grouped together
accordingly. This process is called parsing. The output of this
phase is called a parse tree or a derivation, i.e., a record of which
grammar rules were used to create the source program.

Example of syntax analysis:


Part of a grammar for simple arithmetic expressions in C might
look like this:
Expression -> Expression + Expression |
Expression Expression |
...
Variable |
Constant |
...
Variable -> T_IDENTIFIER
Constant -> T_INTCONSTANT |
T_DOUBLECONSTANT

The symbol on the left side of the -> in each rule can be replaced
by the symbols on the right. To parse a + 2, we would apply the
following rules:
Expression ->
->
->
->
->

Expression + Expression
Variable + Expression
T_IDENTIFIER + Expression
T_IDENTIFIER + Constant
T_IDENTIFIER + T_INTCONSTANT

When we reach a point in the parse where we have only tokens, we


have finished. By knowing which rules are used to parse, we can
determine the structures present in the source program.
3)

Semantic Analysis: The parse tree or derivation is checked for


semantic errors i.e., a statement that is syntactically correct
(associates with a grammar rule correctly), but disobeys the
semantic rules of the source language. Semantic analysis is the
phase where we detect such things as use of an undeclared
variable, a function called with improper arguments, access
violations, and incompatible operands and type mismatches, e.g. an
array variable added to a function name.
Example of semantic analysis:
int arr[2], c;
c = arr * 10;

Most semantic analysis pertains to the checking of types. Although


the C fragment above will scan into valid tokens and successfully
match the rules for a valid expression, it isn't semantically valid. In
the semantic analysis phase, the compiler checks the types and
reports that you cannot use an array variable in a multiplication
expression and that the type of the right hand-side of the
assignment is not compatible with the left:

4)

Intermediate Code Generation: This is where the intermediate


representation of the source program is created. We want this
representation to be easy to generate, and easy to translate into the
target program. The representation can have a variety of forms, but
a common one is called three-address code (TAC) which is a lot
like a generic assembly language. Three-address code is a sequence
of simple instructions, each of which can have at most three
operands.
Example of intermediate code generation:
a = b * c + b * d

_t1
_t2
_t3
a

=
=
=
=

b * c
b * d
_t1 + _t2
_t3

The single C statement on the left is translated into a sequence of


four instructions in three address code on the right. Note the use of
temp variables that are created by the compiler as needed to keep
the number of operands down to three.
Of course, it's a little more complicated than this, because we have
to translate branching and looping instructions, as well as function
calls. Here is some TAC for a branching translation:
if (a <= b)
a = a - c;
c = b * c;
L0:

_t1 = a > b
if _t1 goto L0
_t2 = a c
a = _t2
_t3 = b * c
c = _t3

The synthesis stage (back-end)


There can be up to three phases in the synthesis stage of compiling:
1)

Intermediate Code Optimisation: The optimiser accepts input in the


intermediate representation (e.g. TAC) and outputs a streamlined
version still in the intermediate representation. In this phase, the
compiler attempts to produce the smallest, fastest and most
efficient running result by applying various techniques such as:
inhibiting code generation of unreachable code segments
getting rid of unused variables
eliminating multiplication by 1 and addition by 0
loop optimisation (e.g., remove statements that are not modified in
the

loop)

common sub-expression elimination


strength reduction
....

The optimisation phase can really slow down a compiler, so


typically it is an optional phase. The compiler may even have finegrain controls that allow the developer to make tradeoffs between
time spent compiling versus optimisation quality.
Example of code optimisation:
_t1
_t2
_t3
_t4
a

=
=
=
=
=

b *
_t1
b *
_t2
_t4

c
+ 0
c
+ _t3

_t1 = b * c
_t2 = _t1 + _t1
a = _t2

In the example shown above, the optimiser was able to eliminate


an addition to the zero and a re-evaluation of the same expression,
allowing the original five TAC statements to be re-written in just
three statements and use two fewer temporary variables.
2)

Object Code Generation: This is where the target program is


generated. The output of this phase is usually machine code or
assembly code. Memory locations are selected for each variable.
Instructions are chosen for each operation. The three-address code
is translated into a sequence of assembly or machine language
instructions that perform the same tasks.
Example of code generation:
_t1 = b * c
_t2 = _t1 + _t1
a = _t2

lw $t1, -16($fp) # load


lw $t2, -20($fp) # load
mul $t0, $t1, $t2 # mult
add $t3, $t0, $t0 # add
sw $t3, -24($fp) # store

In the example above, the code generator translated the TAC input
into MIPS assembly output.
3)

Object Code Optimisation: There may also be another optimisation


pass that follows code generation, this time transforming the object
code into tighter, more efficient object code. This is where we
consider features of the hardware itself to make efficient usage of
the processor(s) and registers. The compiler can take advantage of
machine-specific idioms (specialised instructions, pipelining,
branch prediction, and other peephole optimisations) in
reorganising and streamlining the object code itself. As with IR
optimisation, this phase of the compiler is usually configurable or
can be skipped entirely.

The symbol table


There are a few activities that interact with various phases across both
stages. One is symbol table management; a symbol table contains

information about all the identifiers in the program along with important
attributes such as type and scope. Identifiers can be found in the lexical
analysis phase and added to the symbol table. During the two phases that
follow (syntax and semantic analysis), the compiler updates the identifier
entry in the table to include information about its type and scope.
When generating intermediate code, the type of the variable is used to
determine which instructions to emit. During optimisation, the live
range of each variable may be placed in the table to aid in register
allocation. The memory location determined in the code generation phase
might also be kept in the symbol table.
Error-handling
Another activity that occurs across several phases is error handling. Most
error handling occurs in the first three phases of the analysis stage. The
scanner keeps an eye for stray tokens, the syntax analysis phase reports
invalid combinations of tokens, and the semantic analysis phase reports
type errors and the like. Sometimes these are fatal errors that stop the
entire process, while at other times, the compiler can recover and continue.
One-pass versus multi-pass
In looking at this phased approach to the compiling process, one might
think that each phase generates output that is then passed on to the next
phase. For example, the scanner reads through the entire source program
and generates a list of tokens. This list is the input to the parser that reads
through the entire list of tokens and generates a parse tree or derivation. If
a compiler works in this manner, we call it a multi-pass compiler. The
pass refers to how many times the compiler must read through the
source program. In reality, most compilers are one-pass up to the code
optimisation phase. Thus, scanning, parsing, semantic analysis and
intermediate code generation are all done simultaneously as the compiler
reads through the source program once. Once we get to code optimisation,
several passes are usually required which is why this phase slows the
compiler down so much.

CHAPTER TWO
Lexical Analysis
The basics
Lexical analysis or scanning is the process where the stream of characters
making up the source program is read from left-to-right and grouped into
tokens. Tokens are sequences of characters with a collective meaning.
There are usually only a small number of tokens for a programming
language: constants (integer, double, char, string, etc.), operators
(arithmetic, relational, logical), punctuation, and reserved words.

The lexical analyser takes a source program as input, and produces a


stream of tokens as output. The lexical analyser might recognise particular
instances of tokens such as:
3 or 255 for an integer constant token
"Fred" or "Wilma" for a string constant token
numTickets or queue for a variable token

Such specific instances are called lexemes. A lexeme is the actual


character sequence forming a token, the token is the general class that a
lexeme belongs to. Some tokens have exactly one lexeme (e.g. the >
character); for others, there are an infinite number of lexemes (e.g. integer
constants).
The scanner is tasked with determining that the input stream can be divided into
valid symbols in the source language, but has no smarts about which token
should come where. Few errors can be detected at the lexical level alone because
the scanner has a very localised view of the source program without any context.
The scanner can report about characters that are not valid tokens (e.g., an illegal
or unrecognised symbol) and a few other malformed entities (illegal characters
within a string constant, unterminated comments, etc.) It does not look for or
detect garbled sequences, tokens out of place, undeclared identifiers, misspelled
keywords, mismatched types and the like. For example, the following input will
not generate any errors in the lexical analysis phase, because the scanner has no
concept of the appropriate arrangement of tokens for a declaration. The syntax
analyser will catch this error later in the next phase.
int a double } switch b[2] =;

10

Furthermore, the scanner has no idea how tokens are grouped. In the above
sequence, it returns b, [,2, and ] as four separate tokens, having no idea they
collectively form an array access.
The lexical analyser can be a convenient place to carry out some other chores like
stripping out comments and white space between tokens and perhaps even some
features like macros and conditional compilation (although often these are
handled by some sort of preprocessor which filters the input before the compiler
runs).

Scanner implementation 1: loop & switch


There are two primary methods for implementing a scanner. The first is a
program that is hardcoded to perform the scanning tasks. The second is the use of
regular expressions and finite automata to model the scanning process.
A loop & switch implementation consists of a main loop that reads characters
one by one from the input file and uses a switch statement to process the
character(s) just read. The output is a list of tokens and lexemes from the source
program. The following program fragment shows a skeletal implementation of a
simple loop and switch scanner. The main program calls InitScanner() and
loops calling ScanOneToken() until EOF. ScanOneToken() reads the
next character from the file and switches off that char to decide how to handle
what is coming up next in the file. The return values from the scanner can be
passed on to the parser in the next phase.

11

The mythical source language tokenised by the above scanner requires that
reserved words be in all upper-case and identifiers in all lower-case. This
convenient feature makes it easy for the scanner to choose which path to
pursue after reading just one character. It is sometimes necessary to design

12

the scanner to look-ahead before deciding what path to follow notice


the handling for the '/' character which peeks at the next character to check
whether the first slash is followed by another slash or star which indicates
the beginning of a comment. If not, the extra character is pushed back onto
the input stream and the token is interpreted as the single char operator for
division.
Loop-and-switch scanners are sometimes called ad hoc scanners,
indicating their design and purpose of solving a specific instance rather a
general problem. For a sufficiently reasonable set of token types, a handcoded loop-and-switch scanner might be all thats needed they require no
other tools and can be coded rather efficiently. The gcc front-end uses an
ad hoc scanner, in fact.
Scanner implementation 2: regular expressions & finite automata

The other method of implementing a scanner is using regular expressions


and finite automata. A quick detour for some background review is given
and then we look at how we can generate a scanner using this approach.
Regular expression review
We assume that you are well acquainted with regular expressions and all
this is old news to you.
Symbol

Alphabet

string

empty string
formal language
regular expressions

an abstract entity that we shall not define formally


(such as point in geometry). Letters, digits and
punctuation are examples of symbols.
a finite set of symbols out of which we build larger
structures. An alphabet is typically denoted using
the Greek sigma , e.g., = {0,1}.
a finite sequence of symbols from a particular
alphabet juxtaposed. For example: a, b, c, are
symbols and abcb is a string.
denoted (or sometimes ) is the string consisting
of zero symbols.
* the set of all possible strings that can be
generated from a given alphabet.
rules that define exactly the set of words that are
valid tokens in a formal language. The rules are
built up from three operators:
concatenation xy
alternation
x|y
repetition
x*

x or y
x repeated 0 or more times

Formally, the set of regular expressions can be defined by the following


recursive rules:

13

1) every symbol of is a regular expression


2) is a regular expression
3) if r1 and r2 are regular expressions, so are
(r1 )

r1 r2

r1 | r2

r1 *

4) nothing else is a regular expression.


We can use regular expressions to define the tokens in a programming
language. For example, here is a regular expression for an integer, which
consists of one or more digits (+ is extended regex syntax for 1 or more
repetitions)
(0|1|2|3|4|5|6|7|8|9)+

Finite automata review


Once we have all our tokens defined using regular expressions, we can
create a finite automaton for recognising them. To review, a finite
automata has:
1)

2)
3)

A finite set of states, one of which is designated the initial state or


start state, and some (maybe none) of which are designated as final
states.
An alphabet of possible input symbols.
A finite set of transitions that specifies for each state and for each
symbol of the input alphabet, which state to go to next.

Now that we know what FAs are, here is a regular expression and a simple
finite automata that recognises an integer.

Here is an FA that recognises a subset of tokens in the Pascal language:

14

This FA handles only a subset of all Pascal tokens but it should give you
an idea of how an FA can be used to drive a scanner. The
numbered/lettered states are final states. The loops on states 1 and 2
continue to execute until a character other than a letter or digit is read. For
example, when scanning "temp:=temp+1;" it would report the first token
at final state 1 after reading the ":" having recognised the lexeme "temp"
as an identifier token.
What happens in an FA-driven scanner is we read the source program one
character at a time beginning with the start state. As we read each
character, we move from our current state to the next by following the
appropriate transition for that. When we end up in a final state, we perform
an action associated with that final state. For example, the action
associated with state 1 is to first check if the token is a reserved word by
looking it up in the reserved word list. If it is, the reserved word is passed
to the token stream being generated as output. If it is not a reserved word,
it is an identifier so a procedure is called to check if the name is in the
symbol table. If it is not there, it is inserted into the table.
Once a final state is reached and the associated action is performed, we
pick up where we left off at the first character of the next token and begin
again at the start state. If we do not end in a final state or encounter an
unexpected symbol while in any state, we have an error condition. For
example, if you run "ASC@I" through the above FA, we would error out
of state 1.
From regular expressions to NFA
So thats how FAs can be used to implement scanners. Now we need to
look at how to create an FA given the regular expressions for our tokens.
There is a looser definition of an FA that is especially useful to us in this
process. A nondeterministic finite automaton (NFA) has:
1)

A finite set of states with one start state and some


(maybe none) final states.

15

2)
3)

An alphabet of possible input symbols.


A finite set of transitions that describe how to proceed
from one state to another along edges labelled with
symbols from the alphabet (but not ). We allow the
possibility of more than one edge with the same label
from any state, and some states for which certain
input letters have no edge.

Here is an NFA that accepts the language (0|1)*(000|111)(0|1)*

Notice that there is more than one path through the machine for a given
string. For example, 000 can take you to a final state, or it can leave you
in the start state. This is where the non-determinism (choice) comes in. If
any of the possible paths for a string leads to a final state, that string is in
the language of this automaton.
There is a third type of finite automata called -NFA which have
transitions labelled with the empty string. The interpretation for such
transitions is one can travel over an empty-string transition without using
any input symbols.
A famous proof in formal language theory (Kleenes Theorem) shows that
FAs are equivalent to NFAs which are equivalent to -NFAs. And, all
these types of FAs are equivalent in language generating power to regular
expressions. In other words,
If R is a regular expression, and L is the language
corresponding to R, then there is an FA that recognises L.
Conversely, if M is an FA recognising a language L, there is
a regular expression R corresponding to L.
It is quite easy to take a regular expression and convert it to an equivalent
NFA or -NFA, thanks to the simple rules of Thompsons construction:
Rule 1: There is an NFA that accepts any particular symbol of the alphabet:

16

Rule 2: There is an NFA that accepts only :

Rule 3: There is an -NFA that accepts r1|r2:

Rule 4: There is an -NFA that accepts r1r2:

Rule 5: There is an -NFA that accepts r1*:

From NFA to FA via subset construction


Using Thompson's construction, we can build an NFA from a regular
expression, we can then empty subset construction to convert the NFA to a
DFA. Subset construction is an algorithm for constructing the
deterministic FA that recognises the same language as the original
nondeterministic FA. Each state in the new DFA is made up of a set of
states from the original NFA. The start state of the DFA will be the start
state of NFA. The alphabet for both automata is the same.
So, given a state of from the original NFA, an input symbol x takes us
from this state to the union of original states that we can get to on that
symbol x. We then have to analyse this new state with its definition of the
original states, for each possible input symbol, building a new state in the
DFA. The states of the DFA are all subsets of S, which is the set of
original sets. There will be a max of 2n of these (because we might need to
explore the entire power set), but there are usually far fewer. The final
states of the DFA are those sets that contain a final state of the original
NFA.
17

Here is an example:

This is non-deterministic for several reasons. For example, the two


transitions on b coming out of the final state, and no b transition
coming out of the start state. To create an equivalent deterministic FA, we
begin by creating a start state, and analysing where we go from the start
state in the NFA, on all the symbols of the alphabet. We create a set of
states where applicable (or a sink hole if there were no such transitions).
Notice if a final state is in the set of states, then that state in the DFA
becomes a final state.

We continue with this process analysing all the new states that we create.
We need to determine where we go in the NFA from each state, on all the
symbols of the alphabet.

And finally, filling in the transitions from {X2, X3} state brings us full
circle. This is now a deterministic FA that accepts the same language as
the original NFA. We have 5 states instead of original 4, a rather modest
increase in this case.

18

The process then goes like this: from a regular expression for a token, we
construct an NFA that recognises them using Thompsons algorithm.
NFAs are not useful as drivers for programs because non-determinism
implies choices and thus, expensive exhaustive backtracking algorithms.
So, we use subset construction to convert that NFA to a DFA. Once we
have the DFA, we can use it as the basis for an efficient non-backtracking
scanner.
Lex, a Scanner Generator
The reason we have spent so much time looking at how to go from regular
expressions to finite automata is because this is exactly the process that lex
goes through in creating a scanner. Lex is a lexical analysis generator that
takes as input a series of regular expressions and builds a finite automaton
and a driver program for it in C through the mechanical steps shown
above. Theory in practice!
The first phase in a compiler reads the input source and converts strings in
the source to tokens. Using regular expressions, we can specify patterns to
lex that allow it to scan and match strings in the input. Each pattern in lex
has an associated action. Typically an action returns a token, representing
the matched string, for subsequent use by the parser. To begin with,
however, we will simply print the matched string rather than return a token
value. We may scan for identifiers using the regular expression
letter(letter|digit)*
This pattern matches a string of characters that begins with a single letter,
and is followed by zero or more letters or digits. This example nicely
illustrates operations allowed in regular expressions:
-

repetition, expressed by the "*" operator

alternation, expressed by the "|" operator

concatenation

Any regular expression expressions may be expressed as a finite state


automaton (FSA). We can represent an FSA using states, and transitions
19

between states. There is one start state, and one or more final or accepting
states.

In the above figure, state 0 is the start state, and state 2 is the accepting
state. As characters are read, we make a transition from one state to
another. When the first letter is read, we transition to state 1. We remain in
state 1 as more letters or digits are read. When we read a character other
than a letter or digit, we transition to state 2, the accepting state. Any FSA
may be expressed as a computer program. For example, our 3-state
machine is easily programmed:

start:

goto state0

state0: read c
if c = letter goto state1
goto state0
state1: read
if c
if c
goto

c
= letter goto state1
= digit goto state1
state2

state2: accept string

This is the technique used by lex. Regular expressions are translated by lex
to a computer program that mimics an FSA. Using the next input
character, and current state, the next state is easily determined by indexing
into a computer-generated state table.
Now we can easily understand some of lexs limitations. For example, lex
cannot be used to recognize nested structures such as parentheses. Nested
structures are handled by incorporating a stack. Whenever we encounter a
"(", we push it on the stack. When a ")" is encountered, we match it with
the top of the stack, and pop the stack. Lex, however, only has states and
transitions between states. Since it has no stack, it is not well suited for
parsing nested structures. Yacc augments an FSA with a stack, and can
process constructs such as parentheses with ease. The important thing is to
use the right tool for the job. Lex is good at pattern matching. Yacc is
appropriate for more challenging tasks. We shall consider yacc in the
future.

20

Regular expressions in lex are composed of metacharacters (Table 2.1).


Pattern matching examples are shown in Table 2.2. Within a character
class, normal operators lose their meaning. Two operators allowed in a
character class are the hyphen (-) and circumflex (^). When used
between two characters, the hyphen represents a range of characters. The
circumflex, when used as the first character, negates the expression. If two
patterns match the same string, the longest match wins. In case both
matches are the same length, then the first pattern listed is used.

Table 2.1 Pattern Matching Primitives

Table 2.2 Pattern Matching Examples


Input to Lex is divided into three sections, with %% dividing the sections.
This is best illustrated by example. The first example is the shortest
possible lex file:
%%

Input is copied to output, one character at a time. The first %% is always


required, as there must always be a rules section. However, if we dont
specify any rules, then the default action is to match everything and copy it

21

to output. Defaults for input and output are stdin and stdout,
respectively. Here is the same example, with defaults explicitly coded:
%%
/* match everything except newline */
ECHO;
/* match newline */
\n ECHO;
.

%%
int yywrap(void) {
return 1;
}
int main(void) {
yylex();
return 0;
}

Two patterns have been specified in the rules section. Each pattern must
begin in column one. This is followed by whitespace (space, tab or
newline), and an optional action associated with the pattern. The action
may be a single C statement, or multiple C statements enclosed in braces.
Anything not starting in column one is copied verbatim to the generated C
file. We may take advantage of this behavior to specify comments in our
lex file. In this example there are two patterns, . and \n, with an
ECHO action associated for each pattern. Several macros and variables are
predefined by lex. ECHO is a macro that writes code matched by the
pattern. This is the default action for any unmatched strings. Typically,
ECHO is defined as:
#define ECHO fwrite(yytext, yyleng, 1, yyout)

Variable yytext is a pointer to the matched string (NULL-terminated),


and yyleng is the length of the matched string. Variable yyout is the
output file, and defaults to stdout. Function yywrap is called by lex
when input is exhausted. Return 1 if you are done, or 0 if more processing
is required. Every C program requires a main function. In this case, we
simply call yylex, the main entry-point for lex. Some implementations of
lex include copies of main and yywrap in a library, eliminating the need
to code them explicitly. This is why our first example, the shortest lex
program, functioned properly.

22

Table 2.3 Lex Predefined Variables


Heres a program that does nothing at all. All input is matched, but no
action is associated with any pattern, so there will be no output.
%%
.
\n

The following example prepends line numbers to each line in a file. Some
implementations of lex predefine and calculate yylineno. The input file
for lex is yyin, and defaults to stdin.

The definitions section is composed of substitutions, code, and start states.


Code in the definitions section is simply copied as-is to the top of the
generated C file, and must be bracketed with %{ and %} markers.
Substitutions simplify pattern-matching rules. For example, we may define
digits and letters:

23

Whitespace must separate the defining term and the associated expression.
References to substitutions in the rules section are surrounded by braces
({letter}) to distinguish them from literals. When we have a match in
the rules section, the associated C code is executed. Here is a scanner that
counts the number of characters, words, and lines in a file:

24

CHAPTER THREE
Formal Grammars
What is a grammar?
A grammar is a powerful tool for describing and analysing languages. It is
a set of rules by which valid sentences in a language are constructed.
Heres a trivial example of English grammar:
sentence
> <subject> <verb-phrase> <object>
subject
> This | Computers | I
verb-phrase > <adverb> <verb> | <verb>
adverb
> never
verb
> is | run | am | tell
object
> the <noun> | a <noun> | <noun>
noun
> university | world | cheese | lies

Using the above rules or productions, we can derive simple sentences such
as these:
This is a university.
Computers run the world.
I am the cheese.
I never tell lies.

Here is a leftmost derivation of the first sentence using these productions.


sentence > <subject> <verb-phrase> <object>
> This <verb-phrase> <object>
> This <verb> <object>
> This is <object>
> This is a <noun>
> This is a university

In addition to several reasonable sentences, we can also derive nonsense


like Computers run cheese and This am a lies. These sentences don't
make semantic sense, but they are syntactically correct because they are of
the sequence of subject, verb-phrase, and object. Grammars are a tool for
syntax, not semantics. We worry about semantics at a later point in the
compiling process. In the syntax analysis phase, we verify the correctness
of the structure only.
Vocabulary
We need to review some definitions before we can proceed:
grammar
nonterminal

a set of rules by which valid sentences in a language are


constructed.
a grammar symbol that can be replaced/expanded to a
sequence of symbols.

25

terminal

production

an actual word in a language; these are the symbols in a


grammar that cannot be replaced by anything else.
terminal is supposed to conjure up the idea that it is a
dead-end no further expansion is possible from here.
a grammar rule that describes how to replace/exchange
symbols. The general form of a production for a
nonterminal is:
X > Y1Y2Y3...Yn

derivation

start symbol

The nonterminal X is declared equivalent to the


concatenation of the symbols Y1Y2Y3...Yn. The production
means that anywhere where we encounter X, we may
replace it by the string Y1Y2Y3...Yn. Eventually we will have
a string containing nothing that can be expanded further,
i.e., it will consist of only terminals. Such a string is called
a sentence. In the context of programming languages, a
sentence is a syntactically correct and complete program.
a sequence of applications of the rules of a grammar that
produces a finished string of terminals. A leftmost
derivation is where we always substitute for the leftmost
nonterminal as we apply the rules (we can similarly define
a rightmost derivation). A derivation is also called a parse.
a grammar has a single nonterminal (the start symbol) from
which all sentences derive:
S > X1X2X3...Xn

null symbol

BNF

All sentences are derived from S by successive replacement


using the productions of the grammar.
it is sometimes useful to specify that a symbol can be
replaced by nothing at all. To indicate this, we use the null
symbol , e.g., A > B | .
a way of specifying programming languages using formal
grammars and production rules with a particular form of
notation (Backus-Naur form).

A few grammar exercises to try on your own, the alphabet in each case is
a,b}

Define a grammar for the language of strings with


one or more a's followed by zero or more b's.
Define a grammar for even-length palindromes.
Define a grammar for strings where the number of a's
is equal to the number b's.

26

Define a grammar where the number of a's is not


equal to the number b's. (Hint: think about it as two
separate cases...)
Can you write regular expressions for these
languages? Why or why not?

Parse Representation
In working with grammars, we can represent the application of the rules to
derive a sentence in two ways. The first is a derivation as shown earlier for
This is a university where the rules are applied step-by-step and we
substitute for one nonterminal at a time. Think of a derivation as a history
of how the sentence was parsed because it not only includes which
productions were applied, but also the order they were applied (i.e. which
nonterminal was chosen for expansion at each step). There can many
different derivations for the same sentence (the leftmost, the rightmost,
and so on).
A parse tree is the second method for representation. It diagrams how each
symbol derives from other symbols in a hierarchical manner. Here is a
parse tree for This is a university:

Although the parse tree includes all of the productions that were applied, it
does not encode the order they were applied. For an unambiguous
grammar, there is exactly one parse tree for a particular sentence.
More Formal Definitions
Here are some other definitions we will need, described in reference to this
example grammar:
S > AB
A > Ax | y
B > z

alphabet
The alphabet is {S, A, B, x, y, z}. It is divided into two disjoint sets.
The terminal alphabet consists of terminals, which appear in the
sentences of the language: {x, y, z}. The remaining symbols are the
nonterminal alphabet; these are the symbols that appear on theleft
side of productions and can be replaced during the course of a
derivation: {S, A, B}. Formally, we use V for the alphabet, T for the

27

terminal alphabet and N for the nonterminal alphabet giving us: V


= T N, and T N = .
The convention used in our lecture notes are a sans-serif font for
grammar elements, lowercase for terminals, uppercase for
nonterminals, and underlined lowercase (e.g., u, v ) to denote
arbitrary strings of terminal and nonterminal symbols (possibly
null). In some textbooks, Greek symbols are used for arbitrary
strings of terminal and nonterminal symbols (e.g., , )
context-free grammar
To define a language, we need a set of productions, of the general
form: u > v . In a context-free grammar, u is a single nonterminal
and v is an arbitrary string of terminal and nonterminal symbols.
When parsing, we can replace u by v wherever it occurs. We shall
refer to this set of productions symbolically as P.
formal grammar
We formally define a grammar as a 4-tuple {S, P, N, T}. S is the
start symbol and S N, P is the set of productions, and N and T
are the nonterminal and terminal alphabets. A sentence is a string
of symbols in T derived from S using one or more applications of
productions in P. A string of symbols derived from S but possibly
including nonterminals is called a sentential form or a working
string.
A production u > v is used to replace an occurrence of u by v .
Formally, if we apply a production p P to a string of symbols w
in V to yield a new string of symbols z in V, we say that z derived
from w using p, written as follows: w =>p z . We also use:
w => z z derives from w (production unspecified)
w =>* z z derives from w using zero or more productions
w =>+ z z derives from w using one or more productions

equivalence
The language L(G) defined by grammar G is the set of sentences
derivable using G. Two grammars G and G' are said to be
equivalent if the languages they generate L(G) and L(G') are the
same.
A Hierarchy of Grammars
We owe a lot of our understanding of grammars to the work of the
American linguist Noam Chomsky (yes, the Noam Chomsky known for
his politics). There are four categories of formal grammars in the Chomsky
Hierarchy, they span from Type 0, the most general, to Type 3, the most
restrictive. More restrictions on the grammar make it easier to describe and
efficiently parse, but reduce the expressive power.

28

Type 0: free or unrestricted grammars


These are the most general. Productions are of the form u > v
where both u and v are arbitrary strings of symbols in V, with u
non-null. There are no restrictions on what appears on the left or
right-hand side other than the left-hand side must be non-empty.
Type 1: context-sensitive grammars
Productions are of the form uXw > uvw where u , v and w are
arbitrary strings of symbols in V, with v non-null, and X a single
nonterminal. In other words, X may be replaced by v but only when
it is surrounded by u and w . (i.e. in a particular context).
Type 2: context-free grammars
Productions are of the form X> v where v is an arbitrary string of
symbols in V, and X is a single nonterminal. Wherever you find X,
you can replace with v (regardless of context).
Type 3: regular grammars
Productions are of the form X> a or X> aY where X and Y are
nonterminals and a is a terminal. That is the left-hand side must be
a single nonterminal and the right-hand side can be either a single
terminal by itself or with a single nonterminal. These grammars are
the most limited in terms of expressive power.
Every type 3 grammar is a type 2 grammar, and every type 2 is a type 1
and so on. Type 3 grammars are particularly easy to parse because of the
lack of recursive constructs. Efficient parsers exist for many classes of
Type 2 grammars. Although Type 1 and Type 0 grammars are more
powerful than Type 2 and 3, they are far less useful since we cannot create
efficient parsers for them. In designing programming languages using
formal grammars, we will use Type 2 or contextfree grammars, often just
abbreviated as CFG.
Issues in Parsing Context-Free Grammars
There are several efficient approaches to parsing most Type 2 grammars
and we will talk through them over the next few lectures. However, there
are some issues that can interfere with parsing that we must take into
consideration when designing the grammar. Lets take a look at three of
them: ambiguity, recursive rules, and left-factoring.
Ambiguity
If a grammar permits more than one parse tree for some sentences, it is
said to be ambiguous. For example, consider the following classic
arithmetic expression grammar:
E
op

> E op E | (E ) | int
> + | - | * | /

29

This grammar denotes expressions that consist of integers joined by binary


operators and possibly including parentheses. As defined above, this
grammar is ambiguous because for certain sentences we can construct
more than one parse tree. For example, consider the expression 10 2 * 5.
We parse by first applying the production E > E op E. The parse tree on
the left chooses to expand that first op to *, the one on the right to -. We
have two completely different parse trees. Which one is correct?

Both trees are legal in the grammar as stated and thus either interpretation
is valid. Although natural languages can tolerate some kind of ambiguity
(puns, plays on words, etc.), it is not acceptable in computer languages.
We dont want the compiler just haphazardly deciding which way to
interpret our expressions! Given our expectations from algebra concerning
precedence, only one of the trees seems right. The right-hand tree fits our
expectation that * binds tighter and for that result to be computed first
then integrated in the outer expression which has a lower precedence
operator.
Its fairly easy for a grammar to become ambiguous if you are not careful
in its construction. Unfortunately, there is no magical technique that can be
used to resolve all varieties of ambiguity. It is an undecidable problem to
determine whether any grammar is ambiguous, much less attempt to
mechanically remove all ambiguity. However, that doesn't mean in
practice that we cannot detect ambiguity or can't do something about it.
For programming language grammars, we usually take pains to construct
an unambiguous grammar or introduce additional disambiguating rules to
throw away the undesirable parse trees, leaving only one for each
sentence.
Using the above ambiguous expression grammar, one technique would
leave the grammar as is, but add disambiguating rules into the parser
implementation. We could code into the parser knowledge of precedence
and associativity to break the tie and force the parser to build the tree on
the right rather than the left. The advantage of this is that the grammar
remains simple and less complicated. But as a downside, the syntactic
structure of the language is no longer given by the grammar alone.
Another approach is to change the grammar to only allow the one tree that
correctly reflects our intention and eliminate the others. For the expression

30

grammar, we can separate expressions into multiplicative and additive


subgroups and force them to be expanded in the desired order.
E > E t_op E | T
t_op > + | T > T f_op T | F
f_op > * | /
F > (E) | int

Terms are addition/subtraction expressions and factors used for


multiplication and division. Since the base case for expression is a term,
addition and subtraction will appear higher in the parse tree, and thus
receive lower precedence.
After verifying that the above re-written grammar has only one parse tree
for the earlier ambiguous expression, you might thing we were home free,
but now consider the expression 10 2 5. The recursion on both sides of
the binary operator allows either side to match repetitions. The arithmetic
operators usually associate to the left, so by replacing the right-hand side
with the base case will force the repetitive matches onto the left side. The
final result is:
E > E t_op T | T
t_op > + | T
> T f_op F | F
f_op > * | /
F > (E) | int

Whew! The obvious disadvantage of changing the grammar to remove


ambiguity is that it may complicate and obscure the original grammar
definitions. There is no mechanical means to change any ambiguous
grammar into an unambiguous oneit is known to be an undecidable
problem (in fact, even determining that a CFG is ambiguous is
undecidable). However, most programming languages have only limited
issues with ambiguity that can be resolved using ad hoc techniques.
Recursive Productions
Productions are often defined in terms of themselves. For example a list of
variables in a programming language grammar could be specified by this
production:
variable_list > variable | variable_list , variable

Such productions are said to be recursive. If the recursive nonterminal is at


the left of the right-side of the production, e.g. A > u | Av , we call the
production left-recursive. Similarly, we can define a right-recursive
production: A > u | v A. Some parsing techniques have trouble with one or
the other variants of recursive productions and so sometimes we have to

31

massage the grammar into a different but equivalent form. Left-recursive


productions can be especially troublesome in the topdown parsers (well
see why a bit later). Handily, there is a simple technique for rewriting the
grammar to move the recursion to the other side. For example, consider
this left-recursive rule:
X > Xa | Xb | AB | C | DEF

To convert the rule, we introduce a new nonterminal X' that we append to


the end of all non-left recursive productions for X. The expansion for the
new nonterminal is basically the reverse of the original left-recursive rule.
The re-written productions are:
X > ABX' | CX' | DEFX'
X' > aX' | bX' |

It appears we just exchanged the left-recursive rules for an equivalent


right-recursive version. This might seem pointless, but some parsing
algorithms prefer or even require only left or right recursion.
Left-Factoring
The parser usually reads tokens from left to right and it is convenient if
upon reading a token it can make an immediate decision about which
production from the grammar to expand. However, this can be trouble if
there are productions that have common first symbol(s) on the right side of
the productions. Here is an example we often see in programming
language grammars:
Stmt > if Cond then Stmt else Stmt | if Cond then Stmt | Other |...

The common prefix is if Cond then Stmt. This causes problems because
when a parser encounter an if, it does not know which production to use.
A useful technique called left-factoring allows us to restructure the
grammar to avoid this situation. We rewrite the productions to defer the
decision about which of the options to choose until we have seen enough
of the input to make the appropriate choice. We factor out the common
part of the two options into a shared rule that both will use and then add a
new rule that picks up where the tokens diverge.
Stmt
> if Cond then Stmt OptElse | Other |
OptElse > else S |

In the re-written grammar, upon reading an if we expand first production


and wait until if Cond then Stmt has been seen to decide whether to expand
OptElse to else or .

32

Hidden Left-Factors and Hidden Left Recursion


A grammar may not appear to have left recursion or left factors, yet still
have issues that will interfere with parsing. This may be because the issues
are hidden and need to be first exposed via substitution.
For example, consider this grammar:
A > da | acB
B > abB | daA | Af

A cursory examination of the grammar may not detect that the first and
second productions of B overlap with the third. We substitute the
expansions for A into the third production to expose this:
A > da | acB
B > abB | daA | daf | acBf

This exchanges the original third production of B for several new


productions, one for each of the productions for A. These directly show
the overlap, and we can then left-factor:
A > da | acB
B > aM | daN
M > bB | cBf
N > A | f

Similarly, the following grammar does not appear to have any leftrecursion:
S > Tu | wx
T > Sq | vvS

Yet after substitution of S into T, the left-recursion comes to light:


S > Tu | wx
T > Tuq | wxq | vvS

If we then eliminate left-recursion, we get:


S > Tu | wx
T > wxqT' | vvST'
T' > uqT' |

33

CHAPTER FOUR
Top-Down Parsing
Approaches to Parsing
The syntax analysis phase of a compiler verifies that the sequence of
tokens extracted by the scanner represents a valid sentence in the grammar
of the programming language. There are two major parsing approaches:
top-down and bottom-up. In top-down parsing, you start with the start
symbol and apply the productions until you arrive at the desired string. In
bottom-up parsing, you start with the string and reduce it to the start
symbol by applying the productions backwards. As an example, lets trace
through the two approaches on this simple grammar that recognises strings
consisting of any number of as followed by at least one (and possibly
more) bs:
S > AB
A > aA |
B > b | bB

Here is a top-down parse of aaab. We begin with the start symbol and at
each step, expand one of the remaining nonterminals by replacing it with
the right side of one of its productions. We repeat until only terminals
remain. The top-down parse produces a leftmost derivation of the
sentence.
S
AB S
> AB
aAB A > aA
aaAB A > aA
aaaAB A > aA
aaaB A >
aaab B > b

A bottom-up parse works in reverse. We begin with the sentence of


terminals and each step applies a production in reverse, replacing a
substring that matches the right side with the nonterminal on the left. We
continue until we have substituted our way back to the start symbol. If you
read from the bottom to top, the bottom-up parse prints out a rightmost
derivation of the sentence.
aaab
aaab (insert )
aaaAb A >
aaAb A > aA
aAb A > aA
Ab A > aA
AB B > b
SS
> AB

34

In creating a parser for a compiler, we normally have to place some


restrictions on how we process the input. In the above example, it was
easy for us to see which productions were appropriate because we could
see the entire string aaab. In a compilers parser, however, we do not have
such long-distance vision. We are usually limited to just one-symbol of
lookahead. The lookahead symbol is the next symbol coming up in the
input. This restriction certainly makes the parsing more challenging. Using
the same grammar from above, if the parser sees only a single b in the
input and it cannot lookahead any further than the symbol we are on, it
cant know whether to use the production B > b or B > bB.
Backtracking
One solution to parsing would be to implement backtracking. Based on the
information the parser currently has about the input, a decision is made to
go with one particular production. If this choice leads to a dead end, the
parser would have to backtrack to that decision point, moving backwards
through the input, and start again making a different choice and so on until
it either found the production that was the appropriate one or ran out of
choices. For example, consider this simple grammar:
S > bab | bA
A > d | cA

Lets follow parsing the input bcd. In the trace below, the column on the
left will be the expansion thus far, the middle is the remaining input, and
the right is the action attempted at each step:
S
bab
ab
S
bA
A
d
A
cA
Ad
d

bcd
bcd
cd
bcd
bcd
cd
cd
cd
cd
Try
d

Try S > bab


match b
dead-end, backtrack
Try S > bA
match b
Try A > d
dead-end, backtrack
Try A > cA
match c
A > d
match d
Success!

As you can see, each time we hit a dead-end, we backup to the last
decision point, unmake that decision and try another alternative. If all
alternatives have been exhausted, we back up to the preceding decision
point and so on. This continues until we either find a working parse or
have exhaustively tried all combinations without success.
A number of authors have described backtracking parsers; the appeal is
that they can be used for a variety of grammars without requiring them to
fit any specific form. For a small grammar such as above, a backtracking

35

approach may be tractable, but most programming language grammars


have dozens of nonterminals each with several options and the resulting
combinatorial explosion makes a this approach very slow and impractical.
We will instead look at ways to parse via efficient methods that have
restrictions about the form of the grammar, but usually those requirements
are not so onerous that we cannot rearrange a programming language
grammar to meet them.
Top-Down Predictive Parsing
First, we will focus in on top-down parsing. We will look at two different
ways to implement a non-backtracking top-down parser called a predictive
parser. A predictive parser is characterised by its ability to choose the
production to apply solely on the basis of the next input symbol and the
current nonterminal being processed. To enable this, the grammar must
take a particular form. We call such a grammar LL(1). The first L means
we scan the input from left to right; the second L means we create a
leftmost derivation; and the 1 means one input symbol of lookahead.
Informally, an LL(1) has no left-recursive productions and has been leftfactored. Note that these are necessary conditions for LL(1) but not
sufficient, i.e., there exist grammars with no left recursion or common
prefixes that are not LL(1). Note also that there exist many grammars that
cannot be modified to become LL(1). In such cases, another parsing
technique must be employed, or special rules must be embedded into the
predictive parser.
Recursive-Descent
The first technique for implementing a predictive parser is called
recursive-descent. A recursive descent parser consists of several small
functions, one for each nonterminal in the grammar. As we parse a
sentence, we call the functions that correspond to the left side nonterminal
of the productions we are applying. If these productions are recursive, we
end up calling the functions recursively.
Lets start by examining some productions from a grammar for a simple
Pascal-like programming language. In this programming language, all
functions are preceded by the reserved word FUNC:
program
> function_list
function_list > function_list function | function
function
> FUNC identifier ( parameter_list ) statements

What might the C function that is responsible for parsing a function


definition look like? It expects to first find the token FUNC, then it expects
an identifier (the name of the function), followed by an opening
parenthesis, and so on. As it pulls each token from the parser, it must
ensure that it matches the expected, and if not, will halt with an error. For

36

each nonterminal, this function calls the associated function to handle its
part of the parsing.

To make things a little cleaner, lets introduce a utility function that can be
used to verify that the next token is what is expected and will error and
exit otherwise. We will need this again and again in writing the parsing
routines.

Now we can tidy up the ParseFunction() routine and make it clearer


what it does:

The following diagram illustrates how the parse tree is built:

37

Here is the production for an if-statement in this language:


if_statement > IF expression THEN statement ENDIF |
IF expression THEN statement ELSE statement ENDIF

To prepare this grammar for recursive-descent, we must left-factor to share


the common parts:
if_statement > IF expression THEN statement close_if
close_if
> ENDIF | ELSE statement ENDIF

Now, lets look at the recursive-descent functions to parse an if statement:

When parsing the closing portion of the if, we have to decide which of the
two right-hand side options to expand. In this case, it isnt too difficult.
We try to match the first token again ENDIF and on non-match, we try to
match the ELSE clause and if that doesnt match, it will report an error.
Navigating through two choices seemed simple enough, however, what
happens where we have many alternatives on the right side?
statement > assg_statement | return_statement | print_statement | null_statement
| if_statement | while_statement | block_of_statements

When implementing the ParseStatement function, how are we going to


be able to determine which of the seven options to match for any given
input? Remember, we are trying to do this without backtracking, and just
one token of lookahead, so we have to be able to make immediate decision
with minimal information this can be a challenge!
To understand how to recognise and solve problem, we need a definition:
The first set of a sequence of symbols u, written as First(u ) is the
set of terminals which start the sequences of symbols derivable
from u. A bit more formally, consider all strings derivable from u.

38

If u =>* v , where v begins with some terminal, that terminal is in


First(u). If u =>* , then is in First(u ).
Informally, the first set of a sequence is a list of all the possible terminals
that could start a string derived from that sequence. We will work an
example of calculating the first sets a bit later. For now, just keep in mind
the intuitive meaning. Finding our lookahead token in one of the first sets
of the possible expansions tells us that is the path to follow.
Given a production with a number of alternatives: A > u1 | u 2 | ..., we can
write a recursive-descent routine only if all the sets First(ui) are disjoint.
The general form of such a routine would be:

If the first sets of the various productions for a nonterminal are not
disjoint, a predictive parser doesn't know which choice to make. We would
either need to re-write the grammar or use a different parsing technique for
this nonterminal. For programming languages, it is usually possible to restructure the productions or embed certain rules into the parser to resolve
conflicts, but this constraint is one of the weaknesses of the top-down nonbacktracking approach.
It is a bit trickier if the nonterminal we are trying to recognize is nullable.
A nonterminal A is nullable if there is a derivation of A that results in (i.e.
that nonterminal would completely disappear in the parse string) i.e.,
First(A). In this case A could be replaced by nothing and the next token
would be the first token of the symbol following A in the sentence being
parsed. Thus if A is nullable, our predictive parser also needs to consider
the possibility that the path to choose is the one corresponding to A =>* .
To deal with this we define the following:
The follow set of a nonterminal A is the set of terminal
symbols that can appear immediately to the right of A in a
valid sentence. A bit more formally, for every valid sentence
S =>*uAv , where v begins with some terminal, that terminal
is in Follow(A).

39

Informally, you can think about the follow set like this: A can appear in
various places within a valid sentence. The follow set describes what
terminals could have followed the sentential form that was expanded from
A. We will detail how to calculate the follow set a bit later. For now,
realize follow sets are useful because they define the right context
consistent with a given nonterminal and provide the lookahead that might
signal a nullable nonterminal should be expanded to .
With these two definitions, we can now generalize how to handle A > u1 |
u 2 | ..., in a recursive-descent parser. In all situations, we need a case to
handle each member in First(ui) . In addition if there is a derivation from
any ui that could yield (i.e. if it is nullable) then we also need to handle
the members in Follow(A).

What about left-recursive productions? Now we see why these are such a
problem in a predictive parser. Consider this left-recursive production that
matches a list of one or more functions.
function_list > function_list function | function
function
> FUNC identifier ( parameter_list ) statement

Such a production will send a recursive-descent parser into an infinite


loop! We need to remove the left-recursion in order to be able to write the
parsing function for a function_list.
function_list > function_list function | function

becomes

40

function_list > function function_list | function

then we must left-factor the common parts


function_list > function more_functions
more_functions > function more_functions |

And now the parsing function looks like this:

Computing First and Follow


For a grammar to be suitable for LL(1) parsing, you must first remove
ambiguity and left-factor and eliminate left-recursion from all productions.
Now, follow these steps to compute first and follow:
Calculating first sets. To calculate First(u) where u has the form
X1X2...Xn, do the following:
1.
2.

3.

If X1 is a terminal, then add X1 to First(u), otherwise add


First(X1) - to First(u ) .
If X1 is a nullable nonterminal, i.e., X1 =>* , add First(X2) to First(u). Furthermore, if X2 can also go to , then add
First(X3) - and so on, through all Xn until the first
nonnullable symbol is encountered.
If X1X2...Xn =>* , add to the first set.

Calculating follow sets. For each nonterminal in the grammar, do


the following:
1.

2.

3.

Place EOF in Follow(S) where S is the start symbol and EOF


is the input's right endmarker. The endmarker might be end
of file, newline, or a special symbol, whatever is the
expected end of input indication for this grammar. We will
typically use $ as the endmarker.
For every production A > uBv where u and v are any string
of grammar symbols and B is a nonterminal, everything in
First(v) except is placed in Follow(B).
For every production A > uB, or a production A > u Bv
where First(v ) contains (i.e. v is nullable), then everything
in Follow(A) is added to Follow(B).

Here is a complete example of first and follow set computation, starting


with this grammar:

41

S > AB
A > Ca |
B > BaAC | c
C > b |

Notice we have a left-recursive production that must be fixed if we are to


use LL(1) parsing:
B > BaAC | c

becomes

B > cB'
B' > aACB' |

The new grammar is:


S > AB
A > Ca |
B > cB'
B' > aACB' |
C > b |

It helps to first compute the nullable set (i.e. those nonterminals X that X
=>* ), since you need to refer to the nullable status of various
nonterminals when computing the first and follow sets:
Nullable(G) = {A B' C}

The first sets for each nonterminal are:


First(C) = {b }
First(B') = {a }
First(B) = {c}
First(A) = {b a }

Start with First(C) - , add a (since C is nullable) and (since


A itself is nullable)
First(S) = {b a c}

Start with First(A) - , add First(B) (since A is nullable). We


dont add (since S itself is not-nullable A can go away,
but B cannot)
It is usually convenient to compute the first sets for the nonterminals that
appear toward the bottom of the parse tree and work your way upward
since the nonterminals toward the top may need to incorporate the first sets
of the terminals that appears beneath them in the tree.
To compute the follow sets, take each nonterminal and go through all the
right-side productions that the nonterminal is in, matching to the steps
given earlier:

42

Follow(S) = {$}
S doesnt appear in the right hand side of any productions.
We put $ in the follow set because S is the start symbol.
Follow(B) = {$}
B appears on the right hand side of the S > AB production.
Its follow set is the same as S.
Follow(B') = {$}
B' appears on the right hand side of two productions. The B'
> aACB' production tells us its follow set includes the
follow set of B', which is tautological. From B > cB', we
learn its follow set is the same as B.
Follow(C) = {a $}
C appears in the right hand side of two productions. The
production A > Ca tells us a is in the follow set. From B' >
aACB' , we add the First(B') which is just a again. Because B'
is nullable, we must also add Follow(B') which is $.
Follow(A) = {c b a $}
A appears in the right hand side of two productions. From S
> AB we add First(B) which is just c. B is not nullable. From
B' > aACB' , we add First(C) which is b. Since C is nullable,
so we also include First(B') which is a. B' is also nullable, so
we include Follow(B') which adds $.

It can be convenient to compute the follows sets for the nonterminals that
appear toward the top of the parse tree and work your way down, but
sometimes you have to circle around computing the follow sets of other
nonterminals in order to complete the one youre on.
The calculation of the first and follow sets follow mechanical algorithms,
but it is very easy to get tripped up in the details and make mistakes even
when you know the rules. Be careful!
Table-Driven LL(1) Parsing
In a recursive-descent parser, the production information is embedded in
the individual parse functions for each nonterminal and the run-time
execution stack is keeping track of our progress through the parse. There is
another method for implementing a predictive parser that uses a table to
store that production along with an explicit stack to keep track of where
we are in the parse.
This grammar for add/multiply expressions is already set up to handle
precedence and associativity:
E > E + T | T
T > T * F | F
F > (E) | int

After removal of left recursion, we get:

43

E > TE'
E' > + TE' |
T > FT'
T' > * FT' |
F > (E) | int

One way to illustrate the process is to study some transition graphs that
represent the grammar:

A predictive parser behaves as follows. Lets assume the input string is 3 +


4 * 5. Parsing begins in the start state of the symbol E and moves to the
next state. This transition is marked with a T, which sends us to the start
state for T. This in turn, sends us to the start state for F. F has only
terminals, so we read a token from the input string. It must either be an
open parenthesis or an integer in order for this parse to be valid. We
consume the integer token, and thus we have hit a final state in the F
transition diagram, so we return to where we came from which is the T
diagram; we have just finished processing the F nonterminal. We continue
with T', and go to that start state. The current lookahead is + which doesnt
match the * required by the first production, but + is in the follow set for T'
so we match the second production which allows T' to disappear entirely.
We finish T' and return to T, where we are also in a final state. We return
to the E diagram where we have just finished processing the T. We move
on to E', and so on.
A table-driven predictive parser uses a stack to store the productions to
which it must return. A parsing table stores the actions the parser should
take based on the input token and what value is on top of the stack. $ is the
end of input symbol.

44

Tracing a Table-Driven Predictive Parser


Here is how a predictive parser works. We push the start symbol on the
stack and read the first input token. As the parser works through the input,
there are the following possibilities for the top stack symbol X and the
input token nonterminal a using table M
1.
2.
3.

4.

If X = a and a = end of input ($), parser halts and parse


completed successfully.
If X = a and a != $, successful match, pop X and advance to
next input token. This is called a match action.
If X != a and X is a nonterminal, pop X and consult table at
M[X,a] to see which production applies, push right side of
production on stack. This is called a predict action.
If none of the preceding cases applies or the table entry
from step 3 is blank, there has been a parse error.

Here is an example parse of the string int + int * int:

Suppose, instead, that we were trying to parse the input +$. The first step
of the parse would give an error because there is no entry at M[E, +].
Constructing the Parse Table
The next task is to figure out how we built the table. The construction of
the table is somewhat involved and tedious (the perfect task for a

45

computer, but error-prone for humans). The first thing we need to do is


compute the first and follow sets for the grammar:
E > TE'
E' > + TE' |
T > FT'
T' > * FT' |
F > (E) | int
First(E) = First(T) = First(F) = { ( int }
First(T') = { * }
First(E') = { + }
Follow(E) = Follow(E') { $ ) }
Follow(T) = Follow(T') = { + $ ) }
Follow(F) = { * + $ ) }

Once we have the first and follow sets, we build a table M with the
leftmost column labelled with all the nonterminals in the grammar, and the
top row labelled with all the terminals in the grammar, along with $. The
following algorithm fills in the table cells:
1.
2.
3.

4.

For each production A > u of the grammar, do steps 2 and 3


For each terminal a in First(u), add A > u to M[A,a]
If in First(u), (i.e. A is nullable) add A > u to M[A,b] for each
terminal b in Follow(A), If in First(u), and $ is in Follow(A),
add A > u to M[A,$]
All undefined entries are errors

The concept used here is to consider a production A > u with a in First(u ) .


The parser should expand A to u when the current input symbol is a. Its a
little trickier when u = or u =>* . In this case, we should expand A to u if
the current input symbol is in Follow(A), or if the $ at the end of the input
has been reached, and $ is in Follow(A).
If the procedure ever tries to fill in an entry of the table that already has a
non-error entry, the procedure fails (the grammar is not LL(1)).
Properties of LL(1) Grammars
These predictive top-down techniques (either recursive-descent or tabledriven) require a grammar that is LL(1), One fully-general way to
determine if a grammar is LL(1) is to build the table and see if you have
conflicts. In some cases, you will be able to determine that a grammar is or
isn't LL(1) via a shortcut (such as identifying obvious left-factors). To give
a formal statement of what is required for a grammar to be LL(1):

No ambiguity
No left recursion
A grammar G is LL(1) iff whenever A > u | v are two
distinct productions of G, the following conditions hold:

46

for no terminal a do both u and v derive strings


beginning with a (i.e. first sets are disjoint)
at most one of u and v can derive the empty string
if v =>* then u does not derive any string
beginning with a terminal in Follow(A) (i.e. first and
follow must be disjoint if nullable)

All of this translates intuitively that when trying to recognise A, the parser
must be able to examine just one input symbol of lookahead and uniquely
determine which production to use.
Error-Reporting and Recovery
A few general principles apply to errors found regardless of parsing
technique being used:

A parser should try to determine that an error has occurred


as soon as possible. Waiting too long before declaring an
error can cause the parser to lose the actual location of the
error.
A suitable and comprehensive message should be reported.
Missing semicolon on line 36 is helpful, unable to shift
in state 425 is not.
After an error has occurred, the parser must pick a likely
place to resume the parse. Rather than giving up at the first
problem, a parser should always try to parse as much of the
code as possible in order to find as many real errors as
possible during a single run.
A parser should avoid cascading errors, which is when one
error generates a lengthy sequence of spurious error
messages.

Recognising the input is not syntactically valid can be relatively


straightforward. An error is detected in predictive parsing when the
terminal on top of the stack does not match the next input symbol or when
nonterminal A is on top of the stack, a is the next input symbol and the
parsing table entry M[A,a] is empty.
Deciding how to handle the error is bit more complicated. By inserting
specific error actions into the empty slots of the table, you can determine
how a predictive parser will handle a given error condition. At the least,
you can provide a precise error message that describes the mismatch
between expected and found.
Recovering from errors and being able to resume and successfully parse is
more difficult. The entire compilation could be aborted on the first error,
but most users would like to find out more than one error per compilation.

47

The problem is how to fix the error in some way to allow parsing to
continue.
Many errors are relatively minor and involve syntactic violations for
which the parser has a correction that it believes is likely to be what the
programmer intended. For example, a missing semicolon at the end of the
line or a misspelled keyword can usually be recognised. For many minor
errors, the parser can fix the program by guessing at what was intended
and reporting a warning, but allowing compilation to proceed unhindered.
The parser might skip what appears to be an erroneous token in the input
or insert a necessary, but missing, token or change a token into the one
expected (substituting BEGIN for BGEIN). For more major or complex
errors, the parser may have no reliable correction. The parser will attempt
to continue but will probably have to skip over part of the input or take
some other exceptional action to do so.
Panic-mode error recovery is a simple technique that just bails out of the
current construct, looking for a safe symbol at which to restart parsing.
The parser just discards input tokens until it finds what is called a
synchronising token. The set of synchronising tokens are those that we
believe confirm the end of the invalid statement and allow us to pick up at
the next piece of code. For a nonterminal A, we could place all the symbols
in Follow(A) into its synchronising set. If A is the nonterminal for a variable
declaration and the garbled input is something like duoble d; the parser
might skip ahead to the semi-colon and act as though the declaration didnt
exist. This will surely cause some more cascading errors when the variable
is later used, but it might get through the trouble spot. We could also use
the symbols in First(A) as a synchronising set for re-starting the parse of A.
This would allow input junk double d; to parse as a valid variable
declaration.

48

CHAPTER FIVE
Bottom-Up Parsing
As the name suggests, bottom-up parsing works in the opposite direction
from top-down. A top down parser begins with the start symbol at the top
of the parse tree and works downward, driving productions in forward
order until it gets to the terminal leaves. A bottom-up parse starts with the
string of terminals itself and builds from the leaves upward, working
backwards to the start symbol by applying the productions in reverse.
Along the way, a bottom-up parser searches for substrings of the working
string that match the right side of some production. When it finds such a
substring, it reduces it, i.e., substitutes the left side nonterminal for the
matching right side. The goal is to reduce all the way up to the start
symbol and report a successful parse.
In general, bottom-up parsing algorithms are more powerful than topdown methods, but not surprisingly, the constructions required are also
more complex. It is difficult to write a bottom-up parser by hand for
anything but trivial grammars, but fortunately, there are excellent parser
generator tools like yacc that build a parser from an input specification,
not unlike the way lex builds a scanner to your spec.
Shift-reduce parsing is the most commonly used and most powerful of the
bottom-up techniques. It takes as input a stream of tokens and develops the
list of productions used to build the parse tree, but the productions are
discovered in reverse order of a top-down parser. Like a table-driven
predictive parser, a bottom-up parser makes use of a stack to keep track of
the position in the parse and a parsing table to determine what to do next.
To illustrate stack-based shift-reduce parsing, consider this simplified
expression grammar:
S > E
E > T | E + T
T > id | (E)

The shift-reduce strategy divides the string we are trying parse into two
parts: an undigested part and a semi-digested part. The undigested part
contains the tokens that are still to come in the input, and the semidigested part is put on a stack. If parsing the string v, it starts out
completely undigested, so the input is initialised to v, and the stack is
initialised to empty. A shift-reduce parser proceeds by taking one of three
actions at each step:
Reduce:

If we can find a rule A > w, and if the contents of the stack


are qw for some q (q may be empty), then we can reduce the

49

stack to qA. We are applying the production for the


nonterminal A backwards. For example, using the grammar
above, if the stack contained (id we can use the rule T > id
to reduce the stack to (T.
There is also one special case: reducing the entire contents
of the stack to the start symbol with no remaining input
means we have recognised the input as a valid sentence
(e.g. the stack contains just w, the input is empty, and we
apply S > w ). This is the last step in a successful parse.
The w being reduced is referred to as a handle. Formally, a
handle of a right sentential form u is a production A > w ,
and a position within u where the string w may be found
and replaced by A to produce the previous right-sentential
form in a rightmost derivation of u. Recognising valid
handles is the difficult part of shift-reduce parsing.
Shift:

If it is impossible to perform a reduction and there are


tokens remaining in the undigested input, then we transfer a
token from the input onto the stack. This is called a shift.
For example, using the grammar above, suppose the stack
contained ( and the input contained id+id). It is impossible to
perform a reduction on ( since it does not match the entire
right side of any of our productions. So, we shift the first
character of the input onto the stack, giving us (id on the
stack and +id) remaining in the input.

Error:

If neither of the two above cases apply, we have an error. If


the sequence on the stack does not match the right-hand
side of any production, we cannot reduce. And if shifting
the next input token would create a sequence on the stack
that cannot eventually be reduced to the start symbol, a
shift action would be futile. Thus, we have hit a dead end
where the next token conclusively determines the input
cannot form a valid sentence. This would happen in the
above grammar on the input id+). The first id would be
shifted, then reduced to T and again to E, next + is shifted.
At this point, the stack contains E+ and the next input token
is ) . The sequence on the stack cannot be reduced, and
shifting the ) would create a sequence that is not viable, so
we have an error.

The general idea is to read tokens from the input and push them onto the
stack attempting to build sequences we recognise as the right side of a
production. When we find a match, we replace that sequence with the
nonterminal from the left side and continue working our way up the parse
50

tree. This process builds the parse tree from the leaves upward, the inverse
of the top-down parser. If all goes well, we will end up moving everything
from the input to the stack and eventually construct a sequence on the
stack that we recognise as a right-hand side for the start symbol.
Lets trace the operation of a shift-reduce parser in terms of its actions
(shift or reduce) and its data structure (a stack). The chart below traces a
parse of (id+id) using the previous example grammar:

In the above parse on step 7, we ignored the possibility of reducing E > T


was because that would have created the sequence (E + E on the stack
which is not a viable prefix of a right sentential form. Formally, viable
prefixes are the set of prefixes of right sentential forms that can appear on
the stack of a shift-reduce parser, i.e. prefixes of right sentential forms that
do not extend past the end of the rightmost handle. Basically, a shiftreduce parser will only create sequences on the stack that can lead to an
eventual reduction to the start symbol. Because there is no right-hand side
that matches the sequence (E + E and no possible reduction that transforms
it to such, this is a dead end and is not considered. Later, we will see how
the parser can determine which reductions are valid in a particular
situation.
As they were for top-down parsers, ambiguous grammars are problematic
for bottom-up parsers because these grammars could yield more than one
handle under some circumstances. These types of grammars create either
shift-reduce or reduce-reduce conflicts. The former refers to a state where
the parser cannot decide whether to shift or reduce. The latter refers to a
state where the parser has more than one choice of production for

51

reduction. An example of a shift-reduce conflict occurs with the if-thenelse construct in programming languages. A typical production might be:
S > if E then S | if E then S else S

Consider what would happen to a shift-reduce parser deriving this string:


if E then if E then S else S

At some point the parser's stack would have:


if E then if E then S

with else as the next token. It could reduce because the contents of the
stack match the right-hand side of the first production or shift the else
trying to build the right-hand side of the second production. Reducing
would close off the inner if and thus associate the else with the outer if.
Shifting would continue building and later reduce the inner if with the else.
Either is syntactically valid given the grammar, but two different parse
trees result, showing the ambiguity. This quandary is commonly referred
to as the dangling else. Does an else appearing within a nested if statement
belong to the inner or the outer? The C and Java languages agree that an
else is associated with its nearest unclosed if. Other languages, such as Ada
and Modula, avoid the ambiguity by requiring a closing endif delimiter.
Reduce-reduce conflicts are not common and usually indicate a problem in
the grammar definition.
Now that we have general idea of how a shift-reduce parser operates, we
will look at how it recognises a handle, and how it decides which
production to use in a reduction. To deal with these two issues, we will
look at a specific shift-reduce implementation called LR parsing.
LR Parsing
LR parsers (L for left to right scan of input; R for rightmost
derivation) are efficient, table-driven shift-reduce parsers. The class of
grammars that can be parsed using LR methods is a proper superset of the
class of grammars that can be parsed with predictive LL parsers. In fact,
virtually all programming language constructs for which CFGs can be
written can be parsed with LR techniques. As an added advantage, there is
no need for lots of grammar rearrangement to make it acceptable for LR
parsing the way that LL parsing requires.
The primary disadvantage is the amount of work it takes to build the tables
by hand, which makes it infeasible to hand-code an LR parser for most
grammars. Fortunately, there exist LR parser generator tools that create the
parser from a CFG specification. The parser tool does all the tedious and

52

complex work to build the necessary tables and can report any ambiguities
or language constructs that interfere with the ability to parse it using LR
techniques.
We begin by tracing how an LR parser works. Determining the handle to
reduce in a sentential form depends on the sequence of tokens on the stack,
not only the topmost ones that are to be reduced, but the context at which
we are in the parse. Rather than reading and shifting tokens onto a stack,
an LR parser pushes "states" onto the stack; these states describe what is
on the stack so far. Think of each state as encoding the current left context.
The state on top of the stack possibly augmented by peeking at a
lookahead token enables us to figure out whether we have a handle to
reduce, or whether we need to shift a new state on top of the stack for the
next input token.
An LR parser uses two tables:
1.
The action table Action[s,a] tells the parser what to do when the
state on top of the stack is s and terminal a is the next input token.
The possible actions are to shift a state onto the stack, to reduce the
handle on top of the stack, to accept the input, or to report an error.
2.
The goto table Goto[s,X] indicates the new state to place on top of
the stack after a reduce of the nonterminal X while state s is on top
of the stack.
The two tables are usually combined, with the action table specifying
entries for terminals, and the goto table specifying entries for
nonterminals.
Tracing an LR Parser
We start with the initial state s0 on the stack. The next input token is the
terminal a and the current state is st. The action of the parser is as follows:

If Action[st,a] is shift, we push the specified state onto the


stack. We then call yylex() to get the next token a from
the input.
f Action[st,a] is reduce Y > X1...X0 then we pop k states off the
stack (one for each symbol in the right side of the
production) leaving state su on top. Goto[su,Y] gives a new
state sV to push on the stack. The input token is still a (i.e.
the input remains unchanged).
If Action[st,a] is accept then the parse is successful and we
are done.
If Action[st,a] is error (the table location is blank) then we
have a syntax error. With the current top of stack and next
input we can never arrive at a sentential form with a handle
to reduce.

53

As an example, consider the following simplified expression grammar.


The productions have been sequentially numbered so we can refer to them
in the action table:
1) E > E + T
2) E > T
3) T > (E)
4) T > id

Here is the combined action and goto table. In the action columns sN
means shift state numbered N onto the stack number and rN action means
reduce using production numbered N. The goto column entries are the
number of the new state to push onto the stack after reducing the specified
nonterminal. This is an LR(0) table (more details on table construction will
come in a minute).

Here is a parse of id + (id) using the LR algorithm with the above action
and goto table:

Types of LR Parsers
There are three types of LR parsers: LR(k), simple LR(k), and lookahead
LR(k) (abbreviated to LR(k), SLR(k), LALR(k))). The k identifies the
number of tokens of lookahead. We will usually only concern ourselves

54

with 0 or 1 tokens of lookahead, but the techniques do generalise to k > 1.


The different classes of parsers all operate the same way (as shown above,
being driven by their action and goto tables), but they differ in how their
action and goto tables are constructed, and the size of those tables.
We will consider LR(0) parsing first, which is the simplest of all the LR
parsing methods. It is also the weakest and although of theoretical
importance, it is not used much in practice because of its extreme
limitations. LR(0) parses without using any lookahead at all. Adding just
one token of lookahead to get LR(1) vastly increases the parsing power.
Very few grammars can be parsed with LR(0), but most unambiguous
CFGs can be parsed with LR(1). The drawback of adding the lookahead is
that the algorithm becomes somewhat more complex and the parsing table
has to be much larger. The full LR(1) parsing table for a typical
programming language has many thousands of states compared to the few
hundred needed for LR(0). A compromise in the middle is found in the
two variants SLR(1) and LALR(1) which also use one token of lookahead
but employ techniques to keep the table as small as LR(0). SLR(k) is an
improvement over LR(0) but much weaker than full LR(k) in terms of the
number of grammars for which it is applicable. LALR(k) parses a larger
set of languages than SLR(k) but not quite as many as LR(k). LALR(1) is
the method used by the yacc parser generator.
In order to begin to understand how LR parsers work, we need to delve
into how their tables are derived. The tables contain all the information
that drives the parser. As an example, we will show how to construct an
LR(0) parsing table since they are the simplest and then discuss how to do
SLR(1), LR(1), and LALR(1).
The essence of LR parsing is identifying a handle on the top of the stack
that can be reduced. Recognising a handle is actually easier than predicting
a production was in top-down parsing. The weakness of LL(k) parsing
techniques is that they must be able to predict which product to use,
having seen only k symbols of the right-hand side. For LL(1), this means
just one symbol has to tell all. In contrast, for an LR(k) grammar is able to
postpone the decision until it has seen tokens corresponding to the entire
right-hand side (plus k more tokens of lookahead). This doesnt mean the
task is trivial. More than one production may have the same right-hand
side and what looks like a right-hand side may not really be because of its
context. But in general, the fact that we see the entire right side before we
have to commit to a production is a useful advantage.
Constructing LR(0) Parsing Tables
Generating an LR parsing table consists identifying the possible states and
arranging the transitions among them. At the heart of the table
construction is the notion of an LR(0) configuration or item. A

55

configuration is a production of the grammar with a dot at some position


on its right side. For example, A > XYZ has four possible items:
A > XYZ
A > XYZ
A > XYZ
A > XYZ

This dot marks how far we have gotten in parsing the production.
Everything to the left of the dot has been shifted onto the parsing stack and
next input token is in the First set of the symbol after the dot (or in the
follow set if that symbol is nullable). A dot at the right end of a
configuration indicates that we have that entire configuration on the stack
i.e., we have a handle that we can reduce. A dot in the middle of the
configuration indicates that to continue further, we need to shift a token
that could start the symbol following the dot. For example, if we are
currently in this position:
A > XYZ

We want to shift something from First(Y) (something that matches the next
input token). Say we have productions Y > u | w. Given that, these three
productions all correspond to the same state of the shift-reduce parser:
A > XYZ
Y > u
Y > w

At the above point in parsing, we have just recognised an X and expect the
upcoming input to contain a sequence derivable from YZ. Examining the
expansions for Y, we furthermore expect the sequence to be derivable from
either u or w . We can put these three items into a set and call it a
configurating set of the LR parser. The action of adding equivalent
configurations to create a configurating set is called closure. Our parsing
tables will have one state corresponding to each configurating set.
These configurating sets represent states that the parser can be in as it
parses a string. Each state must contains all the items corresponding to
each of the possible paths that are concurrently being explored at that point
in the parse. We could model this as a finite automaton where we move
from one state to another via transitions marked with a symbol of the CFG.
For example,

Recall that we push states onto the stack in a LR parser. These states
describe what is on the stack so far. The state on top of the stack

56

(potentially combined with some lookahead) enables us to figure out


whether we have a handle to reduce, or whether we need to read the next
input token and shift a new state on top of the stack. We shift until we
reach a state where the dot is at the end of a production, at which point we
reduce. This finite automaton is the basis for a LR parser: each time we
perform a shift we are following a transition to a new state.
Now for the formal rule for what to put in a configurating set. We start
with a configuration:
A > X1...Xi Xi+1...Xj

which we place in the configurating set. We then perform the closure


operation on the items in the configurating set. For each item in the
configurating set where the dot precedes a nonterminal, we add
configurations derived from the productions defining that nonterminal
with the dot at the start of the right side of those productions. So, if we
have
Xi+1 > Y1...Yg | Z1...Zh

in the above example, we would add the following to the configurating set.
Xi+1 > Y1...Yg
Xi+1 > Z1...Zh

We repeat this operation for all configurations in the configurating set


where a dot precedes a nonterminal until no more configurations can be
added. So, if Y1 and Z1 are terminals in the above example, we would just
have the three productions in our configurating set. If they are
nonterminals, we would need to add the Y1 and Z1 productions as well.
In summary, to create a configurating set for the starting configuration A
> u, we follow the closure operation:

1.
2.

A > u is in the configurating set


If u begins with a terminal, we are done with this

3.

production
If u begins with a nonterminal B, add all productions with B
on the left side, with the dot at the start of the right side: B
> v

4.

Repeat steps 2 and 3 for any productions added in step 3.


Continue until you reach a fixed point.

The other information we need to build our tables is the transitions


between configurating sets. For this, we define the successor function.
Given a configurating set C and a grammar symbol X, the successor
function computes the successor configurating set C' = successor(C,X). The

57

successor function describes what set the parser moves to upon


recognizing a given symbol.
The successor function is quite simple to compute. We take all the
configurations in C where there is a dot preceding X, move the dot past X
and put the new configurations in C', then we apply the closure operation
to C'. The successor configurating set C' represents the state we move to
when encountering symbol X in state C.
The successor function is defined to only recognise viable prefixes. There
is a transition from A > uxv to A > ux v on the input x. If what was
already being recognised as a viable prefix and we've just seen an x, then
we can extend the prefix by adding this symbol without destroying
viability.
Here is an example of building a configurating set, performing closure,
and computing the successor function. Consider the following item from
our example expression grammar:
E > E + T

To obtain the successor configurating set on + we first put the following


configuration in C':
E > E +T

We then perform a closure on this set:


E > E +T
T > (E)
T > id

Now, to create the action and goto tables, we need to construct all the
configurating sets and successor functions for the expression grammar. At
the highest level, we want to start with a configuration with a dot before
the start symbol and move to a configuration with a dot after the start
symbol. This represents shifting and reducing an entire sentence of the
grammar. To do this, we need the start symbol to appear on the right side
of a production. This may not happen in the grammar so we modify it. We
create an augmented grammar by adding the production:
S' > S

where S is the start symbol. So we start with the initial configurating set C0
which is the closure of S' > S. The augmented grammar for the example
expression grammar:
0) E' > E

58

1) E > E + T
2) E > T
3) T > (E)
4) T > id

We create the complete family F of configurating sets as follows:


1.
2.

3.

Start with F containing the configurating set C0, derived


from the configuration S' > S
For each configurating set C in F and each grammar symbol
X such that successor(C,X) is not empty, add successor(C,X)
to F
Repeat step 2 until no more configurating sets can be added
to F

Here is the full family of configurating sets for the grammar given above.

Note that the order of defining and numbering the sets is not important;
what is important is that all the sets are included.
A useful means to visualise the configurating sets and successors is with a
diagram like the one shown below. The transitions mark the successor
relationship between sets. We call this a gotograph or transition diagram.

59

To construct the LR(0) table, we use the following algorithm. The input is
an augmented grammar G' and the output is the action/goto tables:
1.
2.

3.

4.
5.

Construct F = {I0, I1, ... In}, the collection of configurating


sets for G'.
State i is determined from Ii. The parsing actions for the
state are determined as follows:
a)
If A > u is in Ii then set Action[i,a] to reduce A > u
for all input. (A not equal to S').
b)
If S' > S is in Ii then set Action[i,$] to accept.
c)
If A > uav is in Ii and successor(Ii, a) = Ij, then set
Action[i,a] to shift j (a is a terminal).
The goto transitions for state i are constructed for all
nonterminals A using the rule: If successor(Ii, A) = Ij, then
Goto [i, A] = j.
All entries not defined by rules 2 and 3 are errors.
The initial state is the one constructed from the
configurating set containing S' > S.

Notice how the shifts in the action table and the goto table are just
transitions to new states. The reductions are where we have a handle on
the stack that we pop off and replace with the nonterminal for the handle;
this occurs in the states where the is at the end of a production.
At this point, we should go back and look at the parse of id + (id) from
earlier in the handout and trace what the states mean. (Refer to the action
and goto tables and the parse diagrammed on page 4 and 5).

60

Here is the parse (notice it is an reverse rightmost derivation, if you read


from the bottom upwards, it is always the rightmost nonterminal that was
operated on).
id + (id)
T + (id)
E + (id)
E + (T)
E + (E)
E+T
E
E'

T > id
E > T
T > id
E > T
T > (E)
E > E+T
E' > E

Now lets examine the action of the parser. We start by pushing s0 on the
stack. The first token we read is an id. In configurating set I0, the successor
of id is set I4, this means pushing s4 onto the stack. This is a final state for id
(the is at the end of the production) so we reduce the production T > id.
We pop s4 to match the id being reduced and we are back in state s0. We
reduced the handle into a T, so we use the goto part of the table, and
Goto[0, T] tells us to push s2 on the stack. (In set I0 the successor for T was
set I2). In set I2 the action is to reduce E > T, so we pop off the s2 state and
are back in s0. Goto[0, E] tells us to push s1. From set I1 seeing a + takes us
to set I5 (push s5 on the stack).
From set I5 we read an open ( which that takes us to set I3 (push s3 on the
stack). We have an id coming up and so we shift state s4. Set 4 reduces T >
id, so we pop s4 to remove right side and we are back in state s3. We use the
goto table Goto[3, T] to get to set I2. From here we reduce E > T, pop s2 to
get back to state s3 now we goto s6. . Action[6, )] tells us to shift s7. Now in s7
we reduce T > (E). We pop the top three states off (one for each symbol in
the right-hand side of the production being reduced) and we are back in s5
again. Goto[5,T] tells us to push s8. We reduce by E > E + T which pops off
three states to return to s0. Because we just reduced E we goto s1. The next
input symbol is $ means we completed the production E' > E and the
parse is successful.
The stack allows us to keep track of what we have seen so far and what we
are in the middle of processing. We shift states that represent the
amalgamation of the possible options onto the stack until we reach the end
of a production in one of the states. Then we reduce. After a reduce, states
are popped off the stack to match the symbols of the matching right-side.
What's left on the stack is what we have yet to process.
Consider what happens when we try to parse id++. We start in s0 and do the
same as above to reduce the id to T and then to E. Now we are in set I5 and
we encounter another +. This is an error because the action table is empty
for that transition. There is no successor for + from that configurating set,
because there is no viable prefix that begins E++.

61

Subset Construction and Closure


You may have noticed a similarity between subset construction and the
closure operation. If you think back to a few lectures, we explored the
subset construction algorithm for converting an NFA into a DFA. The
basic idea was create new states that represent the non-determinism by
grouping the possibilities that look the same at that stage and only
diverging when you get more information. The same idea applies to
creating the configurating sets for the grammar and the successor function
for the transitions. We create a NFA whose states are all the different
individual configurations. We put all the initial configurations into one
start state. Then draw all the transitions from this state to the other states
where all the other states have only one configuration each. This is the
NFA we do subset construction on to convert into a DFA. Here is a simple
example starting from the grammar consisting of strings with one or more
as:
1) S' > S
2) S > Sa
3) S > a

Close on the augmented production and put all those configurations in a


set:

Do subset construction on the resulting NFA to get the configurating sets:

Interesting, isn't it, to see the parallels between the two processes? They
both are grouping the possibilities into states that only diverge once we get
further along and can be sure of which path to follow.
Limitations of LR(0) Parsing
The LR(0) method may appear to be a strategy for creating a parser that
can handle any context-free grammar, but in fact, the grammars we used as
examples in this handout were specifically selected to fit the criteria
needed for LR(0) parsing. Remember that LR(0) means we are parsing
with zero tokens of lookahead. The parser must be able to determine what
62

action to take in each state without looking at any further input symbols,
i.e. by only considering what the parsing stack contains so far. In an LR(0)
table, each state must only shift or reduce. Thus an LR(0) configurating set
cannot have both shift and reduce items, and can only have exactly one
reduce item. This turns out to be a rather limiting constraint.
To be precise, a grammar is LR(0) if the following two conditions hold:
1.

2.

For any configurating set containing the item A > uxv


there is no complete item B > w in that set. In the tables,
this translates to no shift-reduce conflict on any state. This
means the successor function from that set either shifts to a
new state or reduces, but not both.
There is at most one complete item A > u in each
configurating set. This translates to no reduce-reduce
conflict on any state. The successor function has at most
one reduction.

Very few grammars meet the requirements to be LR(0). For example, any
grammar with an -rule will be problematic. If the grammar contains the
production A > , then the item A > will create a shift-reduce conflict if
there is any other non-null production for A. -rules are fairly common
programming language grammars, for example, for optional features such
as type qualifiers or variable declarations.
Even modest extensions to earlier example grammar cause trouble.
Suppose we extend it to allow array elements, by adding the production
rule T>id[E]. When we construct the configurating sets, we will have one
containing the items T>id and T>id[E] which will be a shift-reduce
conflict.
Or suppose we allow assignments by adding the productions E > V = E
and V > id. One of the configurating sets for this grammar contains the
items V>id and T > id, leading to a reducereduce conflict.
The above examples show that the LR(0) method is just too weak to be
useful. This is caused by the fact that we try to decide what action to take
only by considering what we have seen so far, without using any
information about the upcoming input. By adding just a single token
lookahead, we can vastly increase the power of the LR parsing technique
and work around these conflicts. There are three ways to use a one token
lookahead: SLR(1), LR(1) and LALR(1), each of which we will consider
in turn later in the next few chapters.

63

CHAPTER SIX
SLR and LR parsing
The problem with LR(0)
LR(0) is the simplest technique in the LR family. Although that makes it
the easiest to learn, these parsers are too weak to be of practical use for
anything but a very limited set of grammars. The examples given at the
end of the LR(0) handout show how even small additions to an LR(0)
grammar can introduce conflicts that make it no longer LR(0). The
fundamental limitation of LR(0) is the zero, meaning no lookahead tokens
are used. It is a stifling constraint to have to make decisions using only
what has already been read, without even glancing at what comes next in
the input. If we could peek at the next token and use that as part of the
decision-making, we will find that it allows for a much larger class of
grammars to be parsed.
SLR(1)
We will first consider SLR(1) where the S stands for Simple. SLR(1)
parsers use the same LR(0) configurating sets and have the same table
structure and parser operation, so everything you've already learned about
LR(0) applies here. The difference comes in assigning table actions, where
we are going to use one token of lookahead to help arbitrate among the
conflicts. If we think back to the kind of conflicts we encountered in LR(0)
parsing, it was the reduce actions that cause us grief. A state in an LR(0)
parser can have at most one reduce action and cannot have both shift and
reduce actions. Since a reduce is indicated for any completed item, this
dictates that each completed item must be in a state by itself. But let's
revisit the assumption that if the item is complete, the parser must choose
to reduce. Is that always appropriate? If we peeked at the next upcoming
token, it may tell us something that invalidates that reduction. If the
sequence on top of the stack could be reduced to the non-terminal A, what
tokens do we expect to find as the next input? What tokens would tell us
that the reduction is not appropriate? Perhaps Follow(A) could be useful
here!
The simple improvement that SLR(1) makes on the basic LR(0) parser is
to reduce only if the next input token is a member of the follow set of the
non-terminal being reduced. When filling in the table, we don't assume a
reduce on all inputs as we did in LR(0), we selectively choose the
reduction only when the next input symbols in a member of the follow set.
To be more precise, here is the algorithm for SLR(1) table construction
(note all steps are the same as for LR(0) table construction except for 2a)
1.

Construct F = {I0, I1, ... In}, the collection of LR(0)


configurating sets for G'.

64

2.

3.

4.
5.

State i is determined from Ii. The parsing actions for the


state are determined as follows:
a)
If A > u is in Ii then set Action[i,a] to reduce A > u
for all a in Follow(A) (A is not S').
b)
If S' > S is in Ii then set Action[i,$] to accept.
c)
If A > uav is in Ii and successor(Ii, a) = Ij, then set
Action[i,a] to shift j (a must be a terminal).
The goto transitions for state i are constructed for all nonterminals A using the rule: If successor(Ii, A) = Ij, then Goto [i,
A] = j.
All entries not defined by rules 2 and 3 are errors.
The initial state is the one constructed from the
configurating set containing S' > S.

In SLR(1) parser, it is allowable for there to be both shift and reduce items
in the same state as well as multiple reduce items. The SLR(1) parser will
be able to determine which action to take as long as the follow sets are
disjoint.
Let's consider those changes at the end of the LR(0) handout to the
simplified expression grammar that would have made it no longer LR(0).
Here is the version with the addition of array access:
E' > E
E > E + T | T
T > (E) | id | id[E]

Here are the first two LR(0) configurating sets entered if id is the first
token of the input.

In an LR(0) parser, the set on the right has a shift-reduce conflict.


However, an SLR(1) will compute Follow(T) = { + ) ] $ } and only enter the
reduce action on those tokens. The input [ will shift and there is no
conflict. Thus this grammar is SLR(1) even though it is not LR(0).
Similarly, the simplified expression grammar with the assignment
addition:
E' > E
E > E + T | T | V = E
T > (E) | id
V > id

65

Here are the first two LR(0) configurating sets entered if id is the first
token of the input.

In an LR(0) parser, the set on the right has a reduce-reduce conflict.


However, an SLR(1) parser will compute Follow(T) = { + ) $ } and Follow(V)
= { = } and thus can distinguish which reduction to apply depending on the
next input token. The modified grammar is SLR(1).
SLR(1) Grammars
A grammar is SLR(1) if the following two conditions hold for each
configurating set:
1.
For any item A > uxv in the set, with terminal x, there is no
complete item B > w in that set with x in Follow(B). In the
tables, this translates no shift-reduce conflict on any state.
This means the successor function for x from that set either
shifts to a new state or reduces, but not both.
2.
For any two complete items A > u and B > v in the same
configurating set, the follow sets must be disjoint, e.g.
Follow(A) Follow(B) is empty. This translates to no
reducereduce conflict on any state. If more than one nonterminal could be reduced from this set, it must be possible
to uniquely determine which using only one token of
lookahead.
All LR(0) grammars are SLR(1) but the reverse is not true, as the two
extensions to our expression grammar demonstrated. The addition of just
one token of lookahead and use of the follow set greatly expands the class
of grammars that can be parsed without conflict.
The Limitations of SLR(1)
The SLR(1) technique still leaves something to be desired, because we are
not using all the information that we have at our disposal. When we have a
completed configuration (i.e. dot at the end) such as X > u, we know this
corresponds to a situation in which we have u as a handle on top of the
stack which we then can reduce, i.e., replacing u by X. We allow such a
reduction whenever the next symbol is in Follow(X). However, it may be
that we should not reduce for every symbol in Follow(X), because the
symbols below u on the stack preclude u being a handle for reduction in
this case. In other words, SLR(1) states only tell us about the sequence on
top of the stack, not what is below it on the stack. We may need to divide

66

an SLR(1) state into separate states to differentiate the possible means by


which that sequence has appeared on the stack. By carrying more
information in the state, it will allow us to rule out these invalid
reductions.
Consider this example from Aho/Sethi/Ullman that defines a small
grammar for assignment statements, using the non-terminal L for l-value
and R for r-value and * for contents-of.
S' > S
S > L = R
S > R
L > *R
L > id
R > L

Consider parsing the expression id = id. After working our way to


configurating set I2 having reduced the first id to L, we have a choice upon
seeing = coming up in the input. The first item in the set wants to set
Action[2,=] be shift 6, which corresponds to moving on to find the rest of
the assignment. However, = is also in Follow(R) because S => L=R => *R =
R. Thus, the second configuration wants to reduce in that slot R>L. This is
a shift-reduce conflict but not because of any problem with the grammar.
A SLR parser does not remember enough left context to decide what
should happen when it encounters a = in the input having seen a string
reducible to L. Although the sequence on top of the stack could be reduced
to R, but we dont want to choose this reduction because there is no
possible right sentential form that begins R = ... (there is one beginning *R
= ... which is not the same). Thus, the correct choice is to shift.
Its not further lookahead that the SLR tables are missing (we dont need
to see additional symbols beyond the first token in the input), we have
already seen the information that allows us to determine the correct choice.
What we need is to retain a little more of the left context that brought us
here. In this example grammar, the only time we should consider reducing
67

by production R>L is during a derivation that has already seen a * or an =.


Just using the entire follow set is not discriminating enough as the guide
for when to reduce. The follow set contains symbols that can follow R in
any position within a valid sentence but it does not precisely indicate
which symbols follow R at this particular point in a derivation. So we will
augment our states to include information about what portion of the follow
set is appropriate given the path we have taken to that state.
We can be in state 2 for one of two reasons, we are trying to build from S
> L = R or from S > R > L. If the upcoming symbol is =, then that rules
out the second choice and we must be building the first, which tells us to
shift. The reduction should only be applied if the next input symbol is $.
Even though = is Follow(R) because of the other contexts that an R can
appear, in this particular situation, it is not appropriate because when
deriving a sentence S > R > L, = cannot follow R.
Constructing LR(1) Parsing Tables
LR or canonical LR parsing incorporates the required extra information
into the state by redefining configurations to include a terminal symbol as
an added component. LR(1) configurations have the general form:
A > X1...Xi Xi+1...Xj , a

This means we have states corresponding to X1...Xi on the stack and we are
looking to put states corresponding to Xi+1...Xj on the stack and then reduce,
but only if the token following Xj is the terminal a. a is called the
lookahead of the configuration. The lookahead only comes into play with
LR(1) configurations with a dot at the right end:
A > X1Xj , a

This means we have states corresponding to X1...Xj on the stack but we may
only reduce when thenext symbol is a. The symbol a is either a terminal or
$ (end of input marker). With SLR(1) parsing, we would reduce if the next
token was any of those in Follow(A). With LR(1) parsing, we reduce only if
the next token is exactly a. We may have more than one symbol in the
lookahead for the configuration, as a convenience, we list those symbols
separated by a forward slash. Thus, the configuration A > u, a/b/c says
that it is valid to reduce u to A only if the next token is equal to a, b, or c.
The configuration lookahead will always be a subset of Follow(A).
Recall the definition of a viable prefix from the SLR handout. Viable
prefixes are those prefixes of right sentential forms that can appear on the
stack of a shift-reduce parser. Formally we say that a configuration [A >
uv , a] is valid for a viable prefix if there is a rightmost derivation S =>*
Aw =>* uvw where = u and either a is the first symbol of w or w is
and a is $. For example,

68

S > ZZ
Z > xZ | y

There is a rightmost derivation S =>* xxZxy => xxxZxy. We see that


configuration [Z > xZ, x] is valid for viable prefix = xxx by letting = xx,
A = Z, w = xy, u = x and v = Z. Another example is from the rightmost
derivation S =>* ZxZ => ZxxZ, making [Z > xZ, $] valid for viable prefix
Zxx.
Often we have a number of LR(1) configurations that differ only in their
lookahead components. The addition of a lookahead component to LR(1)
configurations allows us to make parsing decisions beyond the capability
of SLR(1) parsers. There is, however, a big price to be paid. There will be
more distinct configurations and thus many more possible configurating
sets. This increases the size of the goto and action tables considerably. In
the past when memory was smaller, it was difficult to find storageefficient ways of representing these tables, but now this is not as much of
an issue. Still, its a big job building LR tables for any substantial
grammar by hand.
The method for constructing the configurating sets of LR(1)
configurations is essentially the same as for SLR, but there are some
changes in the closure and successor operations because we must respect
the configuration lookahead. To compute the closure of an LR(1)
configurating set I:
Repeat the following until no more configurations can be added to
state I:

For each configuration [A > uBv , a] in I, for each


production B > w in G', and for each terminal b in First(va)
such that [B > w , b] is not in I: add [B > w , b] to I.
What does this mean? We have a configuration with the dot before the
non-terminal B. In LR(0), we computed the closure by adding all B
productions with no indication of what was expected to follow them. In
LR(1), we are a little more precise we add each B production but insist
that each have a lookahead of va. The lookahead will be First(v a) since this
is what follows B in this production. Remember that we can compute first
sets not just for a single non-terminal, but also a sequence of terminal and
non-terminals. First(va) includes the first set of the first symbol of v and
then if that symbol is nullable, we include the first set of the following
symbol, and so on. If the entire sequence v is nullable, we add the
lookahead a already required by this configuration.
The successor function for the configurating set I and symbol X is
computed as this:

69

Let J be the configurating set [A > uXv , a] such that [A > u Xv , a]


is in I, successor(I,X) is the closure of configurating set J.
We take each production in a configurating set, move the dot over a
symbol and close on the resulting production. This is basically the same
successor function as defined for LR(0), but we have to propagate the
lookahead when computing the transitions.
We construct the complete family of all configurating sets F just as we did
before. F is initialised to the set with the closure of [S' > S, $]. For each
configurating set I and each grammar symbol X such that successor(I,X) is
not empty and not in F, add successor (I,X) to F until no other configurating
set can be added to F.
Lets consider an example. The augmented grammar below that recognizes
the regular language a*ba*b (this example from pp. 231-236
Aho/Sethi/Ullman).
0) S' > S
1) S > XX
2) X > aX
3) X > b

Here is the family of LR configuration

The above grammar would only have seven SLR states, but has ten in
canonical LR. We end up with additional states because we have split
states that have different lookaheads. For example, states 3 and 6 are the
same except for lookahead, state 3 corresponds to the context where we
are in the middle of parsing the first X, state 6 is the second X. Similarly,
states 4 and 7 are completing the first and second X respectively. In SLR,
those states are not distinguished, and if we were attempting to parse a
single b by itself, we would allow that to be reduced to X, even though this
will not lead to a valid sentence. The SLR parser will eventually notice the
syntax error, too, but the LR parser figures it out a bit sooner.

70

To fill in the entries in the action and goto tables, we use a similar
algorithm as we did for SLR(1), but instead of assigning reduce actions
using the follow set, we use the specific lookaheads. Here are the steps to
build an LR(1) parse table:
1.

2.

3.

4.
5.

Construct F = {I0, I1, ... In}, the collection of configurating


sets for the augmented grammar G' (augmented by adding
the special production S' > S).
State i is determined from Ii. The parsing actions for the
state are determined as follows:
a)
If [A > u, a] is in Ii then set Action[i,a] to reduce A >
u (A is not S').
b)
If [S' > S, $] is in Ii then set Action[i,$] to accept.
c)
If [A > uav , b] is in Ii and succ(Ii, a) = Ij, then set
Action[i,a] to shift j (a must be a terminal).
The goto transitions for state i are constructed for all nonterminals A using the rule: If succ(Ii, A) = Ij, then Goto [i, A] =
j.
All entries not defined by rules 2 and 3 are errors.
The initial state is the one constructed from the
configurating set containing S' > S.

Following the algorithm using the configurating sets given above, we


construct this canonical LR parse table:

Lets parse an example string baab. It is a valid sentence in this language


as shown by this leftmost derivation:
S > XX
bX
baX
baaX
baab

71

Now, lets consider what the states mean. S4 is where X > b is completed;
S2 and S6 is where we are in the middle of processing the 2 a's; S7 is where
we process the final b; S9 is where we complete the X > aX production; S5
is where we complete S > XX; and S1 is where we accept.

LR(1) Grammars
Every SLR(1) grammar is a canonical LR(1) grammar, but the canonical
LR(1) parser may have more states than the SLR(1) parser. An LR(1)
grammar is not necessarily SLR(1), the grammar given earlier is an
example. Because an LR(1) parser splits states based on differing
lookaheads, it may avoid conflicts that would otherwise result if using the
full follow set.
A grammar is LR(1) if the following two conditions are satisfied for each
configurating set:
1.

2.

For any item in the set [A > uxv , a] with x a terminal, there
is no item in the set of the form [B > v, x] In the action
table, this translates no shift-reduce conflict for any state.
The successor function for x either shifts to a new state or
reduces, but not both.
The lookaheads for all complete items within the set must
be disjoint, e.g. set cannot have both [A > u, a] and [B > v
, a] This translates to no reduce-reduce conflict on any
state. If more than one non-terminal could be reduced from
this set, it must be possible to uniquely determine which is
appropriate from the next input token.

As long as there is a unique shift or reduce action on each input symbol


from each state, we can parse using an LR(1) algorithm. The above state
conditions are similar to what is required for SLR(1), but rather than the
looser constraint about disjoint follow sets and so on, canonical LR(1)
computes a more precise notion of the appropriate lookahead within a
particular context and thus is able to resolve conflicts that SLR(1) would
encounter.

72

73

CHAPTER SEVEN
LALR Parsing
The Motivation for LALR
Because a canonical LR(1) parser splits states based on differing
lookahead sets, it can have many more states that the corresponding
SLR(1) or LR(0) parser. Potentially it could require splitting a state with
just one item into a different state for each subset of the possible
lookaheads, in a pathological case, this means the entire power set of its
follow set (which theoretically could contain all terminals). It never
actually gets that bad in practice, but a canonical LR(1) parser for a
programming language might have an order of magnitude more states than
an SLR(1) parser. Is there something in between?
With LALR (lookahead LR) parsing, we attempt to reduce the number of
states in an LR(1) parser by merging similar states. This reduces the
number of states to the same as SLR(1), but still retains some of the power
of the LR(1) lookaheads. Lets examine the LR(1) configurating sets from
an example given in the LR parsing.
S' > S
S > XX
X > aX
X > b

Notice that some of the LR(1) states look suspiciously similar. Take I3 and
I6 for example. These two states are virtually identical they have the
same number of items, the core of each item is identical, and they differ
only in their lookahead sets. This observation may make you wonder if it
possible to merge them into one state. The same is true of I4 and I7, and I8
and I9. If we did merge, we would end up replacing those six states with
just these three:
I36:

X > aX, a/b/$


X > aX, a/b/$

74

I47:
I89:

X > b, a/b/$
X > b, a/b/$
X > aX, a/b/$

But isnt this just SLR(1) all over again? In the above example, yes, since
after the merging we did end up with the complete follow sets as the
lookahead. This is not always the case however. Consider this example:
S' > S
S > Bbb | aab | bBa
B > a

In an SLR(1) parser there is a shift-reduce conflict in state 3 when next


input is anything in Follow(B)which includes a and b. In LALR(1), state 3
will shift on a and reduce on b. Intuitively, this is because the LALR(1)
state remembers that we arrived at state 3 after seeing an a. Thus we are
trying to parse either Bbb or aab. In order for that first a to be a valid
reduction to B, the next input has to be exactly b since that is the only
symbol that can follow B in this particular context. Although elsewhere an
expansion of B can be followed by an a, we consider only the subset of the
follow set that can appear here, and thus avoid the conflict an SLR(1)
parser would have.
Conflicts in LALR Mergings
Can merging states in this way ever introduce new conflicts? A shiftreduce conflict cannot exist in a merged set unless the conflict was already
present in one of the original LR(1) configurating sets. When merging, the
two sets must have the same core items. If the merged set has a
configuration that shifts on a and another that reduces on a, both
configurations must have been present in the original sets, and at least one
of those sets had a conflict already.
Reduce-reduce conflicts, however, are another story. Consider the
following grammar:
S' > S
S > aBc | bCc | aCd | bBd
B > e
C > e

The LR(1) configurating sets are as follows:

75

We try to merge I6 and I9 since they have the same core items and they only
differ in lookahead:
I69:

C > e, c/d
B > e, d/c

However, this creates a problem. The merged configurating set allows a


reduction to either B or C when next token is c or d. This is a reduce-reduce
conflict and can be an unintended consequence of merging LR(1) states.
When such a conflict arises in doing a merging, we say the grammar is not
LALR(1).
LALR Table Construction
A LALR(1) parsing table is built from the configurating sets in the same
way as canonical LR(1); the lookaheads determine where to place reduce
actions. In fact, if there are no mergable states in the configuring sets, the
LALR (1) table will be identical to the corresponding LR(1) table and no
advantage or disadvantage is gained.
In the common case, however, there will be states that can be merged and
the LALR table will have fewer rows than LR. The LR table for a typical
programming language may have several thousand rows, which can be
merged into just a few hundred for LALR. Due to merging, the LALR(1)
table seems more similar to the SLR(1) and LR(0) tables, all three have the
same number of states (rows), but the LALR may have fewer reduce
actions some reductions are not valid if we are more precise about the
lookahead. Thus, some conflicts are avoided because an action cell with
conflicting actions in SLR(1) or LR(0) table may have a unique entry in an
LALR(1) once some erroneous reduce actions have been eliminated.
The Brute-Force Method
There are two ways to construct LALR(1) parsing tables. The first and
most obvious way is to construct LR(1) table and merge the sets manually.
This is sometimes referred as the brute-force way. If you dont mind
first finding all the multitude of states required by the canonical parser,
compressing the LR table into the LALR version is straightforward.

76

1.
2.

3.

4.

Construct all canonical LR(1) states.


Merge those states that are identical if the lookaheads are
ignored, i.e. two states being merged must have the same
number of items and the items have the same core (i.e. the
same productions, differing only in lookahead). The
lookahead on merged items is the union of the lookahead
from the states being merged.
The successor function for the new LALR(1) state is the
union of the successors of the merged states. If the two
configurations have the same core, then the original
successors must have the same core as well, and thus the
new state has the same successors.
The action and goto entries are constructed from the
LALR(1) states as for the canonical LR(1) parser.

Lets do an example to make this more clear. Consider the LR(1) table for
the grammar given earlier. There are nine states.

Looking at the configurating sets, we saw that states 3 and 6 can be


merged, so can 4 and 7, and 8 and 9. Now we build this LALR(1) table
with the six remaining states:

A More Efficient Method


Having to compute the LR(1) configurating sets first means we wont save
any time or effort in building a LALR parser. However, the work wasnt
all for naught, because when the parser is executing, it will only need the
compressed table which saves us memory. The savings can be an order of
magnitude in the number of states.

77

However there is a more efficient strategy for building the LALR(1) states
called step-by-step merging. The idea is that you merge the configurating
sets as you go, rather than waiting until the end to find the identical ones.
Sets of states are constructed as in the LR(1) method, but at each point
where a new set is spawned, you first check to see whether it may be
merged with an existing set. This means examining the other states to see
if one with the same core already exists. If so, you merge the new set with
the existing one, otherwise you add it normally.
Here is an example of this method in action:
S'
S
E
F
V

> S
> V = E
> F | E + F
> V | int | (E)
> id

Start building the LR(1) collection of configurating sets as you would


normally:

When we construct state I11, we get something weve seen before:


I11: E >F,)/+

It has the same core as I6 so rather than add a new state, we go ahead and
merge with that one to get:
I6: E >F, $/+/)

We have a similar situation on state I12 which can be merged with state I7 .
The algorithm continues like this, merging into existing states where
possible and only adding new states when necessary. When we finish
creating the sets, we construct the table just as in LR(1).
LALR(1) Grammars
A formal definition of what makes a grammar LALR(1) cannot be easily
encapsulated in a set of rules, because it needs to look beyond the

78

particulars of a production in isolation to consider the other situations


where the production appears on the top of the stack and what happens
when we merge those situations. Instead we state that what makes a
grammar LALR(1) is the absence of conflicts in its parser. If you build the
parser and it is conflict-free, it implies the grammar is LALR(1) and viceversa.
LALR(1) is a subset of LR(1) and a superset of SLR(1). A grammar that is
not LR(1) is definitely not LALR(1), since whatever conflict occurred in
the original LR(1) parser will still be present in the LALR(1). A grammar
that is LR(1) may or may not be LALR(1) depending on whether merging
introduces conflicts. A grammar that is SLR(1) is definitely LALR(1). A
grammar that is not SLR(1) may or may not be LALR(1) depending on
whether the more precise lookaheads resolve the SLR(1) conflicts.
LALR(1) has proven to be the most used variant of the LR family. The
weakness of the SLR(1) and LR(0) parsers mean they are only capable of
handling a small set of grammars. The expansive memory needs of LR(1)
caused it to languish for several years as a theoretically interesting but
intractable approach. It was the advent of LALR(1) that offered a good
balance between the power of the specific lookaheads and table size. The
popular tools yacc and bison generate LALR(1) parsers and most
programming language constructs can be described with an LALR(1)
grammar (perhaps with a little grammar massaging or parser trickery to
skirt some isolated issues).
Error Handling
As in LL(1) parsing tables, we can implement error processing for any of
the variations of LR parsing by placing appropriate actions in the parse
table. Here is a parse table for a simple arithmetic expression grammar
with error actions inserted into what would have been the blank entries in
the table.

Error e1 is called from states 0, 2, 4, 5 when we encounter an operator. All


of these states expect to see the beginning of an expression, i.e., an id or a
left parenthesis. One way to fix is for the parser to act as though id was

79

seen in the input and shift state 3 on the stack (the successor for id in these
states), effectively faking that the necessary token was found. The error
message printed might be something like missing operand.
Error e2 is called from states 0, 1, 2, 4, 5 on finding a right parenthesis
where we were expecting either the beginning of a new expression (or
potentially the end of input for state 1). A possible fix: remove right
parenthesis from the input and discard it. The message printed could be
"unbalanced right parenthesis."
Error e3 is called from state 1, 3, 6, 7, 8, 9 on finding id or left parenthesis.
What were these states expecting? What might be a good fix? How should
you report the error to the user?
Error e4 is called from state 6 on finding $. What is a reasonable fix? What
do you tell the user?

80

CHAPTER EIGHT
Parsing Miscellany
A different way of resolving ambiguity
Recall that ambiguity means we have two or more leftmost derivations for
the same input string, or equivalently, that we can build more than one
parse tree for the same input string. A simple arithmetic expression
grammar is a common example:
E > E + E | E * E | (E) | id

Parsing input: id + id * id can produce two different parse trees because of


ambiguity

Earlier we discussed how to fix the problem by re-writing the grammar to


introduce new intermediate non-terminals that enforce the desired
precedence. Instead, let's consider what happens if we go ahead and create
the LR(0) configurating sets for the ambiguous version of this grammar:

Let's say we were building an SLR(1) table for this grammar. Look
carefully at state 7. In the action table, there are two conflicting entries
under the column labeled * : s5 and r1, a shift/reduce conflict. Trace the
parse of input id + id * id up to the point where we arrive in state 7:

81

State stack
S0S1S4S7

Remaining input
state 7: next input is *

* id $

At this point during the parse, we have the handle E + E on top of the stack,
and the lookahead is * . * is in the follow set for E, so we can reduce that E
+ E to E. But we also could shift the * and keep going. What choice should
we make if we want to preserve the usual arithmetic precedence?
What about if we were parsing id + id + id? We have a similar shift/reduce
in state 7, now on next input +. How do we want to resolve the
shift/reduce conflict here? (Because addition is commutative, it actually
doesnt much matter, but it will for subtraction!)
State stack
S0S1S4S7

Remaining input
+ id $

state 7: next input is +

Now consider parsing id * id + id. A similar shift/reduce conflict comes up


in state 8.
State stack
S0S1S5S8

Remaining input
+ id $

state 8: next input is +

And what about parsing id * id * id?


State stack
S0S1S5S8

Remaining input
state 8: next input is +

* id $

Instead of rearranging the grammar to add all the different precedence


levels, another way of resolving these conflicts to build the precedence
rules right into the table. Where there are two or more entries in the table,
we pick the one to keep and throw the others out. In the above example, if
we are currently in the midst of parsing a lower-precedence operation, we
shift the higher precedence operator and keep going. If the next operator is
lower-precedence, we reduce. How we break the tie when the two
operators are the same precedence determines the associativity. By
choosing to reduce, we enforce left-to-right associativity.
There are a few reasons we might want to resolve the conflicts in the
parser instead of re-writing the grammar. As originally written, the
grammar reads more naturally, for one. In the parser, we can easily tweak
the associativity and precedence of the operators without disturbing the
productions or adding extra states to the parser. Also, the parser will not
need to spend extra time reducing through the single productions (T> F,
E> T) introduced in the grammar re-write.
Note that just because there are conflicts in LR table does not imply the
grammar is ambiguous it just means the grammar isn't LR(1) or
whatever technique you are using. The reverse, though, is true. No
82

ambiguous grammar is LR(1) and will have conflicts in any type of LR


table (or LL for that matter).
Dangling else Revisited
Another example of ambiguity in programming language grammars is the
famous dangling else.
Stmt > if Expr then Stmt else Stmt | if Expr then Stmt | Other

which we rewrite for brevity as:


S > iSeS | iS | a

Here are the LR(0) configurating sets. Where is the conflict in the
following collection?

Say we are parsing: if S e S. When we arrive in state 4, we have if S on the


stack and the next input is e. Do we shift the e or do we reduce what is on
the stack? To follow C and Javas rules, we should want the e to associate
it with the nearest if, which action do we keep to get that behaviour?
Relationships Between LL(1) and the Various LR(1) Grammars
A picture is worth a thousand words:

83

Note this diagram refers to grammars, not languages, e.g. there may be
an equivalent LR(1) grammar that accepts the same language as another
non-LR(1) grammar. No ambiguous grammar is LL(1) or LR(1), so we
must either re-write the grammar to remove the ambiguity or resolve
conflicts in the parser table or implementation.
The hierarchy of LR variants is clear: every LR(0) grammar is SLR(1) and
every SLR(1) is LALR(1) which in turn is LR(1). But there are grammars
that dont meet the requirements for the weaker forms that can be parsed
by the more powerful variations.
Weve seen several examples of grammars that are not LL(1) that are
LR(1). But the reverse is not possible, every LL(1) grammar is guaranteed
to be LR(1). A rigorous proof is fairly straightforward from the definitions
of LL(1) and LR(1) grammars. Your intuition should tell you that an
LR(1) parser uses more information than the LL(1) parser since it
postpones the decision about which production is being expanded until it
sees the entire right side rather than attempting to predict after seeing just
the first terminal.
Comparing LL(1) Parsers to LALR(1)
The two dominant parsing techniques in real compilers are LL(1) and
LALR(1). These techniques are the ones to stash away in your brain cells
for further usage. Here are some thoughts on how to weigh the two
approaches against one another:
Implementation. Because the underlying algorithms are more
complicated, most LALR(1) parsers are built using parser
generators such as yacc and bison. LL(1) parsers may be
implemented via hand-coded recursive-descent or via LL(1)
table-driven predictive parser generators like LLgen. There
are those who like managing details and writing all the code
themselves, no errors result from misunderstanding how the
tools work, and so on. But as projects get bigger, the
automated tools can be a help, yacc/bison can find
ambiguities and conflicts that you might have missed doing
the work by hand, for example. The implementation chosen
also has an effect on maintenance. Which would you rather
do: add new productions into a grammar specification being
fed to a generator, add new entries into a table, or write new
functions for a recursive-descent parser?
Simplicity: Both techniques have fairly simple drivers. The
algorithm underlying LL(1) is more intuitively
understandable and thus easier to visualise and debug. The
myriad details of the LALR(1) configurations can be messy
and when trying to debug can be a bit overwhelming.
84

Generality: All LL(1) grammars are LR(1) and virtually all


are also LALR(1). On the fringes there exist grammars that
can be handled by one technique or the other exclusively.
This isn't much of an obstacle in practice since simple
grammar transformation and/or parser tricks can usually
resolve the problem. As a rule of thumb, LL(1) and LALR(1)
grammars can be constructed for any reasonable
programming language.
Grammar conditioning: An LL(1) parser has strict rules on
the structure of productions, so you will need to massage the
grammar into the proper form first. If extensive grammar
conditioning is required, you may not even recognise the
grammar you started out with. The most troublesome area for
programming language grammars is usually the handling of
arithmetic expressions. If you can stomach what is needed to
transform those to LL(1), you're through the hard partthe
rest of the grammar is smoother sailing. LALR(1) parsers are
much less restrictive on grammar forms, and thus allow you
to express the grammar more naturally and clearly. The
LALR(1) grammar will also be smaller than its LL(1)
equivalent because LL(1) requires extra nonterminals and
productions to factor the common prefixes, rearrange left
recursion, and so on.
Error repair: Both LL(1) and LALR(1) parsers possess the
valid prefix property. What is on the stack will always be a
valid prefix of a sentential form. Errors in both types of
parsers can be detected at the earliest possible point without
pushing the next input symbol onto the stack. LL(1) parse
stacks contain symbols that are predicted but not yet
matched. This information can be valuable in determining
proper repairs. LALR(1) parse stacks contain information
about what has already been seen, but do not have the same
information about the right context that is expected. This
means deciding possible continuations is somewhat easier in
an LL(1) parser.
Table sizes: Both require parse tables that can be sizeable.
For LL(1) parsers, the uncompressed table has one row for
each non-terminal and one column for each terminal, so the
total table size is |T| x |N|. An LALR table has a row for each
state and a column for each terminal and each non-terminal,
so the total table size is |S| x (|N| + |T|). The number of states
can be exponential in the worst case. (i.e. states form the

85

power set of all productions). So for a pathologically


designed grammar, the LALR(1) could be much, much
larger. However, for average-case inputs, the LALR(1) table
is usually about twice as big as the LL(1). For a language like
Pascal, the LL(1) table might have 1500 non-error entries,
the LALR(1) table has around 3000. This sort of thing used
to be important, but with the capabilities of todays
machines, a factor of 2 is not likely to be a significant issue.
Efficiency: Both require a stack of some sort to manage the
input. That stack can grow to a maximum depth of n, where n
is the number of symbols in the input. If you are using the
runtime stack (i.e. function calls) rather than pushing and
popping on a data stack, you will probably pay some
significant overhead for that convenience (i.e. a recursivedescent parser takes that hit). If both parsers are using the
same sort of stack, LL(1) and LALR(1) each examine every
non-terminal and terminal when building the parse tree, and
so parsing speeds tend to be comparable between the two.

86

CHAPTER NINE
INTRODUCTION TO YACC
Grammars for yacc are described using a variant of Backus Naur Form
(BNF). This technique was pioneered by John Backus and Peter Naur, and
used to describe ALGOL60. A BNF grammar can be used to express
context-free languages. Most constructs in modern programming
languages can be represented in BNF.
Input to yacc is divided into three sections.
... definitions ...
%%

... rules ...


%%

... subroutines ...


The definitions section consists of token declarations, and C code
bracketed by %{ and %}. The BNF grammar is placed in the rules
section, and user subroutines are added in the subroutines section.
This is best illustrated by constructing a small calculator that can add and
subtract numbers. Well begin by examining the linkage between lex and
yacc. Here is the definitions section for the yacc input file:
%token INTEGER

This definition declares an INTEGER token. When we run yacc, it


generates a parser in file y.tab.c, and also creates an include file,
y.tab.h:
#ifndef YYSTYPE
#define YYSTYPE int
#endif
#define INTEGER 258
extern YYSTYPE yylval;

Lex includes this file and utilizes the definitions for token values. To
obtain tokens, yacc calls yylex. Function yylex has a return type of
int, and returns the token value. Values associated with the token are
returned by lex in variable yylval. For example,
[0-9]+

{ yylval = atoi(yytext);
return INTEGER;
}

87

would store the value of the integer in yylval, and return token INTEGER
to yacc. The type of yylval is determined by YYSTYPE. Since the default
type is integer, this works well in this case. Token values 0-255 are
reserved for character values. For example, if you had a rule such as
[-+] return *yytext; /* return operator */

the character value for minus or plus is returned. Note that we placed the
minus sign first so that it wouldnt be mistaken for a range designator.
Generated token values typically start around 258, as lex reserves several
values for end-of-file and error processing. Here is the complete lex input
specification for a simple calculator:

Internally, yacc maintains two stacks in memory; a parse stack and a value
stack. The parse stack contains terminals and nonterminals, and represents
the current parsing state. The value stack is an array of YYSTYPE
elements, and associates a value with each element in the parse stack. For
example, when lex returns an INTEGER token, yacc shifts this token to the
parse stack. At the same time, the corresponding yylval is shifted to the
value stack. The parse and value stacks are always synchronized, so
finding a value related to a token on the stack is easily accomplished. Here
is the yacc input specification for a simple calculator:

88

The rules section resembles the BNF grammar discussed earlier. The lefthand side of a production, or nonterminal, is entered left-justified,
followed by a colon. This is followed by the right-hand side of the
production. Actions associated with a rule are entered in braces.
By utilizing left-recursion, we have specified that a program consists of
zero or more expressions. Each expression terminates with a newline.
When a newline is detected, we print the value of the expression. When we
apply the rule
expr: expr '+' expr { $$ = $1 + $3; }

we replace the right-hand side of the production in the parse stack with the
left-hand side of the same production. In this case, we pop expr '+'
expr and push expr. We have reduced the stack by popping three
terms off the stack, and pushing back one term. We may reference
positions in the value stack in our C code by specifying $1 for the first
term on the right-hand side of the production, $2 for the second, and so
on. $$ designates the top of the stack after reduction has taken place.
The above action adds the value associated with two expressions, pops
three terms off the value stack, and pushes back a single sum. Thus, the
parse and value stacks remain synchronised.
Numeric values are initially entered on the stack when we reduce from
INTEGER to expr. After INTEGER is shifted to the stack, we apply the
rule

89

expr: INTEGER { $$ = $1; }

The INTEGER token is popped off the parse stack, followed by a push of
expr. For the value stack, we pop the integer value off the stack, and then
push it back on again. In other words, we do nothing. In fact, this is the
default action, and need not be specified. Finally, when a newline is
encountered, the value associated with expr is printed.
In the event of syntax errors, yacc calls the user-supplied function
yyerror. If you need to modify the interface to yyerror, you can alter
the canned file that yacc includes to fit your needs. The last function in our
yacc specification is main in case you were wondering where it was.
This example still has an ambiguous grammar. Yacc will issue shiftreduce warnings, but will still process the grammar using shift as the
default operation.
In the remaining of this chapter we will extend the simple calculator by
incorporating some new functionality. New features include arithmetic
operators multiply, and divide. Parentheses may be used to over-ride
operator precedence, and single-character variables may be specified in
assignment statements. The following illustrates sample input and
calculator output:
user:
calc:
user:
user:
user:
calc:
user:
calc:
user:
calc:

3 *
27
x =
y =
x
27
y
5
x +
37

(4 + 5)
3 * (5 + 4)
5

2*y

The lexical analyser returns VARIABLE and INTEGER tokens. For


variables, yylval specifies an index to sym, our symbol table. For this
program, sym merely holds the value of the associated variable. When
INTEGER tokens are returned, yylval contains the number scanned.
Here is the input specification for lex:

90

The input specification for yacc follows. The tokens for INTEGER and
VARIABLE are utilized by yacc to create #defines in y.tab.h for use
in lex. This is followed by definitions for the arithmetic operators. We may
specify %left, for left-associative, or %right, for right associative. The
last definition listed has the highest precedence. Thus, multiplication and
division have higher precedence than addition and subtraction. All four
operators are left-associative. Using this simple technique, we are able to
disambiguate our grammar.

91

92

CHAPTER TEN
Syntax-Directed Translation
Syntax-directed translation refers to a method of compiler implementation
where the source language translation is completely driven by the parser.
In other words, the parsing process and parse trees are used to direct
semantic analysis and the translation of the source program. This can be a
separate phase of a compiler or we can augment our conventional grammar
with information to control the semantic analysis and translation. Such
grammars are called attribute grammars.
We augment a grammar by associating attributes with each grammar
symbol that describes its properties. An attribute has a name and an
associated value a string, a number, a type, a memory location, an
assigned register, whatever information we need. For example, variables
may have an attribute "type" (which records the declared type of a
variable, useful later in type-checking) or an integer constants may have an
attribute "value" (which we will later need to generate code).
With each production in a grammar, we give semantic rules or actions,
which describe how to compute the attribute values associated with each
grammar symbol in a production. The attribute value for a parse node may
depend on information from its children nodes below or its siblings and
parent node above.
Consider this production, augmented with a set of actions that use the
value attribute for a digit node to store the appropriate numeric value.
Below, we use the syntax X.a to refer to the attribute a associated with
symbol X.
digit > 0 {digit.value =
| 1 {digit.value =
| 2 {digit.value =
...
| 9 {digit.value =

0}
1}
2}
9}

Attributes may be passed up a parse tree to be used by other productions:


int1 > digit {int1.value = digit.value}
| int2 digit {int1.value=int2.value*10+digit.value}

(We are using subscripts in this example to clarify which attribute we are
referring to, so int1 and int2 are different instances of the same nonterminal symbol.)
There are two types of attributes we might encounter: synthesised or
inherited. Synthesised attributes are those attributes that are passed up a
parse tree, that is, the left-side attribute is computed from the right-side

93

attributes. The lexical analyser usually supplies the attributes of terminals


and the synthesised ones are built up for the nonterminals and passed up
the tree.

Inherited attributes are those that are passed down a parse tree, i.e., the
right-side attributes are derived from the left-side attributes (or other rightside attributes). These attributes are used for passing information about the
context to nodes further down the tree.
X > Y1Y2...Yn
Y .a = f(X.a, Y .a, Y .a, ..., Y .a, Y .a, ..., Y .a)
k

k-1

k+1

Consider the following grammar that defines declarations and simple


expressions in a Pascal-like syntax:
P > DS
D > var V; D |
S > V := E; S |
V > x | y | z

Now we add two attributes to this grammar, name and dl, for the name of a
variable and the list of declarations. Each time a new variable is declared,
a synthesised attribute for its name is attached to it. That name is added to
a list of variables declared so far in the synthesised attribute dl that is
created from the declaration block. The list of variables is then passed as
an inherited attribute to the statements following the declarations for use in
checking that variables are declared before use.
P > DS
D1 > var V; D2
D .dl)}
|
S1 > V := E; S2
S .dl}
|
V > x
| y
| z

{S.dl = D.dl}
{D .dl = addlist(V.name,
1

{D .dl = NULL}
{check(V.name,S .dl); S .dl=
1

{V.name = 'x'}
{V.name = 'y'}
{V.name = 'z'}

If we were to parse the following code, what would the attribute structure
look like?

94

var x;
var y;
x := ...;
y := ...;

Typically the way we handle attributes is to associate with each symbol


some sort of structure with all the necessary attributes, and we can pick out
the attribute of interest by fieldname.

Top-Down SDT
We can implement syntax-directed translation in either a top-down or a
bottom-up parser and we'll briefly investigate each approach. First, let's
look at adding attribute information to a handconstructed top-down
recursive-descent parser. Our example will be a very simple FTP client,
where the parser accepts user commands and uses a syntax-directed
translation to act upon those requests. Here's in the grammar we'll use,
already in an LL(1) form:
Session
CommandList
Command
Login
User
Pass
Get
MoreFiles
Logout

>
>
>
>
>
>
>
>
>

CommandList T_QUIT
Command CommandList |
Login | Get | Logout
User Pass
T_USER T_IDENT
T_PASS T_IDENT
T_GET T_IDENT MoreFiles
T_IDENT MoreFiles |
T_LOGOUT

95

Now, lets see how the attributes, such as the username, filename, and
connection, can be passed around during the parsing. This recursivedescent parser is using the lookahead/token-matching utility functions
from the top-down parsing.

During processing of the Login command, the parser gathers the username
and password returned from the children nodes and uses that information
to create a new connection attribute to pass up the tree. In this situation the

96

username, password, and connection are all acting as synthesised


attributes, working their way from the leaves upward to the parent. That
open connection is saved and then later passed downward into other
commands. The connection is being used as an inherited attributed when
processing Get and Logout, those commands are receiving information
from the parent about parts of the parse that have already happened (its left
siblings).
Bottom-Up SDT
Here is a simple expression grammar that has associativity and precedence
already built in.
E' > E
E > T | E A T
T > F | T M F
F > (E) | int
A > + | M > * | /

During the bottom-up parse, as we push symbols on to the parse stack, we


will associate with each operand/expression symbol (E, T, F, etc.) an
integer value. For each operator (A, M) we will store the operator code.
When performing a reduction, we will synthesise the attribute for the leftside nonterminal from the attributes of the right side symbol, the handle
that is currently on top of the stack. See how the associated actions below
will evaluate the arithmetic expression during parsing.

The attribute value of terminals is assumed to have been assigned by the


scanner. The attribute values of non-terminals is explicitly assigned in the
parser action code. The symbol attributes can usually be stored along with
the symbol itself in the bottom-up parse stack, which is mighty
convenient. However, given the way a bottom-up parser constructs the
leaves first and works its way up to the parent, it is trivial to support

97

synthesised attributes (like those needed in the expression parser), but


more awkward to allow for inherited attributes.
Attributes and Yacc
Access to attributes in yacc looks pretty much as shown above in the
abstract example, but let's make it more concrete with a real specification.
The syntax $1, $2 is used to access the attribute of the nth token on the
right side of the production. The global variable yylval is set by the
scanner and that value is saved with the token when placed on the parse
stack. When a rule is reduced, a new state is placed on the stack, the
default behaviour is to just copy the attribute of $1 for that new state, this
can be controlled by assigning to $$ in the action for the rule. By default
the attribute type is an integer, but can be extended to allow for more
storage of diverse types using the %union specification in the yacc input
file. Here's a simple infix calculator that shows all these techniques in
conjunction. It parses ordinary arithmetic expressions, uses a symbol table
for setting and retrieving values of simple variables, and even has some
primitive error-handling.

98

Here is the scanner to go with:

99

CHAPTER ELEVEN
Semantic Analysis
What is semantic analysis?
Parsing only verifies that the program consists of tokens arranged in a
syntactically-valid combination, we now move on to semantic analysis,
where we delve deeper to check whether they form a sensible set of
instructions in the programming language. Whereas any old noun phrase
followed by some verb phrase makes a syntactically-correct English
sentence, a semantically-correct one has subject-verb agreement, gender is
properly used, and the components go together to express an idea that
makes sense. For a program to be semantically correct, all variables,
functions, classes, etc. are properly defined, expressions and variables are
used in ways that respect the type system, access control isnt violated, and
so on. Semantic analysis is the next-to-last phase of the front end and the
compilers last chance to weed out incorrect programs. We need to ensure
the program is well-formed enough to continue on to the next phase where
we generate code.
A large part of semantic analysis consists of tracking
variable/function/type declarations and type checking. In many languages,
identifiers have to be declared before use. As the compiler encounters a
new declaration, it records the type information assigned to that identifier.
Then, as it continues examining the rest of the program, it verifies that the
type of an identifier is respected in terms of the operations being
performed. For example, the type of the right-side expression of an
assignment statement should match the type of the left-side, and the leftside needs to be a properly declared and assignable identifier (i.e. not some
sort of constant). The parameters of a function should match the arguments
of a function call in both number and type. The language may require that
identifiers are unique, disallowing a global variable and function of the
same name. The operands to multiplication operation will need to be of
numeric type, perhaps even the exact same type depending on the
strictness of the language. These are examples of the things checked in the
semantic analysis phase.
Semantic analysis can be done right in the midst of parsing. As a particular
construct is recognised, say an addition expression, the parser action
would be to check the two operands and verify they are of numeric type
and compatible for this operation. In fact, in a one-pass compiler, the code
is generated right then and there as well. In a compiler that runs in more
than one pass, the first pass only parses the input and builds the tree
representation of the program. Then in a second pass, we go back and
traverse the tree we built to verify the semantic rules of the language are
being respected. The single-pass strategy offers space and time

100

efficiencies, to be traded off against the clean separation of the multiple


passes and programmer conveniences (i.e. can often order things
arbitrarily in the source program).
Types and Declarations
We begin with some basic definitions to set the stage for performing
semantic analysis. A type is a set of values and a set of operations
operating on those values. There are three categories of types in most
programming languages:
Base types

Compound types

Complex types

int, float, double, char, bool, etc. These are the


primitive types provided more or less directly by the
underlying machine. There may be a facility for
userdefined variants on the base types (such as C
enums).
arrays, pointers, records, structs, unions, classes,
and so on. These types are constructed from
aggregations of the base types.
lists, stacks, queues, trees, heaps, tables, etc. You
may recognise these as ADTs. A language may or
may not have support for these sort of higher level
abstractions.

In many languages, a programmer must first establish the name and type
of any data object (variable, function, type, etc). In addition, the
programmer usually defines the lifetime. A declaration is a statement in a
program that communicates to the compiler this information. The basic
declaration is just a name and type, but in many languages may include
modifiers that control visibility and lifetime (i.e. static in C, private in
Java). Some languages also allow declarations to also initialise variables,
such as in C, where you can declare and initialise in one statement. The

following C statements show some example declarations:


Function declarations or prototypes serve a similar purpose for functions
that variable declarations do for variables, it establishes the type for that
identifier so that later usage of that identifier can be validated for typecorrectness. The compiler uses the prototype to check the number and
types of arguments in function calls. The location and qualifiers establish
the visibility of the function is the function global, local to the module,

101

nested in another procedure, attached to a class, and so on. Type


declarations (i.e. C typedef, C++ classes) have similar behaviours with
respect to declaration and use of the new typename.
Type Checking
Type checking is the process of verifying that each operation executed in a
program respects the type system of the language. This generally means
that all operands in any expression are of appropriate types and number.
Much of what we do in the semantic analysis phase is type checking.
Sometimes the rules regarding operations are defined by other parts of the
code (as in function prototypes), and sometimes such rules are a part of the
definition of the language itself (as in both operands of a binary
arithmetic operation must be of the same type). If a problem is found,
e.g., one tries to add a char pointer to a double in C, we encounter a type
error. A language is considered strongly-typed if no type error goes
undetected, weakly-typed if there are various loopholes by which type
errors can slip through. Type checking can be done when the program is
compiled, when it is executed, or a combination of both.
Static type checking is done at compile-time. The information the type
checker needs is obtained via declarations and stored in the symbol table.
After this information is collected, the types involved in each operation are
checked. It is very difficult for a language that only does static type
checking to meet the full definition of strongly-typed. Even motherly old
Pascal, which would appear to be so because of its use of declarations and
strict type rules, cannot find every type error at compile-time. This is
because many type errors can sneak through the type checker. For
example, if a and b are of type int and we assign very large values to
them, a * b may not be in the acceptable range of ints, or an attempt to
compute the ratio between two integers may raise a division by zero.
These kind of type errors are cannot be detected at compile-time. C makes
a somewhat paltry attempt at strong type checkingthings as the lack of
array bounds checking, no enforcement of variable initialisation or
function return create loopholes. The typecast operation is particularly
dangerous. By taking the address of a location, casting to something
inappropriate, dereferencing and assigning, you can wreak havoc on the
type rules. The typecast basically suspends type checking, which, in
general, is a pretty risky thing to do.
Dynamic type checking is implemented by including type information for
each data location at runtime. For example, a variable of type double
would contain both the actual double value and some kind of tag
indicating double type. The execution of any operation begins by first
checking these type tags. The operation is performed only if everything
checks out. Otherwise, a type error occurs and usually halts execution. For
example, when an add operation is invoked, it first examines the type tags

102

of the two operands to ensure they are compatible. LISP is an example of a


language that relies on dynamic type checking. Because LISP does not
require the programmer to state the types of variables at compile-time, the
compiler cannot perform any analysis to determine if the type system is
being violated. But the runtime type system takes over during execution
and ensures that type integrity is maintained. Dynamic type checking
clearly comes with a runtime performance penalty, but it usually much
more difficult to subvert and can report errors that are not possible to
detect at compile-time.
Many compilers have built-in functionality for correcting some type errors
that can occur. Implicit type conversion, or coercion, is when a compiler
finds a type error and then changes the type of the variable to an
appropriate type. This happens in C, for example, when an addition
operation is performed on a mix of integer and floating point values. The
integer values are implicitly promoted before the addition is performed. In
fact, any time a type error occurs in C, the compiler searches for an
appropriate conversion operation to insert into the compiled code to fix the
error. Only if no conversion can be done, does a type error occur. In a
language like C++, the space of possible automatic conversions can be
enormous, which makes the compiler run more slowly and sometimes
gives surprising results.
Other languages are much stricter about type-coercion. Ada and Pascal, for
example, provide almost no automatic coercions, requiring the
programmer to take explicit actions to convert between various numeric
types. The question of whether to provide a coercion capability or not is
controversial. Coercions can free a programmer from worrying about
details, but they can also hide serious errors that might otherwise have
popped up during compilation. PL/I compilers are especially notorious for
taking a minor error like a misspelled name and re-interpreting it in some
unintended way. Heres a particular classic example:
DECLARE (A, B, C) CHAR(3);
B = "123"; C = "456"; A = B + C;

The above PL/1 code declares A, B, and C each as 3-character


array/strings. It assigns B and C string values and then adds them together.
Wow, does that work? Sure, PL/I automatically coerces strings to numbers
in an arithmetic context, so it turns B and C into 123 and 456, then it adds
them to get 579. How about trying to assign a number to a string? Sure,
why not! It will convert a number to string if needed. However, herein lies
the rub: it converts the string using a default width of 8, so it actually
converted the result to " 579". And because A was declared to only hold 3
characters, it will truncate (silently), and A gets assigned " ". Probably not
what you wanted?

103

Case study: ML data types


ML, or Meta-Language, is an important functional language developed in
Edinburgh in the 1970s. It was developed to implement a theorem prover,
but recently, it has gained popularity as a general purpose language. ML
deals with data types in a novel way. Rather than require declarations, ML
works very hard to infer the data types of the arguments to functions. For
example,
fun mult x = x * 10;

requires no type information because ML infers that x is an integer since it


is being multiplied by an integer. The only time a programmer must
supply a declaration is if ML cannot infer the types. For example,
fun sqr x = x * x;

would result in a type error because the multiplication operator is


overloaded, i.e., there exists a multiplication operation for reals and one
for integers. ML cannot determine which one to use in this function, so the
programmer would have to clarify:
fun sqr x:int = x * x;

The fact that types do not have to be declared unless necessary, makes it
possible for ML to provide one of its most important features:
polymorphism. A polymorphic function is one that takes parameters of
different types on different activations. For example, a function that
returns the number of elements in a list:
fun length(L) = if L = nil then 0 else length (tl(L)) + 1;

(Note: tl is a built-in function that returns all the elements after the first
element of a list.) This function will work on a list of integers, reals,
characters, strings, lists, etc. Polymorphism is an important feature of most
object-oriented languages also. It introduces some interesting problems in
semantic analysis, as we will see a bit later.
Designing a Type Checker
When designing a type checker for a compiler, heres the process to
follow:
1.
identify the types that are available in the language
2.
identify the language constructs that have types associated
with them
3.
identify the semantic rules for the language
Type equivalence of compound types
The equivalence of base types is usually very easy to establish, an int is
only exactly equivalent to int, a bool only to a bool. Many languages also

104

require the ability to determine the equivalence of compound types. A


common technique in dealing with compound types is to store the basic
information defining the type in tree structures.
Here is a set of rules for building type trees:

arrays:
structs:
pointers:

two subtrees, one for number of elements and one for the
base type
one subtree for each field
one subtree that is the type being referenced by the pointer

If we store the type information in this manner, checking the equivalence


of two types turns out to be a simple recursive tree operation. Heres an
outline of a recursive test for structural equivalence:

The function looks simple enough, but is it guaranteed to terminate? What


if the type being compared is a record that is recursive (i.e. contains a
pointer to a record of the same type?) We need to be a bit more careful! To
get around this, we could mark the tree nodes during traversal to allow us
to detect a cycle or limit equivalence on pointer types to in name only.
User-defined types
The question of equivalence becomes more complicated when you add
user-defined types. Many languages allow users to define their own types
(e.g., using typedef in C, or type in Pascal). Here is a Pascal example:
type
little = array[1..5] of integer;
small = array[1..5] of integer;
big = array[1..10] of integer;

105

var
a,
c:
d,
f,
h,

b: array[1..5] of integer;
array[1..5] of integer;
e: little;
g: small;
i: big;

When are two types the same? Which of the types are equivalent in the
above example? It depends on how one defines equivalence, the two
main options are named versus structured equivalence. If the language
supports named equivalence, two types are the same if and only if they
have the same name. Thus d and e are type-equivalent, so are f and g
and h and i. The variables a and b are also type-equivalent because they
have identical (but unnamed) types. (Any variables declared in the same
statement have the same type.) But c is a different, anonymous type. And
even though the small type is a synonym for little which is a
synonym for an array of 5 integers, Pascal, which only supports named
equivalence, does not consider d to be type-equivalent to a or f. The
more general form of equivalence is structural equivalence. Two types are
structurally equivalent if a recursive traversal of the two type definition
trees matches in entirety. Thus, the variables a-g are all structurally
equivalent but are distinct from h and i.
Which definition of equivalence a language supports is a part of the
definition of the language. This, of course, has an impact on the
implementation of the type checker of the compiler for the language.
Clearly, a language supporting named equivalence is much easier and
quicker to type check than one supporting structural equivalence. But there
is a trade-off. Named equivalence does not always reflect what is really
being represented in a user-defined type. Which version of equivalence
does C support? Do you know? How could you find out?
Type Compatibility and Subtyping
In addition to establishing rules for type equivalency, the type system also
defines type compatibility. Certain language constructs may require
equivalent types, but most allow for substitution of coercible or
compatible types.
We've already talked a bit about type coercion. An int and a double are not
type equivalent, but a function that takes a double parameter may allow an
integer argument to be passed because an integer can be coerced to a
double without loss of precision. The reverse may or may not be true, in C,
a double is substitutable for an int (it is truncated), in Java, a typecast is
required to force the truncation. This sort of automatic coercion affects
both the type checker and the code generator, since we need to recognise
which coercions are valid in a particular context and if required, generate
the appropriate instructions to actually do the conversion.

106

Subtypes are a way of designating freely compatible types. If a data type


has all of the behaviour and features of another type, to where it is freely
substitutable for that type, we say it is a subtype of that type. C's enums,
for example, allow you to define new subtypes of int, similar to Pascal
subrange type. A subtype is compatible with its parent type, which means
an expression of the subtype can always be substituted where the general
type was expected. If a function is expected an int as a parameter,
substituting an enum that will only have value 1 or 2, for example, is
perfectly fine. Thus the type checker needs to have an awareness of
compatible types to allow such a substitution.
In object-oriented languages, inheritance and interfaces allow other ways
for subtypes to be defined. The type checker allows an instance of a
subclass can be freely substituted for an instance of the parent class.
Scope Checking
A simple shell scripting language may require all variables to be declared
at the top-level and visible everywhere, thus throwing all identifiers into
one big pot and not allowing names to ever be re-used. More usually a
language offers some sort of control for scopes, allowing the visibility of
an identifier to be constrained to some subsection of the program. Global
variables and functions are available anywhere in the program. Local
variables are only visible within certain sections. To understand how this
is handled in a compiler, we need a few definitions. A scope is a section of
program text enclosed by basic program delimiters, e.g., {} in C, or
begin-end in Pascal. Many languages allow nested scopes that are scopes
defined within other scopes. The scope defined by the innermost such unit
is called the current scope. The scope defined by the current scope and by
any enclosing program units are known as open scopes. Any other scope is
a closed scope.
As we encounter identifiers in a program, we need to determine if the
identifier is accessible at that point in the program. This is called scope
checking. If we try to access a local variable declared in one function in
another function, we should get an error message. This is because only
variables declared in the current scope and in the open scopes containing
the current scope are accessible.
An interesting situation can arise if a variable is declared in more than one
open scope. Consider the following C program:
int a;
void Binky(int a) {
int a;
a = 2;
...
}

107

When we assign to a, should we use the global variable, the local variable,
or the parameter? Normally it is the innermost declaration, the one nearest
the reference, which wins out. Thus, the local variable is assigned the
value 2. When a variable name is re-used like this, we say the innermost
declaration shadows the outer one. Inside the Binky function, there is no
way to access the other two a variables because the local variable is
shadowing them and C has no mechanism to explicitly specific which
scope to search.
There are two common approaches to the implementation of scope
checking in a compiler. The first is to implement an individual symbol
table for each scope. We organise all these symbol tables into a scope
stack with one entry for each open scope. The innermost scope is stored at
the top of the stack, the next containing scope is underneath it, etc. When a
new scope is opened, a new symbol table is created and the variables
declared in that scope are placed in the symbol table. We then push the
symbol table on the stack. When a scope is closed, the top symbol table is
popped. To find a name, we start at the top of the stack and work our way
down until we find it. If we do not find it, the variable is not accessible and
an error should be generated.
There is an important disadvantage to this approach, besides the obvious
overhead of creating additional symbol tables and doing the stack
processing. All global variables will be at the bottom of the stack, so scope
checking of a program that accesses a lot of global variables through many
levels of nesting can run slowly. The overhead of a table per scope can
also contribute to memory bloat in the compiler.
The other approach to the implementation of scope checking is to have a
single global table for all the scopes. We assign to each scope a scope
number. Each entry in the symbol table is assigned the scope number of
the scope it is contained in. A name may appear in the symbol table more
than once as long as each repetition has a different scope number.
When we encounter a new scope, we increment a scope counter. All
variables declared in this scope are placed in the symbol table and
assigned this scopes number. If we then encounter a nested scope, the
scope counter is incremented once again and any newly declared variables
are assigned this new number. Using a hash table, new names are always
entered at the front of the chains to simplify the searches. Thus, if we have
the same name declared in different nested scopes, the first occurrence of
the name on the chain is the one we want.
When a scope is closed, all entries with the closing scope number are
deleted from the table. Any previously shadowed variables will now be
accessible again. If we try to access a name in a closed scope, we will not

108

find it in the symbol table causing an error to be generated. The


disadvantage of the single combined symbol table is that closing a scope
can be an expensive operation if it requires traversing the entire symbol
table.
There are two scoping rules that can be used in block-structured
languages static scoping and dynamic scoping. In static scoping, a
function is called in the environment of its definition (i.e. its lexical
placement in the source text), where in dynamic scoping, a function is
called in the environment of its caller (i.e. using the runtime stack of
function calls). For a language like C, the point is moot, because functions
cannot be nested and can only access their local variables or global ones,
but not the local variables of any other functions. But other languages such
as Pascal or LISP allow non-local access and thus need to establish
scoping rules for this. If inside the Binky function, you access a non-local
identifier x, does it consider the static structure of the program (i.e. the
context in which Binky was defined, which may be nested inside other
functions in those languages)? Or does it use the dynamic structure to
examine the call stack to find the nearest caller who has such a named
variable? What if there is no x in the enclosing context can this be
determined at compile-time for static scoping? What about dynamic? What
kind of data structure are necessary at compile-time and run-time to
support static or dynamic scoping? What can you do with static scoping
that you can't with dynamic or vice versa? Over time, static scoping has
won out over dynamic what might be the reasoning it is preferred?
Object-Oriented Issues
Consider the two distinct pieces of a class definition: its interface and its
implementation. Some object-oriented languages tie the two together and a
subclass inherits both. In such a case, the subclass has an "is-a"
relationship with its parent, and an instance of the subclass can be
substituted wherever an instance of the superclass is needed, it is a
subtype.
A language may support inheritance of implementation without interface,
via a feature like C++ private inheritance. The new class has the behavior
internally, but it is not exposed externally. A Stack class might use private
inheritance from a generic Vector class to get all the code for managing a
linear collection but without allowing clients to call methods like insertAt
that defeat the stack discipline. In such a case, the Stack is not a subtype of
a Vector. . (A similar feat could also be provided with composition of a
Vector as an instance variable inside the Stack class).
It is also possible to inherit interface without implementation through a
feature such as Java interfaces. An interface is a listing of method
prototypes only. There is no code, no instance variables, nothing more

109

than declarations establishing the function signatures. When a class


declares that it implements the interface, it is required to provide the
implementation for all of the required methods. In this case, the
relationship between the interface and the class is in the agreement to
adhere to a behaviour but not shared implementation. The class is a
subtype of the interface type.
In a common form of inheritance, a derived class inherits both interface
and implementation from the parent. Often a copy-based approach is used
to implement inheritance. The storage for each derived class or subtype
contains all the information needed from the parent because it was copied
directly into the object.
Another issue in object-oriented languages is polymorphism. We saw this
feature earlier when discussing ML, but the term takes on a slightly
different meaning in a language like C++ or Java. Polymorphism in C++
refers to the situation in which objects of different classes can respond to
the same messages. For example, if we have classes for boat, airplane, and
car, the objects of these classes might all understand the message go(), but
the action will be different for each class. In C++, polymorphism is
implemented using virtual functions. Methods of the same name can have
different definitions in different classes. For example, consider this excerpt
from a hypothetical drawing program:
class Shape {
public:
virtual void Draw();
virtual void Rotate(int degrees);
...
};
class Rect: public Shape {
public:
void Draw();
void Rotate(int degrees) {}
...
};
class Oval: public Shape {
public:
void Draw();
void Rotate(int degrees) {}
...
};

If we have an array of different shape objects, we can rotate all of them by


placing the following statement inside a loop that traverses the array:
shapes[i].Rotate(45);

110

We are rotating all the objects in the list without providing any type
information. The receiving object itself will respond with the correct
implementation for its shape type.
The primary difference between virtual functions and non-virtual functions
is their binding times. Binding means associating a name with a definition
or storage location. In C++, the names of nonvirtual functions are bound at
compile time. The names of virtual functions are bound at run-time, at the
time of the call to the method. Thus, the binding is determined by the class
of the object at the time of the function call. To implement this, each
virtual method in a derived class reserves a slot in the class definition
record, which is created at run-time. A constructor fills in this slot with the
location of the virtual function defined in the derived class, if one exists. If
it does not exist, it fills in the location with the function from the base
class.

111

CHAPTER TWELVE
Intermediate Representations
Most compilers translate the source program first to some form of
intermediate representation (IR) and convert from there into machine
code. The intermediate representation is a machine- and language
independent version of the original source code. Although converting the
code twice introduces another step and thus incurs loss in compiler
efficiency, use of an intermediate representation provides advantages in
increased abstraction, cleaner separation between the front and back ends,
and adds possibilities for re-targeting/cross-compilation. Intermediate
representations also lend themselves to supporting advanced compiler
optimisations and most optimisation is done on this form of the code.
There are many intermediate representations in use (one author suggests it
may be as many as a unique one for each existing compiler) but the
various representations are actually more alike than they are different.
Once you become familiar with one, its not hard to learn others.
Intermediate representations are usually categorised according to where
they fall between a high-level language and machine code. IRs that are
close to a high-level language are called high-level IRs, and IRs that are
close to assembly are called low-level IRs. For example, a high-level IR
might preserve things like array subscripts or field accesses whereas a lowlevel IR converts those into explicit addresses and offsets. For example,
consider the following three code examples (from Muchnick), offering
three translations of a 2-dimensional array access:

The thing to observe here isnt so much the details of how this is done (we
will get to that later), as the fact that the low-level IR has different
information than the high-level IR. What information does a high-level IR
have that a low-level one does not? What information does a low level IR
have that a high-level one does not? What kind of optimisation might be
possible in one form that might not in another?
High-level IRs usually preserve information such as loop-structure and ifthen-else statements. They tend to reflect the source language they are
compiling more than lower-level IRs. Medium level IRs often attempt to
be independent of both the source language and the target machine. Low
level IRs tend to reflect the target architecture very closely, and as such are
112

often machine dependent. They differ from actual assembly code in that
there may be choices for generating a certain sequence of operations, and
the IR stores this data in such a way as to make it clear that choice must be
made. Sometimes a compiler will start-out with a high-level IR, perform
some optimisations, translate the result to a lower-level IR and optimise
again, then translate to a still lower IR, and repeat the process until final
code generation.
Abstract Syntax Trees
A parse tree is an example of a very high-level intermediate
representation. You can usually completely reconstruct the actual source
code from a parse tree since it contains all the information about the parsed
program. (Its fairly unusual that you can work backwards in that way
from most IRs since much information has been removed in translation).
More likely, a tree representation used as an IR is not quite the literal parse
tree (intermediate nodes may be collapsed, groupings units can be
dispensed with, etc.), but it is winnowed down to the structure sufficient to
drive the semantic processing and code generation. Such a tree is usually
referred to as an abstract syntax tree. In the programming projects so far,
you have already been immersed in creating and manipulating such a tree.
Each node represents a piece of the program structure and the node will
have references to its children subtrees (or none if the node is a leaf) and
possibly also have a reference to its parent.
Consider the following excerpt of a programming language grammar:
program
function_list
function
params

>
>
>
>

function_list
function_list function | function
PROCEDURE ident ( params ) body

A sample program for this language:


PROCEDURE main()
BEGIN
statement...
END
PROCEDURE factorial(n:INTEGER)
BEGIN
statement...
END

The literal parse tree for the sample program looks something like:

113

Here is what the abstract syntax tree looks like (notice how some pieces
like the parents and keywords are no longer needed in this representation):

The parser actions to construct the tree might look something like this:
function: PROCEDURE ident ( params ) body
{ $$ = MakeFunctionNode($2, $4, $6); }
function_list: function_list function
{ $$ = $1; $1->AppendNode($2); }

What about the terminals at the leaves? Those nodes have no children,
usually these will be nodes that represent constants and simple variables.
When we recognise those parts of the grammar that represent leaf nodes,
we store the data immediately in that node and pass it upwards for joining
in the larger tree.
constant : int_constant
{ $$ = MakeIntConstantNode($1); }

Generating Assembly Code from Syntax Trees


With an abstract syntax tree in place, we can now explore how it can be
used to drive a translation. As an example application, consider how a
syntax tree for an arithmetic expression might be used to directly generate
assembly language code. Let's assume the assembly we are working with
is for a simple machine that has a collection of numbered registers. Its
limited set of operations include LOAD and STORE to read and write
from memory and two-address forms of ADD, MULT, etc. which

114

overwrite the first operand with the result. We want to translate


expressions into the proper sequence of assembly instructions during a

syntax-directed translation. Here is a parse tree representing an arithmetic


expression:
In going from the parse tree to the abstract syntax tree, we get rid of the
unnecessary non-terminals,and leave just the core nodes that need to be
there for code generation:

Here is one possible data structure for an abstract expression tree:


typedef struct _tnode {
char label;
struct _tnode *lchild, *rchild;
} tnode, *tree;

To generate code for the entire tree, we first generate code for each of the
subtrees, storing the result in some agreed-upon location (usually a
register), and then combine those results. The function GenerateCode
below takes two arguments: the subtree for which it is to generate
assembly code and the number of the register in which the computed result
will be stored.

115

In the first line of GenerateCode, we test if the label of the root node is an
operator. If its not, we emit a load instruction to fetch the current value of
the variable and store it in the result register. If the label is an operator, we
call GenerateCode recursively for the left and right expression subtrees,
storing the results in the result register and the next higher numbered
register, and then emit the instruction applying the operator to the two
results. Note that the code as written above will only work if the number
of available registers is greater than the height of the expression tree. (We
could certainly be smarter about re-using them as we move through the
tree, but the code above is just to give you the general idea of how we go
about generating the assembly instructions).
Lets trace a call to GenerateCode for the following tree:

The initial call to GenerateCode is with a pointer to the '+' and result
register 0.

116

We end up with this set of generated instructions:


LOAD a, R0
LOAD b, R1
LOAD c, R2
SUB R1, R2
LOAD d, R2
MULT R1, R2
ADD R0, R1

Notice how using the tree height for the register number (adding one as we
go down the side) allows our use of registers to not conflict. It also reuses
registers (R2 is used for both c and d). It is clearly not the most optimal
strategy for assigning registers.
A quick aside: Directed Acyclic Graphs
In a tree, there is only one path from a root to each leaf of a tree. In
compiler terms, this means there is only one route from the start symbol to
each terminal. When using trees as intermediate representations, it is often
the case that some subtrees are duplicated. A logical optimisation is to
share the common subtree. We now have a data structure with more than
one path from start symbol to terminals. Such a structure is called a
directed acyclic graph (DAG). They are harder to construct internally, but
provide an obvious savings in space. They also highlight equivalent
sections of code and that will be useful later when we study optimisation
techniques, such as only computing the needed result once and saving it,
rather than re-generating it several times.
a * b + a * b;

117

Gcc's Intermediate Representation


The gcc/g++ compiler uses an abstract syntax tree to capture the results
from its yacc/bison parser. The compiler is written in C, not C++, thus the
nodes are not C++ objects. Instead, each tree is a C pointer to a structure
with a tag field to identify the node type (function, for loop, etc,) and there
are various functions and macros to pick out the individual fields that are
appropriate for each type of node. The tree representation is used only in
passing, it is almost immediately translated to an intermediate language
called RTL (register-transfer language). RTL is inspired by Lisp lists. It
has both an internal form, made up of structures that point at other
structures, and a textual form that is used in the machine description and in
printed debugging dumps. The textual form uses nested parentheses to
indicate the pointers in the internal form. Here is the RTL output for a

simple hello, world program:


RTL is a fairly low-level IR. It assumes a general purpose register machine
and incorporates some notion of register allocation and instruction
scheduling. The gcc compiler does most of its optimisations on the RTL
representation, saving only machine-dependent tweaks to be done as part
of final code generation.
Java's Intermediate Representation
A Java compiler compiles to bytecode which is itself an intermediate
representation. Here is the Java bytecode generated for a simple hello,

world program.
Method Main()
0 aload_0
1 invokespecial #1 <Method java.lang.Object()>
4 return
Method void main(java.lang.String[])

118

0 getstatic #2 <Field java.io.PrintStream out>


3 ldc #3 <String "Hello world">
5 invokevirtual #4 <Method void
println(java.lang.String)>
8 return

Java bytecode is a fairly high-level IR. It is based on stack-based machine


architecture and includes abstract notions such as getstatic and
invokevritual along with more low-level instructions such as ldc (load
constant) and add.
A Java compiler usually stops after producing bytecode. Bytecode is
targeted to a virtual machine, although there are, in fact, hardwareembodiments of this machine (e.g. the "Java chip" CPU), in most cases,
the bytecodes need to be further translated to machine code. The compiler
does not handle the usual back-end task of converting the IR to machine
code, it is instead done by an interpreter within the Java virtual machine
(or possibly by a JIT code generator instead of an interpreter).
It is possible to build a Java compiler that goes the way to native machine
code by statically translating the byte codes to machine code. Rather than
start from scratch, plug a bytecode-to-RTL translator between a Java front
end and the gcc back-end, and you have native Java compilation for the
whole collection of architectures that gcc supports. This converter is not
exactly a trivial piece of code to write, but it's certainly easier than starting
from scratch on the whole thing.

119

CHAPTER THIRTEEN
Code Optimisation
Optimisation is the process of transforming a piece of code to make more
efficient (either in terms of time or space) without changing its output or
side-effects. The only difference visible to the codes user should be that it
runs faster and/or consumes less memory. It is really a misnomer that the
name implies you are finding an optimal solution in truth,
optimisation aims to improve, not perfect, the result.
Optimisation is the field where most compiler research is done today. The
tasks of the front-end (scanning, parsing, semantic analysis) are well
understood and unoptimised code generation is relatively straightforward.
Optimisation, on the other hand, still retains a sizable measure of
mysticism. High-quality optimisation is more of an art than a science.
Compilers for mature languages arent judged by how well they parse or
analyse the codeyou just expect it to do it right with a minimum of
hassle but instead by the quality of the object code they produce.
Many optimisation problems are NP-complete and thus most optimisation
algorithms rely on heuristics and approximations. It may be possible to
come up with a case where a particular algorithm fails to produce better
code or perhaps even makes it worse. However, the algorithms tend to do
rather well overall.
Its worth reiterating here that efficient code starts with intelligent
decisions by the programmer. No one expects a compiler to replace
BubbleSort with Quicksort. If a programmer uses a lousy algorithm, no
amount of optimisation can make it zippy. In terms of big-O, a compiler
can only make improvements to constant factors. But, all else being equal,
you want an algorithm with low constant factors.
First let me note that you probably shouldnt try to optimise the way we
will discuss in your favourite high-level language. Consider the following
two code snippets where each walks through an array and set every
element to one. Which one is faster?

You will invariably encounter people who think the second one is faster.
And they are probably right.if using a compiler without optimisation.
But, many modern compilers emit the same object code for both, by use of

120

clever techniques (in particular, this one is called loop-induction variable


elimination) that work particularly well on idiomatic usage. The moral of
this story is that most often you should write code that is easier to
understand and let the compiler do the optimisation.
Correctness above all!
If may seem obvious, but it bears repeating that optimisation should not
change the correctness of the generated code. Transforming the code to
something that runs faster but incorrectly is of little value. It is expected
that the unoptimised and optimised variants give the same output for all
inputs. This may not hold for an incorrectly written program (e.g., one that
uses an uninitialised variable).
When and in what representation to Optimise
There are a variety of tactics for attacking optimisation. Some techniques
are applied to the intermediate code, to streamline, rearrange, compress,
etc. in an effort to reduce the size of the abstract syntax tree or shrink the
number of TAC instructions. Others are applied as part of final code
generationchoosing which instructions to emit, how to allocate registers
and when/what to spill, and the like. And still other optimisations may
occur after final code generation, attempting to re-work the assembly code
itself into something more efficient.
Optimisation can be very complex and time-consuming; it often involves
multiple sub-phases, some of which are applied more than once. Most
compilers allow optimisation to be turned off to speed up compilation (gcc
even has specific flags to turn on and off individual optimisations).
Control-Flow Analysis
Consider all that has happened up to this point in the compiling process
lexical analysis, syntactic analysis, semantic analysis and finally
intermediate-code generation. The compiler has done an enormous amount
of analysis, but it still doesnt really know how the program does what it
does. In control-flow analysis, the compiler figures out even more
information about how the program does its work, only now it can assume
that there are no syntactic or semantic errors in the code.
As an example, here is TAC code that computes fibonacci numbers:

121

Control-flow analysis begins by constructing a control-flow graph, which


is a graph of the different possible paths program flow could take through
a function. To build the graph, we first divide the code into basic blocks. A
basic block is a segment of the code that a program must enter at the
beginning and exit only at the end. This means that only the first statement
can be reached from outside the block (there are no branches into the
middle of the block) and all statements are executed consecutively after
the first one is (no branches or halts until the exit). Thus a basic block has
exactly one entry point and one exit point. If a program executes the first
instruction in a basic block, it must execute every instruction in the block
sequentially after it.
A basic block begins in one of several ways:
the entry point into the function
the target of a branch (in our example, any label)
the instruction immediately following a branch or a return
A basic block ends in any of the following ways:
a jump statement
a conditional or unconditional branch
a return statement
Using the rules above, let's divide the Fibonacci TAC code into basic
blocks:

122

Now we can construct the control-flow graph between the blocks. Each
basic block is a node in the graph, and the possible different routes a
program might take are the connections, i.e. if a block ends with a branch,
there will be a path leading from that block to the branch target. The
blocks that can follow a block are called its successors. There may be
multiple successors or just one. Similarly the block may have many, one,
or no predecessors.
Connect up the flow graph for Fibonacci basic blocks given above. What
does an if-then-else look like in a flow graph? What about a loop?
You probably have all seen the gcc warning or javac error about:
Unreachable code at line XXX. How can the compiler tell when code is
unreachable?
Local Optimisations
Optimisations performed exclusively within a basic block are called local
optimisations. These are typically the easiest to perform since we do not
consider any control flow information, we just work with the statements

123

within the block. Many of the local optimisations we will discuss have
corresponding global optimisations that operate on the same principle, but
require additional analysis to perform. We'll consider some of the more
common local optimisations as examples.
Constant Folding
Constant folding refers to the evaluation at compile-time of expressions
whose operands are known to be constant. In its simplest form, it involves
determining that all of the operands in an expression are constant-valued,
performing the evaluation of the expression at compile-time, and then
replacing the expression by its value. If an expression such as 10+2*3 is
encountered, the compiler can compute the result at compile-time (16) and
emit code as if the input contained the result rather than the original
expression. Similarly, constant conditions, such as a conditional branch if
a < b goto L1 else goto L2 where a and b are constant can be
replaced by a Goto L1 or Goto L2 depending on the truth of the
expression evaluated at compile-time.
The constant expression has to be evaluated at least once, but if the
compiler does it, it means you dont have to do it again as needed during
runtime. One thing to be careful about is that the compiler must obey the
grammar and semantic rules from the source language that apply to
expression evaluation, which may not necessarily match the language you
are writing the compiler in. (For example, if you were writing an APL
compiler, you would need to take care that you were respecting its
Iversonian precedence rules). It should also respect the expected treatment
of any exceptional conditions (divide by zero, over/underflow).
Consider the unoptimsed TAC translation on the left, which is transformed
by constant-folding on the right:

Constant-folding is what allows a language to accept constant expressions


where a constant is required (such as a case label or array size) as in these
C language examples:
int arr[20 * 4 + 3];
switch (i) {
case 10 * 5: ...
}

In both snippets shown above, the expression can be resolved to an integer


constant at compile time and thus, we have the information needed to

124

generate code. If either expression involved a variable, though, there


would be an error. How could you re-write the grammar to allow the
grammar to do constant folding in case statements? This situation is a
classic example of the gray area between syntactic and semantic analysis.
Constant Propagation
If a variable is assigned a constant value, then subsequent uses of that
variable can be replaced by the constant as long as no intervening
assignment has changed the value of the variable. Consider this section
from our earlier Fibonacci example. On the left is the original, on the right
is the improved version after constant propagation, which saves three
instructions and removes the need for three temporary variables:

Constant propagation is particularly important in RISC architectures


because it moves integer constants to the place they are used. This may
reduce both the number of registers needed and instructions executed. For
example, MIPS has an addressing mode that uses the sum of a register and
a constant offset, but not one that uses the sum of two registers.
Propagating a constant to such an address construction can eliminate the
extra add instruction as well as the additional registers needed for that
computation. Here is an example of this in action. First, we have the
unoptimised version, TAC on left, MIPS on right:

A bit of constant propagation and a little rearrangement on the second load


instruction, cuts the number of registers needed from 4 to 2 and the
number of instructions likewise in the optimised version:
_tmp0 = *(arr + 12) ;

lw $t0, -8($fp)
lw $t1, 12($t0)

Algebraic Simplification and Reassociation


Simplifications use algebraic properties or particular operator-operand
combinations to simplify expressions. Reassociation refers to using
properties such as associativity, commutativity and distributivity to
rearrange an expression to enable other optimisations such as constantfolding or loop-invariant code motion.

125

The most obvious of these are the optimisations that can remove useless
instructions entirely via algebraic identities. The rules of arithmetic can
come in handy when looking for redundant calculations to eliminate.
Consider the examples below, which allow you to replace an expression
on the left with a simpler equivalent on the right:
x+0 = x
0+x = x
x*1 = x
1*x = x
0/x = 0
x-0 = x
b && true = b
b && false = false
b || true = true
b || false = b

The use of algebraic rearrangement can restructure an expression to enable


constant-folding or common sub-expression elimination and so on.
Consider the unoptimised TAC code on the left and rearranged and
constant-folded TAC on right:

Operator Strength Reduction


Operator strength reduction replaces an operator by a "less expensive" one.
Given each group of identities below, which operations are the most and
least expensive, assuming f is a float and i is an int? (Trick question: it
may differ for differing architectures you need to know your target
machine to optimise well!)
a) i*2 = 2*i = i+i = i << 1
b) i/2 = (int)(i*0.5)
c) 0-1 = -i
d) f*2 = 2.0 * f = f + f
e) f/2.0 = f*0.5
Strength reduction is often performed as part of loop-induction variable
elimination. An idiomatic loop to zero all the elements of an array might
look like this in TAC:

126

Each time through the loop, we multiply i by 4 (the element size) and add
to the array base. Instead, we could maintain the address to the current
element and instead just add 4 each time:
L0:

_tmp4 = arr ;
_tmp2 = i < 100;
IfZ _tmp2 Goto _L1 ;
*_tmp4 = 0;
_tmp4 = _tmp4 + 4;
i = i + 1 ;

L1:

This eliminates the multiplication entirely and reduces the need for an
extra temporary. By rewriting the loop termination test in terms of arr, we
could remove the variable i entirely and not bother tracking and
incrementing it at all.
Copy Propagation
This optimisation is similar to constant propagation, but generalised to
non-constant values. If we have an assignment a = b in our instruction
stream, we can replace later occurrences of a with b (assuming there are
no changes to either variable in-between). Given the way we generate
TAC code, this is a particularly valuable optimisation since it is able to
eliminate a large number of instructions that only serve to copy values
from one variable to another.
The code on the left makes a copy of tmp1 in tmp2 and a copy of tmp3 in
tmp4. In the optimised version on the right, we eliminated those
unnecessary copies and propagated the original variable into the later uses:

We can also drive this optimisation "backwards", where we can recognise


that the original assignment made to a temporary can be eliminated in
favour of direct assignment to the final goal:

Dead Code Elimination


If an instructions result is never used, the instruction is considered dead
and can be removed from the instruction stream. So if we have
tmp1 = tmp2 + tmp3 ;

127

and tmp1 is never used again, we can eliminate this instruction altogether.
However, we have to be a little careful about making assumptions, for
example, if tmp1 holds the result of a function call:
tmp1 = LCall _Binky;

Even if tmp1 is never used again, we cannot eliminate the instruction


because we cant be sure that called function has no side-effects. Dead
code can occur in the original source program but is more likely to have
resulted from some of the optimisation techniques run previously.
Common Sub-Expression Elimination
Two operations are common if they produce the same result. In such a
case, it is likely more efficient to compute the result once and reference it
the second time rather than re-evaluate it. An expression is alive if the
operands used to compute the expression have not been changed. An
expression that is no longer alive is dead.

What sub-expressions can be eliminated? How can valid common subexpressions (live ones) be determined? Here is an optimised version, after
constant folding and propagation and elimination of common subexpressions:
tmp2 = -x ;
x = 21 * tmp2 ;
tmp3 = x * x ;
tmp4 = x / y ;
y = tmp3 + tmp4 ;
tmp5 = x / y ;
z = tmp5 / tmp3 ;
y = z ;

Global Optimisations and Data-Flow Analysis


So far we were only considering making changes within one basic block.
With some additional analysis, we can apply similar optimisations across
basic blocks, making them global optimisations. Its worth pointing out
that global in this case does not mean across the entire program. We
usually only optimise one function at a time. Inter-procedural analysis is
an even larger task, one not even attempted by some compilers.

128

The additional analysis the optimiser must do to perform optimisations


across basic blocks is called data-flow analysis. Data-flow analysis is
much more complicated than control-flow analysis, and we can only
scratch the surface here.
Lets consider a global common sub-expression elimination optimisation
as our example. Careful analysis across blocks can determine whether an
expression is alive on entry to a block. Such an expression is said to be
available at that point. Once the set of available expressions is known,
common sub-expressions can be eliminated on a global basis.
Each block is a node in the flow graph of a program. The successor set
(succ(x)) for a node x is the set of all nodes that x directly flows into. The
predecessor set (pred(x)) for a node x is the set of all nodes that flow
directly into x. An expression is defined at the point where it is assigned a
value and killed when one of its operands is subsequently assigned a new
value. An expression is available at some point p in a flow graph if every
path leading to p contains a prior definition of that expression which is not
subsequently killed.
avail[B] = set of expressions available on entry to block B
exit[B] = set of expressions available on exit from B
(i.e. B has available the intersection
avail[B] = exit[x]
x pred[B]

of the exit of its predecessors)

killed[B] = set of the expressions killed in B


defined[B] = set of expressions defined in B
exit[B] = avail[B] - killed[B] + defined[B]
avail[B] = (avail[x] - killed[x] + defined[x])
x pred[B]

Here is an algorithm for global common sub-expression elimination:


1)
First, compute defined and killed sets for each basic block
(this does not involve any of its predecessors or
successors).
2)
Iteratively compute the avail and exit sets for each block by
running the following algorithm until you hit a stable fixed
point:
a)
Identify each statement s of the form a = b op c
in some block B such that b op c is available at
the entry to B and neither b nor c is redefined in B
prior to s.
b)
Follow flow of control backward in the graph
passing back to but not through each block that
defines b op c. The last computation of b op c in
such a block reaches s.

129

c)

After each computation d = b op c identified in


step 2a, add statement t = d to that block where t
is a new temp.
d) Replace s by a = t.
Try an example to make things clearer:
main:
BeginFunc 28;
b = a + 2 ;
c = 4 * b ;
tmp1 = b < c;
ifNZ tmp1 goto L1 ;
b = 1 ;
L1:
d = a + 2 ;
EndFunc ;

First, divide the code above into basic blocks. Now calculate the available
expressions for each block. Then find an expression available in a block
and perform step 2c above. What common subexpression can you share
between the two blocks?
What if the above code were:
main:
BeginFunc 28;
b = a + 2 ;
c = 4 * b ;
tmp1 = b < c ;
IfNZ tmp1 Goto L1 ;
b = 1 ;
z = a+2;<=== an additional line here
L1:
d = a + 2 ;
EndFunc ;

Code Motion
Code motion (also called code hoisting) unifies sequences of code
common to one or more basic blocks to reduce code size and potentially
avoid expensive re-evaluation. The most common form of code motion is
loop-invariant code motion that moves statements that evaluate to the
same value every iteration of the loop to somewhere outside the loop.
What statements inside the following TAC code can be moved outside the
loop body?
L0:
tmp1 = tmp2 + tmp3 ;
tmp4 = tmp4 + 1 ;
PushPram tmp4 ;
LCall _PrintInt ;
PopParams 4;
tmp6 = 10 ;
tmp5 = tmp4 == tmp6 ;
IfZ Goto L0 ;

130

We have an intuition of what makes a loop in a flowgraph, but here is a


more formal definition. A loop is a set of basic blocks which satisfies two
conditions:
1.
2.

All are strongly connected, i.e. there is a path


between any two blocks.
The set has a unique entry point, i.e. every path from
outside the loop that reaches any block inside the loop
enters through a single node. A block n dominates m
if all paths from the starting block to m must travel
through n. Every block dominates itself.

For loop L, moving invariant statement s in block B which defines


variable v outside the loop is a safe optimisation if:
1.
2.
3.

B dominates all exits from L


No other statement assigns a value to v
All uses of v inside L are from the definition in s.

Loop invariant code can be moved to just above the entry point to the
loop.
Machine Optimisations
In final code generation, there is a lot of opportunity for cleverness in
generating efficient target code. In this pass, specific machines features
(specialised instructions, hardware pipeline abilities, register details) are
taken into account to produce code optimised for this particular
architecture.
Register Allocation
One machine optimisation of particular importance is register allocation,
which is perhaps the single most effective optimisation for all
architectures. Registers are the fastest kind of memory available, but as a
resource, they are scarce. The problem is how to minimise traffic between
the registers and what lies beyond them in the memory hierarchy to
eliminate time wasted sending data back and forth across the bus.
One common register allocation technique is called register colouring,
after the central idea to view register allocation as a graph colouring
problem. If we have 8 registers, then we try to colour a graph with eight
different colours. The graphs nodes are made of webs and the arcs are
determined by calculating interference between the webs. A web
represents a variables definitions, places where it is assigned a value (as
in x = .), and the possible different uses of those definitions (as in y = x
+ 2). This problem, in fact, can be approached as another graph. The
definition and uses of a variable are nodes, and if a definition reaches a

131

use, there is an arc between the two nodes. If two portions of a variables
definition-use graph are unconnected, then we have two separate webs for
a variable. In the interference graph for the routine, each node is a web.
We seek to determine which webs don't interfere with one another, so we
know we can use the same register for those two variables. For example,
consider the following code:
i
j
x
y

=
=
=
=

10;
20;
i + j;
j + k;

We say that i interferes with j because at least one pair of is definitions


and uses is separated by a definition or use of j, thus, i and j are alive
at the same time. A variable is alive between the time it has been defined
and that definitions last use, after which the variable is dead. If two
variables interfere, then we cannot use the same register for each. But two
variables that don't interfere can since there is no overlap in the liveness
and can occupy the same register.
Once we have the interference graph constructed, we r-colour it so that no
two adjacent nodes share the same colour (r is the number of registers we
have, each colour represents a different register). You may recall that
graph-colouring is NP-complete, so we employ a heuristic rather than an
optimal algorithm. Here is a simplified version of something that might be
used:
1.
2.
3.
4.

Find the node with the least neighbours. (Break ties


arbitrarily.)
Remove it from the interference graph and push it
onto a stack
Repeat steps 1 and 2 until the graph is empty.
Now, rebuild the graph as follows:
a.
Take the top node off the stack and reinsert it
into the graph
b.
Choose a colour for it based on the colour of
any of its neighbours presently in the graph,
rotating colours in case there is more than one
choice.
c.
Repeat a and b until the graph is either
completely rebuilt, or there is no colour
available to colour the node.

If we get stuck, then the graph may not be r-colourable, we could try again
with a different heuristic, say reusing colours as often as possible. If no
other choice, we have to spill a variable to memory.

132

Instruction Scheduling
Another extremely important optimisation of the final code generator is
instruction scheduling. Because many machines, including most RISC
architectures, have some sort of pipelining capability, effectively
harnessing that capability requires judicious ordering of instructions.
In MIPS, each instruction is issued in one cycle, but some take multiple
cycles to complete. It takes an additional cycle before the value of a load is
available and two cycles for a branch to reach its destination, but an
instruction can be placed in the "delay slot" after a branch and executed in
that slack time. On the left is one arrangement of a set of instructions that
requires 7 cycles. It assumes no hardware interlock and thus explicitly
stalls between the second and third slots while the load completes and has
a dead cycle after the branch because the delay slot holds a noop. On the
right, a more favourable rearrangement of the same instructions will
execute in 5 cycles with no dead cycles.

Peephole Optimisations
Peephole optimisation is a pass that operates on the target assembly and
only considers a few instructions at a time (through a "peephole") and
attempts to do simple, machine-dependent code improvements. For
example, peephole optimisations might include elimination of
multiplication by 1, elimination of load of a value into a register when the
previous instruction stored that value from the register to a memory
location, or replacing a sequence of instructions by a single instruction
with the same effect. Because of its myopic view, a peephole optimiser
does not have the potential payoff of a full-scale optimiser, but it can
significantly improve code at a very local level and can be useful for
cleaning up the final code that resulted from more complex optimisations.
Much of the work done in peephole optimisation can be though of as findreplace activity, looking for certain idiomatic patterns in a single or
sequence of two to three instructions than can be replaced by more
efficient alternatives. For example, MIPS has instructions that can add a
small integer constant to the value in a register without loading the
constant into a register first, so the sequence on the left can be replaced
with that on the right:

133

What would you replace the following sequence?


lw $t0, -8($fp)
sw $t0, -8($fp)

What about this one?


mul $t1, $t0, 2

Optimisation Soup
You might wonder about the interactions between the various optimisation
techniques. Some transformations may expose possibilities for others, and
even the reverse is true, one optimisation may obscure or remove
possibilities for others. Algebraic rearrangement may allow for common
subexpression elimination or code motion. Constant folding usually paves
the way for constant propagation and then it turns out to be useful to run
another round constant-folding and so on. How do you know you are
done? You don't!
As one compiler textbook author (Pyster) puts it:
Adding optimisations to a compiler is a lot like eating chicken
soup when you have a cold. Having a bowl full never hurts, but
who knows if it really helps. If the optimisations are structured
modularly so that the addition of one does not increase compiler
complexity, the temptation to fold in another is hard to resist. How
well the techniques work together or against each other is hard to
determine.

134

BIBLIOGRAPHY
1.

Aho, A., Sethi, R., Ullman, J. Compilers: Principles, Techniques, and


Tools. Reading, MA: Addison-Wesley, 1986.
2. Appleby, D. Programming Languages: Paradigm and Practice. New
York, NY: McGraw-Hill, 1991.
3. Backus, J The History of FORTRAN I, II, III., SIGPLAN Notices,
Vol. 13, no. 8, August, 1978, pp. 165-180.
4. Bennett, J P, Introduction to Compiling Techniques, Berkshire,
England: McGraw-Hill, 1990.
5. N. Chomsky, On Certain Formal Properties of Grammars,
Information and Control, Vol. 2, 1959, pp. 137-167.
6. Cohen, D Introduction to Computer Theory, New York: Wiley, 1986.
7. DeRemer, F. Practical Translators for LR(k) Languages. Ph.D.
dissertation, MIT, 1969.
8. DeRemer, F. Simple LR(k) Grammars, Communications of the
ACM, Vol. 14, No. 7, 1971.
9. Fischer, C; LeBlanc, R Crafting a Compiler. Menlo Park, CA:
Benjamin/Cummings, 1988.
10. Grune, D; Bal, H; Jacobs, C; Langendoen, K Modern Compiler
Design. West Sussex, England: Wiley, 2000.
11. Hopcroft, J; Ullman, J Introduction to Automata Theory, Languages,
and Computation, Reading MA: Addison-Wesley, 1979.
12. Johnson M (2001), Compiler, Stanford University.
13. Korenjak, A. A Practical Method for Constructing LR(k)
Processors, Communications of the ACM, Vol. 12, No. 11, 1969.
14. Kleene, S "Representation of Events in Nerve Nets and Finite
Automata," in Shannon, C and McCarthy, J (eds), Automata Studies,
Princeton, NJ: Princeton University Press, 1956.
15. Knuth, D. On the Translation of Languages from Left to Right,
Information and Control, Vol. 8, No. 6, 1965.
16. Knuth, D. Top-Down Syntax Analysis, Acta Informatica, Vol 1.,
No. 2, 1971.
17. Lewis, P. and Stearns, R. Syntax-Directed Transduction, Journal of
the ACM, Vol. 15, No. 3, 1968.
18. Lewis, P. Rosenkrantz, D., and Stearns, R. Compiler Design Theory.
Reading, MA: Addison-Wesley, 1976.
19. Loudon, K Compiler Construction. Boston, MA: PWS, 1997
20. MacLennan, B. Principles of Programming Languages. New York,
NY: Holt, Rinehart, Winston, 1987.
21. Mak, R Writing Compilers and Interpreters. New York, NY: Wiley,
1991.
22. McGettrick, A The Definition of Programming Languages.
Cambridge: Cambridge University Press, 1980.
23. Muchnick, S Advanced Compiler Design and Implementation. San
Francisco, CA: Morgan Kaufmann, 1997.

135

24. Neimann T (2000), A Compact Guide to Lex and Yacc, ePaper Press,
Portland.
25. Pyster, A Compiler Design and Construction. New York, NY: Van
Nostrand Reinhold, 1988.
26. Sudkamp, T Languages and Machines: An Introduction to the Theory
of Computer Science, Reading, MA: Addison-Wesley, 1988.
27. Tremblay, J; Sorenson, P The Theory and Practice of Compiler
Writing. New York, NY: McGraw-Hill, 1985.
28. Wexelblat, R L History of Programming Languages. London:
Academic Press, 1981.
29. Wirth, N. The Design of a Pascal Compiler, Software - Practice and
Experience, Vol. 1, No. 4, 1971.

136

Potrebbero piacerti anche