Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
CAMBRIDGE
Compiler Construction
(nal version)
A 16-lecture course
Alan Mycroft
Computer Laboratory, Cambridge University
http://www.cl.cam.ac.uk/users/am/
20082009: Lent Term
Compiler Construction 1 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Course Plan
Part A : intro/background
Part B : a simple compiler for a simple language
Part C : implementing harder things, selected additional detail
Compiler Construction 2 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
A compiler
A compiler is a program which translates the source form of a
program into a semantically equivalent target form.
Traditionally this was machine code or relocatable binary form,
but nowadays the target form may be a virtual machine (e.g.
JVM) or indeed another language such as C.
Can appear a very hard program to write.
How can one even start?
Its just like juggling too many balls (picking instructions while
determining whether this + is part of ++ or whether its right
operand is just a variable or an expression . . . ).
Compiler Construction 3 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
How to even start?
When nding it hard to juggle 4 balls at once, juggle them each in
turn instead . . .
character
stream
-
lex
token
stream
-
syn
parse
tree
-
trans
intermediate
code
-
cg
target
code
A multi-pass compiler does one simple thing at once and passes its
output to the next stage.
These are pretty standard stages, and indeed language and (e.g.
JVM) system design has co-evolved around them.
Compiler Construction 4 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Compilers can be big and hard to understand
Compilers can be very large. In 2004 the Gnu Compiler Collection
(GCC) was noted to [consist] of about 2.1 million lines of code and
has been in development for over 15 years.
But, if we choose a simple language to compile (well use the
intersection of C, Java and ML) and dont seek perfect code and
perfect error messages then a couple thousand lines will suce.
Compiler Construction 5 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Overviews
lex (lexical analysis) Converts a stream of characters into a
stream of tokens
syn (syntax analysis) Converts a stream of tokens into a parse
treea.k.a. (abstract) syntax tree.
trans (translation/linearisation) Converts a tree into simple
(linear) intermediate codewell use JVM code for this.
cg (target code generation) Translates intermediate code into
target machine code often as (text form) assembly code.
Compiler Construction 6 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
But text form does not run
use an assembler to convert text form instructions into binary
instructions (Linux: .s to .o le format; Windows: .asm to .obj
le format).
use a linker (ld on linux) to make an executable (.exe on
Windows) including both users compiled code and necessary
libraries (e.g. println).
And thats all there is to do!
Compiler Construction 7 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Overview of lex
Converts a stream of characters into a stream of tokens.
From (e.g.)
{ let x = 1;
x := x + y;
}
to
LBRACE LET ID/x EQ NUM/1 SEMIC ID/x ASS ID/x PLUS ID/y
SEMIC RBRACE
Compiler Construction 8 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Overview of syn
Converts the stream of tokens into a parse tree.
id exp exp exp exp
definition exp
declaration command
block
LBRACE LET ID/x EQ NUM/1 SEMIC ID/x ASS ID/x PLUS ID/y SEMIC RBRACE
Compiler Construction 9 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Overview of syn (2)
Want an an abstract syntax tree, not just concrete structure above:
{ let x = 1;
x := x + y;
}
might produce (repeated tree notes are shown shared)
LET EQDEF
ASS
NUMB
ID
PLUS
ID
1
x
y
Compiler Construction 10 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Overview of trans
Converts a tree into simple (linear) intermediate code. Thus
y := x<=3 ? -x : x
might produce (using JVM as our intermediate code):
iload 4 load x (4th local variable, say)
iconst 3 load 3
if_icmpgt L36 if greater (i.e. condition false) then jump to L36
iload 4 load x
ineg negate it
goto L37 jump to L37
label L36
iload 4 load x
label L37
istore 7 store y (7th local variable, say)
Compiler Construction 11 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Overview of cg
Translates intermediate code into target machine code.
y := x<=3 ? -x : x
can produce (simple if inecient blow-by-blow) MIPS code:
lw $a0,-4-16($fp) load x (4th local variable)
ori $a1,$zero,3 load 3
slt $t0,$a1,$a0 swap args for <= instead of <
bne $t0,$zero,L36 if greater then jump to L36
lw $a0,-4-16($fp) load x
sub $a0,$zero,$a0 negate it
addi $sp,$sp,-4 first part of PUSH...
sw $a0,0($sp) ... PUSH r0 (to local stack)
B L37 jump to L37
L36: lw $a0,-4-16($fp) load x
addi $sp,$sp,-4 first part of PUSH...
sw $a0,0($sp) ... PUSH r0 (to local stack)
L37: lw $a0,0($sp) i.e. POP r0 (from local stack)...
addi $sp,$sp,4 ... 2nd part of POP
sw $a0,-4-28($sp) store y (7th local variable)
Compiler Construction 12 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Commercial justication for multi-pass compiler
Write n front-ends (lex/syn) and m back-ends (cg) and you get
n m compilers (lots of cash!) for compilers translating any of n
languages into any of m target architectures.
Also, separate teams can work on separate passes. (passes are also
called phases).
Compiler Construction 13 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Machine Code
Compilers typically translate a high-level language (e.g. Java)
into machine instructions for some machine.
This course doesnt care what machine we use, but examples will
mainly use MIPS or x86 code.
We only use the most common instructions so you dont need to
be an expert on Part Ib Computer Design.
So heres a very minimal subset we need to use:
Compiler Construction 14 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
MIPS Machine Code (1)
Instructions to:
load a constant into a register, e.g. 0x12345678 by
movhi $a0,0x1234
ori $a0,$a0,0x5678
load/store local variable at oset <nn>
lw $a0,<nn>($fp)
sw $a0,<nn>($fp)
load/store global variable at address 0x00be3f04
movhi $a3,0x00be
lw $a0,0x3f04($a3)
sw $a0,0x3f04($a3)
Compiler Construction 15 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
MIPS Machine Code (2)
Instructions to:
do basic arithmetic/logic/comparison
add $a2,$a0,$a1
xor $a2,$a0,$a1
slt $a2,$a0,$a1 ; comparison
function calling: complicated (and were cheating a bit): caller
pushes the arguments to a function on the stack ($sp) then uses
jal; callee then makes a new stack frame by pushing the old
value of $fp (and the return addresspc following caller) then
sets $fp to $sp to form the new stack frame.
function return is largely the opposite of function call; on the
MIPS put result in $v0 then return using jr.
Compiler Construction 16 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
JVM code subset (1)
We need only a small subset.
Arithmetic:
iconst n) push integer n onto the stack.
iload k) push the kth local variable onto the stack.
istore k) pop the stack into the kth local variable.
getstatic class:eld) push a static eld (logically a global
variable) onto the stack.
putstatic class:eld) pop the stack into a static eld (logically a
global variable).
iadd, isub, ineg etc. arithmetic on top of stack.
Compiler Construction 17 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
JVM code subset (2)
Branching:
invokestatic f call a function.
ireturn return (from a function) with value at top of stack
if icmpeq , also if icmpgt, etc. pop two stack items, compare
them and perform a conditional branch on the result.
goto unconditional branch.
label not an instruction: just declares a label.
NB: apart from MIPS using registers and JVM using a stack the two
subsets provided give very similar functionality.
Compiler Construction 18 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
How do I see these in use?
Reading assembly-level output is often really useful to aid
understanding of how language features are implemented.
gcc -S foo.c # option -O2 is often clearer
will write a le foo.s containing assembly instructions for your
current architecture
Otherwise, use a disassembler to convert the object le back into
assembler level form, e.g. in Java
javac foo.java
javap -c foo
Compiler Construction 19 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Lecture 2
Stacks, Stack Frames, and the like . . .
Compiler Construction 20 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Stacks and Stack Frames
Static/Global variables: allocated to a xed location in memory.
Local variables: need multiple copies for recursion etc.use a
stack.
A stack is a block of memory in which stack frames are allocated.
Function call allocates a new stack frame; function return
de-allocates it.
MIPS register $fp points to stack frame of the currently active
function. When a function returns, its stack frame is deallocated
and $fp restored to point to the stack frame of the caller.
Local variables: allocated to a xed oset from $fp; 5th local
variable typically at -20($fp)
Compiler Construction 21 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Stacks and Stack Frames (2)
A downward-growing stack exemplied for main() which calls f()
which calls f():
stack
frame for main frame for f frame for f
$fp
6
direction of growth
unused)
Compiler Construction 22 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Stacks and Stack Frames (3)
Stack frame needs to save pointer to previous stack frame (FP
) and
also return address (RA):
local vars
FP
6
FP
RA
Compiler Construction 23 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Stacks and Stack Frames (4)
A stack now looks like:
stack
frame for main frame for f frame for f
$fp
6
FPRA FPRA FPRA locals locals locals
H
H
H
6
H
H
H
6
$sp
6
unused)
$sp points to the lowest used location in:
1. the stack as a whole; and
2. the currently active stack frame.
So, memory below $sp can be used for temporary work space
(evaluation stack) and for preparing parameters for a callee.
Compiler Construction 24 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Stacks and Stack Frames (5): parameter passing
Were cheating: the MIPS procedure standard uses registers
($a0$a3) to communicate the rst 4 arguments, and the stack for
the rest (eciency). Well use the stack for all of them!
Treaty:
the caller and callee agree that the parameters are left in memory
cells at $sp, $sp+4, etc. at the instant of call.
Compiler Construction 25 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Stacks and Stack Frames (6): parameter passing
Done by:
the caller evaluates each argument in turn pushing it onto $sp.
I.e. *--SP = arg; in C.
the callee rst stores the linkage information (contiguous with
the received parameters) and so parameters can be addressed as
$fp+8, $fp+12, etc. (assuming 2-word linkage information
pointed at by $fp).
So, the callee sees its parameters at positive osets from $fp and its
local variables at negative osets from $fp with linkage info in
between.
Compiler Construction 26 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Stacks and Stack Frames (7): parameter passing
Better view of a stack frame:
local vars
SP
6
FP+8
6
FP
6
FP
RA parameters
Space below (to the left of) the stack frame is used to construct the
argument list (possibly empty) of any called routinesthe called
routine then turns this into a proper stack frame.
Compiler Construction 27 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Typical code for procedure entry/return
Caller to foo does:
addi $sp,$sp,-4 ; make space for (single) argument
sw $a0,0($sp) ; push argument
jal foo ; do the call (puts r37 into $ra)
r37:
At entry to callee:
foo: sw $ra,-4($sp) ; save $ra in new stack location
sw $fp,-8($sp) ; save $fp in new stack location
addi $sp,$sp,-8 ; make space for what we stored above
addi $fp,$sp,0 ; $fp points to this new frame
On return from callee (result in $v0):
fooxit: addi $sp,$fp,8 ; restore $sp at time of call
lw $ra,-4($sp) ; load return address
lw $fp,-8($sp) ; restore $fp to be callers stack frame
jr $ra ; branch back to caller
Compiler Construction 28 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Who removes the arguments to a call?
One subtlety (below the level of examination) which Ive omitted is:
Who removes the arguments to a call? Caller or callee?
On the MIPS, the caller does it (and doesnt happen very often
on the real MIPS procedure calling standard because of the rst
4 arguments in registers rule).
On the JVM, the callee does it (see two slides on).
On the x86 there are two standardsone of each.
Why? C, but not Java, oers support for vararg functions which
take variable numbers of arguments.
Compiler Construction 29 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Sample Java procedure calling code
Simpleras expectedreal machines have other trade-os than
simple representation of Java in the JVM design.
class fntest {
public static void main(String args[]) {
System.out.println("Hello World!" + f(f(1,2),f(3,4)));
}
static int f(int a, int b) { int y = a+b; return y*a; }
}
The JVM code generated for the function f might be:
f: ; <say meta data here: 2 args, 1 local>
iload 0 ; load a
iload 1 ; load b
iadd
istore 2 ; store result to y
iload 2 ; re-load y
iload 0 ; re-load a
imul
ireturn ; return from fn with top-of-stack value as result
Compiler Construction 30 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Sample Java procedure calling code (2)
Given
public static void main(String args[]) {
System.out.println("Hello World!" + f(f(1,2),f(3,4)));
}
the series of calls in the println would be
iconst 1
iconst 2
invokestatic f
iconst 3
iconst 4
invokestatic f
invokestatic f
Note how in the JVM a two-argument procedure call looks just like a
binary operator (iadd etc.).
Compiler Construction 31 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Address space map
0x00000000 0xffffffff
. . . code . . . static data . . . stack . . . heap . . .
The items listed above are often called segments: thus the code
segment or the stack segment. We will only discuss the heap segment
in Part C of this course.
Compiler Construction 32 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
What is just in time compilation (JIT)?
character
stream
-
lex
token
stream
-
syn
parse
tree
-
trans
intermediate
code
-
cg
target
code
A classical compiler does all these on one machine. To distribute a
system for multiple architectures we compile it once per architecture.
When running Java in a Browser, the JVM le is transported after
the rst 3 stages of compilation. The recipient browser may:
Interpret the JVM code (see later).
Do the last stage of compilation (CG) now the host architecture
is known (this is called just in time compilation).
Compiler Construction 33 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Our simple language
Use a language of your choice to implement the compiler.
The source language we use is in the intersection of C/Java/ML!
only 32-bit integer variables (declared with int), constants and
operators;
no nested function denitions, but recursion is allowed.
no classes, objects etc.
Compiler Construction 34 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Language syntax
<expr> ::= <number>
| <var>
| <expr> <binop> <expr> ;; e.g. + - * / & | ^ &&
| <monop> <expr> ;; unary operators: - ~ !
| <fnname>(<expr>*)
| <expr> ? <expr> : <expr>
<cmd> ::= <var> = <expr>;
| if (<expr>) <cmd> else <cmd>
| while (<expr>) <cmd>
| return <expr>;
| { <decl>* <cmd>* }
<decl> ::= int <var> = <expr>;
| int <fnname>(int <var> ... int <var>) <cmd>
<program> ::= <decl>*
Plus various other restrictions (see notes).
Compiler Construction 35 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Forms of Interpreter
character-stream form while early Basic interpreters would have
happily re-lexed and re-parsed a statement in Basic whenever it
was encountered, the complexity of doing so (even for our
minimal language) makes this no longer sensible;
token-stream form again this no longer makes sense, parsing is
now so cheap that it can be done when a program is read;
historically BBC Basic stored programs in tokenised form and
re-parsed them on execution (probably for space reasonsonly
one form of the program was stored);
Compiler Construction 36 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Forms of Interpreter
syntax-tree form this is a natural and simple form to interpret
(also link to operational semantics). Syntax tree interpreters
are commonly used for PHP or Python.
intermediate-code form the suitability of this for interpretation
depends on the choice of intermediate language; in this course we
have chosen JVM as the intermediate codeand historically
JVM code was downloaded and interpreted.
target-code form if the target code is identical to our hardware
then (in principle) we just load it and branch to it! Otherwise we
can write an interpreter (normally interpreters for another
physical machine are called emulators) in the same manner as we
might write a JVM interpreter.
Compiler Construction 37 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Lecture 3
Interpreters
Compiler Construction 38 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Interpreters
In general doing it makes sense to do as much work as possible before
interpreting (or direct execution): Never put o till run-time what
you can do at compile-time. [Gries].
Done once versus potentially done many times.
This particularly makes sense for statically typed languages.
BTW, not said in notes: systems people tend to call invented
machines virtual machines; theorists tend to call them abstract
machines, but same concept.
Compiler Construction 39 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
How to write a JVM interpreter
read in a .class le
put code into a byte array imem[] byte code instructions.
make PC point to entry to main
allocate a word array dmem[] we only support integers. Put
static data at base of this; make SP and FP index top of it.
(mumble about relocation/use of library routines)
simulate the fetch/execute cycle until we hit a halt instruction.
Compiler Construction 40 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
How to write a JVM interpreter (2)
void interpret()
{ byte [] imem; // instruction memory
int [] dmem; // data memory
int PC, SP, FP; // JVM registers
int T; // a temporary
...
for (;;) switch (imem[PC++])
{
/* special case opcodes for small values (smaller .class files): */
case OP_iconst_0: dmem[--SP] = 0; break;
case OP_iconst_1: dmem[--SP] = 1; break;
case OP_iconst_B: dmem[--SP] = imem[PC++]; break;
case OP_iconst_W: T = imem[PC++]; dmem[--SP] = T<<8 | imem[PC++]; break;
/* Note use of FP-k in the following -- downwards growing stack */
case OP_iload_0: dmem[--SP] = dmem[FP]; break;
case OP_iload_1: dmem[--SP] = dmem[FP-1]; break;
case OP_iload_B: dmem[--SP] = dmem[FP-imem[PC++]]; break;
Compiler Construction 41 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
How to write a JVM interpreter (3)
case OP_iadd: dmem[SP+1] = dmem[SP+1]+dmem[SP]; SP++; break;
case OP_istore_0: dmem[FP] = dmem[SP++]; break;
case OP_istore_1: dmem[FP-1] = dmem[SP++]; break;
case OP_istore_B: dmem[FP-imem[PC++]] = dmem[SP++]; break;
case OP_goto_B: PC += imem[PC++]; break;
/* etc etc etc */
}
}
}
Compiler Construction 42 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
How to write a JVM interpreter (4)
Theres a worry here: the JVM opcodes just use contiguous osets
(to iload and istore) for arguments and localswhereas previously
we required a 2-word gap between them for linkage information.
when interpreting its simpler to have a (yet another) stack (or
indeed two separate stacks) which just holds return addresses
and previous frame pointers
when compiling to a single-stack-segment solution (more exible)
such as MIPS, its easy to insert a gap:
0 +12; 1 +8; 2 4; 3 8; . . .
Compiler Construction 43 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
How to write a JVM interpreter (5)
case OP_invokestatic:
T = <get callee start address from PC>
linkagestack[--LSP] = PC;
linkagestack[--LSP] = FP;
PC = T
FP = SP + <n_p>;
SP = SP - <n_v>; /////////// FIX
case OP_ireturn: ...
/* etc etc etc */
}
}
}
And that really is allits just coding.
Compiler Construction 44 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
How to write an emulator for another machine
Suppose we want to execute the output from a compiler which
produces code for a machine we dont have (e.g. obsolete, or not yet
manufactured).
Just write a JVM-style interpreter for its code.
This is traditionally called an emulator or simulator. If youre a
hardware person you might want a cycle-accurate emulator which
also tells you exactly how long the program would take to run on the
real architecture.
If youre trying to sell your customers a new architecture and want to
tell them their existing binary programs will still run you might want
a dynamic binary translator (JIT translator looking like a fast
emulator).
Compiler Construction 45 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Syntax tree interpreter
Were going to cheat. The Expr/Cmd/Decl language is still a bit too
big for lectures, so Im going to ban Cmds:
require function bodies to be of the form return e;
re-allow limited local Decls by adding let x=e in e
(inline
expansion/beta-reduction).
Compiler Construction 52 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Lecture 4
Lexical Analysis or Tokenisation.
Compiler Construction 53 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Lexical Analysis
Converts a character stream to a token stream (a.k.a.
tokenisation).
Tokens are things like left shift symbol, integer constant or
string, or even plus symbol formed of a single character.
Typically removes whitespace (including comments!)
whitespace might be needed to separate tokens but is not a token
itself.
Most common interface is procedural: TokType lex();.
Compare the corresponding int getchar(); in C which gives a
character stream. Note lex() will probably need a 1-place buer
to tokenise things like abc+1 as we only know the abc is
complete after reading the +.
Compiler Construction 54 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Lexical Analysis (2)
Question: does one regard System.out.println as one token or ve?
Answer: it depends on the language, but in Java its most
appropriate to think of it as ve (and thats what the language
denition says). A good reason is that the language requires things
like println or even x.println for a suitable variable x to refer to
the same name. (We dont want to be matching substrings during
later phases, only subtrees.)
Compiler Construction 55 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Lexical Analysis (3)
Languages have evolved to use regular expressions (over the alphabet
of characters and possibly EOF, end-of-le) for tokens. First noted in
Algol60 by Backus and Naur.
Recall Regular Languages and Finite Automata:
Regular Expression regular language Finite Automaton;
(warning: the notes tend to write Finite State Automaton).
Well come back to whether this is deterministic or not, but in the
meantime also recall the subset construction which, given a NDFA,
gives a DFA.
Compiler Construction 56 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Lexical Analysis (4)
So, lexical analysis is easy:
Write or download a regular expression description of the tokens
of your language;
Create a DFA which accepts this language, and turn it into
C/Java/ML code which (beware see next slide) emits a token
every time it hits an accepting state.
Job done.
There are even automatic tools which read in the regular expression
description, construct the DFA and write the code for you (see Lex
and Yacc later in the course).
Now we look at the details.
Compiler Construction 57 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Lexical Analysis Details
Theres an implicit additional understanding in tokenisation beyond
accepting state in DFAs. Consider input abc+1 and the
identier token being dened by regular expression
(a [ [ z)(a [ [ z)
unsigned integer
I = s J [ J signed integer
H = J [ p J [ J p J digits maybe with .
G = H [ e I [ H e I H maybe with exponent
F = G [ s G G optionally signed
Note that some of the complexity is due to expressing things
precisely, e.g. H allows three cases: an digit string, a digit string
preceded a point, or a point with digits either side, but disallows
things like 3.. [You might pick a better/prettier denition.]
Compiler Construction 61 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Example oating point number syntax (3)
Can get the following DFA but getting something so small requires
careful state minimisation (RLFA or Hardware courses) possibly by a
tool:
1 2 3 8 7 6 5 4
p p
d
d
d
e s
d d
d e e
e
d d p s
with states S3, S5 and S8 being accepting states (but see proviso
earlier).
Compiler Construction 62 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Mapping the DFA to code
One could represent the above DFA with lots of C labels and gotos,
but its simpler to represent it as a tablenote that this also encodes
only accept at the rst invalid character:
s d p e other
S1 S2 S3 S4 S6 .
S2 . S3 S4 S6 .
S3 . S3 S4 S6 acc
S4 . S5 . . .
S5 . S5 . S6 acc
S6 S7 S8 . . .
S7 . S8 . . .
S8 . S8 . . acc
Eciency hack: Ive also indexed by s, d etc. instead of characters!
Compiler Construction 63 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
One nal problem with regular expressions
Suppose I say int is a keyword, but identiers are [a z]
then
these regular expressions overlap. Its really hard to write sequences
of a to z not including int as a regular expression (try it!),
so many notations and tools allow rst one (left-right) wins in case
of a tie.
Moral: although tokens are just regular expressions, in practice these
have lots of mathematical/programming short-hand to keep their size
low and their expressivity high.
Compiler Construction 64 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Lecture 4
1
2
Syntax Analysis or Parsing.
Compiler Construction 65 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Parsing big picture
Regular Expressions/Finite Automata are too weak for this e.g.
they cant match brackets.
We use the next strongest bit of theory Context-free grammars.
The syntax of programming languages is traditionally expressed
using such grammars, often referred to as BNF, Backus-Naur
Form.
Logically we just repeat the progression of the previous lecture,
but everything is much richer now, particularly we need typically
to return a tree for a whole program, not just the next token.
Need to learn a bit of theory before we can program a parser.
Compiler Construction 66 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Grammars
A context-free grammar is a 4-tuple (T, N, S, R)
T set of terminal symbols (things which occur in the source)
N set of non-terminal symbols (names for syntactic elements)
R set of (production) rules: U B
1
B
2
B
n
S N is the start symbol
A symbol is either a T or an N.
We use U, V to range over N, and A, B to range over N T
Compiler Construction 67 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Sentences
Given a grammar (T, N, S, R)
A sentential form is any sequence of symbols (in N T) which
can be produced from S by using a sequence of rules in R.
A sentence is just a sentential form with all its symbols in T.
(E.g. 1+2 but not 1+<expr>).
Compiler Construction 68 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Common notation (1)
Because context-free grammars typically have several productions
having the same terminal on the left-hand side, the notation
U A
1
A
2
A
k
[ [ B
1
B
2
B
is used to abbreviate
U A
1
A
2
A
k
U B
1
B
2
B
.
But beware when counting: there are still multiple productions for
U, not one.
There is various other shorthand, such as for repetition, EBNF.
Compiler Construction 69 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Common notation (2)
Alternatives:
lower case for non-terminals, upper case for terminals (toy
examples)
<expr> etc for non-terminals, ordinary text for terminals
(standards documents)
ordinary identiers for non-terminals, quoted text for terminals
(input to yacc etc.)
Note that is often written ::=
Compiler Construction 70 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Lecture 5
Syntax Analysis (continued)
Compiler Construction 71 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Grammar Engineering
A grammar is ambiguous if a sentence can be produced in two
dierent ways (using two dierent derivationssee later).
a) S A B
A a | a c
B b | c b
a b, a c b, a c c b
b) C if E then C else C | if E then C
if E then if E then C else C
c) <sheepnoise> "baa" | <sheepnoise> <sheepnoise>
baa baa baa
This is a more serious version of the overlapping token description
problem from last lecture.
Compiler Construction 72 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Getting rid of ambiguity
Re-write grammarusually to keep the same set of sentences, but
where each sentence has a unique derivation. E.g.
(before) E ::= Num | E+E | (E)
(after: option 1) E ::= E + T | T (left-associative)
T ::= Num | (E)
(after: option 2) E ::= T + E | T (right-associative)
T ::= Num | (E)
(after: option 3) E ::= T + T | T (non-associative)
T ::= Num | (E)
Non-associative disallows (say) 1+2+3forcing the user to
parenthesise (and here we fortunately remembered to include
parentheses in the syntax!).
Compiler Construction 73 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Precedence
The grammar
E ::= E + T | E - T | E * T | E / T | E ^ T | T
T ::= Num | (E)
may be unambiguous, but its probably not what you wantconsider
2*3+4^5*6+7. Want operators to have varying precedence (a.k.a.
priority or binding power). E.g.
E ::= E + T | E - T | T lowest prio, l-assoc
T ::= T * F | T / F | F medium prio, l-assoc
F ::= P ^ F | P highest prio, r-assoc
P ::= Num | (E)
Compiler Construction 74 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Null productions
Note that we usually allow (e.g.)
E ::= X; E
E ::=
This encodes zero or more occurrences of X;. The second rule is an
empty production also written E .
However, apart from particular uses such as the one above, empty
productions can be hard to deal with when parsing (theres an
string of zero characters wherever one looks...), and are often best
avoided when possible.
Compiler Construction 75 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Leftmost and Rightmost Derivations
Ill not make any real use of this, but it occurs in past exam
questions.
A leftmost derivation (of a sentence from the start symbol) is when
the sentence is generated by always taking the leftmost non-terminal
and choosing a rule with which to re-write it. (The sequence of rules
then exactly determines the string, at least for context free
grammars). Given rules S ::= A+A and A ::= 1 we might have
S A+A 1+A 1+1
A rightmost derivation is when the sentence is generated by always
taking the rightmost non-terminal . . . .
Compiler Construction 76 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Dual uses for grammars
Describing languagesweve now just about done this.
Parsing languageshow do I write a parser from a grammar?
Two answers to this question:
1. Just write iti.e. encode the grammar as code.
2. Use a toolthis encodes the grammar as a table (data) along
with a pre-implemented table interpreter.
Well start with 1 and leave 2 to lecture 14.
Compiler Construction 77 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Parsing by Recursive Descent
Easy in principle. For each non-terminal, E say, write a function
rdE() which reads an E. We start by making rdE() return voidso
this is a syntax checkerit says OK or syntax error.
So, given
F ::= P ^ F | P highest prio, r-assoc
P ::= Num | (E)
Just write
int token; // holds current token from lexing
void rdF() { rdP();
if (token==^) { token=lex(); rdF(); }
}
Compiler Construction 78 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Parsing by Recursive Descent (2)
Similarly
P ::= Num | (E)
gives
void rdP() { if (token==Num) { token=lex(); }
else if (token==()
{ token=lex(); rdE();
if (token==)) token=lex();
else die("no ));
}
else die("unexpected token");
}
Compiler Construction 79 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Parsing by Recursive Descent (3)
But what about:
E ::= E + T | E - T | T lowest prio, l-assoc
How do we know whether we are reading an E or a T rst?
And do we really want to write the following?:
void rdE() { rdE(); }
Answer: re-write to avoid left recursion in the grammar.
Compiler Construction 80 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Lecture 6
1
2
Recursive Descent Continued; abstract syntax trees
Compiler Construction 81 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Parsing by Recursive Descent (4)
What about:
E ::= E + T | E - T | T
How do we know whether we are reading an E or a T rst?
Solution: nd another (similar grammar) for the same language
which (a) which only uses terminals to choose which way to parse
and (b) has no left-recursion.
Note theres no general algorithm to do do this (indeed not always
even possible), but humans can often do it (especially for common
language cases).
Compiler Construction 82 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Parsing by Recursive Descent (5)
Easy in this case
E ::= E + T | E - T | T
just means any number of Ts separated by + or - ; so re-write to
E
::= T + E
| T - E
character
stream
-
lex
token
stream
-
syn
parse
tree
-
trans
intermediate
code
-
cg
target
code
Type checking often not mentioned explicitly. Here you can think of
it as being an arrow from parse tree to parse tree which checks types
(rejecting ill-typed programs) and xes up the parse tree.
Well cheat and do it on-the-y during trans. . .
Compiler Construction 98 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Trans: what do we have to do?
Convert the abstract syntax tree representation of a program into
intermediate object code (here JVM code).
character
stream
-
lex
token
stream
-
syn
parse
tree
-
trans
intermediate
code
-
cg
target
code
Compiler Construction 99 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
What do we have to do (2)?
The translation phase deals with
the scope and allocation of variables,
determining the type of all expressions,
the selection of overloaded operators (type-based!), and
generating the intermediate code.
Compiler Construction 100 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
What we want to happen:
Given, for example,
static int f(int a, int b) { int y = a+b; ... }
we want the translation phase to issue a series of calls of the
following form for the declaration and initialisation of y:
gen2(OP_iload, 0);
gen2(OP_iload, 1);
gen1(OP_iadd);
gen2(OP_istore, 2);
Well assume (1) OP xxx above are enumeration constants
representing opcodes and (2) gen1() and gen2() write the
intermediate code instructions to a le or append them to some other
data structure.
Compiler Construction 101 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Reminderattening a tree in ML
datatype tree = Leaf of int | Branch of tree*tree;
fun flatten(Leaf n) = [n]
| flatten(Branch(t,t)) = flatten t @ flatten t;
val test = Branch(Branch(Leaf 1, Leaf 2),
Branch(Leaf 3, Leaf 4));
flatten(test);
gives:
val it = [1,2,3,4] : int list
Compiler Construction 102 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Reminderattening a tree (alternative)
datatype tree = Leaf of int | Branch of tree*tree;
fun walk(Leaf n) = (print (Int.toString n);
print ";")
| walk(Branch(t,t)) = (walk t;
walk t);
val test = Branch(Branch(Leaf 1, Leaf 2),
Branch(Leaf 3, Leaf 4));
walk(test);
instead of making a list, this just prints the values in the leaves of the
tree:
1;2;3;4;
Compiler Construction 103 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Adjusting things a bit
datatype sourceop = Add | Mul;
datatype tree = Num of int | Diad of sourceop * tree * tree;
datatype jvmop = Iconst of int | Iadd | Imul;
fun trnop(Add) = Iadd
| trnop(Mul) = Imul;
fun flatten(Num n) = [Iconst n]
| flatten(Diad(binop,t,t)) = flatten t @ flatten t @ [trnop binop];
val test = Diad(Add, Diad(Mul, Num 1, Num 2),
Diad(Mul, Num 3, Num 4));
flatten(test);
gives:
val it = [Iconst 1,Iconst 2,Imul,Iconst 3,Iconst 4,Imul,Iadd] : jvmop list
A postorder tree walk is pretty exactly a compiler from syntax trees
to JVM code!
Compiler Construction 104 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Tree walking
Essentially need one tree-walker for each type in the abstract syntax
tree:
void trexp(Expr e) translate an expression
void trcmd(Cmd c) translate a command
void trdecl(Decl d) translate a declaration
Here well mainly consider trexp() but the others are similar.
Compiler Construction 105 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Dealing with names (and hence scoping)
class A {
static int g;
int n,m; /* non-static members just for illustration */
static int f(int x) { int y = x+1; return foo(g,n,m,x,y); }
}
Use a compile-time data structure to remember the names in
scopethe symbol table. At the return this might be:
"g" static variable
"n" class variable 0
"m" class variable 1
"f" method
"x" local variable 0
"y" local variable 1
Compiler Construction 106 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Symbol table
Symbol table is just an abstract data type.
Decls and scope exit call methords to add/remove items from the
symbol table, and well assume trname() looks up things in the table:
void trname(int op, String s)
Rather sloppily for this year Ill assume it not only looks up the oset
of name s but also emits it along with op using gen2().
Compiler Construction 107 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Translation of Expressions
fun trexp(Num(k)) = gen2(OP_iconst, k);
| trexp(Id(s)) = trname(OP_iload,s);
| trexp(Add(x,y)) = (trexp(x); trexp(y); gen1(OP_iadd))
| trexp(Sub(x,y)) = (trexp(x); trexp(y); gen1(OP_isub))
| trexp(Mul(x,y)) = (trexp(x); trexp(y); gen1(OP_imul))
| trexp(Div(x,y)) = (trexp(x); trexp(y); gen1(OP_idiv))
| trexp(Neg(x)) = (trexp(x); gen1(OP_ineg))
| trexp(Apply(f, el)) =
( trexplist(el); // translate args
trname(OP_invokestatic, f)) // Compile call to f
| ...
fun trexplist[] = ()
| trexplist(e::es) = (trexp(e); trexplist(es));
Compiler Construction 108 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Translation of Expressions (2)
Note the invariant: a call to trexp() emits code which when
executed has the net result of pushing one item to the stack.
(Prove by induction assuming the result for sub-expressions.)
Compiler Construction 109 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Lecture 7
Type Checking and Translating a parse tree into stack-based
intermediate code (continued)
Compiler Construction 110 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Translation of Conditional Expressions
fun trexp(Num(k)) = gen2(OP_iconst, k);
| trexp(Cond(b,x,y)) =
let val p = ++label; // Allocate two labels
val q = ++label in
trexp(b); // eval the test
gen2(OP_iconst, 0); // put zero on stack...
gen2(OP_if_icmpeq, p); // ... branch if b false
trexp(x); // code to put x on stack
gen2(OP_goto,q); // jump to common point
gen2(OP_Lab,p);
trexp(y); // code to put y on stack
gen2(OP_Lab,q) // common point; result on stack
end;
| ...
Compiler Construction 111 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Short-circuit boolean operations
Cant translate Java && and || in the way we translate + etc. E.g.
this is bad:
| trexp(Or(x,y)) = (trexp(x); trexp(y); gen1(...))
| trexp(And(x,y)) = (trexp(x); trexp(y); gen1(...))
Must treat e||e
as e?1:(e
as e?(e
?1:0):0.
One lazy way to do this is just to call trexp recursively with the
equivalent code above (which does not use And and Or):
| trexp(Or(x,y)) = trexp(Cond(x, Num(1),
Cond(y,Num(1),Num(0))))
| trexp(And(x,y))= trexp(Cond(x, Cond(y,Num(1),Num(0)),
Num(0)))
Compiler Construction 112 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Relational Operators
Logically, code for Java relational operators (Eq, Ne, Lt, Gt, Le, Ge) is
simply done by (e.g.):
| trexp(Eq(x,y)) = (trexp(x); trexp(y); gen1(OP_EQ))
and this is OK for exams. Sadly in reality JVM does not have such
operations which push a boolean onto the stack, so we instead
generate a branch around code which puts zero/one on the stack
(just like && and ||):
// note the mapping for branch-false: Eq -> CmpNe etc.
| trexp(Eq(x,y)) = trboolop(OP_if_icmpne, x, y)
| ...
| trexp(Gt(x,y)) = trboolop(OP_if_icmple, x, y);
Compiler Construction 113 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Relational Operators (2)
fun trboolop(brop,x,y) =
let val p = ++label; val q = ++label in
trexp(x); // load operand 1
trexp(y); // load operand 2
gen2(brop, p); // do conditional branch
trexp(Num(1)); // code to put true on stack
gen2(OP_goto,q); // jump to common point
gen2(OP_Lab,p);
trexp(Num(0)); // code to put false on stack
gen2(OP_Lab,q) // common point; result on stack
end;
This gives ugly code for a>b?a:b (rst we branch to make 0/1 then
we compare it with zero and branch again), but hey, it works.
(Its the JVMs fault, and we could x it up with a bit more work.)
Compiler Construction 114 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Translation of declarations and commands
Rather left as an exercise, but one which you are encouraged to
sketch out, as it uses simple variants of ideas occurring in trexp and
is therefore not necessarily beyond the scope of examinations.
Hint: start with
fun trcmd(Assign(s,e)) = (trexp(e); trname(OP_istore,s))
| trcmd(Return e) = (trexp(e); gen1(OP_ireturn))
| trcmd(Seq(c,c)) = (trcmd(c); trcmd(c))
| trcmd(If3(e,c,c)) = ...
Think also how variable declarations call methods to add names to
the symbol table and also increment the compilers knowledge of the
oset from FP of where to allocate the next local variable . . .
Compiler Construction 115 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Labels vs addresses the assembler
In the above explanation, given a Java procedure
static int f(int x, int y) { return x<y ? 1:0; }
I have happily generated JVM code like
iload 0
iload 1
if_icmpge label6
iconst 1
goto label7
label6: // written "Lab 6" earlier
iconst 0
label7: // written "Lab 7" earlier
ireturn
Compiler Construction 116 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Labels vs addresses the assembler (2)
But, given
static int f(int x, int y) { return x<y ? 1:0; }
and looking at the JVM code using javap -c, I get
0: iload_0
1: iload_1
2: if_icmpge 9
5: iconst_1
6: goto 10
9: iconst_0
10: ireturn
Did I cheat? Only a little...
Compiler Construction 117 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Labels vs addresses the assembler (3)
The actual JVM binary code has numeric addresses for instructions
(printed to the left by javap -c) and if icmpge and goto use the
address of destination instructions as their operands instead of a
label.
A separate pass of the compiler determines the size of each JVM
instructionto calculate the address of each instruction (relative to
the start of the procedure) which then determines the numeric
address for each of the labels. Each use of a label in a if icmpge and
goto instruction can now be substituted by a numeric oset and the
labels deleted.
This process (of converting symbolic JVM [or other] code to binary
JVM [or other] code) is called assembly and the program which does
it an assembler.
Compiler Construction 118 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Labels vs addresses the assembler (4)
While being a vital system component (and additional pass in
compilation), assemblers are often disregarded in a simple
explanation because they merely map text-form instructions to
binary-form in a 11 manner.
One nal remark: this assembly process is only done at the end of
compilationif we are intending to use the JVM code to generate
further code then we will want to keep the symbolic label nnn form.
Indeed, if we download a Java .class le which contains binary
JVM code with the intention of JITing it (compiling it to native
binary code), the rst thing we need to do is to identify all binary
branch osets and turn them to symbolic labels (disassemble it).
Compiler Construction 119 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Type Checking
Weve used int for everything so far. While types are better treated
in part C of this course. What would happen if we also had type
float (Java/C-style in which every variable is given a type when
declared)?
We have additional JVM ops fload, fstore fadd etc.
So: put the type in the symbol table (along with global/local etc).
But how does e +e
(y.e
) e
So, for example,
let f(y) = y*2
in let x = 3
in f(x+1)
can be simplied to
(f. (x. f(x+1)) (3)) (y. y*2)
Compiler Construction 147 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
The power of Lambda (2)
This translation cannot immediately do rec:
let f(n) = n=0 ? 1 : n*f(n-1) in f(4)
translates to
(f. f(4)) (n. n=0 ? 1 : n*f(n-1) )
in which the right-most use of f is unbound rather than recursive.
Compiler Construction 148 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
The power of Lambda (3)
One might think that recursion must inevitably require an additional
keyword, but note that it is possible to call a function recursively
without dening it recursively:
let f(g,n) = ... g(g,n-1) ... // NB: no f in body
in f(f, 5)
Here the call g(g,n-1) makes a recursive call of (non-recursive) f . . .
And this trick can be extended giving the xed point combinator Y .
Compiler Construction 149 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
The power of Lambda (4)
By generalising this idea it is possible to represent a recursive
denition let rec f = e as the non-recursive form
let f = Y (f.e)
(NB: this at least binds all the variables to the right places.)
Surprisingly at rst, this Y can even be expressed directly in the
lambda-calculus.
Y = f. (g. (f(a. (gg)a)))(g. (f(a. (gg)a))).
(Experts beware: this is the form for the call-by-value lambda
calculus as bets the following interpreter.)
Compiler Construction 150 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
A lambda-calculus evaluator
Why do this? (In 200809 it is also be covered in Foundations of
functional programming for dierent purposes.)
It is a simple language which directly models:
nested function denitions e.g. x.y.x +y and the nature of
function values.
dynamic types (the identity function can rst be applied to an
integer and then to another function).
It extends the simple interpreter in Part A of the notes.
Compiler Construction 151 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
A lambda-interpreter in ML
Syntax of the -calculus with constants in ML as
datatype Expr = Name of string |
Numb of int |
Plus of Expr * Expr |
Fn of string * Expr |
Apply of Expr * Expr;
Values are of either integers or functions (closures):
datatype Val = IntVal of int |
FnVal of string * Expr * Env;
Compiler Construction 152 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
A lambda-interpreter in ML (2)
Environments are just a list of (name,value) pairs as before:
datatype Env = Empty | Defn of string * Val * Env;
and name lookup is natural:
fun lookup(n, Defn(s, v, r)) =
if s=n then v else lookup(n, r);
| lookup(n, Empty) = raise oddity("unbound name");
Compiler Construction 153 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
A lambda-interpreter in ML (3)
The main code of the interpreter is as follows:
fun eval(Name(s), r) = lookup(s, r)
| eval(Numb(n), r) = IntVal(n)
| eval(Plus(e, e), r) =
let val v = eval(e,r);
val v = eval(e,r)
in case (v,v) of (IntVal(i), IntVal(i)) => IntVal(i+i)
| (v, v) => raise oddity("plus of non-number") end
| eval(Fn(s, e), r) = FnVal(s, e, r)
| eval(Apply(e, e), r) =
case eval(e, r)
of IntVal(i) => raise oddity("apply of non-function")
| FnVal(bv, body, r_fromdef) =>
let val arg = eval(e, r)
in eval(body, Defn(bv, arg, r_fromdef)) end;
Compiler Construction 154 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
A lambda-interpreter in ML (4)
Note particularly the way in which dynamic typing is handled (Plus
and Apply have to check the type of arguments and make
appropriate results). Also note the two dierent environments (r,
r fromdef) being used when a function is being called.
A fuller version of this code (with test examples and with the tying
the knot version of Y ) appears on the course web page.
Compiler Construction 155 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Static Scoping and Dynamic Scoping
Modern programming languages normally look up free variables in
the environment where the function was dened rather than when it
is called (static scoping or static binding or even lexical scoping).
The alternative of using the calling environment is called dynamic
binding (or dynamic scoping) and was used in many dialects of Lisp.
The dierence is most easily seen in the following example:
let a = 1;
let f() = a;
let g(a) = f();
print g(2);
Replacing r_fromdef with r in the interpreter moves from static to
dynamic scoping!
Compiler Construction 156 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Implementing the environment
Searching lookup() for names of variables is inecient.
Before running the program we know, given a particular variable
access, how many iterations lookup lookup() will take.
Its the number of variables declared (and still in scope) between the
variable being looked up and where we are now. So we could use
(de Bruijn) indices instead (translating with an additional compiler
phase).
Lam("x",Name("x")) becomes Lam("x",NameIndex(1))
And we dont even need the names anymore:
Lam("x",Name("x")) Lam(NameIndex(1))
Lam("x",Lam("y",Name("x"))) Lam(Lam(NameIndex(2)))
Compiler Construction 157 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Implementing the environment (2)
This is still inecient.
Accessing the nth element of a list by index is O(n) just like
searching for the nth element by name!
What about using an array for the environment to get O(1) access?
Yes, but scope entry and scope exit then costs O(n) with n variables
in scope.
Practical idea: group variables in a single function scope putting
their values in an array(*), and use a list of arrays for the
environment. Scope entry and exit is just a cons or tl.
Lookup costs O(k) where k is the maximum procedure nesting.
(*) think of this array as a stack frame.
Compiler Construction 158 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Implementing the environment (3)
BEWARE: this list of arrays/stack frames is not the same as the
stack frames encountered by following the Old FP stored in the
linkage information its the static nesting structure.
Another point: De Bruijn indices become not a single integer but a
pair (i, j) meaning access the jth variable in the ith array.
Compiler Construction 159 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Implementing the environment (4)
let f(a,b,c) =
( let g(x,y,z) = (let h(t) = E in ...)
in g((let k(u,v) = E
1
a:(1,1) b:(1,2) c:(1,3) level 1
2
x:(2,1) y:(2,2) z:(2,3) level 2
3
t:(3,1) level 3
4
u:(2,1) v:(2,2) also level 2
Compiler Construction 160 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Implementing the environment (5)
We put these entries in the symbol table.
Now, given an access to a variable x (with 2D address (i, j) from a
point at function nesting level d, instead of accessing x by name we
can instead use 2D index (relative address) of (d i, j). For example,
access to c (whose 2D address (1, 3)) is (2, 3) in E (in environment
3
of depth 3) is (2, 3), whereas access to the same variable in E
(in
4
of depth 2) is (1, 3).
Compiler Construction 161 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Lecture 11
Static link method, ML free variables, etc.
Compiler Construction 162 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Static Link
Add a pointer to stack of caller to linkage information:
local vars
SP
6
FP
6
FP+12
6
FP
RA SL parameters
SL is the static linka pointer to the frame of the dener
Note that FP
dier on this.
Compiler Construction 194 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Compiler Correctness
Suppose S, T are source and target languages of a compiler f.
See f as a function f : S T (in practice f is implemented by a
compiler program C written in an implementation language L)
Now we need semantics of S and T, written as [[ ]]
S
: S M and
[[ ]]
T
: T M for some set of meanings M.
We can now say that f is a correct compiler provided that
(s S)[[f(s)]]
T
= [[s]]
S
.
Compiler Construction 195 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Tombstone diagrams
(Non-examinable in 2008/09.)
As above, let L, S, T, U be languages.
Write f : S T for a function from S to T.
Similarly write C : S
L
T for C a compiler from S to T written in
language L.
Functions f : S T and g : T U can be composed to give S U.
So can compilers (output of one as input to the other):
(S
L
T) (T
L
U) (S
L
U).
Compiler Construction 196 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Tombstone diagrams (2)
A semantics for L now turns a program in L into a function, in
particular it now also has type [[ ]]
L
: (S
L
T) (S T).
Let H be our host architecture.
Then [[ ]]
H
means execute a program in language H.
The only useful compilers are ones of type S
H
H.
Compilation types can also be vertically composed: use a L
H
H
compiler to compile a S
L
H one to yield a usable compiler S
H
H.
Compiler Construction 197 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Bootstrapping
Weve designed a great new language U, and write a compiler C for
U in U, i.e. C : U
U
H.
How do we make it useful, i.e. U
H
H?
Write a quick-and-nasty prototype compiler U
H
H and use that to
compile C to get a better compiler U
H
H.
This is called bootstrapping (lift oneself up by ones own bootlaces).
Does this always work? Does it terminate? Is it unique?
Compiler Construction 198 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Trojan compilers
Its not unique. Adjust C to make C
such that:
1. it miscompiles (say) the login program
2. it miscompiles the compiler so that when it compiles something
which looks like the compiler then re-introduces bugs 1. and 2. if
they have been removed from the source.
Now C has no visible bugs in the source code, but whenever it is
compiled on a descendent of C
token
stream
-
standard
driver
-
parse
tree
?
6
work
stack
?
6
table for
grammar
In LR parsing, the table represents the characteristic nite state
machine (CFSM) for the grammar; the standard driver (grammar
independent) merely interprets this.
Compiler Construction 225 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
SLR parsing
So, we only have to learn:
how do we construct the CFSM?
whats the driver program?
Compiler Construction 226 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
SLR parsing example grammar
To exemplify this style of syntax analysis, consider the following
grammar (here E, T, P abbreviate expression, term and
primarya sub-grammar of our previous grammar):
#0 S E eof
#1 E E + T l-assoc +
#2 E T
#3 T P ** T r-assoc **
#4 T P
#5 P i
#6 P ( E )
The form of production #0 dening the sentence symbol S is
important. Its RHS is a single non-terminal followed by the special
terminal symbol eof (which occurs nowhere else in the grammar).
Compiler Construction 227 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
SLR parsing items and states
An item is a production with a position marker (represented by .)
marking some position on its right hand side. There are four possible
items involving production #1:
E .E + T
E E .+ T
E E + .T
E E + T .
So around 20 items altogether (there are 13 symbols on the RHS of
7 productions, and the marker can precede or follow each one).
Think of the marker as a progress indicator.
A state (in the CFSM) is just a set of items (but not just any set ...).
Compiler Construction 228 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
SLR parsing items and states (2)
If the marker in an item is at the beginning of the right hand
side then the item is called an initial item.
If it is at the right hand end then the item is called a completed
item.
In forming item sets a closure operation must be performed to
ensure that whenever the marker in an item of a set precedes a
non-terminal, E say, then initial items must be included in the set
for all productions with E on the left hand side.
Compiler Construction 229 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
SLR parsing items and states (3)
The rst item set is formed by taking the initial item for the
production dening the sentence symbol (S .E eof ) and then
performing the closure operation, giving the item set:
1: { S .E eof
E .E + T
E .T
T .P ** T
T .P
P .i
P .( E )
}
(Remember: item sets are the states of the CFSM.)
Compiler Construction 230 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
SLR parsing items and states (4)
OK, so thats the rst state, what are the rest?
I tell you the transitions which gives new items; you then turn
these into a state by forming the closure again.
Compiler Construction 231 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
SLR parsing items and states (5)
States have successor states formed by advancing the marker over the
symbol it precedes. For state 1 there are successor states reached by
advancing the marker over the symbols E, T, P, i or (. Consider,
rst, the E successor (state 2), it contains two items derived from
state 1 and the closure operation adds no more (since neither marker
precedes a non terminal). State 2 is thus:
2: { S E . eof
E E .+ T
}
The other successor states are dened similarly, except that the
successor of eof is always the special state accept. If a new item set
is identical to an already existing set then the existing set is used.
Compiler Construction 232 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
SLR parsing completed items and states
The successor of a completed item is a special state represented by $
and the transition is labelled by the production number (#i) of the
production involved.
The process of forming the complete collection of item sets continues
until all successors of all item sets have been formed. This necessarily
terminates because there are only a nite number of dierent item
sets.
Compiler Construction 233 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
CFSM for the grammar
5
6
7
10 12
8
1
4 3 $
$
$
#2 #1
#4 P
T
T
E
(
T +
2
eof
accept
**
$
$
i
9
#5
$
11
) E #6
#3
Start to think what happens when I feed this 1**2+3**3.
Compiler Construction 234 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
SLR(0) parser
From the CFSM we can construct the two matrices action and goto:
1. If there is a transition from state i to state j under the terminal
symbol k, then set action[i, k] to Sj.
2. If there is a transition under a non-terminal symbol A, say, from
state i to state j, set goto[i, A] to Sj.
3. If state i contains a transition under eof set action[i, eof ] to
acc.
4. If there is a reduce transition #p from state i, set action[i, k] to
#p for all terminals k.
If any entry is multiply dened then the grammar is not SLR(0).
Compiler Construction 235 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Is our grammar SLR(0)?
The example grammar gives matrices (using dash (-) to mark blank
entries):
action goto
state eof ( i ) + ** P T E
S1 - S10 S9 - - - S6 S5 S2
S2 acc - - - S3 - - - -
S3 - S10 S9 - - - S6 S4 -
S4 #1 #1 #1 #1 #1 #1 - - -
S5 #2 #2 #2 #2 #2 #2 - - -
S6 #4 #4 #4 #4 #4 XXX - - -
S7 - S10 S9 - - - S6 S8 -
S8 #3 #3 #3 #3 #3 #3 - - -
S9 #5 #5 #5 #5 #5 #5 - - -
S10 - S10 S9 - - - S6 S5 S11
S11 - - - S12 S3 - - - -
S12 #6 #6 #6 #6 #6 #6 - - -
Compiler Construction 236 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Is our grammar SLR(0)?
No: because (state S6, symbol ) is marked XXX to indicate that
it admits both a shift transition (S7) and a reduce transition (#4) for
the terminal . In general right associative operators do not give
SLR(0) grammars.
So: use lookaheadthe construction then succeeds, so our grammar
is SLR(1) but not SLR(0).
Compiler Construction 237 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Lecture 16
SLR(1) grammars and LR driver code
Compiler Construction 238 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
LR and look-ahead
Key observation (for this grammar) is: after reading a P, the only
possible sentential forms continue with
the token ** as part of P ** T (rule #3)
the token + or ) or eof as part of a surrounding E or P or S
(respectively).
So a shift (rule #3) transition is always appropriate for lookahead
being **; and a reduce (rule #4) transition is always appropriate for
lookahead being + or ) or eof .
In general: construct sets FOLLOW(U) for all non-terminal symbols U.
To do this it helps to start by constructing Left(U).
Compiler Construction 239 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Left sets
Left(U) is the set of symbols (terminal and non-terminal) which can
appear at the start of a sentential form generated from the
non-terminal symbol U.
Algorithm for Left(U):
1. Initialise all sets Left(U) to empty.
2. For each production U B
1
B
n
enter B
1
into Left(U).
3. For each production U B
1
B
n
where B
1
is also a
non-terminal enter all the elements of Left(B
1
) into Left(U)
4. Repeat 3. until no further change.
Compiler Construction 240 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Left sets continued
For the example grammar the Left sets are as follows:
U Left(U)
S E T P ( i
E E T P ( i
T P ( i
P ( i
Compiler Construction 241 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Follow sets
Algorithm for FOLLOW(U):
1. If there is a production of the form X . . . Y Z . . . put Z and
all symbols in Left(Z) into FOLLOW(Y ).
2. If there is a production of the form X . . . Y put all symbols
in FOLLOW(X) into FOLLOW(Y ).
Compiler Construction 242 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Follow sets continued
For our example grammar, the FOLLOW sets are as follows:
U FOLLOW(U)
E eof + )
T eof + )
P eof + ) **
Compiler Construction 243 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
SLR(1) table construction
Form the action and goto matrices are formed from the CFSM as in
the SLR(0) case, but with rule 4 modied:
4 If there is a reduce transition #p from state i, set action[i, k] to
#p for all terminals k belonging to FOLLOW(U) where U is the
subject of production #p.
If any entry is multiply dened then the grammar is not SLR(1).
Compiler Construction 244 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Is our grammar SLR(1)?
YesSLR(1) is sucient for our example grammar.
action goto
state eof ( i ) + ** P T E
S1 - S10 S9 - - - S6 S5 S2
S2 acc - - - S3 - - - -
S3 - S10 S9 - - - S6 S4 -
S4 #1 - - #1 #1 - - - -
S5 #2 - - #2 #2 - - - -
S6 #4 - - #4 #4 S7 - - -
S7 - S10 S9 - - - S6 S8 -
S8 #3 - - #3 #3 - - - -
S9 #5 - - #5 #5 #5 - - -
S10 - S10 S9 - - - S6 S5 S11
S11 - - - S12 S3 - - - -
S12 #6 - - #6 #6 #6 - - -
Note now SLR(1) has no clashes (in SLR(0) S6/** clashed).
Compiler Construction 245 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
LR parser runtime code
This is the standard driver from last lecture.
We use a stack that contains alternately state numbers and symbols
from the grammar, and a list of input terminal symbols terminated
by eof . A typical situation:
a A b B c C d D e E f | u v w x y z eof
Here a ... f are state numbers, A ... E are grammar symbols
(either terminal or non-terminal) and u ... z are the terminal
symbols of the text still to be parsed. If the original text was
syntactically correct, then
A B C D E u v w x y z
will be a sentential form.
Compiler Construction 246 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
LR parser runtime code (2)
The parsing algorithm starts in state S1 with the whole program, i.e.
conguration
1 | the whole program upto eof )
and then repeatedly applies the following rules until either a
syntactic error is found or the parse is complete.
Compiler Construction 247 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
LR parser runtime code (3)
shift transition If action[f, u] = Si, then transform
a A b B c C d D e E f | u v w x y z eof
to
a A b B c C d D e E f u i | v w x y z eof
reduce transition If action[f, u] = #p, and production #p is of
length 3, say, necessarily P C D E where C D E exactly matches
the top three symbols on the stack. Then transform
a A b B c C d D e E f | u v w x y z eof
to (assuming goto[c, P] = g)
a A b B c P g | u v w x y z eof
Compiler Construction 248 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
LR parser runtime code (4)
stop transition If action[f, u] = acc then the situation will be
as follows:
a Q f | eof
and the parse will be complete. (Here Q will necessarily be the
single non-terminal in the start symbol production (#0) and u
will be the symbol eof .)
error transition If action[f, u] = - then the text being parsed
is syntactically incorrect.
Compiler Construction 249 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
LR parser sample execution
Exampleparsing i+i:
Stack text production to use
1 i + i eof
1 i 9 + i eof P i
1 P 6 + i eof T P
1 T 5 + i eof E T
1 E 2 + i eof
1 E 2 + 3 i eof
1 E 2 + 3 i 9 eof P i
1 E 2 + 3 P 6 eof T P
1 E 2 + 3 T 4 eof E E + T
1 E 2 eof acc (E is result)
Compiler Construction 250 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
Why is this LR-parsing?
Look at the productions used (backwards, starting at the bottom of
the page since we are parsing, not deriving strings from the start
symbol).
We see
E E+T E+P E+i T+i P+i i+i
i.e. a rightmost derivation.
Compiler Construction 251 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
What about the parse tree?
In practice a tree will be produced and stored attached to terminals
and non-terminals on the stack. Thus the nal E will in reality be a
pair of values: the non-terminal E along with a tree representing i+i.
(Exactly what we want!).
Compiler Construction 252 20082009: Lent Term
UNIVERSITY OF
CAMBRIDGE
The end
Come along to Optimising Compilers in Part II if you want to
know how to do things better.
Compiler Construction 253 20082009: Lent Term