1.1 Lexical Analysis 2 1.2 Syntax Analysis 2 1.3 Semantic Analysis 2 1.4 Code generation 2 2 The PL/0 Language 4 2.1 PL/0 5 2.2 Syntax of PL/0 using EBNF (EBNF) 6 2.3 Syntax Diagram for PL/0 statement 7 3 Symbol Table of PL/0 14 4 Convert to reverse polish notation 16 5 Code generated by the Compiler 20 5.1 Reverse Polish Notation 21 5.2 Code table 22 5.3 Examples of intermediate code for PL/0 constructs 23 5.4 Pascal code to generate code for the 24 5.5 Pascal code to generate code for the 25 5.6 Example of Lexical Levels 32 5.7 Use of a Display 34 M A Smith Page 1 December 16, 2009 Overview of the PL/0 Compiler 1 Major phases of a compiler 1.1 Lexical Analysis Which splits the program up into tokens which are more convenient for later stages of the compiler to handle. if, else, variable, =, != // are examples of tokens in Java 1.2 Syntax Analysis Checks that the formation of the program conforms to the syntax of the programming language as specified in the syntax diagrams. if ( cost < 20 ) cheap++; The above construct is syntactically valid in Java if ( cost < 20 ) cheap+++; whilst the above is syntactically invalid 1.3 Semantic Analysis Checks that the meaning behind the syntax of the program that is compiled is valid. amount = cost * VAT; A semantic check would be that amount,cost and VAT had been declared and that they where of the correct type for the operation being performed. 1.4 Code generation Generates code that may be executed. This is realised by either: The CPU of a computer An interpreter which simulates the actions of the "executable" code. A mix of CPU execution and interpretation. Compiler Source Code Executable code M A Smith Page 2 December 16, 2009 Overview of the PL/0 Compiler Symbol Table Name type Loc Spend Int 10 Money Int 20 Lexical Analysis Syntax Analysis Semantic Analysis Code Generation Machine code Source Code Money = Money-Spend; move.w money,d0 sub.w spend,d0 move.w d0,money M A Smith Page 3 December 16, 2009 Overview of the PL/0 Compiler 2 The PL/0 Language A toy language Example program var countdown; begin countdown := 10; while countdown >= 0 do begin write countdown; // Extension countdown := countdown 1; end end Output 10 9 8 7 6 5 4 3 2 1 0 M A Smith Page 4 December 16, 2009 Overview of the PL/0 Compiler 2.1 PL/0 A Block structured language begin similar to { in Java end similar to } in Java := used instead of = = can now be used as equals in a boolean expression Only 1 (One) data type int Reminiscent of BCPL, and B Can only output integers write countdown; Not part of the original language PL/0 for which there where no I/O statements. Conceived in about 1976 by Nicholas Wirth as a toy language used to illustrate the compiler writing process. He wrote the compiler for PL/0 in the language Pascal. M A Smith Page 5 December 16, 2009 Overview of the PL/0 Compiler 2.2 Syntax of PL/0 using EBNF (EBNF) program = block "." block = [ "const" ident "=" number {"," ident "=" number} ";"] [ "var" ident {"," ident} ";"] {"procedure" ident ";" block ";" } statement statement = [ ident ":=" expression | "call" ident | "begin" statement {";" statement } "end" | "if" condition "then" statement | "while" condition "do" statement ] condition = "odd" expression | expression ("="|"#"|"<"|"<="|">"|">=") expression expression = [ "+"|"-"] term { ("+"|"-") term} term = factor { ("*"|"/") factor } factor = ident | number | "(" expression ")" number = // What would it be ident = // What would it be Elements of EBNF definition = Is rewritten as concatenation , No intervening white space termination ; separation | or option [ ... ] 0 or 1 times repetition { ... } 0 or more times grouping ( ... ) double quotation marks " ... " Name single quotation marks ' ... ' Name M A Smith Page 6 December 16, 2009 Overview of the PL/0 Compiler 2.3 Syntax Diagram for PL/0 statement ident := expression call ident begin statement end ; if condition then while do statement satement statement condition statement = [ ident ":=" expression | "call" ident | "begin" statement {";" statement } "end" | "if" condition "then" statement | "while" condition "do" statement ] Treat as if a rail track, and just follow the line through the diagram. M A Smith Page 7 December 16, 2009 Overview of the PL/0 Compiler ident := expression call ident begin statement end ; if condition then while do statement satement statement condition "if" condition "then" statement "begin" statement {";" statement } "end" M A Smith Page 8 December 16, 2009 Syntax analysis type symbol = (nul,ident,number,plus,minus,times,slash, oddsym,eql, neq, lss, leq, gtr, geq, lparen, rparen, comma, semicolon, period, becomes, beginsym, endsym, ifsym, thensym, whilesym, dosym, callsym, constsym, varsym, procsym); var ch: char (* last char. read *); sym: symbol (* last symb. read *); id: alfa (* last id. read *); num: integer (* last number read *); The lexical analyser getsym (lines 87-152) returns in sym the type of the next token read from the pl/0 program If it is an identifier then "id" will contain the characters of the identifier. If it is a number then num will contain the binary number. If it is a punctuation character then the table "ssym" will be used to find the type of the punctuation character. M A Smith Page 9 December 16, 2009 Syntax analysis Initialisation of tables used by lexical analysis for ch := chr(0) to chr(127) do ssym[ch] := nul; word[1] := 'begin '; word[2] := 'call '; word[3] := 'const '; word[4] := 'do '; word[5] := 'end '; word[6] := 'if '; word[7] := 'odd '; word[8] := 'procedur'; word[9] := 'then '; word[10] := 'var '; word[11] := 'while ; wsym[1] := beginsym; wsym[2] := callsym; wsym[3] := constsym; wsym[4] := dosym; wsym[5] := endsym; wsym[6] := ifsym; wsym[7] := oddsym; wsym[8] := procsym; wsym[9] := thensym; wsym[10] := varsym; wsym[11]:= whilesym; ssym['+'] := plus; ssym['-'] := minus; ssym['*'] := times; ssym['/'] := slash; ssym['('] := lparen; ssym[')'] := rparen; ssym['='] := eql; ssym[','] := comma; ssym['.'] := period; ssym['#'] := neq; ssym['<'] := lss; ssym['>'] := gtr; ssym['$'] := leq; ssym['@'] := geq; ssym[';'] := semicolon; M A Smith Page 10 December 16, 2009 Syntax analysis var a,b; The code required to check if a var declaration in PL/0 is valid would be: if sym = varsym then begin getsym; while sym = ident do begin getsym; if sym in [comma,semicolon] then begin if sym = comma then getsym end else error(5); end; if sym = semicolon then getsym else error(5); end; [This code does not attempt any error recovery.] The actual code used by PL/0 is as follows which attempts to recover from errors made by a user in specifying a var declaration. procedure vardeclaration; begin if sym = ident then getsym else error(4) end; if sym = varsym then begin getsym; repeat vardeclaration; while sym = comma do begin getsym; vardeclaration end; if sym = semicolon then getsym else error(5) until sym <> ident; end; M A Smith Page 11 December 16, 2009 Error Recovery in PL/0 Apart from some minor concessions to error recovery in the construction of the syntax analysis there is a generalised strategy which is employed on the detection of an error. That is, on detection of a syntax error, to skip to a major syntactical unit and then proceed in the normal way. type symset = set of symbol; var declbegsys, statbegsys, facbegsys: symset; declbegsys := [constsym, varsym, procsym]; statbegsys := [beginsym, callsym, ifsym, whilesym]; facbegsys := [ident, number, lparen]; declbegsys, statbegsys and facbegsys are all used to inform the syntax analyser which token to skip to on detection of an error. For example in the call to block line 649: block(0, 0, [period] + declbegsys + statbegsys); The third parameter is the set of symbols, which on detection of an error the syntax analyser will skip to. Also in the call to statement line 438 statement([semicolon, endsym] + fsys); M A Smith Page 12 December 16, 2009 Error Recovery in PL/0 This is usually done with the procedure TEST (lines 154-161) procedure test(s1, s2: symset; n: integer); begin if not (sym in s1) then begin error(n); s1 := s1 + s2; while NOT (sym in s1) DO getsym end end Which if the current symbol is not in s1 generates error message n and then skips tokens till finds a token in s2 or s1. M A Smith Page 13 December 16, 2009 Reverse Polish Notation 3 Symbol Table of PL/0 The following declarations define the symbol table in PL/0 type alfa = packed array [1.. al] of char; object = (constant, variable, proc); var table: array [0.. txmax] of record name: alfa; case kind: object of constant: ( val: integer); variable, proc: (level, adr: integer) end; The major routines to manipulate this are enter lines 183-203 which enters a name, its type and value into the symbol table position lines 206-214 which finds the position of an identifier in the table If the identifier is not there it returns 0 M A Smith Page 14 December 16, 2009 Reverse Polish Notation Semantic analysis is responsible for checking that the meaning of a syntactically correct construct is valid. The major semantic checks are to check that a variable has been declared and that it is of the correct type. Consider the code to process an assignment statement: begin (* statement *) if sym = ident then begin i := position(id); if i = 0 then error(11) else if table[i].kind <> variable then begin (* Assignment to non-variable *) error(12); i := 0 end; getsym; if sym = becomes then getsym else error(13); expression(fsys); if i <> 0 then WITH table[i] DO gen(sto, lev - level, adr ) end The syntax check for an identifier, is followed by a semantic check for: a) is the variable declared i := position(id) if i = 0 .... b) is it of the correct type if table[i].kind <> variable then begin error(12); Note in the second case the index into the symbol table is used to access the kind of the variable. M A Smith Page 15 December 16, 2009 Reverse Polish Notation 4 Convert to reverse polish notation To convert from infix notation to reverse polish is as follows: Output Input Stack INPUT consists of an expression in infix notation while ( true ) begin get next symbol/token from INPUT identifier move to OUTPUT operator while priority (INPUT op <= op on STACK) do begin pop item on STACK and move to OUTPUT end Push operator on INPUT onto STACK ( push '(' on stack ) while op on stack <> '(' do begin pop item on STACK and move to OUTPUT end discard '(' on top of stack end of expression while stack not empty do begin pop item on STACK and move to OUTPUT end exit end Priority of operators High ** * / + - Low M A Smith Page 16 December 16, 2009 Reverse Polish Notation program convertoreversepolish(input,output); const MAX = 20; var stack : record items : array[1..MAX] of char; tos : 0 .. MAX end; (* * Function to return the priority of an operator *) function priority(c:char):integer; begin if c in ['*','-','+','/','^','#','('] then case c of '+' : priority := 1; '-' : priority := 1; '*' : priority := 2; '/' : priority := 2; '^' : priority := 3; '#' : priority := -1; (* End marker in stack *) '(' : priority := -1; end else write('Panic: Error in stack ch = ', c ); end; (* * Push an object onto the stack *) procedure push(c:char); begin if stack.tos >= MAX then begin writeln('Panic: Stack Full'); halt; end; stack.tos := stack.tos + 1; stack.items[stack.tos] := c; end; (* * Pull the top object from the stack *) M A Smith Page 17 December 16, 2009 Reverse Polish Notation function pop:char; begin if stack.tos = 0 then begin writeln('Panic: Stack empty'); halt; end; pop := stack.items[stack.tos]; stack.tos := stack.tos - 1; end; (* * Return the top item on the stack [Not removing the top item] *) function lookattos:char; begin lookattos := stack.items[stack.tos]; end; procedure initialisestack; begin stack.tos := 0; end; (* * Do all the work of converting infix expression to reverse polish *) M A Smith Page 18 December 16, 2009 Reverse Polish Notation procedure convertoreversepolish; var junk:char; c:char; begin initialisestack; push('#'); (* End of stack marker *) repeat if eoln then c := '$' else read(c); if c in ['(',')','$','+','-','*', '/','^'] then case c of '(' : push('('); ')' : begin while not ( lookattos in ['(','#'] ) do begin write( pop ); end; if lookattos = '(' then junk := pop { Dispose of '(' on stack } else writeln(' Error Missing ) '); end; '+','-','/','*','^' : begin while priority( c )<=priority( lookattos ) do begin write( pop ); end; push( c ); end; '$' : while lookattos <> '#' do write( pop ); end else if c in ['a' .. 'z' ] then write(c) else begin writeln(' Panic: Ch ',c,' not valid Abort'); c:='$' end; until c = '$'; readln; writeln; end; begin writeln('Infix to Reverse notation'); while not eof do begin convertoreversepolish; end; end . M A Smith Page 19 December 16, 2009 Intermediate code of PL/0 5 Code generated by the Compiler Code Operation Explanation LIT a push( literal a ); OPR op T1:=pop; T2:=pop; push( T2 op T1 ); LOD l,a push( location a in lexical level l ); STO l,a Location a in lexical level l := pop; INT a tos := tos + a JMP a pc := a JPC a if ( pop = 0 ) pc := a CALL l,a Call routine at location a lexical level l In the code the following variables are used pc Is the program counter pc tos Is the top of stack tos a Is the address of the variable l Is the lexical level of the variable Arguments to opr are Arg Action Arg Action 0 Return 1 neg 2 + 3 - 4 * 5 DIV 6 ODD 7 Undefined 8 = 9 <> 10 < 11 >= 12 > 13 <= M A Smith Page 20 December 16, 2009 Intermediate code of PL/0 5.1 Reverse Polish Notation In this notation the operator comes after the operands Infix Notation Reverse Polish Notation A + B A B + A + B * C A B C * + (A + B) * (C + D) A B + C D + * A + B * C + D A B C * + D + In Reverse Polish notation there is no need to use brackets The PL/0 intermediate code is in effect a Reverse Polish Instruction set Thus the code to execute 1 + 2 * 5 is: LIT 1 LIT 2 LIT 5 OPR * OPR + In reverse polish notation the expression is: 1 2 5 * + M A Smith Page 21 December 16, 2009 Code generation by the PL/0 compiler 5.2 Code table The code table is defined by: type fct = (lit,opr,lod,sto,cal,int,jmp,jpc) (*functions*); instruction = PACKED RECORD f: fct (* func. code*); l: 0.. levmax (* level *); a: 0.. amax (* displacement *); END; VAR cx: integer (* code allocation index *); code: array [0.. cxmax] of instruction; Code is entered in to the table with the procedure gen The parameters to which are: x: Function code y: Lexical level z: Offset in the lexical level procedure gen( x:fct; y,z:integer ); { lines 154-161 } begin if cx > cxmax then begin write(' program too long'); goto 99 end; with code[cx] do begin f := x; l := y; a := z end; cx := cx + 1 end (* gen *); cx pointer to the next free cell in this table the "goto 99" is a panic when the code table is full M A Smith Page 22 December 16, 2009 Code generation by the PL/0 compiler 5.3 Examples of intermediate code for PL/0 constructs if hours > 40 then bonus := 30; Address Intermediate code 10 LOD hours 11 LIT 40 12 OPR > 13 JPC 16 14 LIT 30 15 STO bonus 16 while count < 20 do begin count := count + 1; end: Address Intermediate code 10 LOD count 11 LIT 20 12 OPR < 13 JPC 19 14 LOD count 15 LIT 1 16 OPR + 17 STO count 18 JMP 10 19 M A Smith Page 23 December 16, 2009 Code generation by the PL/0 compiler 5.4 Pascal code to generate code for the "if condition then statement" construct if sym = ifsym then begin getsym; condition([thensym, dosym] + fsys); if sym = thensym then getsym else error(16); cx1 := cx; gen(jpc, 0, 0); statement(fsys); code[cx1].a := cx end The above performs both syntax analysis and code generation for the if statement cx is the index into the code table giving the next free cell for an instruction After generating code for the condition which will leave the result (true,false) on top of the stack the address of the next instruction to be generated is remembered in cx1. Next the instruction JPC is generated, its address will be contained in cx1. The operand of this instruction (where to transfer control to if the condition is false can as yet, not be filled in and is set as 0) Following this the code for the statement is generated by a call to "statement" Then the JPC instruction can be "patched" to fill in the address to transfer control to, if the if statement where false (remember cx points to the next free instruction) M A Smith Page 24 December 16, 2009 Code generation by the PL/0 compiler 5.5 Pascal code to generate code for the "while condition do statement" construct if sym = whilesym then begin cx1 := cx; getsym; condition([dosym] + fsys); cx2 := cx; gen(jpc, 0, 0); if sym = dosym then getsym else error(18); statement(fsys); gen(jmp, 0, cx1); code[cx2].a := cx end; The above performs syntax analysis and code generation for the while statement cx is the index into the code table giving the next free cell for an instruction Before generating code for the condition which will leave the result (true,false) on top of the stack, the address of the first instruction of the evaluation of the condition is remembered in cx1. After a call to the procedure "condition" to generate the code for the condition the address of the JPC instruction which will be generated next is remembered in cx2. Next the instruction JPC is generated, its address will be contained in cx2. The operand of this instruction (where to transfer control to if the condition is false can as yet, not be filled in and is set to 0) Following this the code for the body of the while statement is generated (Call on procedure "statement"). Then the JMP instruction back to the top of the while loop is generated (the address of which is in cx1). Then the JPC instruction can be "patched" to fill in the address to transfer control to if the while statement where false (Remember cx points to the next free instruction) M A Smith Page 25 December 16, 2009 Code generation by the PL/0 compiler Consider the infix expression: 1 * 2 + 3 * 4 This can be thought of as a tree, with the operators with the highest priority at the bottom of the tree.
1 2 3 4 + * * The way of evaluating this tree is to work down the tree till can replace a part of the subtree with a result and then to repeat the process till the whole tree is evaluated. In this example (1*2) could be evaluated, then (3*4) [working left to right] and finally (1*2) + (3*4) Formally this is evaluating the tree LHS item RHS (Recursively) The way the syntax diagrams are organised is to reflect this structure With EXPRESSION, TERM and FACTOR being responsible for parsing a part of the tree: M A Smith Page 26 December 16, 2009 Code generation by the PL/0 compiler 1 2 3 4 + * * Expression Term Factor The way an arithmetic expression is generated is to essentially use the procedures EXPRESSION, TERM and FACTOR to generate a tree which if printed out in the order (LHS RHS item) will generate the reverse polish formula for the infix expression. The main complication is that the tree is not generated as such but is formed by the recursion of the 3 procedures EXPRESSION TERM and factor. This means that no tree data structure is actually created, but the recursive descent process forms the parsing of the tree and the subsequent generation of the code M A Smith Page 27 December 16, 2009 Code generation by the PL/0 compiler Procedure TERM begin (* term *) factor(fsys + [times, slash]); while sym in [times, slash] do begin mulop := sym; getsym; factor( fsys + [times, slash]); if mulop = times then gen(opr, 0, 4) else gen(opr, 0, 5) end end (* term *); The way the expression is parsed follows the pattern of the syntax diagrams Which for TERM is Fact Fact * / Now the problem is that the correct order for generating reverse polish from the tree is LHS RHS item not LHS item RHS This is simply solved by delaying the output of the code for the operator * or / till after parsed the RHS M A Smith Page 28 December 16, 2009 Code generation by the PL/0 compiler Generation of code for the operands Consider the code for FACTOR procedure factor(fsys: symset); var i: integer; begin test(facbegsys, fsys, 24); while sym in facbegsys do begin if sym = ident then begin i := position(id); if i = 0 then error(11) else with table[i] do case kind of constant: gen(lit, 0, val); variable: gen(lod, lev - level, adr); proc: error(21) end; getsym end else if sym = number then begin if num > amax then begin error(31); num := 0 end; gen(lit, 0, num); getsym end else if sym = lparen then begin getsym; expression([rparen] + fsys); if sym = rparen then getsym else error(22) end; test(fsys, [lparen], 23) end end (* factor *); M A Smith Page 29 December 16, 2009 Code generation by the PL/0 compiler This is all done in factor, which checks that the operand is either a number, constant or variable. Generating the correct code accordingly, note the semantic check for the token identifier to check that it is a variable or constant. Also note the recursive call to process open bracket M A Smith Page 30 December 16, 2009 The Interpreter The major variables used in the interpreter are: s ARRAY [1..stacksize] of integer; The stack in which all evaluations are done code The array of instructions to be executed t Base of stack, used as stack pointer p The program counter b Base of the current stack frame The code for the add and sub instructions are: case a of 2: begin t := t - 1; s[t] := s[t] + s[t+1]; end; 3: begin t := t - 1; s[t] := s[t] - s[t+1]; end; M A Smith Page 31 December 16, 2009 The Interpreter 5.6 Example of Lexical Levels var a,b; procedure p1; var p1a,p1b; begin a := 2; b := 3; p1a := 4; p1b := 5; end; begin a := -1; b := -2; call p1; end . 0 var a,b; 1 procedure p1; 1 var p1a,p1b; 2 begin 3 a := 2; b := 3; 7 p1a := 4; p1b := 5; 11 end; 2 int 0 5 3 lit 0 2 { a := 2; } 4 sto 1 3 5 lit 0 3 6 sto 1 4 { b := 3; } 7 lit 0 4 8 sto 0 3 { p1a := 4; } 9 lit 0 5 10 sto 0 4 { p1b := 5; } 11 opr 0 0 { RETURN } 12 begin 13 a := -1; b := -2; call p1; 20 end . 12 int 0 5 { Space for vars/SL/DL/RA } 13 lit 0 1 14 opr 0 1 15 sto 0 3 { a := -1 } 16 lit 0 2 17 opr 0 1 18 sto 0 4 { b := -2 } 19 cal 0 2 { Call P1 } 20 opr 0 0 { Return } Note That the lexical level in the instruction is the number of lexical levels previous to the one in which the variable is stored. M A Smith Page 32 December 16, 2009 The Interpreter SL DL RA A B +---+---+---+---+---+ | - | - | 0 | -1| -2| +---+---+---+---+---+ SL Static Link Base of previous lexical level DL Dynamic Link Base of previous active stack frame RA Return Address SL DL RA a b SL DL RA p1a p1b +---+---+---+---+---+---+---+---+---+---+ | - | - | 0 | -1| -2| * | * | 20| 4 | 5 | +---+---+---+---+---+---+---+---+---+---+ ^ | | +---------------------+ | | | +-------------------------+ Note How the PL/0 compiler accesses variables Line 6 p1a := 4; LIT 4 STO 0,3 Line 5 a := 2; LIT 2 STO 1,3 Line 9 a := 1; LIT -1 STO 0,3 M A Smith Page 33 December 16, 2009 The Interpreter 5.7 Use of a Display Rather than use chain for static links, better to use display +---+ | * | ---> Base of global variables +---+ | * | ---> Base of variables 1st lexical level +---+ | * | ---> Base of variables 2nd lexical level +---+ | * | ---> Base of variables 3rd lexical level +---+ etc This would mean having to change 1) Code generation of LOD/STORE instruction 2) Interpreter to maintain Display 3) The definition of the call instruction and exit instruction (OPR 0) to maintain the display The reason for using a display is that the access to the lexical levels by chaining through lexical levels at run time would be grossly inefficient for a compile. Usually when using a display in a compiled program the display is held in registers in this way the addressing mode base+offset can be used. Thus if register 0 contained the current lexical level then to access the variable at displacement 4 in that lexical level the following addressing mode could be used MOV 4(r0),TARGET {PDP-11/VAX instruction} M A Smith Page 34 December 16, 2009