Sei sulla pagina 1di 9

TERM PAPER ON COMPILER WRITING

CPT 623
Osaigbovo Timothy
School of ICT,
Federal University of Technology Minna.
+2348034635653
timothyosaigbovo@yahoo.com
MTECH/SNAS/2012/3955
Submitted to
PROF. H. C. INYIAMA










COMPILER WRITING
Osaigbovo Timothy
School of ICT,
Federal University of Technology Minna.
+2348034635653
timothyosaigbovo@yahoo.com
Abstract
This paper introduces compiler writing techniques. It
was written with the following observations made by
the author:(1) While compiler writing may be done
without too much knowledge of formal language or
automata theory, most books on compiler writing
devote most of their chapters on these topics. This
inevitably gives most readers a false impression that
compiler writing is a difficult task. I have deliberately
avoided this formal approach. Each aspect of compiler
writing is presented in such an informal way that any
naive reader can comprehend; (2) there are several
different aspects of compiler writing. Most publications
would give detailed presentations on every aspect and
still fail to give the reader some feeling on how to write
a compiler. Most important of all, the internal data
structure of this simple compiler will be clearly
presented. Examples will be given and a reader will be
able to understand how a compiler works by studying
these examples. In other words, I make sure that a
reader will not lose his global view of compiler writing;
he sees not only trees, but also forests.
Categories and Subject Descriptors
D.3.4 [Programming Languages]: Processors|Code gen-
eration, Compilers, Parsing, Run-time environments
General Terms Design, Management
General Terms
Compiler, Cross-Compiler, Bootstrapping .
Keywords
Compiler Design, Object Oriented Programming,
High Level Language, Design Patterns, Tools.
1.0 Introduction
The study of compiler designing form a central theme in
the field of computer science. An understanding of the
technique used by high level language compilers can give
the programmer a set of skills applicable in many aspects
of software design - one does not have to be a compiler
writer to make use of them.
The compiler writing is not confined to one discipline
only but rather spans several other disciplines:
programming languages, computer architecture, theory of
programming languages, algorithms, etc. This paper is
intended as an introduction to the basic essential features
of compiler writing.


1.1 What Is A Compiler?
A compiler is software (Program) that reads a program
written in a source language and translates it into an
equivalent program in another language - the target
language. The important aspect of compilation process is
to produce diagnostic (error messages) in the source
program. These error messages are mainly due to the
grammatical mistakes done by a programmer.
There are thousands of source languages, ranging from C
and PASCAL to specialised languages that have arisen in
virtually every area of computer application. Target
languages are also in thousands. A target language may
be another programming language or the machine
language or an assembly language. Compilers are
classified as single pass, multipass, debugging or
optimizing, depending on how they have been
constructed or on what functions they are supposed to
perform. Earlier (in 1950's) compilers were considered as
a difficult program to write. The first FORTRAN
compiler, for example, took 18 staff-years to implement.
But now several new techniques and tools have been
developed for handling many of the important tasks that
occur during compilation process. Good implementation
languages, programming environments (editors,
debubuggers, etc.) and software tools have also been
developed. With this development compiler writing
exercise has become easier.
1.2 Approaches to Compiler
Development
There are several approaches to compiler developments.
Here we will look at some of them.









1.2.1 Assembly Language Coding
Early compilers were mostly coded in assembly
language. The main consideration was to increase
efficiency. This approach worked very well for small
High Level Languages (HLL). As languages and their
compilers became larger, lots of bugs started surfacing
which were difficult to remove. The major difficulty with
assembly language implementation was of poor software
maintenance.
Around this time, it was realised that coding the
compilers in high level language would overcome this
disadvantage of poor maintenance.
Many compilers were therefore coded in FORTRAN, the
only widely available HLL at that time. For example,
FORTRAN H compiler for IBM/360 was coded in
FORTRAN. Later many system programming languages
were developed to ensure efficiency of compilers written
into HLL. Assembly language is still being used but
trend is towards compiler implementation through HLL.

1.2.2 Cross-Compiler
A cross-compiler is a compiler which runs on one
machine and generates a code for another machine. The
only difference between a cross-compiler and a normal
compiler is in terms of code generated by it. For
example, consider the problem of implementing a Pascal
compiler on a new piece of hardware (a computer called
X) on which assembly language is the only programming
language already available. Under these circumstances,
the obvious approach is to write the Pascal compiler in
assembler. Hence, the compiler in this case is a program
that takes Pascal sources as input, produces machine
code for the target machine as output and is written in the
assembly language of the target machine.
This compiler implementation involves a great deal of
work since a large assembly language program has to be
written for X. It is to be noticed in this case that the
compiler is very machine specific; that is, not only does
it run on X but it also produces machine code suitable for
running on X. Furthermore, only one computer is
involved in the entire implementation process.

The use of a high-level language for coding the compiler
can offer great Savings in implementation effort. If the
language in which the compiler is being written is
already available on the computer in use, them the
process is simple.
For example, Pascal might already be available on
machine X, thus permitting the coding of, say, a Modula-
2 compiler in Pascal.
If the language in which the compiler is being written is
not available on the machine, then all is not lost, since it
may be possible to make use of an implementation of that
language on another machine. For example, a Module-2
compiler could be implemented in Pascal on machine Y,
producing object code for machine X.
The object code for X generated on machine Y would of
course have to be transferred to X for its execution. This
process of generating code on one machine for execution
on another is called cross-compilation.
1.2.3 Bootstrapping
It is a concept of developing a compiler for a language by
using subsets (small part) of the same language.
Suppose that a Modda-2 compiler is required for machine
X, but that the compiler itself is to be coded in Modula-2.
Coding the compiler in the language it is to compile is
nothing special and, as will be seen, it has a great deal in
its favour. Suppose further that Modula-2 is already
available on machine Y. In this case, the compiler can be
run on machine Y, producing object code for machine X.
This is the same situation as before except that the
compiler is coded in Modula-2 rather than Pascal. The
special feature of this approach appears in the next step.
The compiler, running on Y. is nothing more than a large
program written in Modula-2. Its function is to transform
an input file of Module-2 statement into a functionally
equivalent sequence of statements in X's machine code.
Therefore, the source statements of this Module-2
compiler can be passed into itself running on Y to
produce a file containing X's machine code. This file is
of course Module-2 compiler, which is capable of being
run on X. By making the compiler compile itself, a
version of the compiler that runs on X has been created.
Once this machine code has been transferred to X, a self-
sufficient Modula-2 compiler is available on X; hence
there is no further use for machine Y for supporting
Module-2 compilation.
This implementation plan is very attractive. Machine Y is
only required for compiler development and once this
development has reached the stage at which the compiler
can (correctly) compile itself, machine Y is no longer
required. Consequently, the original compiler
implemented on Y need not be of the highest quality - for
example, optimization can be completely disregarded.
Further development (and obviously conventional use) of
the compiler can then continue at leisure on machine X.
This approach to compiler implementation is called
bootstrapping. Many languages, including C, Pascal,
FORTRAN and LISP have been implemented in this
way.
Pascal was first implemented by writing a compiler in
Pascal itself. This was done through several
bootstrapping processes. The compiler was translated "by
hand" into an available low level language.
1.3 Compiler Designing Phases
The compiler being a complex program is developed
through several phases. Each phase transforms the source
program from one representation to another. The tasks of
a compiler can be divided very broadly into two sub-
tasks.
i. The analysis of a source program
ii. The synthesis of the object program
In a typical compiler, the analysis task consists of 3
phases.
i. Lexical analysis
ii. Syntax analysis
iii. Semantic analysis
The synthesis task is usually considered as a code
generation phase but it can be divided into some other
distinct phases like intermediate code generation and
code optimization. These four phase functions in
sequence are shown in figure 1. The nature of the
interface between these four phases depends on the
compiler. It is perfectly possible for the four phases to
exist as four separate programs.
1.3.1 Lexical Analysis
Lexical analysis is the first phase of a compiler. Lexical
analysis, also called scanning, scans a source program
from left to right character by character and group them

into tokens having a collective meaning. It performs two
important tasks. First, it scans a source program character
by character from left to right and groups them into
tokens (or syntactic element). Each taken or basic
syntactic element represents a logically cohesive
sequence of characters such as identifier (also called
variable), a keyword (if, then, else, etc.), a multi-
character operator < =, etc. The output of this phase goes
to the next phase, i.e. syntax .analysis or parsing. The
interaction between two phases is shown below in figure
2.
The second task performed during lexical analysis is to
make entry of tokens into a symbol table if it is not there.
Some other tasks performed during lexical analysis are:
to remove all comments, tabs, blank spaces and
machine characters.
to produce error messages (also called diagnostics)
occurred in a source program.
Let us consider the following Pascal language statement,
For i = 1 TO 50 do sum = sum + x [i]; sum of numbers
stored in array x
After going through the statement, the lexical analysis
transforms it into the sequence of tokens:
For i=1TO 50do sum: =sum+x[i];
Tokens are based on certain grammatical structures.
Regular expressions are important notations for
specifying these tokens. It consists of symbols (in the
alphabet of the language that is being defined) and a set
of operators that allow:
(i) concatenation (combination of strings),
(ii) repetition, and
(iii) alteration.
Examples of Regular Expressions
Figure 1: Compiler Design Phases
Figure 2: Interaction between the first two phases
(i) ab denotes the set of strings f ab)
(ii) a l b denotes either a or b
(iii) a* denotes (empty, a, aa, am), etc. *
(iv) ab* denotes {a,a b, abb, abbb)
(v) . [a - z A - z] [a - z A - z 0 - 01]* gives a definition of
a variable which means that a variable starts with an
alphabetic character followed by either alphabetic
character or digit 'character.
Writing a lexical analysis completely from scratch is a
fairly challenging task. Several tools have been built for
constructing lexical analysis from special purpose
notation based on regular expressions. Perhaps the most
famous of these tools is Lex, one of the many utilities
available with Unix operating system. Lex requires that
the syntax of each lexical token be defined in terms of a
regular expression. Associated with each regular
expression is a fragment of code that defines the action to
be taken when that expression is recognised.
The Symbol Table
An essential function of a compiler is to record the
identifiers and the related information about its attributes
type (numeric or character), its scope (where in the
program it is valid) and in the case of procedure or the
function names, such things as the number and types of
its arguments, the mechanism of passing each argument
and the type of result it returns.
A symbol table is a set of locations containing a record
for each identifier with fields for the attributes of the
identifier. A symbol table allows us to find the record for
each identifier (variable) and to store or retrieve data
from that record quickly.
For example, take an expression written in C such as int
x, y, z;
The lexical analysis after oing through this expression
will enter x, y and z into the symbol table. This is shown
in the figure given below.

Figure 3: Symbol Table

The first column of this table contains the entry of
variables and the second contains the address of memory
locations where values of these variables will be stored.
The remaining phases enter information about identifiers
into the symbol table and then use this information in
various ways.
1.3.2 Syntax Analysis
Every language whether it is a programming language or
any natural language follows certain grammatical rules
that define syntactical structures of a language. In C
language, for ex- ample a program is made out of main
function consisting of blocks, a block out of statements, a
statement out of expressions, and an expression out of
tokens and so on. The syntax of a programming language
constructs can be described by Backens Naur Form -
(BNF) notations. These types of notations are also called
context-free grammars. Well-formed grammars offer
significant advantages to compiler designer:
A grammar gives a precise, yet easy to
understand syntactic specification of a
programming language.
Development of tools for designing Parser to
determine if a source program is syntactically
correct, can be achieved from certain class of
grammars.
A well designed grammar imparts a structure to
a programming language that is useful for the
translation of source program into correct
object code.
Syntax analysis is the second phase of compilation
process. This process is also called parsing. It performs
the following operations:
1. Obtains a group of tokens from the lexical
analyser.
2. Determines whether a string of tokens can be
generated by a grammar of the language, i.e. it
checks whether the expression is syntactically
correct or not.
3. Reports syntax error(s) if any.
The output of parsing is a representation of the syntactic
structure of a statement in the form of Parse tree (syntax
tree).
1.3.3 Semantic Analysis
The role of semantic analyzer is to derive methods by
which the structures constructed by the syntax analyzer
may be evaluated or analyzed.
The semantic analysis phase checks the source program
for semantic errors and gathers data type information for
the subsequent code-generation phase. An important
1 x
Location
2 y
Location
3 z
Location
- - - - -

component of semantic analysis is type checking. Here
the compiler checks that each operator has operands that
are permitted by the source language specification. For
example: Many programming languages definition
require a compiler to report an error every time a real
number is used to index an array. For example a [5.6];
here 5.6 is a real value not an integer.
To illustrate some of the actions of a semantic analyzer
consider the expression a+b-c*d in a language such as
Pascal where abc have data type integer and d has type
real. The syntax analyzer produces a parse tree of the
form shown in figure 4(a).
One of the tasks of the semantic analyzer is to perform
type checking within this expression. By consulting the
symbol table, the data types of all the variables can be
inserted into the tree as shown in the figure 4(b) and
performs semantic type conversion and label a node
accordingly.


Figure 4: Semantic Analysis of an arithmetic
expression
The semantic analyser can determine the types of the
intermediate results and thus propagate the type attributes
through the tree checking for compatibility as it goes. In
our example, the semantic analyzer first considers the
results of c and d. According to the Pascal semantic rule
integer * real --> real, the * node can be labelled as real.
This is shown in figure 4(c).
Compilers vary widely in the role taken by the semantic
analyzer. In some simpler compilers, there is no easily
identifiable semantic analysis phase, the syntax analyzer
itself does semantic analysis and intermediate code
generation directly. In other compilers syntax analysis,
semantic analysis and code generation is a separate
phase. In the next section we will discuss about code
generation phase.
1.3.4 Code Generation and
Optimization
The final phase of the compiler is the code generator.
The code generator takes an input as intermediate
representation (in the form of parse tree) of the source
program and produces as output an equivalent target
program (figure 5).
The target program may take on a variety of forms:
absolute machine language, relocatable machine
language or assembly language. Producing an absolute
machine language program as output has the advantage
that it can be placed in a fixed location in memory and
immediately executed
Producing a relocatable machine language program
(object module) as output allows sub-programs to be
compiled separately. A set of relocatable object modules
can be linked together and loaded for execution by a
linking loader. The process of linking and loading in
producing relocatable object code might be little time
consuming but it provides flexibility in being able to
compile subroutine separately and to call other
previously compiled program from the object module. If
the target machine does not handle location
automatically, the compiler must provide relocation
information to the loader to link the separately compiled
program segments.
Producing an assembly-language program as output
makes the process of code generation in somewhat
simpler. We can generate symbolic instruction and use
the macro facilities of the assembler to help generate
code.
A thorough knowledge of the target machine's
architecture as well as instruction set is required to write
a good code generator. The code generator is concerned
with the choice of machine instruction, allocation of
machine registers, addressing, and interfacing with
operating system. To produce faster and more compact
code, the code generator should include some form of
code optimization.
This may exploit techniques such as the use of special
purpose machine instructions or addressing modes,
register optimization etc. This code optimization may
incorporate both machine-dependent and machine-
independent techniques.
1.4 Software Writing Tools
Two best known software tools for compiler
constructions are Lex (a lexical analyzer generator) and
Yacc (a parser generator). Both of which are available
under the UNIX operating system. Their continuing
popularity is partly due to their widespread availability
but also because they are powerful and easy to use with a
wide range of applicability. This section describes these
two software tools.
1.4.1 Lex
Lex is a software tool that takes as input a specification
of a set of regular expressions together with actions to be
taken on recognising each of these expressions. The
output of Lex is a program that recognizes the regular
expression and acts appropriately on each Lex is
generally used in the manner depicted in the following
figure 6.
Figure 5: Code generation phase

Figure 6: Creating a lexical analyser with lex.
First, a specification of lexical analyzer is prepared by
creating a program lex. 1 in the Lex language. Then,
Lex.1 is run through the lex compiler to produce a C
program lex. yy.c. Lex.yy.c. consists of a C language
program containing re-congniser for regular expressions
together with user supplied code. Finally lex.yyc. is run
through the C compiler to produce an object program
a.out which is a lexical analyzer that transforms an input
stream into a sequence of tokens.
Lex specifications:
A lex program consists of three parts:
declaration
% %
translation rules
% %
user routines
Any of these three sections may be empty but the %%
separator between the definitions and the rules cannot be
omitted.
The declaration section includes declaration of variables,
constants and regular definitions. The regular definition
is statements used as components of the regular
expression appearing in the translation rides.
The translation rules of a lex program which is a key part
of the Lex input are statements of the form:

P1 (action l)
P2 (action 2)
-------------
-------------
Pn (action n)
where each pi is a regular expression and each action is a
program fragment describing what action the lexical
analyzer should take when patterns p1 ... pn matches a
token. In lex, the actions are written in C-language in
general, however, they can be in any implementation
language.
The third section contains user routines which are needed
by action. Alternatively, these procedures can be
compiled separately and loaded with lexical analyzer.
Lex supports a very powerful range of operators for the
constructions of regular expressions. For example a
regular expression of an identifier can be written as:
[A-Z a-z] [A-Z a-z 0-9]*
which represents an arbitrary string of letters and digits
beginning with a letter suitable for machining a variable
in many programming languages.
Here is a list of Lex operators with examples:
Operator
notation
Example Meaning
* (astersk) a* Set of all strings of zero or
more a's, i.e. . (empty a, aa,
aaa ... )


| (or) a|b Either a or b
One or more instances of
a i.e. a, aa, aaa etc.
Zero one of instance of a
+ a+

? a?
[,] [a b c] a | b | c. An alphabetical
character class such as [a-zl
denotes the regular
expression a | h | ... | z.



1.4.2 Yacc
Yacc (Yet Another Compiler Compiler) assists in the
next phase of the compiler. It creates a parser which will
be output in a form suitable for inclusion in the next
phase. Yacc is available as a command (utility) on the
UNIX system and has been used to help implement
hundreds of compilers.
A parser can be constructed using Yacc in the manner
illustrated in figure 7.

Figure 7: Yacc Functioning
First a file say Parse.y containing a Yacc specification for
an expression is prepared. The UNIX system command
Yacc parse.y transforms the file parse.y into a C program
called Y.tab.c which is a representation of parser written
in C language along with other C programs that the user
may have prepared. Y.tab.c is run through C compiler
and produces object program a.out that performs the
translation specified by the original Yacc program. Yacc
source program has also 3 parts as Lex. This can be
expressed in Yacc specification as:
declaration
% %
translation rules
% %
C - Programs
Example: To illustrate how to prepare a Yacc source
program, let us construct a simple desk calculator that
reads an arithmetic expression, evaluates it, and then
prints its numeric value. We shall build the desk
calculator staffing with the following grammar for
arithmetic expressions:
expr

expr + term | term
term

term * factor | factor
factor

(expr) digit
The token digit is a single digit ranging from 0 to 9. A
Yacc desk calculator program derived from this grammar
is shown in figure 8.

The declarations part. There are two optional sections in
the declarations part of a Yacc program. In the first
section, we write ordinary C declarations, delimited by
%( and %). Here we place declarations of any
temporaries used by the translation rules or procedures of
the second and third sections. In figure 13, this section
contains only the include-statement #include < ctype.h >
that causes the C pre-processor to include the standard
header file <ctype.h > that contains the predicate is digit.
Also in the declarations part are declarations of grammar
tokens. In figure 13 the statement
%token DIGIT
declares DIGIT (pre-defined) to be a token. Tokens
declared in this section can then be used in the second
and third parts of the Yacc specification.
The translation rules part. In the part of the Yacc
specification after the first % % pair, put the translation
rules. Each rule consists of a grammar production and the
associated semantic action. A set of productions that we
have been writing
< left side > - < alt 1 > | < alt 2 > ...| < alt n >
would be written in Yacc as
< left side > :< alt 1 > (Semantic 1)
: < alt 2 > (Semantic 2)
: < alt n > (Semantic n)
In a Yacc production, a quoted single character 'c' is
taken to be the terminal symbol c, a unquoted strings of
letters and digits not declared to be tokens are taken to be
nonterminal. Alternative right sides can be separated by a
vertical bar, and a semicolon follows each left side is
taken to be the start symbol.
A Yacc semantic action is a sequence of C statements. In
a semantic action, the symbol $$ refers to the attribute
value associated with the nonterminal on the left, while
$i refers to value associated with ith grammar symbol
(terminal or nonterminal) on the right. The semantic
action is performed whenever we reduce by die
associated production, so normally the semantic action
computes a value for $$ in terms of the $i's. In the Yacc
specification, we have written the two production rules
for expressions (expr).
Expr : expr + term 1 term and their associated semantic
actions as
expr : expr '+' term {$$ = $1 + $3;}
: term
:
Note that the nonterminal term in the fast production is
the third grammar symbol on the right while '+' is the
second. The semantic action associated with the first
production adds the value of the expr and the term on the
right and assigns the result as the value for the non-
Figure 8: Yaac specification of a simple desk calculator
terminal expr on the left. We have omitted the semantic
action for the second production altogether, since
copying the value is the default action for productions
with a single grammar symbol on the right. In general, {
$$ = $1; } is the default semantic action.
Notice that we have added a new starting production
Line : expr '\n' { printf("%d\n"};
to the Yacc specification. This production says that an
input to the desk calculator is to be an expression. The
semantic action associated with this production prints the
decimal value of the expression.
The supporting C-routines part. The third part of a Yacc
specification consists of supporting C-routines. A lexical
analyzer by the name lex( ) must be provided. Other
procedures such as error recovery routines may be added
as necessary.
The lexical analyser lex( ) produces pairs consisting of a
token and its associated attribute value. If a token such as
DIGIT is returned, the token must be declared in the first
section of the Yacc specification. The attribute value
associated with a token is communicated to the parser
through a Yacc defined variable Ival.
C language routine reads input characters one at a time
using the getchar(). If the character is a digit the value of
the digit is stored in the variable Ival, and the token
DIGIT is returned. Otherwise, the character itself is
returned as the token. This section will be more clear
once you have gone through Block 3 of Course 4 on C-
programming.
The power and utility of Yacc should not be
underestimated. The effort saved during compiler
implementation by using Yacc rather than a handwritten
parser, can be considerable.
1.5 Program Development Tools
There are some other language support tools a program
has to access for reducing the cost of software
development. These additional facilities include tool for
program editing, debugging. analysis and documentation.
Ideally such tools should be closely integrated with the
language implementation so that, for example, when the
compiler detects syntactic errors in the source program,
the editor can be entered automatically with the cursor
indicating the position of the error.
The rapid production of syntactically correct program is a
major goal. The use of syntax directed editors is
increasing and they can assist the user in eliminating
syntax errors as the program is being input. Such an error
only accepts syntactically correct input and prompts the
user, if necessary to input only those constructs that are
syntactically correct at the current position. These editors
can considerably improve run time efficiency by
removing the necessity for repeated runs of the compiler
to remove syntax errors.
Debugging tools should also be integrated with the
programming language for finding and removing bugs in
the shortest period of Time.
Integrated program development environment are now
being considered as an important aid for the rapid
construction of correct softwares. Considerable progress
has been made over the last decade in the provision of
powerful software tools to ease the programmers burden.
The application of the basic principles of language and
compiler design will help continue the development.
1.6 Conclusion

This paper discussed several issues related to the
compiler. The initial discussion focused on the
approaches to the compiler designing phases, which
included lexical analysis, parsing semantic analysis and
the code generation, while the latter part examined two
important software tools Lex and Yacc and also program
development tools which greatly simplify
implementation of a compiler.
References

[1] Aho, A. V. and Ullman, J.D. (2007). Principles of
Compiler Design, Addison-Wesley, Reading, MA.
[2] Backhouse, R. C.(2009). Syntax of Programming
Languages, Theory and Practice, Prentice-Hall,
Englewood Cliffs, N. J.
[3] Barret, W. A. and Couch, J. D. (1999). Compiler
Construction, Theory and Practice, Science
Research Associates, Chicago, ILL.
[4] Bauer, F, L. and Eickel, J. (2007).Compiler
Construction-an Advanced Course, Springer-
Verlag, New York.
[5] Bornat, R. (2000). Understanding and Writing
Compilers, The MacMillan Press, New York.
[6] Callingart P, Assemblers, Compilers and Program
Translation, Computer Science Press, Rockville,
MD.
[7] Cleaveland, C. and Uzgalis, R.C. (1977):
Grammars for Programming Languages , Elsevier,
New York.
[8] http://www.icse.s5.com/notes/m3.html

Potrebbero piacerti anche