Compiler Structure: Understanding Compilers, Assemblers, Interpreters and Other Language Processors

Compiler Structure: Compilers and Translators
A program can be written in assembly language as well as in high-level language. This written
program is called the source program. The source program is to be converted to the machine
language, which is called an object program. A translator is required for such a translation.
Program translator translates source code of programming language into machine language-
instruction code. Generally, computer programs are written in languages like COBOL, C, BASIC and
ASSEMBLY LANGUAGE, which should be translated into machine language before execution.
Translators are as follows.
 Assembler
 Compiler
 Interpreter
Compiler: A compiler is a special program that processes statements written in a particular

programming language and turns them into machine language or "code" that a computer's
processor uses. Typically, a programmer writes language statements in a language such as Pascal
or C one line at a time using an editor. The file that is created contains what are called the source
statements. The programmer then runs the appropriate language compiler, specifying the name of
the file that contains the source statements.
In another words “A compiler is a software program that compiles program source code files into
an executable program. It is included as part of the integrated development environment IDE with
most programming software packages”.
Assembler: A program which translates an assembly language program into a machine language
program is called an assembler. If an assembler which runs on a computer and produces the
machine codes for the same computer then it is called self assembler or resident assembler. If an
assembler that runs on a computer and produces the machine codes for other computer then it is
called Cross Assembler.
Assemblers are further divided into two types: One Pass Assembler and Two Pass Assembler. One
pass assembler is the assembler which assigns the memory addresses to the variables and
translates the source code into machine code in the first pass simultaneously. A Two Pass
Assembler is the assembler which reads the source code twice. In the first pass, it reads all the
variables and assigns them memory addresses. In the second pass, it reads the source code and
translates the code into object code.
Interpreter: An interpreter is a program which translates statements of a program into machine

code. It translates only one statement of the program at a time. It reads only one statement of
program, translates it and executes it. Then it reads the next statement of the program again
translates it and executes it. In this way it proceeds further till all the statements are translated
and executed. On the other hand, a compiler goes through the entire program and then translates
the entire program into machine codes. A compiler is 5 to 25 times faster than an interpreter.
By the compiler, the machine codes are saved permanently for future reference. On the other
hand, the machine codes produced by interpreter are not saved. An interpreter is a small program
as compared to compiler. It occupies less memory space, so it can be used in a smaller system
which has limited memory space.
Applications of compilers
1. Traditionally, a compiler is thought of as translating a so-called “high level language” such

as C or Modula2 into assembly language. Since assembly language cannot be directly
executed, a further translation between assembly language and (relocatable) machine
language is necessary. Such programs are usually called assemblers but it is clear that
an assembler is just a special (easier) case of a compiler.
2. Sometimes, a compiler translates between high level languages. E.g. the first
C++ implementations used a compiler called “cfront” which translated C++ code to C
code. Such a compiler is often called a “cross compiler”.
3. A compiler need not target a real assembly (or machine) language. E.g. Java compilers
generate code for a virtual machine called the “Java Virtual Machine” (JVM). The
JVM interpreter then interprets JVM instructions without any further translation.
4. Compilers (and interpreters) have wider applications than just translating
programming languages. Conceivably any large and complex application might define
its own “command language” which can be translated to a virtual machine associated
with the application.
5. Compilation techniques used in a lexical analyzer can be used in text editors,
information retrieval system, and pattern recognition programs.
6. Compilation techniques used in a parser can be used in a query processing system such as
SQL.
7. Much software having a complex front-end may need techniques used in compiler design.
8. Most of the techniques used in compiler design can be used in Natural Language
Processing (NLP) systems.
Interpreter Vs Compiler
S.NO Interpreter Compiler
Translates program one statement at a Scans the entire program and translates it
1
time. as a whole into machine code.
It takes less amount of time to analyze It takes large amount of time to analyze the
2 the source code but the overall source code but the overall execution time
execution time is slower. is comparatively faster.
Generates intermediate object code which

No intermediate object code is
3 further requires linking, hence requires
generated, hence are memory efficient.
more memory.
Continues translating the program until It generates the error message only after
4 the first error is met, in which case it scanning the whole program. Hence
stops. Hence debugging is easy. debugging is comparatively hard.
Programming language like Python, Programming language like C, C++ use

5
Ruby use interpreters. compilers.
Language Processing System

Preprocessors: Preprocessors produce input to compiler. They perform following functions:
1. Macro processing: a preprocessor allow a user to define macros. (that are short forms for
longer construct)
2. File inclusion: a preprocessor include header files into the program text.
3. Rational preprocessor: a preprocessor provide the user with built-in macros for constructs
like while-stmt or if-stmt if none exist in the program itself.
Language extension: they add capabilities to the language by what amounts to built-in macros. e.g.
the language, Sql is a database query language embedded in C
Assembler
Programmers found it difficult to write or read programs in machine language. They begin to use a
mnemonic (symbols) for each machine instruction, which they would subsequently translate into
machine language. Such a mnemonic machine language is now called an assembly language.
Programs known as assembler were written to automate the translation of assembly language in
to machine language. The input to an assembler program is called source program, the output is a
machine language translation (object program).
Linker: In high level languages, some built in header files or libraries are stored. These libraries are
predefined and these contain basic functions which are essential for executing the program. These
functions are linked to the libraries by a program called Linker. If linker does not find a library of a
function then it informs to compiler and then compiler generates an error. The compiler
automatically invokes the linker as the last step in compiling a program.
Not built in libraries, it also links the user defined functions to the user defined libraries. Usually a
longer program is divided into smaller subprograms called modules. And these modules must be
combined to execute the program. The process of combining the modules is done by the linker.
The two primary tasks of the linker are

1. Relocating relative addresses.
2. Resolving external references.
Relocating relative addresses:
The assembler processes one file at a time. Thus the symbol table produced while processing file
A is independent of the symbols defined in file B, and conversely. Thus, it is likely that the same
address will be used for different symbols in each program. The technical term is that the
(local) addresses in the symbol table for file A are relative to file A; they must be relocated
by the linker. This is accomplished by adding the starting address of file A (which in turn is the
sum of the lengths of all the files processed previously in this run) to the relative address.
Resolving external references:
Assume procedure f, in file A, and procedure g, in file B, are compiled (and assembled)
separately. Assume also that f invokes g. Since the compiler and assembler do not see g when
processing f, it appears impossible for procedure f to know where in memory to find g. The
solution is for the compiler to indicate in the output of the file A compilation that the address of g
is needed. This is called a use of g. When processing file B, the compiler outputs the (relative)
address of g. This is called the definition of g. The assembler passes this information to the linker.
The simplest linker technique is to again make two passes. During the first pass, the linker records
in its external symbol table (a table of external symbols, not a symbol table that is stored
externally) all the definitions encountered. During the second pass, every use can be resolved by
access to the table.
Loader: Loader is a program that loads machine codes of a program into the system memory. In
Computing, a loader is the part of an Operating System that is responsible for loading programs. It
is one of the essential stages in the process of starting a program. Because it places programs into
memory and prepares them for execution. Loading a program involves reading the contents of
executable file into memory. Once loading is complete, the operating system starts the program
by passing control to the loaded program code. All operating systems that support program
loading have loaders. In many operating systems the loader is permanently resident in memory.
Examples of Compilers:
1. ECJ (Eclipse Compiler for Java)

2. Javac (Sun Microsystems (Owned by Oracle))
3. Power J (Sybase (Owned by SAP))
4. QuickC (Microsoft)
5. SAS/C (SAS Institute)
6. Borland C++ (Borland (CodeGear)
7. Turbo C++ for DOS (Borland (CodeGear) etc
Phases of a Compiler
A compiler operates in phases, each of which transforms the source program from one
representation to another. There are six phases of compiler:
1. Lexical analysis (“scanning”): Reads in program, groups characters into “tokens”

2. Syntax analysis (“parsing”): Structures token sequence according to grammar rules of the
language.
3. Semantic analysis: Checks semantic constraints of the language.
4. Intermediate code generation: Translates to “lower level” representation.
5. Program analysis and code optimization: Improves code quality.
6. Target code generation.
Symbol table and error handling interact with the six phases. Some of the phases may be
grouped together.
Lexical Analysis Phase: The lexical phase reads the characters in the source program and groups
them into a stream of tokens in which each token represents a logically cohesive sequence of
characters, such as, An identifier, A keyword, A punctuation character. The character sequence
forming a token is called the lexeme for the token.
Syntax Analysis Phase: Syntax analysis imposes a hierarchical structure on the token stream. This
hierarchical structure is called syntax tree. A syntax tree has an interior node is a record with a
field for the operator and two fields containing pointers to the records for the left and right
children. A leaf is a record with two or more fields, one to identify the token at the leaf, and the
other to record information about the token.
Semantic Analysis Phase: This phase checks the source program for semantic errors and gathers
type information for the subsequent code-generation phase. It uses the hierarchical structure
determined by the syntax-analysis phase to identify the operators and operands of expressions
and statements. An important component of semantic analysis is type checking.
Intermediate Code Generation: The syntax and semantic analysis generate a explicit intermediate
representation of the source program. The intermediate representation should have two
important properties: It should be easy to produce, and easy to translate into target program.
Intermediate representation can have a variety of forms. Here one of the forms is: three address
code; which is like the assembly language for a machine in which every location can act like a
register. Three address code consists of a sequence of instructions, each of which has at most
three operands.
Code Optimization: Code optimization phase attempts to improve the intermediate code, so that
faster-running machine code will result.
Code Generation: The final phase of the compiler is the generation of target code, consisting
normally of reloadable machine code or assembly code. Memory locations are selected for each
of the variables used by the program. Then, the each intermediate instruction is translated into a
sequence of machine instructions that perform the same task.
Symbol Table Management: Symbol table is a data structure containing a record for each
identifier, with fields for the attributes of the identifier. It Record the identifier used in the source
program and collect information about the identifier such as,
• its type, (by semantic and intermediate code)

• its scope, (by semantic and intermediate code)
• storage allocation, (by code generation)
Number of arguments and its type for procedure, the type returned.
Error Detecting and Reporting: Each phase encounters errors. Lexical phase determine the input
that do not form token. Syntax phase determine the token that violates the syntax rule. Semantic
phase detects the constructs that have no meaning to operand.
Analysis and Synthesis phases of Compiler
A compiler can broadly be divided into two phases based on the way they compile. A pass refers to
the traversal of a compiler through the entire program. Analysis phase is consists of Lexical, Syntax
& Semantic phase and Synthesis phase is consists of Intermediate code generation, code
optimization & Target code generation.
The Analysis consists of three steps.
 Lexical Analysis: The Lexical Analysis in which the stream of characters making up the
source program is read from left-to-right and converted into a stream of words, where a
word is a sequence of characters with a collective meaning.
 Syntactic Analysis: The Syntactic Analysis in which words are grouped into nested
collections (grammatical phrases) with a collective meaning represented by a PARSE TREE.
 Semantic Analysis: The Semantic Analysis in which certain checks are performed to ensure
that the components of a program fit together meaningfully (type checking, type
conversion) and to report on errors.
The Synthesis consists of two steps.
 Code generator: Code generator produces the object code by deciding on the memory
locations for data, selecting code to access each datum and selecting the registers in which
each computation is to be done. Many computers have only a few high speed registers in
which computations can be performed quickly. A good code generator would attempt to
utilize registers as efficiently as possible.
 Code Optimization: This is optional phase described to improve the intermediate code so
that the output runs faster and takes less space. Its output is another intermediate code
program that does the same job as the original, but in a way that saves time and / or
spaces.
Compiler structure
 Front End (Language specific) and Back End (Machine specific) parts of compilation
Passes of Compiler
1. Phase and Pass are two terms used in the area of compilers.
2. Pass is a reading of a file followed by processing of data from file.
3. Phase is a logical part of the compilation process.
4. A pass is a single time the compiler passes over (goes through) the sources code or some
other representation of it.
5. Typically, most compilers have at least two phases called front end and backend, while
they could be either one-pass or multi-pass.
Note: Phase is used to classify compilers according to the construction, while pass is used to
classify compilers according to how they operate
One pass Compiler: A single pass compiler makes a single pass through the source text, parsing,
analyzing, and generating code only once. In other words, it allows the source code to pass
through each compilation unit only once. It immediately translates each code section into its final
machine code.
Main stages of single pass compiler are lexical analysis, syntactical analysis and code generator.
First, the lexical analysis scans the source code and divides it into tokens. Every programming
language has a grammar. It represents the syntax and legal statements of the language. Then, the
syntactical analysis determines the language constructs described by the grammar. Finally, the
code generator generates the target code. Overall, single pass compiler does not optimize the
code. Moreover, there is no intermediate code generation.
Fig: One pass compiler
Fig:
Dependency diagram
of typical one pass
compiler
Multi-pass Compiler: In a multi-pass compiler which converts the program into one or more
intermediate representations steps in between source code and machine code, and which
reprocesses the entire compilation unit in each sequential pass. It processes the source code or
abstract syntax tree of a program several times. Multi-pass compilers are sometimes called wide
compilers. Each pass takes the result of the previous pass as the input, and creates an
intermediate output. In this way, the (intermediate) code is improved pass by pass, until the final
pass emits the final code.
Each pass takes the result of the previous pass as the input and creates an intermediate output.
Likewise, in each pass, the code improves until the final pass generates the final code. A multipass
compiler performs additional tasks such as intermediate code generation, machine dependent
code optimization and machine independent code optimization.
Difference between Single Pass and Multipass Compiler
S.No. Single Pass Compiler Multipass Compiler

A single pass compiler is a type of
compiler that passes through the parts of A multipass compiler is a type of compiler
1 each compilation unit only once, that processes the source code or abstract
immediately translating each code syntax tree of a program several times.
section into its final machine code.
2 A single pass compiler is faster than A multipass compiler is slower than single
pass compiler because each pass reads and
multipass compiler
writes an intermediate file.
A single pass compiler is also called a A multipass compiler is called a wide
3
narrow compiler compiler.
A single pass compiler has a limited scope A multipass compiler has a greater scope of
4
of available information. available information.
there is no code optimization in single There is code optimization in multipass
5
pass compiler compiler
There is no intermediate code generation There is an intermediate code generation in
6
in single pass compilers multipass compilers.
A single pass compiler takes a less time to A multipass compiler takes a more time to
7 compile as compared to a multipass compile as compared to a single pass
compiler. compiler.
The memory requirement in a single pass The memory requirement in a multipass
8 compiler is higher than that of a compiler is lower than that of a single pass
multipass compiler. compiler.
Programming languages such as Pascal
Programming languages such as Java can be
9 can be implemented using a single pass
implemented using a multipass compiler.
compiler
Types of Compilers:
1. Native code compiler: The compiler used to compile a source code for same type of
platform only. The output generated by this type of compiler can only be run on the same
type of computer system and Os that the compiler itself runs on.
2. Cross compiler: The compiler used to compile a source code for different kinds platform.
Used in making software’s for embedded systems that can be used on multiple platforms.
3. Source to source compiler: the compiler that takes high-level language code as input and
outputs source code of another high- level language only. Unlike other compilers which
convert high level language into low level machine language, it can take up a code written
in Pascal and can transform it into C-conversion of one high level language into another
high level language having same type of abstraction . Thus, it is also known as transpiler.
4. One pass compiler: It is a type of compiler that compiles the whole process in only one-
pass.
5. Threaded code compiler: The compiler which simply replace a string by an appropriate
binary code.
6. Incremental compiler: The compiler which compiles only the modified/changed lines from
the source code and update the object code.
7. Source compiler: The compiler which converts the source code high level language code in
to assembly language only.
Lexical analysis
Lexical analysis is the first phase of a compiler. It takes the modified source code from language
preprocessors that are written in the form of sentences. The lexical analyzer breaks these syntaxes
into a series of tokens, by removing any whitespace or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works closely
with the syntax analyzer. It reads character streams from the source code, checks for legal tokens,
and passes the data to the syntax analyzer when it demands.
Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some
predefined rules for every lexeme to be identified as a valid token. These rules are defined by
grammar rules, by means of a pattern. A pattern explains what can be a token, and these patterns
are defined by means of regular expressions.
In programming language, keywords, constants, identifiers, strings, numbers, operators and

punctuations symbols can be considered as tokens.
For example, in C language, the variable declaration line

int value = 100;
contains following tokens:
int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).
Specifications of Tokens
Let us understand how the language theory undertakes the following terms:
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set
of Hexadecimal alphabets, {a-z, A-Z} is a set of English language alphabets.
Strings
Any finite sequence of alphabets is called a string. Length of the string is the total number of
occurrence of alphabets, e.g., the length of the string tutorialspoint is 14 and is denoted by |
tutorialspoint| = 14. A string having no alphabets, i.e. a string of zero length is known as an empty
string and is denoted by ε (epsilon).
Special Symbols
A typical high-level language contains the following symbols:-
Arithmetic Symbols Addition(+), Subtraction(-), Modulo(%), Multiplication(*),
Division(/)
Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)
Assignment =
Special Assignment +=, /=, *=, -=
Comparison ==, !=, <, <=, >, >=
Preprocessor #
Location Specifier &
Logical &, &&, |, ||, !
Shift Operator >>, >>>, <<, <<<
Language
A language is considered as a finite set of strings over some finite set of alphabets. Computer
languages are considered as finite sets, and mathematically set operations can be performed on
them. Finite languages can be described by means of regular expressions.
Bootstrapping
In computer science, bootstrapping is the technique for producing a self-compiling compiler —
that is, compiler (or assembler) written in the source programming language that it intends to
compile. An initial core version of the compiler (the bootstrap compiler) is generated in a different
language (which could be assembly language); successive expanded versions of the compiler are
developed using this minimal subset of the language.
Many compilers for many programming languages are bootstrapped, including compilers for
BASIC, ALGOL, C, D, Pascal, PL/I, Factor, Haskell, Modula-2, Oberon, OCaml, Common Lisp,
Scheme, Go, Java, Rust, Python, Scala, Nim, Eiffel, and more.
It has two steps:
1. Bootstrapping is used to produce a self-hosting compiler. Self-hosting compiler is a type of

compiler that can compile its own source code.
2. Bootstrap compiler is used to compile the compiler and then you can use this compiled
compiler to compile everything else as well as future versions of itself.
A compiler can be characterized by three languages:
1. Source Language
2. Target Language
3. Implementation Language
T-Diagram
Tombstone diagrams (or T-diagrams) consist of a set of “puzzle pieces” representing compilers and
other related language processing programs. They are used to illustrate and reason about
transformations from a source language (left of T) to a target language (right of T) realized in an
implementation language (bottom of T). They are most commonly found, describing complicated
processes for bootstrapping, porting, and self-compiling of compilers, interpreters, and macro-
processors.
T-diagrams were first introduced for describing bootstrapping and cross-compiling compilers by
McKeeman et al. in 1971.
The T- diagram shows a compiler SCIT for Source language S, Target language T and implemented
language I.
Steps to produce a compiler for a new language L for machine A:
1. Create a compiler SCAA for subset, S of the desired language L ( S ⊂L ) using language "A"
and that compiler runs on machine A.
2. Create a compiler LCSA for language L written in a subset of L.
3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler for language L, which runs
on machine A and produces code for machine A.
L
CSA  SCAA LCAA
The process described by the T-diagrams is called bootstrapping.
Cross Compiler
Cross Compiler is a compiler that runs on one computer but produces machine code for a different
type of computer. Cross compilers are used to generate software that can run on computers with
a new architecture or on special-purpose devices that cannot host their own compilers.
For example, a compiler that runs on a Windows 7 PC but generates code that runs on Android
Smartphone is a cross compiler.
Cross compilers are written in different language as the target language. E.g. SNM is a
compiler for the language S that is in a language that runs on machine N and generates output
code that runs on machine M.
Steps involving to produce a compiler for a different machine B (or cross compiler):
1. Convert LCSA into LCLB (by hand, if necessary). Recall that language S is a subset of
language L.
2. Compile LCLB on LCAA (available compiler of machine A) to produce LCAB, a cross-
compiler for L which runs on machine A and produces code for machine B.
3. Compile LCLB with the cross-compiler LCAB to produce LCBB, a compiler for language L
which runs on machine B
Input Buffering
• To ensure that a right lexeme is found, one or more characters have to be looked up beyond
the next lexeme.
• Hence a two-buffer scheme is introduced to handle large lookaheads safely.
• Techniques for speeding up the process of lexical analyzer such as the use of sentinels to mark
the buffer end have been adopted.
• There are three general approaches for the implementation of a lexical analyzer:
1. By using a lexical-analyzer generator, such as lex compiler to produce the lexical analyzer
from a regular expression based specification. In this, the generator provides routines for
reading and buffering the input.
2. By writing the lexical analyzer in a conventional systems-programming language, using I/O
facilities of that language to read the input.
3. By writing the lexical analyzer in assembly language and explicitly managing the reading of
input.
Buffer Pairs
Because of large amount of time consumption in moving characters, specialized buffering

techniques have been developed to reduce the amount of overhead required to process an input
character.
Fig shows the buffer pairs which are used to hold the input data.
Fig: Input buffer pair
Scheme
• Consists of two buffers, each consists of N-character size which are reloaded alternatively.
• N-Number of characters on one disk block, e.g., 4096.
• N characters are read from the input file to the buffer using one system read command.
• eof is inserted at the end if the number of characters is less than N.
Pointers
• Two pointers lexemeBegin and forward are maintained.
lexeme Begin points to the beginning of the current lexeme which is yet to be found.
forward scans ahead until a match for a pattern is found.
• Once a lexeme is found, lexemebegin is set to the character immediately after the lexeme
which is just found and forward is set to the character at its right end.
• Current lexeme is the set of characters between two pointers.
Disadvantages of this scheme
• This scheme works well most of the time, but the amount of lookahead is limited.
• This limited lookahead may make it impossible to recognize tokens in situations where the
distance that the forward pointer must travel is more than the length of the buffer.
(eg.) DECLARE (ARGl, ARG2, . . . , ARGn) in PL/1 program;
• It cannot determine whether the DECLARE is a keyword or an array name until the character
that follows the right parenthesis.
if forward at the end of the first half then begin

Reload second half;
forward:=forward +1;
end
elseif forward at the end of second half then begin
reload first half;
move forward to the beginning of first half;
end
else
forward:= forward+1;
fig: Code to advance forward pointer
Sentinels
• In the previous scheme, each time when the forward pointer is moved, a check is done to
ensure that one half of the buffer has not moved off. If it is done, then the other half must be
reloaded.
• Therefore the ends of the buffer halves require two tests for each advance of the forward
pointer.
Test 1: For end of buffer.

Test 2: To determine what character is read.
• The usage of sentinel reduces the two tests to one by extending each buffer half to hold a
sentinel character at the end.
• The sentinel is a special character that cannot be part of the source program. (eof character is
used as sentinel).
Fig: Sentinels at end of each buffer pair
if forward =eof then begin
if forward at end of first half then begin
reload second half;
end
else if forward at the end of second half then begin
reload first half;
move forward to the beginning of first half;
end
else /*eof within a buffer signifying end of input*/
terminate lexical analysis
end
fig: Lookahead code with sentinels
Advantages
• Most of the time, It performs only one test to see whether forward pointer points to an eof.
• Only when it reaches the end of the buffer half or eof, it performs more tests.
• Since N input characters are encountered between eofs, the average number of tests per input
character is very close to 1
Introduction to Lex & Yacc
A compiler or an interpreter for a programming language is often decomposed into two parts:
1. Read the source program and discover its structure.

2. Process this structure, e.g. to generate the target program.
Lex and Yacc can generate program fragments (either tokens or hierarchical structure) that solve
the first task. The task of discovering the source structure again is decomposed into subtasks:
1. Split the source file into tokens (Lex).
2. Find the hierarchical structure of the program (Yacc)
LEX: A Scanner Generator

Lex is officially known as a "Lexical Analyser". Lex (Lexical analyser) helps write programs whose
control flow is directed by instances of regular expressions in the input stream. It is well suited for
editor script type transformations and for segmenting input in preparation for a parsing routine.
The main job of Lex is to break up an input stream into more usable elements. Or in, other words,
to identify the "interesting bits" in a text file.
Lex source is a table of regular expressions and corresponding program fragments. The table is
translated to a program which reads an input stream, copying it to an output stream and
partitioning the input into strings which match the given expressions. As each such string is
recognized the corresponding program fragment is executed. The recognition of the expressions is
performed by a deterministic finite automaton generated by Lex. The program fragments written
by the user are executed in the order in which the corresponding regular expressions occur in the
input stream.
For example, if we are writing a compiler for the C programming language,
1. The symbols { } ( ); all have significance on their own.

2. The letter ‘a’ usually appears as part of a keyword or variable name, and is not interesting
on it's own. Instead, we are interested in the whole word.
3. Spaces and newlines are completely uninteresting, and we want to ignore them
completely, unless they appear within quotes "like this".
All of these things are handled by the Lexical Analyzer.
LEX specifications:
A LEX program (the .l file) consists of three parts:
{declarations}
%%
{translation rules }
%%
{auxiliary procedures}
where the definitions and the user subroutines are often omitted. The second %% is optional, but
the first is required to mark the beginning of the rules. The absolute minimum LEX program is thus
%%
(no definitions, no rules) which translates into a program which copies the input to the output
unchanged.
1. The declarations section includes declarations of variables, manifest

constants (A manifest constant is an identifier that is declared to represent a constant e.g. #
define PIE 3.14), and regular definitions.
2. The translation rules of a LEX program are statements of the form :
p1 {action 1}
p2 {action 2}
p3 {action 3}
……
……
Where each p is a regular expression and each action is a program fragment describing what
action the lexical analyzer should take when a pattern p matches a lexeme. In Lex the actions
are written in C.
3. The third section holds whatever auxiliary procedures are needed by the actions.
Alternatively these procedures can be compiled separately and loaded with the lexical
analyzer.
LEX is one such lexical analyzer generator which produces C code, based on the token
specifications. LEX tool has been widely used to specify lexical analyzers for a variety of
languages. We refer to the tool as Lex Compiler, and to its input specification as the Lex
language. Lex is generally used in the manner depicted in the above figure;
1. First, a specification of a lexical analyzer is prepared by creating a program lex.l in

the lex language.
2. Then, the lex.l is run through the Lex compiler to produce a C program lex.yy.c . The
program lex.yy.c consists of a tabular representation of a transition diagram
constructed from the regular expressions of the lex.l, together with a standard
routine that uses the table to recognize lexemes. The actions associated with the
regular expressions in lex.l are pieces of C code and are carried over directly to
lex.yy.c.
3. Finally, lex.yy.c is run through the C compiler to produce an object programa.out
which is the lexical analyzer that transforms the input stream into a sequence of
tokens.
Working process of LEX;

1. Input to the generator
2. List of regular expressions in priority order (Regular expressions describe the languages
that can be recognized by finite automata.)
3. Associated actions for each of regular expression(generates kind of token and other book
keeping information)
4. Translate each token regular expression into a non deterministic finite automaton (NFA).
5. Convert the NFA into an equivalent DFA.
6. Minimize the DFA to reduce number of states.
7. Emit code driven by the DFA tables.
8. Output of the generator
9. Repots lexical errors (unexpected characters), if any
We assume that we have a specification of lexical analyzers in the form of regular expression and
the corresponding action parameters. Action parameter is the program segments that are to be
executed whenever a lexeme matched by regular expressions is found in the input. So, the input to
the generator is a list of regular expressions in a priority order and associated actions for each of
the regular expressions. These actions generate the kind of token and other book keeping
information. Our problem is to construct a recognizer that looks for lexems in the input buffer. If
more than one pattern matches, the recognizer is to choose the longest lexeme matched. If there
is two or more pattern that matches the longest lexeme, the first listed matching pattern is
chosen. So, the output of the generator is a program that reads input character stream and breaks
that into tokens. It also reports in case there is lexical error i.e. either unexpected characters occur
or an input string doesn’t match any of the regular expressions.
Example:
%{
/* definitions of manifest constants
LT, LE, EQ, NE, GT, GE,
IF, THEN, ELSE, ID, NUMBER, RELOP */
%}
/* regular definitions */
delim [ \t\n]
ws {delim}+
letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?
%%
{ws} {/* no action and no return */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = (int) installID(); return(ID);}
{number} {yylval = (int) installNumO ; return(NUMBER);}
“<” {yylval = LT; return(RELOP);}
“<=” {yylval = LE; return(RELOP);}
“_“ {yylval = EQ; return(RELOP);}
"<>" {yylval = NE; return(RELOP);}
{yylval = GT; return(RELOP);}
{yylval = GE; return(RELOP);}
int installlDQ {
/* function to install the lexeme, whose first character is pointed to by yytext, arid
whose length is yyleng, into the symbol table and return a pointer thereto*/
int installNumO {
/* similar to installlD, but puts numerical constants into a separate table */ }
Figure: Lex program for the tokens
YACC (Yet Another Compiler Compiler)

YACC is officially known as a "parser". YACC stands for "Yet Another Compiler Compiler". The main
job of YACC is to analyse the structure of the input stream, and operate of the "big picture". A
YACC is useful to builds a bottom up (shift reduce) syntax analyser from a context free grammar.
Structure of YACC
{ declarations}
%%
{Translation rules}
%%
{programs}
A YACC generate a file to C compiler is named as ‘y.tab.c’.
CFG (Context-Free Grammars)

A GFG is consisting of 4-tuple G = (V, T, P, S) where
1. V is the (finite) set of variables (or non-terminals). Each variable represents a

language, i.e., a set of strings.
2. T is a finite set of terminals, i.e., the symbols that form the strings of the language
being defined. (T is disjoint from V)
3. P is a set of production rules that represent the recursive definition of the language.
4. S ε V is the start symbol that represents the language being defined.
Each production rule consists of:
1. A variable that is being (partially) defined by the production. This variable is often
called the head of the production.
2. The production symbol →
3. A string of zero or more terminals and variables.
Example of CFG: Given a grammar G = ({S}, {a, b}, P, S). The set of productions P is
S →aSb
S →SS
S→ ε
This grammar generates strings such as abab, aaabbb, and aababb. If we assume that a is left
parenthesis ‘(’ and b is right parenthesis ‘)’, then L(G) is the language of all strings of properly
nested parentheses.
Derivation Trees
A ‘derivation tree’ is an ordered tree which the the nodes are labeled with the left sides of
productions and in which the children of a node represent its corresponding right sides.
Definition of a Derivation Tree
Let G = (V, T, S, P) be a CFG. An ordered tree is a derivation tree for G iff it has the following
properties:
(i) The root of the derivation tree is S.

(ii)Each and every leaf in the tree has a label from T U{ λ}
(iii)
Each and every interior vertex (a vertex which is no a leaf) has a label from V.
(iv)If a vertex has label AεV, and its children are labeled (from left to right) a1 , a2 , KK
an, then P must contain a production of the form A→ a1, a2………… an
(v) A leaf labeled l has no siblings, that is, a vertex with a child labeled l can have no
other children.
Sentential Form: For a given CFG with productions S →aA, A→ aB, B→ bB, B→ a. The derivation
tree is as shown below.
The resultant of the derivation tree is the word w = aaba.
This is said to be in “Sentential Form”.
Left Most Derivation Right Most Derivation

Consider the grammar G with production
1. S→aSS
2. S→b
For the String w = aababbb
We have:
The sequence followed is “left-most derivation”, following “1121222”, giving, “aababbb”.
The sequence 1211222 represents a “Right Most Derivation”, giving, “aababbb”.

Example: A grammar G which is context-free has the productions
(The word w = acbabc is derived as follows)
Obtain the derivation tree.
Ambiguity of a Grammar
The grammar given by
Generates strings having an equal number of a’s and b’s. The string “abab” can be generated from
this grammar in two distinct ways, as shown in the following derivation trees:
Similarly, “abab” has two distinct leftmost derivations:
Also, “abab” has two distinct rightmost derivations:
Each of the above derivation trees can be turned into a unique rightmost derivation, or into a
unique leftmost derivation. Each leftmost or rightmost derivation can be turned into a unique
derivation tree. These representations are largely interchangeable.
Ambiguous Grammars and Ambiguous Languages:
Since derivation trees, leftmost derivations, and rightmost derivations are equivalent rotations,
the following definitions are equivalent:
Definition: Let G = (N, T, P, S) be a CFG.
A string w ε L(G) is said to be “ambiguously derivable “if there are two or more different derivation
trees for that string in G.
Definition: A CFG given by G = (N, T, P, S) is said to be “ambiguous” if there exists at least one
string in L(G) which is ambiguously derivable. Otherwise it is unambiguous.
Ambiguity is a property of a grammar, and it is usually, but not always possible to find an
equivalent unambiguous grammar. An “inherently ambiguous language” is a language for which no
unambiguous grammar exists
Example: Show that the grammar S → SbS, S → a is ambiguous.
Solution: In order to show that G is ambiguous, we need to find a wεL(G), which is ambiguous.
Assume w = abababa. The two derivation trees for w = abababa is shown below in Fig. (a) and (b).
Therefore, the grammar G is ambiguous.

Compiler Structure: Understanding Compilers, Assemblers, Interpreters and Other Language Processors

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Compiler Structure: Understanding Compilers, Assemblers, Interpreters and Other Language Processors

Caricato da

Copyright:

Formati disponibili

Compiler Structure: Compilers and Translators

Translators are as follows.

Compiler: A compiler is a special program that processes statements written in a particular

Interpreter: An interpreter is a program which translates statements of a program into machine

1. Traditionally, a compiler is thought of as translating a so-called “high level language” such

S.NO Interpreter Compiler

Generates intermediate object code which

Programming language like Python, Programming language like C, C++ use

Language Processing System

The two primary tasks of the linker are

Relocating relative addresses:

Resolving external references:

1. ECJ (Eclipse Compiler for Java)

1. Lexical analysis (“scanning”): Reads in program, groups characters into “tokens”

• its type, (by semantic and intermediate code)

The Synthesis consists of two steps.

Fig: One pass compiler

Difference between Single Pass and Multipass Compiler

S.No. Single Pass Compiler Multipass Compiler

In programming language, keywords, constants, identifiers, strings, numbers, operators and

For example, in C language, the variable declaration line

Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)

Special Assignment +=, /=, *=, -=

Comparison ==, !=, <, <=, >, >=

Location Specifier &

Logical &, &&, |, ||, !

Shift Operator >>, >>>, <<, <<<

It has two steps:

1. Bootstrapping is used to produce a self-hosting compiler. Self-hosting compiler is a type of

A compiler can be characterized by three languages:

Steps to produce a compiler for a new language L for machine A:

2. Create a compiler LCSA for language L written in a subset of L.

The process described by the T-diagrams is called bootstrapping.

Because of large amount of time consumption in moving characters, specialized buffering

Fig: Input buffer pair

• Two pointers lexemeBegin and forward are maintained.

(eg.) DECLARE (ARGl, ARG2, . . . , ARGn) in PL/1 program;

if forward at the end of the first half then begin

Test 1: For end of buffer.

Introduction to Lex & Yacc

1. Read the source program and discover its structure.

LEX: A Scanner Generator

For example, if we are writing a compiler for the C programming language,

1. The symbols { } ( ); all have significance on their own.

All of these things are handled by the Lexical Analyzer.

A LEX program (the .l file) consists of three parts:

1. The declarations section includes declarations of variables, manifest

1. First, a specification of a lexical analyzer is prepared by creating a program lex.l in

Working process of LEX;

/* similar to installlD, but puts numerical constants into a separate table */ }

Figure: Lex program for the tokens

YACC (Yet Another Compiler Compiler)

A YACC generate a file to C compiler is named as ‘y.tab.c’.

CFG (Context-Free Grammars)

1. V is the (finite) set of variables (or non-terminals). Each variable represents a

Each production rule consists of:

Definition of a Derivation Tree

(i) The root of the derivation tree is S.

The resultant of the derivation tree is the word w = aaba.

This is said to be in “Sentential Form”.

Left Most Derivation Right Most Derivation

The sequence followed is “left-most derivation”, following “1121222”, giving, “aababbb”.

The sequence 1211222 represents a “Right Most Derivation”, giving, “aababbb”.