Sei sulla pagina 1di 23


February 2011
Master of Computer Application (MCA) – Semester 3
MC0073 – System Programming– 4 Credits
(Book ID: B0811) Assignment Set – 1

1. Describe the following with respect to Language Specification:

A) Programming Language Grammars
Ans: The lexical and syntactic features of a programming language are specified by its grammar.
This section discusses key concepts and notions from formal language grammars. A
language L can be considered to be a collection of valid sentences. Each sentence can be
looked upon as a sequence of words and each word as a sequence of letters or graphic
symbols acceptable in L. A language specified in this manner is known as & formal language.
A formal language grammar is a set of rules which precisely specify the sentences of L. It is
clear that natural languages are not formal languages due to their rich vocabulary. However,
PLs are formal languages.

Terminal symbols, alphabet and strings

The alphabet of L, denoted by the Greek symbol ∑ , is the collection of symbols in its
character set. We will use lower case letters a, b, c, etc. to denote symbols in ∑ . A symbol in
the alphabet is known as a terminal symbol (T) of L. The alphabet can be represented using
the mathematical notation of a set, e.g. ∑ = {a, b,… z, 0, l,… 9}

Here the symbols {, ‘,’ and} are part of the notation. We call them metasymbols to differentiate
them from terminal symbols. Throughout this discussion we assume that metasymbols are
distinct from the terminal symbols. If this is not the case, i.e. if a terminal symbol and a meta
symbol are identical, we enclose the terminal symbol in quotes to differentiate it from the meta
symbol. For example, the set of punctuation symbols of English can be defined as where ‘,’
denotes the terminal symbol ‘comma’.

A string is a finite sequence of symbols. We will represent strings by Greek symbols a, (α, ß, γ
etc. Thus α = axy is a string over ∑. The length of a string is the number of symbols in it. Note
that the absence of any symbol is also a string, the null string ε. The concatenation operation
combines two strings into a single string. It is used to build larger strings from existing strings.
Thus, given two strings α and ß, concatenation of α with ß yields a string which is formed by
putting the sequence of symbols forming α before the sequence of symbols forming ß. For
example, if α = ab, ß = axy, then concatenation of α and ß, represented as α.ß or simply αß,
gives the string abaxy. The null string can also participate in a concatenation, thus a.ε =ε.a =


Nonterminal symbols

A nonterminal symbol (NT) is the name of a syntax category of a language, e.g. noun, verb,
etc. An NT is written as a single capital letter, or as a name enclosed between <…>, e.g. A or
< Noun >. During grammatical analysis, a nonterminal symbol represents an instance of the
category. Thus, < Noun > represents a noun.


A production, also called a rewriting rule, is a rule of the grammar. A production has the form
A nonterminal symbol :: = String of Ts and NTs and defines the fact that the NT on the LHS of
the production can be rewritten as the string of Ts and NTs appearing on the RHS. When an
NT can be written as one of many different strings, the symbol ‘|’ (standing for ‘or’) is used to
separate the strings on the RHS, e.g.

< Article > ::- a | an | the

The string on the RHS of a production can be a concatenation of component strings, e.g. the
production < Noun Phrase > ::= < Article >< Noun >

expresses the fact that the noun phrase consists of an article followed by a noun.

Each grammar G defines a language lg. G contains an NT called the distinguished symbol or
the start NT of G. Unless otherwise specified, we use the symbol S as the distinguished
symbol of G. A valid string α of lg is obtained by using the following procedure

1. Let α= ‘S’.

2. While α is not a string of terminal symbols

(a) Select an NT appearing in α, say X.

(b) Replace X by a string appearing on the RHS of a production of X.


Grammar (1.1) defines a language consisting of noun phrases in English

< Noun Phrase > :: = < Article > < Noun >

< Article > ::= a | an | the

<Noun> ::= boy | apple


< Noun Phrase > is the distinguished symbol of the grammar, the boy and an apple are some
valid strings in the language.

Definition (Grammar)

A grammar G of a language lg is a quadruple (∑, SNT, t,P) where

∑ is the alphabet of Lg, i.e. the set of Ts,

SNT is the set of NTs,

S is the distinguished symbol, and

P is the set of productions.

Derivation, reduction and parse trees

A grammar G is used for two purposes, to generate valid strings of lg and to ‘rec-ognize’ valid
strings of lg. The derivation operation helps to generate valid strings while the reduction
operation helps to recognize valid strings. A parse tree is used to depict the syntactic
structure of a valid string as it emerges during a sequence of derivations or reductions.


Let production pi of grammar G be of the form

P1: A:: = α


and let ß be a string such that ß = γAθ, then replacement of A by α in string ß constitutes a
derivation according to production p1 . We use the notation N Þη to denote direct derivation of
η from N and N Þ η to denote transitive derivation of η (i.e. derivation in zero or more steps)
from N, respectively. Thus, A =>α only if A : = α is a production of G and A Þ δ if A Þ … Þ δ.
We can use this notation to define a valid string according to a grammar G as follows: δ is a
valid string according to G only if S Þ δ, where S is the distinguished symbol of G.

Example: Derivation of the string the boy according to grammar can be depicted as

< Noun Phrase > => < Article > < Noun >

=> the < Noun >

=> the boy


A string α such that S => α is a sentential form of lg. The string α is a sentence of lg if it
consists of only Ts.

Example: Consider the grammar G

< Sentence >::= < Noun Phrase > < Verb Phrase >

< Noun Phrase >::= < Article >< Noun >

< Verb Phrase >::= <verb> <Noun Phrase>

<Article> ::= = a | an | the

< Noun >::= boy | apple

<verb> ::= ate

The following strings are sentential forms of Lg

< Noun Phrase > < Verb Phrase >

the boy < Verb Phrase >

< Noun Phrase > ate < Noun Phrase >

the boy ate < Noun Phrase >

the boy ate an apple

However, only the boy ate an apple is a sentence.

Reduction: To determine the validity of the string


The boy ate an apple

according to grammar we perform the following reductions Step String

The boy ate an apple

1. < Article > boy ate an apple

2. < Article > < Noun > ate an apple

3. < Article > < Noun > < Verb > an apple


4. < Article > < Noun > < Verb > < Article > apple

5. < Article > < Noun > < Verb > < Article > < Noun >

6. < Noun Phrase > < Verb > < Article > < Noun >

7. < Noun Phrase > < Verb > < Noun Phrase >

8. < Noun Phrase > < Verb Phrase >

9. < Sentence >

The string is a sentence of lg since we are able to construct the reduction sequence the boy
ate an apple —> < Sentence >.

Parse trees

A sequence of derivations or reductions reveals the syntactic structure of a string with respect
to G. We depict the syntactic structure in the form of a parse tree. Derivation according to the
production A :: = α gives rise to the following elemental parse tree.
B) Classification of Grammars
Ans: Grammars are classified on the basis of the nature of productions used in them (Chomsky,
1963). Each grammar class has its own characteristics and limitations.

Type – 0 Grammars

These grammars, known as phrase structure grammars, contain productions of the form

α:: = ß

where both α and ß can be strings of Ts and NTs. Such productions permit arbitrary
substitution of strings during derivation or reduction, hence they are not relevant to
specification of programming languages.

Type – 1 grammars

These grammars are known as context sensitive grammars because their productions specify
that derivation or reduction of strings can take place only in specific contexts. A Type-1
production has the form

αAß:: = αПß Thus, a string П in a sentential form can be replaced by ‘A’ (or vice versa) only
when it is enclosed by the strings α and ß . These grammars are also not particularly relevant
for PL specification since recognition of PL constructs is not context sensitive in nature.


Type – 2 grammars

These grammars impose no context requirements on derivations or reductions. A typical

Type-2 production is of the form

α:: = П which can be applied independent of its context. These grammars are therefore
known as context free grammars (CFG). CFGs are ideally suited for programming language

Type – 3 grammars

Type-3 grammars are characterized by productions of the form

A::= tB | t or A ::= Bt | t

Note that these productions also satisfy the requirements of Type-2 grammars. The specific
form of the RHS alternatives—namely a single T or a string containing a single T and a single
NT—gives some practical advantages in scanning.

Type-3 grammars are also known as linear grammars or regular grammars. These are further
categorized into left-linear and right-linear grammars depending on whether the NT in the
RHS alternative appears at the extreme left or extreme right.

Operator grammars

Definition (Operator grammar (OG)) An operator grammar is a grammar none of whose

productions contain two or more consecutive NTs in any RHS alternative.

Thus, nonterminals occurring in an RHS string are separated by one or more terminal
symbols. All terminal symbols occurring in the RHS strings are called operators of the
C) Binding and Binding Times
Ans: Definition: Binding: A binding is the association of an attribute of a program entity with a

Binding time is the time at which a binding is performed. Thus the type attribute of variable var
is bound to type, when its declaration is processed. The size attribute of type is bound to a
value sometime prior to this binding. We are interested in the following binding times:

1. Language definition time of L

2. Language implementation time of L

3. Compilation time of P


4. Execution init time of proc

5. Execution time of proc.

Where L is a programming language, P is a program written in L and proc is a procedure in P.

Note that language implementation time is the time when a language translator is designed.
The preceding list of binding times is not exhaustive; other binding times can be defined, viz.
binding at the linking time of P. The language definition of L specifies binding times for the
attributes of various entities of programs written in L.

Binding of the keywords of Pascal to their meanings is performed at language def-inition time.
This is how keywords like program, procedure, begin and end get their meanings. These
bindings apply to all programs written in Pascal. At language implementation time, the
compiler designer performs certain bindings. For example, the size of type ‘integer’ is bound
to n bytes where n is a number determined by the architecture of the target machine. Binding
of type attributes of variables is performed at compilation time of program bindings. The
memory addresses of local variables info and p of procedure proc are bound at every
execution init time of procedure proc. The value attributes of variables are bound (possibly
more than once) during an execution of proc. The memory address of P↑ is bound when the
procedure call new (p) is executed.

Static and dynamic bindings

Definition (Static binding) A static binding is a binding performed before the ex¬ecution of a
program begins.

Definition (Dynamic binding) A dynamic binding is a binding performed after the execution
of a program has begun.
2. What is RISC and how it is different from the CISC
Ans: CISC: A Complex Instruction Set Computer (CISC) supplies a large number of complex
instructions at the assembly language level. Assembly language is a low-level computer
programming language in which each statement corresponds to a single machine instruction.
CISC instructions facilitate the extensive manipulation of low-level computational elements
and events such as memory, binary arithmetic, and addressing. The goal of the CISC
architectural philosophy is to make microprocessors easy and flexible to program and to
provide for more efficient memory use.

The CISC philosophy was unquestioned during the 1960s when the early computing
machines such as the popular Digital Equipment Corporation PDP 11 family of
minicomputers were being programmed in assembly language and memory was slow and


CISC machines merely used the then-available technologies to optimize computer

performance. Their advantages included the following:

1. A new processor design could incorporate the instruction set of its predecessor as a subset
of an ever-growing language–no need to reinvent the wheel, code-wise, with each design

2. Fewer instructions were needed to implement a particular computing task, which led to
lower memory use for program storage and fewer time-consuming instruction fetches from

3. Simpler compilers sufficed, as complex CISC instructions could be written that closely
resembled the instructions of high-level languages. In effect, CISC made a computer’s
assembly language more like a high-level language to begin with, leaving the compiler less
to do.

Some disadvantages of the CISC design philosophy are as follows:

(1) The first advantage listed above could be viewed as a disadvantage. That is, the
incorporation of older instruction sets into new generations of processors tended to force
growing complexity.

(2) Many specialized CISC instructions were not used frequently enough to justify their
existence. The existence of each instruction needed to be justified because each one requires
the storage of more microcode at in the central processing unit (the final and lowest layer of
code translation), which must be built in at some cost.

(3) Because each CISC command must be translated by the processor into tens or even
hundreds of lines of microcode, it tends to run slower than an equivalent series of simpler
commands that do not require so much translation. All translation requires time.

(4) Because a CISC machine builds complexity into the processor, where all its various
commands must be translated into microcode for actual execution, the design of CISC
hardware is more difficult and the CISC design cycle correspondingly long; this means delay
in getting to market with a new chip.

The terms CISC and RISC (Reduced Instruction Set Computer) were coined at this time to
reflect the widening split in computer-architectural philosophy.

RISC: The Reduced Instruction Set Computer, or RISC, is a microprocessor CPU design
philosophy that favors a simpler set of instructions that all take about the same amount of
time to execute. The most common RISC microprocessors are AVR, PIC, ARM, DEC Alpha,


RISC, or Reduced Instruction Set Computer. Is a type of microprocessor architecture

that utilizes a small, highly-optimized set of instructions, rather than a more specialized set
of instructions often found in other types of architectures.

RISC characteristics

• Small number of machine instructions : less than 150

• Small number of addressing modes : less than 4

• Small number of instruction formats : less than 4

• Instructions of the same length : 32 bits (or 64 bits)

• Single cycle execution

• Load / Store architecture

• Large number of GRPs (General Purpose Registers): more than 32

• Hardwired control

• Support for HLL (High Level Language).

RISC and x86

However, despite many successes, RISC has made few inroads into the desktop PC and
commodity server markets, where Intel’s x86 platform remains the dominant processor
architecture (Intel is facing increased competition from AMD, but even AMD’s processors
implement the x86 platform, or a 64-bit superset known as x86-64). There are three main
reasons for this. One, the very large base of proprietary PC applications are written for x86,
whereas no RISC platform has a similar installed base, and this meant PC users were locked
into the x86. The second is that, although RISC was indeed able to scale up in performance
quite quickly and cheaply, Intel took advantage of its large market by spending vast amounts
of money on processor development. Intel could spend many times as much as any RISC
manufacturer on improving low level design and manufacturing. The same could not be said
about smaller firms like Cyrix and NexGen, but they realized that they could apply pipelined
design philosophies and practices to the x86-architecture – either directly as in the 6×86 and
MII series, or indirectly (via extra decoding stages) as in Nx586 and AMD K5. Later, more
powerful processors such as Intel P6 and AMD K6 had similar RISC-like units that executed a
stream of micro-operations generated from decoding stages that split most x86 instructions
into several pieces. Today, these principles have been further refined and are used by
modern x86 processors such as Intel Core 2 and AMD K8. The first available chip deploying
such techniques was the NexGen Nx586, released in 1994 (while the AMD K5 was severely


delayed and released in 1995). As of 2007, the x86 designs (whether Intel’s or AMD’s) are as
fast as (if not faster than) the fastest true RISC single-chip solutions available.

Addressing Modes of CISC

The 68000 addressing (Motorola) modes

• Register to Register,

• Register to Memory,

• Memory to Register, and

• Memory to Memory

68000 Supports a wide variety of addressing modes.

• Immediate mode –- the operand immediately follows the instruction

• Absolute address – the address (in either the "short" 16-bit form or "long" 32-bit form)
of the operand immediately follows the instruction

• Program Counter relative with displacement – A displacement value is added to the

program counter to calculate the operand’s address. The displacement can be positive
or negative.

• Program Counter relative with index and displacement – The instruction contains both
the identity of an "index register" and a trailing displacement value. The contents of the
index register, the displacement value, and the program counter are added together to
get the final address.

• Register direct – The operand is contained in an address or data register.

• Address register indirect – An address register contains the address of the operand.

• Address register indirect with predecrement or postdecrement – An address register

contains the address of the operand in memory. With the predecrement option set, a
predetermined value is subtracted from the register before the (new) address is used.
With the postincrement option set, a predetermined value is added to the register after
the operation completes.

• Address register indirect with displacement — A displacement value is added to the

register’s contents to calculate the operand’s address. The displacement can be
positive or negative.

• Address register relative with index and displacement — The instruction contains both
the identity of an "index register" and a trailing displacement value. The contents of the


index register, the displacement value, and the specified address register are added
together to get the final address.


Emphasis on hardware Emphasis on software

Includes multi-clock Single-clock,

complex instructions reduced instruction only

Memory-to-memory: Register to register:

"LOAD" and "STORE" "LOAD" and "STORE"
incorporated in instructions are independent instructions

Small code sizes, Low cycles per second,

high cycles per second large code sizes

Transistors used for storing Spends more transistors

complex instructions on memory registers

3. Explain the following with respect to the design specifications of an Assembler:

A) Data Structures
Ans: The second step in our design procedure is to establish the databases that we have to work

Pass 1 Data Structures

1. Input source program

2. A Location Counter (LC), used to keep track of each instruction’s location.

3. A table, the Machine-operation Table (MOT) that indicates the symbolic mnemonic, for
each instruction and its length (two, four, or six bytes)

4. A table, the Pseudo-Operation Table (POT) that indicates the symbolic mnemonic and
action to be taken for each pseudo-op in pass 1.

5. A table, the Symbol Table (ST) that is used to store each label and its corresponding value.

6. A table, the literal table (LT) that is used to store each literal encountered and its
corresponding assignment location.

7. A copy of the input to be used by pass 2.


Pass 2 Data Structures

1. Copy of source program input to pass1.

2. Location Counter (LC)

3. A table, the Machine-operation Table (MOT), that indicates for each instruction, symbolic
mnemonic, length (two, four, or six bytes), binary machine opcode and format of

4. A table, the Pseudo-Operation Table (POT), that indicates the symbolic mnemonic and
action to be taken for each pseudo-op in pass 2.

5. A table, the Symbol Table (ST), prepared by pass1, containing each label and
corresponding value.

6. A Table, the base table (BT), that indicates which registers are currently specified as base
registers by USING pseudo-ops and what the specified contents of these registers are.

7. A work space INST that is used to hold each instruction as its various parts are being
assembled together.

8. A work space, PRINT LINE, used to produce a printed listing.

9. A work space, PUNCH CARD, used prior to actual outputting for converting the
assembled instructions into the format needed by the loader.

10. An output deck of assembled instructions in the format needed by the loader.

Format of Data Structures

The third step in our design procedure is to specify the format and content of each of the data
structures. Pass 2 requires a machine operation table (MOT) containing the name, length,
binary code and format; pass 1 requires only name and length. Instead of using two different
tables, we construct single (MOT). The Machine operation table (MOT) and pseudo-operation
table are example of fixed tables. The contents of these tables are not filled in or altered
during the assembly process.

The following figure depicts the format of the machine-op table (MOT)





“Abbb” 5A 10 001

“Ahbb” 4A 10 001

“ALbb” 5E 10 001

“ALRB” 1E 01 000

……. ……. ……. …….

‘b’ represents “blank”

B) pass1 & pass2 Assembler flow chart
Ans: Pass Structure of Assemblers: Here we discuss two pass and single pass assembly
schemes in this section:

Two pass translation

Two pass translation of an assembly language program can handle forward references easily.
LC processing is performed in the first pass and symbols defined in the program are entered
into the symbol table. The second pass synthesizes the target form using the address
information found in the symbol table. In effect, the first pass performs analysis of the source
program while the second pass performs synthesis of the target program. The first pass
constructs an intermediate representation (IR) of the source program for use by the second
pass. This representation consists of two main components–data structures, e.g. the symbol
table, and a processed form of the source program. The latter component is called
intermediate code (IC).

Single pass translation

LC processing and construction of the symbol table proceed as in two pass translation. The
problem of forward references is tackled using a process called backpatching. The operand
field of an instruction containing a forward reference is left blank initially. The address of the
forward referenced symbol is put into this field when its definition is encountered.

Look at the following instructions:

READ N 101) + 09 0 113
MOVER BREG, ONE 102) + 04 2 115
MOVEM BREG, TERM 103) + 05 2 116
AGAIN MULT BREG, TERM 104) + 03 2 116
MOVER CREG, TERM 105) + 04 3 116


ADD CREG, ONE 106) + 01 3 115

MOVEM CREG, TERM 107) + 05 3 116
COMP CREG, N 108) + 06 3 113
BC LE, AGAIN 109) + 07 2 104
MOVEM BREG, RESULT 110) + 05 2 114
PRINT RESULT 111) + 10 0 114
STOP 112) + 00 0 000
N DS 1 113)
RESULT DS 1 114)
ONE DC ‘1’ 115) + 00 0 001
TERM PS 1 116)

In the above program, the instruction corresponding to the statement


Can be only partially synthesized since ONE is a forward reference. Hence the instruction
opcode and address of BREG will be assembled to reside in location 101. The need for
inserting the second operand’s address at a later stage can be indicated by adding an entry to
the Table of Incomplete Instructions (TII). This entry is a pair (instruction address>,
<symbol>), e.g. (101, ONE) in this case.

By the time the END statement is processed, the symbol table would contain the addresses of
all symbols defined in the source program and TII would contain information describing all
forward references. The assembler can now process each entry in TII to complete the
concerned instruction. For example, the entry (101, ONE) would be processed by obtaining
the address of ONE from symbol table and inserting it in the operand address field of the
instruction with assembled address 101. Alternatively, entries in TII can be processed in an
incremental manner. Thus, when definition of some symbol symb is encountered, all forward
references to symb can be processed.

Design of A Two Pass Assembler

Tasks performed by the passes of a two pass assembler are as follows:

Pass I:

1. Separate the symbol, mnemonic opcode and operand fields.

2. Build the symbol table.


3. Perform LC processing.

4. Construct intermediate representation.

Pass II: Synthesize the target program.

Pass I performs analysis of the source program and synthesis of the intermediate
representation while Pass II processes the intermediate representation to synthesize the
target program. The design details of assembler passes are discussed after introducing
advanced assembler directives and their influence on LC processing.
4. Define the following,
A) Parsing
Ans: Parsing transforms input text or string into a data structure, usually a tree, which is suitable for
later processing and which captures the implied hierarchy of the input. Lexical analysis
creates tokens from a sequence of input characters and it is these tokens that are processed
by a parser to build a data structure such as parse tree or abstract syntax trees.

Parsing is the process of analyzing a sequence of tokens to determine its grammatical

structure with respect to a given formal grammar. A Parser is the component of a compiler
that carries out this task.

Conceptually, the parser accepts a sequence of tokens and produces a parse tree. In practice
this might not occur.

1. The source program might have errors. Shamefully, we will do very little error handling.

2. Real compilers produce (abstract) syntax trees not parse trees (concrete syntax trees). We
don’t do this for the pedagogical reasons given previously.

There are three classes for grammar-based parsers.

1. Universal

2. Top-down

3. Bottom-up

The universal parsers are not used in practice as they are inefficient; we will not discuss

As expected, top-down parsers start from the root of the tree and proceed downward;
whereas, bottom-up parsers start from the leaves and proceed upward. The commonly used


top-down and bottom parsers are not universal. That is, there are (context-free) grammars
that cannot be used with them.

The LL and LR parsers are important in practice. Hand written parsers are often LL.
Specifically, the predictive parsers we looked at in chapter two are for LL grammars. The LR
grammars form a larger class. Parsers for this class are usually constructed with the aid of
automatic tools.

Parse tree

A parse tree depicts the steps in parsing, hence it is usefull for understanding the process of

A valid parse tree for a grammar G is a tree

Whose root is the start symbol of G?
Whose interior nodes are nonterminals of G?
Whose children of a node T (from left to right) correspond to the symbols on the right hand
side of some production for T in G?
Whose leaf nodes are terminal symbols of G?

• Every sentence generated by a grammar has a corresponding parse tree

• Every valid parse tree exactly covers a sentence generated by the grammar

Example parse tree for the arthematic expression 1+2*3.

Overview of process

The following example demonstrates the common case of parsing a computer language with
two levels of grammar: lexical and syntactic.

The first stage is the token generation, or lexical analysis, by which the input character
stream is split into meaningful symbols defined by a grammar of regular expressions.

For example, a calculator program would look at an input such as "12*(3+4)^2" and split it into
the tokens 12, *, (, 3, +, 4, ), ^ and 2, each of which is a meaningful symbol in the context of
an arithmetic expression. The parser would contain rules to tell it that the characters *, +, ^, (


and ) mark the start of a new token, so meaningless tokens like "12*" or "(3" will not be

The next stage is syntactic parsing or syntactic analysis, which is checking that the tokens
form an allowable expression. This is usually done with reference to a context-free grammar
which recursively defines components that can make up an expression and the order in which
they must appear. However, not all rules defining programming languages can be expressed
by context-free grammars alone, for example type validity and proper declaration of
identifiers. These rules can be formally expressed with attribute grammars.

Types of Parsers

The task of the parser is essentially to determine if and how the input can be derived from the
start symbol of the grammar. This can be done in essentially two ways:

Top-Down Parsing – A parser can start with the start symbol and try to transform it to the
input. Intuitively, the parser starts from the largest elements and breaks them down into
incrementally smaller parts. LL parsers are examples of top-down parsers.

Bottom-Up Parsing – A parser can start with the input and attempt to rewrite it to the start
symbol. Intuitively, the parser attempts to locate the most basic elements, then the elements
containing these, and so on. LR parsers are examples of bottom-up parsers. Another term
used for this type of parser is Shift-Reduce parsing.

Another important distinction is whether the parser generates a leftmost derivation or a

rightmost derivation (see context-free grammar). LL parsers will generate a leftmost derivation
and LR parsers will generate a rightmost derivation (although usually in reverse).

Top-down Parsing

The compiler parses input from a programming language to assembly language or an internal
representation by matching the incoming symbols to Backus-Naur form production rules. An
LL parser, also called a top-bottom parser or top-down parser, applies each production rule to
the incoming symbols by working from the left-most symbol yielded on a production rule and
then proceeding to the next production rule for each non-terminal symbol encountered. In this
way the parsing starts on the Left of the result side (right side) of the production rule and
evaluates non-terminals from the Left first and, thus, proceeds down the parse tree for each
new non-terminal before continuing to the next symbol for a production rule.

Bottom-up parsing

Bottom-up parsing is a parsing method that works by identifying terminal symbols first, and
combines them successively to produce non terminals. The productions of the parser can be


used to build a parse tree of a program written in human-readable source code that can be
compiled to assembly language or pseudo code.

Different computer languages require different parsing techniques, although it is not

uncommon to use a parsing technique that is more powerful than that actually required.

Bottom-up parsing methods have an advantage over top-down parsing in that they are less
fussy in the grammars they can use.

The most popular bottom up technique is LALR (1). Every LL(1) grammar is also LALR(1), but
many LALR(1) grammars, including the most natural grammars for a variety of common
programming language constructs are not LL(1). Unfortunately, bottom-up techniques are
more complicated to understand and to implement than top-down techniques.
B) Scanning
Ans: Scanning and parsing are the two important phases of compiler construction. Compiler is a
program which converts the source program into machine level language. It is a translator.
Compiler performs analysis for sentence generations and interpretations. One phase output
will go to the next phase as input. Conceptually, there are three phases of analysis with the
output of one phase the input of the next. Each of these phases changes the representation
of the program being compiled. The phases are called lexical analysis or scanning, which
transforms the program from a string of characters to a string of tokens. Syntax Analysis or
Parsing, transforms the program into some kind of syntax tree; and Semantic Analysis,
decorates the tree with semantic information.

The character stream input is grouped into meaningful units called lexemes, which are then
mapped into tokens, the latter constituting the output of the lexical analyzer.

For example, any one of the following C statements

x3 = y + 3;

x3 = y + 3 ;

x3 = y+ 3 ;

but not

x 3 = y + 3;

would be grouped into the lexemes x3, =, y, +, 3, and ;. A token is a

<token-name,attribute-value> pair. The hierarchical decomposition above sentence is given in

below figure.


C) Token
Ans: A token is a <token-name, attribute-value> pair.

For example

1. The lexeme x3 would be mapped to a token such as <id,1>. The name id is short for
identifier. The value 1 is the index of the entry for x3 in the symbol table produced by the
compiler. This table is used gather information about the identifiers and to pass this
information to subsequent phases.

2. The lexeme = would be mapped to the token <=>. In reality it is probably mapped to a pair,
whose second component is ignored. The point is that there are many different identifiers
so we need the second component, but there is only one assignment symbol =.

3. The lexeme y is mapped to the token <id,2>

4. The lexeme + is mapped to the token <+>.

5. The number 3 is mapped to <number, something>, but what is the something. On the one
hand there is only one 3 so we could just use the token <number,3>.

6. However, there can be a difference between how this should be printed (e.g., in an error
message produced by subsequent phases) and how it should be stored (fixed vs. float vs.
double). Perhaps the token should point to the symbol table where an entry for this kind of 3
is stored. Another possibility is to have a separate numbers table.

7. The lexeme ; is mapped to the token <;>.

Note, non-significant blanks are normally removed during scanning. In C, most blanks are
non-significant. That does not mean the blanks are unnecessary. Consider

int x;



The blank between int and x is clearly necessary, but it does not become part of any token.
Blanks inside strings are an exception, they are part of the token (or more likely the table
entry pointed to by the second component of the token).

Note that we can define identifiers, numbers, and the various symbols and punctuation
without using recursion (compare with parsing below).

Parsing involves a further grouping in which tokens are grouped into grammatical phrases,
which are often represented in a parse tree.

For example

x3 = y + 3, would be parsed into the tree on the right.

his parsing would result from a grammar containing rules such as

asst-stmt → id = expr ;

expr → number

| id

| expr + expr

Note the recursive definition of expression (expr). Note also the hierarchical decomposition in
the figure on the right. The division between scanning and parsing is somewhat arbitrary, in
that some tasks can be accomplished by either. However, if a recursive definition is involved,
it is considered parsing not scanning.
5. Describe the process of Bootstrapping in the context of Linkers.
Ans: In computing, bootstrapping refers to a process where a simple system activates another
more complicated system that serves the same purpose. It is a solution to the Chicken-and-
egg problem of starting a certain system without the system already functioning. The term is
most often applied to the process of starting up a computer, in which a mechanism is needed


to execute the software program that is responsible for executing software programs (the
operating system).

Bootstrap loading

The discussions of loading up to this point have all presumed that there’s already an
operating system or at least a program loader resident in the computer to load the program of
interest. The chain of programs being loaded by other programs has to start somewhere, so
the obvious question is how is the first program loaded into the computer?

In modern computers, the first program the computer runs after a hardware reset invariably is
stored in a ROM known as bootstrap ROM. as in "pulling one’s self up by the bootstraps."
When the CPU is powered on or reset, it sets its registers to a known state. On x86 systems,
for example, the reset sequence jumps to the address 16 bytes below the top of the system’s
address space. The bootstrap ROM occupies the top 64K of the address space and ROM
code then starts up the computer. On IBM-compatible x86 systems, the boot ROM code
reads the first block of the floppy disk into memory, or if that fails the first block of the first
hard disk, into memory location zero and jumps to location zero. The program in block zero in
turn loads a slightly larger operating system boot program from a known place on the disk into
memory, and jumps to that program which in turn loads in the operating system and starts it.
(There can be even more steps, e.g., a boot manager that decides from which disk partition to
read the operating system boot program, but the sequence of increasingly capable loaders

Why not just load the operating system directly? Because you can’t fit an operating system
loader into 512 bytes. The first level loader typically is only able to load a single-segment
program from a file with a fixed name in the top-level directory of the boot disk. The operating
system loader contains more sophisticated code that can read and interpret a configuration
file, uncompress a compressed operating system executable, address large amounts of
memory (on an x86 the loader usually runs in real mode which means that it’s tricky to
address more than 1MB of memory.) The full operating system can turn on the virtual memory
system, loads the drivers it needs, and then proceed to run user-level programs.

Many Unix systems use a similar bootstrap process to get user-mode programs running. The
kernel creates a process, then stuffs a tiny little program, only a few dozen bytes long, into
that process. The tiny program executes a system call that runs /etc/init, the user mode
initialization program that in turn runs configuration files and starts the daemons and login
programs that a running system needs.

None of this matters much to the application level programmer, but it becomes more
interesting if you want to write programs that run on the bare hardware of the machine, since
then you need to arrange to intercept the bootstrap sequence somewhere and run your


program rather than the usual operating system. Some systems make this quite easy (just
stick the name of your program in AUTOEXEC.BAT and reboot Windows 95, for example),
others make it nearly impossible. It also presents opportunities for customized systems. For
example, a single-application system could be built over a Unix kernel by naming the
application /etc/init.

Software Bootstraping & Compiler Bootstraping

Bootstrapping can also refer to the development of successively more complex, faster
programming environments. The simplest environment will be, perhaps, a very basic text
editor (e.g. ed) and an assembler program. Using these tools, one can write a more complex
text editor, and a simple compiler for a higher-level language and so on, until one can have a
graphical IDE and an extremely high-level programming language.

Compiler Bootstraping

In compiler design, a bootstrap or bootstrapping compiler is a compiler that is written in the

target language, or a subset of the language, that it compiles. Examples include gcc, GHC,
OCaml, BASIC, PL/I and more recently the Mono C# compiler.
6. Describe the procedure for design of a Linker.
Ans: Design of a linker: Relocation and linking requirements in segmented addressing

The relocation requirements of a program are influenced by the addressing structure of the
computer system on which it is to execute. Use of the segmented addressing structure
reduces the relocation requirements of program.

Implementation Examples: A Linker for MS-DOS

Example: Consider the program of written in the assembly language of intel 8088. The
ASSUME statement declares the segment registers CS and DS to the available for memory
addressing. Hence all memory addressing is performed by using suitable displacements from
their contents. Translation time address o A is 0196. In statement 16, a reference to A is
assembled as a displacement of 196 from the contents of the CS register. This avoids the use
of an absolute address, hence the instruction is not address sensitive. Now no relocation is
needed if segment SAMPLE is to be loaded with address 2000 by a calling program (or by the
OS). The effective operand address would be calculated as <CS>+0196, which is the correct
address 2196. A similar situation exists with the reference to B in statement 17. The reference
to B is assembled as a displacement of 0002 from the contents of the DS register. Since the
DS register would be loaded with the execution time address of DATA_HERE, the reference
to B would be automatically relocated to the correct address.


Though use of segment register reduces the relocation requirements, it does not completely
eliminate the need for relocation. Consider statement 14 .


Which loads the segment base of DATA_HERE into the AX register preparatory to its transfer
into the DS register . Since the assembler knows DATA_HERE to be a segment, it makes
provision to load the higher order 16 bits of the address of DATA_HERE into the AX register.
However it does not know the link time address of DATA_HERE, hence it assembles the
MOV instruction in the immediate operand format and puts zeroes in the operand field. It also
makes an entry for this instruction in RELOCTAB so that the linker would put the appropriate
address in the operand field. Inter-segment calls and jumps are handled in a similar way.

Relocation is somewhat more involved in the case of intra-segment jumps assembled in the
FAR format. For example, consider the following program :



Here the displacement and the segment base of FAR_LAB are to be put in the JMP
instruction itself. The assembler puts the displacement of FAR_LAB in the first two operand
bytes of the instruction , and makes a RELOCTAB entry for the third and fourth operand bytes
which are to hold the segment base address. A segment like


(which is an ‘address constant’) does not need any relocation since the assemble can itself
put the required offset in the bytes. In summary, the only RELOCATAB entries that must exist
for a program using segmented memory addressing are for the bytes that contain a segment
base address.

For linking, however both segment base address and offset of the external symbol must be
computed by the linker. Hence there is no reduction in the linking requirements.