Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
3.1 INTRODUCTION
Since computer hardware is capable of understanding only machine level
instructions, so it is necessary to convert the instructions of a program written in high level
language to machine instructions before the program can be executed by the computer. This job
is carried out by the compiler. So a compiler can be defined as a computer program that
translates a program in a source language into an equivalent program in a target language.
This is shown in figure 3.1.
such errors will be successfully compiled and the object code will be obtained without any error
message. But, such programs when executed will not produce the right answers.
b) Syntax Analysis
c) Semantic Analysis
Lexical Analysis
Syntax Analysis
Intermediate Code
Generation
Code
Optimization
Code
Generation
old_sum identifier
+ operator
value identifier
/ Division operator
analysis on a sentence in a natural language. Syntax analysis determines the structural elements
of the program as well as their relationships. The result of syntax analysis is usually represented
as a parse tree or a syntax tree.
For example, consider a line of code a=b+i.
a=b+i can be represented as a parse tree as follows:
x=y+z;
p=q+t
else
x=y–z;
p=q*t
endif
Figure 3.7: Interactions between the lexical analyzer and the parser
The main task of lexical analyzer is to read the input characters of the source
program, group them into lexemes, and produce as output a sequence of tokens for each lexeme
in the source program. The stream of tokens is then sent to the parser for syntax analysis.
Actually the syntax analyzer is the master program. It sends the request to the lexical analyzer
for getting the next token. The lexical analyzer will read the input string and convert this into
token and pass the result to the syntax analyzer. The syntax checker do the grammar check over
the token and then sends the request to the lexical analyzer. This process continues recursively
till the entire input string is consumed.
The need of lexical analysis is to simplify the overall design of the compiler. It is
easy to specify the structure of tokens than the syntactic structure of the source
program.Therefore we can construct a more specialized and hence more efficient recognizer for
tokens than syntactic structure.
Another function of lexical analyzer is stripping out comments and white space
(blank, newline, and tab). Also if any erroneous input is provided by the user in the program,
the lexical analyzer will correlate that error with the source file and the line number.
Lexical analyzer is kept as an independent module. This is to
Simplify the overall design of the compiler
Compiler efficiency is improved
Compiler portability is enhanced.
Once the next lexeme is determined, forward is set to the character at its right
end. Then, after the lexeme is recorded as an attribute value of a token returned to the parser,
lexemeBegin is set to the character immediately after the lexeme just found. In Fig. 3.8, we see
forward has passed the end of the next lexeme, ** (the FORTRAN exponentiation operator),
and must be retracted one position to its left.
Advancing forward requires that we first test whether we have reached theend
of one of thebuffers, and if so, we must reload the other buffer from theinput, and move forward
to the beginning of the newly loaded buffer. As longas we never need to look so far ahead of the
actual lexeme that the sum of thelexeme's length plus the distance we look ahead is greater than
N, we shallnever overwrite the lexeme in its buffer before determining it.
One way to begin the design of any program is to describe the behavior of the
program by a flow chart. This approach is particularly useful when the program is a lexical
analyzer, because the action taken is highly dependent on what characters have been seen
recently. The specialized kind of flow chart for lexical analyzers are called the transition
diagram. In a transition diagram, the boxes of flow chart are drawn as circles and called states.
The states are connected by arrows called edges. The labels on the various edges leaving a state
indicate the input characters that can appear after that state.
Consider the transition diagram given in figure 3.9. The code for state0 might be:
state 0 : C:=GETCHAR();
if LETTER(C) then goto state1
else FAIL()
DIGIT is a procedure which returns true if and only if C is one of the digits
0,1,…9. DELIMITER is a procedure which returns true whenever C is a character that is not a
letter or digit. State2 indicates that an identifier has been found. Since delimiter is not a part of
the identifier we must retract the forward pointer one character.
3. the set of all ASCII characters, or set of all printable ASCII characters.
Strings
A string( or word) is a finite sequence of symbols chosen from some alphabet.
For example, 01101 is a string from the binary alphabet = {0,1}. The string 111 is another
string chosen from this alphabet.
Powers of an alphabet
The set of all strings of a certain length from an alphabet can be represented
using exponential notation. be the set of strings of length k.
For example, If ={0,1}; then ={0,1}; ={00,01,10,11}; ={ }.
The set of all strings over an alphabet is denoted using *. For example
{0,1}* ={0,1,00,01,10,11,000,…}.This can be written as :
= 0 …
Sometimes, we want to exclude the empty string from set of strings. The nonempty strings from
alphabet is denoted by . Thus,
={ ….}
=
Concatenation of Strings
Let x and y be strings. Then xy or x.y denotes the concatenation of x and y,
that is the string formed by making a copy of x and the following it by a copy of y. for example,
if x=a1a2…ai and y is the string composed of j symbols y=b1,b2,…bj, then xy is thestring of
length i+j; xy=a1a2…aib1b2..bj.
The concatenation of the empty string with any string; more formally x=x =x.
Also we can take exponential of strings i.e., x1=x, x2=xx, x3=xxx and so on. In
general, xi is the string x repeated i times.x0 to be for any string x. Thus, plays the role of
1.
The prefix of a string s is any string obtained by removing zero or more symbols
from the end of s. For example, ban, banana, and are the prefix of banana. The suffix of
string s is any string obtained by removing zero or more symbols from the beginning of s. For
example, nana, banana, and are suffixes of banana. A substring of s is obtained by deleting
any prefix and any suffix from s.
Languages
A language is any countable set of strings over some fixed alphabet. Abstract
languages like , the empty set, or { }, the set containing only empty string, are languages
under this definition. Also the language can contain set of all syntactically well-formed c-
programs and the set of all grammatically correct English sentences.
The concatenation can also be applied to languages. If L and M are two
languages, then L.M or just LM, is the language consisting of all strings xy which can be
formed by selecting a string x from L, a string y from M and concatenating them in that order.
That is, LM = {xy | x is in L and y is in M }
Definition of Regular Expression
With this notation we define identifiers as the one that begins with a letter and
may include any number of letter or digits. The vertical bar means “or”. The parenthesis is used
to group subexpression. The „*‟ means 0 or more instances of the parenthesis expression and the
position of letter with remainder of expression means concatenation.
Each regular expression r denotes a language L(r). The defining rules specify
how L(r) is formed by combining in various ways.
1. is a regular expression that denotes { } i.e., a set containing an empty string.
2. If „a‟ is a then „a‟ is a regular expression that denotes {a} i.e., the set containing
string a.
3. If r and s are regular expressions denoting the languages L(r) and L(s) then
a. (r) | (s) is a regular expression denoting L(r) L(s).
b.(r)(s) is a regular expression denoting L(r)(s).
c. r* is a regular expression denoting (L(r))*
e.g: let = { a, b}
4. The regular expression (a|b)*denotes the set of all strings containing 0 or more
instances of an a or b i.e., set of strings of a‟s and b‟s.
5. a|a*b denotes the language {a, b, ab, aab, aaab,…}i.e., the string a and all strings
consisting of zero or more a‟s and ending in b.
We can give names to certain regular expressions and use those names in
subsequent expressions, as if the names were themselves symbols.
Example: C identifiers are strings of letters, digits, and underscores. Hereis a regular definition
for the language of C identifiers.
letterA| B |…|Z |a | b |…|z|
digit0|1|…|9
idletter( letter|digit)*
Example: Unsigned numbers (integer or floating point) are strings suchas 5280, 0.01234,
6.336E4, or 1.89E-4. The regular definition
digit 0 |1 |…| 9
digitsdigit digit*
optionalFraction. digits |
optionalExponent ( E ( + | - | ) digits ) |
numberdigits optionalFractionoptionalExponent
3.10)is distinguished as the start state, and one or more states may be distinguished as accepting
states (or final states) in figure 3.10 , state 3 is accepting, as indicated by the double circle.
Transition Table
Input Symbol
State
a b
0 {0,1} {0}
1 {2}
2 {3}
The transition table has the advantage that we can easily find the transactions on a
given state and input. Its disadvantage is that it takes a lot of space when the input alphabet is
large and most transition is to the empty set.
An NFA accepts input string x if and only if there is some path in the
transition graph from the start state to one of the accepting states, such that the symbols along the
path spell out x. Note that c labels along the path are effectively ignored, since the empty string
does not contribute to the string constructed along the path.
Example :The string aabbis accepted by the NFA of Figure. 3.10. Thepath labeled by aabbfrom
state 0 to state 3 is:
The language defined (or accepted) by an NFA is the set of input strings it access.
Figure 3.12is an NFA accepting L(aa* | bb*). String aaaisaccepted because of the path
s = s0;
c = nextChar();
while( c != eof)
{
s = move(s,c);
c=nextChar();
}
if( s is in F ) return “yes”;
else return “no”;
Example: In figure 3.13, a transition graph of a DFA accepting the language (a|b)*abb, the
same as the accepted by the NFA of figure 3.10. Given the input string ababb, this DFA enters
the sequence of states 0,1,2,1,2,3 and return “yes”.
THOMSON’S CONSTRUCTION
To constructing an NFA from a regular expression we decompose the regular expression into
sub expressions and then construct NFA for each of the basic symbols in r.
Input:A regular expression R over alphabet
Output:An NFA N accepting language L(r)
Method:First, we decompose r into sub expressions then using the rules we construct NFA‟s for
each of the basic symbols in r. Then guided by the syntactic structure of regular expressions we
combine these NFA‟s, until we obtain NFA for the entire expression.
Each intermediate NFA produced during the course of construction corresponds to a sub
expression r and has several important properties
1. For in R
2. For a in R we construct
3. For a | b
4. For ab
5. For a*
The algorithm constructs a transition table Dtran for the DFD D. On the
transition table, for a set of NFA state, we construct a DFA state. Another notation used in the
algorithm is , which means a set of NFA states reachable from NFA state s on
transitions alone. means, set of NFA states reachable from some NFA state s in
set T on transition alone and then move(T, a) gives the set of NFA states to which there is a
transition on input symbol a from state sin T.
Example:
Here, the start state of this NFA is state 0. We have to find -closure(start state) in order to find
the start state of the DFA, D.
1). =A
( )
3). move(A, b) = {5}
( )
4).move(B, a) ={3,8}
( )
5). move(B, b) = {5, 9}
( )
6). move(C,a) = {3, 8}
( )
7). move(C,b) = {5}
( )
8). move(D,a) ={3,8}
( )
9). move(D,b) = {5,10}
( )
10). move(E, a) = {3,8}
( )
11). ). move(E,b) = {5}
( )
Transition Table: