Lexical Analysis

Lexical analyzer converts stream of input characters into a stream of tokens.
The different tokens that our lexical analyzer identifies are as follows:
KEYWORDS: int, char, float, double, if, for, while, else, switch, struct, printf,
scanf, case, break, return, typedef, void
IDENTIFIERS: main, fopen, getch etc
NUMBERS: positive and negative integers, positive and negative floating
point numbers.
OPERATORS: +, ++, -, --, ||, *, ?, /, >, >=, <, <=, =, ==, &, &&.
BRACKETS: [ ], { }, ( ).
STRINGS : Set of characters enclosed within the quotes
COMMENT LINES: Ignores single line, multi line comments
For tokenizing into identifiers and keywords we incorporate a symbol table
which initially consists of predefined keywords. The tokens are read from an
input file. If the encountered token is an identifier or a keyword the lexical
analyzer will look up in the symbol table to check the existence of the
respective token. If an entry does exist then we proceed to the next token.
If not then that particular token along with the token value is written into
the symbol table. The rest of the tokens are directly displayed by writing into
an output file.
The output file will consist of all the tokens present in our input file along
with their respective token values.
ADVERTISEMENT
INTRODUCTION
Lexical analysis involves scanning the program to be compiled and
recognizing the tokens that make up the source statements Scanners or
lexical analyzers are usually designed to recognize keywords , operators ,
and identifiers , as well as integers, floating point numbers , character
strings , and other similar items that are written as part of the source
program . The exact set of tokens to be recognized of course, depends upon
the programming language being used to describe it.
A sequence of input characters that comprises a single token is called a
lexeme. A lexical analyzer can insulate a parser from the lexeme
representation of tokens. Following are the list of functions that lexical
analyzers perform.
Removal of white space and comment
Many languages allow “white space” to appear between tokens. Comments
can likewise be ignored by the parser and translator , so they may also be
treated as white space. If white space is eliminated by the lexical analyzer,
the parser will never have to consider it.
Constants
An integer constant is a sequence of digits, integer constants can be allowed
by adding productions to the grammar for expressions, or by creating a
token for such constants . The job of collecting digits into integers is
generally given to a lexical analyzer because numbers can be treated as
single units during translation. The lexical analyzer passes both the token
and attribute to the parser.
Recognizing identifiers and keywords
Languages use identifiers as names of variables, arrays and functions. A
grammar for a language often treats an identifier as a token. Many
languages use fixed character strings such as begin, end , if, and so on , as
punctuation marks or to identify certain constructs. These character strings,
called keywords, generally satisfy the rules for forming identifiers.
System Analysis
The operating system used is MS-DOS

MS-DOS : The MS-DOS is a single user, single process, single processor
operating system. Due to the confinement of the device dependent code into
one layer, the porting of MS-DOS has theoretically been reduced to writing
of the BIOS code for the new hardware. At the command level it provides a
hierarchical file system, input output redirection, pipes and filters. User
written commands can be invoked in the same way as the standard system
commands, giving the appearance as if the basic system functionality has
been extended.
Being a single user system, it provides only file protection and access
control. The disk space is allocated in terms of a cluster of consecutive
sectors. The command language of MS-DOS has been designed to allow the
user to interact directly with the operating system using a CRT terminal. The
principal use of the command language is to initiate the execution of
commands and programs. A user program can be executed under MS-DOS
by simply typing the name of the executable file of the user program at the
DOS command prompt.
The programming language used here is C programming
ADVERTISEMENT
SYSTEM DESIGN
Process
The lexical analyzer is the first phase of a compiler. Its main task is to read
the input characters and produce as output a sequence of tokens that the
parser uses for syntax analysis. This interaction, summarized schematically
in fig below.
Upon receiving a “get next token “command from the parser, the lexical
analyzer reads the input characters until it can identify next token.
Sometimes , lexical analyzers are divided into a cascade of two phases, the
first called “scanning”, and the second “lexical analysis”.
The scanner is responsible for doing simple tasks, while the lexical analyzer
proper does the more complex operations.
The lexical analyzer which we have designed takes the input from a input
file. It reads one character at a time from the input file, and continues to
read until end of the file is reached. It recognizes the valid identifiers,
keywords and specifies the token values of the keywords.
It also identifies the header files, #define statements, numbers, special
characters, various relational and logical operators, ignores the white spaces
and comments. It prints the output in a separate file specifying the line
number .
BLOCK DIAGRAM
Compiler Design
Looking for GATE Preparation Material? Join & Get here now!
<<Previous Next>>
Overview Of Module And Algorithm
Functions
Verify(): The lexical analyzer uses the verify operation to determine whether
there is entry for a lexeme in the symbol table.
Data structures
Structure : A structure in C is a collection of variables which contains related
data items of similar and /or dissimilar data types but logically related items.
Each variable in the structure represents an item and is called as a member
or field of the structure. Complex data can be important represented in more
meaningful way using structures and is one of the very features available in
C.
An array of structures is basically a array of records. Each an every record
has the same format consisting of similar or dissimilar data types but logically
related entities.
ADVERTISEMENT
Algorithms
Procedure main begin if symb=’#’ then begin advance to next token in input file if
symb=’i’ then begin advance to next token in input file while symb!=’\n’ do begin
advance to next token in input file end {while } print symb is a preprocessor
directive end {if symb=’i’} if symb=’d’ then begin advance to next token input file
while symb!=’ ‘ do begin advance to next token in input file end{while} advance to
next token in input file print symb is a constant advance to next token in input file
while symb!=’\n’ do begin advance to the next token in input file end {while} end {if
symb=’d’} end {if symb=’#’} if symb is a alphabet or symb=’_’ then begin advance
to the next token in input file while symb is a digit or alphabet or symb=’_’ do begin
advance to the next token of input file end {while} call function verify to check
whether symb is a identifier or keyword end {if} if symb=’+’ then begin advance to
the next token in input file if symb=’+’ print symb is ++ operator else ungetc symb
from the input file print symb is + operator end {if} if symb=’-’ then begin advance
to the next token in input file if symb=’-’ print symb is -- operator else ungetc symb
from the input file print symb is - operator end {if} if symb=’|’ then begin advance
to the next token in input file if symb=’|’ print symb is logical or operator else
ungetc symb from the input file print symb is bitwise or operator end {if} if
symb=’*’ then begin print symb is a multiplication operator end {if} if symb=’?’ then
begin print symb is a conditional operator end{if} if symb=’!’or symb=’>’or
symb=’<’then begin advance to the next token in input file
Compiler Design
Looking for GATE Preparation Material? Join & Get here now!
<<Previous Next>>
Overview Of Module And Algorithm Continue...
Contd...
if symb=’=’
print symb is a relational operator
else
ungetc symb from output file
print symb is a operator
end{if}
if symb=’=’
begin
advance to next token in input file
if symb=’=’then
print symb is equal to operator
else
print symb is assignment operator
end{if}
if symb=’&’ then
begin
if symb=’&’ then
print symb is a logical and operator

else
print & symb is an address operator
end{if}
if symb=’/’ then
begin
if symb=’*’ then
begin
while symb!=’/’ do
end{while}
end{if}
else if symb=’/’ then
begin
while symb!=’\n’ do
end{while}
end{if}
else
print symb is a division operator
end{if}
if symb is a digit then
begin
while symb is a digit or symb=’.’ then
begin
end {while}
print symb is a number
end{if}
if symb =’”’ then
begin
while symb!=’”’ do
begin
end{while}
print symb is a string
end{if}}
if symb= ‘{‘ then
print open brace
if symb=’}’ then
print close brace
if symb=’[‘ then
print open bracket
if symb=’]’ then
print close bracket
if symb=’(‘ then
print open parenthesis

if symb=’)’ then
print close parenthesis
end {procedure main}
procedure verify
begin
scan the symbol table to check if encountered token exists
if exists
return token value
end{procedure}
ADVERTISEMENT
USER MANUAL
The code for modules appears in two files: lex.c and output.c. The file lex.c
contains the main source code of the lexical analyzer. And the input to the
lexical analyzer is contained in test.c. Under the DOS operating system, the
program is compiled by using alt F9, and is executed by using ctrl F9. The
output i.e token types are stored in the output file, output.txt
Sample Input
#include<stdio.h>
#include<stdlib.h>
#define abc 100
void main()
int a_,b=30;
printf("enter 2 no.s\n"); // printf statement
scanf("%d%d",&a,&b);
/* scanf
statement*/
if(a<20)
a=a+1;
Sample Output:
LINE NO TOKENS
-----------------------------------------------
1: #include<stdio.h> is a header file
2: #include<stdlib.h> is a header file
3: #define statement: abc is a constant
4: void: token value : 7
main :identifier, token value : 18
(: open parenthesis
): close parenthesis
5: {: open brace
6: int: token value : 1
a_ :identifier, token value : 18
, : comma
b :identifier, token value : 18
=: assignment operator
30 is a number
; : semi colon
7: printf: token value : 5
(: open parenthesis
enter 2 no.s\n : is a string
;: semi colon
8: scanf: token value : 6
(: open parenthesis
%d%d : is a string
,: comma
&a: address operator
, : comma
&b: address operator
;: semi colon
9:
10:
11: if: token value : 8
(: open parenthesis
a :identifier, token value : 18
<: less than operator
20 is a number
12: a: token value : 18
=: assignment operator
a: token value : 18
+: plus operator
1 is a number
;: semi colon
13: }: close parenthesis
CONCLUSION
Generally, when syntactic analysis is being carried out by the parser it may
call upon the scanner for tokenizing the input. But the LEXICAL ANALYZER
designed by us is an independent program. It takes as input a file with an
executable code in C. There fore, the parser cannot make use of the designed
scanner as and when required.
Consider as an example an array ch[20].The designed lexical analyzer will
tokenize 'ch' as an identifier,'[' as an opening brace,'20' as a number, and ']'
as a closing brace. But the parser might require a[5] to be identified as an
array. Similarly, there may arise a number of cases where the parser has to
identify a token by a different mannerism than the one specified and
designed. Hence, we conclude that the LEXICAL ANALYZER so designed is an
independent program which is not flexible.

Lexical Analysis

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Lexical Analysis

Caricato da

Copyright:

Formati disponibili

Lexical analyzer converts stream of input characters into a stream of tokens.

The operating system used is MS-DOS

Overview Of Module And Algorithm

Overview Of Module And Algorithm Continue...

print symb is a relational operator

ungetc symb from output file

print symb is a operator

advance to next token in input file

print symb is equal to operator

ungetc symb from output file

print symb is assignment operator

advance to next token in input file

print symb is a logical and operator

print & symb is an address operator

advance to next token in input file

advance to next token in input file

advance to next token in input file

else if symb=’/’ then

advance to next token in input file

advance to next token in input file

ungetc symb from output file

print symb is a division operator

if symb is a digit then

while symb is a digit or symb=’.’ then

advance to next token in input file

print symb is a number

if symb =’”’ then

advance to next token in input file

advance to next token in input file

print symb is a string

if symb= ‘{‘ then

print open brace

print close brace

print open bracket

print close bracket

print open parenthesis

print close parenthesis

end {procedure main}

scan the symbol table to check if encountered token exists

return token value

#define abc 100

printf("enter 2 no.s\n"); // printf statement

1: #include<stdio.h> is a header file

2: #include<stdlib.h> is a header file

3: #define statement: abc is a constant

4: void: token value : 7

main :identifier, token value : 18

6: int: token value : 1

a_ :identifier, token value : 18

b :identifier, token value : 18

7: printf: token value : 5

enter 2 no.s\n : is a string

8: scanf: token value : 6

&a: address operator

&b: address operator

11: if: token value : 8

a :identifier, token value : 18

<: less than operator

12: a: token value : 18

13: }: close parenthesis

Potrebbero piacerti anche