Sei sulla pagina 1di 13

Lexical analyzer converts stream of input characters into a stream of tokens.

The different tokens that our lexical analyzer identifies are as follows:
KEYWORDS: int, char, float, double, if, for, while, else, switch, struct, printf,
scanf, case, break, return, typedef, void
IDENTIFIERS: main, fopen, getch etc
NUMBERS: positive and negative integers, positive and negative floating
point numbers.
OPERATORS: +, ++, -, --, ||, *, ?, /, >, >=, <, <=, =, ==, &, &&.
BRACKETS: [ ], { }, ( ).
STRINGS : Set of characters enclosed within the quotes
COMMENT LINES: Ignores single line, multi line comments
For tokenizing into identifiers and keywords we incorporate a symbol table
which initially consists of predefined keywords. The tokens are read from an
input file. If the encountered token is an identifier or a keyword the lexical
analyzer will look up in the symbol table to check the existence of the
respective token. If an entry does exist then we proceed to the next token.
If not then that particular token along with the token value is written into
the symbol table. The rest of the tokens are directly displayed by writing into
an output file.
The output file will consist of all the tokens present in our input file along
with their respective token values.
ADVERTISEMENT

INTRODUCTION
Lexical analysis involves scanning the program to be compiled and
recognizing the tokens that make up the source statements Scanners or
lexical analyzers are usually designed to recognize keywords , operators ,
and identifiers , as well as integers, floating point numbers , character
strings , and other similar items that are written as part of the source
program . The exact set of tokens to be recognized of course, depends upon
the programming language being used to describe it.
A sequence of input characters that comprises a single token is called a
lexeme. A lexical analyzer can insulate a parser from the lexeme
representation of tokens. Following are the list of functions that lexical
analyzers perform.
Removal of white space and comment
Many languages allow “white space” to appear between tokens. Comments
can likewise be ignored by the parser and translator , so they may also be
treated as white space. If white space is eliminated by the lexical analyzer,
the parser will never have to consider it.
Constants
An integer constant is a sequence of digits, integer constants can be allowed
by adding productions to the grammar for expressions, or by creating a
token for such constants . The job of collecting digits into integers is
generally given to a lexical analyzer because numbers can be treated as
single units during translation. The lexical analyzer passes both the token
and attribute to the parser.
Recognizing identifiers and keywords
Languages use identifiers as names of variables, arrays and functions. A
grammar for a language often treats an identifier as a token. Many
languages use fixed character strings such as begin, end , if, and so on , as
punctuation marks or to identify certain constructs. These character strings,
called keywords, generally satisfy the rules for forming identifiers.
System Analysis

The operating system used is MS-DOS


MS-DOS : The MS-DOS is a single user, single process, single processor
operating system. Due to the confinement of the device dependent code into
one layer, the porting of MS-DOS has theoretically been reduced to writing
of the BIOS code for the new hardware. At the command level it provides a
hierarchical file system, input output redirection, pipes and filters. User
written commands can be invoked in the same way as the standard system
commands, giving the appearance as if the basic system functionality has
been extended.
Being a single user system, it provides only file protection and access
control. The disk space is allocated in terms of a cluster of consecutive
sectors. The command language of MS-DOS has been designed to allow the
user to interact directly with the operating system using a CRT terminal. The
principal use of the command language is to initiate the execution of
commands and programs. A user program can be executed under MS-DOS
by simply typing the name of the executable file of the user program at the
DOS command prompt.
The programming language used here is C programming
ADVERTISEMENT

SYSTEM DESIGN
Process
The lexical analyzer is the first phase of a compiler. Its main task is to read
the input characters and produce as output a sequence of tokens that the
parser uses for syntax analysis. This interaction, summarized schematically
in fig below.
Upon receiving a “get next token “command from the parser, the lexical
analyzer reads the input characters until it can identify next token.
Sometimes , lexical analyzers are divided into a cascade of two phases, the
first called “scanning”, and the second “lexical analysis”.
The scanner is responsible for doing simple tasks, while the lexical analyzer
proper does the more complex operations.
The lexical analyzer which we have designed takes the input from a input
file. It reads one character at a time from the input file, and continues to
read until end of the file is reached. It recognizes the valid identifiers,
keywords and specifies the token values of the keywords.
It also identifies the header files, #define statements, numbers, special
characters, various relational and logical operators, ignores the white spaces
and comments. It prints the output in a separate file specifying the line
number .
BLOCK DIAGRAM

Compiler Design
Looking for GATE Preparation Material? Join & Get here now!

<<Previous Next>>

Overview Of Module And Algorithm

Functions
Verify(): The lexical analyzer uses the verify operation to determine whether
there is entry for a lexeme in the symbol table.
Data structures
Structure : A structure in C is a collection of variables which contains related
data items of similar and /or dissimilar data types but logically related items.
Each variable in the structure represents an item and is called as a member
or field of the structure. Complex data can be important represented in more
meaningful way using structures and is one of the very features available in
C.
An array of structures is basically a array of records. Each an every record
has the same format consisting of similar or dissimilar data types but logically
related entities.
ADVERTISEMENT

Algorithms
Procedure main begin if symb=’#’ then begin advance to next token in input file if
symb=’i’ then begin advance to next token in input file while symb!=’\n’ do begin
advance to next token in input file end {while } print symb is a preprocessor
directive end {if symb=’i’} if symb=’d’ then begin advance to next token input file
while symb!=’ ‘ do begin advance to next token in input file end{while} advance to
next token in input file print symb is a constant advance to next token in input file
while symb!=’\n’ do begin advance to the next token in input file end {while} end {if
symb=’d’} end {if symb=’#’} if symb is a alphabet or symb=’_’ then begin advance
to the next token in input file while symb is a digit or alphabet or symb=’_’ do begin
advance to the next token of input file end {while} call function verify to check
whether symb is a identifier or keyword end {if} if symb=’+’ then begin advance to
the next token in input file if symb=’+’ print symb is ++ operator else ungetc symb
from the input file print symb is + operator end {if} if symb=’-’ then begin advance
to the next token in input file if symb=’-’ print symb is -- operator else ungetc symb
from the input file print symb is - operator end {if} if symb=’|’ then begin advance
to the next token in input file if symb=’|’ print symb is logical or operator else
ungetc symb from the input file print symb is bitwise or operator end {if} if
symb=’*’ then begin print symb is a multiplication operator end {if} if symb=’?’ then
begin print symb is a conditional operator end{if} if symb=’!’or symb=’>’or
symb=’<’then begin advance to the next token in input file

Compiler Design
Looking for GATE Preparation Material? Join & Get here now!
<<Previous Next>>

Overview Of Module And Algorithm Continue...

Contd...

if symb=’=’

print symb is a relational operator

else

ungetc symb from output file

print symb is a operator

end{if}

if symb=’=’

begin

advance to next token in input file

if symb=’=’then

print symb is equal to operator

else

ungetc symb from output file

print symb is assignment operator

end{if}

if symb=’&’ then

begin

advance to next token in input file

if symb=’&’ then

print symb is a logical and operator


else

print & symb is an address operator

end{if}

if symb=’/’ then

begin

advance to next token in input file

if symb=’*’ then

begin

advance to next token in input file

while symb!=’/’ do

advance to next token in input file

end{while}

end{if}

else if symb=’/’ then

begin

advance to next token in input file

while symb!=’\n’ do

advance to next token in input file

end{while}

end{if}

else

ungetc symb from output file

print symb is a division operator

end{if}

if symb is a digit then

begin
advance to next token in input file

while symb is a digit or symb=’.’ then

begin

advance to next token in input file

end {while}

print symb is a number

end{if}

if symb =’”’ then

begin

advance to next token in input file

while symb!=’”’ do

begin

advance to next token in input file

end{while}

print symb is a string

end{if}}

if symb= ‘{‘ then

print open brace

if symb=’}’ then

print close brace

if symb=’[‘ then

print open bracket

if symb=’]’ then

print close bracket

if symb=’(‘ then

print open parenthesis


if symb=’)’ then

print close parenthesis

end {procedure main}

procedure verify

begin

scan the symbol table to check if encountered token exists

if exists

return token value

end{procedure}
ADVERTISEMENT

USER MANUAL
The code for modules appears in two files: lex.c and output.c. The file lex.c
contains the main source code of the lexical analyzer. And the input to the
lexical analyzer is contained in test.c. Under the DOS operating system, the
program is compiled by using alt F9, and is executed by using ctrl F9. The
output i.e token types are stored in the output file, output.txt
Sample Input

#include<stdio.h>

#include<stdlib.h>

#define abc 100

void main()

int a_,b=30;

printf("enter 2 no.s\n"); // printf statement

scanf("%d%d",&a,&b);

/* scanf

statement*/

if(a<20)
a=a+1;

Sample Output:

LINE NO TOKENS
-----------------------------------------------

1: #include<stdio.h> is a header file

2: #include<stdlib.h> is a header file

3: #define statement: abc is a constant

4: void: token value : 7

main :identifier, token value : 18

(: open parenthesis

): close parenthesis

5: {: open brace

6: int: token value : 1

a_ :identifier, token value : 18

, : comma

b :identifier, token value : 18

=: assignment operator

30 is a number

; : semi colon

7: printf: token value : 5

(: open parenthesis

enter 2 no.s\n : is a string

): close parenthesis

;: semi colon

8: scanf: token value : 6

(: open parenthesis
%d%d : is a string

,: comma

&a: address operator

, : comma

&b: address operator

): close parenthesis

;: semi colon

9:

10:

11: if: token value : 8

(: open parenthesis

a :identifier, token value : 18

<: less than operator

20 is a number

): close parenthesis

12: a: token value : 18

=: assignment operator

a: token value : 18

+: plus operator

1 is a number

;: semi colon

13: }: close parenthesis

CONCLUSION
Generally, when syntactic analysis is being carried out by the parser it may
call upon the scanner for tokenizing the input. But the LEXICAL ANALYZER
designed by us is an independent program. It takes as input a file with an
executable code in C. There fore, the parser cannot make use of the designed
scanner as and when required.
Consider as an example an array ch[20].The designed lexical analyzer will
tokenize 'ch' as an identifier,'[' as an opening brace,'20' as a number, and ']'
as a closing brace. But the parser might require a[5] to be identified as an
array. Similarly, there may arise a number of cases where the parser has to
identify a token by a different mannerism than the one specified and
designed. Hence, we conclude that the LEXICAL ANALYZER so designed is an
independent program which is not flexible.

Potrebbero piacerti anche