Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
The different tokens that our lexical analyzer identifies are as follows:
KEYWORDS: int, char, float, double, if, for, while, else, switch, struct, printf,
scanf, case, break, return, typedef, void
IDENTIFIERS: main, fopen, getch etc
NUMBERS: positive and negative integers, positive and negative floating
point numbers.
OPERATORS: +, ++, -, --, ||, *, ?, /, >, >=, <, <=, =, ==, &, &&.
BRACKETS: [ ], { }, ( ).
STRINGS : Set of characters enclosed within the quotes
COMMENT LINES: Ignores single line, multi line comments
For tokenizing into identifiers and keywords we incorporate a symbol table
which initially consists of predefined keywords. The tokens are read from an
input file. If the encountered token is an identifier or a keyword the lexical
analyzer will look up in the symbol table to check the existence of the
respective token. If an entry does exist then we proceed to the next token.
If not then that particular token along with the token value is written into
the symbol table. The rest of the tokens are directly displayed by writing into
an output file.
The output file will consist of all the tokens present in our input file along
with their respective token values.
ADVERTISEMENT
INTRODUCTION
Lexical analysis involves scanning the program to be compiled and
recognizing the tokens that make up the source statements Scanners or
lexical analyzers are usually designed to recognize keywords , operators ,
and identifiers , as well as integers, floating point numbers , character
strings , and other similar items that are written as part of the source
program . The exact set of tokens to be recognized of course, depends upon
the programming language being used to describe it.
A sequence of input characters that comprises a single token is called a
lexeme. A lexical analyzer can insulate a parser from the lexeme
representation of tokens. Following are the list of functions that lexical
analyzers perform.
Removal of white space and comment
Many languages allow “white space” to appear between tokens. Comments
can likewise be ignored by the parser and translator , so they may also be
treated as white space. If white space is eliminated by the lexical analyzer,
the parser will never have to consider it.
Constants
An integer constant is a sequence of digits, integer constants can be allowed
by adding productions to the grammar for expressions, or by creating a
token for such constants . The job of collecting digits into integers is
generally given to a lexical analyzer because numbers can be treated as
single units during translation. The lexical analyzer passes both the token
and attribute to the parser.
Recognizing identifiers and keywords
Languages use identifiers as names of variables, arrays and functions. A
grammar for a language often treats an identifier as a token. Many
languages use fixed character strings such as begin, end , if, and so on , as
punctuation marks or to identify certain constructs. These character strings,
called keywords, generally satisfy the rules for forming identifiers.
System Analysis
SYSTEM DESIGN
Process
The lexical analyzer is the first phase of a compiler. Its main task is to read
the input characters and produce as output a sequence of tokens that the
parser uses for syntax analysis. This interaction, summarized schematically
in fig below.
Upon receiving a “get next token “command from the parser, the lexical
analyzer reads the input characters until it can identify next token.
Sometimes , lexical analyzers are divided into a cascade of two phases, the
first called “scanning”, and the second “lexical analysis”.
The scanner is responsible for doing simple tasks, while the lexical analyzer
proper does the more complex operations.
The lexical analyzer which we have designed takes the input from a input
file. It reads one character at a time from the input file, and continues to
read until end of the file is reached. It recognizes the valid identifiers,
keywords and specifies the token values of the keywords.
It also identifies the header files, #define statements, numbers, special
characters, various relational and logical operators, ignores the white spaces
and comments. It prints the output in a separate file specifying the line
number .
BLOCK DIAGRAM
Compiler Design
Looking for GATE Preparation Material? Join & Get here now!
<<Previous Next>>
Functions
Verify(): The lexical analyzer uses the verify operation to determine whether
there is entry for a lexeme in the symbol table.
Data structures
Structure : A structure in C is a collection of variables which contains related
data items of similar and /or dissimilar data types but logically related items.
Each variable in the structure represents an item and is called as a member
or field of the structure. Complex data can be important represented in more
meaningful way using structures and is one of the very features available in
C.
An array of structures is basically a array of records. Each an every record
has the same format consisting of similar or dissimilar data types but logically
related entities.
ADVERTISEMENT
Algorithms
Procedure main begin if symb=’#’ then begin advance to next token in input file if
symb=’i’ then begin advance to next token in input file while symb!=’\n’ do begin
advance to next token in input file end {while } print symb is a preprocessor
directive end {if symb=’i’} if symb=’d’ then begin advance to next token input file
while symb!=’ ‘ do begin advance to next token in input file end{while} advance to
next token in input file print symb is a constant advance to next token in input file
while symb!=’\n’ do begin advance to the next token in input file end {while} end {if
symb=’d’} end {if symb=’#’} if symb is a alphabet or symb=’_’ then begin advance
to the next token in input file while symb is a digit or alphabet or symb=’_’ do begin
advance to the next token of input file end {while} call function verify to check
whether symb is a identifier or keyword end {if} if symb=’+’ then begin advance to
the next token in input file if symb=’+’ print symb is ++ operator else ungetc symb
from the input file print symb is + operator end {if} if symb=’-’ then begin advance
to the next token in input file if symb=’-’ print symb is -- operator else ungetc symb
from the input file print symb is - operator end {if} if symb=’|’ then begin advance
to the next token in input file if symb=’|’ print symb is logical or operator else
ungetc symb from the input file print symb is bitwise or operator end {if} if
symb=’*’ then begin print symb is a multiplication operator end {if} if symb=’?’ then
begin print symb is a conditional operator end{if} if symb=’!’or symb=’>’or
symb=’<’then begin advance to the next token in input file
Compiler Design
Looking for GATE Preparation Material? Join & Get here now!
<<Previous Next>>
Contd...
if symb=’=’
else
end{if}
if symb=’=’
begin
if symb=’=’then
else
end{if}
if symb=’&’ then
begin
if symb=’&’ then
end{if}
if symb=’/’ then
begin
if symb=’*’ then
begin
while symb!=’/’ do
end{while}
end{if}
begin
while symb!=’\n’ do
end{while}
end{if}
else
end{if}
begin
advance to next token in input file
begin
end {while}
end{if}
begin
while symb!=’”’ do
begin
end{while}
end{if}}
if symb=’}’ then
if symb=’[‘ then
if symb=’]’ then
if symb=’(‘ then
procedure verify
begin
if exists
end{procedure}
ADVERTISEMENT
USER MANUAL
The code for modules appears in two files: lex.c and output.c. The file lex.c
contains the main source code of the lexical analyzer. And the input to the
lexical analyzer is contained in test.c. Under the DOS operating system, the
program is compiled by using alt F9, and is executed by using ctrl F9. The
output i.e token types are stored in the output file, output.txt
Sample Input
#include<stdio.h>
#include<stdlib.h>
void main()
int a_,b=30;
scanf("%d%d",&a,&b);
/* scanf
statement*/
if(a<20)
a=a+1;
Sample Output:
LINE NO TOKENS
-----------------------------------------------
(: open parenthesis
): close parenthesis
5: {: open brace
, : comma
=: assignment operator
30 is a number
; : semi colon
(: open parenthesis
): close parenthesis
;: semi colon
(: open parenthesis
%d%d : is a string
,: comma
, : comma
): close parenthesis
;: semi colon
9:
10:
(: open parenthesis
20 is a number
): close parenthesis
=: assignment operator
a: token value : 18
+: plus operator
1 is a number
;: semi colon
CONCLUSION
Generally, when syntactic analysis is being carried out by the parser it may
call upon the scanner for tokenizing the input. But the LEXICAL ANALYZER
designed by us is an independent program. It takes as input a file with an
executable code in C. There fore, the parser cannot make use of the designed
scanner as and when required.
Consider as an example an array ch[20].The designed lexical analyzer will
tokenize 'ch' as an identifier,'[' as an opening brace,'20' as a number, and ']'
as a closing brace. But the parser might require a[5] to be identified as an
array. Similarly, there may arise a number of cases where the parser has to
identify a token by a different mannerism than the one specified and
designed. Hence, we conclude that the LEXICAL ANALYZER so designed is an
independent program which is not flexible.