Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
LEXICAL
ANALYSIS
First Step in Compilation
Source code
(character stream)
Lexical analysis
Token stream
Parsing
Abstract syntax tree
Intermediate Code
Generation
Intermediate code
Optimization/Code Generation
Assembly code
Lexical Analysis
Source code if (b == 0) a = "hi";
(character stream)
Lexical analysis
Token
stream if ( b == 0 ) a = "hi" ;
Parsing
Semantic Analysis
A Closer Look
• Lexical analysis converts a character stream to
a token stream of pairs <token_type, value>
if (x1 * x2 < 1.0) {
y = x1;
}
i f ( x 1 * x 2 <1 . 0 ) { \n
IFKEY LPAREN ID(b) RELOP(EQ) CONSTANT(0) RPAREN ID(a) ASSIGNOP CONSTANT(63) SEMI
Send info
to ST to be
Syntax-directed
stored Symbol translation
table requests info from
ST to generate
intermediate code
Lexical Analysis: Terminology
• Token :
– Terminal symbol in a grammar
– Classes of sequences of characters with a
collective meaning, e.g., IDENTIFIER
– Constants, Operators, Punctuation, Reserved
words (keywords)
• Lexeme :
– character sequence matched by an instance
of the token, e.g. "sqrt” is one value for
instance of OPERATOR
Token Types
• Identifiers: x y11 elsex_i00
• Keywords: if else while
• Integers: 2 1000 -500 +6663554
• Floating point: 2.0 0.00020 .02 1. 1e5
0.e-10
• Punctuation: [ ] .. ( ) . , : ;
• Operators: + * - < >= /
• Comments: { don’t change this }
Token Values
Type Value
IDENTIFIER xyz
INTCONSTANT 1
RELOP >
ADDOP -
Lexical errors
• What if user omits the space in "realf"?
– No lexical error, single token IDENT("realf") is
produced instead of sequence REAL, IDENT("f")!
REAL is a keyword
• Speed
– Lexical analysis can become a bottleneck
– Minimize processing per character
• Skip blanks fast
• I/O is also an issue (read large blocks)
– We compile frequently
• Compilation time is important
– Especially during development
How To Describe Tokens
• Lookahead may be required to decide where
one token ends and the next one begins
– Modern languages designed to avoid this (for the
most part)
• Lexical Analysis in C++
– Unfortunately, the problems still exist
– C++ template syntax FOO<Bar>
– C++ stream syntax cin> >var
– Now, the problem FOO<Bar<Bazz> >
How to Describe Tokens
• Modern Approach: Programming
language tokens described using regular
expressions
• A regular expression R describes some
set of strings L(R)
– L(R) is the "language" defined by R
• L(abc) = { abc }
• L(hello | goodbye) = {hello, goodbye}
• L([1-9][0-9]*) = all positive integer constants
Define each kind of token using a RE
e
whitespace
e
identifier
e
e
number
RE to FA
1. ID ::= letter (letter | digit)*
2. INTCONSTANT ::= digit digit*
4. MULOP ::= + | /
5. ADDOP ::= + | -
6. ASSNOP ::= :=
7. COLON ::= :
8. LESSTHAN ::= < letter 1 letter | digit
0 < 8 > 9
9. NOTEQUAL ::= <>
A. LT_OR_EQUAL ::= <= digit 2
B. GT ::= >
digit =
A
C. GT_OR_EQUAL ::= >=
D. EQUAL ::= = *, / 4
E. ENDMARKER ::= . > B = C
F. SEMICOLON ::= ; +,-
G. LEFTPAREN ::= (
5 = D
H. RIGHTPAREN ::= )
: .
= 6 E
; F
7
Any but =
( G
) H
Augmenting the FA
i f Action:
0 1 2
return (IFKEY,-)
Matching Tokens
elsex=0; else x = 0
elsex = 0
Lexical
• REs alone not enough: need rule for choosing analyzer
• Most languages: longest matching token wins knows
nothing about
– even if a shorter token is only way syntax is correct syntax!
– Exception: early FORTRAN (totally whitespace-insensitive)
• Ties in length resolved by prioritizing tokens
• REs + priorities + longest-matching token rule = lexer
definition
Delimeters
…?
GetNextToken:
Calls appropriate FSA
Changes Ids to keywords where necessary
Returns next token in input stream
FSAs:
GetNextToken ReadIdentifier
ReadNumber
ReadSymbol
Assemble char sequences into valid tokens
Return simple token
GetNextChar :
GetNextChar Returns the next significant
character in the input stream
Finite Automata "[^"]*"
• Automaton (DFA) can be represented as
– transition table " Non-"
0 1 Error
1 2 1
2 Error Error
– graph
Non-"
"
0 1 " 2
Lexical Analysis with Lexer
Generator
Lexical
Specification
Manual conversion
Auto conversion
Hand written lexer
(Table driven method)
boolean accept_state[NSTATES] = { … };
int trans_table[NSTATES][VALID_CHARS] = { … };
int state = 0;
…
return token;
}
Hand written lexer: Identifiers
letter 1 letter | digit
0
if (!(nextChar == CharStream.BLANK))
stream.pushBack(nextChar);
token.setType(TokenType.IDENTIFIER);
token.setValue(buffer.toString());
return token;
}
How much should we match?
In general, find the longest match possible.
E.g., on input 123.45, match this as
num_const(123.45)
rather than
num_const(123), ".", num_const(45).
Using PushBack
• Many tokens require looking beyond the end of the
lexeme
• Consider three REs: {aa ba aabb] and input: aaba
a a b b
0 1 2 3 4
b a
5 6
char next; e l s e x
…
while (identifierChar(next)) {
id = id + String(next); next
next = input.read ();
}
Lookahead and Pushback
• In many instances, you read a character or two beyond
the end of a token (e.g., when read a delimeter that
could be a part of the next token)
• But sometimes you don’t
• Need a way to retain previously seen lookahead chars
• Simple solution: use a stack
– Push back a char by pushing on stack
– When called to read next token, get next char from stack (if
not empty), else read from input
• Our lexer: need two character lookahead for the
DOUBLEDOT
5..10 vs 5.10
Pushback
5..10
• Read character "5" we have a number
• Read character "." could be decimal point, part of real number
• Read character "." oops, we need a digit, not part of the number
ACTION:
Push back second dot, then push back first dot
On next call for the next character, read from stack instead of input
5.10+
• Read character "5" we have a number
• Read character "." could be decimal point, part of real number
• Read character "1" digit, we have a real number
• Read character "0" digit, more of the real number
• Read character "+" not a digit or "E" (scientific notation), end of the number
ACTION:
Push back "+"
On next call for the next character, read from stack instead of input
Handling Reserved Words
delim
Action:
if keyword (lexeme)
return (type, -)
else
return (ID,lexeme)
Error Recovery
Compile and link the resulting code, you've got a lexical analyzer
Jlex
http://www.cs.princeton.edu/~appel/modern/java/JLex/
• http://www.cs.princeton.edu/~appel/modern/ja
va/JLex/
Yylex.class
.jlex file
User Code Copied verbatim into
.java file
%%
| means "or"
* means zero or more instances of
+ means one or more instances of
? means zero or one instance of
() are used for grouping
• Regular expressions in JLex (where e and f
are regular expressions):
– c any character c other than: ? * + | ( ) ^ $ . [ ] { } " \
– \c any character c, but \n is newline, \^c is control-c, etc
– . any character except \n
– "..." the concatenation of all the characters in the string
– ef concatenation
– e | f alternation
– e* Kleene closure
– e+ ee*
– e? optional e
– {name} macro expansion
– [...] any character enclosed in [ ] (but only one character), from:
• c a character c (or use \c)
• ef any character from e or from f
• a-b any character from a to b
• "..." any character in the string
– [^...] any character except those enclosed by [ ]
More operators
• ^ matches beginning of line
^main matches string "main" only when it
appears at the beginning of line.
class Token {
String text;
Token(String t){text = t;}
}
class Yylex {
... ... Copied from
switch (yy_last_accept_state) { internal code
directive
case 1:
case 2:
{/*skip*/}
case 3:
{return new Token(yytext());}
... ...
}
}
Java program that uses the scanner
import java.io.*;
class MyLexer{
Automata theory proves you can write regular expressions, give them to a
program like JLex, which will generate a machine to accept exactly those
expressions
Lexer generators
• The power
– Programmer describes tokens as regular expressions
– Lex turns description of tokens into code
– Generated code compiles into a scanner
• The pitfalls
– Source code generated by lex hard to debug
– Without understanding basis in formal languages, lex can be a
quirky black box
Comparison of Methods
• Hand-coded scanner
– Programmer creates types, defines data and
procedures, designs flow of control, implements in
source language
• Lex-generated scanner:
– Programmer writes patterns
– (Declarative, not procedural)
– Lex/flex implements flow of control
– Must less hand-coding, but
• code looks pretty alien, tricky to debug
Summary
• Lexical analyzer converts a text stream to
tokens
• For most languages, legal tokens
conveniently, precisely defined using
regular expressions
• Two ways to write lexer:
– Hand code
– Use a Lexer generator to generate lexer code
automatically from token RE’s, precedence
APPENDIX
Regular Expression Notation
a ordinary character stands for itself
e the empty string
R|S any string from either L(R) or L(S)
RS string from L(R) followed by one
from L(S)
R* zero or more strings from L(R),
concatenated
e|R|RR|RRR|RRRR|RRRRR…
Convenient RE Shorthand
R+ one or more strings from L(R): R(R*)
R? optional R: (R | e)
[abce] one of the listed characters:
(a|b|c|e)
[a-z] one character from this range:
(a|b|c|d|e|...)
[^ab] anything but one of the listed chars
[^a-z] one character not from this range
Examples