Sei sulla pagina 1di 86

CMPU 331 • Compiler Design

LEXICAL
ANALYSIS
First Step in Compilation
Source code
(character stream)
Lexical analysis
Token stream
Parsing
Abstract syntax tree
Intermediate Code
Generation
Intermediate code
Optimization/Code Generation

Assembly code
Lexical Analysis
Source code if (b == 0) a = "hi";
(character stream)
Lexical analysis
Token
stream if ( b == 0 ) a = "hi" ;

Parsing

Semantic Analysis
A Closer Look
• Lexical analysis converts a character stream to
a token stream of pairs <token_type, value>
if (x1 * x2 < 1.0) {
y = x1;
}

i f ( x 1 * x 2 <1 . 0 ) { \n

KEY:if LPAREN ID:x1 OP:* ID:x2 RELOP:<

NUM:1.0 RPAREN LBRACE


Why a Separate Lexical Analysis
Phase?
• Programs could be made from characters, and
parse trees would go down to the character
level
– Machine specific, obfuscates parsing, cumbersome
• Lexical analysis is firewall between program
representation (input source) and parsing
<STMT> ® IFKEY LPAREN <COND> RPAREN <STMT>
| ID ASSIGNOP <EXPR> SEMI
<COND> ® <EXPR> RELOP <EXPR> Token value
<EXPR> ® ID | CONSTANT grammar irrelevant for parsing
<STMT> parse tree
Parser groups
IFKEY LPAREN <COND> RPAREN <STMT>
tokens
according to <EXPR> RELOP <EXPR> ID ASSIGNOP <EXPR> SEMI
grammar ID CONSTANT CONSTANT

IFKEY LPAREN ID(b) RELOP(EQ) CONSTANT(0) RPAREN ID(a) ASSIGNOP CONSTANT(63) SEMI

Lexical analyzer (phase 2) turns lexemes into tokens


if ( b == 0 ) a = 63 ;

Lexical analyzer (phase 1) groups characters into lexemes


i f ( b = = 0 ) a = 6 3 ;
Read in character stream (character by character)
if (b == 0) a = 63;
Interface to Lexical Analyzer
• Either: Convert entire file to a file of tokens,
pass to parser
– Lexical analyzer is separate phase
• Or: Parser calls lexical analyzer to supply next
token
– This approach avoids extra I/O
– Parser builds tree incrementally, using successive
tokens as tree nodes
Interaction of Lexical analyzer
with parser
Call error
Call error
routines
Error routines
when
when routines necessary
necessary
token
Source Lexical Parser
program analyzer getNextToken()

Send info
to ST to be
Syntax-directed
stored Symbol translation
table requests info from
ST to generate
intermediate code
Lexical Analysis: Terminology
• Token :
– Terminal symbol in a grammar
– Classes of sequences of characters with a
collective meaning, e.g., IDENTIFIER
– Constants, Operators, Punctuation, Reserved
words (keywords)
• Lexeme :
– character sequence matched by an instance
of the token, e.g. "sqrt” is one value for
instance of OPERATOR
Token Types
• Identifiers: x y11 elsex_i00
• Keywords: if else while
• Integers: 2 1000 -500 +6663554
• Floating point: 2.0 0.00020 .02 1. 1e5
0.e-10
• Punctuation: [ ] .. ( ) . , : ;
• Operators: + * - < >= /
• Comments: { don’t change this }
Token Values

Some token types have values associated


with them

Type Value
IDENTIFIER xyz
INTCONSTANT 1
RELOP >
ADDOP -
Lexical errors
• What if user omits the space in "realf"?
– No lexical error, single token IDENT("realf") is
produced instead of sequence REAL, IDENT("f")!
REAL is a keyword

• Typically very few lexical error types


– illegal characters in character stream
– unterminated comments
– ill-formed constants
Issues
• How to describe tokens unambiguously
2.e0 20.e-01 2.0000
• How to break text up into tokens
if (x == 0) a = x<<1;
iff (x == 0) a = x<1;
• How to write the lexical analyzer
Free form vs Fixed form
• Free form languages (all modern ones)
– White space does not matter (except to delimit
tokens where necessary)
• Ignore tabs, spaces, new lines, carriage returns
– Only the ordering of tokens is important
• Fixed format languages (historical)
– Layout is critical
• Fortran, label in cols 1-6
• COBOL, area A B
• Lexical analyzer must know about layout to find tokens
Punctuation
• Typically individual special characters such as
; , ) ( { } : .. (two dots)
– Sometimes double characters: lexical scanner looks
for longest token:
• (*, /* -- comment openers in various languages
• <>, >= relational operators
– Returned just as identity (type) of token
• E.g., [ COMMA , - ], [ LEFT_PAREN , - ], etc.
• Character string not retained
Whitespace
• Modern languages: Whitespace is a separator
• FORTRAN compilation rule: whitespace is insignificant
– rule was motivated by the inaccuracy of card punching by operators
– lexical analysis and parsing intertwined
• Consider
– DO 5 I=1,25
• a loop iterates from 1 to 25 with step 5
– DO 5 I=1.25
• an assignment

• Reading left-to-right, cannot tell if DO5I is a variable or DO


statement
• Have to continue until "," or "." is reached
Famous FORTRAN punctuation error
• Occurred on Mariner 1 in 1962

“There is a useful lesson to be learned from the failure of one of


the earliest planetary probes launched by NASA. The cause of the
failure was eventually traced to a statement in its control software
similar to this:
DO 15 I = 1.100
when what should have been written was:
DO 15 I = 1,100
but somehow a dot had replaced the comma. Because Fortran
ignores spaces, this was seen by the compiler as:
DO15I = 1.100
which is a perfectly valid assignment to a variable called DO15I
and not at all what was intended."
Operators
• Like punctuation
– No real difference for lexical analyzer
– Typically single or double special chars
• Operators + - == <=
• Operations :=
– Returned as kind of token
• Often grouped into classes: ADDOP, RELOP,
MULOP
Keywords
• Reserved identifiers
– E.g. BEGIN END in Pascal, if in C, catch in
C++, Java
– Returned as TYPE of token
• E.g., [ BEGIN, - ], [ END, - ] [ IF , - ] [ THEN , - ]
[ ELSE , - ]
– Oddity: unreserved keywords in PL/1
– IF IF THEN THEN = THEN + 1;
• Handled as identifiers (parser disambiguates)
Identifiers
• Rules differ
– Length, allowed characters, separators
• Need to build a names table (symbol table)
– Single entry for all occurrences of Var1
– Some languages are case-insensitive (Pascal, Ada),
some are not (C, Java)
• Case insensitive: same entry for VAR1, vAr1, Var1
– Typical structure: hash table
• Lexical analyzer returns token kind with value
set to the string of its name
– E.g., [ IDENTIFIER, “abc” ]
String Literals
• Text must be stored
• Actual characters are important
– Not like identifiers: must preserve casing
– Character set issues: uniform internal representation
– Table needed (literal table)
• Lexical analyzer returns key into table
• May or may not be worth hashing to avoid
duplicates
Character Literals
• Similar issues to string literals
• Lexical Analyzer returns
– Token kind
– Identity of character
• Cannot assume character set of host
machine, may be different
Numeric Literals
• Need a Constant Table to store numeric values
– E.g. 123 = 0123
– One entry for all occurrences of the same constant
value
• Programmers often repeat values like 0, 1, etc. multiple
times in a program
• Floating point representations more complex
– Correct rounding
– Very delicate to compute correct value
– Host / target issues
Handling Comments
• Comments have no effect on program
• Can be eliminated by scanner
• But may need to be retrieved by tools
• Error detection issues
– E.g. unclosed comments
• Scanner skips over comments and returns next
meaningful token
Performance Issues

• Speed
– Lexical analysis can become a bottleneck
– Minimize processing per character
• Skip blanks fast
• I/O is also an issue (read large blocks)
– We compile frequently
• Compilation time is important
– Especially during development
How To Describe Tokens
• Lookahead may be required to decide where
one token ends and the next one begins
– Modern languages designed to avoid this (for the
most part)
• Lexical Analysis in C++
– Unfortunately, the problems still exist
– C++ template syntax FOO<Bar>
– C++ stream syntax cin> >var
– Now, the problem FOO<Bar<Bazz> >
How to Describe Tokens
• Modern Approach: Programming
language tokens described using regular
expressions
• A regular expression R describes some
set of strings L(R)
– L(R) is the "language" defined by R
• L(abc) = { abc }
• L(hello | goodbye) = {hello, goodbye}
• L([1-9][0-9]*) = all positive integer constants
Define each kind of token using a RE

– Keywords, punctuation are easy


• IF keyword : if
• Left paren : (
– Identifiers, constants a bit more
complicated
• Identifiers : letter (letter | digit)*
• Constants : reals are more complicated…
– (+ | -)? digit (digit)* . …

N.B. extended (UNIX-like) RE syntax: ? stands for "0 or 1"


Implementation Options
• Hand written lexer
– Implement a finite state automaton
• start in some initial state
• look at each input character in sequence, update lexer
state accordingly
• if state at end of input is an accepting state, the input string
matches the RE
• Lexer generator
– generates tokenizer automatically (e.g., flex, jlex)
– Uses RE to NFA to DFA algorithm
– Generates a table-driven lexer (also an FSA)
Implementing the Lexer
• Lexer implemented with a finite state automaton that
corresponds to the regular expressions describing the
tokens
– finite set of states
– set of transitions between states
– transitions taken on input symbols
– one starting state q0 and a set of final states
• Automaton for RE for IF:
i f
0 1 2
• Combine the automata for each token type to create the
lexer
Handling Multiple REs
• Construct one NFA for each RE
• Associate the final state of each NFA with the given RE
• Combine NFAs for all REs into one NFA
• Convert NFA to (minimized) DFA, associating each final DFA state with
the highest priority RE of the corresponding NFA states
NFAs
Minimized DFA
keywords

e
whitespace
e
identifier
e
e
number
RE to FA
1. ID ::= letter (letter | digit)*
2. INTCONSTANT ::= digit digit*
4. MULOP ::= + | /
5. ADDOP ::= + | -
6. ASSNOP ::= :=
7. COLON ::= :
8. LESSTHAN ::= < letter 1 letter | digit
0 < 8 > 9
9. NOTEQUAL ::= <>
A. LT_OR_EQUAL ::= <= digit 2
B. GT ::= >
digit =
A
C. GT_OR_EQUAL ::= >=
D. EQUAL ::= = *, / 4
E. ENDMARKER ::= . > B = C
F. SEMICOLON ::= ; +,-
G. LEFTPAREN ::= (
5 = D
H. RIGHTPAREN ::= )
: .
= 6 E
; F
7
Any but =
( G
) H
Augmenting the FA

• Following recognition of a token, an


action is specified that provides for
– returning the appropriate token (type, value
pair)
– In some cases, other housekeeping

i f Action:
0 1 2
return (IFKEY,-)
Matching Tokens

elsex=0; else x = 0

elsex = 0
Lexical
• REs alone not enough: need rule for choosing analyzer
• Most languages: longest matching token wins knows
nothing about
– even if a shorter token is only way syntax is correct syntax!
– Exception: early FORTRAN (totally whitespace-insensitive)
• Ties in length resolved by prioritizing tokens
• REs + priorities + longest-matching token rule = lexer
definition
Delimeters

• The "longest matching token" rule has an impact on


the REs (and FAs)
• IF : i f delimeter
i f delim
0 1 2 3

…?

• For some tokens, delimeters not an issue : leftparen,


rightparen, =, …
Comments and Whitespace

• Not part of tokens


– Lexer skips over them
• Function as delimeters : el{}se is two tokens: identifier
"el" and identifier "se"
• Whitespace
– Blanks and newlines, tabs
– Only first is relevant--can throw away the rest in a sequence
• Comments
– May want to preserve for printing user source, use by IDEs
Hand-written lexer
• Overall structure:
Driver:
Calls Tokenizer
Driver (Maybe) Prints token type and value

GetNextToken:
Calls appropriate FSA
Changes Ids to keywords where necessary
Returns next token in input stream
FSAs:
GetNextToken ReadIdentifier
ReadNumber
ReadSymbol
Assemble char sequences into valid tokens
Return simple token

GetNextChar :
GetNextChar Returns the next significant
character in the input stream
Finite Automata "[^"]*"
• Automaton (DFA) can be represented as
– transition table " Non-"
0 1 Error
1 2 1
2 Error Error
– graph
Non-"

"
0 1 " 2
Lexical Analysis with Lexer
Generator
Lexical
Specification

Manual conversion

Regular NFA DFA Table-driven


Expression implementation

Auto conversion
Hand written lexer
(Table driven method)
boolean accept_state[NSTATES] = { … };
int trans_table[NSTATES][VALID_CHARS] = { … };
int state = 0;

while (state != ERROR_STATE) {


c = getNextChar();
if (!valid(c))
{
throw
LexicalError.IllegalCharacter(ch, lineNumber);
}
state = trans_table[state][c];
}
return accept_state[state];
accept_state indicates which token was identified
Hand written lexer
(Non-table driven method)

public Token GetNextToken()throws LexicalError {


{
buffer.clear();
token.clear();
char c = getChar();
// Check to see if next character is whitespace, skip if it is
if (c == CharStream.BLANK)
c = getChar();
if (Character.isLetter(c))
token = readIdentifier(c);
else if (Character.isDigit (c))
token = readNumber(c);
else
token = readSymbol(c);
}


return token;
}
Hand written lexer: Identifiers
letter 1 letter | digit
0

protected Token readIdentifier(char nextChar) throws LexicalError


{
while (Character.isDigit(nextChar) || Character.isLetter(nextChar))
{
buffer.addChar(nextChar);
nextChar = getChar();
}

if (!(nextChar == CharStream.BLANK))
stream.pushBack(nextChar);

token.setType(TokenType.IDENTIFIER);
token.setValue(buffer.toString());
return token;
}
How much should we match?
In general, find the longest match possible.
E.g., on input 123.45, match this as
num_const(123.45)
rather than
num_const(123), ".", num_const(45).
Using PushBack
• Many tokens require looking beyond the end of the
lexeme
• Consider three REs: {aa ba aabb] and input: aaba
a a b b
0 1 2 3 4

b a
5 6

• Reach state 3 with no transition on next character a


• Must roll input back to position on entering state 2
(i.e., having read aa)
• Emit token for aa
• On next call to scanner, start in state 0 again with
input ba
Input Buffering
• Input must be retained somehow for re-use
• When need long strings of lookahead, use input
buffering
– However
• Buffering systems in standard languages (C, etc.) are poor
– Copy from disk to OS buffer, OS buffer to buffer in FILE structure, FILE
structure to string
• Lexer should be optimized for speed
– Largest percentage of compile time spent in the lexical analyzer
– Solution: buffer input yourself
• Get two buffers with a size of a disk block each
• Two pointers keep track of location in each
• Load input in buffers; reload one when done
Input Buffering
• Scanner performance is crucial:
– This is the only part of the compiler that examines
the entire input program one character at a time
– Disk input can be slow
– The scanner accounts for ~25-30% of total
compile time
• We need lookahead to determine when a
match has been found
• Scanners use double-buffering to minimize the
overheads associated with this
Buffer Pairs

• Use two N-byte buffers (N = size of a disk block;


typically, N = 1024 or 4096)
• Read N bytes into one half of the buffer each time. If
input has less than N bytes, put a special EOF marker
in the buffer
• When one buffer has been processed, read N bytes
into the other buffer ("circular buffers")
Look-ahead

• If only short strings need to be seen, use one- or two-


character lookahead
– Scan text one character at a time
– Use look-ahead character (next) to determine when the
current token ends

char next; e l s e x

while (identifierChar(next)) {
id = id + String(next); next
next = input.read ();
}
Lookahead and Pushback
• In many instances, you read a character or two beyond
the end of a token (e.g., when read a delimeter that
could be a part of the next token)
• But sometimes you don’t
• Need a way to retain previously seen lookahead chars
• Simple solution: use a stack
– Push back a char by pushing on stack
– When called to read next token, get next char from stack (if
not empty), else read from input
• Our lexer: need two character lookahead for the
DOUBLEDOT
5..10 vs 5.10
Pushback
5..10
• Read character "5" we have a number
• Read character "." could be decimal point, part of real number
• Read character "." oops, we need a digit, not part of the number
ACTION:
Push back second dot, then push back first dot
On next call for the next character, read from stack instead of input

5.10+
• Read character "5" we have a number
• Read character "." could be decimal point, part of real number
• Read character "1" digit, we have a real number
• Read character "0" digit, more of the real number
• Read character "+" not a digit or "E" (scientific notation), end of the number
ACTION:
Push back "+"
On next call for the next character, read from stack instead of input
Handling Reserved Words

• Two possible ways to handle:


1. Hard-wire them directly into the scanner
• harder to modify
• increases the size and complexity of the automaton
• performance benefits unclear (fewer tests, but cache effects
due to larger code size)
2. Fold them into identifier case, then look up in a
keyword table
• simpler, smaller code
• table lookup cost can be mitigated using good hashing
Reserved Words

letter letter | digit


letter digit delim Action:
return (ID,lexeme)

delim
Action:
if keyword (lexeme)
return (type, -)
else
return (ID,lexeme)
Error Recovery

• Not too many types of lexical errors


– Illegal character
– Ill-formed constant
– Unclosed comment
• How is it handled?
– Discard and print a message
– BUT:
• If a character in the middle of a lexeme is wrong, do you
discard the char or the whole lexeme?
• Try to correct?
Practical Considerations for Our
Lexer: Identifiers
• In the final compiler, the value portion of the type-
value pair for the identifier token will be [a pointer to]
the symbol table entry for that identifier
• For now, send the lexeme as value
– Implement when symbol table routines are written
– Token type is an enum
– Token value is a string (for now)—will need additional
possibilities later
• Also an issue with type of constants…
– Only the lexer knows if real or int
– Kludge for our compiler
• two token types, INTCONSTANT and REALCONSTANT
– Parser will treat as one token type: CONSTANT
Summary
• To write your own Lexer:
– Describe tokens using Regular Expressions
– Construct NFAs for those tokens
• If you have no ambiguities in the NFA, or you have a DFA
directly from the regular expressions, you are done
– Construct DFA from NFA using the
algorithm described
– Systematically implement the DFA using
transition tables
OPTION 2
Lexer Generator
• LEX: Lexical processing of character input streams
• FLEX/FLEX++: "Fast lexical analyzer generator" for C /C++
• JLex: Lexical Analyzer Generator for Java
• Input:
– list of regular expressions describing tokens in language, in priority order
– associated action for each RE (generates appropriate kind of token, other
bookkeeping)
• Process:
– Reads patterns
– Builds finite automaton to accept valid tokens
• Output:
– Code (C/C++ or Java) implementation of FA that reads an input stream and
breaks it up into tokens according to the REs (or reports lexical error:
"Unexpected character" )

Compile and link the resulting code, you've got a lexical analyzer
Jlex
http://www.cs.princeton.edu/~appel/modern/java/JLex/

• http://www.cs.princeton.edu/~appel/modern/ja
va/JLex/

• Download from this site or use on lab


machines
• On the site:
– Installation instructions
– User manual
Jlex: a scanner generator

jlex specification JLex.Main generated scanner


xxx.jlex (java) xxx.jlex.java

xxx.jlex.java javac Yylex.class

input program P2.main


Output of P2.main
test.sim (java)

Yylex.class
.jlex file
User Code Copied verbatim into
.java file
%%

JLex directives Includes macros and parameters


For changing how JLex runs

%% Explains how to divide up user


Input into tokens
Regular Expression rules
Regular expression rules
regular-expression { action }
Pattern to be matched Code to be executed
when pattern is matched

• When next_token() method is called, it repeats:


– Find the longest sequence of characters in the input (starting with the
current character) that matches a pattern
– Perform the associated action (plus "consume the matched lexeme")
until a return in an action is executed
• This process is equivalent to deciding whether the
input is in the language of the regular expression
(R1|…|Rn)*
Matching rules
• If several patterns match a sequence of characters, the
longest pattern is considered to be matched
• If several patterns match the same (longest) sequence
of characters, the first such pattern is considered to be
matched
– so the order of the patterns can be important!
• If an input character is not matched in any pattern, the
scanner throws an exception
– make sure that there can be no unmatched characters,
(otherwise the scanner will "crash" on bad input)
Regular expressions
• Similar to those you are familiar with
– most characters match themselves:
• abc
• ==
• while

– characters in quotes, including special characters,


except ", match themselves
• Escape " -> \"
• "a|b" matches a|b not a or b
• "a\"\"\tb" matches a""\tb not a""<TAB>b
Regular-expression operators
• the traditional ones, plus the ? operator

| means "or"
* means zero or more instances of
+ means one or more instances of
? means zero or one instance of
() are used for grouping
• Regular expressions in JLex (where e and f
are regular expressions):
– c any character c other than: ? * + | ( ) ^ $ . [ ] { } " \
– \c any character c, but \n is newline, \^c is control-c, etc
– . any character except \n
– "..." the concatenation of all the characters in the string
– ef concatenation
– e | f alternation
– e* Kleene closure
– e+ ee*
– e? optional e
– {name} macro expansion
– [...] any character enclosed in [ ] (but only one character), from:
• c a character c (or use \c)
• ef any character from e or from f
• a-b any character from a to b
• "..." any character in the string
– [^...] any character except those enclosed by [ ]
More operators
• ^ matches beginning of line
^main matches string "main" only when it
appears at the beginning of line.

• $ matches end of line


main$ matches string "main" only when it
appears at the end of line.
Character classes
• [abc]
– matches one character (either a or b or c)
• [a-z]
– matches any character between a and z, inclusive
• [^abc]
– matches any character except a, b, or c.
– ^ has special meaning only at 1st position in […]
• [\t\\]
– matches tab or \
• [a bc] is equivalent to a|" "|b|c
– white-space in char class and strings matches itself
JLex directives
• Specified in the second part of xxx.jlex.
– can also specify (see the manual for details)
• the value to be returned on end-of-file
• that line counting should be turned on
• Directives include macro definitions (very useful):
– name = regular-expression
• name is any valid Java identifier
– DIGIT = [0-9]
– LETTER = [a-zA-Z]
– WHITESPACE = [ \t\n]
• To use a macro, use its name inside curly braces.
– {LETTER}({LETTER}|{DIGIT})*
Example
%%
digits = 0|[1-9][0-9]*
letter = [A-Za-z]
identifier = {letter}({letter}|[0-9_])*
whitespace = [\ \t\n\r]+
%%
{whitespace} {/* discard */}
{digits} { return new Token(INT, Integer.parseInt(yytext()); }
"if" { return new Token(IF, yytext()); }
"while" { return new Token(WHILE, yytext()); }

{identifier} { return new Token(ID, yytext()); }
Example Output
• Java Lexer which implements the
functionality described in the language
specification
• For instance :
case 5: { return new Token(WHILE, yytext()); }
case 6: break;
case 2: { return new Token(ID, yytext()); }
case 7: break;
case 4: { return new Token(IF, yytext());}
case 8: break;
case 1: { return new Token(INT,
Integer.parseInt(yytext());}
• Available variables
– yytext (null terminated string)
– yyleng (length of the matching string)
– yyin : the file handle
• yyin = fopen(args[0], "r")
• Available functions
– yylex() (the primary function generated)
– input() - Returns the next character from the
input
– unput – put back in input (“pushback”)
Character and line counting
• Sometimes it is useful to know where exactly the token is in the
text
– Token position is implemented using line counting and char counting.
• Character counting is turned off by default, activated with the
directive %char
– Create an instance variable yychar in the scanner
– Contains zero-based character index of the first character on the
matched region of text
• Line counting is turned off by default, activated with the
directive %line
– Create an instance variable yyline in the scanner;
– Contains zero-based line index at the beginning of the matched
region of text Example
"int" { return (new
Yytoken(4,yytext(),yyline,yychar,yychar+3)); }
Context Checking
• JLex allows context-dependent REs
– r/x The regular expression r will be matched
only if it is followed by an occurrence of
regular expression x.
• Makes it easy to deal with our ADDOP
vs. UNARYPLUS problem
How to Use JLex
• Install Jlex
– Instructions at http://www.cs.princeton.edu/~appel/modern/java/JLex/current/README
• Once installed, the lexical analyzer generator is the Main class in the JLex
folder
• Write regular expressions, actions, and directives and save in a file
filename.jlex
• Use the command
– java JLex.Main filename.jlex
to produce a file filename.jlex.java
• Compile this lexical analyzer source file with the Java compiler:
– javac filename.jlex.java
• Result is a lexical analyzer class file, which can be used in your program
• Default:
– lexical analyzer class is called Yylex
– class files are Yylex.class and Yytoken.class

See the User Manual at http://www.cs.princeton.edu/~appel/modern/java/JLex/current/manual.html for full information


Simple example
class Token {
String text; A JLex scanner that looks for five
Token(String t){text = t;} letter words that begin with "P"
} and end with "T"
%%
Digit=[0-9]
AnyLet=[A-Za-z]
Others=[0-9’&.]
WhiteSp=[\040\n]
%type Token // tell jlex to have yylex() return a Token
%eofval{
return new Token(null);
%eofval} // tell jlex what to return on eof
%%
[Pp]{AnyLet}{AnyLet}{AnyLet}[Tt]{WhiteSp}+
{return new Token(yytext());}
({AnyLet}|{Others})+{WhiteSp}+
{/*skip*/}
MPB-NI:jlex ide$ java JLex.Main simple.jlex
Processing first section -- user code.
Processing second section -- JLex declarations.
Processing third section -- lexical rules.
Creating NFA machine representation.
NFA comprised of 26 states.
Working on character classes.::..::.:.:
NFA has 7 distinct character classes.
Creating DFA transition table.
Working on DFA states............
Minimizing DFA transition table.
10 states after removal of redundant states.
Outputting lexical analyzer code.
Code generated in simple.lex.java

class Token {
String text;
Token(String t){text = t;}
}

class Yylex {
... ... Copied from
switch (yy_last_accept_state) { internal code
directive
case 1:
case 2:
{/*skip*/}
case 3:
{return new Token(yytext());}
... ...
}
}
Java program that uses the scanner
import java.io.*;

class MyLexer{

public static void main(String args[])


throws java.io.IOException {
Yylex lex = new Yylex(System.in);
Token token = lex.yylex();

while ( token.text != null ) {


System.out.print("\t"+token.text);
token = lex.yylex(); //get next token
}
}}

After compiling MyLexer.java:


MPB-NI:jlex ide$ echo "Plant" | java -cp . MyLexer
Plant
How does JLex build the FA?
• Programmer writes the regular expression
• JLex generates corresponding NFA-e
– Thompson's construction: 5 rules for making an NFA-e for any
regular expression

• Kleene's Theorem proves that any NFA-e is equivalent


to some NFA, which is in turn equivalent to a DFA
• So, JLex can generate deterministic code
• JLex matches longest token, then accepts

Automata theory proves you can write regular expressions, give them to a
program like JLex, which will generate a machine to accept exactly those
expressions
Lexer generators
• The power
– Programmer describes tokens as regular expressions
– Lex turns description of tokens into code
– Generated code compiles into a scanner
• The pitfalls
– Source code generated by lex hard to debug
– Without understanding basis in formal languages, lex can be a
quirky black box
Comparison of Methods
• Hand-coded scanner
– Programmer creates types, defines data and
procedures, designs flow of control, implements in
source language
• Lex-generated scanner:
– Programmer writes patterns
– (Declarative, not procedural)
– Lex/flex implements flow of control
– Must less hand-coding, but
• code looks pretty alien, tricky to debug
Summary
• Lexical analyzer converts a text stream to
tokens
• For most languages, legal tokens
conveniently, precisely defined using
regular expressions
• Two ways to write lexer:
– Hand code
– Use a Lexer generator to generate lexer code
automatically from token RE’s, precedence
APPENDIX
Regular Expression Notation
a ordinary character stands for itself
e the empty string
R|S any string from either L(R) or L(S)
RS string from L(R) followed by one
from L(S)
R* zero or more strings from L(R),
concatenated
e|R|RR|RRR|RRRR|RRRRR…
Convenient RE Shorthand
R+ one or more strings from L(R): R(R*)
R? optional R: (R | e)
[abce] one of the listed characters:
(a|b|c|e)
[a-z] one character from this range:
(a|b|c|d|e|...)
[^ab] anything but one of the listed chars
[^a-z] one character not from this range
Examples

Regular Expression Strings in L(R)


a "a"
ab "ab"
a|b "a" "b"
""
(ab)* "" "ab" "abab" …
(a | e) b "ab" "b"
More Examples

Regular Expression Strings in L(R)


digit = [0-9] "0" "1" "2" "3" …
posint = digit+ "8" "412" …
int = -? posint "-42" "1024" …
real = int (e | (. posint)) "-1.56" "12" "1.0"
= -?[0-9]+(e | (. [0-9]+))
[a-zA-Z_][a-zA-Z0-9_]* C identifiers

Potrebbero piacerti anche