Sei sulla pagina 1di 10

Outline

• Specifying lexical structure using regular


Implementation of Lexical Analysis expressions

• Finite automata
– Deterministic Finite Automata (DFAs)
– Non-deterministic Finite Automata (NFAs)

• Implementation of regular expressions


RegExp ⇒ NFA ⇒ DFA ⇒ Tables

Compiler Design 1 (2011) 2

Notation Regular Expressions in Lexical Specification

• For convenience, we use a variation (allow user- • Last lecture: a specification for the predicate
defined abbreviations) in regular expression s ∈ L(R)
notation • But a yes/no answer is not enough !
• Instead: partition the input into tokens
• Union: A + B ≡ A|B
• Option: A + ε ≡ A? • We will adapt regular expressions to this goal
• Range: ‘a’+’b’+…+’z’ ≡ [a-z]
• Excluded range:
complement of [a-z] ≡ [^a-z]

Compiler Design 1 (2011) 3 Compiler Design 1 (2011) 4


Regular Expressions ⇒ Lexical Spec. (1) Regular Expressions ⇒ Lexical Spec. (2)

1. Select a set of tokens 3. Construct R, matching all lexemes for all


• Integer, Keyword, Identifier, OpenPar, ... tokens

2. Write a regular expression (pattern) for the R = Keyword + Identifier + Integer + …


lexemes of each token = R1 + R2 + R3 + …
• Integer = digit +
• Keyword = ‘if’ + ‘else’ + …
• Identifier = letter (letter + digit)*
Facts: If s ∈ L(R) then s is a lexeme
• OpenPar = ‘(‘ – Furthermore s ∈ L(Ri) for some “i”
• … – This “i” determines the token that is reported

Compiler Design 1 (2011) 5 Compiler Design 1 (2011) 6

Regular Expressions ⇒ Lexical Spec. (3) How to Handle Spaces and Comments?

4. Let input be x1…xn 1. We could create a token Whitespace


• (x1 ... xn are characters) Whitespace = (‘ ’ + ‘\n’ + ‘\t’)+
• For 1 ≤ i ≤ n check – We could also add comments in there
x1…xi ∈ L(R) ? – An input “ \t\n 5555 “ is transformed into
5. It must be that Whitespace Integer Whitespace
x1…xi ∈ L(Rj) for some j 2. Lexer skips spaces (preferred)
(if there is a choice, pick a smallest such j) • Modify step 5 from before as follows:
It must be that xk ... xi ∈ L(Rj) for some j such
6. Remove x1…xi from input and go to previous step that x1 ... xk-1 ∈ L(Whitespace)
• Parser is not bothered with spaces

Compiler Design 1 (2011) 7 Compiler Design 1 (2011) 8


Ambiguities (1) Ambiguities (2)

• There are ambiguities in the algorithm • Which token is used? What if


• x1…xi ∈ L(Rj) and also
• How much input is used? What if • x1…xi ∈ L(Rk)
• x1…xi ∈ L(R) and also – Rule: use rule listed first (j if j < k)
• x1…xK ∈ L(R)
– Rule: Pick the longest possible substring • Example:
– The “maximal munch” – R1 = Keyword and R2 = Identifier
– “if” matches both
– Treats “if” as a keyword not an identifier

Compiler Design 1 (2011) 9 Compiler Design 1 (2011) 10

Error Handling Summary

• What if • Regular expressions provide a concise notation


No rule matches a prefix of input ? for string patterns
• Problem: Can’t just get stuck … • Use in lexical analysis requires small extensions
• Solution: – To resolve ambiguities
– Write a rule matching all “bad” strings – To handle errors
– Put it last • Good algorithms known (next)
• Lexer tools allow the writing of: – Require only single pass over the input
R = R1 + ... + Rn + Error – Few operations per character (table lookup)
– Token Error matches if nothing else matches

Compiler Design 1 (2011) 11 Compiler Design 1 (2011) 12


Regular Languages & Finite Automata Finite Automata

Basic formal language theory result: A finite automaton is a recognizer for the
Regular expressions and finite automata both strings of a regular language
define the class of regular languages.
A finite automaton consists of
Thus, we are going to use: – A finite input alphabet Σ
• Regular expressions for specification – A set of states S
– A start state n
• Finite automata for implementation
– A set of accepting states F ⊆ S
(automatic generation of lexical analyzers)
– A set of transitions state →input state

Compiler Design 1 (2011) 13 Compiler Design 1 (2011) 14

Finite Automata Finite Automata State Graphs

• Transition • A state
s1 →a s2
• Is read
In state s1 on input “a” go to state s2 • The start state

• If end of input (or no transition possible) • An accepting state


– If in accepting state ⇒ accept
– Otherwise ⇒ reject
a
• A transition

Compiler Design 1 (2011) 15 Compiler Design 1 (2011) 16


A Simple Example Another Simple Example

• A finite automaton that accepts only “1” • A finite automaton accepting any number of 1’s
followed by a single 0
• Alphabet: {0,1}

1
1

Compiler Design 1 (2011) 17 Compiler Design 1 (2011) 18

And Another Example And Another Example

• Alphabet {0,1} • Alphabet still { 0, 1 }


• What language does this recognize?
1

1 0
1
0 0
• The operation of the automaton is not
1 completely defined by the input
1 – On input “11” the automaton could be in either state

Compiler Design 1 (2011) 19 Compiler Design 1 (2011) 20


Epsilon Moves Deterministic and Non-Deterministic Automata

• Another kind of transition: ε-moves • Deterministic Finite Automata (DFA)


– One transition per input per state
ε – No ε-moves
A B • Non-deterministic Finite Automata (NFA)
– Can have multiple transitions for one input in a
• Machine can move from state A to state B given state
without reading input – Can have ε-moves
• Finite automata have finite memory
– Enough to only encode the current state

Compiler Design 1 (2011) 21 Compiler Design 1 (2011) 22

Execution of Finite Automata Acceptance of NFAs

• A DFA can take only one path through the


state graph • An NFA can get into multiple states
– Completely determined by input 1

0 1
• NFAs can choose
– Whether to make ε-moves
– Which of multiple transitions for a single input to
take
0

• Input: 1 0 1

• Rule: NFA accepts an input if it can get in a


final state
Compiler Design 1 (2011) 23 Compiler Design 1 (2011) 24
NFA vs. DFA (1) NFA vs. DFA (2)

• NFAs and DFAs recognize the same set of • For a given language the NFA can be simpler
languages (regular languages) than the DFA

1
0 0
NFA
• DFAs are easier to implement 0

– There are no choices to consider 1 0


0 0
DFA
1
1

• DFA can be exponentially larger than NFA

Compiler Design 1 (2011) 25 Compiler Design 1 (2011) 26

Regular Expressions to Finite Automata Regular Expressions to NFA (1)

• High-level sketch • For each kind of reg. expr, define an NFA


– Notation: NFA for regular expression M

NFA
M
Regular
expressions DFA • For ε
ε

Lexical Table-driven • For input a


Specification Implementation of DFA a

Compiler Design 1 (2011) 27 Compiler Design 1 (2011) 28


Regular Expressions to NFA (2) Regular Expressions to NFA (3)

• For AB • For A*

A ε ε
B

A
• For A + B ε

ε
B ε
ε
ε
ε A

Compiler Design 1 (2011) 29 Compiler Design 1 (2011) 30

Example of Regular Expression → NFA conversion NFA to DFA. The Trick

• Consider the regular expression • Simulate the NFA


(1+0)*1 • Each state of DFA
• The NFA is = a non-empty subset of states of the NFA
• Start state
= the set of NFA states reachable through ε-moves
ε from NFA start state

ε C 1 E ε • Add a transition S →a S’ to DFA iff


1 – S’ is the set of NFA states reachable from any
A ε B
0 F G H ε I J
ε D ε state in S after seeing the input a
• considering ε-moves as well
ε

Compiler Design 1 (2011) 31 Compiler Design 1 (2011) 32


NFA to DFA. Remark NFA to DFA Example

• An NFA may be in many states at any time ε

• How many different states ? ε C 1 E ε


1
A ε B
0 F G H ε I J
ε D ε
• If there are N states, the NFA must be in
some subset of those N states ε
0
• How many subsets are there? 0 FGABCDHI
– 2N - 1 = finitely many 0 1
ABCDHI
1
1 EJGABCDHI

Compiler Design 1 (2011) 33 Compiler Design 1 (2011) 34

Implementation Table Implementation of a DFA

• A DFA can be implemented by a 2D table T


0
– One dimension is “states”
– Other dimension is “input symbols”
0 T
0 1
– For every transition Si →a Sk define T[i,a] = k S
1
1 U

• DFA “execution”
– If in state Si and input a, read T[i,a] = k and skip to 0 1
state Sk
S T U
– Very efficient
T T U
U T U

Compiler Design 1 (2011) 35 Compiler Design 1 (2011) 36


Implementation (Cont.) Theory vs. Practice

• NFA → DFA conversion is at the heart of Two differences:


tools such as lex, ML-Lex or flex
• DFAs recognize lexemes. A lexer must return
• But, DFAs can be huge a type of acceptance (token type) rather than
simply an accept/reject indication.
• In practice, lex/ML-Lex/flex-like tools trade
off speed for space in the choice of NFA and • DFAs consume the complete string and accept
DFA representations or reject it. A lexer must find the end of the
lexeme in the input stream and then find the
next one, etc.

Compiler Design 1 (2011) 37 Compiler Design 1 (2011) 38

Potrebbero piacerti anche