Implementation of The Regular Expression

Outline
• Specifying lexical structure using regular

Implementation of Lexical Analysis expressions
• Finite automata
– Deterministic Finite Automata (DFAs)
– Non-deterministic Finite Automata (NFAs)
• Implementation of regular expressions

RegExp ⇒ NFA ⇒ DFA ⇒ Tables
Compiler Design 1 (2011) 2
Notation Regular Expressions in Lexical Specification
• For convenience, we use a variation (allow user- • Last lecture: a specification for the predicate
defined abbreviations) in regular expression s ∈ L(R)
notation • But a yes/no answer is not enough !
• Instead: partition the input into tokens
• Union: A + B ≡ A|B
• Option: A + ε ≡ A? • We will adapt regular expressions to this goal
• Range: ‘a’+’b’+…+’z’ ≡ [a-z]
• Excluded range:
complement of [a-z] ≡ [^a-z]
Compiler Design 1 (2011) 3 Compiler Design 1 (2011) 4

Regular Expressions ⇒ Lexical Spec. (1) Regular Expressions ⇒ Lexical Spec. (2)
1. Select a set of tokens 3. Construct R, matching all lexemes for all

• Integer, Keyword, Identifier, OpenPar, ... tokens
2. Write a regular expression (pattern) for the R = Keyword + Identifier + Integer + …

lexemes of each token = R1 + R2 + R3 + …
• Integer = digit +
• Keyword = ‘if’ + ‘else’ + …
• Identifier = letter (letter + digit)*
Facts: If s ∈ L(R) then s is a lexeme
• OpenPar = ‘(‘ – Furthermore s ∈ L(Ri) for some “i”
• … – This “i” determines the token that is reported
Regular Expressions ⇒ Lexical Spec. (3) How to Handle Spaces and Comments?
4. Let input be x1…xn 1. We could create a token Whitespace

• (x1 ... xn are characters) Whitespace = (‘ ’ + ‘\n’ + ‘\t’)+
• For 1 ≤ i ≤ n check – We could also add comments in there
x1…xi ∈ L(R) ? – An input “ \t\n 5555 “ is transformed into
5. It must be that Whitespace Integer Whitespace
x1…xi ∈ L(Rj) for some j 2. Lexer skips spaces (preferred)
(if there is a choice, pick a smallest such j) • Modify step 5 from before as follows:
It must be that xk ... xi ∈ L(Rj) for some j such
6. Remove x1…xi from input and go to previous step that x1 ... xk-1 ∈ L(Whitespace)
• Parser is not bothered with spaces

Ambiguities (1) Ambiguities (2)
• There are ambiguities in the algorithm • Which token is used? What if

• x1…xi ∈ L(Rj) and also
• How much input is used? What if • x1…xi ∈ L(Rk)
• x1…xi ∈ L(R) and also – Rule: use rule listed first (j if j < k)
• x1…xK ∈ L(R)
– Rule: Pick the longest possible substring • Example:
– The “maximal munch” – R1 = Keyword and R2 = Identifier
– “if” matches both
– Treats “if” as a keyword not an identifier
Error Handling Summary
• What if • Regular expressions provide a concise notation

No rule matches a prefix of input ? for string patterns
• Problem: Can’t just get stuck … • Use in lexical analysis requires small extensions
• Solution: – To resolve ambiguities
– Write a rule matching all “bad” strings – To handle errors
– Put it last • Good algorithms known (next)
• Lexer tools allow the writing of: – Require only single pass over the input
R = R1 + ... + Rn + Error – Few operations per character (table lookup)
– Token Error matches if nothing else matches

Regular Languages & Finite Automata Finite Automata
Basic formal language theory result: A finite automaton is a recognizer for the
Regular expressions and finite automata both strings of a regular language
define the class of regular languages.
A finite automaton consists of
Thus, we are going to use: – A finite input alphabet Σ
• Regular expressions for specification – A set of states S
– A start state n
• Finite automata for implementation
– A set of accepting states F ⊆ S
(automatic generation of lexical analyzers)
– A set of transitions state →input state
Finite Automata Finite Automata State Graphs
• Transition • A state
s1 →a s2
• Is read
In state s1 on input “a” go to state s2 • The start state
• If end of input (or no transition possible) • An accepting state

– If in accepting state ⇒ accept
– Otherwise ⇒ reject
a
• A transition

A Simple Example Another Simple Example
• A finite automaton that accepts only “1” • A finite automaton accepting any number of 1’s
followed by a single 0
• Alphabet: {0,1}
1
1
And Another Example And Another Example
• Alphabet {0,1} • Alphabet still { 0, 1 }

• What language does this recognize?
1
1 0
1
0 0
• The operation of the automaton is not
1 completely defined by the input
1 – On input “11” the automaton could be in either state

Epsilon Moves Deterministic and Non-Deterministic Automata
• Another kind of transition: ε-moves • Deterministic Finite Automata (DFA)

– One transition per input per state
ε – No ε-moves
A B • Non-deterministic Finite Automata (NFA)
– Can have multiple transitions for one input in a
• Machine can move from state A to state B given state
without reading input – Can have ε-moves
• Finite automata have finite memory
– Enough to only encode the current state
Execution of Finite Automata Acceptance of NFAs
• A DFA can take only one path through the

state graph • An NFA can get into multiple states
– Completely determined by input 1
0 1
• NFAs can choose
– Whether to make ε-moves
– Which of multiple transitions for a single input to
take
0
• Input: 1 0 1
• Rule: NFA accepts an input if it can get in a

final state
NFA vs. DFA (1) NFA vs. DFA (2)
• NFAs and DFAs recognize the same set of • For a given language the NFA can be simpler
languages (regular languages) than the DFA
1
0 0
NFA
• DFAs are easier to implement 0
– There are no choices to consider 1 0

0 0
DFA
1
1
• DFA can be exponentially larger than NFA
Regular Expressions to Finite Automata Regular Expressions to NFA (1)
• High-level sketch • For each kind of reg. expr, define an NFA

– Notation: NFA for regular expression M
NFA
M
Regular
expressions DFA • For ε
ε
Lexical Table-driven • For input a

Specification Implementation of DFA a

Regular Expressions to NFA (2) Regular Expressions to NFA (3)
• For AB • For A*
A ε ε
B
A
• For A + B ε
ε
B ε
ε
ε
ε A
Example of Regular Expression → NFA conversion NFA to DFA. The Trick
• Consider the regular expression • Simulate the NFA

(1+0)*1 • Each state of DFA
• The NFA is = a non-empty subset of states of the NFA
• Start state
= the set of NFA states reachable through ε-moves
ε from NFA start state
ε C 1 E ε • Add a transition S →a S’ to DFA iff

1 – S’ is the set of NFA states reachable from any
A ε B
0 F G H ε I J
ε D ε state in S after seeing the input a
• considering ε-moves as well
ε

NFA to DFA. Remark NFA to DFA Example
• An NFA may be in many states at any time ε
• How many different states ? ε C 1 E ε

1
A ε B
0 F G H ε I J
ε D ε
• If there are N states, the NFA must be in
some subset of those N states ε
0
• How many subsets are there? 0 FGABCDHI
– 2N - 1 = finitely many 0 1
ABCDHI
1
1 EJGABCDHI
Implementation Table Implementation of a DFA
• A DFA can be implemented by a 2D table T

0
– One dimension is “states”
– Other dimension is “input symbols”
0 T
0 1
– For every transition Si →a Sk define T[i,a] = k S
1
1 U
• DFA “execution”
– If in state Si and input a, read T[i,a] = k and skip to 0 1
state Sk
S T U
– Very efficient
T T U
U T U

Implementation (Cont.) Theory vs. Practice
• NFA → DFA conversion is at the heart of Two differences:

tools such as lex, ML-Lex or flex
• DFAs recognize lexemes. A lexer must return
• But, DFAs can be huge a type of acceptance (token type) rather than
simply an accept/reject indication.
• In practice, lex/ML-Lex/flex-like tools trade
off speed for space in the choice of NFA and • DFAs consume the complete string and accept
DFA representations or reject it. A lexer must find the end of the
lexeme in the input stream and then find the
next one, etc.

Implementation of The Regular Expression

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Implementation of The Regular Expression

Caricato da

Copyright:

Formati disponibili

Outline

• Specifying lexical structure using regular

• Implementation of regular expressions

Compiler Design 1 (2011) 2

Notation Regular Expressions in Lexical Specification

Compiler Design 1 (2011) 3 Compiler Design 1 (2011) 4

1. Select a set of tokens 3. Construct R, matching all lexemes for all

2. Write a regular expression (pattern) for the R = Keyword + Identifier + Integer + …

Compiler Design 1 (2011) 5 Compiler Design 1 (2011) 6

4. Let input be x1…xn 1. We could create a token Whitespace

Compiler Design 1 (2011) 7 Compiler Design 1 (2011) 8

• There are ambiguities in the algorithm • Which token is used? What if

Compiler Design 1 (2011) 9 Compiler Design 1 (2011) 10

Error Handling Summary

• What if • Regular expressions provide a concise notation

Compiler Design 1 (2011) 11 Compiler Design 1 (2011) 12

Compiler Design 1 (2011) 13 Compiler Design 1 (2011) 14

Finite Automata Finite Automata State Graphs

• If end of input (or no transition possible) • An accepting state

Compiler Design 1 (2011) 15 Compiler Design 1 (2011) 16

Compiler Design 1 (2011) 17 Compiler Design 1 (2011) 18

And Another Example And Another Example

• Alphabet {0,1} • Alphabet still { 0, 1 }

Compiler Design 1 (2011) 19 Compiler Design 1 (2011) 20

• Another kind of transition: ε-moves • Deterministic Finite Automata (DFA)

Compiler Design 1 (2011) 21 Compiler Design 1 (2011) 22

Execution of Finite Automata Acceptance of NFAs

• A DFA can take only one path through the

• Rule: NFA accepts an input if it can get in a

– There are no choices to consider 1 0

• DFA can be exponentially larger than NFA

Compiler Design 1 (2011) 25 Compiler Design 1 (2011) 26

Regular Expressions to Finite Automata Regular Expressions to NFA (1)

• High-level sketch • For each kind of reg. expr, define an NFA

Lexical Table-driven • For input a

Compiler Design 1 (2011) 27 Compiler Design 1 (2011) 28

Compiler Design 1 (2011) 29 Compiler Design 1 (2011) 30

Example of Regular Expression → NFA conversion NFA to DFA. The Trick

• Consider the regular expression • Simulate the NFA

ε C 1 E ε • Add a transition S →a S’ to DFA iff

Compiler Design 1 (2011) 31 Compiler Design 1 (2011) 32

• An NFA may be in many states at any time ε

• How many different states ? ε C 1 E ε

Compiler Design 1 (2011) 33 Compiler Design 1 (2011) 34

Implementation Table Implementation of a DFA

• A DFA can be implemented by a 2D table T

Compiler Design 1 (2011) 35 Compiler Design 1 (2011) 36

• NFA → DFA conversion is at the heart of Two differences:

Compiler Design 1 (2011) 37 Compiler Design 1 (2011) 38

Potrebbero piacerti anche