Sei sulla pagina 1di 17

3/8/2019

Limits of R.E

Context-Free Grammars

1 2

The Limits of Regular Informal Comments


A context-free grammar is a notation for describing syntactic
Languages specifications of programming languages.
Gives precise and ease to understand syntactic specifications
When scanning, we used regular expressions to Efficient parser could be constructed automatically
define each token. Imparts structure to program which is useful for its translation into
object code and find errors.
Unfortunately, regular expressions are (usually) It is more powerful than finite automata or RE’s, but still cannot define
too weak to define programming languages. all possible languages.
Cannot define a regular expression matching all Useful for nested structures, e.g., parentheses in programming
expressions with properly balanced parentheses. languages.
PL have recursive structure which suits CFG
Cannot define a regular expression matching all
functions with properly nested block structure. Grammars for programming languages are often written in
BNF (Backus-Naur Form ).
We need a more powerful formalism. The syntax analyzer checks the input according to grammar
3 specified and builds parse tree 4

1
3/8/2019

Syntax analysis Scanning vs. parsing


term ::= [a-zA-Z] ( [a-zA-Z] | [0-9] )*
Grammars are often written in Backus-Naur form (BNF). Where do we draw the line? | 0 | [1-9][0-9]*
1. <goal> ::= <expr> op ::= + | — | * | /
Example:
2. <expr> ::= <expr> <op> <expr> expr ::= (term op)* term
3. | num Regular expressions:
4. | id — Normally used to classify identifiers, numbers, keywords …
5. <op> ::= + — Simpler and more concise for tokens than a grammar
6. | — — More efficient scanners can be built from REs
7. | *
8. | /
CFGs are used to impose structure
In a BNF for a grammar, we represent
1. non-terminals with <angle brackets> or CAPITAL LETTERS
— Brackets: (), begin … end, if … then … else
2. terminals with typewriter font or underline — Expressions, declarations …
3. productions as in the example
Factoring out lexical analysis simplifies the compiler
5 6

CFG vs R.E So why not just use CFG's instead of


CFG is more expressive than RE regular expressions (“one big parser”) ?
Any language that can be generated using regular
expressions can be generated by a context-free grammar.
There are languages that can be generated by a context- 1. Don't need full power of CFG's
free grammar that cannot be generated by any regular 2. Reg. exp.'s are more concise and easier
expression.
to understand.
As a corollary, CFGs are strictly more powerful than
DFAs and NDFAs. 3. Can build more efficient lexers from reg.
The proof is in two parts:
exp.'s
Given a regular expression R, we can generate a CFG G 4. Get more modular design by separating
such that L(R) == L(G). lexer from parser (easier to design and
We can define a grammar G for which there is no FA F such maintain)
that L(F) == L(G). 7 8

2
3/8/2019

Definition − A context-free grammar (CFG) consisting CFG Formalism


of a finite set of grammar rules is a quadruple (N, T,
P, S) where
Terminals = symbols of the alphabet of the
N is a set of non-terminal symbols. language being defined.
T is a set of terminals where N ∩ T = NULL.
Variables = nonterminals = a finite set of
P is a set of rules, P: N → (N ∪ T)*,
other symbols, each of which represents a
S is the start symbol.
language.
Example
The grammar ({A}, {a, b, c}, P, A), P : A → aA, A → abc.
Start symbol = the variable (nonterminals )
The grammar ({S}, {a, b}, P, S), P: S → aSa, S → bSb, S whose language is the one being defined.
→ε A production or rewriting rules has the form
The grammar ({S, F}, {0, 1}, P, S), P: S → 00S | 11F, F → variable -> string of variables and terminals.
00F | ε
9 10

A CFG for C++ compound statements:


A CFG for Arithmetic Expressions
<compound stmt> --> { <stmt list> }
<expression> --> number
<stmt list> --> <stmt> <stmt list> | epsilon
<expression> --> ( <expression> ) <stmt> --> <compound stmt>
<expression> --> <expression> + <expression> <stmt> --> if ( <expr> ) <stmt>
<expression> --> <expression> - <expression> <stmt> --> if ( <expr> ) <stmt> else <stmt>
<expression> --> <expression> * <expression> <stmt> --> while ( <expr> ) <stmt>
<expression> --> <expression> / <expression> <stmt> --> do <stmt> while ( <expr> ) ;
<stmt> --> for ( <stmt> <expr> ; <expr> ) <stmt>
<stmt> --> case <expr> : <stmt>
The only nonterminal symbol in this grammar is
<stmt> --> switch ( <expr> ) <stmt>
<expression>, which is also the start symbol. The terminal
symbols are {+,-,*,/,(,),number}. <stmt> --> break ; | continue ;
<stmt> --> return <expr> ; | goto <id> ;
11 12

3
3/8/2019

Convention: Example: Formal CFG


Lower case names like expression, statement, Here is a formal CFG for { 0n1n | n > 1}.
operator and A, B, C,… are variables. Terminals = {0, 1}.
a, b, c,… operator symbols, , (,), 01,2..9, id or if are
Variables = {S}.
terminals.
Capital symbols near end of alphabet chiefly X,Y,Z Start symbol = S.
represent grammar symbols that represents either Productions =
non-terminals or terminals. S -> 01
…, w, x, y, z are strings of terminals only. S -> 0S1
α, β, γ,… are strings of terminals and/or variables. Basis: 01 is in the language.
Induction: if w is in the language, then so is
13 14
0w1.

Conversion of NFA to CFG Conversion of RE to CFG


Basic idea is to use “variables” to stand
for sets of strings (i.e., languages).
These variables are defined recursively,
in terms of one another.
Recursive rules (“productions”) involve
only concatenation.
Alternative rules for a variable allow
union.
15 16

4
3/8/2019

Conversion of RE to CFG
The syntax for regular expressions does
not carry over to CFGs.
(a | b)*abb
Cannot use *, +, or parentheses.
A0 aA 0 | bA0 | aA1
R.E-- a*b
• S → Ab A1 bA 2
• A → Aa | ε A2 bA 3
R.E -- a(b|c*) A3 ∈
• S → aX
• X → (b|c*)
– X → b | c*
– X→b|C
17 18
– C → Cc | ε

Derivations – Intuition Derivations – Formalism


We derive strings in the language of a We say αAβ => αγβ if A -> γ is a production
CFG by starting with the start symbol, and α and β are arbitrary strings of grammar
and repeatedly replacing some variable symbols.
A by the right side of one of its Example: S -> 01; S -> 0S1.
productions. S => 0S1 => 00S11 => 000111.
That is, the “productions for A” are those
that have A on the left side of the ->.

19 20

5
3/8/2019

Iterated Derivation Example: Iterated Derivation


If A → γ then αAβ ⇒ αγβ is a single-step S -> 01; S -> 0S1.
derivation using A → γ
S => 0S1 => 00S11 => 000111.
⇒* and ⇒+ denote derivations of ≥0 and
So
≥1 steps
S =>* S;
Basis: α =>* α for any string α. S =>* 0S1;
Induction: if α =>* β and β => γ, then S =>* 00S11;
α =>* γ. S =>* 000111.

21 22

Language of a Grammar Leftmost and Rightmost


If G is a CFG, then L(G), the language of G,
Derivations
is {w | S =>* w}.
Derivations allow us to replace any of
Note: w must be a terminal string, S is the start the variables in a string.
symbol.
w is in L(G) iff S +w. w is called sentence of
Leads to many different derivations of
G. the same string.
Example: G has productions By forcing the leftmost variable (or
S -> ε and S -> 0S1. find L(G) alternatively, the rightmost variable) to
be replaced, we avoid these
L(G) = {0n1n | n > 0}. Note: ε is a legitimate
“distinctions without a difference.”
right side.
23 24

6
3/8/2019

Leftmost Derivations Example: Leftmost Derivations


Say wAα =>lm wβα if w is a string of Balanced-parentheses grammmar:
terminals only and A -> β is a S -> SS | (S) | ()
production. S =>lm SS =>lm (S)S =>lm (())S =>lm
Also, α =>*lm β if α becomes β by a (())()
sequence of 0 or more =>lm steps. Thus, S =>*lm (())()
S => SS => S() => (S)() => (())() is a
derivation, but not a leftmost derivation.
25 26

Rightmost Derivations Example: Rightmost Derivations


Say αAw =>rm αβw if w is a string of Balanced-parentheses grammmar:
terminals only and A -> β is a production.
S -> SS | (S) | ()
Also, α =>*rm β if α becomes β by a
S =>rm SS =>rm S() =>rm (S)() =>rm
sequence of 0 or more =>rm steps.
(())()
Thus, S =>*rm (())()
S => SS => SSS => S()S => ()()S =>
()()() is neither a rightmost nor a
leftmost derivation.
27 28

7
3/8/2019

Sentential Forms
For eg. E ∗ (I + E) is a sentential form since
Any string of variables and/or terminals E ⇒ E∗E ⇒ E∗(E) ⇒ E∗(E+E) ⇒ E∗(I +E)
derived from the start symbol is called a This derivation is neither leftmost, nor
sentential form. rightmost
Formally, α is a sentential form iff Example: a ∗ E is a left-sentential form, since
S =>* α. E ⇒ lm E ∗ E ⇒ lm I ∗ E ⇒ lm a ∗ E
If S ⇒ lm α we say that α is a left- Example: E∗(E+E) is a right-sentential form,
sentential form, and if S ⇒ rm α we say Since E ⇒ rm E ∗ E ⇒ rm E ∗ (E) ⇒rm E ∗ (E + E)
that α is a right-sentential form

29 30

Constructiong parse tree


Parse Trees Let G = (V, T, P, S) be a CFG. A tree is a
Parse trees are trees labeled by symbols of a parse tree for G if:
particular CFG. 1. Each interior node is labeled by a
Leaves: labeled by a terminal or ε. variable in V .
Interior nodes: labeled by a variable. 2. Each leaf is labeled by a symbol in V
Children are labeled by the right side of a ∪ T ∪ {ε}. Any ε -labeled leaf is the only
production for the parent. child of its parent.
Root: must be labeled by the start symbol. 3. If an interior node is labeled A, and its
children (from left to right) labeled X1,
X2, . . . , Xk , then A → X1X2 . . . Xk ∈ P.
31 32

8
3/8/2019

Example: Parse Tree Yield of a Parse Tree


S -> SS | (S) | ()
The concatenation of the labels of the
S leaves in left-to-right order
That is, in the order of a preorder
S S traversal.
is called the yield /frontier of the parse
( S ) ( )
tree. S
S S
( ) Example: yield of is (())()
( S ) ( )
33 ( ) 34

Parse Trees, Left- and


Importance of Parse Tree
If w ∈ L(G), for some CFG, then w has a parse tree,
Rightmost Derivations
which tells us the (syntactic) structure of w
w could be a program, a SQL-query, an XML document, For every parse tree, there is a unique
etc. leftmost, and a unique rightmost
Parse trees are an alternative representation to derivation.
derivations and recursive inferences.
There can be several parse trees for the same string
Ideally there should be only one parse tree (the
“true” structure) for each string, i.e. the language
should be unambiguous. Unfortunately, we cannot
always remove the ambiguity.
35 36

9
3/8/2019

Ambiguous Grammars Example – Continued


S S
A CFG is ambiguous if there is a string
in the language that is the yield of two S S S S
or more parse trees.
Example: S -> SS | (S) | () S S ( ) ( ) S S
Two parse trees for ()()() on next slide.
( ) ( ) ( ) ( )

37 38

Ambiguous Grammar
E-> E + E / E* E/id
Ambiguity V={E} T= { +,*,id}
Derive id + id *id
Thus, equivalent definitions of
“ambiguous grammar’’ are: 2 approaches Left or Right
1. There is a string in the language that has Left most Right most
two different leftmost derivations. E E+E E E+E
E + E*E
2. There is a string in the language that has id + E
id + E*E E + E* id
two different rightmost derivations. E + id * id
id + id * E
id + id * id id + id * id

39 40

10
3/8/2019

Is it the only possible ways to derive the string?


Two different parse trees for same string
Left most <---No-- Right most Considering Left most derivations
derivation derivation

E E*E E E*E
E+E*E E * id Problems for Parsers:-
id + E* E E+ E* id Which one is the correct derivation ??
id + id * E E+ id * id Example
id + id * id id + id * id 2 + 3*4 2 + (3*4) ( first tree)
2 + 3 *4 (2+3) *4

41 42

How to find out whether


S->aSbS/bSaS/epsilon
Grammar is ambiguous or not?
There is no algorithm for converting an arbitrary ambiguous
grammar into an unambiguous one. w=abab
Some languages are inherently ambiguous, meaning that no
unambiguous grammar exists for them. S
S
There is no algorithm for detecting whether an arbitrary / / \ \
/ | \ \
grammar is ambiguous.
/ | \ \ / / \ \
a S b S / | \ \
Find a string that produces more than one parse tree Example:-
/ | \ \ a S b S
S-> aS/Sa/a and for string
b S a S | / \ \ \
w=aa, we can
S | $ / | \ \
S
/ \ $ a S b S
/ \
S a | |
a S
| $ $
|
a a 43 44

11
3/8/2019

Regular Expression Grammar


R R + R / RR/R*/a/b/c E-> E + E / E* E/id
Parse for expression a + bc generated two different parse trees for
id + id * id as shown earlier
Take another string id + id +id

1. Concatenations has higher precedence as


its at lower level 45 46
2. Where as opposite in tree 2.

Conversion to non ambiguous Grammar


If a grammar can be made unambiguous at all,
Means Grammar is ambiguous it is usually made unambiguous through layering.
Have exactly one way to build each piece of the
Reasons:- string.
Association of operators ( Left or Right) Have exactly one way of combining those pieces
In + no effect on final o/p but affects other back together. E
operators like - Restrict the growth of parse tree in particular / | \
direction E + id
+ is left associated so which tree is correct? / | \
Consider
In id + id * id • E-> E + id | id E + id
|
* has to be evalauted first so one is wrong ( For id + id + id id
precedence) Grammar is defined as Left Recursive
To make operator left associative make the grammar left
47 recursive 48

12
3/8/2019

More issues?
For Precedence Add new operator ↑(raise to power)
New Grammar may be ↑ is right associative (2 ↑ (3 ↑2) )
E-> E+ T How to add ↑ to Grammar
T -> T * F Make Tree grow towards Right Hand Side
Points to note:- Add productions like
- in order to derive * we have to go through + F G↑F/G
-Once we have generated * no way to generate + again ie all Complete Grammar becomes
+ in expressions will be generated first
Problem in Grammar :- What if G does not contain + E-> E+T/T
-Therefore add new productions to generate T from E T ->T*F/F
- E->T and T->F F->G ↑ F / G
G ->id
E-> E+T/T
T ->T*F/F 49 What will be the precedence of +, *, | 50

Another Example Regular Expression Grammar


bExpr -> bExpr or bExp / R-> R + R / R-> R + T / T
bExpr and bExp / RR / T-> TF /F
not bExp R* F->F*/a/b/c
/True /a/b/c
/ False
Find precedence and Associativity of operators? And
generate non-ambiguous Grammar.
E -> E or F /F
F -> F and G / G
G-> not G/ true/ False 51 52

13
3/8/2019

A-> A $ B/B Precedence Another Example?


B-> B#C/C Least of $
C->C@D/D followed by # …@ E -> E * F / F + E /F
D->d F F - F /id
Find out precedence and associtivity of •Association
operators
• * .> *
$ -- left recursive therefore left associative • + <. +
# -- left recursive therefore left associative
• - ( ???)
Same for @
For $ •Precedence
$ left hand side $ shall be executed first $ .> $
• * == +
For #
# left hand side $ shall be exectured first # > # • - ????
For @
53 54
@ left hand side @ shall be exectured first @ > @

Ambiguity is a Property of
Example: Unambiguous Grammar
Grammars, not Languages
B -> (RB | ε R -> ) | (RR
For the balanced-parentheses Construct a unique leftmost derivation for
language, here is another CFG, which is a given balanced string of parentheses by
unambiguous. B, the start symbol, scanning the string from left to right.
B -> (RB | ε derives balanced strings. If we need to expand B, then use B -> (RB if
R -> ) | (RR the next symbol is “(” and ε if at the end.
R generates strings that If we need to expand R, use R -> ) if the next
have one more right paren symbol is “)” and (RR if it is “(”.
than left. 55 56

14
3/8/2019

The Parsing Process The Parsing Process


Remaining Input: Steps of leftmost Remaining Input: Steps of leftmost
(())() derivation: ())() derivation:
B B
Next Next
(RB
symbol symbol

B -> (RB | ε R -> ) | (RR 57 B -> (RB | ε R -> ) | (RR 58

The Parsing Process The Parsing Process


Remaining Input: Steps of leftmost Remaining Input: Steps of leftmost
))() derivation: )() derivation:
B B
Next
(RB Next
(RB
symbol ((RRB symbol ((RRB
(()RB

B -> (RB | ε R -> ) | (RR 59 B -> (RB | ε R -> ) | (RR 60

15
3/8/2019

The Parsing Process The Parsing Process


Remaining Input: Steps of leftmost Remaining Input: Steps of leftmost
() derivation: ) derivation:
B B (())(RB
Next
(RB Next
(RB
symbol ((RRB symbol ((RRB
(()RB (()RB
(())B (())B
B -> (RB | ε R -> ) | (RR 61 B -> (RB | ε R -> ) | (RR 62

The Parsing Process The Parsing Process


Remaining Input: Steps of leftmost Remaining Input: Steps of leftmost
derivation: derivation:
B (())(RB B (())(RB
Next
(RB (())()B Next
(RB (())()B
symbol ((RRB symbol ((RRB (())()
(()RB (()RB
(())B (())B
B -> (RB | ε R -> ) | (RR 63 B -> (RB | ε R -> ) | (RR 64

16
3/8/2019

LL(1) Grammars
As an aside, a grammar such B -> (RB | ε
R -> ) | (RR, where you can always figure
out the production to use in a leftmost
derivation by scanning the given string
left-to-right and looking only at the next
one symbol is called LL(1).
“Leftmost derivation, left-to-right scan, one
symbol of lookahead.”
65

17

Potrebbero piacerti anche