Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1/13/15, 5:10 AM
Table of Contents
1 About this document
1.1 What I want...
1.2 I achieve enlightenment
2 The Language Definition
3 The Source Text
4 The Scanner
4.1 The Character Class
4.2 The Scanner Class
4.3 A driver for the scanner
4.4 A bit of source code for the scanner to process
5 The Lexer
5.1 What is whitespace?
6 The Rules for a Programming Language
6.1 Tokenizing Rules
6.2 Writing a Lexer
6.3 Specifying keywords, symbols, token types, etc.
6.4 A Token class
6.5 A Lexer class
6.6 A driver for the lexer
6.7 Some source text to test the lexer
7 The Parser
7.1 Writing a Recursive Descent Parser
8 I achieve enlightenment (again)
8.1 Generating an Abstract Syntax Tree (AST)
8.2 A Node class for the AST
8.3 A recursive descent parser
8.4 A driver program for the recursive descent parser
8.5 The AST it generated
http://parsingintro.sourceforge.net/
Page 1 of 39
1/13/15, 5:10 AM
Here are some notes that I made during that project. Maybe they will be of use to you.
Scanner: This is the first module in a compiler or interpreter. Its job is to read the source file one
character at a time. It can also keep track of which line number and character is currently being read.
.... For now, assume that each time the scanner is called, it returns the next character in the file.
Lexer: This module serves to break up the source file into chunks (called tokens). It calls the scanner
to get characters one at a time and organizes them into tokens and token types. Thus, the lexer calls
the scanner to pass it one character at a time and groups them together and identifies them as tokens
for the language parser (which is the next stage).
Parser: This is the part of the compiler that really understands the syntax of the language. It calls the
lexer to get tokens and processes the tokens per the syntax of the language.
Page 2 of 39
1/13/15, 5:10 AM
http://parsingintro.sourceforge.net/
Page 3 of 39
1/13/15, 5:10 AM
4 The Scanner
The scanner's job is to read the source file one character at a time. For each character, it keeps
track of the line and character position where the character was found. Each time the scanner is
called, it reads the next character from the file and returns it.
So let's write a scanner.
ENDMARK = "\0"
# aka "lowvalues"
#----------------------------------------------------------------------#
#
Character
#
#----------------------------------------------------------------------class Character:
"""
A Character object holds
- one character (self.cargo)
- the index of the character's position in the sourceText.
- the index of the line where the character was found in the sourceText.
- the index of the column in the line where the character was found in the sourceText.
- (a reference to) the entire sourceText (self.sourceText)
This information will be available to a token that uses this character.
If an error occurs, the token can use this information to report the
line/column number where the error occurred, and to show an image of the
line in sourceText where the error occurred.
"""
#------------------------------------------------------------------#
#------------------------------------------------------------------def __init__(self, c, lineIndex, colIndex, sourceIndex, sourceText):
"""
In Python, the __init__ method is the constructor.
"""
self.cargo
= c
self.sourceIndex
= sourceIndex
self.lineIndex
= lineIndex
self.colIndex
= colIndex
self.sourceText
= sourceText
#------------------------------------------------------------------# return a displayable string representation of the Character object
http://parsingintro.sourceforge.net/
Page 4 of 39
1/13/15, 5:10 AM
#------------------------------------------------------------------def __str__(self):
"""
In Python, the __str__ method returns a string representation
of an object. In Java, this would be the toString() method.
"""
cargo = self.cargo
if
cargo == " "
: cargo = "
space"
elif cargo == "\n"
: cargo = "
newline"
elif cargo == "\t"
: cargo = "
tab"
elif cargo == ENDMARK : cargo = "
eof"
return (
str(self.lineIndex).rjust(6)
+ str(self.colIndex).rjust(4)
+ " "
+ cargo
)
http://parsingintro.sourceforge.net/
Page 5 of 39
1/13/15, 5:10 AM
Page 6 of 39
1/13/15, 5:10 AM
http://parsingintro.sourceforge.net/
Page 7 of 39
1/13/15, 5:10 AM
Having created the scanner machinery, let's put it through its paces.
I create a driver program that sets up a string of source text, creates a scanner object to scan that source
text, and then displays the characters that it gets back from the scanner. [SourceCode]
When we run the scanner driver program, it produces the following results.
Page 8 of 39
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
1/13/15, 5:10 AM
/
*
space
P
R
O
G
R
A
M
space
N
A
M
E
:
space
n
x
x
1
.
t
x
t
newline
newline
n
x
x
space
i
s
space
a
space
s
i
m
p
l
e
space
p
r
o
g
r
a
m
m
i
n
g
space
l
a
n
g
u
http://parsingintro.sourceforge.net/
Page 9 of 39
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
0
1
2
3
4
5
6
7
8
0
1
2
3
4
5
6
7
8
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
1/13/15, 5:10 AM
a
g
e
space
t
h
a
t
space
p
r
o
v
i
d
e
s
:
newline
space
n
u
m
b
e
r
s
newline
space
s
t
r
i
n
g
s
newline
space
a
s
s
i
g
n
m
e
n
t
space
s
t
a
t
e
m
e
n
t
s
newline
http://parsingintro.sourceforge.net/
Page 10 of 39
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
8
8
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
0
1
2
3
4
5
6
7
1/13/15, 5:10 AM
space
s
t
r
i
n
g
space
c
o
n
c
a
t
e
n
a
t
i
o
n
newline
space
s
i
m
p
l
e
space
a
r
i
t
h
m
e
t
i
c
space
o
p
e
r
a
t
i
o
n
s
newline
space
p
r
i
n
t
space
c
http://parsingintro.sourceforge.net/
Page 11 of 39
8
8
8
8
8
8
8
8
8
8
9
9
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
8
9
10
11
12
13
14
15
16
17
0
1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
1/13/15, 5:10 AM
a
p
a
b
i
l
i
t
y
newline
space
newline
c
o
m
m
e
n
t
s
space
m
a
y
space
b
e
space
e
n
c
l
o
s
e
d
space
i
n
space
s
l
a
s
h
+
a
s
t
e
r
i
s
k
space
.
.
space
a
s
http://parsingintro.sourceforge.net/
Page 12 of 39
10
10
10
10
10
10
10
10
10
10
10
10
10
11
11
11
12
12
12
12
12
12
12
12
12
12
12
12
12
13
13
13
13
13
13
13
13
13
13
13
13
13
14
14
14
14
14
14
14
14
14
14
14
14
14
14
14
14
14
14
48
49
50
51
52
53
54
55
56
57
58
59
60
0
1
2
0
1
2
3
4
5
6
7
8
9
10
11
12
0
1
2
3
4
5
6
7
8
9
10
11
12
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
1/13/15, 5:10 AM
t
e
r
i
s
k
+
s
l
a
s
h
newline
*
/
newline
a
l
p
h
a
space
=
space
1
6
space
;
newline
b
e
t
a
space
=
space
2
space
space
space
;
newline
r
e
s
u
l
t
N
a
m
e
space
=
space
"
d
e
l
t
http://parsingintro.sourceforge.net/
Page 13 of 39
14
14
14
14
14
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
18
19
20
21
22
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
1/13/15, 5:10 AM
a
"
space
;
newline
d
e
l
t
a
space
=
space
a
l
p
h
a
space
/
space
b
e
t
a
space
;
newline
p
r
i
n
t
space
"
V
a
l
u
e
space
o
f
space
"
space
|
|
space
r
e
s
u
l
t
N
a
m
e
space
http://parsingintro.sourceforge.net/
Page 14 of 39
16
16
16
16
16
16
16
16
16
16
16
16
16
17
17
17
17
17
17
17
17
17
17
17
17
17
17
18
18
18
18
18
18
18
18
18
18
18
18
18
32
33
34
35
36
37
38
39
40
41
42
43
44
0
1
2
3
4
5
6
7
8
9
10
11
12
13
0
1
2
3
4
5
6
7
8
9
10
11
12
1/13/15, 5:10 AM
|
|
space
"
space
i
s
:
space
"
space
;
newline
p
r
i
n
t
space
d
e
l
t
a
space
;
newline
p
r
i
n
t
space
"
\
n
"
space
;
eof
http://parsingintro.sourceforge.net/
Page 15 of 39
1/13/15, 5:10 AM
5 The Lexer
A lexical analyser is also called a lexer or a tokenizer.
The lexer's job is to group the characters of the source file into chunks called tokens. (If the source text
was written in a natural language (English, Spanish, French, etc.) the tokens would correspond to
the words and punctuation marks in the text.) Each time the lexer is called, it calls the scanner
(perhaps several times) to get as many characters as it needs in order to assemble the characters into
a token. It determines the type of token that it has found (a string, a number, an identifier, a
comment, etc.) and returns the token.
A scanner can be pretty much language-agnostic, but a lexer needs to have a precise specification for the
language that it must tokenize. Suppose we want to process a language called nxx. Then the lexer needs to
know the answers to questions like these abut nxx:
What counts as whitespace?
How are strings (string literals) delimited single quotes? double quotes? both? other?
What symbols and operators does the language support? For example: ( ) + - = ;
What are the rules governing the formation of names (identifiers)? Can names contain dashes?
underscores?
What are the rules governing the formation of numbers (numeric literals)?
What are the rules for writing comments?
the spaces aren't significant in themselves, but without them the COBOL compiler would see:
MOVESTATE-IDTOHOLD-STATE-ID.
MOVE
STATE-ID
TO
HOLD-STATE-ID
.
http://parsingintro.sourceforge.net/
Page 16 of 39
1/13/15, 5:10 AM
In many languages, the whitespace characters consist of the usual suspects: SPACE, TAB, NEWLINE. But
consider a language in which each statement must exist on its own line. For such a language the
NEWLINE character is not whitespace at all but a token indicating the end of a statement in the same
way that a semi-colon does in Java and PL/I. Another example: Python uses indentation (tabs or spaces)
rather than keywords or symbols (e.g. "do..end" or "{..}") to control scope. So for Python (in at least some
contexts) spaces and tabs are not whitespace characters.
Another case in which the lexer would pass whitespace tokens back to its caller is if the calling module is
making some modifications to the input text (for example, removing comments from the source code) but
otherwise leaving the source text intact, whitespace and all.
An identifier token must start with one of the following characters: <a list of characters goes
here>
For example:
Identifiers can contain letters, numeric digits, underscores, and must begin with a letter.
http://parsingintro.sourceforge.net/
Page 17 of 39
1/13/15, 5:10 AM
Page 18 of 39
1/13/15, 5:10 AM
<>
!=
++
**
-+=
-=
||
"""
TwoCharacterSymbols = TwoCharacterSymbols.split()
import string
IDENTIFIER_STARTCHARS = string.letters
IDENTIFIER_CHARS
= string.letters + string.digits + "_"
NUMBER_STARTCHARS
NUMBER_CHARS
= string.digits
= string.digits + "."
Page 19 of 39
1/13/15, 5:10 AM
- the line number and column index where the token starts
"""
#------------------------------------------------------------------#
#------------------------------------------------------------------def __init__(self, startChar):
"""
The constructor of the Token class
"""
self.cargo
= startChar.cargo
#---------------------------------------------------------# The token picks up information
# about its location in the sourceText
#---------------------------------------------------------self.sourceText = startChar.sourceText
self.lineIndex = startChar.lineIndex
self.colIndex
= startChar.colIndex
#---------------------------------------------------------# We won't know what kind of token we have until we have
# finished processing all of the characters in the token.
# So when we start, the token.type is None (aka null).
#---------------------------------------------------------self.type
= None
#------------------------------------------------------------------# return a displayable string representation of the token
#------------------------------------------------------------------def show(self,showLineNumbers=False,**kwargs):
"""
align=True shows token type left justified with dot leaders.
Specify align=False to turn this feature OFF.
"""
align = kwargs.get("align",True)
if align:
tokenTypeLen = 12
space = " "
else:
tokenTypeLen = 0
space = ""
if showLineNumbers:
s = str(self.lineIndex).rjust(6) + str(self.colIndex).rjust(4) + "
else:
s = ""
"
if self.type == self.cargo:
s = s + "Symbol".ljust(tokenTypeLen,".") + ":" + space + self.type
elif self.type == "Whitespace":
s = s + "Whitespace".ljust(tokenTypeLen,".") + ":" + space + repr(self.cargo)
else:
s = s + self.type.ljust(tokenTypeLen,".") + ":" + space + self.cargo
return s
guts = property(show)
http://parsingintro.sourceforge.net/
Page 20 of 39
1/13/15, 5:10 AM
http://parsingintro.sourceforge.net/
Page 21 of 39
1/13/15, 5:10 AM
#------------------------------------------------------------------#
#------------------------------------------------------------------def initialize(sourceText):
"""
"""
global scanner
# initialize the scanner with the sourceText
scanner.initialize(sourceText)
# use the scanner to read the first character from the sourceText
getChar()
#------------------------------------------------------------------#
#------------------------------------------------------------------def get():
"""
Construct and return the next token in the sourceText.
"""
#-------------------------------------------------------------------------------# read past and ignore any whitespace characters or any comments -- START
#-------------------------------------------------------------------------------while c1 in WHITESPACE_CHARS or c2 == "/*":
# process whitespace
while c1 in WHITESPACE_CHARS:
token = Token(character)
token.type = WHITESPACE
getChar()
while c1 in WHITESPACE_CHARS:
token.cargo += c1
getChar()
# return token
# process comments
while c2 == "/*":
# we found comment start
token = Token(character)
token.type = COMMENT
token.cargo = c2
getChar() # read past the first character of a 2-character token
getChar() # read past the second character of a 2-character token
while not (c2 == "*/"):
if c1 == ENDMARK:
token.abort("Found end of file before end of comment")
token.cargo += c1
getChar()
token.cargo += c2
http://parsingintro.sourceforge.net/
Page 22 of 39
1/13/15, 5:10 AM
if c2 in TwoCharacterSymbols:
token.cargo = c2
token.type = token.cargo # for symbols, the token type is same as the cargo
getChar() # read past the first character of a 2-character token
getChar() # read past the second character of a 2-character token
http://parsingintro.sourceforge.net/
Page 23 of 39
1/13/15, 5:10 AM
return token
if c1 in OneCharacterSymbols:
token.type = token.cargo # for symbols, the token type is same as the cargo
getChar() # read past the symbol
return token
# else.... We have encountered something that we don't recognize.
token.abort("I found a character or symbol that I do not recognize: " + dq(c1))
http://parsingintro.sourceforge.net/
Page 24 of 39
1/13/15, 5:10 AM
#----------------------------------------------------------------------#
#
main
#
#----------------------------------------------------------------------def main(sourceText):
global f
f = open(outputFilename, "w")
writeln("Here are the tokens returned by the lexer:")
# create an instance of a lexer
lexer.initialize(sourceText)
#-----------------------------------------------------------------# use the lexer.getlist() method repeatedly to get the tokens in
# the sourceText. Then print the tokens.
#-----------------------------------------------------------------while True:
token = lexer.get()
writeln(token.show(True))
if token.type == EOF: break
f.close()
Page 25 of 39
1/13/15, 5:10 AM
print delta ;
print "\n" ;
http://parsingintro.sourceforge.net/
Page 26 of 39
1/13/15, 5:10 AM
When we run the lexer on this source code, the lexer it produces the following results. That is, it produces
this list of tokens.
http://parsingintro.sourceforge.net/
Page 27 of 39
1/13/15, 5:10 AM
7 The Parser
7.1 Writing a Recursive Descent Parser
Parsing Techniques (first edition, 1990) by Dick Grune and Ceriel Jacobs is a great book about
parsing techniques. This book is freely downloadable from http://www.cs.vu.nl/~dick/PTAPG.html.
I learned a lot from the book.
The first thing I tried to do was to implement (in Python) what they consider the best-ever recursivedescent parsing technique. It is explained on pp. 137-140 of their book. Unfortunately, the results were
disappointing. I won't dispute their claim that this is a fabulous technique. But I found it clumsy and
difficult to understand. I certainly couldn't just sit down and write a parser using the technique.
I had always heard that it wasn't difficult to implement a recursive-descent parser. But that technique was
much too difficult (for me at least).
token = ""
ASSIGNMENT
BEGIN
CALL
COMMA
CONST
DO
END
EQ
GE
GT
IF
LE
LPAREN
LT
MINUS
MULTIPLY
NE
ODD
PERIOD
PLUS
PROC
RPAREN
SEMICOLON
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
http://parsingintro.sourceforge.net/
":="
"begin"
"call"
","
"const"
"do"
"end"
"=="
">="
">"
"if"
"<="
"("
"<"
"-"
"*"
"!="
"odd"
"."
"+"
"proc"
")"
";"
Page 28 of 39
1/13/15, 5:10 AM
SLASH
THEN
VAR
WHILE
=
=
=
=
"/"
"then"
"var"
"while"
IDENTIFIER
NUMBER
= "IDENTIFIER"
= "NUMBER"
class ParserException(Exception):
pass
def error(msg):
quotedToken = '"%s"' % token
msg = msg + " while processing token " + quotedToken
raise ParserException("\n\n" + msg)
def found(argToken):
if (token == argToken):
getToken()
return True
return False
def expect(argToken):
if found(argToken):
return # no problem
else:
quotedToken = '"%s"' % argToken
error("I was expecting to find token "
+ quotedToken
+ "\n but I found something else" )
#-------------------------------------------------def factor():
"""
factor = IDENTIFIER | NUMBER | "(" expression ")"
.
"""
if found(IDENTIFIER):
pass
elif found(NUMBER):
pass
elif found(LPAREN):
expression()
expect(RPAREN)
else:
error("factor: syntax error")
getToken()
#-------------------------------------------------def term():
"""
term = factor {("*"|"/") factor}
.
"""
factor()
while found(MULTIPLY) or found(SLASH):
factor()
#-------------------------------------------------def expression():
http://parsingintro.sourceforge.net/
Page 29 of 39
1/13/15, 5:10 AM
"""
expression = ["+"|"-"] term {("+"|"-") term}
.
"""
if found(PLUS) or found(MINUS):
pass
term()
while found(PLUS) or found(MINUS):
term()
#-------------------------------------------------def condition():
"""
condition =
"odd" expression
| expression ("="|"#"|"<"|"<="|">"|">=") expression
.
"""
if found(ODD):
expression()
else:
expression()
if (
found(EQ) or found(NE) or found(LT)
or found(LE) or found(GT) or found(GE) ):
expression()
else:
error("condition: found invalid operator")
getToken()
#-------------------------------------------------def statement():
"""
statement =
[IDENTIFIER ":=" expression
| "call" IDENTIFIER
| "begin" statement {";" statement} "end"
| "if" condition "then" statement
| "while" condition "do" statement
]
.
"""
if found(IDENTIFIER):
expect(ASSIGNMENT)
expression()
elif found(CALL):
expect(IDENTIFIER)
elif found(BEGIN):
statement()
while found(SEMICOLON):
statement()
expect(END)
elif found(IF):
condition()
expect(THEN)
statement()
http://parsingintro.sourceforge.net/
Page 30 of 39
1/13/15, 5:10 AM
elif found(WHILE):
condition()
expect(DO)
statement()
#-------------------------------------------------def block():
"""
block =
["const" IDENTIFIER "=" NUMBER {"," IDENTIFIER "=" NUMBER} ";"]
["var" IDENTIFIER {"," IDENTIFIER} ";"]
{"procedure" IDENTIFIER ";" block ";"} statement
.
"""
if found(CONST):
expect(IDENTIFIER)
expect(EQ)
expect(NUMBER)
while found(COMMA):
expect(IDENTIFIER)
expect(EQ)
expect(NUMBER)
expect(SEMICOLON)
if found(VAR):
expect(IDENTIFIER)
while found(COMMA):
expect(IDENTIFIER)
expect(SEMICOLON)
while found(PROC):
expect(IDENTIFIER)
expect(SEMICOLON)
block()
expect(SEMICOLON)
statement()
#-------------------------------------------------def program():
"""
program = block "."
.
"""
getToken()
block()
expect(PERIOD)
http://parsingintro.sourceforge.net/
Page 31 of 39
1/13/15, 5:10 AM
"""
A recursive descent parser for nxx1,
as defined in nxx1ebnf.txt
"""
import nxxLexer as lexer
from
nxxSymbols import *
from
genericAstNode import Node
class ParserError(Exception): pass
def dq(s): return '"%s"' %s
token
= None
verbose = False
indent = 0
numberOperator = ["+","-","/","*"]
#------------------------------------------------------------------http://parsingintro.sourceforge.net/
Page 32 of 39
1/13/15, 5:10 AM
#
#------------------------------------------------------------------def getToken():
global token
if verbose:
if token:
# print the current token, before we get the next one
#print (" "*40 ) + token.show()
print((" "*indent) + "
(" + token.show(align=False) + ")")
token = lexer.get()
#------------------------------------------------------------------#
push and pop
#------------------------------------------------------------------def push(s):
global indent
indent += 1
if verbose: print((" "*indent) + " " + s)
def pop(s):
global indent
if verbose:
#print(("
pass
indent -= 1
Page 33 of 39
1/13/15, 5:10 AM
"""
for argTokenType in argTokenTypes:
#print "foundOneOf", argTokenType, token.type
if token.type == argTokenType:
return True
return False
#------------------------------------------------------------------#
found
#------------------------------------------------------------------def found(argTokenType):
if token.type == argTokenType:
return True
return False
#------------------------------------------------------------------#
consume
#------------------------------------------------------------------def consume(argTokenType):
"""
Consume a token of a given type and get the next token.
If the current token is NOT of the expected type, then
raise an error.
"""
if token.type == argTokenType:
getToken()
else:
error("I was expecting to find "
+ dq(argTokenType)
+ " but I found "
+ token.show(align=False)
)
#------------------------------------------------------------------#
parse
#------------------------------------------------------------------def parse(sourceText, **kwargs):
global lexer, verbose
verbose = kwargs.get("verbose",False)
# create a Lexer object & pass it the sourceText
lexer.initialize(sourceText)
getToken()
program()
if verbose:
print "~"*80
print "Successful parse!"
print "~"*80
return ast
#-------------------------------------------------------#
program
#-------------------------------------------------------@track0
def program():
"""
program = statement {statement} EOF.
"""
global ast
node = Node()
http://parsingintro.sourceforge.net/
Page 34 of 39
1/13/15, 5:10 AM
statement(node)
while not found(EOF):
statement(node)
consume(EOF)
ast = node
#-------------------------------------------------------#
statement
#-------------------------------------------------------@track
def statement(node):
"""
statement = printStatement | assignmentStatement .
assignmentStatement = variable "=" expression ";".
printStatement
= "print" expression ";".
"""
if found("print"):
printStatement(node)
else:
assignmentStatement(node)
#-------------------------------------------------------#
expression
#-------------------------------------------------------@track
def expression(node):
"""
expression = stringExpression | numberExpression.
/* "||" is the concatenation operator, as in PL/I */
stringExpression = (stringLiteral | variable) {"||"
stringExpression}.
numberExpression = (numberLiteral | variable) { numberOperator numberExpression}.
numberOperator = "+" | "-" | "/" | "*" .
"""
if found(STRING):
stringLiteral(node)
while found("||"):
getToken()
stringExpression(node)
elif found(NUMBER):
numberLiteral(node)
while foundOneOf(numberOperator):
node.add(token)
getToken()
numberExpression(node)
else:
node.add(token)
consume(IDENTIFIER)
if found("||"):
while found("||"):
getToken()
stringExpression(node)
elif foundOneOf(numberOperator):
http://parsingintro.sourceforge.net/
Page 35 of 39
1/13/15, 5:10 AM
while foundOneOf(numberOperator):
node.add(token)
getToken()
numberExpression(node)
#-------------------------------------------------------#
assignmentStatement
#-------------------------------------------------------@track
def assignmentStatement(node):
"""
assignmentStatement = variable "=" expression ";".
"""
identifierNode = Node(token)
consume(IDENTIFIER)
operatorNode = Node(token)
consume("=")
node.addNode(operatorNode)
operatorNode.addNode(identifierNode)
expression(operatorNode)
consume(";")
#-------------------------------------------------------#
printStatement
#-------------------------------------------------------@track
def printStatement(node):
"""
printStatement
= "print" expression ";".
"""
statementNode = Node(token)
consume("print")
node.addNode(statementNode)
expression(statementNode)
consume(";")
#-------------------------------------------------------#
stringExpression
#-------------------------------------------------------@track
def stringExpression(node):
"""
/* "||" is the concatenation operator, as in PL/I */
stringExpression = (stringLiteral | variable) {"||" stringExpression}.
"""
if found(STRING):
node.add(token)
getToken()
while found("||"):
getToken()
stringExpression(node)
else:
http://parsingintro.sourceforge.net/
Page 36 of 39
1/13/15, 5:10 AM
node.add(token)
consume(IDENTIFIER)
while found("||"):
getToken()
stringExpression(node)
#-------------------------------------------------------#
numberExpression
#-------------------------------------------------------@track
def numberExpression(node):
"""
numberExpression = (numberLiteral | variable) { numberOperator numberExpression}.
numberOperator = "+" | "-" | "/" | "*" .
"""
if found(NUMBER):
numberLiteral(node)
else:
node.add(token)
consume(IDENTIFIER)
while foundOneOf(numberOperator):
node.add(token)
getToken()
numberExpression(node)
#-------------------------------------------------------#
stringLiteral
#-------------------------------------------------------def stringLiteral(node):
node.add(token)
getToken()
#-------------------------------------------------------#
numberLiteral
#-------------------------------------------------------def numberLiteral(node):
node.add(token)
getToken()
Page 37 of 39
1/13/15, 5:10 AM
outputFilename = "output\\nxxParserDriver_output.txt"
sourceFilename = "input\\nxx1.txt"
sourceText = open(sourceFilename).read()
ast = parser.parse(sourceText, verbose=False)
print "~"*80
print "Here is the abstract syntax tree:"
print "~"*80
f = open(outputFilename,"w")
f.write(ast.toString())
f.close()
print(open(outputFilename).read())
http://parsingintro.sourceforge.net/
Page 38 of 39
1/13/15, 5:10 AM
ROOT
=
alpha
16
=
beta
2
=
resultName
"delta"
=
delta
alpha
/
beta
print
"Value of "
resultName
" is: "
print
delta
print
"\n"
Download this zip file to obtain the source code of files discussed in this article.
End of this article/web page.
http://parsingintro.sourceforge.net/
Page 39 of 39