Compiler Design in C (HQ)

Allen I.
Holub
Prentice Hall Software Series
Brian W. Kemighan, Editor
PRENTICE HALL
Englewood Cliffs, New J ersey 07632
Li br ar y of Congr ess Cat al ogl ng- l n- Publ I cat l on Dat a
Ho 1ub, Al l en I .
Compi l er desi gn i n C / Al l en I . Hol ub.
p. cm. ( Pr en t i c e- Ha11 s of t war e s er i es )
I nc l udes b i b l i o g r ap h i c al r ef er en c es .
ISBN 0- 13- 155045- 4
1. Compi l er s (Comput er pr ogr ams) 2. C (Comput er pr ogr am l anguage)
I . T i t l e . I I . Se r l e s .
QA76. 7 6 . C65H65 1990
005. 4* 53 dc20 89-38733
CIP
Editorial/Production supervision: Kathleen Schiaparelli
Cover design: Allen I. Holub and Lundgren Graphics Ltd.
Manufacturing buyer: Margaret Rizzi
1990 by Allen I. Holub.
P u b l i s h e d by Pr e n t i c e - H al l , I nc .
A d i v i s i o n o f S i m o n & S c h u s t e r
E n g l e w o o d Cl i f f s , N e w J e r s e y 0 7 6 3 2
All Rights Reserved. No part of the book may be reproduced in any form or by any means without
permission in writing from the author.
Trademark Acknowledgments: Te X is a Trademark of the American Mathematical Society. IfX,
because it is a visual pun on Te X is used with the kind permission of Donald Knuth. There is no
other connection between either Dr. Knuth or the AMS and the programs or text in this book. IfX,
occs, LLama, autopic, and arachne are all trademarks of Allen I. Holub. unix is a trademark of
Bell Laboratories, ms-dos, Microsoft, and QuickC are trademarks of Microsoft, Inc. Turbo-C is a
trademark of Borland, Inc. PostScript is a trademark of Adobe Systems. AutoCad and AutoSketch
are trademarks of AutoDesk, Inc. EROFF is a trademark of the Elan Computer Group. DEC, PDP,
and VAX are trademarks of Digital Equipment Corporation. Macintosh is a trademark of Apple
Computer, Inc.
LIMITS OF LIABILITY AND DISCLAIMER OF WARRANTY: The author and publisher
nave used their best efforts in preparing this book. These efforts include the development, research,
and testing of the theories and programs to determine their effectiveness. The author and publisher
make no warranty of any kind, expressed or implied, with regard to these programs or the
documentation contained in this book. The author and publisher shall not be liable in any event for
incidental or consequential damages in connection with, or arising out of, the furnishing,
performance, or use of these programs.
Printed in the United States of America
10 9 8 7 6 5 4
ISBN 0- 13 - 1 55 0 45 - 4
Prentice-Hall International (UK) Limited, London
Prentice-Hall of Australia Pty. Limited, Sydney
Prentice-Hall Canada Inc., Toronto
Prentice-Hall Hispanoamericana, S.A.. Mexico
Prentice-Hall of India Private Limited, New Delhi
Prentice-Hall of Japan, Inc., Tokyo
Simon & Schuster Asia Pte. Ltd., Singapore
Editora Prentice-Hall do Brasil, Ltda, Rio de Janeiro
For Deirdre
errata follows page 924
1. Basic Concepts.................................................................................... 1
1 . 1 The Parts of a Compiler.................................................................................................1
1.1.1 The Lexical Analyzer..............................................................................................3
1.1.2 The Parser...............................................................................................................3
1.1.3 The Code Generator................................................................................................ 5
1 .2 Representing Computer Languages.............................................................................6
1 .2.1 Grammars and Parse T rees....................................................................................7
1.2.2 An Expression Grammar.........................................................................................9
1.2.3 Syntax Diagrams...................................................................................................12
1.3 A Recursive-Descent Expression Compiler...............................................................13
1.3.1 The Lexical Analyzer............................................................................................13
1.3.2 The Basic Parser...................................................................................................17
1.3.3 Improving the Parser............................................................................................21
1 .3.4 Code Generation...................................................................................................24
1 .4 Exercises......................................................................................................................30
2. Input and Lexical Analysis.................................................................. 32
2.1 The Lexical Analyzer as Part of a Compiler*............................................................32
2.2 Error Recovery in Lexical Analysis*........................................................................34
2.3 Input Systems*............................................................................................................ 35
2.3.1 An Example Input System*..................................................................................36
2.3.2 An Example Input SystemImplementation.....................................................39
2.4 Lexical Analysis*....................................................................................................... 50
2.4.1 Languages*............................................................................................................ 52
2.4.2 Regular Expressions*........................................................................................... 54
2.4.3 Regular Definitions* ........................................................................................... 56
2.4.4 Finite Automata*.................................................................................................. 56
2.4.5 State-Machine-Driven Lexical Analyzers*.......................................................60
2.4.6 Implementing a State-Machine-Driven Lexical A nalyzer................................ 63
Vll
Preface
2.5 1XA Lexical-Analyzer Generator*......................................................................81
2.5.1 Thompsons Construction: From a Regular Expression to an NFA *.............. 81
2.5.2 Implementing Thompsons Construction............................................................83
2.5.2.1 Data Structures..............................................................................................83
2.5.2.2 ARegular-Expression Grammar.................................................................87
2.5.2.3 File Header.....................................................................................................87
2.5.2.4 Error-MessageProcessing.............................................................................88
2.5.2.5 Memory Management.................................................................................... 88
2.5.2.6 Macro Support................................................................................................93
2.5.2.7 IXs Lexical Analyzer..................................................................................95
2.5.2.8 Parsing...........................................................................................................101
2.5.3 Interpreting an NFATheory*........................................................................ 113
2.5.4 Interpreting an NFAImplementation............................................................ 115
2.5.5 Subset Construction: Converting an NFA to a DFATheory*.................... 122
2.5.6 Subset Construction: Converting an NFA to a DFAImplementation . . .124
2.5.7 DFA MinimizationTheory*...........................................................................132
2.5.8 DFA MinimizationImplementation...............................................................135
2.5.9 Compressing and Printing the Tabl es.............................................................. 140
2.5.9.1 Uncompressed Tables..................................................................................140
2.5.9.2 Pair-Compressed Tables..............................................................................141
2.5.9.3 Redundant-Row-and-Column-Compressed Tables................................. 146
2.5.10 Tying It All Together.......................................................................................152
2.6 Exercises ....................................................................................................................162
3. Context-Free Grammars.................................................................... 166
3.1 Sentences, Phrases, and Context-Free Grammars.................................................. 166
3.2 Derivations and Sentential Forms............................................................................. 168
3.2.1 LLand LR Grammars......................................................................................... 170
3.3 Parse Trees and Semantic Difficulties......................................................................170
3.4 8 Productions............................................................................................................... 173
3.5 The End-of-Input M arker......................................................................................... 173
3.6 Right-Linear Grammars............................................................................................174
3.7 Lists, Recursion, and Associativity...........................................................................175
3.7.1 Simple Lists ........................................................................................................175
3.7.2 The Number of Elements in a List ................................................................... 178
3.7.3 Lists with Delimiters ......................................................................................... 179
3.8 Expressions................................................................................................................. 180
3.9 Ambiguous Grammars..............................................................................................182
3.10 Syntax-Directed Translation ..................................................................................183
3.10.1 Augmented Grammars .................................................................................... 183
3.10.2 Attributed Grammars.......................................................................................186
3.11 Representing Generic Grammars.......................................................................... 192
3.12 Exercises....................................................................................................................193
4. Top-Down Parsi ng ................................................................................................195
4.1 Push-Down Automata*..............................................................................................195
4.1.1 Recursive-Descent Parsers as Push-Down Automata*....................................198
4.2 Using a PDA for a Top-Down Parse*......................................................................201
4.3 Error Recovery in a Top-Down Parser*...................................................................201
4.4 Augmented Grammars and Table-Driven Parsers*............................................... 202
4.4.1 Implementing Attributed Grammars in a PDA *............................................. 203
4.5 Automating the Top-Down Parse Process* ........................................................... 208
4.5.1 Top-Down Parse Tables* ................................................................................. 208
viii Contents
Contents ix
4.6 LL( 1) Grammars and Their Limitations*..................................................................211
4.7 Making the Parse Tables*..........................................................................................213
4.7.1 FIRST Sets* .........................................................................................................213
4.7.2 FOLLOW Sets*....................................................................................................215
4.7.3 LL(1) Selection Sets*..........................................................................................217
4.8 Modifying Grammars*...............................................................................................218
4.8.1 Unreachable Productions*...................................................................................219
4.8.2 Left Factoring*....................................................................................................219
4.8.3 Comer Substitution*.............................................................................................221
4.8.4 Singleton Substitution*........................................................................................223
4.8.5 Eliminating Ambiguity*..................................................................................... 223
4.8.6 Eliminating Left Recursion*..............................................................................226
4.9 Implementing LL( 1) Parsers.....................................................................................229
4.9.1 Top-Down, Table-Driven ParsingThe LLama Output Fi l e........................ 229
4.9.2 Occs and LLama Debugging Supportyydebug.c......................................... 242
4.10 LLamaImplementing an LL(1) Parser-Generator............................................270
4.10.1 LLamas Parser.................................................................................................270
4.10.2 Creating The Tables..........................................................................................304
4.10.2.1 Computing FIRST, FOLLOW, and SELECT Sets................................304
4.10.3 The Rest of L L ama..........................................................................................304
4.11 Exercises.....................................................................................................................333
5. Bottom-Up P arsi ng.................................................................................................337
5.1 How Bottom-Up Parsing Works*..............................................................................338
5.2 Recursion in Bottom-Up Parsing* ........................................................................... 340
5.3 Implementing the Parser as a State Machine* ........................................................343
5.4 Error Recovery in an LR Parser*..............................................................................348
5.5 The Value Stack and Attribute Processing*.............................................................348
5.5.1 A Notation for Bottom-Up Attributes*.............................................................353
5.5.2 Imbedded Actions* ............................................................................................ 354
5.6 Creating LR Parse TablesTheory* ...................................................................... 354
5.6.1 LR(0) Grammars*...............................................................................................354
5.6.2 SLR(l) Grammars*.............................................................................................361
5.6.3 LR(1) Grammars*............................................................................................... 361
5.6.4 LALR( 1) Grammars*..........................................................................................365
5.7 Representing LR State Tables...................................................................................368
5.8 Eliminating Single-Reduction States*.......................................................................373
5.9 Using Ambiguous Grammars*...................................................................................375
5.10 Implementing an LALR(l) ParserThe Occs Output File.................................. 381
5.11 Implementing an LALR( 1) Parser GeneratorOccs I nternals...........................401
5.11.1 Modifying the Symbol Table for LALR(l) Grammars.................................. 401
5.12 Parser-File Generation ............................................................................................408
5.13 Generating LALR( 1) Parse Tabl es.........................................................................408
5.14 Exercises.................................................................................................................... 442
6. Code G enerati on......................................................................................................445
6.1 Intermediate L anguages............................................................................................446
6.2 C-code: An Intermediate Language and Virtual Machine ....................................449
6.2.1 Names and White Space.................................................................................... 450
6.2.2 Basic Types.......................................................................................................... 450
6.2.3 The Virtual Machine: Registers, Stack, and Memory .....................................451
6.2.4 Memory Organization: Segments...................................................................... 455
6.2.5 Variable Declarations: Storage Classes and Alignment..................................457
6.2.6 Addressing Modes.............................................................................................462
6.2.7 Manipulating the Stack......................................................................................465
6.2.8 Subroutines......................................................................................................... 466
6.2.9 Stack Frames: Subroutine Arguments and Automatic Variables.................. 467
6.2.10 Subroutine Return V alues...............................................................................472
6.2.11 Operators......................................................................................................... 473
6.2.12 Type Conversions............................................................................................. 473
6.2.13 Labels and Control Flow.................................................................................474
6.2.14 Macrosand Constant Expressions...................................................................475
6.2.15 File Organization............................................................................................. 476
6.2.16 Miscellany .......................................................................................................476
6.2.17 Caveats.............................................................................................................. 478
6.3 The Symbol T abl e.....................................................................................................478
6.3.1 Symbol-Table Requirements............................................................................ 478
6.3.2 Symbol-Table Data-Base Data Structures...................................................... 480
6.3.3 Implementing the Symbol Table........................................................................485
6.3.4 Representing TypesTheory ..........................................................................489
6.3.5 Representing TypesImplementation ...........................................................490
6.3.6 Implementing the Symbol-Table Maintenance L ayer...................................497
6.4 The Parser: Configuration.........................................................................................509
6.5 The Lexical Analyzer................................................................................................518
6.6 Declarations.............................................................................................................. 522
6.6.1 Simple Variable Declarations ..........................................................................522
6.6.2 Structure and Union Declarations.....................................................................543
6.6.3 Enumerated-Type Declarations....................................................................... 550
6.6.4 Function Declarations........................................................................................ 552
6.6.5 Compound Statements and Local Variables....................................................559
6.6.6 Front-End/Back-End Considerations................................................................ 563
6.7 The gen () Subroutine............................................................................................. 564
6.8 Expressions.................................................................................................................572
6.8.1 Temporary-Variable Allocation....................................................................... 572
6.8.2 Lvalues and Rvalues...........................................................................................578
6.8.3 Implementing Values, Higher-Level Temporary-Variable Support............. 583
6.8.4 Unary Operators ................................................................................................593
6.8.5 Binary Operators................................................................................................617
6.9 Statements and Control Flow....................................................................................637
6.9.1 Simple Statements and i f /el se .....................................................................637
6.9.2 Loops,break, and conti nue ....................................................................... 642
6.9.3 The swi tch Statement......................................................................................642
6.10 Exercises................................................................................................................... 651
7. Opti mi zati on Strategi es......................................................................................657
7.1 Parser Optimizations ................................................................................................657
7.2 Linear (Peephole) Optimizations............................................................................658
7.2.1 Strength Reduction.............................................................................................658
7.2.2 Constant Folding and Constant Propagation....................................................659
7.2.3 Dead Variables and Dead Code.......................................................................660
7.2.4 Peephole Optimization: An Example ..............................................................665
7.3 Structural Optimizations...........................................................................................667
7.3.1 Postfix and Syntax Trees................................................................................... 667
7.3.2 Common-Subexpression Elimination............................................................. 672
7.3.3 Register Allocation ...........................................................................................673
x Contents
7.3.4 Lifetime Analysis........................................................
7.3.5 Loop Unwinding........................................................
7.3.6 Replacing Indexes with Pointers .............................
7.3.7 Loop-Invariant Code Motion.....................................
7.3.8 Loop Induction ..........................................................
7.4 Aliasing Problems.............................................................
7.5 Exercises...........................................................................
Appendix A. Support Functions............................
A.l Miscellaneous Include Fi l es............................................
A. 1.1debug.hMiscellaneous Macros.............................
A. 1.2 stack.h and yystack.h Generic Stack Maintenance
A. 1.3 l.h and compiler.h......................................................
A.2 Set Manipulation...............................................................
A.2.1 Using the Set Functions and Macros ......................
A.2.2 Set Implementation...................................................
A.3 Database MaintenanceHashing:..................................
A.3.1 HashingI mplementation.......................................
A.3.2 Two Hash Functions.................................................
A.4 The ANSI Variable-Argument Mechanism......................
A.5 Conversion Functions ......................................................
A.6 Print Functions..................................................................
%
A.7 Sorting................................................................................
A.7.1 Shell SortTheory...................................................
A.7.2 Shell SortImplementation.....................................
A.8 Miscellaneous Functions.................................................
A.9 Low-Level Video I/O Functions for the IBM P C ..........
A.9.1 IBM Video I/OOverview .....................................
A.9.2 Video I/OImplementation.....................................
A. 10 Low-level-I/O, Glue Functions.....................................
A.l 1Window Management: Curses.......................................
A. 11.1 Configuration and Compiling................................
A. 11.2 Using Curses.............................................................
A. 11.2.1 Initialization Functions.....................................
A. 11.2.2 Configuration Functions..................................
A. 11.2.3 Creating and Deleting Windows....................
A.l 1.2.4 Subroutines That Affect Entire Windows . . .
A. 11.2.5 Cursor Movement and Character I /O ............
A. 11.3 CursesImplementation .......................................
Appendix B. Notes on Pascal Compilers..............
B.l Subroutine Arguments......................................................
B.2 Return Values....................................................................
B.3 Stack Frames....................................................................
Appendix C. A Grammar for C ............................
Appendix D. LPX.....................................................
D.l Using LEX and Occs Together.........................................
D.2 The LfX Input File: Organization.....................................
D.3 The LPX Rules Section......................................................
D.3.1 IX Regular Expressions..........................................
D.3.2 The Code Part of the Rule..........................................
D.4 HrX Command-Line Switches.........................................
Contents
D.5 Limits and B ugs....................................................................................................... 826
D.6 Example: A Lexical Analyzer for C ........................................................................829
D.7 Exercises....................................................................................................................834
Appendix E. LLama and Occs .............................................................. 836
E. 1Using The Compiler Compiler................................................................................836
E.2 The Input F i l e............................................................................................................ 837
E.3 The Definitions Section ........................................................................................... 837
E.4 The Rules Section.....................................................................................................839
E.5 The Code Section .....................................................................................................842
E.6 Output F i l es...............................................................................................................842
E.7 Command-Line Switches.........................................................................................843
E.8 The Visible Parser..................................................................................................... 845
E.9 Useful Subroutines and Variables .......................................................................... 852
E. 10 Using Your Own Lexical Analyzer........................................................................855
E.l 1Occs...........................................................................................................................856
xii Contents
E.l .1 Using Ambiguous Grammars................................................ ......................856
E.l .2 Attributes and the Occs Value Stack.................................... ......................858
E.l .3 Printing the Value Stack........................................................ ......................863
E.l .4 Grammatical Transformations.............................................. ......................865
E.l .5 Theyyout.sym File................................................................. ......................867
E.l .6 Theyyout.doc F i l e................................................................. ......................868
E.l .7 Shift/Reduce and Reduce/Reduce Conflicts........................ ......................871
E.l .8 Error Recovery ...................................................................... ......................875
E.l .9 Putting the Parser and Actions in Different Files................. ......................875
E.l .10 Shifting a Tokens Attributes.............................................. ......................877
E.l . 11 Sample Occs Input Fi l e........................................................ ......................879
E.l . 12 Hints and Warnings............................................................. ......................881
E.12 L L ama.......................................................................................................................883
E.l2.1 Percent Directives............................................................................................883
E. 12.2 Top-Down Attributes.......................................................................................883
E.l 2.3 The LLama Value Stack.................................................................................. 884
E. 12.4 The llout.sym F i l e............................................................................................885
E. 12.5 Sample LLama Input File................................................................................885
Appendix F. A C-code Summary...........................................................889
Bibliography ............................................................................................ 894
Index.......................................................................................................... 897
Cross Reference by Symbol .................................................................. 913
This book presents the subject of Compiler Design in a way thats understandable to
a programmer, rather than a mathematician. My basic premise is that the best way to
learn how to write a compiler is to look at one in depth; the best way to understand the
theory is to build tools that use that theory for practical ends. So, this book is built
around working code that provides immediate practical examples of how given theories
are applied. I have deliberately avoided mathematical notation, foreign to many pro
grammers, in favor of English descriptions of the theory and using the code itself to
explain a process. If a theoretical discussion isnt clear, you can look at the code that
implements the theory. I make no claims that the code presented here is the only (or the
best) implementation of the concepts presented. I ve found, however, that looking at an
implementationat any implementationcan be a very useful adjunct to understanding
the theory, and the reader is well able to adapt the concepts presented here to alternate
implementations.
The disadvantage of my approach is that there is, by necessity, a tremendous amount
of low-level detail in this book. It is my belief, however, that this detail is both critically
important to understanding how to actually build a real compiler, and is missing from
virtually every other book on the subject. Similarly, a lot of the low-level details are
more related to program implementation in general than to compilers in particular. One
of the secondary reasons for learning how to build a compiler, however, is to learn how
to put together a large and complex program, and presenting complete programs, rather
than just the directly compiler-related portions of those programs, furthers this end. I ve
resolved the too-many-details problem, to some extent, by isolating the theoretical
materials into their own sections, all marked with asterisks in the table of contents and in
the header on the top of the page. If you arent interested in the nuts and bolts, you can
just skip over the sections that discuss code.
In general, I ve opted for clarity in the code rather than cleverness or theoretical
efficiency. That is, since the main purpose of this book is to teach compiler-design con
cepts, it seemed reasonable always to go with a more understandable algorithm rather
than an efficient but opaque algorithm. For example, I ve used Thompsons Construction
to make DFAs in LEX rather than the more direct approach recommended in Ahos book,
because Thompsons construction is more understandable (it also introduces certain key
concepts [like closure], in a relatively easy-to-understand context). My method for
*
Xlll
xiv Preface
Prerequisites
computing LALR( 1) lookaheads is also less efficient than it could be, but the algorithm is
understandable and fast enough. I usually point the reader to the source for the more
efficient algorithms, should he or she want to implement them.
In a sense, this book is really an in-depth presentation of several, very well docu
mented programs: the complete sources for three compiler-generation tools are
presented, as is a complete C compiler. (A lexical-analyzer generator modeled after the
UNIX lex utility is presented along with two yacc-like compiler compilers.) As such, it is
more of a compiler-engineering book than are most textsa strong emphasis is placed
on teaching you how to write a real compiler. On the other hand, a lot of theory is
covered on the way to understanding the practice, and this theory is central to the discus
sion. Though I ve presented complete implementations of the programs as an aid to
understanding, the implementation details arent nearly as important as the processes
that are used by those programs to do what they do. I ts important that you be able to
apply these processes to your own programs.
The utilities are designed to be used as a learning aid. For example, LLama and
occs (the two compiler compilers) can create an interactive window-oriented debugging
environment, incorporating a visible parser that lets you watch the parse process in
action. (One window shows the state and value stacks, others show the input, output,
and a running commentary of what the parsers doing.) You can see the parse stack grow
and shrink, watch attributes being inherited and synthesized, set breakpoints on certain
conditions (a token being input, reduction by a particular production, a certain symbol
on the top of stack, and so forth), and if necessary, log the entire parse (including
snapshots of the stack) to a file. I ve found that actually watching a bottom-up parse in
action helps considerably in understanding how the parse process works, and I regularly
use this visible parser in the classroom to good effect.
The C Compiler presented in Chapter Six implements an ANSI-compatible subset of
CI ve left out a few things like floating point which would make the chapter even
larger than it is without adding any significant value. I have tried to cover all of the hard
implementation details that are usually omitted from books such as the present one. For
example, the complete declaration syntax of C is supported, including structures and
declarations of arbitrary complexity. Similarly, block nesting of declarations and the
more complex control structures, such as switches, are also covered.
All the source code presented here is a n s i C, and is all portable to UNIX as well. For
example, window management is done on the IBM-PC by emulating a standard UNIX
window-management package (curses). The complete source code for the emulator is
provided. All of the software is available electronicallyversions are available for the
UNIX, MS-DOS, and Macintosh environments.
I m assuming throughout this book that you will actually read the code as well as the
prose descriptions of the code. I dont waste space in the text discussing implementation
details that are described adequately by the code itself. I do, however, use a lot of space
describing the nonobvious parts of the programs.
The primary prerequisite for using this book is a thorough knowledge of ANSI C in
particular and programming in general. You should be familiar both with the language
itself and with the standard library functions described in the a n s i standard.
I ve used structured programming techniques throughout the book, and have made
heavy use of data abstraction and similar techniques, but I havent described what these
techniques are or why I m using them. The more complicated data structures are
explained thoroughly in the text, but a previous knowledge of basic data structures like
stacks and binary trees is assumed throughout. Similarly, a knowledge of basic set
theory and a familiarity with graphs is also useful. Finally, familiarity with assembly-
language concepts (like how a subroutine-call works) is mandatory, but an in-depth
knowledge of a specific assembly language is not required (because I ve used a C subset
Preface xv
for generated code rather than assembly language).
Though a knowledge of C is mandatory, a knowledge of UNIX or MS-DOS is not. This
book is UNIX oriented only in that several UNIX tools are constructed. The tools were all
developed on an IBM-PC and run nicely in that environment. By the same token,
several MS-DOS implementation details and portability concerns are discussed in depth
here, but Ive been careful to make the code itself as portable as possible. The only
potential confusion for non-UNIX users is in Appendixes D and E, where differences
between my own programs and the UNIX versions are occasionally mentioned in foot
notes. Just ignore these notes if youre not going to use the UNIX tools.
The book is organized so that you can use it in two ways. If you have no interest in
theory and just want to build a compiler, an overview of compiler design in general is
presented in Chapter One, instructions for using the compiler construction tools (LEX and
occs) are in Appendixes D and E, and code generation is discussed in Chapter Six. With
these chapters behind you, you can get to the business of writing a compiler immedi
ately, and go back later and absorb the theory.
A more rigorous approach goes through the book sequentially. You dont have to
read every word of every chapterif youre not interested in the nuts-and-bolts, just skip
past those sections that describe the programs inner workings. I d strongly suggest
reading the code, however, and reading Appendixes D and E before leaping into the text
that describes these programs.
Various support functions that are used internally by the programs are concentrated
in Appendix A. Covered functions do set manipulation, window management, and so
forth. I ve put them in one place so that the rest of the book isnt cluttered with function
descriptions that arent germane to the discussion at hand. This appendix shouldnt be
viewed as an adjunct, however. The functions described in it are used heavily
throughout the rest of the book, and you should be familiar with themor at least with
their calling syntaxbefore trying to read any of the other code.
One major organizational issue is the positioning of the theoretical and practical
parts of the book. Its tempting to put all the theoretical material together at the head of
each chapter so that you can skip past implementation details if youre not interested.
I ve opted, however, to intermix theoretical material with actual code because an exami
nation of the code can often clarify things on the theoretical side. As I mentioned ear
lier, I ve resolved the problem, somewhat, by isolating the theoretical material into indi
vidual sections. I often discuss the theory in one section and put a practical implementa
tion in the following section. The theoretical sections in chapters that mix theoretical
and practical material are marked with asterisks in the table of contents. This way you
can skip past the implementation-related material with little difficulty, should you desire
to do so.
A few compiler-related subjects are not covered here. Optimization is not discussed
beyond the overview presented in Chapter SevenI discuss how various optimizations
move the code around, but I dont discuss the mechanics of optimization itself beyond a
few, simple examples. Similarly, only the most common parse strategies are discussed
(operator-precedence parsing is not mentioned, for example). All the material usually
covered in an upper-division, undergraduate compiler-design course is covered here.
All the code in this book is written in ANSI C (Ive used the Microsoft C compiler,
version 5.1, for development purposes). For the most part, the MS-DOS code can be con
verted to the UNIX compiler by changing a few #defnes before compiling. The disk
that contains the code (see below) is shipped with the source of a small UNIX preproces
sor that handles other conversion details (it does string concatenation, token pasting,
etc.), and the output of this preprocessor can be submitted directly to UNIX cc. I m
assuming that the UNIX compiler supports those ANSI features that are implemented in
most UNIX systems (like structure assignment).
Organization
Source Code and
Portability
xvi Preface
Getting the Code
Bug Reports and
Electronic Mail
If you intend to use the code directly (without UNIX preprocessing), youll need an
ANSI-compatible compiler that supports function prototypes, and so forth. In particular:
<stdarg.h> is used for variable-argument lists.
white space around the #in a preprocessor directive must be permitted.
structure assignment is used.
unsi gned char must be supported.
function prototypes are used heavily.
i s di gi t (), etc., may not have side effects.
string concatenation is used in a few places.
16-character names must be permitted.
My only deviation from strict ANSI is in name lengths. In theory ANSI allows only six
characters in an external name but I m assuming that 16 are supported. I am also using
the old, Kemighan & Ritchie style, subroutine-declaration syntax:
l orenzo( argl , arg2 )
c h a r * argl ;
d o u b l e *arg2;
rather than:
l orenzo( c h a r * argl , d o u b l e *arg2 )
I ve done this because many compilers do not yet support the new syntax, and the old
syntax is still legal in the standard (even though its declared obsolescent).
I ve deliberately avoided using special features of the Microsoft compiler: I ve
ignored things like the existence of the f ar keyword and huge pointers in favor of using
the compact model even though the foregoing would make some of the code more
efficient. By the same token, I havent used regi st er variables because the Microsoft
compiler does a better job of assigning registers than I can do myself.
Unfortunately, the 8086 has an architecture that forces you to worry about the under
lying machine on a regular basis, so the code has a certain amount of 8086-specific
details. All of these details are isolated into macros, however, so that its easy to port the
code to a different environment by changing the macros and recompiling. I do discuss
the foibles of the 8086 here and there; just skip over this material if youre not interested.
All of the source code in this bookalong with executable versions of LEX, LLama,
and occsis available on disk from:
Software Engineering Consultants
P.O. Box 5679
Berkeley, California 94705
(415) 540-7954
The software is available right now for the IBM-PC and UNIX. (The UNIX version is
shipped on an IBM-PC, 5-1/4 disk, however. Youll have to upload it using KERMIT or
some other file-transfer protocol. It has been tested under UNIX System V, BSD 4.3I
cant vouch for any other UNIX variant.) The cost is $60.00 by a check or money order
drawn on a U.S. bank. Please specify the disk size (5/4" or 3/2"). California residents
must add local sales tax. No purchase orders or credit cards (sorry). A Macintosh
version will be available eventually. Binary site licenses are available for educational
institutions.
The code in this book is bound to have a few bugs in it, though I ve done my best to test
it as thoroughly as possible. The version distributed on disk will always be the most
recent. If you find a bug, please report it to me, either at the above address or electroni -
cally. My internet address is holub@violet.berkeley.edu CompuServe users can access
internet from the email system by prefixing this address with >INTERNET: type
help i nternet for information. My UUCP address is .. J.ucbvax!violet!holub.
Preface xvii
The UNIX USENET network is the official channel for bug fixes and general discussion
of the material in this book. The comp.compilers newsgroup should be used for this pur
pose. USENET messages have a way of filtering over to other networks, like BIX, but
the best way to get up-to-date information is via USENET itself. Most universities are
connected to this network, and you can get access to it by getting an account on a
machine at a local university. (Most schools have a mechanism for people in the com
munity to get such accounts.) I d prefer for all postings to be sent to meI ll digest
them and post them to comp.compilers via its moderator. If you want to make a submis
sion to comp.compilers directly, you have to mail it to the moderator, who will post it to
the network. Type help bboard Usenet moderators to get his or her name.
This book was written largely because my students found the standard text
Alfred Aho, Ravi Sethi, and J effrey Ullmans excellent, but at times abstruse Compilers:
Principles, Techniques, and Toolsto be too theoretically oriented. The current volume
owes a lot to Aho et al, however. I ve used many of their algorithms, and their insights
into the compiler-design practice are invaluable. I m also indebted to Mike Lesk, Eric
Schmidt, and Steve J ohnson, the creators of UNIXs lex and yacc utilities, after which the
programs in this book are modeled. My neighbor, Bill Wong, provided invaluable com
ments on the early drafts of this book, as did many of my students. Finally, I m grateful
to Brian Kemighan, J ohnson M. Hart, Andrew Appel, Norman C. Hutchinson, and N.H.
Madhavji (of Bell Labs, Boston University, Princeton University, The University of
Arizona, and McGill University respectively) all of whom reviewed this book before it
went to press. Their comments and suggestions have made this a much better book. I
am particularly indebted to Brian Kemighan, whose careful scrutiny of the entire
bookboth the text and the codecaught many errors that otherwise would have made
it into print.
Allen Holub
Berkeley, California
This book was typeset on an IBM PC/AT using EROFF, a version of the UNIX troflf
typesetter ported to MS-DOS by the Elan Computer Group. PostScript Times Roman and
Italic were used for the text, Helvetica for chapter headings, and Courier, Courier Bold,
and Courier Italic for the listings. Page proofs were generated using an Apple Laser
Writer, and the final typesetting was done on a Linotronic phototypesetter using
EROFF-generated PostScript. The following command line was used throughout:
ar achne f i l e. . . | aut opi c | t bl | t r of f - mm
The arachne preprocessor is a version of Knuths WEB documentation system thats
tailored for C and troff (rather than Pascal and TgX). It runs under MS-DOS on an IBM-
PC. With it, you can put the code and documentation together in a single input file. Used
one way, it extracts the code and writes it out to the correct files for compilation. In a
second mode it processes the code for troflf, performing the necessary font changes, and
so forth, needed to pretty print the code. It also automatically generates index entries
for subroutine declarations. It adds line numbers to the listings and lets you reference
Acknowledgments
Typesetting Notes
xviii Preface
these line numbers symbolically from the text (that is, you can add lines to the listings
and the line numbers in the text automatically adjust themselves). Finally, it lets you dis
cuss global variables and so forth where theyre used, because it automatically moves
them to the top of the output C program.
The second preprocessor, autopic, translates drawings generated by two commer
cially available drafting programs (AutoCad and AutoSketch) into troff graphics
primitives. It is much more useful than pic in that you have both a WYSIWYG capability
and a much more powerful drawing system at your disposal. Since troff commands are
generated as autopic output, the drawings are readily portable to any troflf system.
Autopic and arachne are both compilers, and as such serve as an example of how
you can apply the techniques presented in this book to applications other than writing
compilers for standard programming languages. MS-DOS versions of autopic and
arachne are available from Software Engineering at the address given earlier. Write for
details.
This chapter introduces the basic concepts of compiler design. I ll discuss the inter
nal organization of a compiler, introduce formal grammars and parse trees, and build a
small recursive-descent expression compiler. Before leaping into the text, however, a
word of encouragement, both about this chapter and the book in general, seems in order.
Compilers are not particularly difficult programs to understand once youre familiar with
the structure of a compiler in a general sort of way. The main problem is not that any
one part of a compiler is hard to understand; but, rather, that there are so many parts
and you need to have absorbed most of these parts before any of them make sense. For
now, my advice is to forge ahead without trying to figure out how it all ties together.
Youll find that you will eventually reach a click point where the system as a whole
suddenly makes sense.
Compilers are complex programs. As a consequence, theyre often broken into
several distinct chunks, called passes, that communicate with one another via temporary
files. The passes themselves are only part of the compilation process, however. The pro
cess of creating an executable image from a source-code file can involve several stages
other than compilation (preprocessing, assembly, linking, and so forth). In fact, some
operating systems (such as Microsofts OS/2) can delay the final creation of an execut
able image until a program is actually loaded at run-time. The situation is muddled
further by driver programs like UNIXs cc or Microsoft Cs cl, which hide a good deal of
the compilation process from you. These driver programs act as executives, controlling
the various component programs that make up the compiler in such a way that you dont
know that the components are being used. For the purposes of this book, I ll define a
compiler as a program or group of programs that translates one language into another
in this case the source code of a high-level computer language is translated into assem
bly language. The assembler, linker, and so forth are not considered to be part of the
compiler.
The structure of a typical four-pass compiler is shown in Figure 1.1. The preproces
sor is the first pass. Preprocessors typically do macro substitutions, strip comments from
the source code, and handle various housekeeping tasks with which you dont want to
Compiler passes.
Compiler' defined.
Structure of a four-pass
compiler.
1
2 Basic ConceptsChapter 1
The back end.
Figure 1.1. Structure of a Typical Four-Pass Compiler
source code
assembly language or binary
burden the compiler proper. The second pass is the heart of the compiler. It is made up
of a lexical analyzer, parser, and code generator, and it translates the source code into an
intermediate language that is much like assembly language. The third pass is the optim
izer, which improves the quality of the generated intermediate code, and the fourth pass,
the back end, translates the optimized code to real assembly language or some form of
binary, executable code. Of course, there are many variations to this structure. Many
compilers dont have preprocessors; others generate assembly language in the second
pass, optimize the assembly language directly, and dont have a fourth pass; still others
generate binary instructions directly, without going through an ASCII intermediate
language like assembler.
This book concentrates on the second pass of our model. There are several opera
tions here too, but they interact in more complicated ways than the higher-level passes,
and they share data structures (such as the symbol table) as well.
Section 1.1 The Parts of a Compiler 3
1.1.1 The Lexical Analyzer
A phase is an independent task used in the compilation process. Typically, several
phases are combined into a single pass. The lexical analyzer phase of a compiler (often
called a scanner or tokenizer) translates the input into a form thats more useable by the
rest of the compiler. The lexical analyzer looks at the input stream as a collection of
basic language elements called tokens. That is, a token is an indivisible lexical unit. In
C, keywords like while or fo r are tokens (you cant say wh i l e), symbols like >, >=,
, and = are tokens, names and numbers are tokens, and so forth. The original string
that comprises the token is called a lexeme. Note that there is not a one-to-one relation
ship between lexemes and tokens. A name or number token, for example, can have
many possible lexemes associated with it; a while token always matches a single lexeme.
The situation is complicated by tokens that overlap (such as the >, >=, , and =, used
earlier). In general, a lexical analyzer recognizes the token that matches the longest
lexememany languages build this behavior into the language specification itself.
Given the input , a shift token is recognized rather than two greater-than tokens.
A lexical analyzer translates lexemes into tokens. The tokens are typically
represented internally as unique integers or an enumerated type. Both components are
always requiredthe token itself and the lexeme, which is needed in this example to
differentiate the various name or number tokens from one another.
One of the early design decisions that can affect the structure of the entire compiler is
the choice of a token set. You can have a token for every input symbol, or several sym
bols can be merged into a single tokenfor example, the >, >=, , and =, can be
treated either as four tokens, or as a single comparison-operator tokenthe lexeme is
used to disambiguate the tokens. The former approach can sometimes make code gen
eration easier to do. Too many tokens, however, can make the parser larger than neces
sary and difficult to write. Theres no hard-and-fast rule as to which is better, but by the
time youve worked through this book, youll understand the design considerations and
will be able to make intelligent choices. In general, arithmetic operators with the same
precedence and associativity can be grouped together, type-declaration keywords (like
i n t and char) can be combined, and so forth.
The lexical analyzer is typically a self-contained unit that interfaces with the rest of
the compiler via a small number (typically one or two) of subroutines and global vari
ables. The parser calls the lexical-analyzer every time it needs a new token, and the
analyzer returns that token and the associated lexeme. Since the actual input mechanism
is hidden from the parser, you can modify or replace the lexical analyzer without
affecting the rest of the compiler.
1.1.2 The Parser
Compilers are language translatorsthey translate a high-level language like C into
a low-level language like 8086 assembler. Consequently, a good deal of the theoretical
side of the subject is borrowed from linguistics. One such concept is the idea of parsing.
To parse an English sentence is to break it up into its component parts in order to
analyze it grammatically. For example, a sentence like this:
J ane sees Spot run.
is broken up into a subject (J ane) and a predicate (sees Spot run). The predicate is
in turn broken up into a verb (sees), a direct object (Spot), and a participle that
modifies the direct object (run). Figure 1.2 shows how the sentence is represented by
a conventional sentence diagram like the ones you learned to make in the sixth grade.
A compiler performs this same process (of decomposing a sentence into its com
ponent parts) in the parser phase, though it usually represents the parsed sentence in a
Phases.
Scanner, tokenizer.
Tokens.
Lexemes.
Lexemes are translated
to tokens.
Choosing a token set.
Scanner is self-contained
unit.
Parse, defined.
The parser phase.
Syntax diagrams and
trees.
Syntax trees.
Parse trees.
Sentence: formal
definition.
Figure 1.2. A Sentence Diagram for Jane Sees Spot Run
tree form rather than as sentence diagram. (In this case, the sentence is an entire pro
gram.)
The sentence diagram itself shows the syntactic relationships between the parts of
the sentence, so this kind of graph is formally called a syntax diagram (or, if its in tree
form, a syntax tree). You can expand the syntax tree, however, to show the grammatical
structure as well as the syntactic structure. This second representation is called a parse
tree. A parse tree for our earlier sentence diagram is shown in Figure 1.3. Syntax and
parse trees for the expression A*B+C*D are shown in Figure 1.4. A tree structure is used
here primarily because its easy to represent in a computer program, unlike a sentence
diagram.
Figure 1.3. A Parse Tree for Jane Sees Spot Run
A sentence, by the way, is also a technical term, though it means the same thing as it
does in English. I ts a collection of tokens that follow a well-defined grammatical struc
ture. In the case of a compiler, the sentence is typically an entire computer program.
The analogy is evident in a language like Pascal, which mirrors English punctuation as
well as its grammar. A Pascal program ends with a period, just like an English sentence.
Similarly, a semicolon is used as punctuation to separate two complete ideas, just as it
separates two independent clauses in English.
To summarize: A parser is a group of subroutines that converts a token stream into a
parse tree, and a parse tree is a structural representation of the sentence being parsed.
Looked at another way, the parse tree represents the sentence in a hierarchical fashion,
moving from a general description of the sentence (at the root of the tree) down to the
specific sentence being parsed (the actual tokens) at the leaves. Some compilers create a
physical parse tree, made up of structures, pointers, and so forth, but most represent the
Section 1.1.2 The Parser
Figure 1.4. Syntax and Parse Trees for A*B + C*D
parse tree implicitly. Other parse methods just keep track of where they are in the tree,
without creating a physical tree (well see how this works shortly). The parse tree itself
is a very useful concept, however, for understanding how the parse process works.
1.1.3 The Code Generator
The last part of the compiler proper is the code generator. I ts somewhat misleading
to represent this phase as a separate component from the parser proper, because most
compilers generate code as the parse progresses. That is, the code is generated by the
same subroutines that are parsing the input stream. It is possible, however, for the parser
to create a parse tree for the entire input file, which is then traversed by a distinct code
generator, and some compilers work in this way. A third possibility is for the parser to
create an intermediate-language representation of the input from which a syntax tree can
be reconstructed by an optimization pass. Some optimizations are easier to perform on a
syntax tree than on a linear instruction stream. A final, code-generation pass can
traverse the optimizer-modified syntax tree to generate code.
Though compilers can generate object code directly, they often defer code genera
tion to a second program. Instead of generating machine language directly, they create a
program in an intermediate language that is translated by the compilers back end into
actual machine language. You can look at an intermediate language as a sort-of super
assembly language thats designed for performing specific tasks (such as optimization).
As you might expect, there are many flavors of intermediate languages, each useful in
different applications.
There are advantages and disadvantages to an intermediate-language approach to
compiler writing. The main disadvantage is lack of speed. A parser that goes straight
from tokens to binary object code will be very fast, since an extra stage to process the
intermediate code can often double the compile time. The advantages, however, are
Intermediate languages,
back ends.
Advantages and disad
vantages of intermediate
languages.
1. A physical parse tree is useful for some kinds of optimizations, discussed further in Chapter Seven.
In te rp re te rs.
C Code: The intermedi
ate language used in this
book.
An o n ym o u s te m p o ra rie s.
usually enough to justify the loss of speed. These are, in a nutshell, optimization and
flexibility. A few optimizations, such as simple constant foldingthe evaluation of con
stant expressions at compile time rather than run timecan be done in the parser. Most
optimizations, however, are difficult, if not impossible, for a parser to perform. Conse
quently, parsers for optimizing compilers output an intermediate language thats easy for
a second pass to optimize.
Intermediate languages give you flexibility as well. A single lexical-analyzer/parser
front end can be used to generate code for several different machines by providing
separate back ends that translate a common intermediate language to a machine-specific
assembly language. Conversely, you can write several front ends that parse several
different high-level languages, but which all output the same intermediate language.
This way, compilers for several languages can share a single optimizer and back end.
A final use of an intermediate language is found in incremental compilers or inter
preters. These programs shorten the development cycle by executing intermediate code
directly, rather than translating it to binary first, thereby saving the time necessary for
assembling and linking a real program. An interpreter can also give you an improved
debugging environment because it can check for things like out-of-bounds array index
ing at run time.
The compiler developed in Chapter Six uses an intermediate language for the output
code. The language itself is described in depth in that chapter, but some mention of it is
necessary here, because I ll be using the language informally for code-generation exam
ples throughout this book. Put simply, the intermediate language is a subset of C in
which most instructions translate directly to a small number of assembly-language
instructions on a typical machine (usually one or two). For example, an expression like
x=a+b*c+d is translated into something like this:
t O = _a ;
tl = _b ;
tl *= _c ;
tO += tl ;
tO += _d ;
The 10 and 11 in the foregoing code are temporary variables that the compiler allocates
to hold the result of the partially-evaluated expression. These are called anonymous tem
poraries (often shortened to just temporaries) and are discussed in greater depth below.
The underscores are added to the names of declared variables by the compiler so that
they wont be confused with variables generated by the compiler itself, such as tO, and
11(which dont have underscores in their names).
Since the intermediate language is so C like, I m going to just use it for now without
a formal language definition. Remember, though, that the intermediate language is not C
(there would be little point in a compiler that translated good C into bad C)it is really
an assembly language with a C-like syntax.
1.2 Representing Computer Languages
A compiler is like every other program in that some sort of design abstraction is use
ful when constructing the code. Flow charts, Wamier-Orr diagrams, and structure charts
are examples of a design abstraction. In compiler applications, the best abstraction is
one that describes the language being compiled in a way that reflects the internal struc
ture of the compiler itself.
Section 1.2.1 Grammars and Parse Trees 7
1.2.1 Grammars and Parse Trees
The most common method used to describe a programming language in a formal way
is also borrowed from linguistics. This method is a formal grammar, originally
developed by M.I.T.s Noam Chomsky and applied to computer programs by J.W.
Backus for the first FORTRAN compilers.
Formal grammars are most often represented in a modified Backus-Naur Form (also
called Backus-Normal Form), BNF for short. A strict BNF representation starts with a
set of tokens, called terminal symbols, and a set of definitions, called nonterminal sym
bols. The definitions create a system in which every legal structure in the language can
be represented. One operator is supported, the ::=operator, translated by the phrase is
defined as or goes to. For example, the following BNF rule might start a grammar for
an English sentence:
sentence ::=subject predicate
A sentence is defined as a subject followed by a predicate. You can also say a sentence
goes to a subject followed by a predicate. Each rule of this type is called a production.
The nonterminal to the left of the ::=is the left-hand side and everything to the right of
the ::=is the right-hand side of the production. In the grammars used in this book, the
left-hand side of a production always consists of a single, nonterminal symbol, and every
nonterminal thats used on a right-hand side must also appear on a left-hand side. All
symbols that dont appear on a left-hand side, such as the tokens in the input language,
are terminal symbols.
A real grammar continues with further definitions until all the terminal symbols are
accounted for. For example, the grammar could continue with:
subject ::= noun
noun ::= J ANE
where J ANE is a terminal symbol (a token that matches the string J ane in the input).
The strict BNF is usually modified to make a grammar easier to type, and I ll use a
modified BNF in this book. The first modification is the addition of an OR operator,
represented by a vertical bar (I). For example,
noun ::= J ANE
noun ::= DICK
noun ::= SPOT
is represented as follows:
noun ::=DICK I J ANE I SPOT
Similarly, a is often substituted for the ::=as in
noun >DICK I J ANE
I use the >in most of this book. I also consistently use italics for nonterminal symbols
and boldface for terminals (symbols such as +and * are also always terminalstheyll
also be in boldface but sometimes its hard to tell.)
Theres one other important concept. Grammars must be as flexible as possible, and
one of the ways to get that flexibility is to make the application of certain rules optional.
A rule like this:
Backus-Naur Form
(BNF).
Terminal and nonterminal
symbols.
The : : = and - opera
tors.
Productions.
Left-hand and right-hand
sides (LHS and RHS).
Modified BNF: the
operator.
Terminals=boldface.
nonterminals-italic.
Optional rules and e.
says that THE is an article, and you can use that production like this:
Epsilon (c) productions.
Recognizing a sentence
using the grammar.
object article noun
In English, an object is an article followed by a noun. A rule like the foregoing requires
that all nouns that comprise an object be preceded by a participle. But what if you want
the article to be optional? You can do this by saying that an article can either be the
noun the or an empty string. The following is used to do this:
article THE I e
The 8 (pronounced epsilon) represents an empty string. If the THE token is present in
the input, then the
article THE
production is used. If it is not there, however, then the article matches an empty string,
and
article ^8
is used. So, the parser determines which of the two productions to apply by examining
the next input symbol.
A grammar that recognizes a limited set of English sentences is shown below:
sentence ^ subject predicate
subject ^ noun
predicate ^ verb object
object ^ noun opt_participle
opt^participle ^ participle 18
noun ^ SPOT 1J ANE 1DICK
participle ^ RUN
verb ^ SEES
An input sentence can be recognized using this grammar, with a series of replace
ments, as follows:
(1)
(2)
(3)
Start out with the topmost symbol in the grammar, the goal symbol.
Replace that symbol with one of its right-hand sides.
Continue replacing nonterminals, always replacing the leftmost nonterminal with
its right-hand side, until there are no more nonterminals to replace.
For example, the grammar can be used to recognize J ane sees Spot run as follows:
sentence
subject predicate
noun predicate
J ANE predicate
J ANE verb object
J ANE SEES object
J ANE SEES noun opt_participle
apply sentence -^subject predicate to get:
apply subject-^noun to get:
apply nounJ ANE to get:
apply predicate -^verb object to get:
apply verb-*SEES to get:
apply object-^noun op participle to get:
apply noun-^SPOT to get:
J ANE SEES SPOT opt_participle apply opt participle -^participle to get:
J ANE SEES SPOT participle
J ANE SEES SPOT RUN
apply p a r t i c i p l e RUN to get:
done there are no more nonterminals to replace
These replacements can be used to build the parse tree. For example, replacing sentence
with subject predicate is represented in tree form like this:
Section 1.2.1 Grammars and Parse Trees 9
The second replacement, of subject with noun, would modify the tree like this:
and so forth. The evolution of the entire parse tree is pictured in Figure 1.5.
A glance at the parse tree tells you where the terms terminal and nonterminal come
from. Terminal symbols are always leaves in the tree (theyre at the end of a branch),
and nonterminal symbols are always interior nodes.
1.2.2 An Expression Grammar
Table 1.1 shows a grammar that recognizes a list of one or more statements, each of
which is an arithmetic expression followed by a semicolon. Statements are made up of a
series of semicolon-delimited expressions, each comprising a series of numbers
separated either by asterisks (for multiplication) or plus signs (for addition).
Note that the grammar is recursive. For example, Production 2 has statements on
both the left- and right-hand sides. Theres also third-order recursion in Production 8,
since it contains an expression, but the only way to get to it is through Production 3,
which has an expression on its left-hand side. This last recursion is made clear if you
make a few algebraic substitutions in the grammar. You can substitute the right-hand
side of Production 6 in place of the reference to term in Production 4, yielding
expression factor
and then substitute the right-hand side of Production 8 in place of the factor:
expression ( expression )
Since the grammar itself is recursive, it stands to reason that recursion can also be used
to parse the grammarI ll show how in a moment. The recursion is also important from
a structural perspectiveit is the recursion that makes it possible for a finite grammar to
recognize an infinite number of sentences.
The strength of the foregoing grammar is that it is intuitiveits structure directly
reflects the way that an expression goes together. It has a major problem, however. The
leftmost symbol on the right-hand side of several of the productions is the same symbol
that appears on the left-hand side. In Production 3, for example, expression appears both
on the left-hand side and at the far left of the right-hand side. The property is called left
recursion, and certain parsers (such as the recursive-descent parser that I ll discuss in a
moment) cant handle left-recursive productions. They just loop forever, repetitively
replacing the leftmost symbol in the right-hand side with the entire right-hand side.
You can understand the problem by considering how the parser decides to apply a
particular production when it is replacing a nonterminal that has more than one right-
hand side. The simple case is evident in Productions 7 and 8. The parser can choose
which production to apply when its expanding a factor by looking at the next input sym
bol. If this symbol is a number, then the compiler applies Production 7 and replaces the
Terminals are leaf nodes.
Nonterminals are interior
nodes.
Recursion in grammar.
Left recursion.
Why left recursion
causes problemsan ex
ample.
Basic ConceptsChapter 1
Figure 1.5. Evolution of a Parse Tree
factor with a number. If the next input symbol was an open parenthesis, the parser
would use Production 8. The choice between Productions 5 and 6 cannot be solved in
this way, however. In the case of Production 6, the right-hand side of term starts with a
factor which, in turn, starts with either a number or left parenthesis. Consequently, the
parser would like to apply Production 6 when a term is being replaced and the next input
symbol is a number or left parenthesis. Production 5the other right-hand sidestarts
with a term, which can start with a factor, which can start with a number or left
parenthesis, and these are the same symbols that were used to choose Production 6. To
Table 1.1. A Simple Expression Grammar
Section 1.2.2An Expression Grammar 11
1. statements ^ expression ;
2. 1 expression ; statements
3. expression ^ expression +term
4. 1 term
5. term ^ term * factor
6. 1 factor
7. factor ^ number
8. 1 ( expression )
summarize, the parser must be able to choose between one of several right-hand sides by
looking at the next input symbol. It could make this decision in Productions 7 and 8, but
it cannot make this decision in Productions 5 and 6, because both of the latter produc
tions can start with the same set of terminal symbols.
The previous situation, where the parser cant decide which production to apply, is
called a conflict, and one of the more difficult tasks of a compiler designer is creating a
grammar that has no conflicts in it. The next input symbol is called the lookahead sym
bol because the parser looks ahead at it to resolve a conflict.
Unfortunately, for reasons that are discussed in Chapter Four, you cant get rid of the
recursion by swapping the first and last production element, like this:
expression -^term +expression
so the grammar must be modified in a very counterintuitive way in order to build a
recursive-descent parser for it. Several techniques can be used to modify grammars so
that a parser can handle them, and all of these are discussed in depth in Chapter Four.
I ll use one of them now, however, without any real explanation of why it works. Take it
on faith that the grammar in Table 1.2 recognizes the same input as the one weve been
using. (Ill discuss the h and e that appear in the grammar momentarily.) The modified
grammar is obviously an inferior grammar in terms of self-documentationit is difficult
to look at it and see the language thats represented. On the other hand, it works with a
recursive-descent parser, and the previous grammar doesnt.
Table 1.2. Modified Simple-Expression Grammar
1. statements ^ h
3. expression term expression'
4. expression' ^ +term expression'
5. 1 8
6. term ^ factor term'
7. term ^ * factor term'
8. 1 8
9. factor ^ number
10. 1 ( expression )
The h symbol is an end-of-input marker. For the purposes of parsing, end of file is
treated as an input token, and h represents end of input in the grammar. In this grammar,
Production 1is expanded if the current input symbol is end of input, otherwise Produc
tion 2 is used. Note that an explicit end-of-input marker is often omitted from a
Conflicts and look
aheads.
Modified expression
grammar.
End-of-input symbol (h).
Applying e.
e is a terminal, but not a
token.
Translating grammars to
syntax diagrams.
grammar, in which case h is implied as the rightmost symbol of the starting production
(the production whose left-hand side appears at the apex of the parse tree). Since elim
inating the h symbol removes the entire right-hand side in the current grammar, you can
use the following as an alternate starting production:
statements 8 I expression ; statements
In English: statements can go to an empty string followed by an implied end-of-input
marker.
The replacement of the left-hand side by 8 (the empty string) occurs whenever the
current input symbol doesnt match a legal lookahead symbol. In the current grammar, a
term' is replaced with the right-hand side *factor term' if the lookahead symbol (the next
input symbol) is a *. The term' is replaced with 8 if the next input symbol isnt a *. The
process is demonstrated in Figure 1.6, which shows a parse of 1+2 using the modified
grammar in Table 1.2. The 8 production stops things from going on forever.
Figure 1.6. A Parse of 1+2
Note that 8 is a terminal symbol that is not a token. It always appears at the end of a
branch in the parse tree, so it is a terminal, but it does not represent a corresponding
token in the input stream (just the opposite in factit represents the absence of a partic
ular token in the input stream).
1.2.3 Syntax Diagrams
You can prove to yourself that the grammar in Table 1.2 works as expected by
representing it in a different wayas a syntax diagram. We saw earlier that a syntax
diagram can represent the entire syntactic structure of a parse, but you can also use it in a
more limited sense to represent the syntax of a single production. Syntax diagrams are
useful in writing recursive-descent compilers because they translate directly into flow
charts (thats the main reason were looking at them now). You can use them as a map
that describes the structure of the parser (more on this in a moment). They are also
somewhat more intuitive to an uninitiated reader, so they often make better documenta
tion than does a formal grammar.
I ll translate our grammar into a syntax diagram in two steps. First, several of the
productions can be merged together into a single diagram. Figure 1.7 represents Produc
tions 3,4, and 5 of the grammar in Table 1.2 on page 11. The 8 production is represented
by the uninterrupted line that doesnt go through a box. You can combine these two
Section 1.2.3Syntax Diagrams 13
graphs by substituting the bottom graph for the reference to it in the top graph, and the
same process can be applied to Productions 6,7, and 8.
Figure 1.7. Syntax Diagram for Productions 3,4, and 5
The entire grammar in Table 1.2 is represented as a syntax diagram in Figure 1.8. The
topmost diagram, for example, defines a statement as a list of one or more semicolon-
delimited expressions. The same thing is accomplished by
statements expression ;
I expression ; statements
but the BNF form is harder to understand.
The merged diagram also demonstrates graphically how the modified grammar
works. J ust look at it like a flow chart, where each box is a subroutine, and each circle or
ellipse is the symbol that must be in the input when the subroutine returns. Passing
through the circled symbol removes a terminal from the input stream, and passing
through a box represents a subroutine call that evaluates a nonterminal.
1.3 A Recursive-Descent Expression Compiler
We now know enough to build a small compiler, using the expression grammar
weve been looking at (in Table 1.2 on page 11). Our goal is to take simple arithmetic
expressions as input and generate code that evaluates those expressions at run time. An
expression like a+b*c+d is translated to the following intermediate code:
t o
a
t l
_b
t l
* =
c
t o + = t l
t o + = d
1.3.1 The Lexical Analyzer
The first order of business is defining a token set. With the exception of numbers and
identifiers, all the lexemes are single characters. (Remember, a token is an input symbol
taken as a unit, a lexeme is the string that represents that symbol.) A NUM_OR_ID
token is used both for numbers and identifiers; so, they are made up of a series of con
tiguous characters in the range ' O' ' 9' , ' a' ' z' , or ' A' ' Z' . The tokens them
selves are defined with the macros at the top of lex.h, Listing 1.1. The lexical analyzer
translates a semicolon into a SEMI token, a series of digits into a NUM OR ID token,
Diagram shows how
modified grammar works.
Expression token set.
NUMORI D
yyt ext , yyl eng.
Simple, buffered, input
system.
Figure 1.8. A Syntax Diagram
and so on. The three external variables at the bottom of lex.h are used by the lexical
analyzer to pass information to the parser, yyt ext points at the current lexeme, which
is not ' \ 0' terminated; yyl eng is the number of characters in the lexeme; and
yyl i neno is the current input line number. (Ive used these somewhat strange names
because both lex and LEX use the same names. Usually, I try to make global-variable
names begin with an upper-case letter and macro names are all caps. This way you can
distinguish these names from local-variable names, which are always made up of lower
case letters only. It seemed best to retain UNIX compatibility in the current situation,
however.)
The lexical analyzer itself starts on line nine of lexx\ Listing 1.2. It uses a simple,
buffered, input system, getting characters a line at a time from standard input, and then
isolating tokens, one at a time, from the line. Another input line is fetched only when the
entire line is exhausted. There are two main advantages to a buffered system (neither of
which are really exercised here, though the situation is different in the more sophisti
cated input system discussed in Chapter Two). These are speed and speed. Computers
like to read data in large chunks. Generally, the larger the chunk the faster the
throughput. Though a 128-byte buffer isnt really a large enough chunk to make a
Listing 1.1. lex.h Token Definitions and extern statements
Section 1.3.1 The Lexical Analyzer 15
1 #d e f i n e EOI 0 / * end of i nput * /
2 # d e f i n e SEMI 1
/ *
f * /
3 #d e f i n e PLUS 2
/ *
+
* /
4 # d e f i n e TIMES 3 / *
*
* /
5 #d e f i n e LP 4
/ * ( * /
6 # d e f i n e RP 5
/ * ) * /
7 #d e f i n e NUM OR ID 6
/ * deci mal number or i dent i f i er * /
8
9 e x t e r n c h a r *y y t e x t ; / * i n l ex. c */
10 e x t e r n i n t y y l e n g ;
11 e x t e r n i n t y y l i n e n o ;
difference, once the buffer size gets above the size of a disk cluster, the changes are more
noticeable, especially if you can use the unbuffered I/O functions to do your reading and
writing. The second speed issue has to do with lookahead and pushback. Lexical Lookahead and push-
analyzers often have to know what the next input character is going to be without actu-
ally reading past it. They must look ahead by some number of characters. Similarly,
they often need to read past the end of a lexeme in order to recognize it, and then push
back the unnecessary characters onto the input stream. Consequently, there are often
extra characters that must be handled specially. This special handling is both difficult
and slow when youre using single-character input. Going backwards in a buffer, how
ever, is simply a matter of moving a pointer.
Listing 1.2. lex.c A Simple Lexical Analyzer
1 #i n c l u d e " l e x . h "
2 #i n c l u d e < s t d i o . h >
3 #i n c l u d e < c t y p e . h >
4
5 c h a r * y y t e x t = / * Lexeme (not ' \ 0' t er mi nat ed) * /
6 i n t y y l e n g = 0 ; / * Lexeme l engt h. * /
7 i n t y y l i n e n o = 0; / * I nput l i ne number */
8
9 l e x ( )
10 {
11 s t a t i c c h a r i n p u t _ b u f f e r [ 1 2 8 ] ;
12 c ha r ^ c u r r e n t ;
13
14 c u r r e n t = y y t e x t + y y l e n g ; / * Ski p cur r ent l exeme * /
15
16 w h i l e ( 1 ) / * Get t he next one * /
17 {
18 w h i l e ( ! ^ c u r r e n t )
19 {
20 / * Get new l i nes, ski ppi ng any l eadi ng whi t e space on t he l i ne,
21 * unt i l a nonbl ank l i ne i s f ound.
22 */
23
24 c u r r e n t = i n p u t _ b u f f e r ;
25 i f ( ! g e t s ( i n p u t _ b u f f e r ) )
26 {
27 ^ c u r r e n t = ' \ 0 ' ;
28 r e t u r n EOI;
29 }
30
Listing 1.2. continued...
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73 }
y y l i n e n o ;
w h i l e ( i s s p a c e ( * c u r r e n t )
+ + c u r r e n t ;
)
}
( ; ^ c u r r e n t ; + + c u r r e n t )
{
/ * Get t he next t oken * /
y y t e x t
y y l e n g
c u r r e n t ;
1;
s w i t c h ( ^ c u r r e n t )
{
46 c a s e EOF r e t u r n EOI
47 c a s e
f . r
r r e t u r n SEMI
48 c a s e ' +' r e t u r n PLUS
49 c a s e
' * f
r e t u r n TIMES
50 c a s e ' ('
r e t u r n LP
51 c a s e
' ) '
r e t u r n RP
52
53 c a s e ' \ n '
*
54 c a s e ' \ t '
>
55 c a s e
9 9
: b r e a k ;
( ! i s a l n u m ( ^ c u r r e n t )
)
f p r i n t f ( s t d e r r , " I g n o r i n g i l l e g a l i n p u t <%c>\ n", * c u r r e n t ) ;
{
w h i l e ( i s a l n u m ( ^ c u r r e n t )
)
+ + c u r r e n t ;
c u r r e n t y y l e n g =
r e t u r n NUM OR ID;
y y t e x t ;
}
}
}
}
Reading characters
i nput _buf f er,
current.
The input buffer used by l ex declared 11 of Listing 1.2. cur r ent (on
line 12) points at the current position in the buffer. On the first call, the increment on
line 14 initializes cur r ent to point at an empty string ( yyl eng is 0 at this juncture, and
yyt ext points at an empty string because of the initializer on line five). The whi l e
statement 18 tests true as a consequence. This whi l e loop has two purposes
gets lines (and increments the line number), and it skips past all blank lines (including
lines that contain only white space). The loop doesnt terminate until i nput buf f er
holds a nonblank line, and cur r ent will point at the first non white character on that
line.
Tokenization.
The loop starting on line 37 does the actual tokenization. Single-character lex
emes are recognized on lines 46-51, white space is ignored by the cases on lines 53-55,
and the multiple-character NUM OR ID token is handled in the clause on lines
Section 1.3.1 The Lexical Analyzer 17
60-67. An error message is printed if an illegal character is found. When the loop ter
minates, yyt ext points at the first character of the lexeme, and yyl eng holds its length.
The next time l ex ( ) is called, the code on line 14 adjusts the current pointer to
point past the previous lexeme, and then, if the input buffer hasnt been exhausted, the
whi l e test on line 18 fails and you go straight to the token-isolation code, l ex ( ) wont
get another input line until it reaches the end of the line* cur r ent is ' \ 0' in this
case.
The remainder of lex.c (in Listing 1.3) addresses the problem of lookahead. The
parser must look at the next input token without actually reading it. Though a
read/pushback scheme similar to get c () / unget c () could be used for this purpose, its
generally a good idea to avoid going backwards in the input, especially if you have to
push back entire lexemes rather than single characters. The problem is solved by using
two subroutines: mat ch ( t oken) evaluates to true if the next token in the input stream
matches its argumentit looks ahead at the next input symbol without reading it.
advance () discards the current token and advances to the next one. This strategy elim
inates the necessity of a push-back subroutine such as unget c ().
The Lookahead variable (on line 74) holds the lookahead token. I ts initialized
to -1, which is not used for any of the input tokens. I ts modified to hold a real token the
first time mat ch ( ) is called. Thereafter, the test on line 81 will become inoperative and
mat ch () simply returns true if Lookahead matches its argument. This approach is
relatively foolproofthough the fool in this case is myself. I know that I ll regularly
forget to call an initialization routine before calling mat ch (), so I ll let mat ch () ini
tialize itself the first time its called. The advance ( ) function just calls l ex ( ) to
assign a new value to Lookahead.
Listing 1.3. lex.c Match and Advance Functions
74 s t a t i c i n t Lookahead = - 1 ; / * Lookahead t oken * /
75
76 i n t ma t c h ( t o k e n )
77 i n t t o k e n ;
78
{
79 / * Ret ur n t r ue i f "t oken" mat ches t he cur r ent l ookahead symbol . * /
80
81 i f ( Lookahead =
= - 1 )
82 Lookahead = l e x ( ) ;
83
84 r e t u r n t o k e n == Lookahead;
85 }
86
87 v o i d a d v a n c e ()
88 {
89 / * Advance t he l ookahead to t he next i nput symbol . * /
90
91 Lookahead = l e x ( ) ;
92
}
1.3.2 The Basic Parser
Moving on to the parser, since I m planning to start with a naive implementation and
refine it, I ve isolated mai n ( ) into a small file (Listing 1.4) that I can compile once and
then link with the parser proper. The parser itself is called st at ement s ( ). st at ement s o .
The most naive parser for our grammar is shown in Listing 1.5. I ve reproduced the
grammar here for convenience:
Solving the lookahead
problem.
mat ch (), advance().
Lookahead.
Listing 1.4. main.c
1 m a i n ()
2 {
3 s t a t e m e n t s ( ) ;
4 }
1. statements ^ expression ; h
2. 1 expression ; statement
3. expression ^ term expression
4. expression' ^ +term expression
5. 1 8
7. term' ^ * factor term'
8. 1 8
9. factor ^ numori d
Subroutines correspond
to left-hand sides, imple
ment right-hand sides.
e recognized.
Subroutines advance
past recognized tokens
The parser generates no code, it just parses the input. Each subroutine corresponds to
the left-hand side in the original grammar that bears the same name. Similarly, the struc
ture of each subroutine exactly matches the grammar. For example, the production
expression term expression
is implemented by the following subroutine (on line 23 of Listing 1.5)
e x p r e s s i o n ()
{
t e r m ( ) ;
e x pr p r i m e ( ) ;
}
The 8 production in
is implemented implicitly when the test on line 37 fails (if its not a PLUS, its an 8).
Note that each subroutine is responsible for advancing past any tokens that are on the
equivalent productions right-hand side.
Listing 1.5. plain.c A Naive Recursive-Descent Expression Parser
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/
*
Basi c par ser , shows t he st r uct ur e but t her e' s no code gener at i on
*
/
# i n c l u d e < s t d i o . h >
# i n c l u d e " l e x . h "
s t a t e m e n t s ()
{
/
*
st at ement s - > expr essi on SEMI
*
expr essi on SEMI st at ement s
*
/
e x p r e s s i o n ( ) ;
( m a t c h ( SEMI )
a d v a n c e ( ) ;
)
Section 1.3.2The Basic Parser 19
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
f p r i n t f ( s t d e r r
Q
d: I n s e r t i n g m i s s i n g s e m i c o l o n \ n " , y y l i n e n o ) ;
( !mat c h( EOI )
)
s t a t e m e n t s ( ) ; / * Do anot her st at ement . * /
}
{
/
*
expr essi on - > t er mexpr essi on /
t e rm( ) ;
e xpr p r i m e ( ) ;
}
e x pr p r i m e ()
{
/
*
*
expr essi on' - > PLUS t er mexpr essi on
epsi l on
*
/
( ma t c h ( PLUS )
)
{
a d v a n c e ( ) ;
t e r m ( ) ;
e x pr p r i m e ( ) ;
}
}
t e r m ()
{
/ * t er m- > f act or t er m' * /
f a c t o r ( ) ;
t erm p r i m e ( ) ;
}
t erm p r i m e ()
{
/ * t er m' - > TI MES f act or t erm'
*
on
*
/
( ma t c h ( TIMES )
)
{
a d v a n c e ( ) ;
f a c t o r ( ) ;
t e r m p r i m e ( ) ;
}
}
0
{
/ * f act or >
*
NUM_OR_I D
LP expr essi on RP
*
/
73 i f ( match(NUM_OR_ID) )
74 a d v a n c e ( ) ;
75
76 e l s e i f ( mat ch( LP) )
77
{
78 a d v a n c e ( ) ;
79 e x p r e s s i o n ( ) ;
80 i f ( mat c h( RP) )
81 a d v a n c e ( ) ;
82 e l s e
83 f p r i n t f ( s t d e r r , "%d: Mi s mat c he d p a r e n t h e s i s \ n " , y y l i n e n o ) ;
84
}
85 e l s e
86 f p r i n t f ( s t d e r r , "%d Number or i d e n t i f i e r e x p e c t e d \ n " , y y l i n e n o ) ;
87 }
The recursion in a
recursive-descent parser
You can now see why a production like
expression expression +term
t be used by descent parser. You can implement the foregoing as fol
lows:
{
( ! m a t c h ( PLUS )
e r r o r ( ) ;
)
a d v a n c e ( ) ;
t e r m ( ) ;
}
Subroutine calling se
quence mirrors the parse
tree.
But the first thing that expr essi on () does is call itself, the recursion never stops, and
the program never terminatesat least not until it runs out of stack space and is abnor
mally terminated by the operating system.
At this point I d suggest doing an exercise. Using a pencil and paper, trace what hap
pens as the expression 1+2 is evaluated by the parser in Listing 1.5. Every time a sub
routine is called, draw a downward pointing arrow and write the name of the called sub
routine under the arrow; every time the subroutine returns, draw an arrow at the other
end of the same line. As the parser advances past tokens, write them down under the
name of the current subroutine. A partial subroutine trace for this expression is shown in
Figure 1.9. The diagram shows the condition of the parser when it is in subroutine
expr _pr i me ( ) just before the t er m( ) call on line 40. Its advanced past the 1and
the current lookahead token is the plus sign. If you finish this diagram, an interesting
fact emerges. The subroutine trace is identical to the parse tree for the same expression
in Figure 1.6 on page 12. So, even though no physical parse tree is created, a parse tree
is implicit in the subroutine-calling sequence.
Figure 1.9. A Partial Subroutine Trace for 1+2
Section 1.3.2The Basic Parser 21
1.3.3 Improving the Parser
The naive parser discussed in the previous section is useful for explaining things, but
is not much use in practice. The main difficulty is the tremendous amount of unneces
sary recursion. Glancing at the syntax diagram for our grammar (in Figure 1.8 on page
14), two changes come to mind immediately. First, all the right recursionproductions
Eliminate right recursion.
in which the left-hand side also appears at the far right of the right-hand sidecan be
replaced by loops: If the last thing that a subroutine does is call itself, then that recursive
call can be replaced by a loop. Right recursion is often called tail recursion.
The second obvious improvement is that the same merging together of productions
that was done to make the second and third graphs in Figure 1.8 can be applied to the
subroutines that implement these productions. Both of these changes are made in Listing
1.6.
Merging productions into
a single subroutine.
Listing 1.6. improved.c An Improved Parser
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
/ * Revi sed par ser
*
/
# i n c l u d e < s t d i o . h >
# i n c l u d e " l e x . h "
v o i d
v o i d
v o i d
f a c t o r
t erm
( v o i d )
( v o i d )
e x p r e s s i o n ( v o i d )
{
/
*
st at ement s - > expr essi on SEMI expr essi on SEMI st at ement s */
w h i l e ( !mat ch( EOI)
{
)
( ma t c h ( SEMI )
a d v a n c e ( ) ;
)
f p r i n t f ( s t d e r r ,
ft0.
}
}
Listing 1.6. conti nued. . .
25 v o i d e x p r e s s i o n ()
26
{
27 / * expr essi on - > t er mexpr essi on'
28 * expr essi on' - > PLUS t er mexpr essi on' epsi l on
29
30
* /
31 i f ( ! l e g a l _ l o o k a h e a d ( NUM_OR_ID, LP, 0 ) )
32
33
r e t u r n ;
34 t e r m ( ) ;
35 w h i l e ( m a t c h ( PLUS ) )
36
{
37 a d v a n c e ( ) ;
38 t e r m ( ) ;
39 }
40
41
}
42 v o i d t e r m( )
43 {
45
46
r e t u r n ;
47 f a c t o r ( ) ;
48 w h i l e ( m a t c h ( TIMES ) )
49
{
50 a d v a n c e ( ) ;
51 f a c t o r ( ) ;
52
}
53
54
}
55 v o i d f a c t o r ()
56
{
58
59
r e t u r n ;
60 i f ( match(NUM_OR_ID) )
61
62
a d v a n c e ( ) ;
64
{
65 a d v a n c e ( ) ;
66 e x p r e s s i o n ( ) ;
67 i f ( mat ch( RP) )
68 a d v a n c e ( ) ;
69 e l s e
71 }
72 e l s e
73 f p r i n t f ( s t d e r r , "%d: Number or i d e n t i f i e r e x p e c t e d \ n", y y l i n e n o ) ;
74
}
Error recognition, FIRST i ve made one additional change here as well. I ve introduced a little error recovery
sets* by adding code to each subroutine that examines the current lookahead token before
doing anything else. I ts an error if that token cannot legitimately occur in the input. For
example, expressions all have to start with either a NUM_OR_ID or an LP. If the loo
kahead character is a PLUS at the top of expr essi on ( ), then somethings wrong.
This set of legitimate leading tokens is called a FIRST set, and every nonterminal
Section 1.3.3Improving the Parser 23
symbol has its own FIRST set. FIRST sets are discussed in depth in Chapter Threethe
informal definition will do for now, though.
The l egal l ookahead ( ) subroutine in Listing 1.7 checks these FIRST sets, and
if the next input symbol is not legitimate, tries to recover from the error by discarding all
input symbols up to the first token that matches one of the arguments. The subroutine
takes a variable number of arguments, the last one of which must be zero. I ve used the
a n s i variable-argument mechanism here, so the routine can take any number of argu
ments as parameters, but the last one must be a 0. (This mechanism, and the <stdarg.h>
file, is described in Appendix A if youre not already familiar with it.)
One final C style note is needed to head off the inevitable criticism of the got o state
ment on line 118 of Listing 1.7. Though many programmers contend with almost reli
gious fervor that the got o should be obliterated from all structured programs, I strongly
feel that there are a few situations where a judiciously used got o makes for better code.
Here, the got o branch to a single label is vastly preferable to multiple ret urn state
ments. A subroutine that has a single exit point is much more maintainable than one
with several exit points. You can put a single breakpoint or debugging diagnostic at the
end of the subroutine instead of having to sprinkle them all over the place. You also
minimize the possibility of accidentally falling off the bottom of the subroutine without
returning a valid value. My rules of thumb about gotos are as follows:
Gotos are appropriate in two situations: (1) to eliminate multiple return statements
and (2) to break out of nested loops. You can also do (2) with a flag of some sort
(whi l e (! done)), but flags tend to make the code both larger and harder to read, so
should be avoided.
Dont use a got o unless it is the only solution to the problem. You can often elim
inate the need for a got o by reorganizing the code.
A subroutine should have at most one label.
All got o branches should be in a downwards direction to that single label.
The target of a got o branch should be in the same block or at a higher (more outer)
nesting level than the got o itself. Dont do this:
{
g o t o l a b e l ;
}

{
l a b e l :
}
Listing 1.7. improved.c Error Recovery for the Improved Parser
75 # i n c l u d e < s t d a r g . h >
76
77 # d e f i n e MAXFIRST 16
78 # d e f i n e SYNCH SEMI
79
80 i n t l egal _ l ookahead( f i rst _arg )
81 i n t f i rst _arg;
82 {
83 / * Si mpl e er r or det ect i on and r ecover y. Ar gument s ar e a 0- t er mi nat ed l i st of
84 * t hose t okens t hat can l egi t i mat el y come next i n t he i nput . I f t he l i st i s
85 * empt y, t he end of f i l e must come next . Pr i nt an er r or message i f
86 * necessar y. Er r or r ecover y i s per f or med by di scar di ng al l i nput symbol s
87 * unt i l one t hat ' s i n t he i nput l i st i s f ound
88 *
89 * Ret ur n t r ue i f t her e' s no er r or or i f we r ecover ed f r omt he er r or ,
90 * f al se i f we can' t r ecover .
Error recovery:
l e g a l l ookahead().
Style note: the goto.
Listing 1. 7. continued.
91 * /
92
93 va l i s t a r g s ;
94 i n t t o k ;
95 i n t l ookaheads [ MAXFI RST] , *p = l o o k a h e a d s , *c u r r e n t ;
96 i n t e r r o r p r i n t e d = 0;
97 i n t r v a l = 0;
98
99 va s t a r t ( a r g s , f i r s t _ a r g ) ;
100
101 i f ( ! f i r s t a r g )
102 {
103 i f (
mat c h( EOI ) )
104 rva 1 = 1;
105 }
106 e l s e
107 {
108 *p++ = f i r s t a r g ;
109 w h i l e ( ( t o k = va a r g ( a r g s , i n t ) ) && p < &l ookaheads[MAXFIRST] )
110 *++p = t o k ;
111
112 w h i l e ( !m a t c h ( SYNCH ) )
113 {
114 f o r ( c u r r e n t = l o o k a h e a d s ; c u r r e n t < p ; + + c u r r e n t )
115 i f ( m a t c h ( ^ c u r r e n t ) )
116
{
117 r v a l = 1;
118 g o t o e x i t ;
119 }
120
121
i f (
l e r r o r p r i n t e d )
122
{
123 f p r i n t f ( s t d e r r , "Li ne %d: Sy nt a x e r r o r \ n " , y y l i n e n o ) ;
124 e r r o r p r i n t e d = 1;
125 }
126
127 a d v a n c e ( ) ;
128 }
129
}
130
131 e x i t :
132 va e n d ( a r g s )
133 r e t u r n r v a l
9
134 }
1.3.4 Code Generation
Recognizers. The parsers that we just looked at are, strictly speaking, recognizer programs in that,
if they terminate without an error, the input sentence is a legal sentence in the grammar.
All they do is recognize legal input sentences. Our goal is to build a compiler, however,
and to do this, you need to add code generation to the bare-bones recognizer. Given the
input
1 + 2 * 3 + 4
a typical compiler generates the following code:
Section 1.3.4Code Generation 25
tO = 1
t l = 2
t 2 = 3
t l *= t 2
tO += t l
t l = 4
tO += t l
An optimizer will clean up the unnecessary assignments. Its useful, for now, to look at
the raw output, however. The temporary variables (tO, and so forth) are maintained
internally by the compiler. A real compiler typically uses registers or the run-time stack
for temporaries. Here, theyre just global variables. The expression is evaluated opera
tor by operator, with each temporary holding the result of evaluating the current subex
pression. Sometimes (as is the case in this example) several temporaries must exist
simultaneously in order to defer application of an operator because of precedence or
associativity problems. Here, 10 holds the left operand of the addition operator until the
higher-precedence multiply is performed.
You can also look at the temporary-variable assignments in terms of a syntax tree.
The syntax tree for 1+2 *3+4 is shown in Figure 1.10. The nodes are marked with the
names of the temporaries that hold the evaluated subexpression represented by the sub
tree.
The first thing you need to generate code is a mechanism for allocating temporaries.
Ideally, they should be recycledthe temporaries should be reused after they are no
longer needed in the current subexpression. A simple, but effective, mechanism to do
this is shown in Listing 1.8. (Well look at a more elaborate system in Chapter Six.) A
stack of temporary-variable names is declared on line one. When a new name is
required, newname () pops one off the stack. When the temporary is no longer needed, a
f r eename () call pushes it back.
The next code-generation problem is determining the name of the temporary that
holds the parti ally-evaluated expression at any given moment. This information is
passed around between the subroutines in the normal way, using arguments and return
values.
To demonstrate the differences between the two methods, I ll show two parsers, one
that uses return values exclusively, and another that uses arguments exclusively. The
subroutine-calling tree for a parse of 1+2 *3+4; (using the improved parser) is in Figure
1.11. The subroutines that generated this tree (and the earlier code) are in Listing 1.9.
The arrows are marked to show the flow of temporary-variable names during the parse.
Control flow goes counterclockwise around the drawing (the nodes are subscripted to
show the order in which theyre visited).
Temporary variables at
run time.
Temporaries on the syn
tax tree.
Compile-time,
temporary-variable
management.
newname(),
f r eename().
Using return values in
parser.
Listing 1.8. name.c Temporary-Variable Allocation Routines
1 c h a r *Names[ ] = { "tO", " t l " , "t 2 ", "t 3 ", "t 4 ", "t 5 ", "t 6 ", " t 711 } ;
2 c h a r **Namep = Names;
3
4 c h a r *newname( )
5 {
6 i f ( Namep >= &Names[ s i z e o f ( N a m e s ) / s i z e o f ( *Names) ] )
7 {
8 f p r i n t f ( s t d e r r , "%d: E x p r e s s i o n t o o c o mp l e x \ n " , y y l i n e n o ) ;
9 e x i t ( 1
);
10 }
11
12 r e t u r n ( *Namep++ );
13 }
14
15 f r e e n a me ( s )
16 c h a r * s ;
17 {
18 i f ( Namep > Names )
19 *- - Namep = s;
20 e l s e
21 f p r i n t f ( s t d e r r , %d: ( I n t e r n a l e r r o r ) Name s t a c k u n d e r f l o w \ n " ,
22 y y l i n e n o ) ;
23 }
Figure 1.11. A Subroutine Trace of 1+2 (Improved Parser)
term
A
3
t o
factor
statements, ;
1
A
V
t o
expression
+
A
+
t l
term
4
factor
6
factor
term
A
8
t l
7
factor
9
1 2 3 4
Listing 1.9. retval.c Code Generation Using Return Values
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# n c l u d e < s t d i o . h >
# i n c l u d e " l e x . h "
* f a c t o r
*t erm
*
e x p r e s s i o n
( v o i d ) ;
( v o i d ) ;
( v o i d ) ;
e x t e r n c h a r *newname( v o i d ) ;
e x t e r n v o i d f r e e n a me ( c h a r *name ) ;
{
}
{
}
{
/
*
st at ement s - > expr essi on SEMI expr essi on SEMI st at ement s
*
/
*
t e mp v a r ;
w h i l e ( ! mat ch( EOI)
{
)
t empvar e x p r e s s i o n ()
i f ( ma t c h ( SEMI )
a d v a n c e ( ) ;
)
I f Q
f r e e n a me ( t empvar ) ;
}
*
/
*
*
expr essi on - > t er mexpr essi on'
expr essi on' - > PLUS t er mexpr essi on' epsi l on
*
/
*
t e mpvar , *t e mpv a r 2;
t empvar t e r m ( ) ;
w h i l e ( ma t c h ( PLUS )
{
)
t e r m ( ) ;
a d v a n c e ( ) ;
t empvar -
p r i n t f ("
f r e e n a m e ( t empvar2 ) ;
9- Q -L 2-
O o s \ n " , t e mpvar , t empvar2 ) ;
}
r e t u r n t e mpvar ;
* t e r m()
*t empvar, *t empvar2 ;
t empvar f a c t o r ( ) ;
w h i l e ( ma t c h ( TIMES )
{
)
a d v a n c e ( ) ;
t empvar2 =
p r i n t f ("
f a c t o r ( ) ;
%s *= s \ n " , t e mpvar , t empvar2 ) ;
60 f r e e n a m e ( t e mpvar 2 ) ;
61 }
62
63 r e t u r n t e mpvar ;
64
}
65
66 c h a r * f a c t o r ()
67 {
68 c har *t e mpvar ;
69
70 i f (
match(NUM_OR_ID) )
71 {
72 / * Pr i nt t he assi gnment i nst r uct i on. The %0. *s conver si on i s a f or mof
73 * %X. Ys, wher e X i s t he f i el d wi dt h and Y i s t he maxi mumnumber of
74 * char act er s t hat wi l l be pr i nt ed ( even i f t he st r i ng i s l onger ) . I ' m
75 * usi ng t he %0. *s t o pr i nt t he st r i ng because i t ' s not \ 0 t er mi nat ed.
76 * The f i el d has a def aul t wi dt h of 0, but i t wi l l gr ow t he si ze needed
77 * t o pr i nt t he st r i ng. The t el l s pr i nt f ( ) t o t ake t he maxi mum-
78 * number - of - char act er s count f r omt he next ar gument ( yyl eng) .
79 */
80
81 p r i n t f ( " %s = %0 . * s \ n", t e mpvar = newname( ) , y y l e n g , y y t e x t ) ;
82 a d v a n c e ( ) ;
83 }
84 e l s e i f ( ma t c h( LP) )
85 {
86 a d v a n c e ( ) ;
87 t e mpvar = e x p r e s s i o n ( ) ;
89 a d v a n c e ( ) ;
90 e l s e
92
}
93 e l s e
i
94 f p r i n t f ( s t d e r r , "%d: Number or i d e n t i f i e r e x p e c t e d \ n " , y y l i n e n o ) ;
95
96 r e t u r n t e mpvar ;
97 }
Generate 10=1.
A likely place to generate instructions of the form 10=1 is in f actor (), the subrou
tine that reads the 1. f actor () calls newname () to get an anonymous-temporary
Generate arithmetic in
structions.
name, generates the code to copy the input number into the temporary, and then returns
the name to its parent. Similarly, the best place to generate multiplication instructions is
the place where the times signs are read: in term( ) . After the two f actor () calls,
tempvar and tempvar2 hold the names of the two temporaries. Code is generated to
do the multiplication, one of the temporaries is freed, and the other (which holds the
result of the multiply) is passed back up to the parent. So, this temporary, the one that
holds the result of the subexpression evaluation, is used in the next step. Addition is
handled the same way in expressi on ().
J ust to make sure that you understand the process, I suggest taking a moment to run
through a parse of 1* (2+3) *4 by hand, creating a subroutine-calling graph as in the
previous example and watching how the code is generated.
Using subroutine argu
ments to pass informa
tion.
Listing 1.10 shows the parser, modified once more to use subroutine arguments rather
than return values. Here, instead of allocating the names when the instructions are gen
erated, the temporary variables are allocated high up in the tree. Each subroutine passes
to its child the name of the temporary in which it wants the subexpression result. That is,
the high level routine is saying: Do what you must to evaluate any subexpressions, but
by the time youre finished, I want the result to be in the temporary variable whose name
I m passing to you. Recursive-descent compilers often use both of the methods just
discussedneither is more or less desirable. The code can be easier to understand if you
restrict yourself to one or the other, though thats not always possible.
Listing 1.10. args.c Code Generation Using Subroutine Arguments
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# i n c l u d e < s t d i o . h >
# i n c l u d e " l e x . h "
v o i d
v o i d
v o i d
t erm
(
(
*
*
e x p r e s s i o n (
*
t empvar ) ;
t e mpvar ) ;
t e mpvar ) ;
e x t e r n c h a r *newname( v o i d ) ;
e x t e r n v o i d f r e e n a m e ( c h a r *name ) ;
{
/
*
st at ement s - > expr essi on SEMI expr essi on SEMI st at ement s
*
/
*
t e mpvar ;
w h i l e ( !mat ch( EOI)
{
)
e x p r e s s i o n ( t empvar
f r e e n a m e ( t empvar ) ;
newname() ) ;
( ma t c h ( SEMI )
a d v a n c e ( ) ;
)
II Q.
}
}
v o i d e x p r e s s i o n ( t empvar )
*t e mpvar ;
{
/
*
expr essi on > t er mexpr essi on
*
expr essi on' - > PLUS t er mexpr essi on' epsi l on
*
/
*t e mpv a r 2;
t e r m( t empvar ) ;
w h i l e ( ma t c h ( PLUS )
{
a d v a n c e ( ) ;
)
t e r m( t empvar2 newname()
) ;
p r i n t f (" I s +
o
o s \ n " , t e mpvar , t empvar2 ) ;
f r e e n a m e ( t empvar2 ) ;
}
}
50 v o i d t e r m ( t e mpvar )
51 c h a r *t e mpvar ;
52 {
53 c h a r *t e mpv a r 2 ;
54
55 f a c t o r ( t e mpvar ) ;
56 w h i l e ( m a t c h ( TIMES ) )
57 {
58 a dv a nc e ( ) ;
59
60 f a c t o r ( t e mpvar 2 = newname() ) ;
61
62 p r i n t f ( " %s *= %s\ n", t e mpv a r , t e mpvar 2 ) ;
63 f r e e n a m e ( t empvar2 ) ;
64 }
65 }
66
67 v o i d f a c t o r ( t e mpvar )
68 c h a r *t e mpvar ;
69 {
70 i f ( match(NUM_0R_ID) )
71 {
72 p r i n t f ( " %s = %0 . * s \ n , t e mpv a r , y y l e n g , y y t e x t ) ;
73 a d v a n c e ( ) ;
74 }
76 {
77 a dv a nc e ( ) ;
78 e x p r e s s i o n ( t e mpvar ) ;
80 a d v a n c e ( ) ;
81 e l s e
83 }
84 e l s e
85 f p r i n t f ( s t d e r r , "%d: Number or i d e n t i f i e r e x p e c t e d \ n " , y y l i n e n o ) ;
86 }
1.4 Exercises
1.1. Write a grammar that recognizes a C variable declaration made up of the follow
ing keywords:
i n t c h a r l o n g f l o a t d o u b l e s i g n e d u n s i g n e d s h o r t
c o n s t v o l a t i l e
and a variable name.
combinations of the following keywords:
i n t c h a r l o n g f l o a t d o u b l e s i g n e d u n s i g n e d s hor t
c o n s t v o l a t i l e
and a variable name. The grammar should be able to accept all such legal
declarations. For example, all the following should be accepted:
Section 1.4Exercises 31
v o l a t i l e u n s i g n e d l o n g i n t x;
u n s i g n e d l o n g v o l a t i l e i n t x;
l o n g u n s i g n e d v o l a t i l e i n t x;
l o n g v o l a t i l e u n s i g n e d i n t x;
but something like
u n s i g n e d s h o r t l o n g x ;
1.3.
should not be accepted. Remember that the i nt keyword is optional in a
declaration.
Modify your solution to the previous exercise so that declarations for arrays,
pointers, pointers to arrays, arrays of pointers, arrays of pointers to arrays, and so
on, are also recognized. That is, all legal combination of stars, brackets,
parentheses, and names should be recognized.
1.4 Write a grammar (and a descent compiler for that grammar) that
translates an English description of a C variable into a C-style variable declara
tion. For example, the input:
x i s a poi nt er t o an ar r ay of 10 poi nt er s t o f unct i ons t hat r et ur n i nt
y i s an ar r ay of 10 f l oat s.
z i s a poi nt er t o a st r uct of t ype a st r uct .
should be translated to:
i n t (
*
(*x) [ 101) () ;
y[10];
s t r u c t a s t r u c t *z;
1.5 Modify either of the expression compilers (in Figures 1.11 or 1.10) so that the C
and operators are supported.
1.6 LISP uses a prefix notation for arithmetic expressions. For example, 1+2 is
represented as (+ 1 2), and 1+2*3 is (+ 1
*
2 3) ). Modify the expres-
1.7.
1.8.
sion compiler so that it translates infix expressions to prefix.
Write a LISP-to-infix translator.
Modify the expression compiler so that it translates expressions into postfix nota
tion, such as that used by a Hewlett-Packard calculator. For example, the expres
sion (1+2) * (3+4) should be translated to:
12 + 34 +
1.9.
1.10.
Modify the expression compiler so that it prints the parse tree created by its input.
suggest creating a physical parse tree (with structures and so forth) and then
printing the tree by traversing the physical parse tree.
(This is a very difficult problem.)
a. Try to write a context-free grammar that correctly parses both time flies like
an arrow and "fruit flies like a banana.
b. One of the things that defines context-free grammars is that the left-hand side
always consists of a single nonterminal symbol. How would the foregoing work
if you were permitted to use more than one terminal or nonterminal symbol on a
left-hand side? Try to write a parser for this sort of grammar.
This chapter looks at input strategies and at lexical analysis. I ll discuss a set of
buffered input routines and construct LEX, a program modeled after the UNIX lex utility,
that translates regular expressions into a lexical analyzer. Its worth understanding how
LPX works, even if youre not going to build a version of your own. Various theoretical
concepts such as finite automata and closure will crop up again when I discuss how
programs like occs and yacc generate bottom-up parse tables, and you need to be able to
understand how these tables are created in order to use these programs effectively. The
concepts are easier to understand in the context of LEX, however, so the theoretical
material in this chapter is actually an introduction to the concepts that youll need later.
I m using a bootstrap approach in that LEX itself uses a hard-coded lexical analyzer and a
recursive-descent parser to do its work. As such, its a good example of a compiler built
by hand, without special tools.
The techniques used for lexical analysis are useful in many programming applica
tions other than compilers. Efficient I/O is a concern in virtually every computer pro
gram. Similarly, lexical analyzers are pattem-recognition enginesthe concepts dis
cussed here can be applied to many programs that need to recognize patterns: editors,
bibliographic data-base programs, and so forth. You can extend the techniques to do
things like assembly-line quality control and network-protocol processing.
If you intend to read the implementation parts of this chapter rather than just the
theory, you should read Appendix D, which contains a users manual for LEX, before
proceeding. Also many of the support routines used by LEX are presented in Appendix A
(the set routines are used heavily in the code that follows, and the hash functions are
used as well).
2.1 The Lexical Analyzer as Part of a Compiler*
The main purpose of a lexical analyzer in a compiler application is to translate the
* An asterisk appended to a section heading is used throughout this and subsequent chapters to indicate
theoretical material. Implementation-oriented sections are not so marked.
32
Section 2.1 The Lexical Analyzer as Part of a Compiler* 33
input stream into a form that is more manageable by the parser. It translates input strings
or lexemes, into tokensarbitrary integer values that represent the lexemes. A token can
have a one-to-one relationship with a lexeme. For example, the keyword whi l e is asso
ciated with a single token. More generic tokens such as identifiers or numbers have
several lexemes associated with them. Lexical analyzers often have auxiliary functions
as well. For example, a lexical analyzer can discard comments and skip over white
space. Isolating this housekeeping from the parser can simplify the parser design (and
the grammar of the language). The analyzer can keep track of the current line number so
that intelligent error messages can be output by the parser. Program listings that show
the source code intermixed with error messages are usually created by the lexical
analyzer.
The lexical analyzer is an independent compilation phase that communicates with
the parser over a well-defined and simple interface. The relationship is pictured in Fig
ure 2.1. The parser calls a single lexical-analyzer subroutine every time it needs a new
token, and that subroutine returns the token and associated lexeme.
Figure 2.1. Interaction Between the Lexical Analyzer and Parser
This organization has several things going for it. Since its an independent phase, the
lexical analyzer is easy to maintain because changes to the analyzer do not affect the
compiler as a whole, provided that the interface is not changed. Moreover, much of the
code that comprises the lexical analyzer is the same for every compiler, regardless of the
input language, so you can recycle much of the lexical analyzers code. The only things
that change from language to language in the table-driven lexical analyzers described
later in this chapter are the tables themselves. Other advantages include speedan
independent lexical analyzer can optimize character-read times because it can read large
amounts of data at once, and portabilitythe peculiarities of reading the source code
under a particular operating system are all confined to the lexical analyzer itself. Notice
that the actual input system in Figure 2.1 is isolated completely from the parser, even
though its closely linked to the lexical analyzer.
Sometimes a more complex interaction between lexical analyzer and parser is
required. For example, the t ypedef statement in C effectively creates new keywords in
the language. After the parser has processed the statement:
t y p e d e f i n t a l p h o n s o ;
the lexical analyzer must treat the input string alphonso as if it were a type token rather
than as an identifier token. This sort of high-level communication is usually done
through a shared data structure such as the symbol table. In this case, the parser can enter
Lexemes, tokens.
Interface to parser.
Advantages of indepen
dent lexical analyzers.
Shared symbol table.
34 Input and Lexical Analysis Chapter 2
Attributes.
Discarding characters.
alphonso into the symbol table, identifying it as a t ypedef name, and the lexical
analyzer can check the symbol table to determine if a string is a type or identifier token.
A lexical analyzer can also do work other than simple pattern recognition. For
example, when it reads a numeric constant, it can translate that constant into the associ
ated number [in a manner similar to at oi ()] and return that number along with the
token and lexeme. When an identifier is read, the analyzer can look up the identifier in
the symbol table and return a pointer to a symbol-table entry along with the token.
These additional values associated with individual tokens are called attributes. (Note
that the lexeme is also an attribute of the token, because its a quantum of information
that is associated with the token.) In general, its best to restrict the lexical analyzer to
simple pattem-recognition tasks in order to make it easier to maintain. If the analyzer is
an independent module that performs only one task (pattern recognition), its a simple
matter to replace it if necessary.
Theres one final point to make. Lexical analysis is often complicated if a language is
not designed with ease-of-compilation in mind. For example,1PL/1 keywords are not
reservedyou can use them as identifiersand the lexical analyzer has to determine
what its looking at based on surrounding context. You can say something like this in
PL/1:
i f t hen t hen t hen = el se; el se el se = t hen;
Separating the keyword t hen from the identifier t hen can be quite difficult.
2.2 Error Recovery in Lexical Analysis*
I ts possible, of course, for errors to occur in the lexical-analysis as well as the pars
ing phase. For example, the at sign (@) and backquote (') are both illegal outside of a
string in a C program. The lexical analyzer can recover from these errors in several
ways, the simplest of which just discards the offending character and prints an appropri
ate error message. Even here, there are some choices that are driven by the application,
however. If the last character of a multiple-character lexeme is incorrect, the analyzer
can discard the entire malformed lexeme or it can discard only the first character of the
lexeme and then try to rescan. Similarly, the lexical analyzer could try to correct the
error. Some operating systems have a do what I mean feature that works along these
lines. When faced with an error, the operating systems command-line interpreter (which
is a compiler) tries to determine what the user meant to type and proceeds accordingly.
If a word has only one misspelled letter, its not too difficult to correct the problem by
simple inspection.
1. This example (and the other PL/1 example, below) is borrowed from [Aho], p. 87 and p. 90.
2. There are other, more sophisticated techniques that can be used to determine the similarity of two words.
The most common technique is the soundex algorithm developed by Margaret Odell and Robert Russel,
and described in [Knuth], vol. 3, pp. 391-392. Also of interest is the Ratclifl/Obershelp algorithm,
described in [Ratcliff], and Allen Bickels algorithm, described in [Bickel] and implemented in C in
[Howell].
Section 2.3Input Systems* 35
2.3 Input Systems4
Since the input system is usually an independent module, and since the concerns here
are divorced from the mechanics of recognizing tokens, I ll look at input systems in
depth before moving on to the issues of lexical analysis per se.
The lowest-level layer of the lexical analyzer is the input systemthe group of func
tions that actually read data from the operating system. For the same reason that the
analyzer should be a distinct module, its useful for the input system itself to be an
independent module that communicates with the analyzer via well-defined function
calls. Since the analyzer itself can be isolated from the input mechanics, the resulting
code is more portable. Most of the system-dependent operations of the analyzer are con
centrated into the input layer.
Issues of optimization aside, most compilers spend a good portion of their time in the
lexical analysis phase. Consequently, its worthwhile to optimize lexical analysis for
speed. The standard C buffered input system is actually a poor choice for several rea
sons. First, most buffered systems copy the input characters at least three times before
your program can use them: from the disk to a buffer maintained by the operating sys
tem, from that buffer to a second buffer thats part of the FI LE structure, and finally from
the FI LE buffer to the string that holds the lexeme. All this copying takes both time and
buffer space. Moreover, the buffer size is not optimal. The more you can read from the
disk at one time, the faster your input routines tend to be (though this is operating-system
dependenttheres not much advantage under UNIX in reading more than one block at a
time; MS-DOS, however, performs much better with very large reads).
The other issue is lookahead and pushback. The lexical analyzer may have to look
ahead several characters in the input to distinguish one token from another, and then it
must push the extra characters back into the input. Consider the earlier PL/1 expression:
The lexical analyzer can distinguish the el se keyword from the el se identifier by look
ing at the characters that follow the lexeme. The el se must be an identifier if it is fol
lowed by an equal sign, for example. Another example is a PL/1 decl are statement
like this:
d e c l a r e ( a r g l , a r g 2 , argN )
The lexical analyzer cant distinguish the decl are keyword from an identifier (a sub
routine name in this case) until it has read past the rightmost parenthesis. It must read
several characters past the end of the lexeme, and then push the extra characters back
into the input stream once a decision has been made, and this lookahead and pushback
must be done as efficiently as possible.
A final, admittedly contrived, example demonstrates that pushback is necessary even
in recognizing individual tokens. If a language has the three tokens xxyy, xx, and y, and
the lexical analyzer is given the input xxy it should return an xx token followed by a y
token. In order to distinguish, however, it must read at least four characters (to see if
xxyy is present) and then push the two y9s back into the input stream.
Most programming languages are designed so that problems such as the foregoing
are not issues. If the tokens dont overlap, you can get by with only one character of
pushback. LEX however, cant make assumptions about the structure of the lexemes, so
must assume the worst.
The pushback problem means that you cant use the normal buffered input functions
because unget c () gives you only one character of pushback. You can add a layer
around get c () that gives you more pushback by using a stack, as is demonstrated in
Listing 2.1. Push back a character by pushing it onto the stack, and get the next input
The input system is an
independent module.
Optimizing for speed.
Lookahead,pushback.
ungetc () inappropriate.
Stack-based pushback.
character either from the stack (if its not empty) or from the real input stream (if it is).
UNIX lex uses this method. A better method is described in the remainder of this section.
Listing 2.1. Using a Stack for Multiple-Character Pushback
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# i n c l u d e < s t d i o . h >
#d e f i n e SIZE 128 / * Maxi mumnumber of pushed- back char act er s
*
/
/
*
Pbackbuf i s t he pushback st ack.
*
Pbackp i s t he st ack poi nt er . The st ack gr ows down, so a push i s
* *
Pbackp=c and a pop i s: c=*Pbackp++
*
get ()
*
*
*
eval uat es t o t he next i nput char act er , ei t her poppi ng i t of f t he
st ack (i f i t ' s not empt y) or by cal l i ng get c( ) .
unget (c) pushes c back. I t eval uat es t o c i f successf ul , or t o -1 i f t he
pushback st ack was f ul l .
*
/
i n t P b a c k b u f [ S I Z E ] ;
i n t
*
Pbackp &Pbac kbuf [ SI ZE] ;
#d e f i n e g e t ( s t r e a m ) ( Pbackp < &Pbac kbuf [ SI ZE] ? *Pbackp++ : g e t c ( s t r e a m )
# d e f i n e u n g e t ( c ) ( Pbackp <= Pbackbuf 1
(
*
Pbac kp=( c ) )
)
)
v o i d u n g e t s ( s t a r t , n )
*
i n t
{
n;
/
*
*
Push back t he l ast n char act er s of t he st r i ng by wor ki ng backwar ds
t hr ough t he st r i ng.
*
/
*
P
s t a r t + s t r l e n ( s t a r t ) ; / * Fi nd t he end of t he st r i ng.
*
/
w h i l e (
P
> s t a r t && n > 0 )
( u n g e t ( *p) 1 )
f p r i n t f ( s t d e r r , " P u s h b a c k - s t a c k o v e r f l o w \ n " ) ;
}
2.3.1 An Example Input System*
This section describes the input system used by a LEX-generated lexical analyzer
(though not by LEX itself). This input system has many of the qualities that are desirable
in a compilers input system, and as such can be taken as characteristic. There are, of
course, other solutions to the input problem,' but the current one has proven quite work
able and provides a good example of the sorts of problems that come up. Several design
Input-system design cri- criteria must be met:
teria.
The routines should be as fast as possible, with little or no copying of the input
strings.
Several characters of pushback and lookahead must be available.
Lexemes of a reasonable length must be supported.
Both the current and previous lexeme must be available.
3. A system thats more appropriate in a Pascal environment is described in [Aho] pp. 88-92
Section 2.3.1 An Example Input System* 37
Disk access should be efficient.
To meet the last criterion, consider how a disk is accessed by most operating systems.
All disks are organized into sectors which must be read as a unit. Disk reads must be
performed in sector-sized chunks. If your disk has 512-byte sectors, then you must read
512 bytes at a time. This limitation is imposed by the hardware. If you request one byte
from the disk, the operating system will read an entire sector into a buffer, and then
return a single byte from the buffer. Subsequent reads get characters from the buffer
until it is exhausted, and only then is a new sector read. Some operating systems (MS-
DOS is a case in point) impose further constraints in that a group of several sectors
(called a cluster or block) is the smallest possible unit that can be accessed directly, so
you have to read an entire cluster at once. The minimum number of bytes that can be
read is called an allocation unit, and, for the sake of efficiency, all reads must be done in
terms of allocation units. That is, the number of bytes read from the disk at any one time
should be a multiple of the allocation unit. Typically, the larger the buffer, the shorter
the read time. Many operating systems reward you for doing block-sized transfers by
not buffering the input themselves, as would be the case when an odd-sized block was
requested from a low-level read call. The operating system transfers the data directly
from the disk into your own buffer, thereby eliminating one level of copying and
decreasing the read time. For example, MS-DOS read and write times improve dramati
cally when you read 32K bytes at a time.
The other design criteria are met by using a single input buffer and several pointers.
My system is pictured in Figure 2.2. The drawing on the top shows the condition of the
buffer just after it is loaded the first time. BUFSI ZE is the actual buffer size. MAXLEX is
the maximum lexeme length, and the disk reads are always in multiples of this number.
St ar t buf marks the physical start of the buffer, and END marks the physical end of the
buffer. End buf points at the logical end of buffer. (Since reads are in multiples of
MAXLEX, and since the buffer itself isnt an even multiple of MAXLEX in length, there is
usually a scrap of wasted space at the end of the buffer. End buf points just past the
last valid character in the buffer.) Finally, Next points at the next input character. (Ill
discuss DANGER and MAXLOOK momentarily.)
The middle picture in Figure 2.2 shows the buffer in its normal state, after the lexical
analyzer has processed several tokens. Various pointers have been set to mark the boun
daries of various lexemes: pMar k points at the beginning of the previous lexeme,
sMar k points at the beginning of the current lexeme, and eMar k points at the end of the
current lexeme. The lexical analyzer has scanned several characters past the end of the
current lexeme (Next is to the right of eMark). If the lexical analyzer finds a longer lex
eme than the current one, all it need do is move the eMar k to the current input position.
If, on the other hand, it finds that it has read too far in the input, it can push back all the
extra characters by setting Next back to the eMark.
Returning to MAXLOOK , this constant is the number of lookahead characters that are
supported. The DANGER marker tells the input routines when the Next pointer is getting
too close to the end of the buffer (there must be at least MAXLOOK characters to the right
of Next). When Next crosses the DANGER point, a buffer flush is triggered, giving us the
situation shown in the bottom picture in Figure 2.2. All characters between the pMar k
and the last valid character (pointed to by End buf ) have been shifted to the far left of
the buffer. The input routines fill the remainder of the buffer from the disk, reading in as
many MAXLEX-sized chunks as will fit. The End buf pointer is adjusted to mark the
new end of the buffer, and DANGER scales automatically: its positioned relative to the
new end of buffer. This may seem like a lot of copying, but in practice the lexemes are
not that large, especially in comparison to the buffer size. Consequently, flushes dont
happen very often, and only a few characters are copied when they do happen.
Disk access, sectors.
Clusters, blocks.
Allocation units.
Input system organiza
tion: buffers and pointers.
Buffer pointers, buf -
SI ZE, MAXLEX, END,
St ar t buf , End buf .
Lexeme markers,
pMar k, sMark, eMark.
MAXLOOK, DANGER.
Buffer flush.
Input and Lexical Analysis Chapter 2
Figure 2.2. The Input Buffer
I
End buf
Start buf DANGER END
B UFSIZE
Start buf DANGER END
After flush
> n x MAXLEX 
This approach has many advantages, the main one being a lack of copying. The lexi
cal analyzer can just return the sMar k as the pointer to the next lexeme, without having
to copy it anywhere. Similarly a pushback is a single assignment ( Next =eMar k) rather
than a series of pushes and pops. Finally, the disk reads themselves are reasonably
efficient because theyre done in block-sized chunks.
Section 2.3.2An Example Input SystemImplementation 39
2.3.2 An Example Input SystemImplementation
The foregoing is all implemented by the variables and macros declared at the top of
input.c, in Listing 2.2. At this point weve looked at most of them, the others are dis
cussed as theyre used in the code. The macro definitions on lines 15 to 19 take care of a
few system dependenciesCOPY () is mapped tomemmove ( ) for the Microsoft C com- Portability problems:
piler [because Microsofts memcpy ( ) doesnt support overlapping strings]. copy o .
Listing 2.2. input.c Macros and Data Structures
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# n c l u d e < s t d i o . h >
# i n c l u d e < s t d l i b . h >
# i n c l u d e < f c n t l . h >
# i n c l u d e < t o o l s / d e b u g . h>
# i n c l u d e < t o o l s / l . h >
# n c l u d e < s t r i n g . h >
/
/
*
Needed onl y f or pr ot ot ypes
*
/
* rr *
/
/
*
* I NPUT. C: The i nput syst emused by LeX l exi cal anal yzer s
*
*
Syst em- dependent def i nes
*
/
# i f d e f MSDOS
16 # d e f i n e COPY( d, s, a) memmove(d , s , a)
17 # e l s e
18 # d e f i n e COPY( d, s, a) memcpy(d, s , a)
19 # e n d i f
20
21 #d e f i n e STDIN 0 / * st andar d i nput * /
22
23
/ * ------------------------
*/
24
25 #d e f i n e MAXLOOK 16 / * Maxi mumamount of l ookahead
26 # d e f n e MAXLEX 1024 / * Maxi muml exeme si zes.
27
28 #d e f i n e BUFSIZE ( (MAXLEX * 3) + (2
*
MAXLOOK) ) / * Change t he 3 onl y
29
30 # d e f n e DANGER ( End buf - MAXLOOK
)
/ * Fl ush buf f er when Next
31 / * passes t hi s addr ess
32
33 #d e f i n e END ( &St art buf [ BUFSI ZE] ) / * J ust past l ast char i n buf
34
35 # d e f i n e NO_MORE_CHARS() (
Eof r e a d && Next >= End b u f )
36
37 t y p e d e f u n s i g n e d c h a r uc ha r ;
38
39 PRIVATE uc har S t a r t buf [ BUFSI ZE] ;
/ *
I nput buf f er */
40 PRIVATE uc har *End b u f = END;
/ *
J ust past l ast char act er */
41 PRIVATE uc har *Next = END; / *
Next i nput char act er */
42 PRIVATE uc har *sMark = END;
/ *
St ar t of cur r ent l exeme */
43 PRIVATE uc har *eMark = END;
/ * End of cur r ent l exeme */
44 PRIVATE uc har *pMark = NULL;
/ *
St ar t of pr evi ous l exeme */
45 PRIVATE i n t pLi ne no = 0; / * Li ne # of pr evi ous l exeme */
46 PRIVATE i n t pLe ngt h = 0; / *
Lengt h of pr evi ous l exeme */
47
48 PRIVATE i n t Inp f i l e = STDIN;
/ *
I nput f i l e handl e */
49 PRIVATE i n t Li ne no = 1 ; / *
Cur r ent l i ne number */
50 PRIVATE i n t Ml i ne = 1 ; / *
Li ne # when mar k end () cal l ed * /
51 PRIVATE i n t Termchar = 0; / *
Hol ds t he char act er t hat was */
*
*
/
/
*
/
*
*
/
/
*
/
52 / *
over wr i t t en by a \ 0 when we */
53 / * nul l t er mi nat ed t he l ast * /
54 / * l exeme.
* /
55 PRIVATE i n t Eof r e a d = 0; / * End of f i l e has been read.
* /
56 / * I t ' s possi bl e f or t hi s to be
* /
57 / * t r ue and f or char act er s t o * /
58 / * st i l l be i n t he i nput buf f er .
* /
59
60 extern i n t o p e n ( ) , c l o s e () , r e a d ( ) ;
61
62 PRIVATE i n t ( *Openp) () open ;
/ *
Poi nt er to open f unct i on
* /
63 PRIVATE i n t ( * Cl o s e p) () = c l o s e ; / *
Poi nt er t o cl ose f unct i on
* /
64 PRIVATE i n t ( * Re a d p ) () r e a d ; / * Poi nt er t o r ead f unct i on
* /
Change low-level input
routines, i i i o().
Open new input file,
i i newf i l e( ) .
The actual code starts in Listing 2.3 with two initialization functions. The first,
i i i o () 65 of Listing 2.3, is used to change the low-level input functions that
are used to open files and fill the buffer. You may want to do this if youre getting input
directly from the hardware, from a string, or doing something else that circumvents the
normal input mechanism. This way you can use a LEX-generated lexical analyzer in an
unusual situation without having to rewrite the input system.
on line 81 of Listing 2.3 is the normal mechanism for The newf i l e routine
opening a new input file. It is passed the file name and returns the file descriptor (not the
FI LE pointer) for the opened file, or -1 if the file couldnt be opened. The previous
input file is closed unless it was standard input.
i i
i i newf i l e ( ) does not actually read
the first buffer; rather, it sets up the various pointers so that the buffer is loaded the first
time a character is requested. This way, programs that i newf i l e will
work successfully, getting input from standard input. The problem with this approach is
that you must read at least one character before you can look ahead in the input (other
wise the buffer wont be initialized). If you need to look ahead before advancing, use:
i i a d v a n c e ( ) ; / * Read f i r s t b u f f e r f u l l o f i n p u t
*
/
i i p u s h b a c k ( 1 ) ; / * but put back t h e f i r s t c h a r a c t e r * /
Reassigning standard in
put.
The default input stream [used if i i newf i l e ( ) is never called] is standard input.
You can reassign the input to standard input (say, after you get input from a file) by cal-
ling:
i i n e wf i l e ( NUL L ) ;
s also okay to do a i i _newf i l e (11/ dev/ t t y" ) (in both MS-DOS and UNIX), but
input is actually taken from the physical console in this case. Redirection wont work.
An i i newf i l e (NULL) allows for redirected input, however.
Note that the indirect open () call on line 103 of Listing 2.3 uses the O BI NARY
Binary (untranslated) in- input mode in MS-DOS systems (its mapped to zero in UNIX systems). A CR-LF
(carriage-retum, linefeed) pair is not translated into a single ' \ n' when binary-mode
input is active. This behavior is desirable in most LEX applications, which treat both CR
and LF as white space. Theres no point wasting time doing the translation. The lack of
translation might cause problems if youre looking for an explicit ' \ n' in the input,
though.
Note that the input buffer is not read by i i newf i l e ( ); rather, the various pointers
put.
First read is delayed until
first advance.
are initialized to point at the end of the buffer 111 to 114 of Listing 2.3 The
actual input routine ( advance (), discussed below) treats this situation the same as it
would the Next pointer crossing the DANGER point. It shifts the buffers tail all the way
to the left (in this case the tail is empty so no characters are shifted, but the pointers are
Listing 2.3. input.c Initialization Routines
65 v o i d i i _ i o ( o p e n _ f u n c t , c l o s e _ f u n c t , r e a d _ f u n c t )
66 i n t ( * o p e n _ f u n c t ) ( ) ;
67 i n t ( * c l o s e _ f u n c t ) ( ) ;
68 i n t ( * r e a d _ f u n c t ) ( ) ;
69 {
70 / * Thi s f unct i on l et s you modi f y t he open( ) , cl ose (), and r ead( ) f unct i ons
71 * used by t he i / o syst em. Your own r out i nes must wor k l i ke t he r eal open,
72 * cl ose, and r ead (at l east i n t er ms of t he ext er nal i nt er f ace. Open shoul d
73 * r et ur n a number t hat can' t be conf used wi t h st andar d i nput (not 0) .
74 * /
75
76 Openp = o p e n _ f u n c t ;
77 Cl o s e p = c l o s e _ f u n c t ;
78 Readp = r e a d _ f u n c t ;
79 }
80 / * -------------------------------------------------------------------------------------------------------------------------------------- * /
81 i n t i i _ n e w f i l e ( name )
82 c h a r *name;
83 {
84 / * Pr epar e a new i nput f i l e f or r eadi ng. I f newf i l e() i sn' t cal l ed bef or e
85 * i nput () or i nput _l i ne( ) t hen st di n i s used. The cur r ent i nput f i l e i s
86 * cl osed af t er successf ul l y openi ng t he new one (but st di n i sn' t cl osed) .
87 *
88 * Ret ur n - 1 i f t he f i l e can' t be opened; ot her wi se, r et ur n t he f i l e
89 * descr i pt or r et ur ned f r omopen( ) . Not e t hat t he ol d i nput f i l e won' t be
90 * cl osed unl ess t he new f i l e i s opened successf ul l y. The er r or code ( errno)
91 * gener at ed by t he bad open( ) wi l l st i l l be val i d, so you can cal l per r or ()
92 * t o f i nd out what went wr ong i f you l i ke. At l east one f r ee f i l e
93 * descr i pt or must be avai l abl e when newf i l e() i s cal l ed. Not e i n t he open
94 * cal l t hat 0__BI NARY, whi ch i s needed i n MS- DOS appl i cat i ons, i s mapped
95 * t o 0 under UNI X ( wi t h a def i ne i n <t ool s/ debug. h>) .
96 * /
97
98 i n t f d; / * Fi l e descr i pt or * /
99
100 MS( i f ( s t r c mp( name , " / d e v / t t y " ) == 0 ) )
101 MS( name = "CON" ; )
102
103 i f ( ( f d = ! name ? STDIN : ( * 0 pe np) ( na me , 0_RD0NLY| 0_BINARY)) != - 1 )
104 {
105 i f ( I n p _ f i l e != STDIN )
106 ( * C l o s e p ) ( I n p _ f i l e ) ;
107
108 I n p _ f i l e = f d;
109 Eo f _ r e a d = 0;
110
111 Next = END;
112 sMark = END;
113 eMark = END;
114 End_buf = END;
115 Li ne no = 1;
116 Ml i ne = 1;
117 }
118 r e t u r n f d ;
119 }
Access functions.
Limiting scope,
PRI VATE.
Functions to access lex
emes: i i _t ext (),
i i _l engt h(),
i i _l i neno(),
i i _pt ext (),
i i _pl engt h(),
i i _pl i neno().
Functions to mark lex
eme boundaries,
i i _mar k_st ar t (),
i i mar k end().
Move start marker,
i i _ move_ s t ar t ().
Restore pointer to previ
ous mark,
i i _ t o_ mar k ().
Mark previous lexeme,
i i _ mar k _ pr ev ().
Advance input pointer,
i i a dv a nc e ().
moved), and then loads the buffer from the disk. I ve taken this approach because its
sometimes convenient to open a default input file at the top of a program, which is then
overridden by a command-line switch or the equivalent later on in the same program.
Theres no point in reading from a file thats not going to be used, so the initial read is
delayed until a character is requested.
The input.c file continues in Listing 2. 4 with several small access functions. For
maintenance reasons, it is desirable to limit external access of global variables, because
the linker assumes that two global variables with the same name are the same variable.
If you inadvertently declare two variables with the same name, one of them will seem to
magically change its value when a subroutine that accesses the other is called. You can
avoid this problem by declaring the variables stati c, thereby limiting their scope to
the current file. PRI VATE is mapped to st at i c in debug.h, discussed in Appendix A.
I ts still necessary for external subroutines to access these variables however, and the
safest way to do so is through the small routines in Listing 2. 4. These subroutines are
used for maintenance reasons onlytwo subroutines with the same name will result in
an error message from the linker, unlike two variables with the same name, which are
silently merged.
The i i t ext ( ) , i i l engt h ( ), and i i l i neno ( ) routines (lines 120 to 122 of
Listing 2. 4) return a pointer to the current lexeme, the lexemes length, and the line
number for the last character in the lexeme. The i i p t ext ( ), i i p l engt h ( ), and
i i pl i neno ( ) routines (lines 123 to 125) do the same thing, but for the previous lex
eme. The i i mar k st ar t ( ) routine (line 127) moves the sMar k to the current input
position (pointed to by Next). It also makes sure that the end-of-lexeme marker
(eMark) is not to the left of the start marker. i i _mar k_end( ) (line 134) does the
same for the end marker (eMark). It also saves the current line number in Ml i ne,
because the lexical analyzer might sweep past a newline when it scans forward looking
for a new lexeme. The input line number must be restored to the condition it was in
before the extra newline was scanned when the analyzer returns to the previous end
marker.
The i i move st ar t ( ) routine on line 140 of Listing 2. 4 lets you move the start
marker one space to the right. It returns the new start marker on success, NULL if you
tried to move past the end marker ( sMark is not modified in this last case).
i i _t o_mar k( ) (line 148) restores the input pointer to the last end mark. Finally,
i i mar k pr ev ( ) modifies the previous-lexeme marker to reference the same lexeme
as the current-lexeme marker. Typically, i i mar k pr ev ( ) is called by the lexical
analyzer just before calling i i mar k st ar t ( ) (that is, just before it begins to search
for the next lexeme).
The next group of subroutines, in Listings 2. 5 and 2. 6, comprise the advance and
buffer-flush functions. i i _ a d v a n c e ( ), on line 168 of Listing 2. 5, returns the next
input character and advances past it. The code on lines 180 to 191 is provided for those
situations where you want an extra newline appended to the beginning of a file. LEX
needs this capability for processing the start-of-line anchora mechanism for recogniz
ing strings only if they appear at the far left of a line. Such strings must be preceded by a
newline, so an extra newline has to be appended in front of the first line of the file; other
wise, the anchored expression wouldnt be recognized on the first line.4
4. IfX Usage Note: This pushback could conceivably cause problems if there is no regular expression in the
1X input file to absorb the newline, and YYBADI NP is also #def i ned (Youll get an error message in this
case). A regular expression that absorbs white space is usually present, however.
Listing 2.4. input.c Small Access Routines and Marker Movement
120 PUBLIC c h a r * i i t e x t () { r e t u r n ( sMark
) ; }
121 PUBLIC i n t i i l e n g t h () { r e t u r n ( eMark - sMark ) ; }
122 PUBLIC i n t i i l i n e n o () { r e t u r n ( Li ne no ) ; }
123 PUBLIC c h a r * i i p t e x t () { r e t u r n ( pMark
) ; }
124 PUBLIC i n t i i p l e n g t h () { r e t u r n ( pLe ngt h
) ; }
125 PUBLIC i n t i i p l i n e n o () { r e t u r n ( pLi ne no
) ; }
126
127 c ha r * i i mark s t a r t ()
128
{
129 Ml i ne = Li ne no;
130 eMark = sMark = Ne x t ;
131 r e t u r n ( sMark ) ;
132
}
133
134 PUBLIC c h a r * i i mark e nd( )
135 {
136 Ml i ne = Li ne no ;
137 r e t u r n ( eMark = Ne xt ) ;
138
}
139
140 PUBLIC c h a r * i i move s t a r t ()
141
{
142 i f ( sMark >= eMark )
143 r e t u r n NULL;
144 e l s e
145 r e t u r n ++sMark ;
146
}
147
148 PUBLIC c h a r * i i t o ma r k()
149 {
150 Li ne no := Ml i ne ;
151 r e t u r n ( Next = eMark ) ;
152
}
153
154 c h a r * i i mark p r e v ( )
155
{
156 / * S e t t he pMar k. Be car ef ul wi t h t hi s r out i ne . A buf f er f l ush won' t go past
157 * pMar k so, once you' ve set i t, you must move i t ever y t i me you move sMar k.
158 * I ' mnot doi ng t hi s aut omat i cal l y because I jmi ght want t o r emember t he
159 * t oken bef or e l ast r at her t han t he l ast one. I f i i mar k pr ev () i s never
160 * cal l ed, pMar k i s j ust i gnor ed and you don' t have t o wor r y about i t.
161 */
162
163 pMark = sMark;
164 pLi ne no = Li ne no;
165 pLengt h = eMark - sMark;
166 r e t u r n ( pMark ) ;
167
}
The NO MORE CHARS ( ) macro is used on line 193 to detect end of file. It was Detect end of file,
defined in the header as follows
NO MORE CHARS().
#def i ne NO MORE CHARS() ( Eof r ead && Next >= End buf
Eof _r ead is set to true when end of file is encountered. You must use both Eof r ead Eof r ead
and Next to detect end of input because EOF might have been read while the lexical
analyzer was looking ahead. In this case, characters may have been pushed back after
reading the EOF. You have to see both if end of file has been encountered and if the
End of input. buffer is empty. This is a case where end of input and end of file are different things,
because there still may be characters in the input buffer long after end of file has been
read. The i i _f l ush( ) call on line 196 flushes the buffer if necessary, and the line
number is advanced on line 199. The next input character is returned normally, 0 is
returned on end of file, and -1 is returned if the buffer couldnt be flushed for some rea
son.
Listing 2.5. input.c The Advance Function
168 i n t i i a d v a n c e ()
169 {
170 / * i i advance () i s t he r eal i nput f unct i on. I t r et ur ns t he next char act er
171 * f r omi nput and advances past i t. The buf f er i s f l ushed i f t he cur r ent
172 * char act er i s wi t hi n MAXLOOK char act er s of t he end of t he buf f er . 0 i s
173 * r et ur ned at end of f i l e. -1 i s r et ur ned i f t he buf f er can' t be f l ushed
174 * because i t ' s t oo f ul l . I n t hi s case you can cal l i i f l ush(1) to do a
175 * buf f er f l ush but you' l l l oose t he cur r ent l exeme as a consequence.
176 * /
177
178 s t a t i c i n t be e n c a l l e d = 0;
179
180 i f ( ! be e n c a l l e d )
181
{
182 / * Push a newl i ne i nt o t he empt y buf f er so t hat t he LeX st ar t - of - l i ne
183 * anchor wi l l wor k on t he f i r st i nput l i ne.
184 */
185
186 Ne xt = sMark = eMark = END - 1;
187 *Next = ' \ n ' ;
188 - - L i n e n o ;
189 - - M l i n e ;
190 be e n c a l l e d = 1;
191 }
192
193 i f ( NO_MORE_CHARS() )
194 r e t u r n 0;
195
196 i f ( ! Eof r e a d && i i f l u s h (0) < 0 )
197 r e t u r n - 1 ;
198
199 i f ( *Next == ' \ n ' )
200 Li ne no++;
201
202 r e t u r n ( *Next ++ ) ;
203
}
The actual buffer flush is done by i i _f l ush ( ), which starts at the top of Listing
2.6. The test on line 248 checks to see that there will be enough room after the move to
load a new MAXLEX- si zed bufferfull of charactersthere might not be if the buffer con
tains two abnormally long lexemes. The test evaluates true if there isnt enough room.
Normally, the routine returns -1 if theres no room, and 1 is returned if everything is
okay. If the f or ce argument is true, however, the buffer is flushed even if theres no
room, and 1 is returned. The flush is forced by setting the start marker to the current
input position and the l ef t _edge of the character to be shifted to the Next pointer,
effectively destroying the current lexeme. The code on lines 259 and 246 figures out
Flush input buffer,
i i f l ush ().
how many characters have to be copied ( copy amt ) and the distance that they have to
be moved ( shi f t _amt ) . The shift is done on line 260, and a new buffer is loaded by the copy amt, shi f t amt.
i i _ f i l l b u f ( ) call on line 262. COPY was defined earlier (on line 16 of Listing 2. 2) to
map to either memmove () or memcpy ( ) , depending on the compilation environment.
The rest of the routine adjusts the various markers to compensate for the move.
Listing 2.6. input.c Buffer Flushing
204 i nt i i _f l ush( f or ce )
205 i nt f orce;
206 {
207 / * Fl ush t he i nput buf f er . Do not hi ng i f t he cur r ent i nput char act er i sn' t
208 * i n t he danger zone, ot her wi se move al l unr ead char act er s t o t he l ef t end
209 * of t he buf f er and f i l l t he r emai nder of t he buf f er . Not e t hat i nput ()
210 * f l ushes t he buf f er wi l l y- ni l l y i f you r ead past t he end of buf f er .
211 * Si mi l ar l y, i nput _l i ne( ) f l ushes t he buf f er at t he begi nni ng of each l i ne.
212 *
213 * pMar k DANGER
214 * I |
215 * St ar t _buf sMar k eMar k \Next End_buf
216 * I I I I I I
217 * V W V W V
218 * +--------------------------------------------- +-------------------------------- +--------------- +
219 * | t hi s i s al r eady r ead \ t o be done yet \ wast e
220 * ----------------------------------------------- 1--------------------------------- /---------------
221 * | I I I
222 * | <--------- shi f t _ amt ----------- > | < copy_amt > |
223 * | I
224 * | <----------------------------------- BUFSI ZE---------------------------------------->|
225 *
226 * Ei t her t he pMar k or sMar k ( whi chever i s smal l er ) i s used as t he l ef t most
221 * edge of t he buf f er . None of t he t ext t o t he r i ght of t he mar k wi l l be
228 * l ost . Ret ur n 1 i f ever yt hi ng' s ok, -1 i f t he buf f er i s so f ul l t hat i t
229 * can' t be f l ushed. 0 i f we' r e at end of f i l e. I f "f or ce i s t rue, a buf f er
230 * f l ush i s f or ced and t he char act er s al r eady i n i t ar e di scar ded. Don' t
231 * cal l t hi s f unct i on on a buf f er t hat ' s been t er mi nat ed by i i _t er m( ) .
232 */
233
234 i nt copy_amt , shi f t _amt ;
235 uchar *l ef t _edge;
236
237 i f ( NO_MORE_CHARS() )
238 r et ur n 0;
239
240 i f ( Eof _r ead ) / * not hi ng mor e t o r ead * /
241 r et ur n 1;
242
243 i f ( Next >= DANGER || f or ce )
244 {
245 l ef t _edge =pMar k ? mi n( sMar k, pMar k) : sMar k;
246 shi f t amt = l ef t edqe - St ar t buf ;
247
248 i f ( shi f t _amt < MAXLEX ) / * i f ( not enough room) * /
249 {
250 i f ( !f or ce )
251 r et ur n -1;
252
253 l ef t _edge = i i _mar k_st ar t ( ) ; / * Reset st ar t t o cur r ent char act er * /
254 i i _mar k_pr ev();
255
Li sti ng2.6. conti nued. ..
256 shi f t amt = l ef t edqe - St ar t buf ;
257 }
258
259 copy_amt = End_buf - l ef t _edge;
260 COPY( St ar t _buf , l ef t _edge, copy_amt );
261
262 i f ( ! i i _f i l l buf ( St ar t _buf + copy_amt ) )
263 f er r ("I NTERNAL ERROR, i i _f l ush: Buf f er f ul l , can' t r ead. Xn") ;
264
265 i f ( pMar k )
266 pMar k - = shi f t _amt ;
267
268 sMar k - = shi f t _amt ;
269 eMar k - = shi f t _amt ;
270 Next - = shi f t _amt ;
271 }
272
273 r et ur n 1;
274 }
275
276 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
277
278 PRI VATE i nt i i _f i l l buf ( st ar t i ng_at )
279 unsi gned char *s t a r t i n g a t ;
280 {
281 / * Fi l l t he i nput buf f er f r omst ar t i ng_at t o t he end of t he buf f er .
282 * The i nput f i l e i s not cl osed when EOF i s r eached. Buf f er s ar e r ead
283 * i n uni t s of MAXLEX char act er s; i t ' s an er r or i f t hat many char act er s
284 * cannot be r ead (0 i s r et ur ned i n t hi s case) . For exampl e, i f MAXLEX
285 * i s 1 0 2 4 , t hen 1024 char act er s wi l l be r ead at a t i me. The number of
286 * char act er s r ead i s r et ur ned. Eof _r ead i s t r ue as soon as t he l ast
287 * buf f er i s read.
288 *
289 * PORTABI LI TY NOTE: I ' massumi ng t hat t he r ead f unct i on act ual l y r et ur ns
290 * t he number of char act er s l oaded i nt o t he buf f er , and
291 * t hat t hat number wi l l be < need onl y when t he l ast chunk of t he f i l e i s
292 * read. I t ' s possi bl e f or r ead( ) t o al ways r et ur n . f ewer t han t he number of
293 * r equest ed char act er s i n MS- DOS unt r ansl at ed- i nput mode, however (i f t he
294 * f i l e i s opened wi t hout t he 0_BI NARY f l ag) . That ' s not a pr obl emher e
295 * because t he f i l e i s opened i n bi nar y mode, but i t coul d cause pr obl ems
296 * i f you change f r ombi nar y t o t ext mode at some poi nt .
291 * /
298
299 r egi st er unsi gned need, / * Number of byt es r equi r ed f r omi nput . * /
300 got ; / * Number of byt es act ual l y read. * /
301
302 need = ( ( END - st ar t i ng at) / MAXLEX) * MAXLEX ;
303
304 D( pr i nt f ( "Readi ng %d byt es\ n", need ); )
305
306 i f ( need < 0 )
307 f er r (" I NTERNAL ERROR ( i i _f i l l buf ) : Bad r ead- r equest st ar t i ng addr . \ n") ;
308
309 i f ( need == 0 )
310 r et ur n 0;
311
312 i f ( ( got = ( *Readp) ( I np_f i l e, st ar t i ng_at , need) ) == - 1 )
313 f er r ( "Can' t r ead i nput f i l e\ n" ) ;
314
Li sti ng2.6. conti nued. . .
315 End buf = s t a r t i n g a t + g o t ;
316
317 i f ( g o t < ne e d )
318 Eof r e a d = 1; / * At end of f i l e */
319
320 ret urn g o t ;
321 }
The final routine in Listing 2.6 is i i f i l l buf ( ) , starting on line 278 . It is passed
a base address, and loads as many MAXLEX-sized chunks into the buffer as will fit. The
need variable, initialized on line 302, is the amount needed. The logical-end-of-buffer
marker is adjusted on line 315. Note that a single r ead ( ) call does the actual read on
line 312. ( Readp is initialized to point at r ead () when it is declared up at the top of
the file.) This can cause problems when a lexeme can span a line, and input is fetched
from a line-buffered input device (such as the console). Youll have to use i i i o ( ) to
supply an alternate read function, in this case.
Listing 2.7 shows the lookahead function, i i _l ook ( ) . It returns the character at
the offset from the current character thats specified in its argument. An i i l ook (0)
returns the character that was returned by the most recent i i _advance( ) call,
i i _l ook( l ) is the following character, i i _l ook (-1) is the character that precedes
the current one. MAXLOOK characters of lookahead are guaranteed, though fewer might
be available if youre close to end of file. Similarly, lookback (with a negative offset) is
only guaranteed as far as the start of the buffer (the pMar k or sMark, whichever is
smaller). Zero is returned if you try to look past end or start of the buffer, EOF if you try
to look past end of file.
Listing 2.7. input.c Lookahead
322 i nt i i l o o k ( n )
323
{
324 / * Ret ur n t he nt h char act er of l ookahead, EOF i f you t r y to l ook past
325 * end of f i l e, or 0 i f you t r y t o l ook past ei t her end of t he buf f er .
326 */
327
328 uc har *p;
329
330 p = Ne xt + ( n- 1) ;
331
332 i f ( Eof r e a d && p >= End b u f )
333 r et ur n EOF;
334
335 r et ur n( p < S t a r t buf p >= End buf ) ? 0 : *p ;
336 }
Listing 2.8 contains the pushback function, i i pushback ( n) . It is passed the i i _pushback(n).
number of characters to push back. For example, i i pushback (5) pushes back the
five most recently read characters. If you try to push past the sMark, only the characters
as far as the sMark are pushed and 0 is returned (1 is returned on a successful push). If
you push past the eMark, the eMark is moved back to match the current character.
Unlike unget c ( ), you can indeed push back characters after EOF has been reached.
The remainder of i nput . c, in Listing 2.9, provides support for ' \ 0 '-terminated \o-terminated-stringsup
strings. These routines are notstrictly speakingnecessary, because the lexeme ^
Load input buffer,
i i _f i l l buf ().
need.
Lookahead, i i i ook( ).
Listing 2.8. input.c Pushback
337 i nt i i p u s h b a c k ( n )
338
{
339 / * Push n char act er s back i nt o t he i nput . You can' t push past t he cur r ent
340 * sMar k. You can, however , push back char act er s af t er end of f i l e has
341 * been encount er ed.
342 */
343
344 whi l e( - - n >= 0 && Ne xt > sMark )
345
{
346 i f ( *Next == ' \ n ' ! *Next )
347 Li ne no ;
348
}
349
350 i f ( Ne xt < eMark )
351 {
352 eMark = Ne x t ;
353 Ml i ne = Li ne no ;
354
}
355
356 r et ur n( Ne xt > sMark ) ;
357 }
Terminate/unterminate
current lexeme,
i i _ t er m() ,
i i unt er m( ) .
length is always available. I ts occasionally useful to have a terminator on the string,
however. Note that these functions should be used exclusively after the string has been
terminatedthe other input functions will not work properly in this case.
The termination is done with a call to i i t er m( ) (on line 358). It saves the char
acter pointed to by Next in a variable called Ter mchar , and then overwrites the charac
ter with a ' \ 0'. The unt er m( ) function (on line 366) puts everything back
Listing 2.9. input.c Support for ' \ 0' -terminated Strings
358 voi d i i t e r m( )
359
{
360 Termchar = *Next ;
361 *Next = ' \ 0 ' ;
362
363
}
364
365
/ * _ _ .
366 voi d i i u n t e r m()
367 {
368
i f (
Termchar )
369
{
370 *Next = Termchar;
371 Termchar = 0;
372 }
373
374
}
375
376
/ * - - .
377 i nt i i i n p u t ()
378
379
{
380
381
i nt r v a l ;
*
/
*
/
Listing 2.9. continued.. .
382 i f ( Termchar )
383
{
384 i i u n t e r m( ) ;
385 r v a l = i i a d v a n c e ( ) ;
386 i i _ ma r k e n d ( ) ;
387 i i t e r m ( ) ;
388 }
389 e l s e
390
{
391 r v a l = i i a d v a n c e ( ) ;
392 i i mark e n d ( ) ;
393
394
}
395 return r v a l ;
396
397
}
398
399
/ * --------------------------------------------------------------------------------------------------------------
- - * /
400 void i i u n p u t ( c )
401
{
403
{
404 i i u n t e r m( ) ;
405 i f ( i i p u s h b a c k (1) )
406 *Next = c;
407 i i t e r m ( ) ;
408
}
409 e l s e
410
{
411 i f ( i i p u s h b a c k (1) )
412 *Next = c;
413 }
414
415
}
416
417
/ * --------------------------------------------------------------------------------------------------------------
- - * /
418 i n t i i l o o k a h e a d ( n )
419
{
420 return (n == 1 && Termchar) ? Termchar : i i l o o k ( n ) ;
421
422
}
423
424
/ * ----------------------------------------------------------------------------------------------------------------
- - * /
425 i n t i i f l u s h b u f ( )
426
{
428 i i u n t e r m( ) ;
429
430 return i i f l u s h ( l ) ;
431
}
This approach is better than putting the i i unt er m( ) code into i i _advance ( ),
because the latter approach slows down all i i advance ( ) calls. On the other hand,
you have to remember to call i i _unt er m( ) before calling i i _advance ( ) . For this
reason, an i i _i nput ( ) function has been provided (on line 377) to make sure that the
lexeme is unterminated and then reterminated correctly. That is, i i _i nput ( ) is a
well-behaved input function meant to be used directly by the user. The function also
i i unput ().
i i l ookahead().
Ring buffers.
Why ring buffers are inap
propriate here.
Lookup tables in hard
coded scanners.
Hard-coded scanners:
advantages and disad
vantages.
moves the end marker, making the lexeme one character longer (moving the null termi
nator if necessary), and it returns the new input character, or 0 at end of file. -1 is
returned if another character couldnt be read because the buffer was full.
i i unput ( ) (on line 400) is a reverse-input function. It backs up the input one
notch and then overwrites the character at that position with its argument,
i i unput ( ) works correctly on both terminated and unterminated buffers, unlike
i i pushback ( ) , which cant handle the terminator.
The i i _l ookahead( ) function bears the same relation to i i _l ook( ) that
i i _i nput ( ) bears to i i _advance ( ) . That is, i i _l ookahead (1) functions
correctly for strings that have been terminated with i i t er m() calls, i i l ook () does
not. Similarly, i i _f l ushbuf ( ) flushes a terminated buffer by unterminating it before
calling i i _f l ush ( ) .
One final note on strategy. The buffer-flush approach that I ve used here allows me to
take advantage of Cs pointer mechanism when scanning the input. This approach isnt
appropriate in a language like FORTRAN, where arrays must be referenced using an
index. Here, youre better off using a circular array or ring buffer. For example, the
input buffer would be declared with
char i n p u t _ b u f [ SIZE ];
and the next character would be accessed with
x = i n p u t _ b u f [ c u r r e n t _ c h a r a c t e r % SIZE ] ;
You would load a new chunk from the disk into the far left of the array when
cur r ent char act er was greater than or equal to SIZE, being careful not to overwrite
the current lexeme in the process.
The problem here is that a lexeme can span the buffer. That is, a situation may arise
where the first half of a lexeme is far right of i nput buf and the other half is at the far
left. As long as youre accessing all the characters with an array index modulus the array
size, this is not a problem. C, however, wants its strings in contiguous memory so that it
can scan through them using a pointer. Moreover, the array index and modulus operation
needed to access every character is inherently less efficient than a simple pointer access;
more inefficient, even, than the moves that are part of a buffer flush. Consequently, a
ring buffer isnt particularly appropriate in a C implementation.
2.4 Lexical Analysis*
Now that weve developed a set of input routines, we need to apply them in a
lexical-analysis application. There are two approaches to lexical analysis, both useful.
First, you can hard code the analyzer, recognizing lexemes with nested i f / el se state
ments, switches, and so forth. If the lexemes arent too long one effective approach uses
a series of lookup tables to recognize tokens. (Lookup tables tend to be faster than
swi tchs or i f / el se statements.) Listing 2.10 shows such a system for recognizing the
following tokens:
> >= < <===! =
The basic strategy is to vector from one table to another until a complete lexeme is
identified.
The hard-coded approach has its advantageshard-coded lexical analyzers tend to
be very efficient, but hard-coded analyzers are difficult to maintain. When youre
developing a new language, its handy to be able to add new tokens to the language or
take some away without too much work. This problem is solved by programs like lex
Section 2.4Lexical Analysis*
Listing 2.10. Using Lookup Tables for Character Recognition
51
1 #def i ne LESS_THAN 1
2 #def i ne GREATER_THAN 2
3 #def i ne EQUAL 3
4 #def i ne NOT 4
5 #def i ne LESS_THAN OR_EQUAL 5
6 #def i ne GREATER_THAN_OR_EQUAL 6
7 #def i ne NOT_OR_EQUAL 7
8
9
10
#def i ne ASSIGN 8
#def i ne ERROR -1
11 #def i ne CONTINUE 0
12
13 #def i ne SIZE OF CHARACTER SET 128
14
15 char f i r s t [ SIZE_OF_CHARACTER__SET ] ;
16 char s e c o n d [ SIZE OF CHARACTER [SET ] ;
17
18
i nt s ,

f ;
19 me ms e t ( f i r s t , - 1 , SIZE OF CHARACTER
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
SET ) ; / * I ni t i al i ze t o er r or t oken.
*
me ms e t ( s e c o n d , - 1 , SIZE OF CHARACTER SET ) ; / * Not e t hat t her e' s an i mpl i ci t
*
/ * conver si on of 1 t o 255 her e
*
/
*
( si gned i nt to unsi gend char)
*
/
/
/
/
f i r s t [
f i r s t
f i r s t
f i r s t
s e c o n d [
>
<
i
GREATER_THAN;
LESS_THAN;
NOT;
ASSIGN;
EQUAL;
c g e t c h a r ( ) ;
( ( f
f i r s t [ c ] ) ERROR ) / * di scar d bad char act er */
r et ur n ERROR;
( (s s e c o n d [ c ] ) ERROR ) / * 1- char act er 1
*
/
{
u n g e t c h a r ( ) ;
r et ur n( f ) ;
}
/ * 2- char act er l exeme * /
{
( s EQUAL )
swi t ch( f )
{
ASSIGN:
LESS_THAN:
GREATER THAN
NOT:
r et ur n EQUAL ;
LESS_THAN_OR_EQUAL ;
r et ur n GREATER_THAN OR EQUAL
NOT EQUAL ;
}
r et ur n ERROR; / * di scar d bot h char act er s * /
}
and LPX, which translate a description of a token set into a table-driven lexical analyzer.
(Hereafter, when I say LPX, I m actually referring to both programs). LEX itself just
creates the tables, the remainder of the analyzer is the same for all LEX-generated source
code. The fact that a LEX-generated analyzer is typically slower than a hard-coded one
is, more often than not, a small price to pay for faster development time. Once the
Alphabets, strings,
words.
e, the empty string.
Empty versus null strings
Languages, sentences,
grammars.
language is stable, you can always go back and hard code a lexical analyzer if its really
necessary to speed up the front end.
2.4.1 Languages
*
Before looking at LX itself, well need a little theory. First some definitions. An
alphabet is any finite set of symbols. For example, the ASCII character set is an alpha
bet; the set {' 0' / 1 ' } is a more-restricted, binary alphabet. A string or word is a
sequence of alphabetic symbols. In practical terms, a string is an array of characters.
There is also the special case of an empty string, represented by the symbol 8 (pro
nounced epsilon). In C, the ' \ 0' is not part of the input alphabet. As a consequence, it
can be used as an end-of-string marker because it cannot be confused with any of the
characters in the string itself, all of which are part of the input alphabet. An empty string
in C can then be represented by an array containing a single ' \ 0' character. Note, here,
that theres an important difference between e, an empty string, and a null string. The
former is an array containing the end-of-string marker. The latter is represented by a
NULL pointera pointer that doesnt point anywhere. In other words, there is no array
associated with a null string.
A language is a set of strings that can be formed from the input alphabet. A sentence
is a sequence of the strings that comprise a language. A language can be as small as one
string and still be useful. (Zero-element languages are possible, but not of much utility.)
The ordering of strings within the sentence is defined by a collection of syntactic rules
called a grammar. Note that this definition does not attribute meaning to any of the
strings and this limitation has important practical consequences. The lexical analyzer
doesnt understand meaning. It has to distinguish tokens solely on the basis of surround
ing contextby looking at the characters that surround the current word, without regard
to the syntactic or semantic structure of the input sentence (the tokens that precede and
follow the current token). [Aho] introduces several other useful terms and definitions,
paraphrased here:
prefix A prefix is a string composed of the characters remaining after zero or
more symbols have been deleted from the end of a string: "i n" is a
prefix of "i nconsequent i al ". Officially, 8 is a prefix of every string,
suffix A suffix is a string formed by deleting zero or more symbols from the
front of a string, "i bl e" is a suffix of "i ncompr ehensi bl e". The
suffix is whats left after youve removed a prefix. 8 is a suffix of every
string.
substring A substring is whats left when you remove both a suffix and prefix:
" is a substring of "unmanageabl e". Note that suffixes and
prefixes are substrings (but not the other way around). Also 8, the empty
string, is a substring of every string,
proper X A proper prefix, suffix, or substring of the string x has at least one ele
ment and it is not the same as x. That is, it cant be 8, and it cant be
identical to the original string,
sub-sequence A sub-sequence of a string is formed by deleting zero or more symbols
from the string. The symbols dont have to be contiguous, so "i i i i "
and "ssss" are both sub-sequences of "Mi ssi ssi ppi ".
vv
5. [Aho], p. 93
Section 2.4.1 Languages* 53
Several useful operations can be performed on strings. The concatenation of two String concatenation.
strings is formed by appending all characters of one string to the end of another string.
The concatenation of "f i r e" and "wat er " is "f i r ewat er ". The empty string, 8, can
be concatenated to any other string without modifying it. (In set theory, 8 is the identity
element for the concatenation operation. An arithmetic analogy is found in multiplica- eis the identity element
tion: 1is the identity element for multiplication because x=xx\.) The concatenation
operation is sometimes specified with an operator (typically a x or ), so you can say that
fire water =firewater
If you look at concatenation as a sort of multiplication, then exponentiation makes String exponentiation.
sense. An expression like xnrepresents x, repeated n times. You could define a language
consisting of the eight legal octal digits with the following:
L(octal) 0,1,2, 3,4, 5, 6, 7
and then you could specify a three-digit octal number with L(octal)3.
The exponentiation process can be generalized into the closure operations. If L is a
language, then the Kleene closure of L is L repeated zero or more times. This operation
is usually represented as L*. In the case of a language comprised of a single character,
L* is that character repeated zero or more times. If the language elements are strings
rather than single characters, L* are the strings repeated zero or more times. For exam
ple, L(octal)* is zero or more octal digits. If L(vl) is a language comprised of the string
Va and L(v2) is a language comprised of the string Voom, then
Kleene Closure (*).
L(vl)* L(v2)
describes all of the following strings:
Voom VaVoom VaVaVoom VaVaVaVoom etc
The positive closure of L is L repeated one or more times, usually denoted L+. Its other
wise just like Kleene closure.
Since languages are sets of symbols, most of the standard set operations can be
applied to them. The most useful of these is union, denoted with the u operator. For
example, if letters is a language containing all 26 letters [denoted by L(letters)] and
digits is a set containing all 10 digits [denoted by L(digits)], then \L(lettersjuL fdigits))
is the set of alphanumeric characters. Union is the equivalent of a logical OR operator.
(If x is an element of AuB, then it is a member of either A OR B.) Other set operations
(like intersection) are, of course possible, but have less practical application.
The foregoing can all be applied to build a language from an alphabet and to define
large languages (such as token sets) in terms of smaller languages (letters, digits, and so
forth). For example
Positive closure (+).
Set operations on
languages, union (u).
L(digit)
L(alpha)
1,2, 3,4, 5,6,7, 8,9
a, b, c,..., z
you can say:
L( digit)+
L( digit)*
L(alpha) u L(digit)
(L(alpha) u L(digit))*
L(alpha) (L(alpha) u L(digit) )* is a C identifier.
is a decimal constant in C (one or more digits),
is an optional decimal constant (zero or more digits)
is the set of alphanumeric characters,
is any number of alphanumeric characters.
Forming regular expres
sions. Metacharacters.
Regular expression con
catenation.
Wildcard character.
Start-of-line anchor.
End-of-line anchor.
Character classes.
Programs like LEX use the foregoing language theory to specify a token set for a lexi
cal analyzer. The possible lexemes that correspond to individual tokens are all defined
using a series of set operations applied to previously defined languages, with a base
alphabet of the ASCII character set. The programs then translate that language
specification into the C source code for a computer program that recognizes strings in
the language.
Both programs use a notation called regular expressions for this purpose. Strictly
speaking, a regular expression is any well-formed formula over union, concatenation and
Kleene closureas was the case with the examples in the previous section. A practical
implementation of regular expressions usually add other operations, however, to make
them easier to use. I ll examine an extended regular-expression syntax in the current
section.
The simplest regular expression is just a series of letters that match a sequence of the
same letters in the input. Several special characters, called metacharacters, can be used
to describe more complex strings. Though there are variations in the notation used for
regular expressions, the following rules are used by LEX to form a regular expression and
can be taken as characteristic:
c A single character that is not a metacharacter is a regular expression. The
character c forms a regular expression that matches the single character c.
ee Two regular expressions concatenated form a regular expression that recog
nizes a match of the first expression followed by a match of the second. If a,
n, and d are regular expressions recognizing the characters a, n, and d, they
can be concatenated to form the expression and which matches the pattern
and in the input. Note that theres no explicit concatenation operator here,
the two strings are just placed next to each other.
A period (pronounced dot) matches any character except a newline. For
example, the expression a. y matches any, amy, and the agy in magyar.
An up arrow anchors the pattern to the start of the line. The pattern ~and
matches the string and only if it comprises the first three characters on the
line (no preceding white space). Note that any newline character that pre
cedes the and is not matched. That is, the newline is not part of the lexeme,
even though its presence (or a start-of-file marker) is required for a success
ful match.
$ A dollar sign anchors the pattern to end of line. The pattern and$ matches
the string only if it is the last three characters on the line (no following
white space). Again, the newline character is not part of the lexeme. The
pattern ~and$ matches the word only if its the only thing on a line.
[...] ["...] Brackets match any of the characters enclosed in the brackets. The [ and ]
metacharacter form a character class which matches any of the characters
listed. For example, [ 0123456789] matches any single decimal digit.
Ranges of characters can be abbreviated using a dash, so [0- 9] also
matches a single decimal digit. [ 0- 9A- Fa- f ] matches a hexadecimal
digit. [ a- zA- Z] matches an alphabetic character. If the first character fol
lowing the bracket is an up arrow Q, a negative character class (which
matches any character except the ones specified) is formed. [~a- z]
2.4.2 Regular Expressions*
6. Other uni x utilities, like grep, vi, and sed, use a subset of these rules
Section 2.4.2Regular Expressions* 55
matches any character except a lower-case, alphabetic character. Only
seven characters have special meaning inside a character class:
{ Start of macro name.
} End of macro name.
] End of character class.
- Range of characters.
Indicates negative character class.
M Takes away special meaning of characters up to next quote mark.
\ Takes away special meaning of next character.
Use \ ], \- , \ \ , and so forth, to put these into a class. Since other metachar
acters such as *, ?, and +are not special here, the expression [ * ?+] matches
a star, question mark, or plus sign. Also, a negative character class does not
match a newline character. That is, [~a-z] actually matches anything
except a lower-case character or newline. Note that a negative character
class must match a character. That is, [ ~a-z] $ does not match an empty
line. The line must have at least one character, though it may not end in a
nonalphabetic character.
* +? A regular expression followed by a * (pronounced star) matches that
expression repeated zero or more times; a +matches one or more repeti
tions, a ? matches zero or one repetitions. These three metacharacters
represent closure operations. They are higher precedence than concatena
tion. l l ?ama matches two strings: l l ama and lama. The expression
1+ama matches lama, ll ama, and 1111111111 llama. The expression
l*ama matches all of the above, but also matches ama. The expression -
0 [xX] [0- 9a- f A - f ]+ matches a hexadecimal number in C syntax; [0-
7 ] [ 0-7 ] ? matches one or two octal digits.
e{n,m} Matches n to m repetitions of the expression e. This operator is recognized
by lex, but not LEX.
e\e Two regular expressions separated by a vertical bar recognize a match of
the first expression OR a match of the second. OR is lower precedence than
concatenation. The expression ei t her |or matches either ei t her or or.
(e) Parentheses are used for grouping. The expression:
( f r a nk | -john) i e
matches both f ranki e, and j ohni e. The expression
( f r a n k | j o h n ) ( i e ) ?
matches f r ank and j ohn as well. You can add a newline to the characters
recognized by a negative character class with something like this:
( [ ~ a - z ] | \ e n )
Surrounding a string that contains metacharacters with double quotes ("*") or
preceding a single metacharacter with a backslash (\*) takes away its special meaning.
(A character preceded by a backslash is said to be escaped.) The operator precedence is
summarized in the Table 2.1. All operators associate left to right.
Note that regular expressions can only define sequences of characters. They cannot
do things like recognize any number of properly nested parentheses, something that can
be recognized grammatically (by the parser). This is one of the main reasons that the
lexical analyzer and parser are separate modules. The lexical analyzer is in charge of
recognizing simple sequences of characters, and the parser recognizes more complex
combinations.
Closure operators.
Multiple matches.
The OR operator.
Grouping.
Add 0 to negative char
acter class.
Escaping metacharac
ters, quotes marks.
Limitations of regular ex
pressions.
Element r
Recognizers.
Finite automata, state
machines.
States.
Transitions.
Start state.
Accepting states.
Transition diagram.
Table 2.1. Regular-Expression Operator Precedence
operator description level
()
t ]
* + 9
CC
~ $
parentheses for grouping
character classes
closure: 0 or more, 1or more, 0 or 1
concatenation
OR
anchors to beginning and end of line
1(highest)
2
3
4
5
6 (lowest)
2.4.3 Regular Definitions
*
There is an alternate way of describing a languages token set that takes a more
grammatical approach, and which is used in many language specifications. A regular
definition builds up a language specification using a combination of regular-expression
operators and production-like specifiers. For example:
keyword
digit
digit seque nee
sign
exponent_part
long I int I double I while
011121... 19
digit +
+
e sign? digit sequence
E sign? digit sequence
floating constant digit_sequenee . digit_sequenee ? exponent_part?
digit seque nee ? . digit sequence exponent_part?
digit seque nee exponent_part
Occasionally you see an opt subscript used to denote an optional element, such as digit
rather than digit?. This grammatical approach to languages is discussed in greater depth
in the next chapter.
2.4.4 Finite Automata*
A recognizer program, such as a lexical analyzer, reads a string as input and outputs
yes if the string is a sentence in a language, no if it isnt. A lexical analyzer has to do
more than say yes or no to be useful, so an extra layer is usually added around the recog
nizer itself. When a certain string is recognized, the second layer performs an action
associated with that string. LEX takes an input file comprised of regular expressions and
associated actions (code). It then builds a recognizer program that executes the code in
the actions when a string is recognized. LPX builds the recognizer component of the
analyzer by translating regular expressions that represent the lexemes into a finite auto
maton or finite state machine (usually abbreviated to state machine or FSM).
Strictly speaking, an FSM consists of the following:
A finite set of states.
A set of transitions (or moves) from one state to another. Each transition is labeled
with a character from the input alphabet.
A special start state.
A set of final or accepting states.
State machines are best understood by looking at one. Figure 2.3 is a transition
diagram for a state machine that recognizes the four strings he, she, his, and
hers.
Figure 2.3. A State Machine
Section 2.4.4Finite Automata*
The circles are individual states, marked with the state numberan arbitrary number
that identifies the state. State 0 is the start state, and the machine is initially in this state.
The lines connecting the states represent the transitions, these lines are called edges and
the label on an edge represents characters that cause the transition from one state to
another (in the direction of the arrow). From the start state, reading an h from the input
causes a transition to State 1; from State 1, an e gets the machine to State 3, and an i
causes a transition to State 5; and so on. A transition from State N to state M on the char
acter c is often represented with the notation: next(N,c)=M. This function is called the
move function by some authors, [Aho] among them, but I feel that next better describes
what the function is doing.
The states with double circles are called accepting states. Entering an accepting
state signifies recognition of a particular input string, and there is usually some sort of
action associated with the accepting state (in lexical-analyzer applications, a token is
returned). Unmarked edges (for example, there are no outgoing edges marked with an /,
5, r, or e from State 0) are all implied transitions to a special implicit error state.
State machines such as the foregoing can be modeled with two data structures: a sin
gle variable holding the current state number and a two-dimensional array for computing
the next state. One axis is indexed by the input character, the other by the current state,
and the array holds the next state. For example, the previous machine can be represented
by the arrays in Table 2.2. Two arrays are used, one to hold the state transitions and
another to tell you whether a state is accepting or not. (You could also use a single, two-
dimensional array of structures, one element of which was the next state and the other of
which was the accepting-state marker, but that would waste space.) The next state is
determined from the current state and input character, by looking it up in the table as fol
lows:
n e x t _ s t a t e = T r a n s i t i o n _ t a b l e [ i n p u t _ c h a r a c t e r ] [ c u r r e n t _ s t a t e ] ;
i f ( A c c e p t i n g [ n e x t _ s t a t e ] == 1 )
d o _ a n _ a c c e p t i n g _ a c t i o n ( n e x t _ s t a t e ) ;
This input character is usually called the lookahead character because its not removed
from the input until the next-state transition is made. The machine derives the next state
from the current state and lookahead character. If the next state is not the error state,
then set the current state to that state and advance past the lookahead character (typically
by reading, and discarding, it).
The machine we just looked at is called a deterministic finite automaton or DFA. A
DFA is deterministic in that the next state can always be determined by knowing the
current state and the current lookahead character. To be more specific, a DFA is a state
Edges.
next(N,c)=M.
Accepting states.
Modeling state machines
with arrays.
Lookahead character
used to compute next
state.
Deterministic finite
automata (DFA).
Nondeterministic finite
automaton (NFA).
e transitions match empty
string.
Table 2.2. Representing the State Machine
Transition Table
Accepting Lookahead Character
e h
l r s
0
1

7 0
1 2
5

0
2

3
1
cur 3

4 0
rent 4 1
state 5

6 0
6 1
7
8

0
8 9
-
0
9 1
machine in which all outgoing edges are labeled with an input character, and no two
edges leaving a given state have the same label. There is also a second, more general
type of state machine called a nondeterministic finite automaton or NFA, which is more
useful in many applications, including the current one. (All DFAs are also NFAs, but not
the other way around.) An NFA has no limitations on the number and type of edges:
Two outgoing edges can have the same label, and edges can be labeled with the empty
string, 8. This last type of edge is called an epsilon edge or epsilon transition. Since an 8
transition matches an empty string, it is taken without advancing the input and is always
takenregardless of the input character. For example, how can the regular expression
(and/any) be represented as a state machine? A DFA for this expression looks like this:
O
Unfortunately, DFAs are difficult to construct directly from regular expressions7
NFAs are easy to construct. Two possibilities are:
OK)
0 ^ 0
and
7. It is possible to construct a DFA directly from a regular expression, though I wont discuss how to do it
here. See both [McNaughton] and [Aho] pp. 135-141.
Section 2.4.4Finite Automata*
The second machine is preferable because its easier to represent in a computer program
(well see how in a moment). As you can see, the NFA can be an awkward data structure
to use. It can have many more states than the equivalent DFA, and its difficult to write a
state-machine driver (a program that uses the state machine to do something, such as
recognize tokens) that can use it directly. LPX solves the problem by creating the state
machine in a two-step process. It first makes an NFA representing the input regular
expressions, and it then converts that NFA to a DFA, which it in turn outputs. I ll discuss
how LEX performs this feat later in this chapter.
The state-machine representations weve been looking at are, of course, just one of
many ways to represent them. You can generalize the definitions for NFAs and DFAs. A
nondeterministic finite automaton, or NFA, is a mathematical model consisting of:
(1) A set of states, S.
(2) A special state in S called the start state. The machine is initially in this state.
(3) A set of states in S called accepting states, entry into which denotes recognition of
a string. Some sort of action is usually associated with each of the accepting states.
(4) A set of input symbols (an input alphabet).
(5) A next function that, when given a state and an input symbol, returns the set of
states to which control is transferred on that symbol from the indicated state. I ll
describe this next function in greater detail in a momentnote, however, that the
next function returns a set of states. The main implementation difference between
an NFA and a DFA is this next function. The next function for a DFA always
yields a single next state. The equivalent NFA function can yield several next
states.
A deterministic finite automaton or DFA is an NFA with the following restrictions:
(1) No state can have an outgoing e transition (an edge labeled with 8, the empty
string).
(2) There may be no more than one outgoing transition from any state that is labeled
with the same character.
In practical terms, the foregoing definition describes only the data structures (the set
of states and the way that the transitions are represented) and the next function that
determines the next state from the current one. In other words, it tells us how to make a
transition matrix. There is no information here about how the state machine is used; and
automata can, in fact, be used in different ways depending on the application. The state
machine is itself distinct from the driver programthe program that uses that machine.
8. In fact, a theoretical NFA often has fewer states than an equivalent DFA because it can have more edges
leaving a single state than the DFA has. Nonetheless, this sort of NFA is difficult to represent in a
computer program because it has an indeterminate number of outgoing edges. The NFAs discussed in the
current chapter all have more states than the equivalent DFAs because the extra states help smooth over
these difficulties. Ill show how in a moment.
State-machine driver.
NFA: formal definition.
DFA: formal definition.
The state machine and
driver are distinct.
State-machine driver.
Using state machines for
lexical analysis.
Transition matrix.
The greedy algorithm
(matches longest string).
This section demonstrates how state machines are used for lexical analysis by look
ing, at a high level, at the method used by a LPX-generated lexical analyzer. I ll describe
a simple table-driven lexical analyzer that recognizes decimal and floating-point con
stants. The following regular expressions describe these constants:
[ 0- 9] + r et ur n I CON;
( [ 0- 9] +| [ 0- 9] *\ . [ 0- 9] +| [ 0- 9] +\ . [ 0- 9] *) ( e[ 0- 9] +) ? r et ur n FCON;
The code to the right of the regular expression is executed by the lexical analyzer when
an input string that matches that expression is recognized. The first expression recog
nizes a simple sequence of one or more digits. The second expression recognizes a
floating-point constant. The (e [ 0 - 9 ] +) ? at the end of the second regular expression is
the optional engineering notation at the end of the number. I ve simplified by not allow
ing the usual +or - to follow the e, and only a lower-case e is recognized. The
( [ 0- 9] + | [ 0- 9] *\ . [ 0- 9] + | [ 0- 9] +\ . [0- 9]*)
recognizes one of three patterns (Ive added the spaces to clarify whats going on
theyre not really there): The [0-9]+ is a simple sequence of decimal digits. Its for
numbers like 10e3. Because of the way that LEX works, the [0-9]+ on the previous
line of the input specification takes precedence over the current onean ICON is
returned if a number does not have a trailing e, otherwise an FCON is returned. The
[0-9] * \. [0-9] +recognizes numbers with at least one digit to the right of the decimal
point, the [09] +\ . [0-9]* recognizes numbers with at least one digit to the left. You
cant use [0-9] * \ . [0-9]* because that pattern would accept a decimal point without
numbers on either side. All of the following numbers are accepted:
1.2 1. .1 1.2e3 2e3 1
and, of these, the last is an ICON and the others are FCONs.
1$X uses a state-machine approach to recognize regular expressions, and a DFA that
recognizes the previous expressions is shown in Figure 2.4. The same machine is
represented as an array in Table 2.3. The next state is computed using that array with:
n e x t _ s t a t e = a r r a y [ c u r r e n t _ s t a t e ] [ i n p u t ]
A dash indicates a failure transition (no legal outgoing transition on the current input
character from the current state). This array is typically called a transition matrix or
transition table. There are three accepting states (states from which a token is recog
nized) in the machine: 1, 2, and 4. State 1accepts an integer constant, and the other two
recognize floating-point constants. The accepting states are recognized in an auxiliary
array that is also indexed by state number, and which indicates whether or not a state is
accepting.
As I mentioned earlier, the state machine itself and the driver program that uses that
machine are distinct from one another. Two algorithms are commonly used in lexical
analysis applications, and the same state machine (transition matrix) is used by both
algorithms. A greedy algorithm, shown in Table 2.4, is used by LEX (because thats
whats required by most programming-language specifications). This algorithm finds the
longest possible sequence of input characters that can form a token. The algorithm can
be stated informally as follows: If theres an outgoing transition from the current state,
take it. If the new state is an accepting state, remember it along with the input position.
If theres no outgoing transition (the table has a a dash in it), do the action associated
with the most-recently seen accepting state. If there is no such state, then an error has
occurred (LEX just ignores the partially-collected lexeme and starts over from State 0, in
this situation).
2.4.5 State-Machine-Driven Lexical Analyzers*
Section 2.4.5State-Machine-Driven Lexical Analyzers*
Figure 2.4. State Machine That Recognizes Floating-Point Constants
61
[0-91
r e t u r n ICON;
[0-91
urn FCON;
r e t u r n FCON;
10=9]
Table 2.3. State Machine in Figure 2.3 Represented as an Array
lookahead character
accepting action
0-9 e
0 3 1

1 2 1 5 r et ur n I CON;
current 2
2 5 r et ur n FCON;
state 3
2

4
r et ur n FCON
5
4

I ll do two examples to show the workings of the machine, the first with the input
1.2e4. LEX starts in State 0. The 1 causes a transition to State 1 and the input is
advanced. Since State 1is a potential accepting state, the current input position and state
number is remembered. The dot now gets us to State 2 (and the input is advanced again).
Since State 2 is also an accepting state, the previously remembered input position and
state number are overwritten by the current ones. The next input character (the 2) causes
us to go from State 2 to itself. State 2 is still an accepting state, so the current input posi
tion overwrites the previously saved one. The e now gets us to State 5, which isnt an
accepting state, so no other action is performed; and the final 4 causes a transition to
State 4, which is an accepting state, so the current input position overwrites the previous
one. The next input character is the end-of-input marker. There is no legal transition out
of State 4 on end of input, so the machine enters the failure state. Here, the action asso
ciated with the most-recently seen accepting state (4) is performed and the machine
returns FCON. The next time the subroutine is called, it returns zero immediately,
because the lookahead character is end of input.
The second example looks at the incorrect input 12e, with no number following the
e. This input causes a failure transition from State 5, because theres no legal outgoing
transition on end of input from State 5. When the failure occurs, the most recently seen
accepting state is State 2, so the input is backed up to the condition it was in in State 2
(the next input character is an e) and an FCON is returned. The next time the algorithm is
entered, there will be a failure transition from the start state, because an e cant occur at
the beginning of the number. The e is discarded, and the algorithm goes to State 0 (and
terminates).
Example: 1.2e4.
Bad-input example: 1.2e.
Table 2.4. Algorithm Used by the LEX State-Machine Driver
current_state =0;
previously_seen_accepting_state =none_seen;
if( lookahead character is end-of-input)
return 0;
while( lookahead character is not end-of-input)
{
if( there is a transition from the current state on the current lookahead character)
{
current_state =that state;
advance the input;
if( the current state is an accepting state )
{
remember the current position in the input
and the action associated with the current state;
}
}
else
{
if( no accepting state has been seen )
{
Theres an error:
Discard the current lexeme and input character.
Current_state =0;
}
else
{
back up the input to the position it was in when it saw the last accepting state
perform the action associated with that accepting state;
Note that the greedy algorithm does have its disadvantages: I ts tricky to implement
and tends to be relatively slow. It can also cause the recognizer to behave in sometimes
unexpected ways. (The LEX input expression (\n | .) * tries to absorb the entire input
file, for example.) It is nonetheless the best (and sometimes the only) choice in most real
lexical-analysis applications.
The second type of algorithm (the nongreedy algorithm) is much simpler. Here, the
shortest possible input string is recognized, and the machine just accepts as soon as an
accepting state is entered. A nongreedy recognizer program is much simpler to imple
ment than a greedy one, and is much faster as well. Nonetheless, this algorithm can be
used only when all the accepting states in the machine are terminal nodeswhen they
have no outgoing transitions.
Disadvantages of greedy
algorithm.
The nongreedy algorithm
(matches first string).
Terminal nodes.
Section 2.4.6Implementing a State-Machine-Driven Lexical Analyzer 63
2.4.6 Implementing a State-Machine-Driven Lexical Analyzer
This section shows how the machine in the previous section is implemented by
analyzing a 1X output file in depth. You may want to skip over this section if youre not
interested in this level of detail. You should read Appendixes A and D, in which various
support functions and LEX itself are described, before continuing. A LEX input file that
recognizes the floating-point constants weve been looking at is shown in Listing 2.11 inPut f' ,e f r floating
point constants.
Listing 2.11. numbers.lex A 1X Input File to Recognize Floating-Point Constants
1
%{
2 #def i ne FCON 1;
3 #def i ne ICON 2;
4
%}
5 D [ 0 - 9 ] / * a si ngl e deci mal di gi t * /
6
o o
o o
7 {D} + r et ur n ICON;
8 ( (D } + I {D} * \ . {D } + | {D} + \ . { D}*) ( e {D} +) ? r et ur n FCON;
9
o o
o o
The lex output file lexyy.cbegins in Listing 2.12.9The first two lines are from the
header portion of the original LEX input file. They are followed by a comment that
describes the state machine that LEX created. If a state is an accepting state, the first few
characters of the equivalent code are printed (tabs are mapped to \ t), along with the
State-machine descrip
tion in IfX output file.
input line number. The goto transitions are shown along with the characters that cause
the transitions. If several edges (outgoing transitions) all go to the same state, they are
represented like this:
goto 2 on 0123456789
The state goes to State 2 if the input character is a digit. The entire comment is sur
rounded with an #f def __NEVER__in case a */ should accidentally come up in one
NEVER
of these lists of transition characters.
The next part of the LEX output file is copied directly from the template file
(/lib/lex.par by default, but a different file can be specified with the -m command-line
Template-file organiza
tion, lex. par
switch). This template file is separated into three parts by formfeed (Ctrl-L) characters.
Everything from the beginning of the file up to the first formfeed is copied into the out
put file at this juncture. The relevant code is shown in Listing 2.13.
The #f ndef directive on line 35 of Listing 2.13 lets you define YYPRIVATE in the
YYPRI VATE.
LEX-input-file header, without having to #undef it first. Most of the global variables in
the file are declared as YYPRIVATE, which normally translates to the keyword
Redefining this macro to an empty string makes these variables true globals, which can
be accessed from outside the current file. I m using the definition of NULL on line 39 to
determine if <stdio.h> was included previously, and including it if not. Finally,
YYDEBUG is defined, various debugging diagnostics are activated. These are printed
only if the variable y y d e b u g is also true, thus the i f statement on line 45. I ts best to
YY_D, yydebug,
YYDEBUG
9. Note that there are two kinds of text in lexyy.c: (1) text copied verbatim from the template file lex.par and
(2) text generated by lX itself. The listings that describe those parts of lexyy.c that are copied from the
template file are labeled lex.par in the following discussion. l^X-generated text is in listings labeled
lexyy.c. Line numbers carry over from one listing to another because theres really only a single output
file.
Li sti ng2.12. lexyy.c State-Machine Description
1 #def i ne FCON 1
2
'l
#def i ne ICON 2
4 #i f def NEVER
5 / * -
6
*
DFA ( st art st at e i s 0) i s
7
*
8
*
St at e 0 [ nonaccept i ng]
9
*
got o 3 on .
10
*
got o 1 on 0123456789
11
*
St at e 1 [accept i ng, l i ne 7 <r et ur n I CON; >]
12
*
got o 2 on .
13
*
got o 1 on 0123456789
14
*
got o 5 on e
15
*
St at e 2 [accept i ng, l i ne 8 <r et ur n\ t FCON; >]
16
*
got o 2 on 0123456789
17
*
got o 5 on e
18
*
19
*
got o 2 on 0123456789
20
*
St at e 4 [accept i ng, l i ne 8 <r et ur n\ t FCON; >]
21
*
got o 4 on 0123456789
22
*
23
*
got o 4 on 0123456789
24
*/
25
26 #endi f
see how the macro works by looking at an example; if YYDEBUG is defined, then a
debugging diagnostic like:
YY_D ( p r i n t f ( "a a a a a g hhhh! ! ! " ) ) ;
is expanded to:
i f ( yyde bug ) { p r i n t f ( "a a a a a g hhhh! ! ! " ) ; } el se;
Note that the semicolon following the el se comes from the original macro invocation
and the semicolon following the pri nt f ( ) follows the x in the macro definition. That
trailing el se is important in order to make something like the following work correctly:
i f ( s o me t h i n g )
Y Y _ D ( p r i n t f ( "a a a a a g hhhh! ! ! " ) ) ;
el se
s o m e t h i n g _ e l s e ( ) ;
The foregoing expands to:
i f ( s o me t h i n g )
i f ( yyde bug )
{
p r i n t f ( "a a a a a g hhhh!! !") ;
}
el se
f
el se
s o m e t h i n g _ e l s e ( ) ;
If the el se werent present in the macro definition, then the el se
somethi ng el se( ) clause in the original code would incorrectly bind to the
Trailing e l s e in
multiple-statement
macro.
Listing 2.13. lex.par Various Definitions Copied Into lexyy.c
27 / * yy TTYPE i s used f or t he DFA t r ansi t i on t abl e: Yy nxt [ ] , decl ar ed bel ow.
28 * YYF mar ks f ai l ur e t r ansi t i ons i n t he DFA t r ansi t i on t abl e. Ther e' s no f ai l ur e
29 * st at e i n t he t abl e i t sel f , t hese t r ansi t i ons must be handl ed by t he dr i ver
30 * pr ogr am. The DFA st ar t st at e i s St at e 0. YYPRI VATE i s onl y def i ned her e onl y
31 * i f i t hasn' t be #def i ned ear l i er . I ' massumi ng t hat i f NULL i s undef i ned,
32 * <st di o. h> hasn' t been i ncl uded.
33 * /
34
35 #i f ndef YYPRIVATE
36 # def i ne YYPRIVATE st at i c
37 #endi f
38
39 #i f ndef NULL
40 # i ncl ude < s t d i o . h >
41 #endi f
42
43 #i f def YYDEBUG
44 i nt yyde bug = 0;
45 # def i ne YY D( x) i f ( yyde bug ) { x; }el se
46 #el se
47 # def i ne YY_D(x)
48 #endi f
49
50 t ypedef unsi gned char YY TTYPE;
51 #def i ne YYF (( YY_TTYPE ) ( - 1 ) )
52
53 unsi gned char * i i t e x t ( ) ;
i f ( yydebug) rather than the i f ( somet hi ng) . If YYDEBUG isnt defined in the
header, then the argument to YY_D effectively disappears from the input (the macro
expands to an empty string). In this case, the pr i nt f ( ) statements go away.
The code on lines 50 and 51 of Listing 2.13 are used to declare and access the
transition-matrix array. YY TYPE is the type of one array element, and YY_F marks
failure transitions in the array. This latter value cannot be used as a state number. Note
that the cast to unsi gned char effectively translates -1 to 255. Similarly, l s in the
tables are all silently converted to 255 as part of the initialization.
The next part of the LEX output file is the state-machine transition matrix. It is used
to compute the next state from the current state and lookahead symbols. This array can
take three forms. The first, uncompressed form is shown in in Figure 2.5 and Listing
2.14. I ve simplified the picture by leaving all the error transitions blank. (Theyre ini
tialized to -1 in Listing 2.14.) An uncompressed array is generated by specifying a -/
(for fast) switch on the LEX command line. The next state is computed with:
Yy_nxt [ c u r r e n t _ s t a t e ] ] [ l o o k a h e a d _ c h a r a c t e r ]
This operation is encapsulated into the yy next ( st at e, c) macro on line 147 of List
ing 2.14.
Notice that several columns in the uncompressed array are identical. All columns
not associated with a period, e, or digit are the sametheyre all error transitions. All
the columns for the digits are the same. By the same token, the rows associated with
States 4 and 5 are the same. (The states arent equivalent because one is an accepting
state and the other isnt.) This situation holds with most state machines that recognize
real token sets. Taking C as a case in point, all the control characters and the space char
acter are ignored, so the columns for these are identical. With a few exceptions like L
YY TYPE, YYF.
Transition matrix
representations.
Uncompressed transition
matrix.
Compressed transition
matrix. Redundant row
and column elimination.
Figure 2.5. The Uncompressed Transition Table
Y y _ n x t [ ] [ ]
0123456789 e
0 3 1111111111
1 2 1111111111 5 r e t u r n ICON
2 2222222222 5 r e t u r n FCON
3 2222222222
4 4444444444
5 4444444444 r e t u r n FCON
Listing 2.14. lexyy.c The Uncompressed Transition Table
54 YYPRIVATE YY__TTYPE
Yy_
n x t [ 6 ] [ 128 ]
-
55
{
56 / * 00 */ { - 1, -
- l , - 1 , - 1 ,
- I f
- 1 , - 1 , - 1 , - 1 , - 1 ,
57
- I f -
- l , - 1, - 1 ,
If
- 1 , - 1 , - 1 , - 1 , - 1 ,
58
- I f -
- l , - 1, - 1 ,
- I f
- 1, - 1 , - 1 , - 1 , - 1 ,
59
- I f ~
- l , - 1, - 1 ,
- I f
-1, - 1 , - 1 , - 1 , - 1 ,
60
- I f -
- l , - 1 , - 1 ,
- I f
- 1, 3, - 1 , 1, 1,
61 1, l , 1, 1, 1, 1, 1, 1, - 1 , - 1 ,
62
- 1 , -
- l , - 1 , - 1 ,
- I f '
- 1, - 1 , - 1 , - 1 , - 1 ,
63
- 1 / -
- l , - 1 , - 1 ,
- I f
- 1, - 1 , - 1 , - 1 , - 1 ,
64
1/
- l , - 1 , - 1 ,
- I f
- 1, - 1 , - 1 , - 1 , - 1 ,
65
If -
- l , - 1 , - 1 ,
- I f
- 1, - 1 , - 1 , - 1 , - 1 ,
66
- I f ~
- l , - 1 , - 1 ,
- I f
- 1, - 1 , - 1 , - 1 , - 1 ,
67
- I f ~
- l , - 1 , - 1 ,
- I f
- 1, - 1 , - 1 , - 1 , - 1 ,
68
- I f ~
- l , - 1 , - 1 ,
- I f
- 1, - 1 , - 1
69
70 / * 01 */ { - I f -
- l , - 1 , - 1 ,
- I f
- 1, - 1 , - 1 , - 1 , - 1 ,
71
If -
- l , - 1 , - 1 ,
- I f
- 1, - 1 , - 1 , - 1 , - 1 ,
72
I f -
- l , - 1 , - 1 ,
- I f
- 1, - 1 , - 1 , - 1 , - 1 ,
73
If -
- l , - 1 , - 1 ,
- I f
- 1, - 1 , - 1 , - 1, - 1 ,
74
If -
- l , - 1 , - 1 ,
- I f
- 1, 2, - 1 , 1, 1,
75 1, l , 1, 1, 1, 1, 1, 1, - 1, - 1,
76
"If -
- l , - 1 , - 1 ,
- I f
- 1, - 1 , - 1 , - 1 , - 1,
77
If -
- l , - 1 , - 1 ,
- I f
- 1, - 1 , - 1 , - 1 , - 1,
78
If -
- l , - 1 , - 1 ,
- I f '
- 1, - 1 , - 1 , - 1 , - 1 ,
79
If -
- l , - 1 , - 1 ,
- I f
- 1, - 1 , - 1 , - 1 , - 1 ,
80 - 1 , 5, - 1 , - 1 ,
- I f
- 1, - 1 , - 1 , - 1 , - 1 ,
81
I f -
- 1, - 1 , - 1 ,
- I f
- 1, - 1 , - 1 , - 1 , - 1 ,
82
- I f -
- 1, - 1 , - 1 ,
- I f
- 1, - 1 , - 1
83
84 / * 02 * /
{ - I f -
- 1, - 1 , - 1 ,
If
-1, - 1 , - 1 , - 1 , - 1 ,
85
If -
- 1, - 1, - 1 ,
- I f '
- 1, - 1 , - 1 , - 1 , - 1 ,
86
If -
-1, - 1 , - 1 ,
- I f '
- 1, - 1 , - 1 , - 1 , - 1 ,
87
If -
-1, - 1 , - 1,
- I f
- 1, - 1 , - 1 , - 1 , - 1,
88
If
-1, - 1 , - 1,
- I f '
- 1, - 1 , - 1 , 2, 2,
89 2, 2 r 2, 2, 2, 2, 2, 2, - 1 , - 1,
90
If -
- 1, - 1 , - 1 ,
- I f '
- 1, - 1 , - 1 , - 1 , - 1,
91
If -
- 1, - 1 , - 1 ,
- I f
- 1, - 1 , - 1 , - 1 , - 1 ,
92
- I f -
- 1, - 1 , - 1 ,
- I f
- 1, - 1 , - 1 , - 1 , - 1 ,
93
- I f -
- 1, - 1 , - 1 ,
- I f
-1, - 1 , - 1 , - 1 , - 1 ,
94 - 1 , 5 , - 1 , - 1 ,
If '
-1, - 1 , - 1 , - 1 , - 1 ,
95
"I f -
- 1, - 1 , - 1 ,
- I f
-1, - 1 , - 1 , - 1 , - 1 ,
96
- I f -
- 1, - 1 , - 1 ,
- I f
-1, - 1 , - 1
Listing 2.14. continued. *
97
98 / * 03 */ { - 1 , - 1, - 1,
- 1 . -If '
-1 - 1 , - 1 - 1, - 1,
99 - 1 , - 1 , - 1 ,
- I f If '
-1 - 1 , - 1 - 1, - 1 ,
100 - 1 , - 1 , - 1 ,
- I f ~If
-1 - 1 , - 1 - 1, - 1 ,
101 - 1 , - 1 , - 1 ,
- I f - If
-1 - 1, - 1 - 1, - 1 ,
102 - 1 , - 1 , - 1 ,
- I f - If
-1 - 1 , - 1 2, 2 ,
103 2, 2, 2, 2, 2, 2 2, 2 - 1 , - 1 ,
104 - 1 , - 1 , - 1 ,
- I f - If
-1 - 1 , - 1 - 1 , - 1 ,
105 - 1 , - 1 , - 1 ,
- I f - I f
-1 - 1 , - 1 - 1, - 1,
106 - 1, - 1, - 1,
- I f - I f
-1 - 1 , - 1 - 1, - 1,
107 - 1, - 1, - 1,
- I f - I f
-1 - 1 , - 1 - 1, - 1,
108 - 1, - 1, - 1,
- I f - I f -
-1 - 1 , - 1 - 1, - 1,
109 - 1, - 1, - 1,
- I f - I f
-1 - 1, - 1 - 1, - 1,
110 - 1 , - 1, - 1,
- I f - I f
-1 - 1, - 1
111
112 / * 04 * / { - 1, - 1, - 1,
- I f - I f '
-1 - 1, - 1, - 1, - 1,
113 - 1 , - 1, - 1,
- I f - I f
-1 - 1, - 1, - 1, - 1,
114 - 1, - 1, - 1,
- I f - I f
-1 - 1, - 1, - 1, - 1,
115 - 1, - 1, - 1,
I f - I f
-1 - 1, - 1, - 1, - 1,
116 - 1, - 1, - 1,
- I f - I f
-1 - 1, - 1, 4, 4,
117 4, 4, 4, 4, 4, 4 4, 4, - 1, - 1,
118 - 1, - 1, - 1,
- I f - I f '
-1 - 1, - 1, - 1, - 1,
119 - 1, - 1, - 1,
- I f " I f
-1 - 1, - 1, - 1, - 1,
120 - 1, - 1, - 1,
I f
1, --1 - 1, - 1, - 1, - 1,
121 - 1, - 1, - 1,
- I f
1, --1 - 1, - 1, - 1, - 1,
122 - 1, - 1, - 1,
- I f
1, --1 - 1, - 1, - 1, - 1,
123 - 1, - 1, - 1,
- I f " I f -
-1 - 1, - 1, - 1, - 1,
124 - 1, - 1, - 1,
- I f - I f -
-1 - 1, - 1
125
126 / * 05 */ { - 1, - 1, - 1,
- I f "
1, --1 - 1, - 1, - 1, - 1,
127 - 1, - 1, - 1,
- I f
1, --1 - 1, - 1, - 1, - 1,
128 - 1, - 1, - 1,
- I f
1, --1 - 1, - 1, - 1, - 1,
129 - 1, - 1, - 1,
I f - I f '
-1 - 1, - 1, - 1, - 1,
130 - 1, - 1, - 1,
- I f " I f -
-1 - 1, - 1, 4, 4,
131 4, 4, 4, 4, 4, 4 4, 4, - 1, - 1,
132 - 1 , - 1, - 1,
- I f
1, --1 - 1, - 1, - 1, - 1,
133 - 1, - 1, - 1,
- I f - I f -
-1 - 1, - 1, - 1, - 1,
134 - 1, - 1, - 1,
- I f
1, --1 - 1, - 1, - 1, - 1,
135 - 1, - 1, - 1,
- I f I f -
-1 - 1, - 1, - 1, - 1,
136 - 1 , - 1, - 1,
I f I f '
-1 - 1, - 1, - 1, - 1,
137 - 1, - 1, - 1,
- I f " I f '
-1 - 1, - 1, - 1, - 1,
138 - 1, - 1, - 1,
- I f I f '
-1 - 1, - 1
139 }
140
};
141
142
/ * --------------------
143 * yy next (state, c) i s gi ven t he cur r ent st at e and i nput
144 * char act er and eval uat es t o t he next st at e.
145 */
146
147 #d e f i n e yy n e x t ( s t a t e , c) Yy n x t [ s t a t e ] [ c ]
and jc, all the columns for the letters are identical, as are most of the digits columns.
Moreover, at least half of the states in a typical machine have no legal outgoing transi
tions, so the rows associated with these states are identicalevery cell holds -1.
LEXs default compression technique takes advantage of this situation and eliminates
the redundant rows and columns by creating two supplemental arrays. The compressed
Figure 2.6. Transition Table With Redundant Rows and Columns Eliminated
Yy_cmap[], Yy_nxt []
Yy r map[].
Compression ratio.
Pair-compressed transi
tion matrix.
Yy nxtN, Yy nxt [ ].
Yy c m a p [ ] 0123456789 e
table is shown in Figure 2.6 and in Listing 2.15. The Yy_cmap[ ] array is indexed by
lookahead character and holds the index of one of the columns in the Yy_nxt [ ] array.
When several columns in the original array are equivalent, the matching entries in
Yy_cmap [ ] hold the index of a single column in Yy_nxt [ ]. For example, the columns
associated with digits in the original table are all identical. Only one of these columns is
present in the compressed array (at Yy nxt [x] [2]), and all columns corresponding to
digits in Yy_cmap hold a 2. The rows are compressed in the same way using
Yy r map [ ]. Since rows 4 and 5 are identical in the uncompressed array, Yy_r map [ 4]
and Yy_r map [5] both hold a 4, and Yy_nxt [4] [x] holds the original row from the
uncompressed table.
An array element is accessed using:
Y y _ n x t [ Yy_rmap[ c u r r e n t _ s t a t e ] ] [ Yy_cmap[ l o o k a h e a d _ c h a r a c t e r ] ]
rather than the
Y y _ n x t [ c u r r e n t _ s t a t e ] ] [ l o o k a h e a d _ c h a r a c t e r ]
thats used for the uncompressed table. The yy next macro for this type of table is
defined on line 102 of Listing 2.15.
Redundant-row-and-column elimination is usually the best practical compression
technique. The access time is fast, and the compression ratio is usually quite good. (In
this example, the ratio is about 4:1 154 bytes, as compared to 640 bytes. The C lexi
cal analyzer presented in Appendix D does even better, with a compression ratio of
about 7:1 1,514 bytes versus 10,062 bytes.)
A second compression method yields better compression if the transition matrix is
particularly sparse, though the access time is slower. The rows are split into distinct
one-dimensional arrays, accessed indirectly through an array of pointers (see Figure 2.7).
The rows are all named Yy_nxt/ V, where N is the original row index in the
uncompressed table (row 5 is in Yy_nxt 5[ ] ) , and the array of pointers is called
Yy_nxt [ ]. The current tables are compressed in this way in Listing 2.16.
If the first cell of the Yy_nxt / V array is zero, then the remainder of the array is
identical to the original row, and can be accessed directly using the lookahead character.
To simplify a little:
Listing 2.15. lexyy.c Transition Table With Redundant Rows and Columns Eliminated
54 / * -
55
*
The
Yy_
cmap [ ] and Yy r map ar r ays ar e used as f ol l ows:
56
*
57
*
next st at e= Yy nxt [ Yy_ r map[cur r ent st at e ]
] [
Yy cmap[i nput char ] ];
58
*
59
*
Char act er posi ti ons i n t he Yy_ cmap ar r ay are:
60
*
61
*
~A
-A
C ~E ~I N "0
62
*
~R S
* rji
~V ~X ~Y ~Z
AJ A>A A
63
* i
99
# $ % &
/
( )
*
+
r
-
7
64
*
0 1 2 3 4 5 6 7 8 9
m
f
<
- >
9
65
*
0 A B C D E F G H I j K L M N 0
66
*
P
Q
R S T U V W X Y z
[ \ ]
A
67
* \
a b c d e f
g
h
9
1
4
J
k 1 m n o
68
*
P <7
r s t u V w X
y
Z
{ 1 }
DEL
69
*/
70
71 YYPRIVATE YY_TTYPE Yy c map[ 128]
72
{
73
o, o, o, o, 0, 0, 0, 0, 0, 0f 0, o, 0, 0, 0, o,
74 0, o, 0, o, 0, 0, 0, 0, 0, o, 0, 0, 0, 0, 0, o,
75 0, 0, 0, 0, 0, 0, 0, 0, 0, o, 0, 0, 0, o,
1,
o,
76 2, 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 0, 0, 0, 0, 0, o,
77 0, 0, 0, 0, 0, 0, 0, o, 0, 0, 0, 0, 0, 0, 0, o,
78 0, o, 0, o, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
79 o, o, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
80 0, o, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
81
} ;
82
83 YYPRIVATE YY TTYPE Yy rmap[6
] =
84
{
85 0, 1, 2, 3, 4, 4
86
};
87
88 YYPRIVATE YY_TTYPE Yy nx t [ 5 ] [ 4 ] =
89
{
90 /* 00 */ { 1, 3, 1,
1 },
91
/*
01
*/ { - 1 , 2 ,
1,
5 },
92
/ *
02
* / { - 1 , 1,
2, 5 },
93
/ * 03 * / { - 1 , 1,
2 ,
1 },
94
/ *
04
* / { - 1 , 1, 4,
1 }
95 };
96
97
/ * -
98
*
yy_
next (st ate, c) i s gi ven t he cur r ent st at e number and i nput
99
*
char act er and eval uat es to t he next St at e.
100 */
101
102 #defi ne yy n e x t ( s t a t e , c) (Yy_ nx t [ Yy rmap [ s t a t e ] ] [ Yy c map[ c ] ] )
YY_TTYPE *row;
row = Yy _ n x t [ c u r r e n t _ s t a t e ] ;
i f ( *row == 0 )
n e x t _ s t a t e = (row + 1 ) [ l o o k a h e a d _ c h a r a c t e r ] ;
If the first cell of the Yy nxt/V array is nonzero, then the array holds a sequence of
character/next-state pairs and the first cell holds the number of pairs. For example,
Figure 2.7. Pair-Compressed Transition Table
Yy nx t [ ]
1
1"
3 . j J 0] 1 T ,1 92] 1 '3\ 1 '4\ 1 5 1
11
6*11 ' 7\ 1 98\ 1 99] 1
Yy nxt O[]
Yy nxt l []
|o
-1-1 ...-1 2-1 1 1111111111 -1 .. .-1 5 -1 ...
1"
9092 7 2 '2' 2 93\ 2 94\ 2 95\ 2 '6\ 2 7 2
7 z
98\ 2 99\2 9e\ 5 Yy nxt 2[]
1090\ 2 9r 2
1 z
92\ 2 93' 2 94] 2 95] 2 96] 2 97\2 98\ 2 99] 2
1090]4 91\4 92\ 4 93\ 4 94\4 95\ 4 96]4 97\4 98\ 4 99)4
1090\ 4 91\ 4 92\ 4 9394 94\ 4 95 14 96\ 4 97\ 4 98\ 4 99] 4
Yy n x t 3 []
Yy n x t 4 []
Yy n x t 5 []
Listing 2.16. lexyy.c Pair-Compressed Transition Table
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
YYPRIVATE YY TTYPE Yy nxtO []
{ 11,
YYPRIVATE YY
YYPRI VATE YY
f f
m 3, ' 0'
, 1 , ' 1' , 1, ' 2 ' , 1 , ' 3 ' / l /
' 4 ' 1, ' 5'
, 1, ' 6 ' , 1 , ' 7'
, 1 , ' 8 ' , 1 ,
' 9'
l } ;
TTYPE Yy n x t l
[] = { o,
- 1 - 1 , - 1
, - 1 /
- 1 , - 1 , - 1 , - 1 , - l , - l
- 1 - 1 , - 1
, - 1,
- 1 , - 1 , - 1 , - 1 , - l , - l
- 1 - 1 , - 1
, - 1 ,
- 1 , - 1 , - 1 , - 1 , - l , - l
- 1 - 1 , - 1
, - 1 /
- 1 , - 1 , - 1 , - 1 , - l , - l
- 1 - 1 , - 1
, - 1 /
- 1 , - 1 , 2, - 1 , l , l
1 1, 1
1/
1, 1, 1, 1, - l , - l
- 1 - 1 , - 1
, - 1 /
- 1 , - 1 , - 1 , - 1 , - l , - l
- 1 - 1 , - 1
, - 1 /
- 1 , - 1 , - 1 , - 1 , - l , - l
- 1 - 1 , - 1
, - 1 /
- 1 , - 1 , - 1 , - 1 , - l , - l
- 1 - 1 , - 1
, - 1 /
- 1 , - 1 , - 1 , - 1 , - l , - l
- 1 5, - 1
, - 1 /
- 1 , - 1 , - 1 , - 1 , - l ,
-i>
- 1 - 1 , - 1
, - 1 /
- 1 , - 1 , - 1 , - 1 , - l , - l
- 1 - 1 , - 1
, - 1 /
- 1 , - 1 , - 1 ,
- l } ;
TTYPE Yy nx t 2
[] = { 11,
' 0' 2, ' 1' , 2, ' 2' 2
^ 9 9
' 3 ' , 2 , ' 4 ' , 2,
' 5' 2, ' 6' , 2, 9 1 9 2 ' 8' , 2 , ' 9 ' , 2,
' e ' , 5 } ;
YYPRIVATE YY TTYPE Yy n x t 3 [] { 10,
' O ' , 2,
' 5 ' , 2 ,
} ;
' l ' , 2 ,
'6', 2,
' 2' 2 Z- f ^ f
919 2 / f Z- f
' 3 ' , 2 ,
'8', 2,
' 4 ' , 2 ,
' 9' , 2
[]
YYPRIVATE YY_TTYPE Yy_nxt 4
' O ' , 4, ' 1 ' , 4 ,
' 5' , 4, ' 6' , 4,
};
YYPRIVATE YY TTYPE Yy n x t 5 []
{ 10,
'2', 4,
' 7 ' , 4 ,
' 3 ' , 4 , ' 4 ' , 4 ,
'8', 4, '9', 4
{ 10,
' 5 ' , 4 ,
} ;
'6', 4,
'2', 4,
' 7 ' , 4 ,
' 3 ' , 4 , ' 4 ' , 4 ,
' 8 ' , 4 , ' 9' 4
YYPRI VATE YY TTYPE
*
Yy n x t [ 6 ]
{
Yy nxtO Yy n x t l Yy nx t 2 Yy nx t 3 Yy ri xt4 Yy nx t 5
92
};
93
94
/ *__ ----------- */
95
96 YYPRI VATE YY TTYPE yy next ( cur st at e, c )
97 unsigned i n t c ;
98 i n t cur st at e ;
99
{
100 / * Gi ven t he cur r ent st at e and t he cur r ent i nput char act er
101 * r et ur n t he next st at e.
102 */
103
104 YY TTYPE *p = Yy nxt [ cur st at e ]
f
105 r e g i s t e r i n t i ;
106
107
i f ( P ) / * t her e ar e t r ansi t i ons */
108
{
109 i f ( (i = *p++) == 0 ) / *
r ow i s uncompr essed */
110 return p [ c
];
111
112 f o r ( ; - - i >= 0 ; p += 2 )
/ *
r ow i s i n i n pai r s */
113 i f ( c == p [0]
)
114 return p [1];
115
}
116 return YYF;
117 }
Yy nxt O [ ] in Figure 2. 7 contains 11 pairs (because 11 is in the first cell). The first pair
is [ \ \ 3] , which means that if the next input character is a dot, then the next state is
State 3. The next pair is [ O, 1] if the lookahead character is a ' 0' , go to State 1, and
so forth. If you get to the end of the list without finding the current lookahead character,
there is a failure transition on that character. Again, simplifying a little:
YY_TTYPE *row;
r ow = Yy_nxt [ cur r ent _st at e ];
i f ( *r ow != 0 )
{
f or ( num_pai r s = *r ow++ ; num_pai r s >= 0 ; r ow += 2)
i f ( r ow[ 0] == l ookahead_char act er )
{
next _st at e = r ow[ l ] ;
br eak;
}
}
The foregoing is implemented by the code on line 112 of Listing 2. 16.
A third situation, unfortunately not illustrated in this example, occurs when a state
has no legal outgoing transitions. In this case, Yy nxt [ st at e] is set to NULL, so:
i f ( Yy_nxt [cur r ent _st at e] == NULL )
next _st at e = FAI LURE ;
This transition is activated when the test on line 107 of Listing 2. 16 fails.
Note that further compression can be achieved at the expense of error recovery by Using a default transition.
providing a default state other than the error state. Hitherto, an error was indicated when
you got to the end of the pair list. You can use the most common nonerror transition
instead of the error transition in this case, however. For example, Yy nxt 2 contains
Setting pair-compression
threshold in 15X.
Pair compression,
compression ratio.
Accepting-state array,
Yyaccept [].
these pairs:
[ ' O ' , 2] [' l ' , 2 ] [ ' 2 ' , 2 ] [ ' 3 ' , 2 ] [ ' 4 ' , 2 ]
[ ' 5 ' , 2 ] [ ' 6 ' , 2 ] [ ' 7 ' , 2 ] [ ' 8 ' , 2 ] [ ' 9 ' , 2 ] [ ' e ' , 5 ]
All but one of the transitions are to State 2, so if you use the transition to State 2 as the
default (rather than a transition to the error state), the row could be compressed as fol
lows
YYPRIVATE YY_TTYPE Yy_nxt 2 [] = { 1, 2, ' e ' , 5 };
The first number (1) is the pair count, as before. The second number (2) is the default
next-state transition. If the next function gets through the entire list of pairs without
finding a match, it will go to this default next state. The remainder of the array is the one
pair that doesnt go to the default state. This extreme compression isnt very useful in a
lexical-analyzer application because you cant really afford to discard all the error infor
mation, but its sometimes useful in parser applications. The UNIX yacc utility uses a
variation on this method to compress its tables. Of course, if there are more transitions
to an explicit state than to the error state, you can put the error transitions into the pair
list and use the explicit state as the default state. This way you wont loose any informa
tion.
Pair compression is activated in LEX with a -cN command-line switch. N is the thres
hold beyond which pairs are abandoned in favor of a simple array indexed by lookahead
character. The example weve been looking at had the threshold set at 11, so any state
with more than 11outgoing transitions is handled with a simple array, and states with 11
or fewer outgoing transitions are represented with character/next-state pairs. The default
thresholdused when no N is given on the command lineis four.
The compression ratio here tends not to be as good as with redundant-row-and-
column elimination in programming-language applications. The current example uses
247 bytes, versus 154 for the other method. The C lexical analyzer uses 3,272 bytes for
the pair-compressed tables, versus 1,514 for the default method. It does do better when
the data in the matrix is both sparse and randomly distributed, however. The -v
command-line switch to LEX causes the final table sizes to be printed, so you can judge
which method is more appropriate for a given application.
Youll note that the redundant-row-and-column elimination could be combined with
the pair-compression technique. For example, since the last two rows in the table shown
in Figure 2.7 are the same, you really need to store only one of them and keep two
pointers to it. I havent implemented this combined method.
The next part of l exyy. c is the accepting-state array, Y yaccept [ ], shown in List
ing 2.17. The array is indexed by state number. It evaluates to 0 if the state is not an
accepting state. Other values set the conditions under which the lexeme is accepted. It
holds 1if the string is anchored to the start of the line (a ~was the leftmost symbol in the
original regular expressionan extra newline will be at the far left of the lexeme in this
case). It holds 2 if the string is anchored to the end of the line (a $ was the rightmost
symbol in the original regular expressionan extra newline will be at the far right of the
lexeme in this case). It holds 3 if the lexeme is anchored both to the beginning and the
end of the line, and 4 if the lexeme is always acceptedno anchors were present in the
original regular expression.
10. The method is described in [Aho], pp. 144-146. They use separate arrays for the pairs and the default
transitions, but the rationale is the same.
Listing 2.17. lexyy.c Accepting-State Identification Array
118
/ * ---------------------------------
119 * The Yyaccept ar r ay has t wo pur poses. I f Yyaccept [ i ] i s 0 t hen st at e
120 * i i s nonaccept i ng. I f i t' s nonzer o t hen t he number det er mi nes whet her
121 * t he st r i ng i s anchor ed, 1=anchor ed at st ar t of l i ne, 2=at end of
122 * l i ne, 3=bot h, 4=l i ne not anchor ed
123 */
124
125 YYPRIVATE YY TTYPE Y y a c c e p t [] =
126 {
127 0 , / * St at e 0 */
128 4 , / * St at e 1 * /
129 4 , / * St at e 2
* /
130 0 , / * St at e 3 * /
131 4 , / * St at e 4 * /
132 0 / * St at e 5
* /
133 };
The remainder of lexyy.c file is the actual state-machine driver, shown in Listing
2.18. The first and last part of this listing are the second and third parts of the Ctrl-L-
delimited template file discussed earlier. The case statements in the middle (on lines 287
to 295 of Listing 2.18) correspond to the original code attached to the regular expres
sions in the input file and are generated by 1X itself.
The various global variables that communicate with the parser are declared on lines
138 to 141. Note that yyout is provided for UNIX compatibility, but you shouldnt use it
if youre using occs (because it will mess up the windows in the debugging system).
Same goes for the out put ( ) and ECHO macros on lines 147 and 148. UNIX supports
them but they shouldnt be used in an occs environment. Its best to use the actual output
functions, or to supply similarly-named replacement functions that you can use to debug
your lexical analyzer (assuming that the functions you supply will eventually be
replaced by the occs versions). These replacement functions are shown in lexjo.c (List
ing 2.18). Link this file to the lexical analyzer when youre debugging a lex output file
without a parser. Use the versions of the routines that are in I.lib, and which support the
occs debugging environment, when youre using an occs-generated parser.
The y y er r or () macro on line 151 of Listing 2.19 prints internal error messages.
Theres no UNIX equivalent because lex doesnt print error messages. In the occs
environment, you should redefine YYERROR () to use yyer r or () rather than
f pr i nt f (). (Do it in a %{ %} block in the definitions section.)
The yymor e ( ) macro on line 154 of Listing 2.19 just sets a flag to true. It forces
the driver to continue processing the current lexeme, ignoring the current accepting
action, unput ( ) and yyl ess ( ) (on lines 156 and 157) are the two pushback func
tions. They unterminate the current lexeme, push back any requested characters, and
then reterminate the lexeme. I ve made extensive use of the comma operator here in
order to squeeze several instructions into a single macro. The comma operator just exe
cutes the comma-delimited statements in sequence. It evaluates to the rightmost state
ment in the list, but neither of these macros take advantage of this behaviortheyd be
declared voi d if they were functions. Braces cant be used here because of the binding
problems discussed earlier. Note that the conditional on line 158 sets yyleng to zero if
the i i _pushback () call fails, as it will if you try to push back too many characters;
otherwise, yyleng is just reduced by the number of pushed-back characters.
The i nput ( ) function on line 162 of Listing 2.19 is complicated enough to be a
subroutine. It has to return a value, and the contortions necessary to do this with a
IfX state-machine driver.
yyout, out put (),
ECHO.
YYERROR().
yymor e (), yyl ess (),
unput ().
i nput ().
Listing 2.18. lex io.c Debugging Output Routines for LPX-Generated Analyzer
1 # i n c l u d e < s t d i o . h >
2 # i n c l u d e < s t d a r g . h >
3
4 / * Thi s f i l e cont ai ns t wo out put r out i nes t hat r epl ace t he ones i n yydebug. c,
5 * f ound i n l . l i b and used by occs f or out put . Li nk t hi s f i l e t o a LeX-
6 * gener at ed l exi cal anal yzer when an occs- gener at ed par ser i s not pr esent .
1 * Then use yycomment ( ) f or messages t o st dout , yyer r or () f or messages t o
8 * st der r .
9 * /
10
11 PUBLIC v o i d yyc omme nt ( f mt , . . . )
12 c h a r *f mt ;
13 {
14 / * Wor ks l i ke pr i nt f (). * /
15
16 v a _ l i s t a r g s ;
17 v a _ s t a r t ( a r g s , fmt ) ;
18 v f p r i n t f ( s t d o u t , f mt , a r g s ) ;
19 va end ( a r g s ) ;
20 }
21
22 PUBLIC v o i d y y e r r o r ( f mt , . . . )
23 c h a r *f mt ;
24 {
25 / * Wor ks l i ke pr i nt f () but pr i nt s an er r or message al ong wi t h t he
26 * cur r ent l i ne number and l exeme.
27 * /
28
29 v a _ l i s t a r g s ;
30 v a _ s t a r t ( a r g s , fmt ) ;
31
32 f p r i n t f ( s t d e r r , " ERROR on l i n e %d, ne a r <%s>\ n", y y l i n e n o , y y t e x t ) ;
33 v f p r i n t f ( s t d e r r , f mt , a r g s ) ;
34 va end ( a r qs ) ;
35 }
yyl ex (), yyst at e.
comma operator are not worth the effort.
The lexical analyzer itself, yyl ex ( ) starts on line 177 of Listing 2.19. The current
state is held in yystate, which initially set to 1 a value thats not used as a state
number. The code 187 to 192 is executed only once because yystate is set to
1only the first time the subroutine is called. The
n _ a d v a n c e ( ) ;
i i p u s h b a c k ( 1 ) ;
Control-flow in yyl ex ()
forces an initial buffer load so that i i _ l ook ( ) can be used later on.
The actual control flow through the program is unusual in that one branch of the
main loop exits the subroutine entirely and reenters the loop from the top. In other
words, if an action in the original input file contains a return statement, then control
passes out of the loop at that point, and passes back into the loop on the next call. A nor
mal path through the loop is also available when no such return is executed. The
situation is illustrated in Figure 2.8. The initializations cant be done at the top of the
loop because theyre performed only on accepting a lexeme, not on every iteration of the
loop.
Figure 2.8. Flow of Control Within yyl ex ()
The initializations on lines 194 to 198 of Listing 2.19 are executed every time the
loop is entered from the top, and these initializations are duplicated on lines 303 to 315
for those situations where an accepting action does not contain a return statement.
This code unterminates the previous lexeme (i i unt er m()), sets the start marker for
the current lexeme (i i mar k st ar t ()), and sets yyl ast accept to zero to signify
that no accepting state has been seen (the start state, which is always State 0, cannot be
an accepting state). The i i _unt er m( ) call does nothing if the string is not ' \ 0' ter
minated, as is initially the case.
Listing 2.19. lex.par State-Machine Driver Copied to lexyy.c
134
/ * ---------
135 * Gl obal var i abl es used by t he par ser
136 */
137
138 c h a r * y y t e x t ; / * Poi nt er t o 1exeme.
139 i nt y y l e n g ; / * Lengt h of l exeme.
140 i nt y y l i n e n o ; / * I nput l i ne number .
141 FILE * y yout = s t d o u t ;
142
143
/ * ---------
144 * Macr os t hat dupl i cat e f unct i ons i n UNI X l ex:
145 */
146
147 #def i ne o u t p u t ( c ) p u t c ( c , y y o u t )
148 #def i ne ECHO f p r i n t f ( y y o u t , "%s ", y y t e x t
149
)
*
*
*
/
/
/
150
151
152
153
#i f ndef YYERROR
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
YYERROR(t) f p r i n t f ( s t d e r r
a
o s ", t )
#endi f
154 #def i ne y y mo r e () y y mo r e f l g = 1
155
156 #def i ne unput (c) ( i i _ u n p u t ( c ) , y y l e n g )
157 #def i ne y y l e s s ( n ) ( i i _ u n t e r m ( ) , \
158 ( y y l e n g - =
$
i i pus hba c k( n) ? n : y y l e n g ), \
159 i i t e r m( ) \
160 )
161
162 i nt i n p u t () / * Thi s i s a macr o i n UNI X l ex */
163
{
164 i nt c ;
165
166 i f ( (c = i i i n p u t ( ) ) && (c !
= - 1 ) )
167 {
168 y y t e x t = i i t e x t ( ) ;
169 y y l i n e n o = i i l i n e n o ( ) ;
170 + + y y l e n g ;
171 }
172 r et ur n c;
173 }
174
175
/ * ------------------------------------------------------------
* /
176
177 y y l e x ()
178
{
179 i nt y y mo r e f l g ; / *
Set when yymor e () i s execut ed * /
180 st at i c i nt y y s t a t e = - 1 ; / * Cur r ent st at e. * /
181 i nt y y l a s t a c c e p t ;
/ *
Most r ecent l y seen accept st at e * /
182 i nt y y p r e v ;
/ *
St at e bef or e yyl ast accept * /
183 i nt y y n s t a t e ; / *
Next st at e, gi ven l ookahead. * /
184 i nt y y l o o k ;
/ *
Lookahead char act er * /
185 i nt y y a n c h o r ;
/ *
Anchor poi nt f or most r ecent l y seen * /
186 / *
accept i ng st at e. * /
187 i f ( y y s t a t e == - 1 )
{
y y _ i n i t _ l e x ( ) ;
i i _ a d v a n c e ( ) ;
i i p u s h b a c k ( 1 ) ;
/ * One- t i me i ni t i al i zat i ons * /
}
y y s t a t e
y y l a s t a c c e p t
y y mo r e f l g
0;
0;
0;
/
*
Top- of - l oop i ni t i al i zat i ons
*
/
n _ u n t e r m () ;
i i mark s t a r t ( ) ;
whi l e( 1 )
{
/ * Check end of f i l e. I f t her e' s an unpr ocessed accept i ng st at e,
*
*
yyl ast accept wi l l be nonzer o. I n t hi s case, i gnor e EOF f or now so
t hat you can do t he accept i ng act i on; ot her wi se, t r y to open anot her
* f i l e and r et ur n i f you can' t .
*
/
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
whi l e( 1 )
{
(
( y y l o o k = i i l o o k ( l ) ) != EOF )
{
y y n s t a t e yy n e x t ( y y s t a t e , y y l o o k ) ;
}
{
( y y l a s t a c c e p t ) /
*
st i l l somet hi ng to do
*
/
{
y y n s t a t e YYF;
}
L f ( y y wr a p ()
)
{
/ * anot her f i l e?
/*
*
no
*
/
/
y y t e x t
y y l e n g
IIII
0;
0;
}
{
i i _ a d v a n c e ( ) ;
i i pus hbac k ( 1 ) ;
/ * l oad a new buf f er * /
}
}
}
( y y n s t a t e YYF )
{
YY D( p r i n t f ( 11 T r a n s i t i o n f rom s t a t e %d", y y s t a t e )
YY D( p r i n t f (" t o s t a t e d on <%c>\ n", y y n s t a t e , y y l o o k )
) ;
) ;
( i i a d v a n c e () < 0 ) / * Buf f er f ul l * /
{
YYERROR( "Lexeme t o o l o n g , d i s c a r d i n g c h a r a c t e r s \ n " ) ;
i i f l u s h ( 1 ) ;
}
( yyanc hor Y y a c c e p t [ y y n s t a t e ]) /
*
saw an accept st at e * /
{
yyp
y y l a s t a c c e p t =
i i mark e n d ( ) ;
y y s t a t e
y y n s t a t e
f
/
/
/
*
*
*
Mar k i nput at cur r ent char act er
A subsequent i i _t o_mar k( )
r et ur ns us t o t hi s posi t i on.
*
*
*
/
/
}
y y s t a t e y y n s t a t e ;
}
{
( ! y y l a s t a c c e p t ) /
*
i l l egal i nput
*
/
{
#i f def YYBADINP
YYERROR( " I g n o r i n g bad i n p u t \ n " ) ;
#endi f
i i a d v a n c e ( ) ; /
*
Ski p char t hat caused f ai l ur e.
*
}
268 e l s e
269 {
270 i i t o ma r k( ) ; / * Back up t o pr evi ous accept st at e * /
271 i f ( y y a nc ho r & 2 ) / * I f end anchor i s act i ve * /
272 i i p u s h b a c k ( 1 ) ; / * push back t he CR or LF * /
273
274 i f ( y y a nc ho r & 1 ) / * i f st ar t anchor i s act i ve */
275 i i move s t a r t ( ) ; / * ski p t he l eadi ng newl i ne * /
276
277 i i t e r m ( ) ; / * Nul l - t er mi nat e t he st r i ng */
278 y y l e n g = i i l e n g t h ( ) ;
279 y y t e x t = i i t e x t ( ) ;
280 y y l i n e n o = i i l i n e n o ( ) ;
281
282 YY D( p r i n t f ( " Ac c e p t i n g s t a t e %d, ", y y l a s t a c c e p t ) ) ;
283 YY D( p r i n t f ( " l i n e %d: <%s>\ n", y y l i n e n o , y y t e x t ) ) ;
284
285 s w i t c h ( y y l a s t a c c e p t )
286 {
287 case 1: / * St at e 1 * /
288 return ICON;
289 break;
290 case 2: / * St at e 2 * /
291 return FCON;
292 break;
293 case 4: / * St at e 4 * /
294 return FCON;
295 break;
296
297 d e f a u l t :
298 YYERROR( "INTERNAL ERROR, y y l e x \ n " ) ;
299 break;
300 }
301 }
302
303 i i u n t e r m( ) ;
304 y y l a s t a c c e p t = 0;
305
306 i f ( ! y y mo r e f l g )
307 {
308 y y s t a t e = 0;
309 i i mark s t a r t ( ) ;
310 }
311 e l s e
312 {
313 y y s t a t e = y y p r e v ; / * Back up * /
314 y y mo r e f l g = 0;
315 }
316 }
317
}
318 }
yymoref l g. The yymor ef l g that is tested on line 306 of Listing 2.19 is set true by yymor e ( ).
If yymor ef l g is false, the machine behaves normally: the next state is State 0 and the
start-of-lexeme marker is reset so that the machine can collect a new lexeme (on lines
308 and 309). If yymor ef l g is true, the machine backs up one state rather than going to
State 0, and the start-of-lexeme marker isnt modified, so additional characters are added
to the end of the current lexeme. The only use of yypr ev is to remember the last state
so that the machine can back up.
Generally, backing up is the correct action to take for yymor e ( ) . For example, the
naive string-processing algorithm discussed in Chapter Two looked like this:
\" r \ M] *\ M i f ( y y t e x t [ y y l e n g - 2 ]
)
y y mo r e ( ) ;
r e t u r n STRING;
This expression creates the following machine:
?? f f
Anything but "
The problem is that the machine is in State 2 after the close quote is recognized. In order
to continue processing the string, it needs to back up to State 1(so that another close
quote can get it back to State 2). The back up thats initiated by yymor e ( ) can cause
problems (not very often, but, as Shiva says, to be four-armed is to be forewarned) One
of my original attempts to handle the escape-sequence-in-a-string problem looked like
this:
V T V W] * ( i i l o o k a h e a d (1)
)
{
i n p u t ( ) ; /
*
Ski p t he backsl ash
*
/
i n p u t ( ) ; / * and t he char act er t hat f ol l ows. */
}
y y mo r e ( ) ;
/
*
i t ' s a
n *
/
{
i n p u t ( ) ; / * Get t he cl ose quot e
*
/
}
r e t u r n STRING;
The idea was to break out of the regular expression if either a backslash or quote was
encountered, look ahead to see which of the two possible characters were there, and then
absorb the escape sequence with i nput () calls. The state machine looks like this:
f f
Anything but " or \
The problem arises when you try to handle a string like:

The machine starts out in State 0, the quote puts it into State 1, and then the machine ter
minates because theres no outgoing transition from State 1on a backslash. The code is
now activated and the i f statement tests true. The two i nput ( ) calls absorb the
backslash and the second quote, and yymor e ( ) backs us up to the previous state State
0. Now the third quote is encountered, but the machine treats it as a start-of-string char
acter, not as an end-of-string character. The code associated with State 1wont be exe
cuted until a fourth quote or another backslash is encountered. The best solution to this
problem prevents the backup by using:
y y pr e v y y s t a t e ;
y y mo r e ( ) ;
yypr ev
Problems with
yymor e().
rather than a simple yymor e ( ) invocation. To see whats happening here, consider that
yypr ev holds the number of the state to back up tothe previous state; yyst at e holds
Finding the next state.
End-of-file processing,
yywr ap().
Failure transitions in
yyl ex (),
yyl ast accept .
YYBADINP
the current state number. The assignment of the current state to the previous state means
that the machine will back up to the current state when yymor e () is invoked. That is, it
wont back up at all.
Alternately, you could use the following:
i i _ u n t e r m ( ) ;
c o n t i n u e ;
instead of yymor e, but I think that this latter solution is more confusing. (Look at the
code to see what a cont i nue will do here.) You dont want to br eak out of the loop
because the existing lexeme is discarded in this case.
Returning to Listing 2.19, the whi l e loop on lines 208 to 234 (page 77) gets the next
input character and puts it into yyl ookthe input is not advanced by i i _l ook( ).
The normal situation of not being at end of file is handled on line 212, where the next
state is computed. The el se clause deals with end of file. If yyl ast accept is true on
line 217, then the machine hasnt executed the accepting action for the last lexeme in the
file, so end-of-file processing is delayed until the action is done. If its false, then
yywr ap ( ) is called to open a new input file. This user-supplied subroutine can be used
to chain together several input files. The default, library version just returns zero. If you
have several input files, you can replace the default version with one that opens the next
input file (and returns 1until there are no more files to open). So, if yywr ap ( ) returns
false, its not time to wrap up and the code loops back up to line 208. The next state is
then recomputed using the first character in the new file (as if the EOF had not been
encountered). The code loops until a nonempty file is opened. Theres some potential
confusion here in that yywr ap ( ) returns true if the program should terminate, even
though a false return value would be more intuitive in this situation. Remember, the
name yywr ap () stands for go ahead and wrap up.
When the loop terminates, yyl ook holds the current lookahead character, and the
potential next state is figured on line 212. The machine has not changed state yet
because yyst at e hasnt been modified. The machine is looking ahead at what the next
state is going to be, given the current lookahead character.
If the next state is not a failure transition, the input is advanced (on line 242) and the
machine looks to see if the new state is an accepting state (on line 248). If so, the
accepting state is remembered in yyl ast accept , and the state preceding the accepting
state is also remembered for yymor e ( ) processing. Finally, the driver switches to a
new state by modifying yyst at e on line 257.
The el se clause that starts on line 259 handles failure transitions. In this case you
want to perform the accepting action associated with the most recently seen accepting
state (which you just remembered in yyl ast accept ) . If yyl ast accept is zero, then
no such accepting state was encountered and youre looking at a bad lexeme (one that is
not described by any regular expression in the input file). An error message is printed if
YYBADI NP is true and i i advance ( ) is called to skip the offending character.
If an accepting state had been encountered, the input is restored to the condition it
was in at that time by the i i _t o_mar k ( ) call on line 270. The test on line 271 checks
for an end-of-line anchor, in which case the newline (which is part of the lexeme at this
point) must be pushed back into the input (in case it is needed to match a start-of-line
anchor in the next lexeme). The lexeme is terminated, and the global variables that com
municate with the parser are initialized on lines 277 to 280. The i f clause on the next
11. Ive, perhaps wrongly, perpetuated the problem in order to keep u n i x compatibility.
Section 2.4.6Implementing a State-Machine-Driven Lexical Analyzer
line removes a newline thats at the start of the lexeme as the result of a beginning-of-
line anchor.
The swi t ch on line 285 contains all the accepting actions that were part of the origi
nal input file. The case statements are all generated by LPX itselfthe case values are
the state numbers of the associated accepting state. The def aul t case on line 297
should never be executed. I ts here as insurance, in case an unknown bug in LPX gen
erates a bad state number in the swi tch.
2.5 LEXA Lexical-Analyzer Generator*
The remainder of this chapter presents the complete source code for LX, along with
the underlying theory. You must read Appendix A if you intend to look at the implemen
tation details. The set routines presented there are used heavily in this chapter, and a
familiarity with the calling conventions for these routines will be useful.
2.5.1 Thompsons Construction: From a Regular Expression to an NFA*
LPX constructs NFAs from regular expressions using a system called Thompsons
Construction, developed by Ken Thompson at Bell Labs for the QED editor. It works as
follows:
The simplest possible regular expression is a single character, and this expression
can be represented by a correspondingly simple NFA. For example, a machine that
matches an a is shown below:
OO
The concatenation of two regular expressions is also straightforward. The following
machine represents the expression ab by constructing individual machines for each
subexpression (the a and b), and then connecting the machines with an e edge:
oooo
This method needlessly wastes states, however. A better solution merges the ending
state of the first machine with the start state of the second one, like this:
OOM3
There are two situations in a IPX application where an OR of two regular expressions is
required. The first is the input specification itself. That is, the tPX input contains many
regular expressions, but a single machine must be output that recognizes all of these
expressions. This means that all the input expressions are effectively ORed together to
create the output DFA. I fX does this high-level OR using the system shown in Figure
2.9. Each of the boxes is an NFA that represents an entire regular expression, and all of
these are connected together using several dummy states and edges.
The second OR situation is the OR operator (the vertical bar) which can appear in the
regular expression itself (as in a\b). tPX processes the OR operator by constructing the
machine shown in Figure 2.10. Again, the seemingly empty boxes in the pictures
represent machines for entire subexpressions. Figure 2.11 shows how the expression
((a\b)\cd) would be represented. LX starts out by making two machines to recognize the
a and b, and connects the two using the OR construction shown in Figure 2.10. LPX then
creates two more machines to recognize the c and d, concatenating them together by
merging the end state of the first machine with the start state of the second. Finally, it
Accepting actions in
yyl ex ().
Simple expressions.
Concatenation.
Logical OR at the top
level.
OR operator (|).
Figure 2.9. Connecting the Regular Expressions in a 1X Input File
Closure operators: * + ?
Evolution of a complex
regular expression.
processes the second OR operator, applying the same construction that it used earlier,
but this time using the machines representing the more-complicated subexpressions (a/b)
and cd in place of the boxes in Figure 2.10.
Figure 2.10. Generic NFA for the OR operator
The machines to recognize the three closure operators are a little more complicated
looking. They are shown in Figure 2.12. Note that the machines that recognize +and ?
are special cases of the machine for the * operator.
Figure 2.13 shows the evolution of a machine that recognizes a subset of the
floating-point constants discussed earlier. The expression used is ( D*\ . D| D\ . D*) . It
recognizes numbers with one digit to the right of the point and zero or more digits
preceding it and it also recognizes the inverseone digit to the left of the decimal point
and zero or more digits following the decimal point. LPX starts out constructing an
expression for the first Din the expression [in Figure 2.13(a)]. It then reads the leftmost
* and substitutes the first machine into the closure machine, yielding Figure 2.13(b). It
then reads the dot, and tacks it on to the right edge of the partially constructed machine
[Figure 2.13(c)], and it does the same for the next D[Figure 2.13(d)]. Encountering the I
operator, IPX holds onto the previously constructed expression for a moment, and then
constructs a second machine for the next subexpression ( D\ . D* ) , not shown in the
figure. Finally, it connects the two machines for the subexpressions together, using the
OR construction [Figure 2.13(e)].
Figure 2.11. An NFA That Recognizes ((a\b)\cd)
Section 2.5.1 Thompsons Construction: From a Regular Expression to an NFA*
Figure 2.12. Representing the Closure Operators
e
e
e
e
2.5.2 Implementing Thompsons Construction
This section presents a low-level implementation of the theory in the previous sec
tion. Skip forward to Section 2.5.3 if youre not interested in this level of detail.
2.5.2.1 Data Structures. A machine built with Thompsons construction has several
useful characteristics:
All the machines that recognize subexpressionsno matter how complicatedhave
a single start state and a single end state.
No state has more than two outgoing edges.
Characteristics of a
Thompson machine
Figure 2.13. Constructing an NFA for (DA.DIDVD*)
The n f a structure
next, next2, and edge
fields of an n f a .
There are only three possibilities for the labels on the edges: (1) there is only one
outgoing edge labeled with a single input character, (2) there is only one outgoing
edge labeled with 8, and (3) there are two outgoing edges labeled with 8. There are
never two outgoing edges labeled with input characters, and there are never two
edges, one of which is labeled with an input character and the other of which is
labeled with 8.
The NFA data structure in Listing 2.20 uses these characteristics to implement an
NFA state. The next field either points at the next state, or is set to NULL if there are no
outgoing edges. The next 2 field is used only for states with two outgoing 8 edges. It is
set to NULL if theres only one such transition. The edge field holds one of four values
that determine what the label looks like:
Section 2.5.2Implementing Thompsons Construction 85
If there is a single outgoing edge labeled with an input character, edge holds that Single, non-e edge,
character.
If the state has an outgoing 8 edge, then edge is set to EPSI LON.
If a transition is made on a character class, all characters in the class are elements of
the SET pointed to by bi t set, and edge holds the value CCL. The set elements are
just the ASCII values of the characters. For example, if an ASCI I ' 0', which has the
e edges, epsi l on.
Character classes, ccl ,
48
10
the character class, the number 48 will be in the set
If the state has no outgoing transitions, edge is set to EMPTY
Terminal states, empty.
Listing 2.20. nfa.h Data Structures and Macros
1 /* --------------------------
2 * Nf a st at e:
3 */
4
5 t ypedef st r uct n f a
6 {
7 i nt e d g e ; / * Label f or edge: char act er , CCL, EMPTY, or * /
8
/ *
EPSI LON.
* /
9 SET * b i t s e t ; / * Set to st or e char act er cl asses.
* /
10 st r uct n f a * n e x t ; / *
Next st at e (or NULL i f none)
* /
11 st r uct n f a * n e x t 2 ; / * Anot her next st at e i f edge==EPSI LON * /
12
/ *
NULL of t hi s st at e i sn' t used
* /
13 chaz
* a c c e p t ; / *
NULL i f not an accept i ng st at e, el se
* /
14
/ *
a poi nt er to t he act i on st r i ng
* /
15 i nt anc hor ;
/ *
Says whet her pat t er n i s anchor ed and, i f * /
16
/ *
so, wher e ( uses i def i nes above) .
* /
17 } NFA;
18
19 #def i ne EPSILON - 1 / * Non- char act er val ues of NFA. edge
* /
20 #def i ne CCL - 2
21 #def i ne EMPTY - 3
22
23 / * Val ues of t he anchor f i el d:
* /
24 #def i ne NONE 0 / * Not anchor ed * /
25 #def i ne START 1 / * Anchor ed at st ar t of l i ne * /
26 #def i ne END 2 / * Anchor ed at end of l i ne * /
27 #def i ne BOTH ( START END ) / * Anchor ed i n bot h pl aces
* /
The accept field in the NFA structure is NULL for nonaccepting states; otherwise, it Storing accepting strings,
points at a string holding the action part of the original input rulethe code to be exe
cuted when that state is accepted. The string itself is of variable length. The first i n ts
worth of characters hold the line number and the remainder of the array holds the string.
For example, in a machine with a 16-bit i nt, the first two bytes of the string are the
line number (in binary) and the remainder of the string is the actual input text. The
pointer to the actual string is stored.
saved pointer
accept.
\/ . . . . ______________________________________________ __________________
input line number accepting string
> s i z e o f ( i n t )
If p is a pointer to an NFA structure, the actual string is accessed with p- >accept , and
the line number can be accessed by casting
one notchlike this:
into a pointer to i nt and backing up
( ( i n t * ) ( p - > a c c e p t ) ) [ - 1 ]
The anchor field.
NFA MAX, STR MAX
Because of various alignment problems, some care has to be taken to integrate the line
number into the string in this fashionthe mechanics are discussed shortly. The tech
nique is useful, not only in the current application, but also for doing things like attach
ing a precomputed string length to a string. The string can still be handled in the normal
way, but the count is there when you need it.
The macros on lines 24 to 27 of Listing 2.20 are possible values of the anchor field.
They describe whether a ", $, or both were present in the regular expression.
The reminder of nfa.h, in Listing 2.21, holds other definitions needed to make the
machine. NFA MAX is the maximum number of NFA states that can be in the machine,
and STR MAX is the total space available for all the accepting actions combined. The
rest of the file is just prototypes for the extemally-accessible functions discussed below.
Listing 2.21. nfa.h Other Definitions and Prototypes
28 # d e f i n e NFA_MAX 7 68 / * Maxi mumnumber of NFA st at es i n a
* /
29 / * si ngl e machi ne. NFA MAX * si zeof ( NFA)
* /
30 / * can' t exceed 64K.
* /
31 #d e f i n e STR MAX (10 * 1024) / * Tot al space t hat can be used by t he * /
32 / * accept st r i ngs. * /
33
34 v o i d new ma c r o ( c h a r ^ d e f i n i t i o n ) ; / * t hese t hr ee ar e i n nf a. c
* /
35 v o i d p r i n t m a c s ( v o i d ) ;
36 NFA * t h o mp s o n ( c h a r * ( * i n p u t f u n c t ) ( ) , i n t *max s t a t e , NFA * * s t a r t s t a t e ) ;
37 v o i d p r i n t n f a ( NFA * n f a , i n t l e n , NFA * s t a r t ) ; / * i n pr i nt nf a. c * /
Global-variable
definitions: globals.h
The other file you need to look at before starting is globals.h, in Listing 2.22. This
file holds definitions for all the true global variables in the program (globals that are
shared between modules). All other globals are declared , so their scope is lim-
cl as s and i (x)
macros, al l oc .
ited to the file in which they are declared. For maintenance reasons, its desirable that
both the definition and declaration of a variable be in a single file. That way you dont
have to worry about maintaining two files, one in which space is allocated and the other
containing
globals.h usi ng the CLASS and I
statements describing the variables. The problem is solved in
(x) macros defined on lines three to seven of Listing
2.22. The following two lines are found in only one file [typically in the same file that
contains mai n () ]:
#d e f i n e ALLOC
# i n c l u d e " g l o b a l s . h "
All other files include globals.h without the previous ALLOC definition. When ALLOC
exists, CLASS evaluates to an empty string and I ( x) evaluates to its argument. So, the
input line:
CLASS c har ^Templ at e I ( = " l e x . p a r " ) ;
expands to
c har ^Templ at e "l e x . p a r " ;
If ALLOC doesnt exist, then CLASS expands to ext er n and I (x) expands to an empty
string. The earlier input line expands to:
e x t e r n c har ^Templ at e;
The variables on lines 11 to 15 of globals.h are set by command-line switches; the ones
on lines 16 to 22 are used by the input routines to communicate with one another.
Listing 2.22. globals.h Global-Variable Definitions
1 / * GLOBALS. H: G1oba1 var i abl es shar ed bet ween modul es */
2 #i f der ALLOC
3 # def i ne CLASS
4 # def i ne I ( x ) x
5 #el se
6 # def i ne CLASS ext er n
7 # def i ne I ( x )
8 #endi f
9 #def i ne MAXINP 2 04 8 / * Maxi mumr ul e si ze
*/
10
11 CLASS i nt Ve r bo s e
I ( = 0 ) ; / * Pr i nt st at i st i cs */
12 CLASS i nt No l i n e s
I ( = 0 ) ; / *
Suppr ess #l i ne di r ect i ves */
13 CLASS i nt Uni x
I ( = 0 ) ; / *
Use UNI X- st yl e newl i nes * /
14 CLASS i nt P u b l i c
I ( = 0 ) ; / *
Make st at i c symbol s publ i c * /
15 CLASS char ^Templ at e I ( = "l e x . p a r " ) ;
/ *
St at e- machi ne dr i ver t empl at e * /
16 CLASS i nt Ac t u a l l i n e n o I ( = 1 ) ; / *
Cur r ent i nput l i ne number * /
17 CLASS i nt Li ne no
I ( = 1 ) ; / *
Li ne number of f i r st l i ne of * /
18 / *
a mul t i pl e- l i ne r ul e. * /
19 CLASS char I nput _buf [ MAXI NP] ;
/ *
Li ne buf f er f or i nput * /
20 CLASS char *I nput f i l e name; / *
I nput f i l e name ( f or #l i ne) * /
21 CLASS FILE * I f i l e ; / *
I nput st r eam. * /
22 CLASS FILE * O f i l e ; / *
Out put st r eam. * /
23 #under CLASS
24 #unde I
2.5.2.2 A Regular-Expression Grammar. The code in nf a. c, which starts in List
ing 2.23, reads a regular expression and converts it to an NFA using Thompsons con
struction. The file is really a small compiler, comprising a lexical analyzer, parser, and
code generator (though in this case, the generated code is a state-machine description,
not assembly language). The grammar used to recognize a LEX input specification is
summarized in Table 2.5. This is an informal grammarit describes the input syntax in
a general sort of way. Clarity is more important here than strict accuracy. I ll fudge a
bit in the implementation in order to get the grammar to work. Precedence and associa
tivity are built into the grammar (the mechanics are described in depth in the next
chapter). Concatenation is higher precedence than I; closure is higher precedence still;
everything associates left to right. The various left-recursive productions have not yet
been translated into an acceptable form, as was discussed in Chapter OneI ll do that as
I implement them.
2.5.2.3 File Header. The header portion of nfa.c is in Listing 2.23. The ENTER and Debugging: enter,
LEAVE macros on lines 21 to 28 are for debugging. They expand to empty strings when LEAVE*
DEBUG is not defined. When debugging, they print the current subroutine name (which is
passed in as an argument), the current lexeme and whats left of the current input line.
An ENTER invocation is placed at the top of every subroutine of interest, and a LEAVE
macro is put at the bottom. The text is indented by an amount proportional to the
subroutine-nesting levelLev is incremented by every ENTER invocation, and decre
mented by every LEAVE. Levx4 spaces are printed to the left of every string using the
pr i ntf ()s * field-width capability. To simplify, the following pr i ntf () statement
Table 2.5. A Grammar for LPX
Productions Notes
machine > rule machine
\ rule END OF INPUT
rule > expr EOS action
1 Aexpr EOS action
1 expr $ EOS action
action > white space string
1 white space
1 e
expr > expr | cat expr
1 catexpr
cat expr cat expr factor
1 factor
factor > term *
1 term +
1 term ?
1 term
term > [ string ]
1 [Astring ]
1 []
i n
1
1 character
1 ( expr )
A list of rules
A single regular expression followed by an accepting action.
Expression anchored to start of line.
Expression anchored to end of line.
An optional accepting action.
A list of expressions delimited by vertical bars.
A list of concatenated expressions.
A subexpression followed by a *.
A subexpression followed by a +.
A subexpression followed by a?.
A character class.
A negative character class.
(nonstandard) Matches white space.
(nonstandard) Matches everything but white space.
Matches any character except newline.
A single character.
A parenthesized subexpression.
white space > one or more tabs or spaces
character any single ASCII character except white space
string one or more ASCII characters
outputs Lev spaces by printing an empty string in a field whose width is controlled by
Lev.
p r i n t f ( "%*s", Lev, "" ) ;
2.5.2.4 Error-Message Processing. The next part of nfa.c is the error-message
routines in Listing 2.24. I ve borrowed the method used by the C buffered I/O system:
possible error codes are defined in the enumerated type on lines 35 to 51, and a global
Error messages: variable is set to one of these values when an error occurs. The Er r msgs array on lines
53 to 68 is indexed by error code and evaluates to an appropriate error message. Finally,
the par seer r ( ) subroutine on line 70 is passed an error code and prints an appropri
ate message. The whi l e loop on line 76 tries to highlight the point at which the error
occurred with a string like this:
Errmsgs,
0
The up arrow will (hopefully) be to the point of error, par se er r ( ) does not
return.
Managing n f a structures. 2.5.2.5 Memory Management. Listing 2.25 contains the memory-management rou
tines that allocate and free the NFA structures used for the states. Two routines are used
new(), di s c ard( ) .
for this purpose: new ( ), on line 105, allocates a new node and di scard ( ), on line
131, frees the node. I m not using mal l oc ( ) and f ree ( ) because theyre too slow;
rather, a large array (pointed to by Nf a is allocated the first time new( ) is
Stack strategy,
Nfa s t a t e s [1
called (on line 112 the entire i f statement is executed only once, during the first call).
A simple stack strategy is used for memory management: di scar d ( ) pushes a
pointer to the discarded node onto a stack, and new( ) uses a node from the stack if one
Listing 2.23. nfa.c File Header
1 / * NFA. C------ Make an NFA f r oma LeX i nput f i l e usi ng Thompson' s const r uct i on * /
2
4 #i f def MSDOS
5 # i ncl ude < s t d l i b . h >
6 #el se
7 # i ncl ude < ma l l o c . h >
8 #endi f
9
10 #i ncl ude < c t y p e . h >
11 #i ncl ude < s t r i n g . h >
12 #i ncl ude < t o o l s / d e b u g . h >
13 # i ncl ude < t o o l s / s e t . h>
14 #i ncl ude < t o o l s / h a s h . h >
15 #i ncl ude < t o o l s / c o m p i l e r . h>
16 # i ncl ude < t o o l s / s t a c k . h>
17 # i ncl ude "nf a . h" / * def i nes f or NFA, EPSI LON, CCL */
18 #i ncl ude " g l o b a l s . h " / * ext er ns f or Ver bose, et c. * /
19
20 #i f def DEBUG
21 i nt Lev = 0;
22 # def i ne ENTER(f) p r i n t f ( "%*s ent er %s [%c ] [ %1. 10s ] \ n " , \
23 Lev++ * 4, "", f , Lexeme, I nput )
24 # def i ne LEAVE(f) p r i n t f ( "%*s l eave %s [ %c ] [ %1. 1 0 s ] \ n , \
25 Lev * 4, "", f, Lexeme, I nput )
26 #el se
27 # def i ne ENTER(f)
28 # def i ne LEAVE( f )
29 #endi f
is available, otherwise it gets a new node from the Nf a st at es [ ] array (on line 124).
new( ) prints an error message and terminates the program if it cant get the node. The
new node is initialized with NULL pointers [the memory is actually cleared in di s
car d () with the memset () call on line 136] and the edge field is set to EPSI LON on ed<?e initialized to
line 125. The stack pointer (Sp) is initialized at run time on line 116 because of a bug in
EPSI LON.
the Microsoft C compact model thats discussed in Appendix A.
Theres an added advantage to the memory-management strategy used here. I ts con
venient when constructing the NFA to create a physical machine with one node per state
and actual pointers to the next state. Later on, it will be convenient to have the NFA
represented as an array because you can use the array index as the state number. The
stack gives you both representations in a single data structure. The only disadvantage is
that any nodes that are still on the stack when the NFA construction is complete will be
holes in the array. It turns out that there is at most one hole, but theres no way to know
in advance where its going to be.
The other memory-management function in nfa.c is the string-management function,
save ( ), also in Listing 2.25. This function is passed a pointer to a string, and returns a
pointer to a copyin static memoryof that string. The pointer is preceded in memory
by an int-sized line number, as was discussed earlier on page 85. The array pointer
( St ri ngs line 100) is declared pointer to i nt for portability reasons. Many
machines require i nts to be aligned at more restrictive addresses than chars. For
example, an i nt might have to be at an address that is an even multiple of four, but a
char could be at any address. A run-time error would happen if you tried to put an i nt
into an illegal address (one that was not an even multiple of four). Making St r i ngs an
The same data objects
form both an array and a
graph.
String management
s a v e ().
Alignment problems
caused by leading line
number in string.
Listing 2.24. nfa.c Error-Processing Routines
37 E__MEM, / *
38 E__BADEXPR, / *
39 E~_P ARE N, / *
40 E__STACK, / *
41 E_JLENGTH,
/ *
42 E__BRACKET, / *
43 E__BOL, / *
44 E__CLOSE,
/ *
45 E__STRINGS, / *
46 E__NEWLINE, / *
47 E__BADMAC, / *
48 E_ NOMAC, / *
49 E_ MACDEPTH
/ *
50
51 } ERR NUM;
i nt pointer takes care of the alignment problem at the cost of a little wasted spacethe
size of the region used to store the string itself must be rounded up to an even multiple of
the i nt size. I m assuming that a char will have less restrictive alignment rules than an
i nta pretty safe assumption.
A single large array that will hold all the strings is allocated the first time save () is
called on line 155 of Listing 2.25, thereby avoiding multiple inefficient mal l oc () calls
every time a new string is required. The line number is put into the string on line 164,
the pointer is incremented past the number, and the string itself is copied by the loop on
line 166. A char pointer (textp) is used for the purpose of copying the string com
ponent. The test on line 161 is for lines starting with a vertical bar, which say that the
action for the next line should be used for the current rule. No strings are copied in this
situationthe same pointer is returned here as will be returned by the next save () call.
That is, when the input string is a " | ", several pointers to the same accepting string are
returned by consecutive save () calls. A pointer to the string itself (as compared to the
line number) is returned on line 182 in order to facilitate debuggingthis way you can Getting the linenumber
examine the string directly without having to skip past the line number. The line number fromthe the str'n9
can be accessed later on, using an expression like the following:
char * s t r = s a v e ( " s t r i n g " ) ;
l i ne _ numbe r = ( (i nt * ) s t r ) [ - 1 ] ;
Listing 2.25. nfa.c Memory ManagementStates and String
82 PRIVATE NFA *Nf a s t a t e s ;
/ * S t a t e - machi ne ar r ay */
83 PRIVATE i nt N s t a t e s = 0 ; / * # of NFA st at es i n machi ne * /
84 PRIVATE i nt Ne xt a l l o c ; / * I ndex of next el ement of t he ar r ay * /
OJ
86
87
#def i ne SSI ZE 32
o /
88 PRIVATE NFA * S s t a c k [ SSIZE ] ;
/ *
St ack used by new()
* /
89 PRIVATE NFA **Sp = &Ss t ac k[ - 1 ] ;
/ *
St ack poi nt er * /
90
91 #def i ne STACK_OK() ( INBOUNDS( Sst ack, Sp) ) / * t r ue i f st ack not * /
92 / * f ul l or empt y */
93 #def i ne STACK_USED () ( ( S p - S t a c k ) + 1
) / *
sl ot s used */
94 #def i ne CLEAR STACK () ( Sp = S s t a c k -- 1
) / *
r eset t he st ack */
95 #def i ne PUSH(x) ( *++Sp = (x)
) /*
put x on st ack
*/
96 #def i ne POP () ( *Sp ) /*
get x f r omst ack */
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
/ *____________________________________________________________________ * j
PRIVATE i nt * S t r i n g s ;
PRIVATE i nt
'k
Savep;
/
/
*
*
Pl ace t o save accept i ng st r i ngs
Cur r ent posi t i on i n St r i ngs ar r ay
J *_____________________________________ _______________ _________ _____ * J
PRIVATE NFA *new( )
{
/
*
NFA gement f unct
*
/
NFA
*
p;
st at i c i nt f i r s t t i me 1;
( f i r s t t i me )
{
( ! ( Nf a s t a t e s (NFA
*
) c a l l o c ( NF A MAX, (NFA))
) )
p a r s e e r r ( E MEM ) ;
*
*
/
/
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
f i r s t t i me 0;
Sp &S s t a c k [
l ];
}
( + + Ns t a t e s >= NFA_MAX )
p a r s e e r r ( E LENGTH ) ;
/
*
I f t he st ack i s not ok, i t ' s empt y
*
/
!STACK_OK() ? &Nfa s t a t e s [ N e x t a l l o c + + ] : POP( ) ;
EPSILON;
P
P
r et ur n p;
}
/ * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - *
/
PRIVATE voi d d i s c a r d ( n f a t o d i s c a r d )
NFA
{
* n f a t o d i s c a r d ;
N s t a t e s ;
me ms e t ( n f a t o d i s c a r d , 0, (NFA)
) ;
n f a _ t o _ d i s c a r d -
PUSH( n f a t o d i s c a r d ) ;
EMPTY
( ! STACK OK()
)
p a r s e e r r ( E STACK ) ;
}
/* -------------------------------------------------------------------------------------------------------------------- */
PRIVATE * s a v e ( s t r ) /
*
St r i ng- management f unct i on.
*
/
* s t r ;
{
* *
i nt
st at i c i nt
t e x t p ,
l e n ;
f i r s t t i me
s t a r t p ;
1;
( f i r s t t i me )
{
( ! ( Savep S t r i n g s (i nt
*
) m a l l o c ( STR MAX ))
)
p a r s e e r r ( E MEM ) ;
f i r s t t i me 0;
}
(
*
s t r
)
r et ur n (char
*
) ( Savep + 1 ) ;
*Savep++ Li ne no ;
( t e x t p ( * ) Save p ;
*
s t r
*
t e x t p + +
*
( t e x t p > (
*
) ( S t r i n g s + (STR MAX-1))
s t r + + )
)
p a r s e e r r ( E STRINGS ) ;
*
t e x t p + + ' \ 0'
/ * I ncr ement past t he t ext . "l en" i s i ni t i al i zed to t he st r i ng l engt h.
*
The "l en/ si zeof (i nt ) " t r uncat es t he si ze down to an even mul t i pl e of t he
* cur r ent i nt si ze. The "+( l en
o
t) si zeof ( i nt )
i
0) " adds 1 t o t he t r uncat ed
Listing 2.25. continued..
175 * si ze i f t he st r i ng l engt h i sn' t an even mul t i pl e of t he i nt si ze (the !=
176 * oper at or eval uat es t o 1 or 0) . Ret ur n a poi nt er t o t he st r i ng i t sel f .
177 */
178
179 s t a r t p = ( c har * ) Save p;
180 l e n = t e x t p - s t a r t p ;
181 Savep += ( l e n / s i z e o f ( i n t ) ) + ( l e n %s i z e o f ( i n t ) != 0 ) ;
182 r e t u r n s t a r t p ;
183 }
2.5.2.6 Macro Support. The next code segment (in Listing 2.26) comprises the
macro-support routines, new macro ( ) (on line 202) is passed a pointer to a line that Macro substitution:
contains a macro definition and it files the macro in a small symbol table. The new macroo,
expand maci ro ()
expand macro (char **namep) routine on line 264 is passed a pointer to a charac- -
ter pointer, which in turn points at a macro invocation. The routine advances * namep
past the invocation, and returns a string holding the macros contents. The pr i nt-
macs ( ) subroutine on line 300 of Listing 2.26 prints the macros. The various hash rou- pri ntmacs o
tines and the HASH TAB structure that are used here are discussed in Appendix A.
Listing 2.26. nfa.c Macro Support
184 / *
185 * MACRO suppor t :
186 */
187
188 # d e f i n e MAC_NAME_MAX 3 4 / * Maxi mumname l engt h * /
189 #d e f i n e MAC_TEXT_MAX 80 / * Maxi mumamount of expansi on t ext * /
190
191 t y p e d e f s t r u c t
192
{
193 c h a r name [ MAC NAME MAX ] ;
194 c h a r t e x t [ MAC_TEXT~MAX ] ;
195
196 } MACRO;
197
198 PRIVATE HASH_TAB *Macros; / * Symbol t abl e f or macr o def i ni t i ons */
199
200
/ * ----------------------------------------------------------------- ----------------------------------------------------- * /
201
202 PUBLIC v o i d new ma c r o ( d e f )
203 c ha r * d e f ;
204
{
205 / * Add a new macr o t o t he t abl e. I f t wo macr os have t he same name, t he
206 * second one t akes pr ecedence. A def i ni t i on t akes t he f orm:
207 * name <whi t espace> t ext [ <whi t espace>]
208 * whi t espace at t he end of t he l i ne i s i gnor ed.
209 */
210
211 u n s i g n e d has h a d d ( ) ;
212
213 c ha r *name; / * Name component of macr o def i ni t i on * /
214 c h a r * t e x t ; / * t ext par t of macr o def i ni t i on * /
215 c h a r * e d e f ; / * poi nt er to end of t ext par t * /
216 MACRO *p;
217 s t a t i c i n t f i r s t t i me = 1;
218
219 i f ( f i r s t _ t i m e )
220 {
221 f i r s t _ t i m e = 0;
222 Macros = ma k e t a b ( 31, ha s h_ a dd, s t r c mp ) ;
223 }
224
225 f o r ( name = d e f ; * d e f && ! i s s p a c e ( * d e f ) ; de f ++ ) / * I sol at e name * /
226
227 i f ( * d e f )
228 * d e f ++ = ' \ 0 ' ;
229
230 / * I sol at e t he def i ni t i on t ext . Thi s pr ocess i s compl i cat ed because you need
231 * t o di scar d any t r ai l i ng whi t espace on t he l i ne. The f i r st whi l e l oop
232 * ski ps t he pr ecedi ng whi t espace. The f or l oop i s l ooki ng f or end of
233 * st r i ng. I f you f i nd a whi t e char act er ( and t he \ n at t he end of st r i ng
234 * i s whi t e) , r emember t he posi t i on as a pot ent i al end of st r i ng.
235 * /
236
237 whi l e( i s s p a c e ( * d e f ) ) / * ski p up t o macr o body * /
238 ++de f ;
239
240 t e x t = d e f ; / * Remember st ar t of r epl acement t ext * /
241
242 e d e f = NULL; / * st r i p t r ai l i ng whi t e space * /
243 whi l e( * d e f )
244 {
245 i f ( ! i s s p a c e ( * de f ) )
246 ++de f ;
247 el se
248 f or ( e d e f = de f ++; i s s p a c e ( *d e f ) ; ++de f )
249
250 }
251
252 i f ( e d e f )
253 * e d e f = ' \ 0 ' ;
254 / * Add t he macr o t o t he symbol t abl e * /
255
256 p = (MACRO *) ne ws ym( si zeof (MACRO) ) ;
257 s t r n c p y ( p- >name, name, MAC_NAME_MAX ) ;
258 s t r n c p y ( p - > t e x t , t e x t , MAC_TEXT_MAX ) ;
259 addsym( Macros , p ) ;
260 }
261
262 / * ------------------------------------------------------------------------------------------------------------------------------------------ * /
263
264 PRIVATE char * e x pa nd_ ma c r o ( namep )
265 char **namep;
266 {
267 / * Ret ur n a poi nt er t o t he cont ent s of a macr o havi ng t he i ndi cat ed
268 * name. Abor t wi t h a message i f no macr o exi st s. The macr o name i ncl udes
269 * t he br acket s, whi ch ar e dest r oyed by t he expansi on pr ocess. *namep
270 * i s modi f i ed t o poi nt past t he cl ose br ace.
271 * /
272
273 char *p;
274 MACRO *mac;
275
276 i f ( ! (p = s t r c h r ( ++ ( *namep) , ' } ' ) ) ) / * ski p { and f i nd } * /
277 p a r s e e r r ( E BADMAC ); /* pr i nt msg & abor t */
278 e l s e
279 {
280 *p++ = ' \ 0 ' ; / * Over wr i t e cl ose br ace. * /
281
282 i f ( ! (mac = (MACRO *) f i n d s y m( Macros , *namep )) )
283 p a r s e e r r ( E NOMAC ) ;
284
285 *namep = p ; / * Updat e name poi nt er . * /
286 return ma c - > t e x t ;
287 }
288
289 return "ERROR"; /* I f you get her e, i t ' s a bug */
290
}
291
292
/* _ _ _ _ _ _ -----
293
294 PRIVATE p r i n t a ma c r o ( mac ) /* Wor khor se f unct i on needed by * /
295 MACRO *mac; / * pt ab () cal l i n pr i nt macs( ) , bel ow * /
296
{
297 p r i n t f ( " % - l 6 s - - [ %s] - - \ n " , mac- >name, ma c - > t e x t ) ;
298
}
299
300 PUBLIC void p r i n t ma c s ( ) / * Pr i nt al l t he macr os to st dout * /
301
{
302 i f ( !Macros )
303 p r i n t f ( " \ t T h e r e a r e no ma c r o s \ n " ) ;
304 e l s e
305
{
306 p r i n t f ("\nMACROS:\ n " ) ;
307 p t a b ( Macros , p r i n t a macro, NULL, 1 ) ;
308
}
309 }
2.5.2.7 LEXs Lexical Analyzer. The lowest-level input functions are in input.c,
Listing 2.27. g e t e x p r ( ) on line eight is the actual input function. It gets an entire
ruleboth the regular expression and any following codefrom the input file (pointed
to by I f i l e ) and puts it into I n p u t b u f []. Multiple-line rules are handled here in
that lines that start with white space are concatenated to the previous line. Two line-
number variables are modified in this routine. They are L i n e n o , which holds the input
line number of the first line of the rule, and A c t u a l l i n e n o , which holds the current
input line number, g e t e x p r () normally returns a pointer to the input string (in
I n p u t b u f [ ]). It returns NULL either at end of file or when a line starting with a %%is
encountered. Since %%is treated as an end of file, the third part of the input file, which
contains C source code that is passed directly to the output, is ignored by the parser.
Listing 2.28 holds the lexical analyzer itself. The token set is defined in the
enumerated type on lines 310 to 330. The L token (L for literal) is used for all characters
that arent represented by explicitly defined tokens. Escaped characters and characters
within quoted strings are also returned as L tokens, even if the lexeme would normally
be an explicit token. The EOS token is returned at end of the regular expression, but the
input buffer holds the entire rule, including a multiple-line accepting action. The parser
uses this fact to pick up the accepting action when an EOS is encountered. Note that end
of input is also treated as a token.
get _expr ()
Input buffers: i f i l e,
I nput _buf fer.
Multiple-line actions:
Li neno,
Act ual l i neno.
l^Xs lexical analyzer.
Literal characters: the l
token, EOS.
Listing 2.27. input.c Low-Level Input Functions
2 # i ncl ude < c t y p e . h >
3 # i ncl ude < t o o l s / d e b u g . h >
4 #i ncl ude "gl obal s. h"
5
6 / * I NPUT. C Lowest - l evel i nput f unct i ons. */
1
8 PUBLIC char * g e t _ e x p r ( )
9 {
10 / * I nput r out i ne f or nf a( ) . Get s a r egul ar expr essi on and t he associ at ed
11 * st r i ng f r om t he i nput st r eam. Ret ur ns a poi nt er to t he i nput st r i ng
12 * nor mal l y. Ret ur ns NULL on end of f i l e or i f a l i ne begi nni ng wi t h % i s
13 * encount er ed. Al l bl ank l i nes ar e di scar ded and al l l i nes t hat st ar t wi t h
14 * whi t espace ar e concat enat ed t o t he pr evi ous l i ne. The gl obal var i abl e
15 * Li neno i s set t o t he l i ne number of t he t op l i ne of a mul t i pl e- l i ne
16 * bl ock. Act ual _l i neno hol ds t he r eal l i ne number .
17 */
18
19 st at i c i nt l o o k a h e a d = 0/
20 i nt s p a c e _ l e f t ;
21 char *p;
22
23 p = I n p u t _ b u f ;
24 s p a c e _ l e f t = MAXINP;
25 i f ( Ve r bo s e > 1 )
26 p r i n t f ( "b%d: ", A c t u a l _ l i n e n o ) ;
27
28 i f ( l o o k a h e a d == '%' ) / * next l i ne st ar t s wi t h a %si gn * /
29 r et ur n NULL; / * r et ur n End- of - i nput mar ker */
30
31 Li ne no = A c t u a l _ l i n e n o ;
32
33 whi l e( ( l o o k a h e a d = g e t l i n e ( & p , s p a c e _ l e f t - l , I f i l e ) ) != EOF )
34 {
35 i f ( l o o k a h e a d == 0 )
36 l e r r o r ( l , "Rul e t o o l o n g \ n " ) ;
37
38 A c t u a l _ l i n e n o + + ;
39
40 i f ( ! I nput _buf [0] ) / * I gnor e bl ank l i nes * /
41 cont i nue;
42
43 s p a c e _ l e f t = MAXINP - ( p - I n p u t _ b u f ) ;
44
45 i f ( ! i s s p a c e ( l o o k a h e a d ) )
46 br eak;
47
48 *p++ = ' \ n ' ;
49 }
50
51 i f ( Ve r bo s e > 1 )
52 p r i n t f ( "%s\ n", l o o k a h e a d ? I n p u t _ b u f : "- - EOF " ) ;
53
54 r et ur n l o o k a h e a d ? I n p u t _ b u f : NULL ;
55 }
56
57 /* -------------------------------------------------------------------------------------------------------------------- */
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
PRIVATE i n t g e t l i n e ( s t r i n g p , n, s t r e a m )
* *
s t r i n g p ;
FILE
{
* s t r e a m;
}
/
*
*
Get s a l i ne of i nput . Get s at most n- 1 char act er s. Updat es *st nngp
t o poi nt at t he ' \ 0' at t he end of t he st r i ng. Ret ur n a l ookahead
*
*
char act er (the char act er t hat f ol l ows t he \ n i n t he i nput ) . The
i s not put i nt o t he st r i ng.
\ n '
*
*
Ret ur n t he char act er f ol l owi ng t he \ n nor mal l y,
*
*
EOF
0
at end of f i l e,
i f t he l i ne i s t oo l ong.
*
/
s t a t i c i n t l o o k a h e a d 0;
*
s t r ,
*
s t a r t s t r ;
r s t r
*
s t r i n g p ;
( l o o k a h e a d
l o o k a h e a d
0 )
g e t c ( s t r e a m ) ;
( n > 0 && l o o k a h e a d != EOF )
{
w h i l e (
{
n > 0 )
*
s t r
l o o k a h e a d
l o o k a h e a d ;
g e t c ( s t r e a m ) ;
Lf (
*
s t r ' \ n '
*
s t r EOF )
+ + s t r ;
}
*
s t r
*
s t r i n g p
' \ 0' ;
s t r ;
}
r e t u r n (n < 0) ? 0 l o o k a h e a d ;
/ * i ni t i al i ze */
Listing 2.28. nfa.c IPXs own Lexical Analyzer
310 t y p e d e f enum t o k e n
311 {
312 EOS = 1,
/ *
end of st r i ng */
313 ANY,
/ * */
314 AT_BOL,
/ * */
315 AT_EOL, / * $ */
316 CCL_END, / * ] */
317 CCL_START,
/ * [ */
318 CLOSE_CURLY, / * } */
319 CLOSE_PAREN, / * ) */
320 CLOSURE, / *
*
*/
321 DASH,
/ *
*/
322 END OF INPUT,
/ *
EOF * /
323 L, / * l i t er al char act er * /
324 OPEN_CURLY,
/ * { * /
325 OPEN PAREN,
/ * ( * /
326 OPTIONAL, / * ? * /
327 OR, / * I * /
328 PLUS_CLOSE / * + * /
329
330 } TOKEN;
331
332 PRIVATE TOKEN Tokmap[] =
333 {
3 3 4
/ *
~A ~B ' ' I * /
335 L, L, L, L, L, L, L, L, L, L, L, L, L, L, L,
3 3 6
3 3 7 / *
~S
^ rji
* /
3 3 8 L, L, L, L, L, L, L, L, L, L, L, L, L, L, L,
3 3 9
3 4 0
/ *
A A A
SPACE
t
rr
# $
a
o &
f
* /
341 L, l 7 L, L, L, L, AT EOL, L, L, L
3 4 2
343 / * ( ;
*
* /
3 4 4 OPEN PAREN, CLOSE PAREN, CLOSURE, PLUS CLOSE, L, DASH, ANY,
345
3 4 6 / * / 0 1 2 3 4 5 6 7 5 9
#
f
< ----
* /
3 4 8
3 4 9 / *
> * /
3 5 0 L, OPTI ONAL,
351
3 5 2
/ * 0 A B C D E F G H I J K L M A/
* /
353 L, L, L, L, L, L, L, L, L, L, L, L, L, L, L,
3 5 4
355 / *
O P
Q
P S T U X y Z
* /
3 5 6 L, L, L, L, L, L, L, L, L, L, L, L,
3 5 7
3 5 8 / * [
\ 7
A
* /
3 5 9 CCL_ START, L, CCL_ END, AT_ BOL,
3 6 0
361 / *
1
a b c d e f
g
h
*
l
9
J
k 1 m
* /
363
364 / *
n o
P
r s t u V w X
y
z
* /
365 L, L, L, L, L, L, L, L, L, L, L, L, L,
366
367 / * {
|
;
DEL
* /
368OPEN_CURLY, OR, CLOSE_CURLY, L
369 };
370
371 PRIVATE char * ( * I f u n c t ) () ;
/*
I nput f unct i on poi nt er */
372 PRIVATE char * I nput = "" ;
/ *
Cur r ent posi t i on i n i nput st r i ng
* /
373 PRIVATE char *S i n p u t ; /* Begi nni ng of i nput st r i ng
*/
374 PRIVATE TOKEN Cur r e nt t o k ; /* Cur r ent t oken */
375 PRIVATE i nt Lexeme ; /* Val ue associ at ed wi t h LI TERAL
*/
376
377 #defne MATCH(t) ( Cur r e nt t o k == ( t ) )
378
379
/* -------
380 * Lexi cal anal yzer :
381
*
382 * Lexi cal anal ysi s i s t r i vi al because al l l exemes ar e si ngl e- char act er val
383 * The onl y compl i cat i ons ar e escape sequences and quot ed st r i ngs, bot h
384 * of whi ch ar e handl ed by advance (), bel ow. Thi s r out i ne advances past t he
385 * cur r ent t oken, put t i ng t he new t oken i nt o Cur r ent t ok and t he equi val ent
3 8 6
387
388
3 8 9
3 9 0
391
* l exeme i nt o Lexeme. I f t he char act er was Lexeme hol ds t he act ual
443
4 4 4
4 4 5
*
val ue. For exampl e, i f a "\ s" i s encount er ed, Lexeme wi l l hol d a space
* char act er . The MATCH( x) macr o r et ur ns t r ue i f x mat ches t he cur r ent t oken
* Advance bot h modi f i es Cur r ent t ok t o t he cur r ent t oken and r et ur ns i t.
*
/
392 PRIVATE i n t a d v a n c e ()
393 {
394 s t a t i c i n t i n q u o t e = 0; /* Pr ocessi ng quot ed st r i ng * /
395 i n t saw e s c ; / * Saw a backsl ash */
396 s t a t i c c h a r * s t a c k [ SSIZE ], / * I nput - sour ce st ack
*/
397 **s p = NULL; / * and st ack poi nt er . */
398
399
i f ( !sp ) / * I ni t i al i ze sp. */
400 sp = s t a c k - 1; / * Necessar y f or l ar ge model */
401
402
i f (
Cur r e nt t o k == EOS ) / * Get anot her l i ne
*/
403
{
404 i f ( i n q u o t e )
405 p a r s e e r r ( E NEWLINE );
406 do
407
{
408 i f ( ! ( I nput = ( * I f u n c t ) ()) ) / * End of f i l e */
409
{
410 Cur r e nt t o k = END OF INPUT;
411 g o t o e x i t ;
412
}
413 w h i l e ( i s s p a c e ( * I nput ) ) / * I gnor e l eadi ng
/
414 I nput ++; / * whi t e space. ..
/
415
416 } w h i l e ( ! * I n p u t ); /* and bl ank l i nes.
/
417
418 S i n p u t = I nput ; / * Remember st ar t of l i ne */
419
}
/* f or er r or messages.
/
420
421 w h i l e ( *I nput == ' \ 0 ' )
422
{
423 i f ( INBOUNDS( st ack, sp) ) /* Rest or e pr evi ous i nput sour ce */
424
{
425 I nput = *s p ;
426 c o n t i n u e ;
427
}
428
429 Cur r e nt t o k = EOS; / * No mor e i nput sour ces t o r est or e * /
430 Lexeme = ' \ 0 ' ; / * i e. you' r e at t he r eal end of * /
431 g o t o e x i t ; / * st r i ng. * /
432
}
433
434
i f (
! i n q u o t e )
435
{
436 w h i l e ( * I nput == ' { ' ) / * Macr o expansi on r equi r ed
}* /
437
{
438 *++sp = I nput ; / * St ack cur r ent i nput st r i ng
* /
439 I nput = expand ma c r o ( s p ) ; / * and r epl ace i t wi t h t he macr o
* /
440 / * body. * /
441 i f ( TOOHIGH( st ack, sp) )
442 p a r s e e r r (E MACDEPTH); / * St ack over f l ow */
}
}
4 4 6
4 4 7
4 4 8
4 4 9
4 5 0
451
4 5 2
4 5 3
4 5 4
4 5 5
4 5 6
4 5 7
4 5 8
4 5 9
4 6 0
461
4 6 2
4 6 3
4 6 4
4 6 5
4 6 6
4 6 7
4 6 8
4 6 9
4 7 0
471
4 7 2
4 7 3
4 7 4
4 7 5
4 7 6
4 7 7
4 7 8
4 7 9
4 8 0
481
4 8 2
4 8 3
( * I nput
t II t
)
{ / * At ei t her st ar t and end of a quot ed
i n q u o t e i n q u o t e ;
( ! *++I nput )
/
/
*
*
st r i ng. Al l char act er s ar e t r eat ed
l i t er al s whi l e i nquot e i s t r ue) .
*
/
/
/
{
Cur r e nt t o k EOS
'\ 0' ;
got o
}
}
saw e s c ( * I nput
( ! i n q u o t e )
{
( i s s p a c e ( * I n p u t ) )
{
Cur r e nt t o k EOS ;
'\ 0' ;
got o
}
e s c ( &I nput ) ;
}
{
( saw e s c && I n p u t [1]
)
{
I nput +
Lexeme
2;
}
*
I nput
}
Cur r e nt t o k ( i n q u o t e saw e s c ) ? L : Tokmap[Lexeme] ;
r et ur n Cur r e nt t o k ;
}
I f unct, I nput,
S_i nput , Cur r ent tok,
Lexeme
MATCH( t )
advance()
ters.
The translation problem is simplified by the fact that all lexemes are single charac-
[ ~, which starts a negative character class, is treated as a CCL START/AT BOL
pair). The Tokmap [ ] array on lines 332 to 369 is used to translate these single
character lexemes into tokens. It is indexed by ASCII character, and evaluates to the
token value associated with that character.
Various other variables are needed by the lexical analyzer, all declared on lines 371
to 375 of Listing 2.28. I f unct points at the current input function, which should work
like get s ( ), returning either the next input line or NULL at end of file. I nput points at
the current input line, and S i nput is the current position on that line. Cur r ent t ok
holds the current token, and Lexeme holds the associated lexeme (which is only needed
for L tokens). The MATCH (t ) macro on line 377 of Listing 2.29 evaluates true if t is the
current lookahead token.
The advance () function on lines 392 to 483 of Listing 2.28 advances the input by
one character, modifying Cur r ent _t ok and Lexeme as appropriate, and also returning
the current token. An END_OF_INPUT token is returned at end of file. This subroutine
is probably the most complicated routine in the parser, and is also a little long for my
taste, but it seemed best to keep the whole lexical analysis phase in a single subroutine
for maintenance reasons. A new line is fetched on lines 402 to 419 only if the end of the
previous line has been reached ( Cur r ent _t ok is EOS).
Lines 421 to 444 of Listing 2.28 handle macro expansion. Macros are delimited by
braces, and they are recognized in the whi l e loop on line 436, which finds the leading
brace. Its a loop because macro definitions might be nestedif the first character of the
macro body is also an open brace, the loop will expand this inner macro as well as the
current one. Nested macros are handled with a stack. When an inner macro is encoun
tered, the current input buffer is pushed onto a stack, and I nput is modified to point at
the macro-replacement text. The loop on line 421 is activated when you get to the end of
the replacement text. The previous input string is restored at the top of the loop by pop
ping it off the stack. The code on lines 429 to 431 is activated only if the stack is empty
and no more characters are found in the current input source, in which case end of string
has been reached. The got o statement on line 431 is functioning as a ret urn statement
here. I generally shun multiple ret urn statements in favor of multiple got o branches
to a label that precedes a single ret urn statement. This way, the subroutine has only a
single exit point so its easier to set up breakpoints and debugging diagnostics.
Quotes are recognized on line 446, and i nquot e is set to true when processing a
quoted string. Similarly, saw_escape is set to true on line 457 when a backslash is
encountered. The clause on lines 461 to 467 handles normal text. EOS, which marks
the end of the regular expression, is returned if any white space is encountered, and
escape sequences are expanded by the esc ( ) call on line 467. The following el se
clause handles quoted strings. The test on line 471 is looking for a ", which must be
treated as a literal quote mark. All other characters, including white space, are con
sidered to be part of the regular expression and are just returned in the normal way.
Finally, the current token is put into Cur r ent _t ok on line 480. If youre in a quoted
string or if the current character is preceded by a backslash, then the current character is
treated literally and an L token is returned; otherwise, the character is translated to a
token by looking it up in Tokmap [ ].
2.5.2.8 Parsi ng. LEXs parser begins in Listing 2.29. The prototypes at the top of the
listing are necessary, both for debugging and because the parser itself is highly
recursivethere is no way to arrange the subroutines to avoid all forward references.
The parser is a straightforward implementation of the grammar presented earlier. The
topmost routine, machi ne ( ) on line 508 of Listing 2.29, collects a series of rules and
chains them together using 8 transitions and dummy states (as was pictured in Figure 2.9
on page 82). The r ul e ( ) calls on lines 516 and 522 return pointers to NFAs that
represent each regular expression on the input line.
The r ul e ( ) subroutine on line 531 of Listing 2.29 gets a single regular expression
and associated action. Most of the work is done by expr ( ) , called on lines 554 and
557. The routine is passed two pointers to NFA pointers. When expr ( ) returns, these
two pointers will have been modified to point at the first and last nodes in the machine
(there will be only one of each). That is, synthesized attributes are used here (and
throughout the rest of the parser), but I cant use the actual return value (the argument to
a ret urn statement) because there are two attributes. Consequently, I m passing
pointers to variables to be modified. (Put another way, I m doing a call by reference
herepassing pointers to the object to be modified).
Beginning- and end-of-line anchors are processed directly in r ul e () on lines 550 to
554 of Listing 2.29. An extra node is created with an outgoing transition on a newline so
that the ~is treated as if it were a regular expression that matched a newline. The
anchor field is modified on line 552 to remember that this newline is there as the result
of an anchor, as compared to a specific match of a \n. The newline has to be discarded
in the former case; it remains in the lexeme if a specific match was requested, however.
Macro expansion.
Quotes in regular expres
sions.
machi ne()
r u l e ()
e x pr ()
Anchor processing.
Listing 2.29. nfa Parser. Part 1: machine and rul e
484 PRIVATE i n t a dv anc e { v o i d
) ;
485 PRIVATE v o i d c a t _ e x pr ( NFA* *, NFA** ) ;
486 PRIVATE v o i d d i s c a r d ( NFA* ) ;
487 PRIVATE v o i d dodas h ( SET*
) ;
488 PRIVATE v o i d e x pr ( n f a * *, NFA** ) ;
489 PRIVATE v o i d f a c t o r ( NFA**, NFA** ) ;
490 PRIVATE i n t f i r s t i n c a t ( TOKEN
) ;
491 PRIVATE NFA ^machi ne ( v o i d
) ;
492 PRIVATE NFA *new ( v o i d
) ;
493 PRIVATE v o i d p a r s e e r r ( ERR NUM ) ;
494 PRIVATE NFA * r u l e ( v o i d
) ;
495 PRIVATE c h a r * s a v e ( c h a r *
) ;
496 PRIVATE v o i d t e r m ( NFA**, NFA** ) ;
497
498
/ * ---------
499 * The Par ser :
500
*
A si mpl e r ecur si ve descent par ser t hat cr eat es a Thompson NFA f or
501
*
a r egul ar expr essi on. The access r out i ne [ t hompson()] i s at t he
502
*
bot t om. The NFA i s cr eat ed as a di r ect ed gr aph, wi t h each node
503
*
cont ai ni ng poi nt er ' s t o t he next node. Si nce t he st r uct ur es ar e
504
*
al l ocat ed f r oman ar r ay, t he machi ne can al so be consi der ed
505
*
as an ar r ay wher e t he st at e number i s t he ar r ay i ndex.
506 * /
507
508 PRIVATE NFA *m a c h i n e ()
509
{
510 NFA * s t a r t ;
511 NFA *p;
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
ENTER( "machi ne") ;
P =
p - > n e x t
n e w( ) ;
r u l e () ;
w h i l e ( ! MATCH (END_ OF_ INPUT)
{
)
p - > n e x t 2
P
p - > n e x t
n e w( ) ;
p - > n e x t 2
r u l e ( ) ;
}
LEAVE( "machi ne") ;
}
/
* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - * /
PRIVATE NFA * r u l e ( )
{
/
*
r ul e >
*
*
EOS act i on
expr EOS act i on
EOS act i on
*
*
act i on > <t abs> <st r i ng of char act er s>
*
on
*
/
541 NFA *p;
542 NFA * s t a r t =NULL;
543 NFA *end =NULL;
5 4 4 i nt anc hor =NONE;
545
5 4 6 ENTER( " r u l e " ) ;
547
548 i f ( MATCH( AT_ BOL ) )
5 4 9
{
5 5 0 s t a r t = n e w() ;
551 s t a r t - > e d g e = ' \ n ' ;
552 anc hor = START ;
553 a d v a n c e ( ) ;
5 5 4 e x p r ( & s t a r t - > n e x t , &end ) ;
555
}
5 5 6 el se
557 e x p r ( &s t a r t , &end ) ;
558
5 5 9
5 6 0 i f ( MATCH( AT_ EOL ) )
561 {
562 /* pat t er n f ol l owed by a car r i age- - ret urn or l i nef eed (use a
563 * char act er cl ass) .
564 */
565
5 6 6 a d v a n c e ();
567 e n d - > n e x t = new( ) ;
568 e n d - > e d g e = CCL ;
5 6 9
5 7 0 i f ( !( e n d - > b i t s e t = n e w s e t ()) )
571 p a r s e e r r ( E_ MEM );
572
573 ADD ( e n d - > b i t s e t , ' \ n ' );
5 7 4
575 i f ( !Uni x )
576 ADD( e n d - > b i t s e t , ' \ r ' );
577
578 end = e n d - > n e x t ;
5 7 9 anc hor = END ;
5 8 0
}
581
5 8 2 whi l e( i s s p a c e (*I n p u t ) )
583 I nput ++ ;
5 8 4
585 e n d - > a c c e p t = s a v e ( I nput );
586 e nd- >a nc ho r = anc hor ;
587 a d v a n c e ( ) ; / * ski p past EOS */
588
5 8 9 LEAVE( "rul e") ;
5 9 0 return s t a r t ;
591 }
This information is eventually output as a table that will be used by the LEX-generated
driver when it processes the newline at run time. The end-of-line anchor is handled in a
similar way on lines 566 to 579 of Listing 2.29, though the extra node is put at the end of
the machine rather than at the beginning.
MS-D O S, e n d -o f-lin e p ro b
lem s in a n ch o rs.
OR and concatenation
expr (), cat expr( ) .
Anchors are recognized with a character class that matches either a carriage return or
linefeed (as compared to a literal match of a ' \ n ' character). You must recognize both
characters in MS-DOS binary-mode input, because all input lines are terminated with a
CR-LF pair (in that order). Lines are terminated by a single newline only in
translated mode. Since I dont want to worry about which of the two input modes are
used, I m testing for both possibilities. When e x p r ( ) returns, the input is positioned at
the start of the accepting action, which is saved on line 585 (remember, an entire
multiple-line action is collected into the input string).
Subroutines e x pr ( ) and c a t e x pr ( ) are in Listing 2.30. These routines handle the
binary operations: I (OR) and concatenation. I ll show how e x pr ( ) works by watching
it process the expression A\B. The c a t e x pr ( ) call on line 621 creates a machine that
recognizes the A:
* s t a r t p and * e n d p are modified to point at Nodes 0 and 1, and the input is advanced
to the OR operator. The MATCH on line 623 succeeds and the OR is skipped by the subse
quent a d v a n c e ( ) call. The second c a t e x p r ( ) call (on line 625) creates a machine
that recognizes the B:
and modifies e 2 _ s t a r t and e 2 _ e n d to point at nodes representing States 2 and 3. The
two machines are then joined together to create the following machine:
A
Node 4 is created on lines 628 to 630 and * s t a r t p is modified to point at Node 4 on
line 631. Node 5 is set up in a similar way on the next four lines.
Listing 2.30. nfa.c Parser, Part 2: Binary Operators
5 9 2
593
5 9 4
595
5 9 6
5 9 7
598
5 9 9
6 0 0
601
6 0 2
603
6 0 4
6 0 5
6 0 6
6 0 7
6 0 8
6 0 9
PRIVATE voi d e x p r ( s t a r t p , endp )
NFA
{
/
* *
s t a r t p , **endp ;
*
*
Because a r ecur si ve descent compi l er can' t handl e l ef t r ecur si on,
t he pr oduct i ons:
*
*
*
> expr OR cat _ expr
cat _expr
*
* must be t r ansl at ed i nt o
*
*
*
*
*
> cat _expr expr
> OR cat _expr expr
epsi l on
* whi ch can be i mpl ement ed wi t h t hi s l oop
*
*
cat _ expr
6 1 0
611
6 1 2
613
6 1 4
615
6 1 6
6 1 7
6 1 8
6 1 9
6 2 0
621
622
623
6 2 4
625
6 2 6
6 2 7
6 2 8
6 2 9
6 3 0
631
6 3 2
633
6 3 4
635
6 3 6
637
638
6 3 9
6 4 0
641
6 4 2
643
6 4 4
645
6 4 6
647
648
6 4 9
6 5 0
651
6 5 2
653
6 5 4
655
6 5 6
6 5 7
658
6 5 9
6 6 0
661
6 6 2
663
6 6 4
665
666
6 6 7
668
6 6 9
*
*
*
whi l e ( mat ch( OR) )
cat _expr
do t he OR
*
/
NFA
*
e2 s t a r t
NFA *e2 end
NULL; / * expr essi on t o r i ght of
NULL;
*
/
NFA
*
p;
ENTER( "expr") ;
c a t _ e x pr ( s t a r t p , endp ) ;
w h i l e ( MATCH( OR )
{
a d v a n c e ( ) ;
)
c a t _ e x p r ( &e2 s t a r t , &e2 end ) ;
P
n e w( ) ;
p - > n e x t 2
p - > n e x t
* s t a r t p
e2 s t a r t
*
s t a r t p
p;
P
n e w( ) ;
( * e n d p ) - > n e x t
e 2 _ e nd
*endp
p;
p;
p;
}
LEAVE( " e x p r ") ;
}
/
* -----------------------------------------------------------------------------------------------------*/
PRIVATE v o i d c a t e x p r ( s t a r t p endp )
NFA
{
* * s t a r t p , **endp ;
/* The same t r ansl at i ons t hat wer e needed i n t he expr r ul es ar e needed agai n
* her e:
*
* cat expr > cat _ expr f act or
*
f act or
*
* i s t r ansl at ed to:
*
* cat _ expr > f act or cat expr '
* cat expr ' - > f act or cat expr
*
epsi l on
*/
NFA *e2 s t a r t ,
e2 end;
ENTER("cat_ e x p r " ) ;
( f i r s t _ i n _ c a t ( Cu r r e n t _ t o k )
f a c t o r ( s t a r t p , endp ) ;
)
w h i l e ( f i r s t _ i n _ c a t ( Cu r r e n t _ t o k )
{
f a c t o r ( &e2 s t a r t , &e2 end ) ;
)
670 memcpy( *endp, e2 s t a r t , si zeof (NFA)) ;
671 d i s c a r d ( e2 s t a r t ) ;
672
673 *endp = e2 end
;
674
}
675
676 LEAVE("cat e x p r ") ;
677 }
678
679 / * -
- - * /
680
681 PRIVATE i nt f i r s t _ i n c a t ( t o k )
682 TOKEN t o k ;
683 {
684 swi t ch( t o k )
685
{
686 case CLOSE_ PAREN:
687 case AT EOL:
688 case OR:
689 case EOS: r et ur n 0;
690
691 case CLOSURE:
692 case PLUS_CLOSE:
693 case OPTIONAL: p a r s e e r r ( E_ CLOSE ) ; r et ur n 0;
694
695 case CCL END: p a r s e e r r ( E_ BRACKET ) ; r et ur n 0;
696 case AT_ BOL: p a r s e e r r ( E_ BOL ) ; r et ur n 0;
697 }
698
699 r et ur n 1;
700 }
Problems with implicit
concatenation operator,
f i r st i n cat ().
Concatenation is a somewhat harder problem because theres no operator to look for.
The problem is solved by i n cat ( ) on line 681 of Listing 2.30, which looks to
see if the next input token can reasonably be concatenated to the current one. That is,
there is a set of tokens that can not just be concatenatedsuch as the parenthesis that
terminates a parenthesized subexpression or an OR tokenand the loop must terminate
when one of these is encountered. These symbols are identified by the statements
on lines 686 to 689. The other cases test for obvious error conditions such as a close
bracket (CCL END) without a preceding open bracket or one of the closure operators
Concatenation.
without anything in front of it.
Concatenation is performed in a manner similar to OR. The first operand is fetched
with the f actor ( ) 664 (which returns with * star tp and *endp modified
to point at the endpoints of a machine that recognizes the operand). The second and sub
sequent operands fetched by the ( ) call on line 668, and the two
catenated by overwriting the contents of the end node of the first operand with the con
tents of the starting node of the second operand (with the memcpy ( ) call on the next
line). The now redundant start node of the second operand is then discarded.
As an example of the process, if the input expression is D* \. D, the first f actor ( )
call processes the D*, modifying and
*
endp to point at nodes representing
States 2 and 3 of the following machine:
Section 2.5.2Implementing Thompsons Construction
8
The second call modifies e2_st ar t and e2_end to point at Nodes 4 and 5 of the fol
lowing machine:
They are concatenated together by overwriting Node 3 with Node 4, yielding:
8
Node 3 is then discarded. The f act or ( ) call in the next iteration of the loop modifies
e2 st ar t and e2 end to point at the ends of:
(The discarded Node 3 is picked up again here because it will be at the top of the push-
back stack.) This new machine is concatenated to the previous one by overwriting Node
5 with the contents of Node 3.
The unary closure operators are all handled by f act or ( ), in Listing 2.31. It
behaves just like the earlier routines, except that it builds a closure machine like the ones
shown in Figure 2.12 on page 83. The code is simplified because the machines for +and
? are subsets of the one for *. The same number of extra nodes is created in all three
situations. The backwards-pointing 8 edge is created only if a or a + is being pro
cessed; the forward-pointing edge is created only for a or a
Single-character matches and parenthesized subexpressions are handled by t er m( ),
in Listing 2.32. The actual NFA structures are allocated and connected to each other on
lines 761 and 762. The edge field is then initialized to the character or CCL, as appropri
ate. A dot is treated as a character class that matches everything but a newline. Normal
character classes are assembled by dodash ( ) 12on line 811 of Listing 2.32. This rou
tine converts the input tokens representing the class into a SET. A set can have one ele
ment ([x]) or several elements[ a- zA- Z] and the following large character class are
equivalent:
[abcdef ghi j kl mnopqr st uvwxyzABCDEFGHI J KLMNOPQRSTUVWXYZ]
Note that the dash notation is expanded as sequential numbers. Since your machine
probably uses the ASCII character set, this means that [A-z] contains the entire alpha
bet plus the following symbols:
12. Camptown ladies sing this song, dodash ( ) , dodash (), Camptown race track five miles l ong. . .
Closure operators: f ac
t or ().
Single characters and
parenthesized subex
pressions: t er m().
Character classes, do
dash ().
Listing 2.31. nfa.c Parser, Part 3: Unary Operators (Closure)
701 PRI VATE voi d f act or ( st ar t p, endp )
702 NFA **st ar t p, **endp;
703 {
704
/ *
f act or > t er m* term-h t er m?
705 */
706
707 NFA *st ar t , *end;
708
709 ENTER( "f act or ") ;
710
711 t er m( st ar t p, endp );
712
713 i f (
MATCH( CLOSURE) || MATCH( PLUS CLOSE) || MATCH( OPTI ONAL) )
714
{
715 st ar t = new( ) ;
716 end = new() ;
717 st ar t - >next = *st ar t p ;
718 (*endp) - >next = end ;
719
720 i f ( MATCH( CLOSURE) || MATCH( OPTI ONAL) ) / * * or ?
*/
721 st ar t - >next 2 = end;
722
723 i f ( MATCH( CLOSURE) MATCH( PLUS_CLOSE) ) / * * or + */
724 (*endp) - >next 2 = *st ar t p;
725
726 *st ar t p = st ar t ;
727 *endp = end ;
728 advance();
729 }
730
731 LEAVE( "f act or ") ;
732
>
T
You can get a dash into a character class with a \- , as in [_ \- ], which recognizes either
a dash or an underscore. Note that the only metacharacters recognized as such in a char
acter class are a ~that immediately follows the bracket, a dash, and a close bracket (]);
so [ * ?. ] recognizes a star, question mark, or dot. The Mand {that trigger a macro
expansion are also active in a character class, but they are handled by the input routines.
You can use \ ] to get a ] into a class: [ [ \ ] ] recognizes either an open or close bracket.
Listing 2.32. nfa.c Parser, Part 4: Single Character Matches
733 PRI VATE voi d t er m( st ar t p, endp )
734 NFA **st ar t p, **endp;
735 {
736 / * Pr ocess t he t er mpr oduct i ons:
737
*
738 * t er m >[ ] [ *] [] [~] . I (expr) <char act er >
739
*
740 * The [] i s nonst andar d. I t mat ches a space, t ab, f or mf eed, or newl i ne,
741 * but not a car r i age r et ur n ( \ r) . Al l of t hese ar e si ngl e nodes i n t he
742 * NFA.
743 */
744
745 NFA *st ar t ;
746 i nt c ;
747
748 ENTER( "t er m") ;
749
750 i f ( MATCH( OPEN_PAREN ) )
751 {
752 advance ();
753 expr ( st ar t p, endp );
754 i f ( MATCH( CLOSE_PAREN ) )
755 advance();
756 e l s e
757 par se_er r ( E_PAREN );
758 }
759 e l s e
760 {
761 *st ar t p = st ar t = n e w();
762 *endp = st ar t - >next = new();
763
764 i f ( !( MATCH( ANY ) || MATCH( CCL_START) ))
765 {
766 st ar t - >edge = Lexeme;
767 advance();
768 }
769 e l s e
770 {
771 st ar t - >edge = CCL;
772
773 i f ( !( st ar t - >bi t set = newset ()) )
774 par se_er r ( E_MEM );
775
776 i f ( MATCH( ANY ) ) / * dot (.) */
111 {
778 ADD( st ar t - >bi t set , ' \ n' );
779 i f ( ! Uni x )
780 ADD( st ar t - >bi t set , ' \ r' );
781
782 COMPLEMENT( st ar t - >bi t set );
783 }
784 e l s e
785 {
786 advance();
787 i f ( MATCH( AT_B0L ) ) / * Negat i ve char act er cl ass */
788 {
789 advance();
790
791 ADD ( st ar t - >bi t set , ' \ n' ); / * Don' t i ncl ude \ n i n cl ass */
792 i f ( ! Uni x )
793 ADD( st ar t - >bi t set , ' \ r' );
794
795 COMPLEMENT( st ar t - >bi t set );
796 }
797 i f ( ! MATCH( CCL_END ) )
798 dodash( st ar t - >bi t set );
799 e l s e / * [] or [~] */
800 f or ( c = 0; c <= ' ' ; ++c )
801 ADD( st ar t - >bi t set , c );
802 }
803 advance ();
804 }
805 }
806 LEAVE( "t er m") ;
807 }
808
809 / * -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
810
811 PRI VATE void dodash( set )
812 SET *set ; / * Poi nt er to ccl char act er set */
813 {
814 r e g i s t e r i n t f i rst ;
815
816 f or (; !MATCH( EOS ) && !MATCH( CCL END ) ; advance() )
817
{
818 i f ( ! MATCH( DASH ) )
819 {
820 f i r st = Lexeme;
821 ADD( set, Lexeme );
822
}
823 e l s e
824
{
825 advance();
826 for (; f i r st <= Lexeme ; f i r st ++ )
827 ADD( set, f i r st );
828 }
829 }
830 }
The final workhorse function is pr i nt nf a ( ), starting on line 57 of Listing 2.33,
which is for debugging. It prints out the entire machine in human-readable form, show
ing the various pointers, and so forth.
Access routine: thomp- Nfa.c finishes up with a high-level access routine, t hompson ( ), in Listing 2.34.
son()- (Everything else was declared PRIVATE, so was inaccessible from outside the current
file). It is passed a pointer to an input function, and it returns two things: a pointer to an
array of NFA structures that represents the state machine and the size of that array (the
number of states in use).
Listing 2.33. printnfa.c Print NFA to Standard Output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include <st di o . h>
#include <t ool s/ debug. h>
#include <t ool s/ set . h>
#include <t ool s/ hash. h>
#include <t ool s/ compi l er . h>
#include "nf a. h
ff
PRI VATE void pr i nt ed
PRI VATE char *pl ab
PUBLI C void pr i nt nf a
/ *- - - - - -
( SET*
( NFA*
( NFA*
);
NFA
*
);
i n t , NFA
*
);
*
PRI NTNFA. C Rout i ne to pr i nt out a NFA st r uct ur e i n human- r eadabl e f or m
*
/
PRI VATE void pr i nt ed ( set )
SET
{
*
s t a t i c i n t
i ;
20
21 p u t c h a r ( ' [ ' ) ;
22 f o r ( i = 0 ; i <= 0 x 7 f ; i ++ )
23 {
24 i f ( TEST( s e t , i ) )
25 {
26 i f ( i < ' ' )
27 p r i n t f ( " 1 c " , i + '@' ) ;
28 el se
29 p r i n t f ( "%c", i ) ;
30 }
31 }
32
33 p u t c h a r ( ' ] ' ) ;
34 }
35
36 / * ---------------------------------------------------------------------------------------------------------------------------- * /
37
38 PRIVATE char * p l a b ( n f a , s t a t e )
39 NFA * n f a , * s t a t e ;
40 {
41 / * Ret ur n a poi nt er t o a buf f er cont ai ni ng t he st at e number . The buf f er i s
42 * over wr i t t en on each cal l so don' t put mor e t han one pl ab( ) cal l i n an
43 * ar gument t o pr i nt f (),
44 * /
45
46 st at i c char b u f [ 32 ] ;
47
48 i f ( ! n f a | | ! s t a t e )
49 r et ur n ( " - - " ) ;
50
51 s p r i n t f ( b u f , "%2d", s t a t e - n f a ) ;
52 r et ur n ( buf ) ;
53 }
54
55 / * ----------------------------------------------------------------------------------------------------------------------------* /
56
57 PUBLIC voi d p r i n t _ n f a ( n f a , l e n , s t a r t )
58 NFA * n f a , * s t a r t ;
59 i nt l e n ;
60 {
61 NFA *s = n f a ;
62
63 p r i n t f ( " \ n ---------------------------------- N F A --------------------------------\ n" ) ;
64
65 f or( ; l e n >= 0 ; nf a++ )
66 {
67 p r i n t f ( "NFA s t a t e %s: ", p l a b ( s , n f a ) ) ;
68
69 i f ( ! n f a - > n e x t )
70 p r i n t f ( " (TERMINAL)") ;
71 el se
72 {
73 p r i n t f ( " > %s ", p l a b ( s , n f a - > n e x t ) ) ;
74 p r i n t f ( " (%s) on ", p l a b ( s , n f a - > n e x t 2 ) ) ;
75
76 swi t ch( n f a - > e d q e )
77 {
Listing2.33. conti nued..
78 case CCL: p r i n t e d ( n f a - > b i t s e t ) ; br eak;
79 case EPSI LON: p r i n t f ( "EPSI LON " ); br eak;
80 def aul t : pc ha r ( n f a - > e d g e , s t d o u t ) ; br eak;
81 }
82 }
83
84
i f (
n f a == s t a r t )
85 p r i n t f (" ( START STATE)
") ;
86
87 i f (
n f a - > a c c e p t )
88 p r i n t f (" a c c e p t i n g %s<%s>%s", n f a - > a n c h o r & START ?
ff ^ff ff ff
f
89 n f a - > a c c e p t ,
90 n f a - > a n c h o r & END
9 ff C;ff . ff ff ) .
91 p r i n t f ( "\ n" ) ;
92
}
93 p r i n t f ( " \ n ------------------------------------- \ n" ) ;
94 }
Listing 2.34. nfa.c The High-Level Access Function
831 PUBLI C NFA *t hompson( i nput _f unct i on, max_st at e, st ar t _st at e )
832 c h a r * ( *i nput _f unct i on) ( ) ;
833 i nt * ma x _ s t a t e ;
834 NFA **st ar t _st at e;
835 {
836 / * Access r out i ne to t hi s modul e. Ret ur n a poi nt er t o a NFA t r ansi t i on
837 * t abl e t hat r epr esent s t he r egul ar expr essi on poi nt ed t o by expr or
838 * NULL i f t her e' s not enough memor y. Modi f y *max_st at e t o r ef l ect t he
839 * l ar gest st at e number used. Thi s number wi l l pr obabl y be a l ar ger
840 * number t han t he t ot al number of st at es. Modi f y *st ar t _st at e t o poi nt
841 * t o t he st ar t st at e. Thi s poi nt er i s gar bage i f t hompson( ) r et ur ned 0.
842 * The memor y f or t he t abl e i s f et ched f r ommal l oc (); use f ree( ) t o
843 * di scar d i t.
844 * /
845
846 CLEAR_STACK () ;
847
848 I f unct = i nput _f unct i on;
849
850 Cur r ent _t ok =EOS; / * Load f i r st t oken */
851 advance();
852
853 N s t a t e s = 0;
854 N e x t _ a l l o c = 0;
855
856 *st ar t _st at e =machi ne(); / * Manuf act ur e t he NFA */
857 *max_st at e =Next _al l oc ; / * Max st at e # i n NFA */
858
859 i f ( Ver bose > 1 )
860 pr i nt _nf a( Nf a_st at es, *max_st at e, *st ar t _st at e );
861
862 i f ( Ver bose )
863 {
864 pr i nt f ("%d/ %d NFA st at es used. \ n", *max_st at e, NFA_MAX );
865 pr i nt f ("%d/ %d byt es used f or accept st r i ngs. \ n\ n", Savep - St r i ngs,
866 STR_MAX );
867 }
868 r et ur n Nf a_st at es;
869 }
2.5.3 Interpreting an NFATheory*
Using an NFA to recog
nize strings.
eedges effectively merge
states.
state. I ll demonstrate with a concrete example. The NFA for
( D * \ . D | D \ . D*)
is reproduced in Figure 2.14, and I ll recognize the string 1 . 2. using this NFA.
Figure 2.14. NFA for (DA. DIDV D*)
Now that weve constructed an NFA, we need to turn it into a DFA. The method
used here is called subset construction; it is developed over the next few sections. First,
lets look at how an NFA can be used directly to recognize a string. (Note that interpret
ing an NFA directly can make more sense in some applications, such as editors, in which
the time required to manufacture the DFA from the NFA can be greater than the search
time.)
The basic strategy makes use of the fact that, from any state, you are, for all practical
purposes, also in every state that can be made by traversing e edges from the current
The terminal states (the ones with no outgoing edges) are all accepting states. Starting in
the start state (State 0), you can take all e transitions, regardless of the input character,
and without reading any input. If the machine is in State 0, it is, for all practical pur
poses, simultaneously in States 1, 3, 4, 5, and 12 as well. (These states are all at the end
of 8 edges that start in State 0. For example, State 4 is there by traversing three edges,
from State 0 to State 12 to State 3 to State 4.) This set of statesthe ones that can be
reached by making 8 transitions from a given set of start statesis called the 8-closure
set. This set also includes the start statetheres an implied 8 transition from every state
to itself. In this case, the set of start states has only one element: {0}, and you can make
transitions to a set of five states: {1, 3,4, 5, 12}. So:
e-closure( {0} ) ={0, 1, 3,4, 5, 12}.
Reading the 1 from the input, the next state is determined by looking at all transitions
that can legitimately be made on a D from any of the states in the 8-closure set that was
just created. There is a transition from State 1to 2 on a D, and another from State 5 to 8;
so from the states in the set {0, 1, 3, 4, 5, 12}, you can make transitions to states in the
set {2, 8 }on a D. This set of states is called the move set by [Aho], and its created by a
Greedy and nongreedy
algorithms.
move() function. More formally, move(S,c)where 5 is a set of states and c is an input
characteris the set of states that can be reached by making transitions on c from any of
the states in 5. In this case:
move( {0, 1, 3,4, 5, 12}, D)={2, 8 }.
From States 2 and 8 the machine is also in all states that can be reached by making 8
transitions out of States 2 and 8. There are 8 transitions from State 2 to States 1 and 4,
but there are no 8 transitions out of State 8. So:
e-closure( {2, 8 }) ={1, 2,4, 8 }
(Remember, the 8-closure set includes the original states.) Now get another input char
acter, this time the dot, and look for outgoing transitions from the previously-computed
8-closure set. There is a transition from State 8 to 10 on a dot, and another from State 4
to 6:
move( {1, 2,4, 8 }, .) ={6, 10}.
Continuing the process:
e-closure( {6,10} ) ={6,9, 10, 13, 14}
State 14 is an accepting state, and the machine can accept if any of the states in the 8-
closure set are accepting states. So the machine can accept at this point. Since theres
more input, however, more processing is requiredthe same greedy algorithm thats
used by the LPX-generated analyzer is used here. Reading the 2 :
move( {6,9, 10, 13, 14}, D) ={7, 11}
and:
e-closure( {7, 11} ) ={7,9, 11, 13, 14}.
This represents an accepting state because State 14 is present. Reading the end-of-input
marker,
move( {7,9, 11, 13, 14}, END_OF_INPUT) =0
The resulting set is emptytheres nowhere to go. This situation is the equivalent of a
failure transition, so the machine executes the most recently seen accepting action and
accepts State 14.
The algorithm is stated more formally in Tables 2.6 and 2.7. The first of these algo
rithms is a greedy algorithm. It recognizes the longest possible string that matches a reg
ular expression. The second algorithm is nongreedy. It accepts the first string that
matches a regular expression. The main difference between these algorithms is that the
greedy algorithm doesnt accept an input string until a failure transition out of some state
occurs, whereupon it accepts the string associated with the most recently seen accepting
state (which could be the current state if the machine fails out of an accepting state).
The non-greedy algorithm just accepts as soon as it enters any accepting state.
11
13. The move() function is the same thing as the next() function that weve been using to find the next state in a
DFA.
Section 2.5.3Interpreting an NFATheory*
Table 2.6. Greedy NFA Interpretation (Terminates on Failure)
current is the set of NFA states that represents the current position in the machine.
c is the current input character.
*
accept if the most recently computed e-closure set includes an accepting state, this is the
state number, otherwise its false (0).
last_accept = FALSE;
current = -closure( state_state );
while( c = nextchar())
if( -closure( moveicurrent, c ) ) * 0 )
if( accept)
last_accept = accept;
current = next;
else if( last_accept)
ACCEPT( last_accept );
last_accept = FALSE;
else
ERROR;
Table 2.7. Non-Greedy NFA Interpretation (Terminates on First Accepting State)
current is the set of NFA states that represents the current position in the machine.
c
accept
is the current input character.
if the most recently computed -closure set includes an accepting state, this is the
state number, otherwise its false (0).
while( c = nextchar())
if( -closure( mo\z(current, c ) ) is not 0 )
if( accept)
ACCEPT( accept);
else
current = next;
2.5.4 Interpreting an NFAImplementation
The foregoing theory is all implemented by the code in terp.c, which starts in Listing
2.35. The LARGEST_INT macro on line 19 evaluates to the largest positive integer that
can be represented in an i nt. It is portablethe actual size of an i nt is immaterial. It
Computing largest posi
tive integer:
LARGEST I NT
works by shifting a number that contains all l s to the right by one bit, shifting a 0 into
Nfa, Nfa_states, nfa () the high bit. The cast to unsi gned defeats sign extension on the right shift. Nfa is a
pointer to the NFA that represents the regular expression, and Nfa states is the
number of states in that NFA. The nf a ( ) subroutine on line 26 is passed a pointer to an
input routine. It creates an NFA using the routines developed earlier, and returns the
free nfa o state number of the start state (its index in the array). The f ree nfa ( ) routine on line
43 discards the NFA created with the previous nf a ( ) call.
Listing 2.35. terp.c File Header
m
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#i ncl ude <st di o. h>
#i ncl ude <ct ype. h>
#i ncl ude <t ool s/ debug. h>
#i ncl ude <t ool s/ set . h>
#i ncl ude ct ool s/ compi l er . h>
#i ncl ude "nf a. h" /
*
f or NFA, EPSI LON, CCL
*
/
/
*
*
Pr ot ot ypes f or subr out i nes i n t hi s f i l e
*
/
PUBLI C i nt
PUBLI C voi d
PUBLI C SET
PUBLI C SET
nf a
f r ee_nf a
*e cl osur e (
(
( voi d
* / *
(*) 0
*move (
SET*
SET*
* *
i nt
);
) ;
i nt * );
) ;
/
* *
/
#def i ne LARGEST I NT (i nt) ( ( (unsi gned) (~0) ) >> 1)
PRI VATE NFA
*
Nf a; /
*
Base addr ess of NFA ar r ay
*
PRI VATE i nt Nf a st at es; / * Number of st at es i n NFA
/
/
/
*
/
PUBLI C i nt nf a( i nput r out i ne )
*
( *i nput r out i ne) ();
{
/
*
*
Compi l e t he NFA and i ni t i al i ze t he var i ous gl obal var i abl es used by
move () and e cl osur e (). Ret ur n t he st at e number ( i ndex) of t he NFA st ar t
* st at e. Thi s r out i ne must be cal l ed bef or e ei t her e cl osur e() or move()
*
*
*
ar e cal l ed. The memor y used f or t he nf a can be f r eed wi t h f r ee nf a( )
(i n t hompson. c) .
/
NFA *sst at e;
Nf a t hompson( i nput r out i ne, &Nf a st at es, &sst at e) ;
r et ur n( sst at e Nfa ) ;
}
/
* *
/
PUBLI C voi d f r ee nf a( )
{
f r ee ( Nf a );
}
Section 2.5.4Interpreting an NFAImplementation 117
The e-closure set is computed by e_cl osur e ( ) , in Listing 2.36. It uses the NFA
created by a previous nf a ( ) call. It is passed a pointer to a set of states (i nput) and
returns the e-closure set, which is empty if either the input set or the e-closure set is
empty. If the e-closure set contains an accepting state, it returns two additional values
indirectly through pointers* accept is modified to point at the string that holds the
action code, and * anchor is modified to hold the value of the NFA structures anchor
field. * accept is set to NULL if there is no accepting state in the closure set. Since all
members of the original set are also in the closure set, the output set is created by modi
fying the input set. That is, e cl osur e () returns its own first argument, but the set is
modified to contain the elements of the closure set in addition to the initial elements.
If the e-closure set contains more than one accepting state, the accepting action that
has the lowest NFA state number is used. This way, conflicting states that are higher in
the input file take precedence over the ones that occur later. The accept numvariable
declared on line 69 of Listing 2.36 holds the state number of the last-assigned accepting
state. If the current state has a lower number, the other state is overwritten. It is initial
ized to the largest positive integer. The algorithm in Table 2.8 is used. The numbers in
the algorithm reference comments in the code.
Table 2.8. Algorithm for Computing e-Closure
set of input states.
the current NFA state being examined
state number of a next state that can be reached from State i.
Push all states in the input set onto a stack
while( the stack is not empty )
Pop the top element into /.
if( State i is an accepting state )
accept =the accept string;
if( theres an e transition from State / to State N )
if( N isnt in the closure set)
Add N to the closure set.
Push N onto the stack.
The move set is figured with the move () subroutine in Listing 2.37. It returns either
the set of states that can be reached by making transitions on a specified input character
from a specified set of states, or NULL if there are no such transitions.
The f or loop on lines 129 to 143 does the work. The i f statement at the top of the
loop evaluates true for every element of the input set, and any next states are added to
the output set on line 140. Note that the output set is not created unless theres some
thing to put into it.
Computing e closure:
e cl osur e()
Resolving ambiguous ac
cepting actions. Higher
actions, higher pre
cedence
Computing the move set:
move().
14. It is derived from the one in [Aho] p. 119
Listing 2.36. terp.c The 8-Closure Function
47 PUBLI C SET * e _ c l o s u r e ( i n p u t , a c c e p t , a nc ho r )
48 SET *i n p u t ;
49 char **a c c e p t ;
50 i nt *a nc ho r ;
51 {
52 / * i nput i s t he set of st ar t st at es t o exami ne.
53 * *accept i s modi f i ed t o poi nt at t he st r i ng associ at ed wi t h an accept i ng
54 * st at e (or t o NULL i f t he st at e i sn' t an accept i ng st at e) .
55 * *anchor i s modi f i ed t o hol d t he anchor poi nt , i f any.
56 *
57 * Comput es t he epsi l on cl osur e set f or t he i nput st at es. The out put set
58 * wi l l cont ai n al l st at es t hat can be r eached by maki ng epsi l on t r ansi t i ons
59 * f r omal l NFA st at es i n t he i nput set . Ret ur ns an empt y set i f t he i nput
60 * set or t he cl osur e set i s empt y, modi f i es *accept t o poi nt at t he
61 * accept i ng st r i ng i f one of t he el ement s of t he out put st at e i s an
62 * accept i ng st at e.
63 * /
64
65 i nt s t a c k [ NFA_MAX ]; / * St ack of unt est ed st at es */
66 i nt * t o s ; / * St ack poi nt er */
67 NFA *p; / * NFA st at e bei ng exami ned */
68 i nt i ; / * St at e number of " */
69 i nt a c c e pt _num = LARGEST_I NT ;
70
71 i f ( ! i n p u t )
72 got o a b o r t ;
73
74 * a c c e p t =NULL; / * Ref er ence t o al gor i t hm: */
75 t o s = & s t a c k [ - l ] ; / * 1 */
76
77 f o r ( ne x t _ me mbe r ( NULL) ; ( i = n e x t _ me mb e r ( i n p u t ) ) >= 0 ;)
78 *+ + t o s = i ;
79
80 whi l e( I NBOUNDS( s t a c k , t o s ) ) / * 2 */
81 {
82 i = * t o s ; / * 3 */
83 p = & Nf a[ i ] ;
84 i f ( p - > a c c e p t && ( i < ac c e pt _num) )
85 {
86 a c c e pt _ num = i ;
87 * a c c e p t = p - > a c c e p t ;
88 *anc hor = p - > a n c h o r ;
89 }
90
91 i f ( p - > e d g e == EPSI LON ) / * 4 * /
92 {
93 i f ( p - > n e x t )
94 {
95 i = p - > n e x t - Nf a;
96 i f ( !MEMBER( i n p u t , i ) ) / * 5 * /
97 {
98 ADD( i n p u t , i ) ; / * 6 */
99 * ++t o s = i ; / * 7 * /
100 }
101 }
102 i f ( p - > n e x t 2 )
103 {
104 i = p - > n e x t 2 - Nf a;
105 i f ( !MEMBER( i nput , i ) )
/ *
5
*/
106
{
107 ADD( i nput , i ); / * 6 */
108 *++t os = i ; / * 7 */
109
}
110
}
111 }
112 }
113 abor t :
114 ret urn i nput ;
115 }
Listing 2.37. terp.c The Move Function
116 PUBLI C SET *move ( i np s e t , c )
117 SET *i np>_set ; / * i nput set
*/
118 i nt c; / * t r ansi t i on on t hi s char act er */
119 {
120 / * Ret ur n a set t hat cont ai ns al l NFA st at es t hat can be r eached by maki ng
121 * t r ansi t i ons on "c" f r omany NFA st at e i n "i np set . " Ret ur ns NULL i f
122 * t her e* ar e no such t r ansi t i ons. The i np set i s not modi f i ed.
123 */
124
125 i nt i ;
126 NFA *p; / * cur r ent NFA st at e */
127 SET *out set = NULL; / * out put set */
128
129 f or ( i = Nf a st at es; i >= 0; )
130 {
131 i f (
MEMBER(i np set,
i ) )
132
{
133 p = &Nf a[i ] ;
134
135 i f ( p- >edge==c | I ( p- >edge==CCL && TEST( p- >bi t set , c) ))
136
{
137 i f ( !out set )
138 out set = newset ();
139
140 ADD( out set , p- >next - Nf a ) ;
141
}
142
}
143
}
144 r et ur n( out set );
145 }
Listing 2.38 uses the previous subroutines to build a small egrep-like program that
copies standard input to standard output, printing only those lines that contain a match of
a regular expression passed into the program on the command line. The next_ char ( )
subroutine on line 156 of Listing 2.38 gets the next input character. It buffers entire lines
because terp must print the entire line when a match is found [otherwise getc ( ) could
be used], get l i ne ( ) on line 177 is the input function passed to nfa ( ). It just passes
along the regular expression that was on the command line. (Expr is initialized on line
215.) The remainder of the subroutine is a straightforward implementation of the non
greedy algorithm discussed in the previous section.
120
Listing 2.38. terp.c A Simplified Egrep
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
#i f def MAI N
#def i ne ALLOCATE
#i ncl ude "gl obal s. h" / * ext er ns f or Ver bose */
#def i ne BSI ZE 256
PRI VATE
PRI VATE
PRI VATE
Buf [ BSI ZE ] ;
*Pbuf
Expr ;
Buf ;
*
/
/
/
*
i nput buf f er
cur r ent posi t i on i n i nput buf f er
*
*
/
/
*
r egul ar expr essi on f r omcommand l i ne */
i nt
{
next char ()
(
!*Pbuf )
{
(
Pbuf
! f get s( Buf , BSI ZE, st di n)
ret urn NULL;
Buf ;
)
}
ret urn *Pbuf ++;
}
/
* ---------------------------------------------------------------------------------------------- */
PRI VATE voi d
{
pr i nt buf ()
f put s( Buf , st dout ) ; / * Pr i nt t he buf f er and f or ce a r ead
*
*Pbuf 0; /
*
on t he next cal l t o next char ()
*
/
/
}
/* -------------------------------------------------------------------------------------------------------*/
PRI VATE *get l i ne()
{
st at i c i nt f i r st t i me cal l ed 1;
( !f i r st t i me cal l ed )
NULL;
f i r s t _t i me_c ai l ed
ret urn Expr ;
0;
}
/* -------------------------------------------------------------------------------------------------------*/
190 mai n( argc, ar gv )
191 char **argv;
192
{
193 i nt sst at e; /*
St ar t i ng NFA st at e */
194 SET *st ar t df ast at e; /* Set of st ar t i ng nf a st at es
*/
195 SET ^cur r ent ; /*
cur r ent DFA st at e */
196 SET *next ;
197 i nt accept ; /*
cur. DFA st at e i s an accept */
198 i nt c; /*
cur r ent i nput char act er
*/
199 i nt anchor ;
200
201 i f (
ar gc == 2 )
202 f pr i nt f ( st der r , "expr essi on i s %s\ n", ar gv[ l ] );
203 e l s e
204
{
205 f p r i n t f ( s t d e r r , " u s a g e : t e r p p a t t e r n < i n p u t \ n " ) ;
206 e x i t ( 1 ) ;
207 }
208
209
/ * 1: Compi l e t he NFA; i ni t i al i ze move( ) & e cl osur e( ) .
210
*
2: Cr eat e t he i ni t i al st at e, t he set of al l NFA st at es t hat can
211
*
be r eached by maki ng epsi l on t r ansi t i ons f r om t he NFA st ar t st at e.
212
*
3: I ni t i al i ze t he cur r ent st at e t o t he st ar t st at e.
213
*/
214
215 Expr = a r g v [ l ] ; / * 1 */
216 s s t a t e = n f a ( g e t l i n e ) ;
217
218 n e x t = n e w s e t ( ) ; / * 2 * /
219 ADD ( n e x t , s s t a t e ) ;
220
i f (
! ( s t a r t d f a s t a t e = e c l o s u r e ( n e x t , &a c c e pt , S a n c h o r ) ) )
221
{
222 f p r i n t f ( s t d e r r , " I n t e r n a l e r r o r : S t a t e mac hi ne i s e mp t y \ n " ) ;
223 e x i t ( 1 ) ;
224
}
225
226 c u r r e n t = n e w s e t ( ) ; / * 3 * /
227 a s s i g n ( c u r r e n t , s t a r t d f a s t a t e ) ;
228
229
/ *
Now i nt er pr et t he NFA: The next st at e i s t he set of al l NFA st at es t hat
230
*
can be r eached af t er we' ve made a t r ansi t i on on t he cur r ent i nput
231
*
char act er f r omany of t he NFA st at es i n t he cur r ent st at e. The cur r ent
232
*
i nput l i ne i s pr i nt ed ever y t i me an accept st at e i s encount er ed.
233
*
The machi ne i s r eset to t he i ni t i al st at e when a f ai l ur e t r ansi t i on i s
234
*
encount er ed.
235 */
236
237 / w h i l e ( c = n e x t c h a r () )
238
{
239 i f ( n e x t = e c l o s u r e ( m o v e ( c u r r e n t , c ) , &ac c e pt , &anchor) )
240
{
241 i f ( a c c e p t )
242 p r i n t b u f ( ) ; / * accept */
243 e l s e
244 { / * keep l ooki ng */
245 d e l s e t ( c u r r e n t ) ;
246 c u r r e n t = n e x t ;
247 c o n t i n u e ;
248
}
249
}
250
251 d e l s e t ( n e x t ) ; / * r eset */
252 a s s i g n ( c u r r e n t , s t a r t d f a s t a t e ) ;
253
}
254
}
255 # e n d i f
2.5.5 Subset Construction: Converting an NFA to a DFATheory*
Subset construction: an
example.
-closure({ 12}) ={0, 1, 3,4, 5, 12} (new DFA State 0)
The next states in the DFA are then computed by figuring move(current_state, c), for
every possible input character. The input alphabet (the set of all possible input charac
ters) contains only two elements: D and dot (.), so DFA States 1 and 2 are computed as
follows:
move({0, 1, 3,4, 5, 12}, D) ={2,8} (new DFA State 1)
move({0, 1, 3,4, 5,12}, .) ={6 } (new DFA State 2)
This procedure is then applied to the two new states. Starting with State 2:
DFA State 2= {6 }
-closure( {6 }) ={6 }
move({6 }, .) = 0
move({6 }, D) ={7} (new DFA State 3)
You can apply the foregoing procedure to translate NFAs to DFAs by computing the
-closure and move sets for every possible input character. This generalized method,
called subset construction, takes advantage of the fact that all NFA states connected with
edges are effectively the same state. A single DFA state represents all NFA states that
are connected in this manner. The outgoing edges from the DFA state are just the sum of
the outgoing edges from all states in the -closure set. I ll demonstrate with an example.
Using the NFA in Figure 2.14 on page 113, the starting DFA state consists of the -
closure of the starting NFA state:
DFA State 1={2,8}
-closure({2,8}) ={1,2, 4, 8 }
move( {1, 2, 4, 8 }, D) = {2} (new DFA State 4)
move({ 1, 2, 4, 8 }, .) ={6,10} (new DFA State 5)
the move operation results in an empty setas is the case with move({6 }, .)>
abovethen there are no outgoing transitions from that state. Applying the procedure
again to the three new DFA States:
DFA State 3= {7}
-closure({7})
move({7, 13, 14}, .)
move({7, 13, 14}, D)
{7,13, 14}
DFA State 4 ={2}
-closure({2 })
move({ 1, 2, 4}, .)
move({ 1, 2, 4}, D)
{1, 2, 4}
{6 }
{2 }
(existing DFA State 4)
{6,9,10,13,14}
DFA State 5= {6, 10}
-closure({6, 10})
move({6, 9, 10, 13, 14}, .)
move({6, 9, 10, 13, 14}, D) ={7,11} (new DFA State 6)
DFA States 3 and 5 are accepting states because their -closure sets all contain an
accepting state (NFA State 14). This last expansion introduced only one new state,
which is now expanded:
Section 2.5.5Subset Construction: Converting an NFA to a DFATheory*
DFA State 6 ={7,11}
e-closure({7, 11}) ={7,9, 11, 13, 14}
move({7, 9, 11, 13, 14}, .) =0
move({7, 9, 11, 13, 14}, D) ={11} (new DFA State 7)
And then this new state is expanded in turn:
DFA State 7 ={11}
e-closure({7, 11}) ={9, 11, 13, 14}
move({9, 11, 13, 14}, .) =0
move( {9,11,13,14}, D) ={11} (existing DFA State 7)
DFA States 6 and 7 are accepting states because they contain NFA accepting State 14.
The process is summarized in Table 2.9, and the complete DFA is pictured in Figure
2.15. The procedure is formalized with the algorithm in Table 2.10.15 Note that the state
machine that this algorithm generates is not optimal. It has several more states than
necessary. I ll look at how to remove these extra states in a moment.
Table 2.9. Converting NFA in Figure 2.14 (page 113) to a DFA
DFA NFA
e-closure
move(D) move(.)
Accepting
State State
NFA DFA NFA DFA
0 {0} {0,1,3,4,5,12} (2,8) 1 {6} 2 no
1 {2,8} {1,2,4,8} (2) 4 {6,10} 5 no
2 {6} (6} (7) 3 0
no
3 {7} 17,13,14) 0
-
0
yes
4 {2} 11,2,4) (2) 4 {6} 2 no
5 {6,10} (6,9,10,13,14) (7,11) 6 0
yes
6 {7,11} (7,9,11,13,14)
(11)
7 0
yes
7
{11}
(9,11,13,14)
HD
7 0
yes
Figure 2.15. DFA for NFA in Figure 2.14
I D
M0f
D
15. Adapted from [Aho], p. 118.
Dstates is an array of DFA states. Each state is represented by a set of NFA states and a
Boolean mark field.
Dtran is the DFA transition matrix, an array as wide as the input
character set and as deep as the maximum number of states.
Dtran[current_state] [input_character] = the next state
i is the index in Dtran of the current state.
nstates is the index in Dtran of the place where a new state will be inserted.
S is a set of NFA states that defines a DFA state.
c is a possible input symbol.
Initially: Dstates[0].set = e-closure of NFA start state
Dstates[0] is not marked.
Nstates = 1.
i = 0.
S is empty.
Dtran is initialized to all failure transitions.
while( there is an unmarked state at Dstates[i] )
{
Dstates[i].mark = TRUE ;
for( each input symbol c )
I
S = e-closure( move( Dstate[i].set, c ));
if( S is not 0 )
Table 2.10. Algorithm to Convert an NFA to a DFA
if( a state with set=S isnt in Dstates )
add a new unmarked state at Dstates[Nstates], with set-S
Dtran[i][c] = Nstates++
else
Dtran[i][c] = index of exiting state in Dstates
2.5.6 Subset Construction: Converting an NFA to a DFAImplementation
This section shows how LPX applies the theory in the previous section to make the
output DFA. It uses the ec l o sur e () and move () functions to build a complete DFA
transition matrix from a previously constructed NFA. As before, you need to start with
some constant and data-structure definitions, found in dfa.h, Listing 2. 39.
The maximum number of DFA states is defined on line seven of Listing 2. 40 to be
254. This number is more than adequate for most applications and has the added advan
tage of fitting into an unsi gned char. The tables used for the transition matrix can be
smaller as a consequence. You can make the number larger if you wish, but youll have
to change the driver discussed at the beginning of this chapter to compensate for the
change in type. The TTYPE definition on line 12 is the type of the output tablesit must
match the declared type of the tables in the state-machine driver. The internal versions
of the same tables are all arrays of i nt. The F macro, defined on line 18, marks failure
transitions in the internal tables. It must be - 1, but hiding the - 1 in a macro makes the
code a little more readable. The ROWtype on line 21 is one row of the DFA transition
matrix. Declaring a row as a distinct type helps you use pointers to traverse individual
Output-table type: t t ype
Failure transitions: f
The r ow type.
Listing 2.39. dfa.h Definitions for a DFA
Section 2.5.6Subset Construction: Converting an NFA to a DFAImplementation 125
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
47
48
49
/* -------------------------------------------------------------------------------------------------------
*
*
DFA. H: The f ol l owi ng def i ni t i ons ar e used i n df a. c and i n
mi ni mi ze. c t o r epr esent DFA' s.
*____________________________________________________________________
*
/
#def i ne DFA MAX 254 /* Maxi mumnumber of DFA st at es. I f t hi s
* number > you' l l have t o change t he
*
out put r out i nes and dr i ver . St at es ar e
* number ed f r om0 t o DFA MAX- 1
*
/
unsi gned char TTYPE; /* Thi s i s t he t ype of t he out put DFA
*
/
/
/
/
*
*
*
*
t r ansi t i on t abl e (the i nt er nal one i s
an ar r ay of i nt ) . I t i s used onl y to
f i gur e t he var i ous t abl e si zes pr i nt ed
by - V.
*
*
*
*
/
/
/
/
/
#def i ne F
#def i ne MAX CHARS
- 1
128
/ * Mar ks f ai l ur e st at es i n t he t abl e.
/ * Maxi mumwi dt h of df a t r ansi t i on t abl e
*
*
/
/
i nt ROW[ MAX CHARS ]; /
/
*
*
One f ul l r ow of Dt r an, whi ch i s i t sel f
an ar r ay, DFA MAX el ement s l ong, of
*
*
/ * ROWs.
/
/
/
J 'k__________________________---------------- ---- ---------------------------------------- --------- -------------- ---- ---- * j
{
i nt
*st r i ng;
anchor ;
/
/
*
Accept i ng st r i ng; NULL i f nona mg
*
Anchor poi nt , i f any. Val ues ar e def i ned i n NFA. H
} ACCEPT;
/ *--------------------------------------------------------------------------------------------------------------------
* Ext er nal subr out i nes:
35 */
36
37 SET *e cl osur e (SET*, char * *, i nt * / f
38 voi d f ree nf a (voi d ) f
39 SET *move ( SET*, i nt ) 9
40 i nt nf a (char *
(*) 0 ) f
41 i nt df a (char *
(*) 0 ,
ROW*[ ] , ACCEPT** ) f
42 i nt mi n df a (char *
(*) 0 ,
ROW*[ ] , ACCEPT** ) f
43 i nt t t ypesi ze (i nt ) 9
44 voi d pdef i nes ( FI LE*, i nt ) 9
45 i nt col umns ( FI LE*, ROW*, i nt , i nt , char * / 9
46 voi d cnext ( FI LE*, char* ) 9
/
/
/
/
/* df a. c
/
. c
*
*
t er p. c
. c
. c
*
*
*
mi ni mi ze. c
pr i nt df a. c
*
/
/
/ * col umns. c
/ * col umns. c
*
. c
*
*
*
/
/
/
/
/
/
/
/
/
/
voi d pheade
f P
ROWdt i nt nr ows, ACCEPT *accept ) ; / * pr i nt . c
voi d pdr i ver ( FI LE *out put , i nt nrows, ACCEPT
*
); /
*
pr i nt . c
*
*
/
/
columns, because incrementing a pointer to an entire row will skip as many array ele
ments as are in the row, effectively moving you down to the next element of the current
column. For example, the following code allocates a NUM_ROWS element array (a
two-dimensional array of size MAX_CHARS x NUM_ROWS) and then prints the sixth
column. It initializes the column pointer (p) to the head of the sixth column, and moves
the pointer to the next element of the same column with a simple increment on each
ACCEPT structures
DFA_ STATE: accept ,
gr oup, anchor , s et
fields
The transition matrix:
Dt r an, Ns t at es
Create a DFA: df a ().
iteration of the loop. Note that two stars are needed to get an array element in the
pr i nt f () statement.16
#def i ne NUM_ROWS 10
st at i c ROW A r r a y [ NUM_ROWS ] ;
ROW * p ;
i nt i ;
p = (ROW * ) ( &Ar r a y [ 0 ] [6] ) ; / * I ni t i al i ze p t o head * /

/ * of col umn 6. * /
f o r ( i = NUM_ROWS; i >= 0/ p r i n t f ( " %d ", **p++) )
Returning to Listing 2.39, the ACCEPT structure on line 26 keeps track of the accept
ing strings, which are copied to an array of ACCEPT structures so that the memory used
for the NFA can be discarded once it is translated into a DFA. The remainder of the file
is just function prototypes.
The actual conversion routines start in Listing 2.40 with a few more definitions. The
DFA STATE structure, typedefed on lines 23 to 31, represents a single state in the
DFA. The gr oup field is used for minimization, discussed in the next section. The
accept and anchor fields are the accepting string and anchor point copied from the
NFA state in the case of an accepting state; set is the set of NFA states that comprise
the current DFA state. Dst at es, an array of DFA structures, is defined on line 33. The
DFA state number is the array index. The actual transition matrix Dt r an is declared on
line 36, and the number of states Nst at es is declared on the next line.
The routines that create the DFA are all local to dfa.c and are accessed via the
df a ( ) routine, also in Listing 2.40(on It is passed an input function, and returns the
number of states in the transition matrix, the matrix itself, and an array holding the
accepting string (the last two are returned indirectly through pointers). For the most part,
this function is just allocating memory for the new tables and discarding memory for
other tables once theyre no longer needed. The NFA is constructed on line 69; its con
verted to a DFA on line 81; the accepting strings are copied to the target array on line
91; and if verbose mode is active, a diagnostic message is printed on lines 101 to 114 of
Listing 2.40.
16. C note: You need two stars because p points at a two-dimensional array. To see why, in C the square
brackets can always be interpreted with the following identity: p [ i ] =* ( p+i ) . If i is 0, then p [ 0] s*p.
Applying this identity to a two-dimensional array: p [0] [0]=(*p) [0]s**p. Looked at another way, ifp
is a pointer to an array, then *p must be the array itself. But arrays are always represented in C by a
pointer to the first element. Consequently, if p is a pointer to an array of i n t , then *p must be an array of
i n t , which is represented by a pointer to the first elementa pointer to an i n t . Both p and *p will hold
the same address, but p is of type pointer to array of i n t and *p is of type pointer to i n t .
Using a pointer here is much more efficient than something like this:
f o r ( i = 0; i < NUM_ ROWS; i ++ ) '
pr i nt f ( " %d ", Ar r ay[ i ] [ 6] )
because the array index forces the compiler to multiply i by the size of a row at every iteration of the loop.
The following is better, but still not as good as the original version:
ROW *p;
f or ( p=Ar r ay, i =NUM ROWS; i >= 0; pr i nt f ( " %d ", ( * p++) [ 6] )
Section 2.5.6Subset Construction: Converting an NF A to a DFAImplementation
Listing 2.40. dfa.c NFA to DFA Conversion: File Header and Access Function
127
2 #i f def MSDOS
4 #el se
5 # i ncl ude < ma l l o c . h >
6 #endi f
8 # i ncl ude < t o o l s / s e t . h >
9 # i ncl ude < t o o l s / c o m p i l e r . h >
10 #i ncl ude "df a . h"
11 #i ncl ude " g l o b a l s . h " / * ext er ns f or Ver bose, et c. * /
12
13 / * ---------------------------------------------------------------------------------------------------------------------------------
14 * DFA. C Make a DFA t r ansi t i on t abl e f r oman NFA cr eat ed wi t h
15 * Thompson' s const r uct i on.
16 *--------------------------------------------------------------------------------------------------------------------------------
17 * Dt r an i s t he det er mi ni st i c t r ansi t i on t abl e. I t i s i ndexed by st at e number
18 * al ong t he maj or axi s and by i nput char act er al ong t he mi nor axi s. Dst at es
19 * i s a l i st of det er mi ni st i c st at es r epr esent ed as set s of NFA st at es.
20 * Nst at es i s t he number of val i d ent r i es i n Dt r an.
21 */
22
23 t ypedef st ruct d f a _ s t a t e
24 {
25 unsi gned gr oup : 8; / * Gr oup i d, used by mi ni mi ze () * /
26 unsi gned mark : 1/ / * Mar k used by make_dt r an( ) * /
27 char * a c c e p t ; / * accept act i on i f accept st at e * /
28 i nt a nc ho r ; / * Anchor poi nt i f an accept st at e * /
29 SET * s e t ; / * Set of NFA st at es r epr esent ed by * /
30
31 } DFA_STATE; / * t hi s DFA st at e * /
32
33 PRI VATE DFA_STATE * D s t a t e s ; / * DFA st at es t abl e * /
34 /* ------------------------------------------------------------------------------------------------------ */
35
36 PRI VATE ROW *Dt ran ; / * DFA t r ansi t i on t abl e * /
37 PRI VATE i nt N s t a t e s ; / * Number of DFA st at es * /
38 PRIVATE DFA_STATE * La s t _ ma r k e d ; / * Most - r ecent l y mar ked DFA st at e i n Dt r an */
39
40 ext ern char * b i n _ t o _ a s c i i ( i nt, i nt ) ; / * i n compi l er , l i b * /
41
42 / * ----------------------------------------------------------------------------------------------------------------------------
43 * Pr ot ot ypes f or subr out i nes i n t hi s f i l e:
44 * /
45
46 PRI VATE DFA STATE *get unmar ked ( voi d ) ;
47 PRI VATE i nt i n dst at es ( SET* ) ;
48 PRI VATE voi d f r ee_set s ( voi d ) ;
49 PRI VATE voi d make dt r an ( i nt );
50 PUBLI C i nt df a ( char *(*)(), R0W*[ ] , ACCEPT** );
51
/ *- - - -
52
53 i nt df a ( i f unct , df ap, accept p )
54 char * ( *i f unct ) ();
55 ROW *( df ap[] );
56 ACCEPT *( *accept p );
57
{
58 / * Tur ns an NFA wi t h t he i ndi cat ed st ar t st at e ( sst at e) i nt o a DFA and
59 * r et ur ns t he nurri ber of st at es i n t he DFA t r ansi t i on t abl e. *df ap i s
60 * modi f i ed t o poi nt at t hat t r ansi t i on t abl e and *accept p i s modi f i ed
61 * t o poi nt at an ar r ay of accept i ng st at es ( i ndexed by st at e number ) .
62 * df a( ) di scar ds al l t he memor y used f or t he i ni t i al NFA.
63 * /
64
65 ACCEPT * a c c e p t _ s t a t e s ;
66 i nt i ;
67 i nt s t a r t ;
68
69 s t a r t = n f a ( i f u n c t ) ; / * make t he nf a * /
70 N s t a t e s = 0;
71 D s t a t e s = (DFA_STATE *) c a l l o c ( DFA_MAX, si zeof (DFA_STATE) ) ;
72 Dt r an = (ROW * ) c a l l o c ( DFA_MAX, si zeof (ROW) ) ;
73 Las t _mar ke d = D s t a t e s ;
74
75 i f ( Ve r b o s e )
76 f p u t s ( "maki ng DFA: ", s t d o u t ) ;
77
78 i f ( ! D s t a t e s I I ! Dt r an )
79 f e r r ( "Out o f memory!" ) ;
80
81 ma k e _ d t r a n ( s t a r t ) ; / * conver t t he NFA t o a DFA* /
82 f r e e _ n f a ( ) ; / * Fr ee t he memor y used f or t he nf a * /
83 / * i t sel f (but not t he accept st r i ngs) . * /
84
85 Dt r an = (ROW *) r e a l l o c ( Dt r a n, N s t a t e s * si zeof (ROW) ) ;
86 a c c e p t _ s t a t e s = (ACCEPT*) ma l l o c ( N s t a t e s * si zeof (ACCEPT) ) ;
87
88 i f ( ! a c c e p t _ s t a t e s | | ! Dt ran )
89 f e r r ( "Out o f memory! ! " ) ;
90
91 f o r ( i = N s t a t e s ; i >= 0 ; )
92 {
93 a c c e p t _ s t a t e s [ i ] . s t r i n g = D s t a t e s [ i ] . a c c e p t ;
94 a c c e p t _ s t a t e s [ i ] . a nc ho r = D s t a t e s [ i ] . a n c h o r ;
95 }
96
97 f r e e ( D s t a t e s ) ;
98 *df ap = Dt r an;
99 * a c c e p t p = a c c e p t _ s t a t e s ;
100
101 i f ( Ve r bo s e )
102 {
103 p r i n t f ( " \ n %d o u t o f %d DFA s t a t e s i n i n i t i a l ma c h i n e . \ n " ,
104 N s t a t e s , DFA_MAX);
105
106 p r i n t f ( " %d b y t e s r e g u i r e d f o r unc o mpr e s s e d t a b l e s . \ n \ n " ,
107 N s t a t e s * MAX_CHARS * si zeof (TTYPE) / * dt r an * /
108 + N s t a t e s * si zeof (TTYPE) ) ; / * accept * /
109
110 i f ( Ve r bo s e > 1 )
111 {
112 p r i n t f ( " T h e u n - mi n i mi z e d DFA l o o k s l i k e t h i s : \ n \ n " ) ;
113 p h e a d e r ( s t d o u t , Dt r a n, N s t a t e s , a c c e p t _ s t a t e s ) ;
114 }
115 }
116
Section 2.5.6Subset Construction: Converting an NF A to a DFAImplementation 129
117 return Nstates ;
118 }
Several support functions are needed to do the work, all in Listing 2.40. The
add t o dst at es () function on line 23 adds a new DFA state to the Dstates array and
increments the number-of-states counter, Nst at es. It returns the state number (the
index in ] t at es) of the newly added state, i n dst at es () on line 31 is passed a set
of NFA states, and returns the state number of an existing state that uses the same set (or
- 1 if there is no such state). The routine just does a linear search, which is probably not
the best strategy here. Something like a binary tree would be better.
Listing 2.41. dfa.c NFA to DFA Conversion: Support Functions
119 i nt add_ t o _ d s t a t e s ( NFA_s et , a c c e p t i n g _ s t r i n g , anc hor )
120 SET *NFA_s e t ;
121 char *a c c e p t i n g _ s t r i n g ;
122 i nt anc hor ;
123 {
124 i nt n e x t s t a t e ;
125
126 i f ( N s t a t e s > (DFA_MAX-1) )
127 f e r r ( " T o o many DFA s t a t e s \ n " ) ;
128
129 n e x t s t a t e = N s t a t e s + + ;
130 D s t a t e s [ n e x t s t a t e ] . s e t = NFA_s et ;
131 D s t a t e s [ n e x t s t a t e ] . a c c e p t = a c c e p t i n g _ s t r i n g ;
132 D s t a t e s [ n e x t s t a t e ] . a n c h o r = a nc ho r ;
133
134 ret urn n e x t s t a t e ;
135 }
136
137 / * --------------------------------------------------------------------------------------------------------------------------* /
138
139 PRIVATE i nt i n _ d s t a t e s ( NFA_s et )
140 SET *NFA_s e t ;
141 {
142 / * I f t her e' s a set i n Dst at es t hat i s i dent i cal t o NFA_set , r et ur n t he
143 * i ndex of t he Dst at e ent r y, el se r et ur n - 1.
144 * /
145
146 DFA_STATE *p;
147
148 f o r ( p = & D s t a t e s [ N s t a t e s ] ; p >= D s t a t e s ; )
149 i f ( IS_EQUIVALENT( NFA_ s e t , p - > s e t ) )
150 r et ur n( p - D s t a t e s ) ;
151
152 r et ur n( -1 );
153 }
154
155 / * --------------------------------------------------------------------------------------------------------------------------* /
156
157 PRIVATE DFA STATE * g e t unmar ke d()
158 {
159 / * Ret ur n a poi nt er t o an unmar ked st at e i n Dst at es. I f no such st at e
160 * exi st s, r et ur n NULL. Pr i nt an ast er i sk f or each st at e t o t el l t he
161 * user t hat t he pr ogr amhasn' t di ed whi l e t he t abl e i s bei ng const r uct ed.
162 * /
add_t o_ds t at e s ()
i n d s t a t e s ()
130 Input and Lexical AnalysisChapter 2
Li sti ng2.41. conti nued.
163
164 f or( ; La s t marked < & D s t a t e s [ N s t a t e s ] ; ++Las t marked )
165
{
166
i f (
! La s t marked- >mark )
167
{
168 p u t c ( s t d e r r ) ;
169 f f l u s h ( s t d e r r ) ;
170
171 i f ( Ve r bo s e > 1 )
172
{
173 f p u t s ( " \ n " , s t d o u t ) ;
174 p r i n t f ( "wor ki ng on DFA s t a t e %d = NFA s t a t e s : ",
175 La s t ma r k e d - Ds t a t e s ) ;
176 p s e t ( La s t ma r k e d - > s e t , f p r i n t f , s t d o u t ) ;
177 p u t c h a r ( ' \ n ' ) ;
178 }
179
180 r et ur n La s t marked;
181
}
182
}
183 r et ur n NULL;
184
}
185
186 / * -
------------------------------------------------------------------------------------------------------* /
187
188 PRIVATE voi d f r e e s e t s ()
189 {
190 / * Fr ee t he memor y used f or t he NFA set s i n al l Dst at e ent r i es. * /
191
192 DFA STATE *p;
193
194 f or ( p = & D s t a t e s [ N s t a t e s ] ; p >= D s t a t e s ; )
195 d e l s e t ( p - > s e t ) ;
196
}
get unmarked o a pointer to the next unmarked state is returned by g e t _ u n m a r k e d ( ) , declared on
line 157. The L a s t ma r k e d variable was initialized by d f a ( ) on line 38 of Listing
2.40. Since new states are always added to the end of the table, everything above
L a s t _ m a r k e d will have been marked with a previous expansion, so these states dont
have to be examined now. The asterisk thats printed on line 168 of Listing 2.41 actually
has two purposes. It tells the user that the program hasnt hung (it takes a while to make
the tables), and it lets the user terminate the program with a Ctrl-Break (a SIGINT under
UNIX). I ts needed because MS-DOS ignores Ctrl-Breaks until a system call of some sort is
made by a program. No such call is made while creating the tables, but a system call is
required to print the asterisks.
O Finally, f ree sets () on line 188 of Listing 2.41 goes through the entire Dstates
array, deallocating the memory used for the sets of NFA states.
NFA to DFA Conversion: The actual NFA to DFA conversion is done by make d t r a n ( ) , in Listing 2.42.
make_dtran (). a straightforward implementation of the algorithm in Table 2.10 on page 124.
Section 2.5.6Subset Construction: Converting an NF A to a DFAImplementation 131
Listing 2.42. dfa.c NFA to DFA Conversion: Conversion Function
197 PRIVATE v o i d make d t r a n ( s s t a t e )
198 i n t s s t a t e ; / * St ar t i ng NFA st at e. * /
199
{
200 SET *NFA s e t ; / * Set of NFA st at es t hat def i ne * /
201 / * t he next DFA st at e. * /
202 DFA STATE *c u r r e n t ; / * St at e cur r ent l y bei ng expanded. * /
203 i n t n e x t s t a t e ; / * Got o DFA st at e f or cur r ent char. * /
204 c h a r * i s a c c e p t ; / * Cur r ent DFA st at e i s an accept * /
205 / * ( t hi s i s t he accept i ng st r i ng) . * /
206 i n t a nc ho r ; / * Anchor poi nt , i f any. * /
207 i n t c ; / * Cur r ent i nput char act er . * /
208
209 / * I ni t i al l y Dst at es cont ai ns a si ngl e, unmar ked, st ar t st at e f or med by
210 * t aki ng t he epsi l on cl osur e of t he NFA st ar t st at e. So, Dst at es[0]
211 * ( and Dt r an[0]) i s t he DFA st ar t st at e.
212
*/
213
214 NFA s e t = n e w s e t () ;
215 ADD( NFA s e t , s s t a t e ) ;
216
217 N s t a t e s = 1;
218 D s t a t e s [ 0 ] . s e t = e c l o s u r e ( NF A s e t , & D s t a t e s [ 0 ] . a c c e p t , & D s t a t e s [0] . a n c h o r ) ;
219 D s t a t e s [ 0 ] . mark = 0;
220
221 w h i l e ( c u r r e n t = g e t unmar ke d() ) / * Make t he t abl e * /
222
{
223 c u r r e n t - >ma r k = 1;
224
225 f o r ( c = MAX CHARS ; c >= 0 ; )
226
{
227 i f ( NFA s e t = m o v e ( c u r r e n t - > s e t , c) )
228 NFA s e t = e c l o s u r e ( NFA s e t , & i s a c c e p t , &anchor ) ;
229
230 i f ( !NFA s e t ) / * no out goi ng t r ansi t i ons */
231 n e x t s t a t e = F;
232
233 e l s e i f ( ( n e x t s t a t e = i n d s t a t e s ( N F A s e t ) ) != - 1 )
234 d e l s e t ( NFA s e t ) ;
235
236 e l s e
237 n e x t s t a t e = add t o d s t a t e s ( NFA s e t , i s a c c e p t , a nc ho r ) ;
238
239 D t r a n [ c u r r e n t - D s t a t e s ] [ c ] = n e x t s t a t e ;
240
}
241
}
242
243 p u t c ( ' \ n ' , s t d e r r ) ; / * Ter mi nat e st r i ng of *' s pr i nt ed i n
244 * get unmar ked () ;
245 * /
246
247 f r e e s e t s ( ) ; / * Fr ee t he memor y used f or t he DFA STATE set s * /
248
}
Finding equivalent states.
Implicit failure state.
*.. .is distinguished...
2.5.7 DFA MinimizationTheory*
The DFA created earlier (in Figure 2.15 on page 123) is not an optimal DFA. Only
six of the eight states are actually required. A somewhat simplified version of the transi
tion matrix for the original machine is reproduced in Table 2.11. A quick inspection of
this table shows that States 6 and 7 are equivalent because they have identical next-state
transitions and theyre both accepting states. If all transitions into State 6 were replaced
with transitions to State 7, the machine would still accept the same strings as before.
One additional state can be eliminated, however, and simple inspection is not good
enough for finding this extra state. I ll discuss how to find the extra state in this section.
Table 2.11. DFA Transition Matrix for DFA in Figure 2.15
Lookahead
Accepting
D
0 1 2 no
1 4 5 no
2 3
no
Current state
4 4 2
yes
no
5 6
yes
6 7
yes
7 7
yes
You find equivalent states by systematically eliminating those states that cant be
equivalentpartitioning the initial array into potentially equivalent states, and then gra
dually partitioning the partitions. When youre done, states that share a partition are
equivalent. Initially, you partition the matrix into two parts, one for accepting states and
another for nonaccepting states:
D
0 1 2
1 4 5
2 3
4 4 2
3
----
5 6
6 7
7 7
(The partition number is on the right.) The implicit failure state is alone in a special,
unnumbered partition thats not shown in the foregoing table. The next step in the
minimization process creates a set of new partitions by looking at the goto transitions
for each state and eliminating states that are not equivalent. If the outgoing transitions
from two states go to different partitions, then the states are not equivalent. You do the
elimination on a column-by-column basis. Starting with the D column, States 0,1, and 4
all go to a state in Partition 0 on a D, but State 2 goes to Partition 1on a D. Conse
quently, State 2 must be removed from Partition 0. Formally, you say that State 4 is dis
tinguished from the other states in the partition by a D. Continuing in this manner, State
3 is also distinguished from States 5, 6, and 7 by a D because State 3 goes to the failure
state on a D, but the other states go to a state in Partition 1. The failure state is not in
Partition 1, so State 3 is distinguished by a D.
Section 2.5.7DFA MinimizationTheory*
D
0 1 2
1 4 5 0
4 4 2
2 3
2
3

3
5 6
6 7
1
7 7
Now you go down the dot (.) column. The dot distinguishes State 1from States 0 and 4
because State 1goes to a state in Partition 1on a dot, but States 0 and 4 both go to States
in Partition 2. The new partitions are:
D
0 1 2
4 4 2
1 4 5
2 3
3

5 6
6 7
7 7
Next, you go through the array a second time, column by column. Now, D distinguishes
State 0 from State 4 because State 0 goes to a state in Partition 4 on a D, but State 4 goes
to a state in Partition 0 on a D. No other states can be distinguished from each other, so
were done. The final partitions look like this:
D
0 1 2 0
4 4 2 5
1 4 5 4
2 3
2
3

3
5 6
6 7
1
7 7
The next step is to build a new transition matrix. Each partition is a single state in
the minimized DFA, and all next states in the original table are replaced by the number
of the partition in which the state is found. For example, since States 5, 6, and 7 are all
in Partition 1, all references to one of these states in the original table are replaced by a
reference to the new State 1. The new table looks like this:
D
0 4 2
I 1
2 3
3

4 5 1
5 5 2
The algorithm is formalized in Table 2.12. The transition diagram for the new state
machine is in Figure 2.16.
Table 2.12. DFA-Minimization Algorithm
group
groups
new
next
A character in the alphabet used by the DFA.
A set of potentially equivalent states (a partition of the original transition matrix)
A collection of groups.
The set of states that have been distinguished from other states in the current group
First state in a group.
One of the other states in a group, FALSE if there is no such state
goto Jirst A transition on c out of the first state comes here.
goto next A transition on c out of the next state comes here
Initially: Partition the original states into a series of groups. All nonaccepting states are in a single
group, and accepting states that have the same accepting string are grouped together,
one-element group containing a single accepting state is permissible.
Repeat the following until no new groups are added to groups:
for( each group in groups )
new
next
the first state in the current group
the next state of the current group or FALSE if none
while( next)
for( each character c )
goto Jirst = state reached by making a transition on c out of first.
goto next =state reached by making a transition on c out of next.
if( goto Jirst is not in the same group as gotojiext)
move next from the current group into new
next the next state in the current group or FALSE if none.
if( new is not empty )
add it to groups
Figure 2.16. Minimized DFA for (DA. DID\. D*)
Section 2.5.7DFA MinimizationTheory* 135
2.5.8 DFA MinimizationImplementation
The algorithm in the previous section is implemented by the code starting in Listing
2.43. The Gr oups [ ] array on line 16 is the collection of partitions. The array index is
used as the partition number, and each group is a SET of NFA state numbers. Num-
groups, on line 17, is the number of groups currently in the array. I ngr oup [ ], on line
18, is the inverse of Gr oups [ ]. It is indexed by state number and evaluates to the group
in which the state is found. All three of these are initialized to zeros by the access func
tion min df a ( ) (on lines 43 to 45). The DFA is created by the df a ( ) call on line 47,
and its minimized by the subroutine mi ni mi ze ( ) , called on the next line.
The Gr oups [ ] and I ngr oup [ ] arrays are initialized in i ni t _gr oups ( ), at the
top of Listing 2.44. The program decides which group to place a state in by looking at
the accepting-string pointer for that state. Since all nonaccepting states have NULL
accepting strings, theyll all end up in a single groupGroup 0 because DFA State 0 is
always nonaccepting. The other groups in the initial partitioning contain accepting
states. If two such states have the same accepting string, they are considered equivalent
and are put into the same group. Note that since the starting state (0) always goes into
Group 0, and since the minimization algorithm never moves the first element of the
group to another partition, the starting state is still 0 after the minimization is complete.
The got o statement on line 78 of Listing 2.44 is necessary because it has to break
out of a doubly nested loop in order to avoid forming a new group. The same thing
could probably be done with a Boolean flag of some sort, but the got o seems cleaner.
The actual minimization is done by minimize ( ) starting on line 104 of Listing
2.44. It is passed the number of states in the DFA transition table, a pointer to the DFA
table itself, and a pointer to the accepting-state table. Note that theres an extra level of
indirection in both of these last two parameters. That is, addresses of pointers to the two
tables are passed to mi ni mi ze ( ). I ve done this because entirely new tables are
created to replace the existing ones*df ap and * accept are modified to point at the
new tables. The memory used for the old tables is discarded automatically by mi ni m
i ze ( ).
The routine is a straightforward implementation of the algorithm in Table 2.12 on
page 134. It terminates when an entire pass is made through the set of groups without
adding a new group. The test on lines 143 to 146 is complicated by the failure transi
tions, which have no explicit existence. There is no physical group that holds the failure
state, so I have to test here to make sure that neither of the two next states are failure
The partitions:
Gr oups[], I ngr oup[],
Numgr oups
Access function:
mi n_df a()
Group initialization:
i ni t gr oups().
Minimization: mi ni m
i ze ()
Listing 2.43. minimize.c Data Structures and Access Function
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#ncl ude < s t d i o . h >
#i f def MSDOS
i ncl ude < s t d l i b . h >
#el se
i ncl ude <ma11o c.h >
#endi f
#i ncl ude < t o o l s / d e b u g . h >
#i ncl ude < t o o l s / s e t , h >
#i ncl ude < t o o l s / h a s h . h>
#i ncl ude c t o o l s / c o m p i l e r . h>
#i ncl ude df a.h"
#i ncl ude " g l o b a l s . h " / * ext er ns f or Ver bose */
/ * MI NI MI ZE. C: Make a mi ni mal DFA by el i mi nat i ng equi val ent st at es.
*
/
PRIVATE SET
Groups [ DFA MAX ] ; /

*
Gr oups of equi val ent st at es i n Dt r an
*
PRIVATE i nt Numgroups;
PRIVATE i nt I ngr oup
/ * Number of gr oups i n Gr oups
*
DFA MAX ] ; / * t he I nver se of Gr oup
*
/
/
/
/* -------------------------------------------------
*
Pr ot ot ypes f or subr out i nes i n

/
PRIVATE voi d
PRIVATE voi d
PUBLIC i nt
PRIVATE voi d
PRIVATE voi d
f i x _ d t r a n
i n i t _ g r o u p s
mi n_ df a
mi n i mi z e
pg r o ups
t hi s f i l e:
( ROW*[], ACCEPT** ) ;
( i nt, ACCEPT* ) ;
char * (*) (), ROW*[ ] , ACCEPT** ) ;
i nt, ROW*[] , ACCEPT** ) ;
( i nt ) ;
/ *_____________________________________________________________________________ * /
PUBLIC i nt mi n d f a ( i f u n c t , d f a p , a c c e p t p )
*
ROW
ACCEPT
{
/*
*
*
( * i f u n c t
( d f a p []
) 0 ;
) ;
(
*
a c c e p t p ) ;
Make a mi ni mal DFA, el i mi nat i ng equi val ent st at es. Ret ur n t he number of
* st at es i n t he mi ni mi zed machi ne. *sst at ep t he new st ar t st at e.
*
/
i nt n s t a t e s ; / * Number of DFA st at es
*
/
me ms e t ( Groups , 0,
me ms e t ( I ng r o up, 0,
( Gr oups )
( I ngr oup)
) ;
);
Numgroups 0;
s
mi n i mi z e
d f a ( i f u n c t , d f a p , a c c e p t p ) ;
( n s t a t e s , d f a p , a c c e p t p ) ;
ret urn Numgroups;
}
Removing redundant
states: f i x dtran ()
transitions before allowing an index into the I ngroup [ ] array.
The f i x_ dtran() routine on line 177 of Listing 2.44 uses the previously-
computed Groups [ ] to remove redundant states from the table. I ts difficult to do the
compression in place because the original states are randomly distributed throughout the
groups two new arrays are allocated on lines 199 and 200. The original arrays are
Section 2.5.8DFA MinimizationImplementation 137
freed on lines 218 and 219, and the array pointers are changed to point at the new arrays
on the next couple of lines.
The remainder of minimize.c consists of a few debugging routines in Listing 2.45.
pgroups ( ) prints out all the existing groups and is used for very-verbose-mode (-V) Print the g ro u p s
diagnostics.
pgroups()
Listing 2.44. minimize.c Minimization Functions
52 PRIVATE voi d i n i t _ g r o u p s ( n s t a t e s , a c c e p t )
53 ACCEPT * a c c e p t ;
54 {
55 SET * * l a s t ;
56 i nt i , j ;
57
58 l a s t = &Gr o u p s [0] ;
59 Numgroups = 0;
60
61 f o r ( i = 0; i < n s t a t e s ; i ++ )
62 {
63 f o r ( j = i ; j >= 0 ; )
64 {
65 / * Check t o see i f a gr oup al r eady exi st s t hat has t he same
66 * accept i ng st r i ng as t he cur r ent st at e. I f so, add t he cur r ent
67 * st at e t o t he al r eady exi st i ng gr oup and ski p past t he code t hat
68 * woul d cr eat e a new gr oup. Not e t hat si nce al l nonaccept i ng st at es
69 * have NULL accept st r i ngs, t hi s l oop put s al l of t hese t oget her
70 * i nt o a si ngl e gr oup. Al so not e t hat t he t est i n t he f or l oop
71 * al ways f ai l s f or gr oup 0, whi ch can' t be an accept i ng st at e.
72 * /
73
74 i f ( a c c e p t [ i ] . s t r i n g == a c c e p t [ j ] . s t r i n g )
75 {
76 ADD( Gr o u p s [ I n g r o u p [ j ] ] , i ) ;
77 I n g r o u p [ i ] = I n g r o u p [ j ] ;
78 got o mat ch;
79 }
80 }
81
82 / * Cr eat e a new gr oup and put t he cur r ent st at e i nt o i t. Not e t hat ADD ()
83 * has si de ef f ect s, so "l ast can' t be i ncr ement ed i n t he ADD
84 * i nvocat i on.
85 * /
86
87 * l a s t = n e w s e t ( ) ;
88 ADD( * l a s t , i ) ;
89 I n g r o u p [ i ] = Numgroups++;
90 + + l a s t ;
91
92 ma t c h: ; / * Gr oup al r eady exi st s, keep goi ng */
93 }
94
95 i f ( Ve r bo s e > 1 )
96 {
97 p r i n t f ( " I n i t i a l g r o u p i n g s : \ n" ) ;
98 pg r o ups ( n s t a t e s ) ;
99 }
100 }
101
102 / *-
103
104 PRI VATE
105 i n t
106 ROW
107 ACCEPT
108
{
109 i n t
110 i n t
111 SET
112 SET
113 i n t
114 i n t
115 i n t
116 i n t
117
118 ROW
119 ACCE
120
121 i ni t
122 do
123
{
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
*
/
v o i d m i n i m i z e ( n s t a t e s , d f a p , a c c e p t p )
nst at es;
( df ap[]
/ * number of st at es i n dt r an[]
*
*
); /
*
DFA
*
(
*
a c c e p t p ) ; / * Set of
t i on t abl e to compr ess
st at es
*
*
/
/
/
ol d numgr oups; / * Used to see i f we di d anyt hi ng i n t hi s pass
*
c;
**cur r ent ;
**new ;
f i r s t;
next ;
got o_f i r st ;
got o next ;
/ * Cur r ent char act er .
/
/
*
*
*
Cur r ent gr oup bei ng
New par t i t i on bei ng cr eat ed.
*
*
/ * St at e # of f i r st el ement of cur r ent gr oup
/ * St at e # of next el ement of cur r ent gr oup.
/*
/*
t ar get of t r ansi t i on f r omf i r st [ c] .
ot her t ar get of t r ansi t i on f r omf i r st [ c] .
*
/
/
/
/
/
/
/
/
* d t r a n * df a p;

a c c e p t p ;
gr oups( nst at es, accept );
o l d _numgroups
( c u r r e n t
Numgr oups;
&Groups [ 0] ; c u r r e n t < &Groups[ Numgroups] ; ++c u r r e n t )
{
num_el e( ^cur r ent ) <
c o n t i n u e ;
1 )
new
*new
&Gr oups[ Numgr oups ];
newset ();
next member ( NULL );
f i r st n e x t member( ^ c u r r e n t ) ;
w h i l e ( ( next
{
( c
next member ( ^cur r ent ) ) > 0 )
MAX CHARS; c > 0 ;)
{
g o t o _ f i r s t
g o t o n e x t
dt r an[ f i r st 1[ c 1;
dt r an[ next 1[ c 1;
( got o f i r st != got o next
&&( got o_f i r st
got o next
F
F
I ngr oup[ got o f i rst ]
i
I ngr oup[ got o next ]
)
)
{
REMOVE( ^cur r ent , next ); / * Move t he st at e to
*
ADD ( *new,
I ngr oup[ next ]
next );
Numgr oups ;
/
*
t he new par t i t i on
*
/
/
}
}
}
( I S_EMPTY( *new )
del set ( *new );
)
Section 2.5.8DFA MinimizationImplementation 139
160 e l s e
161 ++Numgroups;
162 }
163
164 } w h i l e ( ol d_numgr oups != Numgroups ) ;
165
166 i f ( Ve r bo s e > 1 )
167 {
168 p r i n t f ( " \ n S t a t e s g r o upe d a s f o l l o w s a f t e r m i n i m i z a t i o n : \ n " ) ;
169 p g r o u p s ( n s t a t e s ) ;
170 }
171
172 f i x _ d t r a n ( d f a p , a c c e p t p ) ;
173 }
174
175 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
176
177 PRIVATE v o i d f i x _ d t r a n ( d f a p , a c c e p t p )
178 ROW *( d f a p [ ] ) ;
179 ACCEPT *( * a c c e p t p ) ;
180 {
181 / * Reduce t he si ze of t he dt r an, usi ng t he gr oup set made by mi ni mi ze ().
182 * Ret ur n t he st at e number of t he st ar t st at e. The or i gi nal dt r an and accept
183 * ar r ays ar e dest r oyed and r epl aced wi t h t he smal l er ver si ons.
184 * Consi der t he f i r st el ement of each gr oup ( cur r ent ) t o be a
185 * "r epr esent at i ve st at e. Copy t hat st at e t o t he new t r ansi t i on t abl e,
186 * modi f yi ng al l t he t r ansi t i ons i n t he r epr esent at i ve st at e so t hat t hey' l l
187 * go t o t he gr oup wi t hi n whi ch t he ol d st at e i s f ound.
188 * /
189
190 SET * * c u r r e n t ;
191 ROW *newdt ran ;
192 ACCEPT * ne wa c c e pt ;
193 i n t s t a t e ;
194 i n t i ;
195 i n t * s r c , * d e s t ;
196 ROW * dt r a n = *df ap;
197 ACCEPT * a c c e p t = * a c c e p t p ;
198
199 ne wdt r an = (ROW *) c a l l o c ( Numgroups, s i z e of ( ROW ) ) ;
200 ne wa c c e pt = (ACCEPT *) c a l l o c ( Numgroups, s i zeof ( ACCEPT) ) ;
201
202 i f ( ! ne wdt r an | | ! n e wa c c e p t )
203 f e r r ("Out o f me mor y! ! ! " ) ;
204
205 ne x t _ me mbe r ( NULL ) ;
206 f o r ( c u r r e n t = &Groups[ Numgroups] ; c u r r e n t >= Groups ; )
207 {
208 d e s t = &newdt ran[ c u r r e n t - Gr o u p s ] [ 0 ] ;
209 s t a t e = ne x t _ me mbe r ( ^ c u r r e n t ) ; / * Al l gr oups have at */
210 s r c = & d t r a n [ s t a t e ] [ 0 ] ; / * l east one el ement . * /
211
212 n e w a c c e p t [ c u r r e n t - Gr o u p s ] = a c c e p t [ s t a t e ] ; / * St r uct . assi gnment * /
213
214 f o r ( i = MAX_CHARS; i >= 0 ; s r c + + , d e s t + + )
215 * d e s t = ( * s r c == F) ? F : I ng r o up[ * s r c ] ;
216 }
217
218 f r e e ( * df a p );
219 f r e e ( * a c c e p t p );
220 * df a p = ne wdt r an ;
221 * a c c e p t p = n e wa c c e pt ;
222 }
Listing 2.45. minimize.c Debugging Routines
223 PRIVATE v o i d p g r o u p s ( n s t a t e s )
224 i nt n s t a t e s ;
225
{
226 /* Pr i nt al l t he gr oups used f or mi ni mi zat i on. */
227
228 SET * * c u r r e n t ;
229
230 f o r ( c u r r e n t = &Groups[ Numgroups] ; - - c u r r e n t >= Groups ; )
231 {
232 p r i n t f ( " \ t g r o u p %d: {", c u r r e n t - Groups ) ;
233 p s e t ( ^ c u r r e n t , f p r i n t f , s t d o u t ) ;
234 p r i n t f ( 11} \ n" ) ;
235
}
236 p r i n t f ( " \ n " ) ;
237 while ( - - n s t a t e s >= 0 )
238 p r i n t f ( 11\ t s t a t e %2d i s i n gr oup %2d\ n", n s t a t e s , I n g r o u p [ n s t a t e s ] ) ;
239 }
There are actually two additional minimizations that could be done here and are not:
dead statesnonaccepting states that have no outgoing transitions other than ones that
loop back into the current statecan be removed, as can unreachable statesstates with
no incoming edges. In practice, both types of states are rare, so it didnt seem worth the
effort to remove them. I ll leave this further minimization as an exercise.
2.5.9 Compressing and Printing the Tables
The remainder of this chapter describes the remainder of LEXprimarily housekeep
ing and table-printing functions. Skip to the next chapter if youre not following the
implementation details.
This section discusses three table-output routines: one for uncompressed tables, one
for pair-compressed tables, and a third routine that eliminates redundant rows and
columns and prints the result. The first two of these are of sufficient general \ ility to be
put in a library. The third routine is used only by LEX.
2.5.9.1 Uncompressed Tables. The uncompressed tables are printed by two sub
routines in two separate input files. The d e f n e x t ( ) routine in Listing 2.46 prints the
y y n e x t ( ) macro definitiony y n e x t () figures the next state from the current state
and input character. The array access is hidden in a macro so that the state-machine
driver can call y y n e x t ( ) for the next state, regardless of the compression method
used. Each compression method supplies a different version of y y n e x t ( ) .
The uncompressed two-dimensional array is output by p r i n t a r r a y () in Listing
2.47. The routine is passed a FI LE pointer for the output stream, a pointer to the array,
and the two dimensions. It is configured to print an array of i nts, but you can change
Dead states.
Unreachable states.
Generate yy_next ()
functi on: def next ()
Printing two-dimensional
arrays: pr i nt ar r ay ()
Section 2.5.9Compressing and Printing the Tables
141
Listing 2.46. defnext.c Print Next-State Macro: Uncompressed Table
1 #i ncl ude <st di o. h>
2 #i ncl ude <t ool s/ debug. h>
3 #i ncl ude <t ool s/ compi l er . h> / * needed onl y f or pr ot ot ype */
4
5 PUBLI C voi d def next ( f p, name)
6 FI LE *f p;
7 char * n ame;
8 {
9 / * Pr i nt t he def aul t yy_next ( s, c) subr out i ne f or an uncompr essed t abl e. * /
10
11 st at i c char *comment _t ext [] =
12 {
13 "yy_next ( st at e, c) i s gi ven t he cur r ent st at e and i nput char act er and",
14 "eval uat es t o t he next st at e. ",
15 NULL
16 };
17
18 comment ( f p, comment _t ext );
19 f pr i nt f ( f p, "#def i ne yy_next ( st at e, c) %s[ st at e ][ c ]\ n", name );
20 }
this type by modifying the for ATYPE on line eight. The array is treated inter-
ATYPE
nally as a one-dimensional array having nr ows x ncol s cells. The main use of the row
size (nrows) is to print the inner brackets on lines 28 and 43. NCOLS, defined on line
nine, determines how many array elements are printed on each output line. It controls
NCOLS
formatting onlynot any of the actual array dimensions, pr i nt ar r ay () just prints
the initialization part of the declaration (and the terminating semicolon). It doesnt print
the actual declaration. Use it like this:
p r i n t f ( " u n s i g n e d c har Arr ay[ %d] [ %d] =\ n", nr ows , n c o l s ) ;
p r i n t a r r a y ( s t d o u t , a r r a y , nr ows , n c o l s ) ;
The declared type of the output array (unsi gned char above) is immaterial, even
though the input array must be i nts. I ts up to you to make sure that all values in the
input array fit into the cells of the output array. In the current example, no cell in the
input array can have a value greater than 255 or the number will effectively be truncated.
2.5.9.2 Pair-Compressed Tables. The next set of routines, in Listing 2.48, produce
pair-compressed tables, as were pictured in Figure 2.7 on page 70. Various macros on
lines 21 to 26 control the names and types used for both the input and output array.
ATYPE and NCOLS are used as before: they are the declared type of the input array and
the number of columns on one line of the output array. The other three definitions are
ATYPE, NCOLS
used for the storage class of the array and the type of one array The printed tables
assume that the table type and storage class are defined somewhereyou have to put
something like the following at the top of the file in which the tables are placed:
unsi gned char YY TTYPE;
#def i ne YYPRI VATE st at i c
Note that the yy next (), which figures the next state, can be declared with a different
storage class than the tables themselves (though it isnt in the current configuration).
This way you can have a public decoding routine and tables.
The table-generation routine, pai r s (), starts on line 29 of Listing 2.48. It is passed
three more arguments than pr i nt ar r ay The name argument is the name used for
Printing pair-compressed
tables: pai rs ()
the output arrays. The names for the arrays that represent rows are manufactured by
Listing 2.47. print ar.c Print A Two Dimensional Array
3 # i ncl ude < t o o l s / c o m p i l e r . h> / * f or pr ot ot ypes onl y * /
4 /* --------------------------------------------------------------------------------------------------------------------
5 * PRI NT_AR. C: Gener al - pur pose subr out i ne t o pr i nt out a 2- di mensi onal ar r ay.
6 */
7
8 t ypedef i nt ATYPE;
9 #def i ne NCOLS 10 / * Number of col umns used t o pr i nt ar r ays */
10 /* --------------------------------------------------------------------------------------------------------------------*/
11
12 PUBLIC voi d p r i n t _ a r r a y ( f p , a r r a y , nrows , n c o l s )
13 FILE * f p;
14 ATYPE * a r r a y ; / * DFA t r ansi t i on t abl e */
15 i nt nrows ; / * Number of r ows i n ar r ay[] * /
16 i nt n c o l s ; / * Number of col umns i n ar r ay[] * /
17 {
18 / * Pr i nt t he C sour ce code t o i ni t i al i ze t he t wo- di mensi onal ar r ay poi nt ed
19 * t o by "ar r ay. Pr i nt onl y t he i ni t i al i zat i on par t of t he decl ar at i on.
20 */
21
22 i nt i ;
23 i nt c o l ; / * Out put col umn * /
24
25 f p r i n t f ( f p , " ( \ n " ) ;
26 f o r ( i = 0; i < nrows ; i ++ )
27 {
28 f p r i n t f ( f p , "/ * %02d * / { ", i ) ;
29
30 f o r ( c o l = 0; c o l < n c o l s ; c o l + + )
31 {
32 f p r i n t f ( f p , "%3d" , *ar r ay++ ) ;
33 i f ( c o l < n c o l s - 1 )
34 f p r i n t f ( f p , ", " ) ;
35
36 i f ( ( c o l %NCOLS) == NCOLS-1 && c o l != n c o l s - 1 )
37 f p r i n t f ( f p , "\ n ") ;
38 }
39
40 i f ( c o l > NCOLS )
41 f p r i n t f ( f p , "\ n " ) ;
42
43 f p r i n t f ( f p , " }%c\ n", i < nr o ws - 1 ? : ' ' ) ;
44 }
45 f p r i n t f ( f p , " } ; \ n " ) ;
46 }
character/next-state pairs are abandoned in favor of a one-dimensional array. If the
putting a numeric suffix at the end of this string. The array of pointers to rows uses the
reshoi d argument to string without modification. The t hr eshol d argument determines the point at which
Lrs() character/next-state pairs are abandoned in favor of a one-dimensional array. If the
number of transitions in a row are less than or equal to t hr eshol d, then pairs are used,
otherwise an array is used. Finally, number s controls the output format of the character
part of the pair. If its true, a decimal number is used (an ASCII ' a ', for example, is
output as 97); if false, character constants are used (an ' a ' is printed as an ' a ').
The pai r s ( ) routine returns the number of cells used to store the row arrays. The
number of bytes used by the entire table is:
Section 2.5.9Compressing and Printing the Tables 143
(number_of_c ells x si zeof (TYPE)) +(nrows x si zeof (TYPE*))
The rows are printed in the loop that starts on line 52 of Listing 2.48. The loop on
line nine counts the number of nonfailure transitions in the row, and the row is generated
only if there are nonfailure transitions. The elements themselves are printed in the loop
on line 75. The second half of the subroutine, on lines 110 to 129 prints out the array of
pointers to the rows. The pnext ( ) subroutine that starts on line 135 prints the next-
state subroutine (its too big to be a macro).
Listing 2.48. pairs.c Pair Compression
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
#i ncl ude < s t d i o . h >
#i ncl ude <t ooI s/ debug.h>
#i ncl ude <t ool s/ compi l er .h> /
*
f or pr ot ot ypes onl y
*
/
/ * PAI RS. C
*
*
Thi s modul e cont ai ns t he r out i nes t o compr ess a t abl e
hor i zont al l y ( usi ng char / next pai r s) and t hen pr i nt t he
compr essed t abl e. The compr essed ar r ay l ooks l i ke t hi s:
*
*
*
Yy_nxt :
+----------- +
Yy nxt DD:
+-----------+------------------------------------------------------------------ +
* *
> 0 Next st at e ar r ay, i ndexed by char act er
* +--------- +
+-----------+------------------------------------------------------------------ +
*
*
*
+----------- +
*
+ + + + + + +
> count cl s i c2 s2
*
*
*
+----------- +
NULL
+----------- +
+ + + + + + +
( t here ar e no non- f ai l ur e t r ansi t i ons i f NULL)
*
/
i nt ATYPE; /
*
Decl ar ed t ype of i nput t abl es
*
/
#def i ne NCOLS
#def i ne TYPE
#def i ne SCLASS
10
"YY TTYPE"
/ * Number of col umns used t o pr i nt ar r ays

/
/
/
/
/ * _____________________________________________________________________________________________ * /
/
"YYPRI VATE" / *
Decl ar ed t ype of out put t abl es.
cl ass of al l t he t abl es
*
#def i ne D SCLASS "YYPRI VATE" / * St or age cl ass of t he decodi ng r out i ne
PUBLI C i nt p a i r s ( f p , a r r a y , nr ows , n c o l s , name, t h r e s h o l d , numbers )

FI LE
ATYPE
i nt
i nt
*
*
f p;
a r r a y ;
n r o ws ;
n c o l s ;
/
*
out put f i l e
*
/ * DFA t r ansi t i on t abl e
/ * Number of r ows i n ar r ay[]
*
Number of col umns i n ar r ay[]
*
name; / * Name used f or out put ar r ay
k
i nt
i nt
{
t h r e s h o l d ;
numbers;
/
/
*
*
Ar r ay vs. pai r s t hr eshol d
Use number s f or char, par t of pai r
*
*
/
/
/
/
7
/
/
/ * Gener at e t he C sour ce code f or a pai r - compr essed DTRAN. Ret ur ns t he
* number of cel l s used f or t he YysDD ar r ays. The "number s" ar gument
* det er mi nes t he out put f or mat of t he char act er par t of a
* char act er / next - st at e pai r . I f number s i s t r ue, t hen nor mal number s
* ar e used, ot her wi se ASCI I char act er s ar e used. For exampl e: ' a' , 100
*
as compar ed to: 97, 100 (' a' 0x61 97, deci mal )
*
/
i nt
i nt
ATYPE
w
1,
V
3
n t r a n s i t i o n s , n p r i n t e d , ncommas;
num c e l l s 0; / * cel l s used f or r ows
*
/
*
p;
50 c h a r * b i n _ t o _ a s c i i () ;
51
52 f o r ( i = 0/ i < nrows ; i ++ )
53 {
54 n t r a n s i t i o n s = 0;
55 f o r ( p = a r r a y + ( i * n c o l s ) , j = n c o l s ; j >= 0; p++ )
56 i f ( *p != - 1 )
57 + + n t r a n s i t i o n s ;
58
59 i f ( n t r a n s i t i o n s > 0 )
60 {
61 f p r i n t f ( f p , "%s %s %s%-2d[] = { ", SCLASS, TYPE, name, i ) ; / *}*/
62 + + n u m_ c e l l s ;
63 i f ( n t r a n s i t i o n s > t h r e s h o l d ) / * ar r ay */
64 f p r i n t f ( f p , " 0 , \ n ") ;
65 el se / * pai r s */
66 {
67 f p r i n t f ( f p , "%2d, ", n t r a n s i t i o n s ) ;
68 i f ( t h r e s h o l d > 5 )
69 f p r i n t f ( f p , \ n ) ;
70 }
71
72 n p r i n t e d = NCOLS;
73 ncommas = n t r a n s i t i o n s ;
74
75 f o r ( p = a r r a y + ( i * n c o l s ) , j = 0; j < n c o l s ; j ++, p++ )
76 {
77 i f ( n t r a n s i t i o n s > t h r e s h o l d ) / * ar r ay */
78 {
79 ++n u m_ c e l l s ;
80 - - n p r i n t e d ;
81
82 f p r i n t f ( f p , "%3d" , *p ) ;
83 i f ( j < n c o l s - 1 )
84 f p r i n t f ( f p , ", ") ;
85 }
86 el se i f ( *p != - 1 ) / * pai r s */
87 {
88 n u m_ c e l l s += 2;
89
90 i f ( numbers )
91 f p r i n t f ( f p , "%d,%d", j , *p ) ;
92 el se
93 f p r i n t f ( f p , "' %s ' , %d", b i n _ t o _ a s c i i ( j , 0 ) , *p ) ;
94
95 n p r i n t e d - = 2;
96 i f ( ncommas > 0 )
97 f p r i n t f ( f p , ", ") ;
98 }
99
100 i f ( n p r i n t e d <= 0 )
101 {
102 f p r i n t f ( f p , "\ n ") ;
103 n p r i n t e d = NCOLS;
104 }
105 }
106 f p r i n t f ( f p , " } ; \ n " ) ;
107 }
108 }
109
170 f pr i nt f
(fp.
"\ n/ *
*
c
*
i i i
171 f pr i nt f
(fp,
"%s %s yy next ( cur st at e, c )\ n", D SCLASS, TYPE );
172 pr i nt v
(fp.
t opt ext );
173 f pr i nt f
(fp,
" %s *p = %s[ cur st at e ] ; \ n", TYPE, name );
174 pr i nt v
(fp,
bopt ext );
175 }
2.5.9.3 Redundant-Row-and-Column-Compressed Tables. The routines to
print the array with redundant row and columns removed are in Listing 2.49. Since these
are used only in LX, I ve taken less pains to make them as flexible as possible. The sub
routine goes through the array column by column using the algorithm in Table 2.13. The
same algorithm is used to compress the rows. Note that the row compression creates a
single row that contains nothing but failure transitions. This approach seems better than
setting the map-array entries to a failure marker because it makes the array access both
more consistent and faster at the cost of only a few bytesyou dont have to test for
failure before indexing into the map array.
Table 2.13. Algorithm to Eliminate Redundant Columns
Section 2.5.9Compressing and Printing the Tables
Listing 2.49. squash.c Redundant Row and Column Elimination
147
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#ncl ude <st di o. h>
#i ncl ude cct ype. h>
#i f def MSDOS
i ncl ude <st dl i b. h>
#el se
i ncl ude <mal l oc. h>
#endi f
#i ncl ude <t ool s/ compi l er . h>
#i ncl ude "df a. h"
#i ncl ude "gl obal s. h"
/
*
SQUASH. C Thi s modul e cont ai ns t he r out i nes to compr ess a t abl e
*
*
hor i zont al l y and ver t i cal l y by r emovi ng r edundant col umns and rows, and t hen
pr i nt t he compr essed t abl e. I haven' t been as car ef ul about maki ng t hi s
* r out i ne pur pose because i t ' s onl y usef ul to LeX. The pai r s
*
compr essi on r out i nes i n pai r s. c ar e used to compr ess t he occs and
* l l ama t abl es so t hey' r e a l i t t l e mor e compl i cat ed
*
/
PRI VATE i nt
PRI VATE i nt
Col map[ MAX CHARS ];
Row map[ DFA MAX ];
#def i ne NCOLS
#def i ne TYPE
#def i ne SCLASS
16
"YY TTYPE" /
*
"YYPRI VATE" / *
Decl ar ed t ype of out put t abl es.
cl ass of al l t he t abl es
*
*
/
/
/* --------------------------------------------------------------------------------------------------------------------
* Subr out i ne i n t hi s f i l e:
*
/
34 PRI VATE i nt c o l e g u i v ( i nt*, i nt*, i nt ) ;
35 PRI VATE voi d c o l cpy ( i nt *, i nt*, i nt , i nt , i nt
);
36 PRI VATE voi d r e duc e ( ROW*, i nt*, i nt*
);
37 PRI VATE voi d p r i n t c o l map ( FI LE*
);
38 PRI VATE voi d p r i n t row map ( FI LE*, i nt );
39 PUBLI C voi d pmap ( FI LE*, i nt*, i nt
);
40 PUBLI C voi d c n e x t ( FI LE*, char* );
41 PUBLI C i nt s g u a s h ( FI LE*, ROW*, i nt , i nt, char* );
42
43
/ *----------
*/
44
45 #def i ne ROW_EQUI V( r l , r 2 , n c o l s ) (memcmp( r l , r 2 , n c o l s * si zeof ( i nt ))== 0 )
46 #def i ne ROWCPY ( r l , r 2 , n c o l s ) ( memcpy( r l , r 2 , n c o l s * si zeof ( i nt ) )
)
47
48
/ *----------
*/
49
50 PUBLI C i nt s g u a s h ( f p , d t r a n , nrows , n c o l s , name )
51 FI LE *f p;
52 ROW *d t r a n ;
53
54
55
56
57
58
59
{
/
*
*
*
*
n ame;
Compr ess ( and out put ) dt r an usi ng equi val ent - col umn el i mi nat i on
Ret ur n t he number of byt es r equi r ed f or t he compr essed t abl es
( i ncl udi ng t he map but not t he accept i ng ar r ay) .
*
/
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
i nt o n c o l s
i nt onrows
n c o l s ;
n r o ws ;
/
/
*
or i gi nal col umn count
*
/
*
or i gi nal r ow count * /
r e d u c e ( d t r a n , &nrows, &ncol s ) ; /
*
Compr ess t he t abl es
*
/
p r i n t _ c o l _ ma p ( f p ) ;
p r i n t row map ( f p , onrows ) ;
f p r i n t f ( f p , "%s %s %s [ %d ] [ d ] \ n " , SCLASS, TYPE, name, nrows , n c o l s ) ;
p r i n t a r r a y ( f p , (i nt * ) d t r a n , nr ows , n c o l s ) ;
r et ur n( ( nrows * n c o l s
+(
+ (
onrows
*
*
o n c o l s *
(TTYPE))
(TTYPE))
(TTYPE))
/ * dt r an
*
) ;
/
/
r ow map
col map
*
*
/
/
/
}
J * _____________________________________________________________________________________________ * J
PRIVATE i nt c o l e q u i v ( c o l l , c o l 2 , nrows )
i nt
{
* c o l l , * c o l 2 ;
/ * Ret ur n 1 i f t he t wo col umns ar e equi val ent , el se r et ur n 0
*
/
whi l e(
{
nrows > 0 && * c o l l
*
c o l 2 )
c o l l += MAX CHARS; / * Advance t o next cel l i n t he col umn * /
c o l 2 + MAX CHARS;
}
r et ur n(
i
( nrows > 0) );
}
__ __ __ __ __ __ __ __
PRIVATE voi d c o l c p y ( d e s t , s r c , nr ows , n s r c c o l s , n d e s t c o l s )
i nt
i nt
i nt
i nt
i nt
{
*
* s r c ;
n r o ws ;
n _ s r c _ c o l s ;
n d e s t c o l s ;
/
/
*
*
Top of dest i nat i on col umn
Top of sour ce col umn
*
*
/ * Number of r ows
/ * Number of col umns i n sour ce ar r ay
/
Number of col umns i n dest i nat i on ar r ay

*
/
/
/
/
/
/
*
*
Copy a col umn f r omone ar r ay to anot her . Bot h ar r ays ar e nr ows
t he sour ce ar r ay i s n sr c col s wi de and t he dest i nat i on ar r ay i s
* n dest col s wi de.
*
/
whi l e(
{
nrows > 0 )
*
s r c
+
+
*src;
n _ d e s t _ c o l s ;
n s r c c o l s ;
}
}
y'* _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ kj
PRIVATE voi d r e d u c e ( d t r a n , p nr ows , p n c o l s )
ROW
i nt
i nt
*d t r a n ;
*
*
p _ n r o ws ;
p n c o l s ;
/ * DFA t r ansi t i on t abl e
/ * # of st at es i n dt r an
/ * Poi nt er to col umn count
*
*
*
/
/
/
120
121
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
{
/
*
Reduce dt r an hor i zont al l y and ver t i cal l y, f i l l i ng t he t wo map ar r ays
122
*
wi t h t he char act er mappi ng t o t he f i nal , r educed t r ansi t i on t abl e#
123
*
Ret ur n t he number of col umns i n t he r educed dt r an t abl e.
124
*
125
*
Col map i s t he x ( char act er ) axi s, Row map i s t he y (next st at e) axi s.
126 */
127
128 i nt n c o l s = *p n c o l s ; / *
number of col umns i n or i gi nal machi ne * /
129 i nt nrows = *p nrows ;
/ *
number of r ows i n or i gi nal machi ne * /
130 i nt r n c o l s ; / *
number of col umns i n r educed machi ne * /
131 i nt r nrows ; / *
number of r ows i n r educed machi ne * /
132 SET * s a v e ; / * r ows or col umns t hat wi l l r emai n i n t abl e */
133 i nt ^ c u r r e n t ; / *
f i r st of sever al i dent i cal col umns * /
134 i nt ^c ompr e s s e d; / *
poi nt er t o compr essed1ar r ay * /
135 i nt
*p;
136 i nt
9 9
i, J r-
137
138 / *
*
139 Fi r st do t he col umns
140 * /
memset ( Col _map,
= newset ();
1,
(Col map) ) /
( r ncol s 0 r ncol s++ )
{
/
*
Ski p past any st at es i n t he Col map t hat have al r eady been
*
*
I f t he ent i r e Col map has been br eak
/
(i
r ncol s; Col map[ i ]
i
1 && i < ncol s i ++ )
( i >
ncol s )
/ * Add t he cur r ent col umn t o t he save set . I t event ual l y ends up
* i n t he r educed ar r ay as col umn "r ncol s" so modi f y t he Col map
* ent r y accor di ngl y. Now, scan t r ough t he ar r ay l ooki ng f or
* dupl i cat es of t he cur r ent col umn ( poi nt ed t o by cur r ent ) . I f you
* f i nd a dupl i cat e, make t he associ at ed Col map ent r y al so poi nt t o
* f f
r ncol s. "
*
/
ADD( save, i );
Col map[ i ] r ncol s;
cur r ent
P
&d t r a n [ 0 ] [ i ] ;
cur r ent + 1;
( j
i ; ++j < ncol s
P++ )
( Col _map[ j ]
Col map[j ]
1 && col equi v( cur r ent , p, nrows)
r ncol s
)
}
/
*
Compr ess t he ar r ay hor i zont al l y by r emovi ng al l of t he col umns t hat
* ar en' t i n t he sa We' r e doi ng t hi s by movi ng al l t he col umns
*
*
t hat ar e i n t he save set t o t he pr oper posi t i on i n a newl y al l ocat ed
You can' t do i t i n pl ace because t her e' s no guar ant ee t hat t he
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
}
*
equi val ent r ows ar e next t o each ot her
*
/
( ! ( compr essed (i nt
) mal l oc( nr ows * r ncol s * ( i nt ) )) )

f er r ( "Out of memor y") /
P
compr essed;
( next member ( NULL) ;
(i
next member ( save) )
i
l ; )
col cpy( p++, &dt r an[ 0] [i ] , nr ows, ncol s, r ncol s );
/
*
*
*
El i mi nat e equi val ent rows, wor ki ng on t he r educed ar r ay
cr eat ed i n t he pr evi ous st ep. The al gor i t hmused i s t he
*
*
/
memset ( Row_map
CLEAR( save );
1,
( Row map)
);
( r nr ows 0 ;; r nr ows++ )
{
( i
r nr ows; Row map[ i ] ! 1 && i < nr ows;
)
( i >
nr ows )
ADD ( save, i );
Row map[i ] r nr ows;
cur r ent compr essed + ( (i )
*
P
compr essed + ( (i +1)
*
r _ncol s ) ;
r ncol s );
( j
i ; ++j < nr ows ; p += r ncol s )
( Row map[j ] 1 ScSc ROWEQUI V ( cur r ent , p, r ncol s) )
Row map[j ] r nr ows
}
/
*
*
*
Act ual l y compr ess rows, copyi ng back i nt o t he or i gi nal ar r ay space
Not e t hat bot h di mensi ons of t he ar r ay have been changed.
*
/
P
( i nt *) dt r an;
( next member ( NULL) ;
(i
next member ( save) ) !
1; )
{
ROWCPY( p, compr essed + (i
r ncol s) , r ncol s );
P
+ r ncol s;
}
( save ) ;
( compr essed );
*
p_ncol s
p nr ows
r _ncol s ;
r nr ows ;
238
239
240
241
242
243
244
245
246
247
248
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
/*_____________________________________________________________________________ * /
PRIVATE voi d p r i n t c o l map( f p )
FILE
{
f p;
* t e x t []
{
The Yy cmap and Yy rmap a r r a y s a r e u s e d a s f o l l o w s : ,
I V
n e x t s t a t e = Y y d t r a n [ Yy r ma p [ c u r r e n t s t a t e [ Yy c ma p [ i n p u t c ha r ] ] ; " ,
I V
249 "Cha r a c t e r p o s i t i o n s i n t h e Yy_ cmap a r r a y a r e
. *
/
250
VV I I
251
~@
"A ~C ~E ~F ~H
Aj A
j "L ~M ~N "0",
252
VI ~p
~R ~S ~U ~X
- Y ^
z
~[ ~ \ \ ~]
AA
VV
/
253
VV
i
\ " # $
o
"0 &
t
( )
+
t
254 0 1 2 3 4 5 6 1 8 9
#
#
<
>
9 tt
255 @ A B C D E F G H I j K L M N 0",
256 P
Q
R S T U V W X Y z
[ \ \ ]
A
VV
_ /
257
VI \
a b c d e f
g
h
f
l
*
D
k 1 m n o",
258
p q
r s t u V w X
y
z
{
|
}
DEL",
NULL
};
c omme nt ( f p, t e x t ) ;
f p r i n t f ( f p ,
ii o. 0
oo Yy cmap[%d] = \ n { \ n
II
SCLASS, TYPE, MAX CHARS ) ;
pmap ( f p , Col map, MAX CHARS ) ;
}
PRIVATE voi d p r i n t row map
fp
nrows )
FILE
i nt
{
*
f p;
nrows ;
f p r i n t f ( f p , Yy rmap[%d] = \ n { \ n ", SCLASS, TYPE, nrows ) ;
pmap ( f p , Row map, nrows ) ;
}
PRIVATE voi d pmap( f p , p n )
FILE
i nt
i nt
{
*
f p;
p;
n;
/
/
/
*
out put st r eam
poi nt er t o ar r ay
*
*
* *
/
/
/
/ * Pr i nt a one- di mensi onal ar r ay
/
*
i nt j ;
( j
0; j < (n 1 ) ; j++ )
{
I V Q.
3 d, " ,
*
P++ ) ;
Lf ( NCOLS)
f p r i n t f ( f p , "\ n
NCOLS-1 )
");
}
f p r i n t f ( f p , "%3d\ n} ; \ n \ n " ,
*
P ) ;
}
296
/ * -------------------------------------------------------------------------------------------------------------------------------------------- * /
297
298 PUBLI C v o i d cnext ( f p, name )
299 FI LE *f p;
300 char *name;
301 {
302 / * Pr i nt out a yy next (state, c) subr out i ne f or t he compr essed t abl e.
303 * /
304
305 s t a t i c char *t ext [] =
306
{
307 "yy next ( st at e, c) i s gi ven t he cur r ent st at e number and i nput ",
308 "char act er and eval uat es t o t he next st at e. ",
309 NULL
310 };
311
312 comment ( f p, t ext );
313 f pr i nt f ( f p,
314 "#def i ne yy next ( st at e, c) (%s[ Yy r map[ st at e] ][ Yy_cmap[ c] ] ) \ n",
315 name
316
) ;
317 }
2.5.10 Tying It All Together
LPX is completed by the subroutines presented in this section, (in Listings 2.50, 2.51,
2.52, and 2.53). These listings are commented sufficiently that additional comments here
would be tedious. Read the code if youre interested.
I should also add, at this point, that there are more efficient methods than the one I ve
just described for converting a regular expression to a finite automata. (See [Aho] pp.
134-141.) I ve chosen the current method because it serves to introduce several con
cepts (like closure) that will be useful later on.
Listing 2.50. signon.c Print Sign-on Message
1 #i n c l u d e <st di o. h>
2 #i n c l u d e <t ool s/ debug. h>
3
4
5 si gnon()
6 {
7 / * Pr i nt t he si gnon message. Si nce t he consol e i s opened expl i ci t l y, t he
8 * message i s pr i nt ed even i f bot h st dout and st der r ar e r edi r ect ed.
9 * /
10
11 FI LE *scr een;
12
13 UX( i f ( ! ( s c r e e n = f o p e n ( "/ d e v / t t y " , "w")) ) )
14 MS( i f ( ! ( s c r e e n = f o p e n ( " c o n , "w") ) ) )
15 scr een = st der r ;
16
17 / * The ANSI ___ DATE macr o yi el ds a st r i ng of t he f orm: "Sep 01 1989". */
18 / * The DATE +7 get s t he year por t i on of t hat st r i ng. */
19
Section 2.5.10Tying It All Together 153
20 f p r i n t f ( s c r e e n , " L e X 1 . 0 [ %s] . (c) %s, A l l e n I . Hol ub. "
21 "Al l r i g h t s r e s e r v e d . \ n " , DATE , DATE +7 ) ;
22
23 i f ( s c r e e n != s t d e r r )
24 f c l o s e ( s c r e e n ) ;
25 }
Listing 2.51. print.c Print Remainder of Output File
2 # i ncl ude < c t y p e . h >
3 #i ncl ude < t o o l s / d e b u g . h>
4 #i ncl ude < t o o l s / s e t . h>
5 # i ncl ude < t o o l s / c o m p i l e r . h >
6 # i ncl ude " d f a . h"
7 # i ncl ude " n f a . h"
8 # i ncl ude " g l o b a l s . h "
9
10 / * PRI NT. C: Thi s modul e cont ai ns mi scel l aneous pr i nt r out i nes t hat do
11* ever yt hi ng except pr i nt t he act ual t abl es.
12 */
13
14 PUBLI C voi d p h e a d e r ( FI LE * f p , ROWd t r a n [ ] , i nt nr ows , ACCEPT * a c c e p t ) ;
15 PUBLI C voi d p d r i v e r ( FI LE * o u t p u t , i nt nr ows , ACCEPT * a c c e p t ) ;
16 / * ------------------------------------------------------------------------------------------------------------------------* /
17
18 PUBLI C voi d p h e a d e r ( f p , d t r a n , nr ows , a c c e p t )
19 FI LE * f p; / * out put st r eam * /
20 ROW d t r a n [ ] ; / * DFA t r ansi t i on t abl e * /
21 i nt nrows ; / * Number of st at es i n dt r an[] */
22 ACCEPT * a c c e p t ; / * Set of accept st at es i n dt r an[] */
23 {
24 / * Pr i nt out a header comment t hat descr i bes t he uncompr essed DFA. */
25
26 i nt i , j ;
27 i nt l a s t _ t r a n s i t i o n ;
28 i nt c h a r s _ p r i n t e d ;
29 c h a r * b i n _ t o _ a s c i i () ;
30
31 f p r i n t f ( f p , " # i f d e f ___NEVER__\ n" ) ;
32 f p r i n t f ( f p , " / * -------------------------------------------------------------------------------------------------------- \ n") ;
33 f p r i n t f ( f p , " * DFA ( s t a r t s t a t e i s 0) i s : \ n *\ n" ) ;
34
35 f o r ( i = 0; i < nrows ; i ++ )
36 {
37 i f ( ! a c c e p t [ i ] . s t r i n g )
38 f p r i n t f ( f p , " * S t a t e %d [ n o n a c c e p t i n g ] " , i ) ;
39 el se
40 {
41 f p r i n t f ( f p , " * S t a t e %d [ a c c e p t i n g , l i n e %d <",
42 i , ( ( i n t * ) ( a c c e p t [ i ] . s t r i n g ) ) [ - 1 ] ) ;
43
44 f p u t s t r ( a c c e p t [ i ] . s t r i n g , 20, f p ) ;
45 f p r i n t f ( f p , ">]" ) ;
46
47 i f ( a c c e p t [ i ] . a nc ho r )
48 f p r i n t f ( f p , " Anchor: %s%s",
49 a c c e p t [ i ] . a nc ho r & START ? " s t a r t " : "",
50 a c c e p t [ i ] . anc hor & END ? "end" : "" ) ;
51 }
52
53 l a s t _ t r a n s i t i o n = - 1 /
54 f o r ( j = 0; j < MAX_CHARS; j ++ )
55 {
56 i f ( d t r a n [ i ] [ j ] != F )
57 {
58 i f ( d t r a n [ i ] [ j ] != l a s t _ t r a n s i t i o n )
59 {
60 f p r i n t f ( f p , "\ n * g o t o %2d on ", d t r a n [ i ] [ j ] ) ;
61 c h a r s _ p r i n t e d = 0/
62 }
63
64 f p r i n t f ( f p , "%s", b i n _ t o _ a s c i i ( j , 1 ) ) ;
65
66 i f ( ( c h a r s _ p r i n t e d += s t r l e n ( b i n _ t o _ a s c i i ( j , 1 ) ) ) > 56 )
67 {
68 f p r i n t f ( f p , "\ n * " ) ;
69 c h a r s _ p r i n t e d = 0;
70 }
71
72 l a s t _ t r a n s i t i o n = d t r a n [ i ] [ j ] ;
73 }
74 }
75 f p r i n t f ( f p , " \ n " ) ;
76 }
77 f p r i n t f ( f p , " * / \ n \ n " ) ;
78 f p r i n t f ( f p , " # e n d i f \ n " ) ;
79 }
80
81 / * ---------------------------------------------------------------------------------------------------------------------------- * /
82
83 PUBLIC voi d p d r i v e r ( o u t p u t , nr ows , a c c e p t )
84 FILE * o u t p u t ;
85 i nt nr ows ; / * Number of st at es i n dt r an[] * /
86 ACCEPT * a c c e p t ; / * Set of accept st at es i n dt r an[] * /
87 {
88 / * Pr i nt t he ar r ay of accept i ng st at es, t he dr i ver i t sel f , and t he case
89 * st at ement s f or t he accept i ng st r i ngs.
90 * /
91
92 i nt i ;
93 st at i c char * t e x t [] =
94 {
95 "The Yy a c c e p t a r r a y has t wo p u r p o s e s . I f Y y a c c e p t [ i ] i s 0 t h e n s t a t e " ,
96 "i i s n o n a c c e p t i n g . I f i t ' s n o n z e r o t h e n t h e number d e t e r mi n e s whe t he r ",
97 "t he s t r i n g i s a n c h o r e d , l = a n c h o r e d a t s t a r t o f l i n e , 2 =a t end o f " ,
98 " l i n e , 3 =bo t h, 4 = l i n e n o t a nc ho r e d",
99 NULL
100 };
101
102 c omme nt ( o u t p u t , t e x t ) ;
103 f p r i n t f ( o u t p u t , "YYPRIVATE YY_TTYPE Y y a c c e p t [] =\ n" ) ;
104 f p r i n t f ( o u t p u t , "{ \ n" ) ;
105
106 f or ( i = 0 ; i <nrows ; i ++ ) / * accept i ng ar r ay */
107 {
108 i f ( ! a c c e p t [ i ] . s t r i n g )
109 f p r i n t f ( o u t p u t , " \ t 0 " ) ;
110 e l s e
111 f p r i n t f ( o u t p u t , " \ t % - 3 d " , a c c e p t [ i ] . a nc ho r ? a c c e p t [ i ] . a nc ho r : 4 ) ;
112
113 f p r i n t f ( o u t p u t , "%c / * S t a t e %-3d * / \ n " ,
114 i == ( nrows - 1 ) ? ' ' : , i ) ;
115 }
116 f p r i n t f ( o u t p u t , " } ; \ n \ n " ) ;
117
118 d r i v e r _ 2 ( o u t p u t , ! N o _ l i n e s ) ; / * code above cases * /
119
120 f o r ( i = 0 / i < nrows ; i ++ ) / * case st at ement s * /
121 {
122 i f ( a c c e p t [ i ] . s t r i n g )
123 {
124 f p r i n t f ( o u t p u t , " \ t \ t c a s e % d : \ t \ t \ t \ t \ t / * S t a t e %-3d * / \ n " , i , i ) ;
125 i f ( ! N o _ l i n e s )
126 f p r i n t f ( o u t p u t , " # l i n e %d \ " %s \ " \ n " ,
127 *( ( i n t * ) ( a c c e p t [ i ] . s t r i n g ) - 1 ) ,
128 I n p u t _ f i l e _ n a me ) ;
129
130 f p r i n t f ( o u t p u t , " \ t \ t %s\ n", a c c e p t [ i ] . s t r i n g ) ;
131 f p r i n t f ( o u t p u t , " \ t \ t b r e a k ; \ n " ) ;
132 }
133 }
134
135 d r i v e r 2 ( o u t p u t , !No l i n e s ) ; / * code bel ow cases * /
136 }
Listing 2.52. lex.c main () and Other High-Level Functions (Part 1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#i ncl ude <st dar g. h>
#i ncl ude <t ool s/ set .h>
#i ncl ude <t ool s/ hash. h>
#i ncl ude "df a. h"
/
*
For vpr i nt f ( )
*
/
#def i ne ALLOCATE
#i ncl ude "gl obal s. h"
/* --------------------------------------------------------------------------------------------------------------------*/
PUBLI C voi d
PUBLI C i nt mai n
(i nt usage, char
(i nt argc,
*
f mt,
PRI VATE i nt get l i ne (
* *
**ar gv) ;
st r i ngp, i nt n, FI LE *st r eam) ;
PRI VATE voi d do_f i l e
PRI VATE voi d t ai l
(voi d) ;
(voi d) ;
*
get expr ( voi d) ; /
*
i n i nput . c
*
/
22
23
24
25
26
27
/* -------------------------------------------------------------------------------------------------------------------- */
#def i ne DTRAN NAME 11Yy nxt " /
/
Name used f or DFA t r ansi t i on t abl e. Up to

3 char act er s ar e appended t o t he end of
/ * t hi s name i n t he r ow- compr essed t abl es.
*
*
/
/
28 #def i ne E(x) f p r i n t f ( s t d e r r , "%s \ n", x)
29
30 PRIVATE i nt Col umn c o mpr e s s = 1;
/ * Var i abl es f or command- l i ne swi t ches * /
31 PRIVATE i nt No c o mp r e s s i o n = 0;
32 PRIVATE i nt T h r e s h o l d = 4;
33 PRIVATE i nt No h e a d e r = 0;
34 PRIVATE i nt Header o n l y = 0;
35
36 ext ern i nt Ve r b o s e ; / * i n gl obal s. h * /
37 ext ern i nt No l i n e s ; / *
i n gl obal s. h */
38 ext ern i nt Li ne no ; / *
I n gl obal s. h, t he l i ne number */
39 / * used t o pr i nt #l i ne di r ect i ves. */
40
41
/ * -------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------- * /
42
43 PRIVATE voi d cmd l i n e e r r o r ( u s a g e , f mt ,
. . . )
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
{
f mt ;
/
Pr i nt an er r or message and exi t t o t he oper at i ng syst em. Thi s r out i ne i s

*
used much l i ke pr i nt f (), t hat i t has an ext r a f i r st ar gument .
*
*
*
" i s 0, an er r or message associ at ed wi t h t he cur r ent val ue of
er r no i s pr i nt ed af t er pr i nt i ng t he f or mat / ar g st r i ng i n t he nor mal way
I f "usage99 i s nonzer of a l i st of l egal command- l i ne swi t ches i s pr i nt ed
*
*
*
*
I ' musi ng t he ANSI convent i ons f or a subr out i ne wi t h a var i abl e number of
ar gument s. These di f f er a l i t t l e f r om t he Uni x V convent i ons. Look up
dopr nt f ) [ i t ' s i n t he "pr i nt f ( 3S) " ent r y] i n t he Uni x Pr ogr ammer ' s
* Manual f or det ai l s. A ver si on of t he <st dar g. h> macr os used her e i s i n
*
Appendi x A.
*
/
*
s y s e r r l i s t [ ] ;
ext ern i nt e r r n o ;
va l i s t
va s t a r t ( a r g s , f mt ) ;
f p r i n t f ( s t d e r r
ff ff
);
v f p r i n t f ( s t d e r r , f mt , a r g s ) ;
( \ u s a g e )
f p r i n t f ( s t d e r r , "( %s ) \ n", s y s e r r l i s t [ e r r n o ] ) /
{
E
E
E
E
E
E
E
E
E
E
E
\ n \ n U s a g e i s : LeX [ o p t i o n s ] f i l e
ff
f o r ( f ) a s t . Do n ' t c o mpr e s s t a b l e s . "
s u p p r e s s ( h ) e a d e r comment t h a t d e s c r i b e s s t a t e mac hi ne
p r i n t t h e ( H) e ade r o n l y . "
S u p p r e s s # ( l ) i n e d i r e c t i v e s i n t h e o u t p u t . "
ff
ff
cN u s e p a i r ( c ) o mp r e s s i o n , N t h r e s h o l d ( d e f a u l t 4 ) . "
t
u
v
V
Send o u t p u t t o s t a n d a r d o u t p u t i n s t e a d o f l e x y y . c "
UNIX mode, n e w l i n e i s \ \ n , n o t \ \ n o r \ \ r "
( v ) e r b o s e mode, p r i n t s t a t i s t i c s . "
More ( V) e r b o s e , p r i n t i n t e r n a l d i a g n o s t i c s as l e x runs
VV
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
}
e x i t ( 1 ) /
va e n d ( a r g s ) ;
}
/
*
/
PUBLIC voi d
f mt ;
l e r r o r ( s t a t u s , f mt ,
*
{
/ * Pr i nt an er r or message and i nput l i ne number . Exi t wi t h
* i ndi cat ed st at us i f st at us" i s nonzer o.
*
/
ext ern i nt
va l i s t
e r r n o ;
va s t a r t ( a r g s , f mt ) ;
f p r i n t f ( " l e x , i n p u t l i n e %d
VV
Ac t u a l l i n e n o ) ;
v f p r i n t f ( s t d e r r , f mt , a r g s ) ;
( s t a t u s )
e x i t ( s t a t u s ) ;
}
/
* *
/
PUBLIC m a i n ( a r g c , a r g v )
110 char **ar gv;
111
{
112 st at i c char
*P ;
113 st at i c i nt us e s t d o u t = 0;
114
115 s i g n o n ( ) ;
116
117 f or ( ++ar gv, - - a r g c ; a r g c && * (p = *ar gv) == ++ar gv,
118
{
119 whi l e( *++p )
120
{
121 swi t ch( *p )
122
{
123 case ' f ' : No c o mp r e s s i o n
1; br eak;
124 case ' h ' : No he a de r
l ; br eak;
125 case ' H ' : Header o n l y
l ; br eak;
126 case ' 1 ' : No l i n e s
l ; br eak;
127 case ' m ' : Te mpl at e = p +
l ; got o o u t ;
128 case ' p ' : P u b l i c
l ; br eak;
129 case ' t ' : u s e s t d o u t
1; br eak;
130 case ' u ' : Uni x
1; br eak;
131 case ' v ' : Ve r bo s e
1; br eak;
132 case ' V ' : Ve r bo s e
2; br eak;
133 case ' c ' : Col umn c ompr e s s 0;
134
135 i f ( ! i s d i g i t ( p [ 1 ] ) )
136 Th r e s h o l d = 4
/
137 el se
138
{
139 Th r e s h o l d = a t o i ( ++P ) ;
a r g c )
}
o u t : ;
}
/
*
{
}
(
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
w h i l e ( *p && i s d i g i t ( p [ 1 ]
++p;
) )
}
cmd l i n e e r r o r ( 1, -%c i l l e g a l a r g u me n t . " , * p ) ;
}
( a r g c > 1 )
cmd l i n e e r r o r ( 1, "Too many a r g u me n t s . Onl y one f i l e name p e r mi t t e d " ) ;
L f ( a r g c < 0 )
cmd l i n e e r r o r ( 1, " F i l e name r e q u i r e d " ) ;
1
*
/
( I f i l e f o p e n ( * a r g v , " r " )
)
I nput f i l e name
*
a r g v ;
cmd l i n e e r r o r ( 0, "Can' t open i n p u t f i l e %s", *ar gv ) ;
( ! u s e s t d o u t )
! ( O f i l e f o p e n ( Header o n l y ? " l e x y y . h " : " l e x y y . c " , "w") ) )
cmd l i n e e r r o r ( 0, "Can' t open o u t p u t f i l e l e x y y . [ c h ] " ) ;
do f i l e
f c l o s e
e x i t
0 ;
( O f i l e ) ;
( I f i l e ) ;
( 0 ) ;
}
J * __________ ____________________ _________ __________________________________________________k j
PRIVATE v o i d do f i l e ( )
{
i n t
ROW
ACCEPT
FILE
i n t
n s t a t e s ;
*dtran;
/ * Number of DFA st at es
/ * Tr ansi t i on t abl e
*
*
/
*
Set of accept st at es i n df a
*
*
i n p u t , * d r i v e r 1 ( ) ; / * Templ at e f i l e f or dr i ver
*
/
/
/
/
i ;
/
*
Pr ocess t he i nput f i l e
*
/
h e a d ( Header o n l y ) ; / * pr i nt ever yt hi ng up t o f i r st %%*/
s mi n d f a ( g e t e x p r , &dt ran, &accept ) ; / * make DFA */
( Ve r bo s e )
{
p r i n t f ( " %d o ut o f %d DFA s t a t e s i n mi n i mi z e d ma c h i n e \ n " , n s t a t e s ,
DFA MAX ) ;
p r i n t f ( " %d b y t e s r e q u i r e d f o r mi n i mi z e d t a b l e s \ n \ n " ,
n s t a t e s * MAX CHARS *
+ n s t a t e s *
(TTYPE)
(TTYPE)
/ * dt r an
*
); /
*
accept
*
/
/
}
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
}
( !No he a de r )
{
p h e a d e r ( O f i l e , d t r a n , n s t a t e s , a c c e p t ) ; /
*
pr i nt header
/ * comment
*
*
/
/
( ! Header o n l y )
}
/ * f i r st
/ * of dr i ver
*
*
/
/
( ! ( i n p u t d r i v e r 1( O f i l e , !No l i n e s , Te mpl at e ) ) )
{
p e r r o r ( Te mpl at e ) ;
e x i t ( 1 ) /
}
( No c o mp r e s s i o n ) /
*
compr essed t abl es
*
/
{
}
{
}
{
}
f p r i n t f ( O f i l e , "YYPRIVATE YY TTYPE s [ d ] [ d ] \n",
DTRAN NAME, n s t a t e s , MAX CHARS);
p r i n t a r r a y ( O f i l e , ( i n t * ) d t r a n , n s t a t e s , MAX CHARS ) ;
d e f n e x t ( O f i l e , DTRAN NAME ) ;
( Col umn c ompr e s s ) /
*
col umn- compr essed t abl es
*
/
i s qua s h ( O f i l e , d t r a n , n s t a t e s , MAX CHARS, DTRAN NAME
c n e x t ( O f i l e , DTRAN NAME
);
);
( Ve r bo s e )
p r i n t f ( " %d b y t e s r e q u i r e d f o r c o l u mn - c o mp r e s s e d t a b l e s \ n \ n " ,
i
+ ( s
*
s i z e o f ( i n t ) ) ) ;
/
/
*
*
dt r an
Yy accept
*
*
/
/
/
*
pai r - compr essed t abl es
*
/
i p a i r s ( O f i l e , ( i n t * ) d t r a n , n s t a t e s ,
MAX CHARS, DTRAN NAME, T h r e s h o l d , 0 ) ;
Ve r bo s e )
{
/ * Fi gur e t he space occupi ed f or t he var i ous t abl es. The
* Mi cr osof t compi l er uses r oughl y 100 byt es f or t he yy_next ( )
* subr out i ne. Col umn compr essi on does yy next i n l i ne wi t h a
*
macr o so t he over head i s negl i gi bl e.
*
/
i
( i
*
+ (
+ (
+ 100
s
s
*
*
( TTYPE ))
( TTYPE*))
( TTYPE ))
/
/
/
/
*
*
*
*
YysDD ar r ays
Dt r an[]
Yy_accept []
yy next ()
*
*
/
/
/
/
p r i n t f ( " %d b y t e s r e q u i r e d f o r p a i r - c o m p r e s s e d t a b l e s \ n " , i ) ;
}
p n e x t ( O f i l e , DTRAN NAME ) ;
p d r i v e r ( O f i l e , n s t a t e s , a c c e p t ) ;
t a i l ( ) ;
/
/
/
*
*
*
pr i nt r est of dr i ver
and ever yt hi ng f ol l owi ng
t he
second %%
/
*
/
/
259 / * --------------------------------------------------------------------------------------------------------------------------------------------
260 * Head pr ocesses ever yt hi ng up t o t he f i r st %%. Any l i nes t hat begi n
261 * wi t h whi t e space or ar e sur r ounded by %{ and %} ar e passed t o t he
262 * out put . Al l ot her l i nes ar e assumed t o be macr o def i ni t i ons.
263 * A %%can not be conceal ed i n a %{ %} but i t must be anchor ed at st ar t
264 * of l i ne so a %%i n a pr i nt f st at ement ( f or exampl e) i s passed t o t he
265 * out put cor r ect l y. Si mi l ar l y, a %{ and %} must be t he f i r st t wo char act er s
266 * on t he l i ne.
267 * /
268
269 PRI VATE head( suppr ess_out put )
270 i nt suppr ess_out put /
271 {
272 i nt t r a n s p a r e n t = 0 ; / * Tr ue i f i n a %{ %} bl ock */
273
274 i f ( !suppr ess_out put && Publ i c )
275 f put s( "#def i ne YYPRI VATE\ n\ n", Of i l e );
276
277 i f ( ! No_l i nes )
278 f pr i nt f ( Of i l e, "#l i ne 1 \ "%s\ "\ nn, I nput _f i l e_name) ;
279
280 whi l e( f get s( I nput _buf , MAXI NP, I f i l e) )
281 {
282 ++ Act ual _l i neno;
283
284 i f ( I t r a n s p a r e n t ) / * Don' t st r i p comment s */
285 st r i p_comment s( I nput _buf ); / * f r omcode bl ocks. */
286
287 i f ( Ver bose > 1 )
288 pr i nt f ( "h%d: %s", Act ual _l i neno, I nput _buf );
289
290 i f ( l nput _buf [ 0] == ' %' )
291 {
292 i f ( I n p u t _ b u f [ l ] == '%' )
293 {
294 i f ( !suppr ess_out put )
295 f put s( "\ n", Of i l e );
296 br eak;
297 }
298 el se
299 {
300 i f ( I nput _buf [ l ] ==' {' ) / * H * /
301 t r a n s p a r e n t = 1;
302
303 el se i f ( I n p u t _ b u f [ l ] == ' } ' )
304 t r a n s p a r e n t = 0;
305
306 el se
307 l e r r o r ( 0 , " I g n o r i n g i l l e g a l %%%c d i r e c t i v e \ n " ,
308 I nput _buf [1 ] );
309 }
310 }
311 el se i f ( t r a n s p a r e n t | | i s s p a c e ( I n p u t _ b u f [ 0 ] ) )
312 {
313 i f ( !suppr ess_out put )
314 f put s( I nput _buf , Of i l e );
315 }
316 el se
317 {
318 new macr o( I nput buf );
319 i f ( ! suppr ess_out put )
320 f put s( "\ n", Of i l e ) ; / * Repl ace macr o def wi t h a bl ank * /
321 / * l i ne so t hat t he l i ne number s */
322 / * won' t get messed up. * /
323 }
324 }
325
326 i f ( Ver bose > 1 )
327 pr i nt macs( ) ;
328 }
329
330 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
331
332 st r i p_comment s( st r i ng )
333 c h a r * s t r i n g ;
334 {
335 / * Scan t hr ough t he st r i ng, r epl aci ng C- l i ke comment s wi t h space
336 * char act er s. Mul t i pl e- l i ne comment s ar e suppor t ed.
337 * /
338
339 s t a t i c i n t i ncomment = 0;
340
341 f o r ( ; *st r i ng ; ++st r i ng )
342 {
343 i f ( i ncomment )
344 {
345 i f ( s t r i n g [ 0 ] = = ' * ' &&s t r i n g [ 1 ] = = ' / ' )
346 {
347 i ncomment = 0 ;
348 * s t r i n g + + =' ' ;
349 }
350
351 i f ( !i sspace( *st r i ng ) )
352 * s t r i n g = ' ' ;
353 }
354 e l s e
355 {
356 i f ( s t r i n g [ 0 ] = = ' / ' &&s t r i n g [ 1 ] = = ' * ' )
357 {
358 i ncomment = 1 ;
359 * s t r i n g + + = ' ' ;
360 *s t r i n g = ' ' ;
361 }
362 }
363 }
364 }
365
366 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
367
368 PRIVATE v o i d t a i l ( )
369 {
370 f get s( I nput _buf , MAXI NP, I f i l e) ; / * Thr ow away t he l i ne t hat */
371 / * had t he %%on i t. * /
372
373 i f ( ! No_l i nes )
374 f p r i n t f ( O f i l e , " # l i n e %d \ " %s \ " \ n " ,
375 Act ual _l i neno + 1, I nput _f i l e_name) ;
376
162 Input and Lexical AnalysisChapter 2
377 w h i l e ( f g e t s ( I n p u t b u f , MAXINP, I f i l e ) )
378 {
379 i f ( Ve r bo s e > 1 )
380 p r i n t f ( "t%d: %s", A c t u a l l i n e n o + + , I nput b u f ) ;
381
382 f p u t s ( I n p u t b u f , O f i l e ) ;
383
}
384 }
2.6 Exercises
2.1. Modify the simple expression grammar used in Chapter 1 so that COMMENT
tokens can appear anywhere in the expression. This exercise demonstrates graph
ically why its useful for a lexical analyzer to do things like remove comments
from the input.
2.2. Write a lexical analyzer that processes the PL/1 statement
correctly, distinguishing the keywords from the identifiers.
2.3. Write a regular expression that can recognize all possible integer constants in C
All of the following should be recognized:
0x89ab 0123 45 ' z' ' \ t' ' \ x a b ' ' \ 0 1 2 '
2.4. Write a regular expression that recognizes a C identifier. Construct an NFA for
this expression using Thompsons Construction, convert the NFA to a DFA using
Subset Construction, and minimize the DFA.
2.5. Write a set of input functions that can handle interactive input from a human
being more elegantly than the ones in input.c. (They should support command-
line editing, and so forth). Your routines must work with the default driver gen
erated by LEX (they must use the same interface as the default ones). The routines
should be able to get input directly from the console, rather than from standard
input.
2.6. Use LEX to rewrite its own lexical analyzer.
2.7. Write a set of regular expressions that can recognize C subroutine declarations.
You can assume that the entire declaration is on one line. Declarations such as
the following should be recognized:
TYPE * d o o ( wha, d i t t y )
u n s i g n e d c h a r ( * f o o ( ) ) [10]
i n t ( ^ s i g n a l ( s i g n a l , p t r ) ) ( )
but not the following
e x t e r n i n t ( * s i g n a l ( i n t s i g n a l , ( * pt r ) ( ) ) ) ( ) ;
s a m( dave ) ;
2.8. Use the regular expressions developed in the previous exercise to write a LEX pro
gram that inputs a list of C programs whose names are listed on the command
line, and which outputs all subroutine declarations along with the name of the file
in which they are found and their input line numbers.
2.9. Modify the previous exercise so that only the subroutine name is printed, rather
than the entire declaration.
2.10. Various UNIX editors support the \ ( and \) operators, which bracket subexpres
sions in a regular expression. For example, an expression like this:
\ ( a b c \ ) d e f \ ( g h i \ )
recognizes the string abcdef ghi , and the two substrings abc and ghi can be
accessed later using the notation \ 1 and \ 2 respectively. Modify LPX to support a
similar feature, but as follows:
A subexpression in a regular expression is indicated with parenthesis.
a. Modify LEX so that repeated subexpressions in the regular expressions them
selves can be referenced using the notation \ 0, \ 1, and so forth. \ 0 refer
ences the first subexpression, \ 1 the next one, and so forth. A
subexpressions number is derived by counting the number of open
parentheses to the left of the one that starts that subexpression. For example,
the following regular expression recognizes three, comma-delimited decimal
numbers:
[0-9]+, [0-9]+, [0-9] +
You should be able to express the same thing as follows:
([0- 9]+)(,)\0\1\0
The \ 0 in the regular expression tells LEX to duplicate the first subexpression
([ 0-9] +), the \ 1 duplicates the second (,).
The double-quote character and double backslash should still work as
expected. Both of the following expressions should recognize an input string
like "1234\1\2\1":
([0-9]+) ( , ) " \ 1\ 2\ 1"
([0-9]+) ( , ) \ \ l \ \ 2\ \ l
The backslashes are treated as literal characters here.
b. Modify 13X so that the lexemes associated with the matched subexpression
can be accessed in an action using an argv-like array of string pointers called
YYv: yyv [ 0 ] holds the lexeme that matches the first subexpression, yyv [ 1]
the second, and so on. An integer called yyc should hold the number of valid
entries in yyv. For example, the expression in the following rule recognizes
the input pattern abcdef ghi :
( ( a b c ( d e f ) ) ( g h i ) ) w h i l e ( y y c >= 0 )
p r i n t f ( "%s\ n", *yyv++ ) ;
The associated action prints:
a b c d e f g h i
a b c d e f
d e f
g h i
2.11. Modify LEX to support a unary NOT operator. It should be higher precedence
than the closure operators and lower precedence than concatenation. It should
precede its operand. For example, !c matches anything except the character c; the
expression a! (be) should match an a followed by any sequence of characters
other than the two character sequence be; ! (and| or) should match any string
except and or or; finally, ! [a- z] should work just like [*a-z]. Be careful of the
newline, which should not be matched by any of the foregoing patterns.
2.12. Add the / lookahead operator to LX. The expression a/b where a and b are
both regular expressions matches a, but only if it is followed by b. The charac
ters that match b should not be put into the lexeme. In fact, they should be pushed
back into the input after being recognized. For example, the regular expression
abc/ def should match an abc only if its followed by a def and the def should be
pushed back into the input for subsequent processing. You can simulate the $
operator with xxx/ [ \ r \ n] . Similarly, the following pattern matches the string
if, but only if the next nonwhite character isnt a semicolon, plus sign, minus sign,
or asterisk:
i f / [ \ s \ t \ n \ r ] * r ; + - * ] '
By the same token, i f / [ \ s\ t \ n\ r ] * ( matches an if only if the next nonwhite
character is an open parenthesis.
2.13. Add the REJ ECT directive to LPX. This directive selects the next alternative
accepting action when regular expressions conflict. For example, the following
recognizes the word he, even if its imbedded in she:
i nt he =0 , s he =0 ;
o o
"o "o
he { ++he; REJ ECT; }
she { ++she; REJ ECT; }
(. | \ n) ; / * i gnor e */
o o
"O"O
mai n ()
{
yyl ex();
pr i nt f ( "Saw %d he' s and %d she' s\ n", he, she );
}
It recognizes the she first, then the REJ ECT action causes it to execute the code
associated with the he regular expressions too. Note that its not doing anything
special with the input, its only executing all possible accepting actions for the
expression. The program should print
Saw 2 h e ' s and 1 s h e ' s
when given the input:
he s he
2.14. Add the X{a,b} operator to LEX. This expression matches X repeated a to b times.
For example 1 ( 1, 3 }ama recognizes lama, llama, and lllama.
2.15. Add left-context sensitivity to LEX. Do it like this: The statement
%START NAME
is placed in the first, definitions part of the LPX input file. It creates a class of reg
ular expressions that are active only in certain situations (discussed below). NAME
can be an arbitrary name having up to eight characters. All regular expressions
that start with <NAME>are placed in the class. The angle brackets must be present,
like this:
<NAME>( a| b) * { s o me _ c o d e ( ) ; }
Activate all states in the class with the following line anywhere in the code part of
a rule:
BEGI N NAME;
That is, the earlier regular expression, (a/b)*, wont be recognized until a BEGI N
NAME has been executed in a code block. Use BEGIN 0; to disable all special
regular expressions. An expression can be put into several classes like this:
<NAME1, NAME2, NAME3>expres s i on { s o me _ c o d e ( ) ; }
Once an expression is activated, the usual input-file precedence rules are also
active. That is, if a special expression occurs earlier in the input file than a
conflicting expression, the special one takes precedence once its activated.
2.16. Modify IPX to detect which of the two compression methods yields the smallest
tables, and then use that method automatically.
2.17. Add a command-line switch to IPX that causes the two compression methods to
be combined: Pair compression should be applied first, then a map array should
be used to compress the uncompressed rows in the pair-compressed tables. Only
one copy of two or more identical rows should be kept, with several pointers in
Yy n x t [ ] addressing the single copy of the row.
2.18. Modify the minimization routines to remove dead statesnonaccepting states
that have no outgoing transitions other than ones that loop back into the current
state.
2.19. Modify the minimization routines to remove unreachable statesstates that have
no incoming edges.
2.20. Sometimes its useful for a programs users to be able to modify the token set
when they install a program. This way, they can customize a programs input syn
tax to fit their own preferences. Modify IPX so that it produces a file containing
binary representations of the state and auxiliary tables rather than C source code.
Modify y y l e x () so that it reads the tables from this file into memory the first
time it is called. This way, the user can change the IPX input file and generate a
new state table without having to recompile the program that is using that state
table.
Note that the actions still have to be put into the compiled programtheres no
way to read them from disk at run time. Consequently, though the regular expres
sions can be modified at installation time, the number of regular expressions must
stay the same. The position of the regular expression from the top of the file
determines the action to execute when that expression is recognized. That is, the
second action in the original input file is executed when the second regular
expression is matched, even if that expression is changed at installation time.
Section 2.6Exercises
Alphabet.
String, word.
Token.
Language, sentence.
Grammar, production.
Sy ntax.
Context-free grammar.
Backus-Naur Form,
Left-hand side (LHS).
The discussion of formal grammars started in Chapter One continues here. I discuss
the general, more theoretical aspects of grammars, but with emphasis on building real
grammars, discussing precedence and associativity, list and expression grammars, and so
forth. Attributed and augmented grammars are introduced as well, and they are put into
the context of a recursive-descent parser. Grammars are discussed further in Chapter
Four, where various grammatical transformations are presented.
3.1 Sentences, Phrases, and Context-Free Grammars
This section quickly summarizes the terms introduced in Chapters One and Two and
extends these terms somewhat.
An alphabet is a collection of characters. The ASCII character set is a good example
of an alphabet. A string or word is a specific sequence of symbols from an input alpha
bet. In compiler parlance, a word in the language is called a tokenan indivisible lexi
cal unit of the language.
A language is a set of words, and a sentence is a sequence of one or more of the
words that comprise the language. Strictly speaking, there is no inherent ordering of the
words within the sentence, and thats where a grammar comes into play. A formal gram
mar is a system of rules (called productions) that control the order in which words may
occur in a sentence.
The syntax of a sentence determines the relationships between the words and phrases
in a sentence. That is, the syntax of a language controls the structure of a sentence and
nothing more. Taking English as an example, the syntax of English requires that the sub
ject precede the verb, or that an adjective must precede the noun that it modifies. The
syntax rules tell you nothing about what the sentence actually means, however.
Informally, a context-free grammar is a system of definitions that can be used to
break up a sentence into phrases solely on the basis of the sequence of strings in the
input sentence. A context-free grammar is usually represented in Backus-Naur form
(BNF), in which productions are represented as follows:
part of speech definition
The operator (often represented as ::=) is the is defined as or goes to operator.
166
Section 3.1 Sentences, Phrases, and Context-Free Grammars 167
The left-hand side of the production (often abbreviated to LHS), is the name of the syn
tactic unit within the language (noun, verb, prepositional phrase, and so forth). The
right-hand side (or RHS) is a list of symbols that define the LHS. For example:
sentence subject predicate PERIOD
A sentence is a subject, followed by a predicate, followed by a PERIOD. There are two
types of symbols on the right-hand side: Nonterminal symbols are terms that have to be
defined further somewhere in the grammar (they have to appear on a left-hand side), and
terminal symbols such as tokens which need no further definition. The nonterminal
nodes form interior nodes in the parse tree; the terminal nodes are all at the ends of
branches. There are terminal symbols other than tokens, discussed below.
So, a context-free grammar is one in which all input sentences can be parsed strictly
on the basis of syntax. Formally, a context-free grammar is composed of the following:
A finite set of terminal symbols or tokens.
A finite set of nonterminal symbols.
A finite set of productions of the form s>a, where s is a nonterminal symbol, and a
is a list of zero or more terminal and nonterminal symbols, s is called the left-hand
side of the production and a is called the right-hand side. Every nonterminal symbol
that appears in the right-hand side in some production must also appear on a left-
hand side. No terminal symbols may appear on a left-hand side.
A single start or goal symbol from which all the productions derive.
Its theoretically possible to have a context-sensitive grammar that does consider
semantics when analyzing a sentence, but this type of grammar is beyond the scope of
this book and wont be discussed further.
A grammar can also distinguish groups of words that are related syntactically. For
example, in English, the syntax of the language helps separate a subject from a predicate,
recognize prepositional phrases, and so forth. Each of these syntactically related lists of
words are called phrases. In a computer language, a loop-control statement and associ
ated body is a phrase, as is an entire subroutine declaration. Note that the phrases can be
organized hierarchicallya sentence can contain a predicate, which can in turn contain
a prepositional phrase. A grammatical phrase can be as short as a single word: in C, a
single identifier can comprise a phrase. It can also be emptyI ll discuss this case
below. A production is a rule that describes the syntax of a single phrase.
Syntactic rules are used to break a sentence into its component parts of speech and
analyze the relationship of one part to another (this process is called parsing the sen
tence). The term is used in both linguistic and compiler theory. A parser is a computer
program that uses a context-free grammar to parse an input sentenceto isolate the
component parts of speech for subsequent processing. Most parser programs are also
recognizer programs. They accept (returns yes) only if the input forms a syntactically
correct sentence. The fact that a parser can also generate code is actually immaterial
from a theoretical point of view.
The parser can analyze only the structure of the sentence. Even in English, the syn
tax of a sentence tells you nothing about what the sentence means. Meaning comes
under the purview of semantics. A context-free grammar defines the structure of the sen
tence only. It tells you nothing about the semanticsthe meanings of the words
except that you can sometimes infer meaning from the position of the word within a sen
tence. For example, a grammar could tell you that a variable declaration is comprised of
a sequence of type-speci fication tokens (i nt, long, short, and so forth) followed by a
single identifier, but it is just manipulating sequences of strings, it has no idea what the
strings mean. In the case of a declaration, it knows nothing about the characteristics of
the individual typesas far as the grammar is concerned, the following declarations are
Right-hand side (RHS).
Nonterminal and terminal
symbols.
Context-free grammar:
formal definition.
Phrase.
Parsing.
Recognizer programs.
Semantics and syntax.
168 Context-Free GrammarsChapter 3
Syntax-directed transla
tion.
Derivations.
Start, goal symbol.
The => symbol.
Leftmost derivation.
all reasonable because theyre all sequences of type specifiers followed by a single
identifier:
l o n g j o h n _ s i l v e r ;
s h o r t s t op ;
l o n g s h o r t muddl e;
What the grammar does do is distinguish the specifiers from the identifiers. If a type
specifier were redefined so that it could represent an i nt only, then the meaning of the
specifier could be extrapolated from the syntax. That is, when the parser encountered a
type specifier, it would know that the specifier was an i nt, because that was the only
possibility. This sort of degenerate case is not much use in natural languages like
English, but it can be used effectively in languages with limited syntactic and semantic
scope, such as programming languages, which might have only a few verbs (whi l e, do,
for), and the same number of nouns (i nt, f l oat, doubl e).
In syntax-directed translation, code is generated (the input language is translated to
the output language) strictly on the basis of the syntax of a sentencethe meaning of a
phrase or word is inferred from its position within the sentence. In practical terms, the
parser performs certain code-generation actions as it traverses a parse tree which is
created using syntactic rules. It performs a code-generation action every time a node on
the tree is visited.
3.2 Derivations and Sentential Forms
A derivation is a way of showing how an input sentence can be recognized with a
grammar. You can also look at it as a type of algebraic substitution using the grammar.
As an example of the derivation process, you start a leftmost derivation with the topmost
nonterminal in a grammar, called the start or goal symbol. You then replace this nonter
minal with one of its right-hand sides. Continue in this mannerbut replace the left
most nonterminal created from the previous step with one of the right-hand sides of that
nonterminaluntil there are no more nonterminals to replace.
The following grammar recognizes a simple semicolon-terminated expression con
sisting of interspersed numbers and PLUS tokens:
1: stmt ^ expr SEMI
2: expr ^ factor PLUS expr
3: 1 factor
4: factor ^ NUMBER
All you need do to prove that an input sentence (like 1+2 ;) can be recognized for this
grammar is come up with a derivation that can match the input sentence, token for token.
The following derivation does just that (the =>symbol means derives),
stmt => expr SEMI by Production 1
factor PLUS expr SEMI by Production 2
NUMBER PLUS expr SEMI by Production 4
NUMBER PLUS factor SEMI by Production 3
NUMBER PLUS NUMBER SEMI by Production 4
This process is called a leftmost derivation because you always replace the leftmost non
terminal in the partially-parsed sentence with the equivalent productions right-hand
side. Start with the goal symbol (the topmost symbol in the grammar) and then replace
the leftmost nonterminal at each step. The operator signifies a leftmost derivation.
X^Y means X derives Y by a leftmost derivation. You can say that
Section 3.2Derivations and Sentential Forms
expr SEMI b factor PLUS expr SEMI
by substituting expr for its right-hand side. The foregoing derivation was created in a
top-down fashion, starting with the topmost symbol in the grammar and working down to
the tokens. This process is called a top-down parse.
Each of the steps in the earlier example (each partial derivation) is formally called a
viable prefix. You can look at a viable prefix as a partially parsed sentence. Another
useful term is a handle, which is the portion of the viable prefix that is replaced at each
step in the derivation. The handle is always a contiguous sequence of zero or more sym
bols at the far right or far left of the viable prefix (depending on how the derivation is
done).
You can also have a rightmost derivation which starts with the goal symbol and
replaces the rightmost nonterminal with each step. The =>operator can be used to indi
cate this kind of derivation. I ts also possible to have a bottom-up parseone that starts
with the leaves of the tree and works its way up to the root. Bottom-up parsers use
right-most derivations. They start with the input symbols and, when a right-hand side is
encountered in a viable prefix, replaces the symbols that comprise the right-hand side
with the equivalent left-hand side. You still go through the input from left to right, but
you always replace the rightmost handle in the viable prefix. Using the same grammar
and input as before, you can derive the goal symbol from the input symbols as follows:
R
R
R
R
NUMBER PLUS NUMBER SEMI Start with the input,
factor PLUS NUMBER SEMI Apply factor-^NUMBER
factor PLUS factor SEMI Apply f a c t o r s NUMBER
factor PLUS expr SEMI Apply expr-^factor
R
expr SEMI
stmt
Apply expr^factor PLUS expr
Apply stmt>expr SEMI
The parser scans the viable prefix (the partially-parsed sentence) for a handle that
matches the right-hand side of some production, and then replaces the handle with the
left-hand side of that production. For example, NUMBER is the right-hand side of
/actor^NUMBER, and this production is applied in the first derivation, replacing the
NUMBER with the factor, factor PLUS doesnt form a right hand side, so the parser
moves to the next symbol, the second NUMBER, for the next replacement, and so forth.
A practical bottom-up parser, which uses a rightmost derivation, is discussed in great
detail in Chapter Five, so dont worry about it if the foregoing didnt sink in.
Note that top-down parsers that do rightmost derivations and bottom-up parsers that
do leftmost derivations are theoretically possible, but they are difficult to implement.
Similarly, you can go through the input from right to left rather than left to right, but
again, the practice is not common.
In general, the =>, =, and symbols mean that only one substitution has occurred
between the left- and right-hand side of the equation. Several other forms of the =>sym
bol are commonly used: and means derives in zero or more and in one or more
steps, respectively. The various forms can also be combined in reasonable ways: L
R R * +
+
L
We need one more term before continuing: if S^>a (if there is a leftmost derivation
of the viable prefix a from the nonterminal S), then a is sometimes called a left-
sentential form of S. Right-sentential form is used for a rightmost derivation and a just
plain sentential form is used when the type of derivation isnt known or doesnt matter.
For example, NUMBER PLUS expr SEMI is a left-sentential form of stmt because
srm^NUMBER PLUS expr SEMI.
Top-down parse.
Viable prefix, handle.
Rightmost derivation.
Bottom-up parse.
Sentential forms.
LL parsers and gram
mars.
LR parsers and gram
mars.
Semantic difficulties in
parsing.
The concepts of left and right derivations apply directly to the two types of parsers
that are discussed in Chapters Four and Five. In fact, the derivation closely follows the
process that a parser uses to analyze the input sentence. Chapter Four discusses LL
parsers, which go through the input stream from left to right (thats the first L), perform
ing a leftmost derivation (thats the second L). Recursive-descent parsers, such as the
ones examined in Chapter 1, are also LL parsers.
An LL grammar is a grammar that can be parsed by an LL parser. (There are many
grammars that cant be so parsed.) An LL(1) grammar is a grammar that can be parsed
by an LL parser with one symbol of lookahead. That is, if a nonterminal has more than
one right-hand side, the parser can decide which one to apply by looking at the next
input symbol.
I ts possible to have an LL(k) grammar, where k is some number other than 1, and
which requires k symbols of lookahead, but such grammars are not very practical. If the
number is missing, 1 is assumed.
The other type of parser that well look at in this book is an LR parser which goes
through the input from left to right, but does a bottom-up, rightmost derivation. The
class of grammars that can be parsed by LR parsers are LR(k) grammars, and as before,
were interested primarily in LR(1) grammars, which require the parser to look ahead
only one symbol to recognize a handle. LALR(&) grammars (Look-Ahead LR gram
mars), which are also discussed in depth in Chapter Five, are a special case of the more
general LR(k) grammars.
3.2.1 LL and LR Grammars
3.3 Parse Trees and Semantic Difficulties
The grammar in Table 3.1 recognizes a very limited set of English sentences: A sen
tence is a subject and a predicate. Subjects are made up of with an optional
preceding article; predicates are either single verbs or a verb followed by a prepositional
phrase; a prepositional phrase is a preposition followed by an object; and A
can be one of the tokens time or arrow.
You can use this grammar to parse the sentence: Time flies like an arrow, with the
following leftmost derivation:
sentence
L
L
L
L
L
L
L
subject predicate
noun predicate
time predicate
time verb prep jphrase
time flies prep jphrase
time flies like object
time flies like article noun
time flies like an noun
time flies like an arrow
by Production 1
by Producti
by Producti
by Producti
by Producti
time flies preposition object by Producti
by Producti
by Producti
by Producti
on 2
on 9
on 5
on 13
on 6
on 14
on 8
on 12
by Production 10
The derivation process can be represented diagramatically as a parse tree, a graph in
which the root node of each subtree is a nonterminal symbol in the language, and the
subtree itself represents the right-hand side of a replaced symbol. The goal symbol is
always at the apex of the parse tree. The parse tree for the previous derivation is in Fig
ure 3.1.
Note that the parser would happily have applied noun-^time in the last step of the
derivation, even though time flies like a time is not a reasonable sentence, and this is an
example of a semantic versus a syntactic problema meaning-related versus a structural
Section 3.3Parse Trees and Semantic Difficulties
Table 3.1. A Grammar for an English Sentence
Production Production
Number Left-hand Side Right-hand Side
1. sentence ^ subject predicate
2. subject > noun
3. 1 article noun
4. predicate > verb
5. 1 verb prep jphrase
6. prep jphrase ^ preposition object
7. object y noun
8. 1 article noun
9. noun > time
10. 1 arrow
11. article > a
12. 1 an
13. verb ^ flies
14. preposition ^ like
problem. This situation often arises in computer-language implementations. The fore
going problem could probably be fixed grammatically, but its not always possible to do
so, and the resulting grammar will always be larger than necessary (the larger the gram
mar, the slower the parser). Theres an even more serious problem that becomes clear
when you add the following productions to the grammar:
noun fruit I banana
The grammar now parses:
Fruit flies like a banana
without errors. Nonetheless, the sentence is actually parsed incorrectly because flies is
being treated as a verb, not a noun. That is, youve just given bananas the capability of
independent flight. You could try to rectify the situation by adding the following pro
ductions:
adjective > fruit
noun > flies
and modifying subj as follows:
subj >noun I adjective noun
But now, the grammar has become almost impossible to parse because the parser has to
know the meanings of the various words in order to figure out which of the productions
to apply. Making the wrong choice causes further errors (like the existence of a time fly,
which can move forward or backwards in time to avoid being swatted). To parse both
time flies like an arrow md fruit flies like a banana correctly, youd have to know that
there are no such things as time flies and that fruit cant fly, and its very difficult (if not
impossible) to build this semantic knowledge into a context-free grammar.
The foregoing example illustrates the limitations of context-free grammars, which
can only specify syntax, not semantics. Though there are a few exceptions in artificial-
intelligence applications, most real parsers are limited by the limitations of the
grammarthey only know about syntax. Consequently, if semantic knowledge is
Context-Free GrammarsChapter 3
Figure 3.1. Evolution of a Parse Tree for Time Flies Like An Arrow
(a) (b) (c) (d) (e)
s e n t e n c e s e n t e n c e s e n t e n c e s e n t e n c e s e n t e n c e
s u b j e c t p r e d i c a t e s u b j e c t p r e d i c a t e
n o u n
s u b j e c t p r e d i c a t e
n o u n
time
n o u n v e r b ( p r e p _ p h r a s e )
time
(f) (g) (h)
s e n t e n c e s e n t e n c e s e n t e n c e
n o u n v e r b ( p r e p p h r a s e )
time flies
n o u n
time
v e r b p r e p p h r a s e
flies
n o u n
time
v e r b p r e p p h r a s e
flies p r e p o s i t i o n o b j e c t
(i)
s e n t e n c e s e n t e n c e
n o u n v e r b
time flies
p r e p p h r a s e
p r e p o s i t i o n o b j e c t
like
n o u n v e r b
time flies
p r e p j j h r a s e
like a r t i c l e n o u n
GO (1)
s e n t e n c e s e n t e n c e
n o u n v e r b
time flies
p r e p p h r a s e
n o u n v e r b
time flies
p r e p p h r a s e
an an arrow
necessary to control the parse of a specific grammar, it becomes very difficult to build a
computer program that can parse an input sentence. Its usually not possible to resolve
all the semantic ambiguities of a computer language in the grammar. For example, the
grammar itself cant tell you whether a variable is of the correct type when its used in an
expression, it cant recognize multiple declarations like i nt x, l ong x;, and so
forth. To do the foregoing, you need to know something about meaning. Errors like the
foregoing are best detected by the code-generation part of the compiler or by auxiliary
code in the parser, not in the grammar itself.
Section 3.4e Productions 173
3.4 e Productions
I mentioned earlier that a grammatical phrase (a right-hand side) can be empty.
More correctly, its permitted for a right-hand side to match the empty string, e. Produc
tions of the form seare called productions (pronounced epsilon productions).
Consider the following grammar that recognizes compound statements:
1: compound stmt -> LEFT_CURLY stmt RIGHT_CURLY
2: stmt > NUMBER
3: I e

I ve simplified by defining a statement as either a single number or an empty string. This
grammar supports empty compound statements, as does C. First, note that an produc
tion can effectively disappear from a derivation. The input {} generates the following
derivation:
compound stmt LEFT_CURLY stmt RIGHT_CURLY
b LEFTCURLY RIGHT CURLY
The application of expr>e effectively removes the nonterminal from the derivation by
replacing it with an empty string. This disappearance is important when you consider
how an LL parser decides which of several productions to apply when the nonterminal
being replaced has several right-hand sides. Theres no problem if all of the right-hand
sides start with different terminal symbols; the parser can decide on the basis of the look
ahead symbol. If the current lookahead symbol matches one of the terminals, then the
production whose right-hand side starts with that symbol is used. This kind of
grammarwhere the leftmost symbols of all the right-hand sides of any given nontermi
nal are different terminal symbolsis called an S grammar, and S grammars are among
the simplest of the LL( 1) grammars to recognize.
An e production causes problems because the parser has to look beyond the current
production to make its decision. The parser still decides whether or not a symbol
matches an empty string by looking at the next input symbol, but it also has to look at the
grammatical symbols that can follow the nonterminal that goes to . In the current
example, if the next input symbol is a NUMBER, then the parser can apply
s t mt ^ NUMBER. It applies the production if the next input symbol is a
CLOSE CURLY. The reasoning here is that, if an empty string is matched, then the
current nonterminal can effectively disappear from the derivation, and as a consequence,
the next input symbol must be a symbol that can follow that nonterminal.
Since the parser can decide which production to apply in the current grammar by
looking at a single lookahead character, then this is an LL(1) grammar. If a stmt could
start with a CLOSE_CURLY, or if a NUMBER could follow a stmt, then the parser
wouldnt be able to decide what to do, and the grammar would not be LL(1). The real
situation is a little more complicated than the foregoing would indicateI ll discuss the
ins and outs of LL(1) grammars in depth in the next chapter. The current example serves
to illustrate the sorts of problems that are involved, however.
3.5 The End-of-lnput Marker
Another symbol that bears mentioning is the end-of-input marker, represented by K
Strictly speaking, the end-of-input marker is treated as a token, and it always follows the
rightmost symbol on the right-hand side of the start productionthe one with the goal
symbol on its left-hand side. An explicit h is often omitted from a grammar, however.
The h is still thereits just not shown. This omission doesnt present problems in many
S grammars.
The h symbol.
State machines and
grammars.
Right-linear grammars.
Translating a right-linear
grammar to a DFA.
situations. Consider the case of a Pascal program that has to end in a period. The starting
production in a Pascal grammar could look like this:
program definitions PERIOD
and the parser would just look for the period to detect the end of the input sentence. Any
input that followed the period, including the h, could be ignored in this situation. The
situation is complicated in C, because the start production can look like this:
program definitions I 8
In this case, the parser has to look for an \~ marker, even though theres no explicit
marker in the grammar. J ust mentally tack a h to the end of all the right-hand sides of
the start production. (Remember, 8 is an identity element for string concatenation, so eh
is the same thing as K)
3.6 Right-Linear Grammars
This section moves away from a general discussion of grammars to the specifics of
implementing them. Chapter Two described how state machines can be used for lexical
analysis, and the notion of a regular definition, a grammatical representation of the input
language, was also introduced. It turns out that all languages that can be represented as
state machines can also be represented grammatically, and I ll use this similarity to
demonstrate the relationships between the two systems here.
Grammars that can be translated to DFAs are called right-linear grammars. A gram
mar is right-linear if the right-hand side of each nonterminal has at most one nonterminal
in it, and that nonterminal is at the far right of the right-hand side. (A left-linear gram
mar is the same, except that the nonterminal symbol must be at the far left.)
A right-linear grammar can be translated directly into a DFA if all productions in the
grammar are either of the form a>Eor a>X b. That is, the right-hand side must either
be 8, or it must be made up of a single terminal symbol followed by a single nonterminal
symbol. To make a DFA, you must add the further restriction that, if a production has
more than one right-hand side, the right-hand sides must all start with different symbols.
(If they dont, you have an NFA.) The state machine is created from the grammar as fol
lows:
(1) The set of terminal symbols in the grammar form the DFAs input alphabet.
(2) The set of nonterminal symbols in the grammar form the states in the DFA. The
start production is the DFA start state.
(3) If a production takes the form a-^>8, then State a is an accepting state.
(4) If a production takes the form a-^X b, then a transition is made from State a to
State b on the character X.
Figure 3.2 shows both a state machine and grammar for recognizing a subset of the C
floating-point constants (the regular expression D*\.D|D\.D* is recognized).
A parser for this sort of grammar is, of course, trivial to implement. In fact, thats
what the LEX-generated state-machine driver isa parser for a right-linear grammar.
State transitions are made based on the current state and input character, and an action is
executed when an accepting state is entered. A right-linear grammar that doesnt have
the properties discussed earlier can easily be modified to have the required properties
using the transformation rules discussed in Chapter Four.
Section 3.6Right-Linear Grammars 175
Figure 3.2. State-machine and Grammar That Recognizes D*\.D|D\.D*
0 ^ DIGIT 4
| DOT 2
1 ^ DIGIT 1
| DOT accept
2 ^ DIGIT 3
| DOT error
3 DIGIT accept
| DOT accept
4 ^ DIGIT error
| DOT 1
5 -> DIGIT 5
|
DOT 2
3.7 Lists, Recursion, and Associativity
Probably the most common grammatical structure is a list. Programs are lists of vari
able declarations and subroutines, subroutines are lists of statements, expressions are
lists of identifiers and constants delimited by operators, and so forth. Weve actually
been using lists informally up until now, but its worthwhile to look at them in depth.
3.7.1 Simple Lists
Lists are loops, and loops are implemented grammatically using recursion. You can
see whats going on graphically by looking at the grammatical representations of States
1and 5 in Figure 3.2 on page 175. The loops in these states recognize lists of digits, but
you can use the same grammatical structure to recognize loops of any sort.
The recursion can be implemented in two ways. A left-recursive production is one
where the left-hand side appears as the leftmost symbol on the right-hand side. A right-
recursive production is the other way around, the left-hand side is duplicated at the far
right. Note that a grammar can be right (or left) recursive and not be right (or left) linear
because there can be more than one nonterminal on the right-hand side.
Left and right recursion
The following grammar provides an example of a simple, left list of state
ments:
stmt list stmt list stmt
stmt
If you also supply a simple definition of stmt as one of three terminal symbols:
stmt
then the list ABC generates the parse tree in Figure 3.3.
A typical parser creates the parse tree from left to right as it reads the input, but it
does a left-to-right, depth-first traversal of the parse tree as it generates code. (The tree is
traversed left-to-right, but the nodes further down on the tree are visited first.) The sub
scripts in Figure 3.3show the order in which nodes are visited. Assuming that the code
that processes a statement is executed when the stmt list nodes are traversed, the list ele
ments are processed from left to right. That is, the elements in the list associate from left
to right. This is always the case. Left-recursive productions always associate from left to
right.
Associativity and recur
sion.
Fudging left associativity
with a right-recursive
grammar.
Figure 3.3. Parse Tree for a Simple, Left-Recursive List
Changing the grammar to a right-recursive form illustrates the change to right associ
ativity. The new grammar looks like this:
stmt list > stmt stmt list I stmt
stmt ^ A I B I C
and the parse tree for ABC is shown in Figure 3.4. Again, a depth-first traversal causes
the nodes to be visited in the indicated order, and the important thing is the order in
which the stmt list nodes are visited. The elements in this list are processed from right
to left.
Figure 3.4. Parse Tree for a Simple, Right-Recursive List
Recursive-descent parsers can cheat and generate code as the tree is built from the
top down. This cheating can give surprising results, because a list can be processed from
left to right, even though the grammar would indicate otherwise. For example, Listing
3.1 shows the list being processed from left to right, and Listing 3.2 processes the same
input from right to left. The grammar is the same, right-recursive grammar in both list
ings, but the position of the pr ocess st at ement () subroutine has been changed. In
Listing 3.1, pr ocess st at ement () is called before the recursive st mt _l i st ( )
call, so the list element is processed before the subtree is built. In the second listing, the
Section 3.7.1 Simple Lists 177
processing happens after the recursive call, so the subtree will have been traversed
before the list element is processed.
Listing 3.1. Left Associativity with a Right-Recursive Grammar
1 st mt l i st ()
2
{
3 / * Code i s gener at ed as you cr eat e t he t r ee, bef or e t he subt r ee i s
4 * pr ocessed.
5 * /
6
7 r emember = st mt () ;
8
9 pr ocess st at ement ( r emember ) ;
10
11 i f ( not at end of i nput () )
12 st mt _l i st ( ) ;
13 }
14
15 st mt 0
16
{
17 r e t u r n ( r ead() );
18
}
Listing 3.2. Right Associativity with a Right-Recursive Grammar
1 st mt l i st ()
2
{
3 / * Code i s gener at ed as you cr eat e t he t r ee, af t er t he subt r ee i s
4 * pr ocessed.
5
* /
6
7 r emember = st mt ();
8
9 i f ( not at end of i nput () )
10 st mt _l i st ();
11
12 pr ocess st mt ();
13 }
14
15 st mt 0
16
{
17 r e t u r n r ead();
18
}
Since you cant have left recursion in a recursive-descent parser, this technique is
often useful, but it can also cause maintenance problems, because some code is gen
erated as the tree is created, but other code must be generated as the tree is traversed.
You cant apply this technique in processing expressions, for example. Consequently, its
usually best always to generate code after the call to a subroutine that visits a subtree,
not before it. In Chapter Four, well look at a method for modifying grammars that can
eliminate the left recursion but maintain the left associativity; in general, this is the best
approach.
There is one maintainable way to get left associativity with a right-associative gram
mar, though this method can be used only occasionally. Looking again at Figure 3.4,
youll notice that the stmt nodes are processed from left to right, even though the
stmtjist nodes go from right to left. Consequently, if the statements can be processed in
stmt rather than in stmt list, the associativity is effectively reversed.
Strictly speaking, the grammars weve been looking at are selfXcit or right recursive.
Self recursive productions are those where the left-hand side appears at one end or the
other of the right-hand side. I ts also possible for a grammar to have indirect recursion,
however. Here, one or more steps have to be added to a derivation before the recursion
becomes evident. For example, the following grammar has indirect left recursion in it:
1. s a AHH
2. a -> c CHOO
3. c > sDOO WOP
because, all derivations that start with s ultimately end up with an s on the far left of the
viable prefix (of the partial derivation):
s => a AHH by Production 1
=> c CHOO AHH by Production 2
s DOO_WOP CHOO AHH by Production 3
In general, indirect recursion is not desirable in a grammar because it masks the fact
that the grammar is recursive.
3.7.2 The Number of Elements in a List
The grammar we just looked at creates a list with at least one element in it. Its pos
sible for a list to be empty, however. The following productions recognize zero or more
declarations:
decllist > decllist declaration
I 8
Productions executed
first or last.
list element
I element element
I element element element
The method is cumbersome, but workable. Though this method is the only way to do it
with a grammar, you can solve the limited-number-of-repetitions problem semantically
rather than syntactically. Instead of building the counts into the grammar, you can use a
more general-purpose list and have the parser keep track of how many times the list pro
ductions are applied, printing an error message if there are an incorrect number of repeti
tions.
Since a decl list can go to 8, the list could be empty. Parse trees for left and right recur
sive lists of this type are shown in Figure 3.5. Notice here, that the decljist>z produc
tion is the first list-element thats processed in the left-recursive list, and its the last
list-element processed in the right-recursive list. The same sort of thing was also true in
the nonempty lists that we looked at in the last sectionthe stmtjist>stmt production
was processed only once in both types of lists, and it was processed first in the left-
recursive list and last in the right-recursive list. This fact comes in handy when youre
doing code generation, because you can use the 8 production to trigger initializations or
clean-up actions. The technique is used heavily in Chapter Six.
I ts also possible to specify a limited number of repetitions, using something like the fol
lowing:
Figure 3.5. Lists Terminated with 8 Productions
Section 3.7.2The Number of Elements in a List
decljist - decljist declaration
8
decljist - declaration decljist
8
decl list
decl list declaration
decl list declaration
8
decl list
declaration decljist
declaration decljist
8
3.7.3 Lists with Delimiters
Most lists must incorporate delimiters of some sort into them, and the delimiters can
be handled in two ways. In C, for example, a comma separates multiple names in a
declaration, as in:
i n t x, y, z;
Semicolons terminate elements of a list of several declarations, however. Both types of
lists are illustrated in the following grammar:
decl list decl list declaration
8
declaration declarator list specifier list SEMICOLON
declarator list > TYPE declarator
8
specifier list > specifier list COMMA NAME
NAME
A decl list is a list of zero or more declarations (its left associative). The declaration
production defines a single list element. It is made up of declarators followed by
specifiers, and is terminated with a SEMICOLON. Since SEMICOLON is a termina
tor, its at the far right of the production that describes the list element, rather than being
in the list-production (the decljist) itself. A declarator list is a simple list of zero or
more TYPE tokens. Finally, a specifierJist is a COMMA-separated list of identifiers.
Here, the COMMA, since it is a separator, is part of the list-definition production itself.
The specifier Ji st is left recursive. A right-recursive version of the same thing could
be done like this:
specifier list NAME COMMA specifier list
NAME
The relative positions of the COMMA and NAME must also be reversed here. Common
ways to do the various types of lists are all summarized in Table 3.2.
Table 3.2. List Grammars
At least one
element
Zero elements
okay
At least one
element
Zero elements
okay
No Separator
Right associative
list MEMBER list I MEMBER
Left associative
list list MEMBER I MEMBER
list MEMBER list I 8 list list MEMBER I 8
Separator Between List Elements
Right associative
list me mb e r delim list I m e m b e r
list
opt Ji st list I 8
m e m b e r delim list I me mb e r
Left associative
list list delim me mb e r I me mb e r
list
opt list ^list I 8
list delim me mbe r I me mbe r
A MEMBER is a list element; it can be a terminal, a nonterminal, or a collection of terminals and nontermi
nals. If you want the list to be a list of terminated objects such as semicolon-terminated declarations,
MEMBER should take the form: MEMBER> a TERMINATOR, where a is a collection of one or more termi
nal or nonterminal symbols.
One special form of a list is an expression. For example 1+2+3 is a plus-sign-
delimited list of numbers. Table 3.3 shows a grammar that recognizes a list of one or
more statements, each of which is an arithmetic expression followed by a semicolon. A
stmt is made up of a series of semicolon-delimited expressions (exprs), each a series of
numbers separated either by asterisks (for multiplication) or plus signs (for addition).
This grammar has several properties that are worth discussing in depth.
1. stmt > expr ;
2. 1 expr; stmt
3. expr ^ expr +term
4. 1 term
5. term > term * factor
6. 1 factor
7. factor > number
8. 1 ( expr )
Figure 3.6 shows the parse tree generated by the foregoing grammar when the input
Depth-first traversal. statement 1+2 * (3+4)+5 ; is processed. The code-generation pass does a depth-first
traversal of the treethe deeper nodes are always processed first. The subscripts in the
figure show the order in which the nodes are visited. Assuming that code is generated
only after an entire right-hand side has been processed, the depth-first traversal forces
the 3+4 to be done first (because its deepest in the tree and the expr associated with the
subexpression 3+4 is processed before any of the other exprs); then, moving up the tree,
the result of the previous addition is multiplied by the 2; then 1 is added to the accumu
lated subexpression; and, finally, the 5 is added into the total.
Figure 3.6. A Parse Tree for 1+2* (3+4) +5 ;
Section 3.8Expressions 181
Note that the order of evaluation is what you would expectassuming that multipli
cation is higher precedence than addition, that addition associates from left to right, and
that parenthesized expressions are evaluated first. Since the parse tree in Figure 3.7 is
the only possible one, given the grammar used, this means that associativity and pre
cedence are actually side effects of the grammatical structure. In particular, subtrees that
are lower down on the parse tree are always processed first, and positioning a production
further down in the grammar guarantees a lower relative position in the parse tree.
Higher-precedence operators should then be placed lower down in the grammar, as is the
case in the grammar used here. (The only way to get to a term is through an expr, so
multiplication is always further down the tree than addition.) Parentheses, since they are
at the bottom of the grammar, have higher precedence than any operator. The associa
tivity is still determined by left or right recursion, as is the case in a simple list.
Unary operators tend to be lower in the grammar because they are usually higher pre
cedence than the binary operators. You can add unary minus to the previous grammar by
modifying the factor rules as follows:
Order of evaluation as
controlled by grammar.
Unary operators.
1. stmt ^ expr ;
2. 1 expr ; stmt
3. expr > expr +term
4. 1 term
5. term ^ term * unop
6. 1 unop
7. unop ^ -factor
8. 1 factor
9. factor ^ number
10. 1 ( expr)
The placement of the new productions insure that unary minus can be applied to
parenthesized subexpressions as well as single operands.
Unary operators can sometimes be treated as lists. For example, the C pointer-
dereference operators (*), can pile up to the left of the operand, like this: ***p. You can
introduce a star into the previous grammar by adding the following list production:
factor * factor
Note that the right recursion correctly forces right-to-left associativity for this operator.
3.9 Ambiguous Grammars
The expression grammar just discussed is a unambiguous grammar because only one
possible parse tree can be created from any given input stream. The same parse tree is
generated, regardless of the derivation used. Its possible to write an ambiguous gram
mar, however. For example, expressions could be represented as follows:
statement > expr;
expr expr +expr
I expr * expr
I ( expr )
I number
A grammar is ambiguous when the same nonterminal appears twice on a right-hand side
(because the order in which the nonterminals are evaluated is dependent on the deriva
tion). Two of the possible parse trees that can be generated for A+B*C are shown in Fig
ure 3.7.
Ambiguous grammars are generally to be avoided exactly because theyre ambigu
ous. That is, because there are two possible parse trees, the expression can be evaluated
in two different ways, and theres no way to predict which one of these ways will be used
from the grammar itself. Precedence and associativity cant be controlled by the gram
mar alone. Ambiguous grammars do tend to be smaller, however, and easier to read. So,
parser-generation programs like yacc and occs generally provide mechanisms for using
them. These programs can force specific associativity or precedence by controlling the
parse in predetermined ways when an ambiguous production is encountered. The tech
nique is discussed in depth in Chapter Five.
Figure 3.7. Parse Trees Derived from an Ambiguous Grammar
Section 3.9Ambiguous Grammars 183
3.10 Syntax-Directed Translation
I mentioned earlier that if the syntactic scope of a language was sufficiently limited,
you could derive semantic information from a formal grammarfrom the position of
particular symbols in the input sentence, and by extension in the grammar itself. This is
how compilers, in fact, generate codeby executing code-generation actions at times
that are controlled by the positions of various strings in the input sentence. We did this
on an ad-hoc basis in Chapter One, first by building a parser for a grammar, and then by
adding code-generation actions to the parser. The process is discussed in greater depth
in this section.
3.10.1 Augmented Grammars
In an augmented grammar, code-generation actions are placed in the grammar itself,
and the relative position of the action in the grammar determines when it is executed.
For example, a production like this:
expr' +term {op('+');} expr
Augmentations.
could be coded in a recursive-descent parser as follows:
Generating code to
evaluate expressions.
Anonymous temporaries.
expr _pr i me()
{
i f ( mat ch( PLUS ) )
{
t er m();
op(' +' );
expr _pr i me();
}
}
I ll demonstrate how an augmented grammar works by generating code to evaluate
expressions. You need two subroutines for this purpose, both of which generate code. In
addition, between the two of them, they manage a local stack of temporary-variable
names. An expression of the form 1+2+3 generates the following code:
tO = 1/
11 = 2;
t O += tl ;
11 = 3;
t O += tl ;
The variables t O and t l are known as anonymous temporaries (or just plain tem
poraries) because the compiler makes up the names, not the programmer. The actual
code is generated with the following sequence of actions (the numbers identify indivi
dual actions, several of which are repeated):
(1) Read the 1.
(2) Get a temporary-variable name (tO), and output t O =1.
(3) Push the temporary-variable name onto a local stack.
(4) Read the "+" and remember it.
(1) Read the 2.
(2) Get a temporary-variable name (tl ), and output t l =2.
(3) Push the temporary-variable name onto a local stack.
(5) Pop two temporary-variable names off the stack and generate code to add them
together: tO += t l [youre adding because you remembered it in (4)].
(6) Push the name of the temporary that holds the result of the previous operation and
discard the second one.
(4) Read the M+Mand remember it.
(1) Read the 3.
(2) Get a temporary-variable name (tl ), and output t l =3.
(3) Push the temporary name onto a local stack.
(5) Pop two temporary-variable names off the stack and generate code to add them
together: tO += tl .
(6) Push the name of the temporary that holds the result of the previous operation and
discard the second one.
When youre done, the name of the anonymous temporary that holds the evaluated
expression is at the top of the stack. Many of the previous actions are identicalthe
time that the action is executed is the important factor here.
The subroutines that do the work are in Listing 3.3. Temporary-variable names are
allocated and freed with the newname () and f r eename () calls on lines nine and ten
using the method described in Chapter One. You allocate a name by popping it off a
stack (declared on line six), and free the name by pushing it back. I ve removed the
error-detection code to simplify things.
The rest of Listing 3.3 shows the code-generation actions. The cr eat e t mp () sub
routine (on line 14) does two things, it generates the code that copies the current lexeme
Listing 3.3. expr.y Action Code for the Augmented Grammar
Section 3.10.1 Augmented Grammars 185
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
2 9
30
31
32
#nclude c t o o l s / s t a c k . h> /
*
Descr i bed i n Appendi x A. */
stack del (Temporari es, *, 1 2 8 ) ; /
*
Tempor ar i es: st ack of 128 char pt r s. */
/* -------------------------------------------------------------------------------------------------------------------- */
*Names[]
Namep
* *
{ "tO", " t l " , nt 2 " , " t 3 " , " t 4 ", "t 5 ", " t 6", "t 7 " };
Names;
*newname()
voi d freename(s)
{ r e t ur n( *Namep++ ); }
*
s; {
*
Namep s; }
/* -------------------------------------------------------------------------------------------------------------------- */
create_tmp( s t r )
* s t r ;
/
*
Cr eat e a t empor ar y hol di ng st r and push i t s name. */
{
*
name getname();
yy_code(" %s = %s;\n", name, s t r );
push( Temporaries, name );
}
op( what )
i nt what;
{
*
l e f t , *ri ght;
ri ght pop( Temporaries );
t o s ( Temporaries );
yy code(" %s %c= %s;\n", l e f t , what, ri ght );
f reename( ri ght ) ;
}
into an anonymous temporary, and it pushes the name of that temporary onto the Tem
por ar i es stack. The op () subroutine on line 22 generates code to do the actual opera
tion. It takes two temporary-variable names from the Tempor ar i es and generates the
code necessary to perform the required operation. Note that only one of the temporaries
is actually popped. The other remains on the stack because it holds the result of the gen
erated operation. The popped name is recycled for future use on line 31.
Now, armed with the proper tools, you can design the parser. The augmented gram- Designing the parser.
mar in Table 3.4 performs the actions described earlier in the correct order. The
remember it action is implicit in the grammar itself (because, to get to the op () call
in Production 4, for example, you must have passed a previous +). The order of evalua
tion is controlled, as usual, by the structure of the parse tree and the fact that youre
doing a bottom-up traversal. Action symbols in the grammar are treated like other termi
nal symbolstheyre just put onto the parse tree in the appropriate places and executed
when that node is traversed. A parse tree for 1+2+3 is shown in Figure 3.8. The sub
scripts indicate the order in which nodes are visited.
An exercise is in order at this juncture. Take a pencil and piece of paper and draw
both the parse tree and the contents of the Tempor ar i es stack (which holds the
temporary-variable names) as the expression 1+2*3 is parsed. Note how the Tem
por ar i es stack keeps track of the name of the variable that holds the 1 until after the
higher-precedence multiplication operator is processed. The 2 is stored in the same way
until the parenthesized subexpression can be handled. Also, note how the name at top of
Table 3.4. An Augmented Expression Grammar
1. stmt ^ 8
2. 1 expr ; stmt
3. expr ^ term expr'
4. expr ^ +term {op (' +' ) ; }expr
5. 1 8
7. term' ^ * factor {op (' * ') ; }term'
8. 1 8
9. factor number_or_id {create tmp (yytext) ; }
10. 1 ( expr )
Figure 3.8. Augmented Parse Tree for 1+2+3
term
factor term
5
num
(i)
, {r val ue( yyt ext );>2 8
statements
31
expr21
28
statements
30
+7 term 13 {op(" +"); }
14
expr
factor term {2 +15 term2X {op( "+") ; }22 expr
numg {r val ue (y y t ex t ) ; }q
(2)
8
fnrtnr
num,6 {r val ue ( yyt ext ); } p
(3)
8
19
24
term 20 823
stack always holds the value of the most recently processed subexpression.
3.10.2 Attributed Grammars
Attributes.
So far, the concept of an attribute has been used in a limited, informal way. An attri
bute is a piece of information that is associated with a grammatical symbol. Tokens all
have at least one attributethe associated lexeme. They can have other attributes as
well. For example a NUMBER token could have an integer attribute that was the
number represented by the lexeme. A NAME token could have an attribute that was a
pointer to a symbol-table entry for that name. Attributes of this type are usually
represented with subscripts. For example, the two attributes attached to a NUMBER
could be represented as follows:
NUMBER
("156", 156)
Section 3.10.2Attributed Grammars 187
The first attribute is the lexeme, the second is the numeric value of the lexeme.
Nonterminal symbols can have attributes as well. Take, for example, a recursive-
descent parser. You can look at the subroutines that implement nonterminals as symbols
that represent the nonterminal, not the other way around. Looked at in this way, a
subroutines argument or return value represents a quantum of information that is
attached to the grammatical symbolto the nonterminal that the subroutine is imple
menting, so the subroutine arguments and return values can be viewed as the attributes
of the associated nonterminal symbol.
There are two types of attributes that are attached to nonterminals, and these two
types correspond to the two methods used to pass information around in the two parsers
presented at the end of Chapter One. An inherited attribute is passed down the parse
tree, from a parent to a child, so in a recursive-descent compiler, inherited attributes are
subroutine arguments. A synthesized attribute goes in the other direction, up the tree
from a child to a parent. In a recursive-descent compiler, synthesized attributes are
return values. 1The best way to keep track of which term is which is to think of the actual
English usage of the two words. Children inherit something from their parents, not
the other way around. So, an inherited attribute is one thats passed from the parent to
the child in the parse tree. The code-generation in the previous section was done without
the benefit of attributes. That is, all information was passed between code-generation
actions using global variables rather than attaching the information to the grammatical
elements themselves. Drawing an analogy to recursive-descent parsers, its as if you
didnt use subroutine arguments or return values anywhere in the parser.
The attribute mechanism can be extended to grammars in general and is very useful
in designing a compiler. The attributes can help you specify code-generation actions in
the grammar in greater detail than would be possible with an augmented grammar alone.
A grammar to which attributes have been added in this way is called an attributed gram
mar. In a typical compiler design, you put attributes into the grammar in a manner simi
lar to the augmentations used in the previous sections. That is, the first step in the design
process is the creation of an augmented and attributed grammar. I ts easier to under
stand the concepts, however, if you start with working code, so I ll demonstrate the pro
cess by going backwards from a working parser to an attributed, augmented grammar.
The naive recursive-descent parser in Listing 3.4 implements the following grammar:
1. stmt ^ 8
2. 1 expr ; stmt
3. expr > term expr
4. expr ^ +term expr
5. 1 8
6. term ) factor term'
7. term' > * factor term'
8. 1 8
9. factor ^ number
10. 1 ( expr )
This parser is essentially the naive parser developed in Chapter One, where youll find
the mat ch () and advance () subroutines that comprise the lexical analyzer,
newname () and f r eename () were discussed earlier. The parser uses inherited
1. In ALGOL-like languages such as Pascal, an inherited attribute is a subroutine argument thats passed by
value; a synthesized attribute is either passed by reference or is a functions return value.
Attributes for nontermi
nals.
Inherited and synthesized
attributes.
Attributed grammar.
attributes, which are passed down the parse tree from parent to child as subroutine argu
ments.
Listing 3.4. naive.c Code Generation with Synthesized Attributes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
voi d
voi d
voi d
f act or
t er m
expr
(
(
(
*
*
*
t ) ;
t ) ;
t ) ;
/
*
Pr ot ot ypes to avoi d
k
/ * f or war d r ef er ences
/ * i n t he code.
*
k
/
/
/
st mt ()
{
/
*
st mt - > expr SEMI expr SEMI st mt
k
/
*
t;
whi l e( !mat ch( EOI )
{
)
expr ( t newname()
) ;
( t ) ;
( mat ch( SEMI )
advance();
)
}
}
/ *_____________________________________________________________________________ * j
voi d expr ( t )
t ;
/
*
> t er mexpr /
{
t e r m ( t ) ;
e x pr p r i m e ( t ) ;
}
/ *_____________________________________________________________________________ *j
expr pr i me( t ) /
*
> PLUS t er mexpr epsi l on
*
/
t;
{
t 2;
( mat ch( PLUS )
)
{
advance();
t er m( t 2 newname() );
%s + pr i nt f (11
f r eename( t 2 );
o
os\ n", t, t 2 );
expr pr i me( t );
}
}
j 'k_________________________________________________________________________________________________________________ * J
voi d t er m( t )
t ;
/ * t er m > f act or t erm' * /
*
{
f act or
t er mpr i me
( t ) ;
( t ) ;
}
J k ----------- ----------------------- -------------------------------------------------------------- ------------------------------- --------- ---- ------ ---- ---- k J
t er mpr i me( t ) / * t erm' - > TI MES f act or t erm' epsi l on
k
/
*
t ;
{
k
t 2 ;
Section 3. 10. 2Attributed Grammars 189
56 i f ( ma t c h ( TIMES ) )
57 {
58 a d v a n c e ( ) ;
59
60 f a c t o r ( t 2 = newname() ) ;
61
62 p r i n t f (" %s *= %s\ n", t , t 2 ) ;
63 f r e e n a me ( t 2 ) ;
64
65 t e rm p r i m e ( t ) /
66
}
67 }
68 / * * /
69 voi d f a c t o r ( t ) / * f act or - > NUMBER OR I D LP expr RP * /
70 char * t ;
71
{
72 i f ( match(NUMBER OR ID) )
73
{
74 p r i n t f (" %s = %0 . * s \ n", t , y y l e n g , y y t e x t ) ;
75 a d v a n c e ( ) ;
76
}
78
{
79 a d v a n c e ( ) ;
80
81 e x p r ( t ) ;
82
83 i f ( mat c h( RP) )
84 a d v a n c e ( ) ;
85 }
86
}
Now, consider the flow of attributes (the temporary-variable names) through the
parse tree as the expression 1+2* 3 is processed. The parser generates the following out
put:
t o = 1
t l = 2
t 2 = 3
t l *= t 2
t O += t l
and the associated parse tree is shown in Figure 3.9. The flow of attributes is shown in
this figure by labeling edges in the graph with the attributes value. That is, the fact that
st mt () passes t Oto expr () is indicated by labeling the edge between the stmt and
expr nodes with a tO. The attribute is in boldface when the calling function creates that
attribute with a newname () call; otherwise, the attribute came into a function as an
argument and was passed to a child function. You should trace through the first few sub
routine calls in the parse to see whats happening. Remember, each child node in the
tree represents a subroutine call made by the parent, and the attributes are subroutine
arguments, so they flow down the tree from the parent to the child.
You can represent the flow of attributes by borrowing the notation used for argu
ments in a programming language. Each attribute has a name just like subroutine argu
ments have names. All attributes that are passed into a production are listed next to the
name on the left-hand side of that production, like a formal argument list in a subroutine
declaration. Those attributes can then be referenced by name by the symbols on the
Borrow attribute notation
from subroutine-calling
conventions.
Figure 3.9. The Flow of Inherited Attributes Through a Parse Tree for 1+2*3
right-hand side of the production, as if they were used in the subroutine. For example,
say that an attribute, t , represents a temporary variable name, and it is attached to an
expr:; it is represented like this in the subroutine representing the expr:
e x p r ( t )
char * t ;
{

}
and like this in the grammar:
expr >...
A t attribute would also be attached to every expr on a right-hand side, as if it were the
argument to a recursive subroutine call:
stmt >expr ; stmt
If the same attribute appears on both the left- and right-hand sides, then that attribute
comes into the subroutine as an argument, and is passed down to a child subroutine, in
turn. An attribute can also be used internally in an action, as if it were a variable.
Attributes can also be created within an action in a manner analogous to local vari
ables, and that new attribute can be, in turn, passed to another grammatical symbol (to a
subroutine) as an attribute (as an argument). Note that a leftmost derivation requires the
flow of attributes to be from left-to-right across the production, just like the flow of con
trol in a recursive-descent parser that implements the grammar. So, in an LL parser
(which uses a leftmost derivation), inherited attributes can be passed from a left-hand
side to a symbol on the right-hand side, and an attribute can be passed from a code-
generation action to a symbol to its right. Synthesized attributes are used differently, and
are discussed in depth in Chapter Five along with the bottom-up parsers that use this
type of attribute.
Section 3.10.2Attributed Grammars 191
An attributed version of our expression grammar is shown in Table 3.5. The attribute
names are taken from the previous recursive-descent compiler. An attribute attached to
a left-hand side is treated like a formal argument in a subroutine declarationits the
attribute that is passed to the current production from its parent in the tree. An attribute
on the right-hand side is treated like an argument in a subroutine call; it is passed down
the tree to the indicated production.
Table 3.5. An Attributed Grammar
1. stmt ^ 8
2. 1 expr(t) ; stmt
3.
exPr (t)
^ term,t, expr\t)
4.
exPr' (t)
^ + termix2] expr\t)
5. 1 e
6. term(t) > factor term'
7. term'(t) ^ * factor{t2) term' lr)
8. 1 8
9. factor(t) ^ number_or_id
10. 1 ( expr{t) )
The attributes by themselves are not much use. You need to augment the grammar to Augmented, attributed
show how the attributes are manipulated. Remember that the flow of control in a top- 9rammar.
down parser is from left to right across the grammar, so an action can affect only those
symbols to its right and can use only those symbols that are initialized to its left (or
which appear as an attribute of the left-hand side). An augmented, attributed grammar is
shown in Table 3.6. The code-generation actions are taken from the previous recursive-
descent parser.
Table 3.6. An Augmented, Attributed Grammar
1. stmt -^ 8
2. 1 {t =newname () ; }expr ; Stmt
3.
exPr (t)
^ termM expr M
4. expr'M +{t 2=newname () ; }term.t2) {pr i nt f ( "%s+=%s\ n", t, t2) ; f r eename (t2) ; }expr'
5. 1 8
6. term M -^ factor term
7. term' (t, -^ * {t 2=newname () ; }factor(t2 {pr i nt f ( "%s+=%s\ n", t , t2) ; f r eename (t2) ; } term'
8. 1 8
9. factor (t, -^ number or id {pr i nt f ( "%s=%0. *s\ n", t, yyl eng, yyt ext ) ; }
10. 1 ( expr{t))
As I said at the beginning of this section, I ve demonstrated the process in a topsy-
turvey fashion. Generally, you would create the augmented, attributed grammar first,
and then use that grammar as a specification when coding. The compiler-generation
tools developed in the next two chapters take augmented, attributed grammars as their
input, and output parsers for these grammars, thereby eliminating the necessity of coding
a parser at all. You still have to write the grammar, however, and figure out how to gen
erate code by adding actions to it in appropriate places.
One final note: Its possible for a single grammar to use both inherited and syn
thesized attributes, but its not a good idea. In a recursive-descent parserthe only kind
Generic symbols.
Greek letters represent
sequence of symbols.
Multiple right-hand sides,
On-
Multiple tokens, Tn.
of parser in which combined attribute types are really practicalmaintenance becomes a
problem because its more difficult to follow the flow of data as the parser executes.
Other types of parsers, such as the one-pass, table driven ones discussed in the next two
chapters, can handle only one type of attribute.
3.11 Representing Generic Grammars
As we discuss grammars in the remainder of this book, it will be handy, occasionally,
to discuss generic productions. In these discussions, when I dont say otherwise the term
symbol, without the word terminal or nonterminal, is used for when something can
be either a terminal or nonterminal. As usual, italics are used for nonterminals and bold
face is used for terminals, but I ll use upper-case letters that look like this:
A <BC <D<L7 g M13 X.L O <PQJRJ <1tl V WX J Z
to represent an unspecified symbol. A production like this:
s A
has two symbols on its right-hand side, and these can be either terminals or nonterminals.
The following production has three symbols on its right-hand side: the first is a nonter
minal, the second is a terminal, and the third can be either a terminal or nonterminal:
s >a T !A
To confuse matters further, I ll use Greek letters to represent a sequence of symbols.
For example, a generic production can be written like this:
s ôc
s is a nonterminal because its in italics, and a represents an arbitrary sequence of termi
nals and nonterminals. Unless I say otherwise, a Greek letter represents zero or more
symbols, so something like this:
s êxA |3
represents all productions that have s on their left-hand side and an A somewhere on
their right-hand side. The A can be preceded by zero or more symbols (terminals or non
terminals), and it can be followed by zero or more symbols. A sequence of symbols
represented by a Greek letter can go to . That is, it can be replaced in the equation by
an empty string, effectively disappearing from the production. For example, the produc
tion
s Â exB
has to start with an A and end with a B, but it can have any number of terminals and non
terminals (including zero) in between.
If a generic production has several right-hand sides, I ll represent these with sub
scripts. For example, in the following generic grammar, the nonterminal s has n different
right-hand sides, one of which could be :
S ^ CXj
I

1
By the same token (so to speak), the following represents n different terminal sym
bols: T. T , ... T .
1 2 n
Section 3.11 Representing Generic Grammars
All the foregoing may seem abstruse, but it will sink in once youve used it a few
times. I ll call out exactly whats going on if the generic production gets too compli
cated.
3.12 Exercises
3.1. Show a step-by-step leftmost derivation of the expression
1 + 2 * ( (3+4) + 5) + 6;
using the following grammar:
1. statements ^
2. 1 expr; statements
3. expr ^ term expr
4. expr ^ +term expr
5. 1
8. 1
9. factor > number
10. 1 ( expr)
3.2. Build a recursive-descent parser that implements the augmented grammar in Table
3.6 on page 191.
3.3. The following regular expression recognizes a series of binary digits, with the pat
tern 000 used to terminate the series:
(Oi l ) *000
Translate that expression into a right-linear grammar.
3.4. Assuming that each character in the ASCII character set forms a token, create a
grammar that recognizes all C integer and character constants. All of the follow
ing should be recognized by your grammar:
Oxabcd 0123 45 ' a' ' \ t' ' \ x0a' ' \ 123'
3.5. In LISP, a simple expression is formed as follows: ( + a b) represents a+b,
(+ a (* b c) ) represents a+b*c, and so forth. Write a grammar that recog
nizes all such LISP expressions, with an arbitrary level of parentheses nesting.
The +, -, *, and / operators must be supported, +and - should be lower pre
cedence than *, and /.
3.6. Write an expression grammar that supports variable names, an assignment opera
tor, a plus operator and array references. Array references are done in a
FORTRAN-like fashion: a three-dimensional array element can be accessed
using the notation a[ x, y, z] . x is the minor axis, so the list of array indexes
should be processed from right to left. Similarly an expression like a =b[2] =
c[l,2] should be recognized, and the assignment operator should associate right to
left. Addition associates left to right, however. The order of precedence should
be brackets (highest precedence), then addition, then assignment.
3.7. Write a grammar that recognizes Pascal subroutine declarations. Draw the parse
tree that results from a top-down parse of the following declaration:
f unct i on subr ( var ar gl : st r i ng; x: i nt eger ; s: st r i ng) :
i nt eger ;
Write a grammar for 8086 assembly language (or whatever assembly language
that you know).
Write a grammar that describes the input language used by your calculator. Every
key on the calculator represents a token in the language.
Listing 3.5 shows a specification for a DFA transition matrix. The st at e state
ments define the rows of the matrix; each got o statement controls the contents of
a column in that row. The accept statements say that the current state is an
accepting state, and the code that follows is to be executed when the accepting
action occurs. The outer braces are part of the accept statement itself, but the
included code must also be able to have braces in it. Unspecified transitions are
filled with the number specified in the er r or statement.
Listing 3.5. A State Machine Specification
1 rows 4; / * Number of r ows i n t he t abl e.
* /
2 c ol umns 12 8; / * Number of col umns i n t he t abl e. * /
3
A
e r r o r - 1 ; / * Use t hi s val ue f or al l unspeci f i ed t abl e el ement s. * /
5 s t a t e 0
6
{
7 g o t o 0 on ' a' ;
8 g o t o 1 on ' b' ;
9 g o t o 2 on ' c ' ;
10 g o t o 2 on 100; / * = 0x64
_ r
d' . * /
11 l o o p on 0x65; / * 0x65 = ' e ' , same as "got o <cur r ent st at e>. "
* /
12 a c c e p t / * Def i ne an accept i ng act i on f or t hi s st at e.
* /
13
{
14 e x e c u t e t h i s c o d e ( ) ;
15
}
16
}
17
18 s t a t e 1 { g o t o 2 on b; } / * Not accept i ng. * /
19 s t a t e 2 { g o t o 3 on c; } / * Not accept i ng. * /
20 s t a t e 3 { a c c e p t { h i t h e r e ( "BOO!" ) ; } }
(a) Write a grammar for the language just described.
(b) Augment the grammar with sufficient code-generation actions and attributes
that a parser could be coded that reads in a state-machine specification, and
which outputs C source code, which, when compiled, implements those
tables. Two array declarations and a subroutine should be output. The first
array should be the transition matrix itself, the second array should be
indexed by state number and should evaluate to 1 if that state is an accepting
state. The subroutine should contain a swi t ch statement which, when
passed the state number of an accepting state, executes the associated code.
(c) Implement the grammar developed in (b).
3.11. Modify your solution to the foregoing exercise so that you can give symbolic
names to a state and then use those symbolic names in got o statements. Names
are declared implicitly by using them instead of a number in a st at e statement.
The compiler must assign state numbers in this case, and you should make sure
that a state number supplied by the compiler doesnt conflict with one supplied in
a st at e statement. Forward references should be permitted.
3.8.
3.9.
3.10.
This chapter develops the idea of recursive-descent parsing by discussing ways to do
the same thing in a more maintainable, table-driven fashion. The chapter includes a
description of top-down parsing techniques and an in-depth discussion of LL grammars,
including techniques for modifying grammars to be LL(1). A yacc-like utility called
LLamawhich translates an augmented, attributed LL(1) grammar into a parseris
constructed. If you intend to read the implementation parts of this chapter, you should
read Appendix E, which contains a users manual for LLama and occs, before continu
ing.
One thing I m not doing here is presenting an extended example of how to use
LLama. I ve left this out because LLama is, in itself, not nearly as useful a tool as is
occs. The main reason this chapter is here, in fact, is to present several concepts and pro
cedures that are used later to construct occs (but in a somewhat simpler context than occs
itself). That is, LLama, is just a step on the way to occs. I do present a short example of
using LLama in that I use LLama to rewrite its own parser. Also, a significant part of
the code used by LLama is also used by occs, and the common code is presented only in
the current chapter.
4.1 Push-Down Automata4
We saw in Chapter Three that certain right-linear grammars can be converted
directly to DFAs. If you can represent a programming language in this way, you can
implement a parser for it as a state machine with a code-generation action associated
with each state. Consider the following simple grammar that recognizes expressions of
the form number +number.
* As in Chapter Two, asterisks mark sections containing theoretical material.
Parsing with a state
machine.
195
196 Top-Down ParsingChapter 4
State machines cant
count.
Push-down automata
(PDA).
E number A
A ^+B
B number C
C ->8
This grammar satisfies the conditions necessary for implementation as a DFA: its right
linear, and every production is of the form a ^ e or b^ Xc \ where a, b, and c are nonter
minals and X is a terminal. Consequently, you can represent this grammar with the state
machine in Figure 4.1 and can parse that grammar using the state machine.
Figure 4.1. Using a State Machine for Parsing
The problem with a straightforward state-machine implementation of a real grammar
is that a simple state machine cant count; or to be more precise, it can count only by
adding additional states. Of course, most grammars cant count beyond 1 either. For
example, the only way to get one, two, or three of something is with:
s >stooge I stooge stooge I stooge stooge stooge
Nonetheless, a grammar can handle nested structures, such as parentheses, and keep
track of the nesting level without difficulty, and a simple DFA cannot. For example, the
following grammar recognizes properly nested lists of parentheses:
list > plist I plist list
phst ^ ( list) I
The grammar accepts input like the following: ( ( ) ( ( ) ) ) , but it rejects expressions
without properly-nested parentheses. This grammar cant be parsed with a state machine
alone, precisely because the state machine cant count. A straightforward state machine
that recognizes nested parentheses is shown in Figure 4.2.
Figure 4.2. A State Machine to Recognize Nested Parentheses
This machine works fine as long as you dont have more than three levels of nesting.
Each additional nesting level requires an additional state tacked on to the right edge of
the machine, however. You dont want to limit the nesting level in the grammar itself,
and its not a great idea to modify the state machine on the fly by adding extra states as
open parentheses are encountered. Fortunately, the situation can be rectified by using a
state machine augmented with a stack. This state-machine/stack combination is called a
push-down automaton or PDA. Most table-driven parsers are push-down automata.
They are driven by a state machine and use a stack to keep track of the progress of that
state machine.
Section 4.1 Push-Down Automata* 197
A push-down automaton that recognizes nested parentheses is shown in Figure 4.3.
The associated state table is in Table 4.1. A PDA is different from a normal state
machine in several ways. First of all, the stack is an integral part of the machine. The
number at the top of stack is the current state number. Secondly, the contents of the state
table are not next-state numbers; rather, they are actions to perform, given a current state
(on the top of stack) and input symbol. Four types of actions are supported:
accept A sentence in the input grammar has been recognized.
error A syntax error has been detected in the input.
push N Push N onto the stack, effectively changing the current state to N.
pop Pop one item from the stack, changing the current state to whatever state
number is uncovered by the pop. The input is advanced with each push or
pop.
Figure 4.3. A Push-Down Automata to Recognize Nested Parentheses
Table 4.1. State Table for PDA in Figure 4.3
I nput Symbol
( )
h
State
0 push 1 error accept
1 push 1 pop error
The following algorithm is used to parse an input sentence:
Push the number of the start state.
while( ((action =state_table[ top_of_stack_symbol ][ input_symbol ]) âccept)
{
if( action = error )
reject;
else
do the indicated action.
}
The parse stack and input stream for a parse of ( () ( () ) ) is summarized in Table 4.2.
This PDA is using the stack as a counting device. The 0 marks the bottom of the stack,
and the 1is used as a nesting-level marker. The marker is pushed every time an open
parenthesis is encountered and popped when the matching close parenthesis is found.
The machine accepts an input string only if the start-state marker is on the stack when
end of input is encountered.
Using a PDA to count
parentheses.
Stack used for counting.
How recursive-descent
parsers use the stack.
Stack frame.
Table 4.2. A Parse of (() ( () ) )
Parse Stack I nput Next Action
0
(( ) ( ( ) ) ) !-
push 1 and advance
0 1
( ) (( )) >H
push 1 and advance
0 1 1
) ( ( ) ) ) 1-
push 1 and advance
0 1
( ( ) ) ) !-
pop and advance
0 1 1
( )) )H
push 1 and advance
0 111
) ) ) H
pop and advance
0 1 1
) )H
pop and advance
0 1
) f-
pop and advance
0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . h . .
accept
This machine has several advantages over the previous one. Not only can it handle
any level parenthesis nesting (or, at least, the only limitation in that regard is the depth of
the parse stack), but it has many fewer states than would a straight DFA for a reasonable
nesting level. Well look at how to use pushdown automata in more sophisticated ways
in a moment.
4.1.1 Recursive-Descent Parsers as Push-Down Automata*
I ts instructive to see how the foregoing applies to a recursive-descent parser. Most
programming languages use a run-time stack to control subroutine calls and returns. The
simplest possible situation is an assembly-language, subroutine-call instruction like
JSRaddr, which pushes the address of the instruction following the J SR onto the run
time stack, and then starts executing instructions at addr. A matching RET instruction
pops an address off the stack and starts executing code at that address. In this simple
case, the stack frame consists of nothing but the return address.
A programming language that supports recursion must use the stack more exten
sively. The actual details are discussed in depth in Chapter Six, but for now, suffice it to
say that space for subroutine arguments and automatic local variables are allocated from
the stack as part of the subroutine-call process. A C subroutine call, for example, usually
involves the following sequence of operations:
(1) Push the arguments onto the stack.
(2) Call the subroutine.
(3) Allocate space for local variables by decrementing the stack pointer by some con
stant (Im assuming that the stack grows towards low memory, as is usually the
case).
Everything is done in reverse order when the subroutine returns, except that the argu
ments are discarded rather than being popped.
This entire block (arguments, return address, and local variables) comprises the
subroutines stack frame. Figure 4.4 shows the various stack frames as theyre created
and destroyed as the naive recursive-descent parser in Chapter One processes the expres
sion 1+2 using the grammar in Table 4.3. Each new stack frame represents a subroutine
call, and when the stack frame disappears, the subroutine has returned. The input is
shown on the right as it is gradually absorbed. In terms of compilers, the entire stack
frame represents a nonterminalthe left-hand side of some production, and the local
variables and argument components of the stack frame are the attributes associated with
that nonterminal.
Table 4.3. An LL( 1) Expression Grammar
Section 4.1.1 Recursive-Descent Parsers as Push-Down Automata* 199
1. stmt ^ 8
2. 1 expr ; stmt
3. expr > term expr
4. expr > +term expr
5. 1 8
6. term > factor term'
8. 1 8
9. factor > numberori d
10. 1 ( expr )
Figure 4.4. The Parsers Stack Frames
statements
statements expression
statements expression term
statements expression term factor
statements expression term
statements expression expr_prime
statements expression expr_prime term
statements expression expr_prime term factor
statements expression expr_prime term
statements expression expr_prime expr_prime
statements
+ 2 ;
+ 2 ;
+ 2 ;
+ 2 ;
+ 2 :
+ 2 ;
+ 2
The question to ask is: what are these stack frames really doing as the parse
progresses? Theyre doing two things. First, theyre keeping track of the current posi
tion on the parse tree. The stack always holds the path from the root down to the current
node on the tree. You can traverse up to a parent by popping one stack element. Second,
the stack keeps track of attributes; since a stack frame represents a subroutine call, it also
represents a nonterminal symbol in the parse tree. If you view a subroutines arguments
and local variables as attributes of the nonterminal, then the portion of the stack frame
that holds these objects represents the attributes. The same nonterminal symbol can be
on the stack in several places, but since each instance of the associated subroutine has its
own stack frame (with its own local variables and arguments), each instance of the
Stack used to remember
attributes and position on
parse tree.
Recursive subroutines
have unique stack
frames, attributes.
Replace recursion with
loop and explicit stack.
associated nonterminal can have a unique set of attributes attached to it. For example,
when expr_pri me () calls itself recursively (in the fifth stack from the bottom of Fig
ure 4.4), there are two stack frames, each with a unique set of local variables and argu
ments.
You can make the transition from recursive descent to a table-driven parser by realiz
ing that neither the recursion nor the subroutines themselves are actually necessary, pro
vided that you can simulate the recursive-descent parsers stack frames using a local
stack. That is, instead of using the implicit, run-time stack to keep track of the attributes
and the position in the parse, you can maintain an explicit stack that does the same thing.
That explicit stack can then be controlled by a nonrecursive subroutine and a table. For
example, a production like this:
fact or MINUS unop
unop > NUMBER
I IDENTIFIER
can be implemented in a recursive-descent parser as follows:
f a c t o r ()
{
i f ( match( MINUS ) ) { advance(); unop(); }
e l s e { e r r o r (); }
}
unop()
{
i f ( match( NUMBER
e l s e i f ( match( IDENTIFIER
e l s e
}
And these subroutines can be simulated as follows:
#def i ne f act or 1
#def i ne unop 2
push( f act or ); /* Push t h e go a l symbol */
w h i l e ( stack_not_empty() )
{
s wi t c h ( top_of_stack_symbol )
{
case f act or: i f ( match(MINUS) ) { advance(); push( unop ); }
e l s e { e r r o r (); }
pop();
break;
case unop: i f (match( NUMBER ) )advance();
e l s e i f (match( IDENTIFIER ) )advance();
e l s e e r r o r ();
pop();
break;
}
}
Rather than calling a subroutine, a symbol representing that subroutine is pushed onto a
local stack. The pops replace the return statements. Since the foregoing loop is, in
effect, a DFA, you can translate it to a table with little difficulty.
) ) advance();
) ) advance();
e r r o r ();
Section 4.2Using a PDA for a Top-Down Parse* 201
4.2 Using a PDA for a Top-Down Parse*
A top-down parse can be implemented using a grammar, a stack, and the following
algorithm. The parse stack is initialized by pushing the goal symbol (the topmost pro
duction in the grammar).
(0) If the parse stack is empty, the parse is complete.
(1) If the item at top of stack is a nonterminal, replace it with its right-hand side, push
ing the symbols in reverse order. For example, if you have a production: a-^b c d,
and a is at the top of stack, pop the a and push the right-hand side in reverse order:
first the d, then the c, and then the b. In the case of an 8 production (a production
with 8 as its right-hand side), an item is popped, but nothing is pushed in its place.
Goto (0).
(2) Otherwise, if the item at top of stack is a terminal symbol, that symbol must also be
the current lookahead symbol. If its not, theres a syntax error, otherwise pop the
terminal symbol and advance the input. Goto (0).
Note that step (2), above, requires certain symbols to be in the input at specific times
because tokens at the top of stack must match the current input symbol. The parser is
effectively predicting what the next input token will be. For this reason, the parser just
described is often called a predictive parser. Note that only certain classes of grammars
(which I ll discuss shortly) can be parsed by predictive parsers. Table 4.4 shows the
parse stack as 1+2 ; is processed using the grammar in Table 4.3 on page 199 and a
predictive parser.
4.3 Error Recovery in a Top-Down Parser*
One of the advantages of top-down parsers is that effective error recovery is easy to
implement. The basic strategy makes use of a set of tokens called a synchronization set.
Symbols in this set are called synchronization tokens. Members of the synchronization
set are typically symbols that can end blocks of code.
A syntax error occurs in a predictive parser when a token is at top of stack and that
same token is not the current lookahead character. The synchronization set is used to
recover from the error as follows:
(1) Pop items off the parse stack until a member of the synchronization set is at top of
stack. Error recovery is not possible if no such item is on the stack.
(2) Read input symbols until the current lookahead symbol matches the symbol at top
of stc :k or you reach end of input.
(3) If you are at end of input, error recovery failed, otherwise you have recovered.
An alternate method, which is harder to implement but more robust, examines the
parse stack and creates a list of those synchronization symbols that are on the stack when
the error occurred. It then reads the input until any of these symbols are encountered,
and pops items off the stack until the same symbol is at top of stack.
Ideally, every production should have its own set of synchronization symbols, which
are changed when the production is activated. To do this properly, you must have a stack
of synchronization sets, and replacing a nonterminal with its right-hand side involves the
following operations:
(1) Push the associated synchronization set onto the synchronization-set stack.
(2) Pop the nonterminal from the parse stack.
(3) Push a special pop marker onto the parse stack.
(4) Push the right-hand side of the nonterminal onto the parse stack in reverse order.
Rules for top-down pars
ing with a PDA.
Predictive parsers.
Synchronization set,
tokens.
Synchronization-set
stack.
Table 4.4. A Top-Down Parse of 1+2;
Parse Stack Input Comments
stmt 1+2 ; h Apply stmt-êxpr; stmt
1+2; h
stmt 1+2; h (This stmt is the one on the right of stmt-êxpr ; stmt)
stmt; 1+2; h
stmt; expr 1+2; h Apply expr-^term expr
stmt; 1+2; h
stmt; expr 1+2; h
stmt; expr term 1+2; h Apply termrfactor term'
stmt; expr 1+2; h
stmt; expr term' 1+2; h
stmt; expr term' factor 1+2; h Apply /acfor-^num_or_id
stmt; expr term' 1+2; h
stmt; expr term' num or id 1+2; h TOS symbol matches lookahead, pop and advance
stmt; expr term' +2; h Apply term'-Ê
stmt; expr +2; h Apply expr'-^+ term expr'
stmt; +2; h
stmt; expr +2; h
stmt; expr term +2; h
stmt; expr term + +2; h TOS symbol matches lookahead, pop and advance
stmt; expr term 2 ; h Apply term-^factor term'
stmt; expr 2 ; h
stmt; expr term' 2 ; h
stmt; expr term' factor 2 ; h Apply factor-^num_or_id
stmt; expr' term' 2 ; h
stmt; expr' term' numori d 2 ; h TOS symbol matches lookahead, pop and advance
stmt; expr term'
;l~ Apply term'>
stmt; expr' Apply expr'-Ê
stmt; TOS symbol matches lookahead, pop and advance
stmt h Apply stmt-^z
h Done
When a pop marker is found on the parse stack, one set is popped from the
synchronization-set stack. The set at the top of the synchronization-set stack is used
when an error is encountered.
In practice, a synchronization-set stack is usually not worth the effort to implement
because most synchronization sets have the same tokens in them. Its usually adequate
to have one synchronization set that can work in most situations and is used universally
whenever an error is encountered. Taking C as an example, a good choice of synchroni
zation tokens would be a semicolon, comma, close-brace, and close parenthesis. In other
words, symbols that end commonly occurring phrases in the input language are good
choices.
4.4 Augmented Grammars and Table-Driven Parsers*
Though the parser just described is nice, its not very useful because theres no provi
sion for code generation. I ts like the initial stab at a recursive-descent parser back in
Chapter One, before the code generation had been added to the actions. Fortunately this
omission is easy to rectify.
Section 4.4Augmented Grammars and Table-Driven Parsers* 203
As was discussed in Chapter Three, an augmented grammar is one in which code-
generation actions are inserted directly into the grammar. An augmented version of the
previous grammar is shown in Table 4.5. (Its the augmented grammar from the last
chapter.) The augmented grammar is parsed just like the unaugmented one. The actions
are just pushed onto the parse stack along with other symbols. When an action is at the
top of the stack, its executed.
Table 4.5. An Augmented LL( 1) Expression Grammar
,
1. statements ^ h
3. expression ^ term expression
4. expression ^ +term {op (' +') ; }expression
5. 1 8
6. term ) factor term
7. term' > * factor {op( '* ') ; }term
8. 1 8
9. factor ^ nuinber_or_id {cr eat e t mp ( yyt ext ) ; }
The action subroutines that do the code generation (op() and create_ tmp ())
were discussed in Chapter Three, but to summarize:
The cr eatetmp ( ) subroutine generates code to copy the operand into an
anonymous temporary variable, and it pushes the name onto a local Temporari es
stack.
The op ( ) subroutine performs the operation on the two previously-generated tem
porary variables, popping their names off the stack, generating code to perform the
operation, and then pushing the name of the temporary that holds the result.
A sample parse of 1+2 is shown in Table 4.6. Note that the op ( ) call follows the term
in expr'-^+ term {op (' +' ) ; }expr . Since the temporaries are generated indirectly by
the terms, two names will exist on the Temporari es stack when the op ( ) call is exe
cuted.
4.4.1 Implementing Attributed Grammars in a PDA*
The previous example generated code without benefit of attributes. It is possible to
use attributes in a PDA, however. A top-down parser such as the one were looking at
uses inherited attributes (the equivalent of subroutine arguments) exclusively. I ts very
difficult to represent return values (synthesized attributes) in this kind of parser, though
the situation is reversed in the bottom-up parsers discussed in the next chapter.
Well support attributes by introducing a second stack, called an attribute or value
stack, which simulates that part of a subroutines stack frame that holds the subroutine
arguments. This stack is a stack of structures, one field of which is used to hold the attri
bute associated with a productions left-hand side, and the other element of which is used
to hold a symbols own attribute. Every time a symbol is pushed onto the normal parse
stack as part of a replacement operation, the attribute attached to the left-hand side thats
being replaced is pushed onto the value stack (its copied to both fields of the structure).
An action is executed when its at the top of the stack, and all symbols to the right of
the action in the production are still on the parse stack when the execution occurs. For
example, when the acti on () in the following production is executed:
Top-down code genera
tion using augmented
grammar.
Actions executed as
parse tree traversed.
Inherited attributes in
top-down parser.
Value, attribute stack.
Table 4.6. A Top-Down Parse of 1+2; Using the Augmented Grammar
Parse Stack Input Comments
stmt 1+2; b Apply stmtêxpr ; stmt
1+2; b
stmt 1 + 2; b
stmt 1+2; b
stmt expr 1+2; b Apply expr-^term expr'
stmt 1+2; b
stmt expr 1+2; b
stmt expr' term 1+2; b Apply term-^factor term'
stmt expr' 1+2; b
stmt expr term' 1+2 ;b
stmt expr' term' factor 1+2; b Apply factor {cr eat e t mp ( yyt ext ) ; }
stmt expr' term' 1+2; b
stmt expr' term' {c r eat e t mp ( yyt ext ) ; } 1+2; b
stmt expr' term' {c r eat e t mp ( yyt ext ) ; } n 1+2; b TOS symbol matches lookahead, pop and advance
stmt expr' term' {c r eat e t mp ( yyt ext ) ; } +2; b Generate t 0 = l ;
stmt expr' term' +2; b Apply term'-^z
stmt expr' +2; b Apply expr'-^+ term {op (' +' ) ; } expr'
stmt +2; b
stmt expr' +2; b
stmt expr' {op( ' +' ) ; } +2; b
stmt expr' {op (' +' ) ; } term +2; b
stmt expr' {op (' +' ) ; } term + +2; b TOS symbol matches lookahead, pop and advance
stmt expr' {op (' +' ) ; } term 2; b Apply term^factor term'
stmt expr' {op ( ' + ' ) ; } 2; b
stmt expr' {op (' +' ) ; } term' 2; b
stmt expr {op (' +' ) ; } term' factor 2; b Apply factor {cr eat e t mp ( yyt ext ) ; }
stmt expr' {op (' +' ) ; } term' 2 ; b
stmt expr' {op (' +' ) ; } term' {c r eat e t mp ( yyt ext ) ; } 2 ; b
stmt expr' {op (' +' ) ; } term' {c r eat e t mp ( yyt ext ) ; } n 2 ; b TOS symbol matches lookahead, pop and advance
stmt expr' {op (' +' ) ; } term' {c r eat e t mp ( yyt ext ) ; } Generate t l = 2 ;
stmt expr' {op (' +' ) ; } term' Apply term' >
stmt expr' {op( ' +' ) ; } Generate t 0 + = t l ;
stmt expr' Apply expr'-Ê
stmt ; I- TOS symbol matches lookahead, pop and advance
stmt h Apply stmt>
h Done
num_or_id is abbreviated as n so that the table can fit onto the page.
a b {act i on() ; } c d
The action symbol itself is at the top of the stack, c is directly underneath it, and d is
under the c. The attribute for a symbol can be modified from within an action by modi
fying the right-hand-side field of an attribute-stack structure. The distance from the top
of stack to the required structure is the same as as the distance in the grammar from
action to the required symbol. In the earlier example, d is two symbols to the right of the
{act i on () }and its attributes are two symbols beneath the {act i on () }s attributes
on the value stack. If that action wants to modify the attribute for d it need only modify
the attribute at offset 2 from the top of the attribute stack. 1The value stack can be imple
mented using the modified top-down parse algorithm in Table 4.7.
1. Its also possible to keep track of attributes on the parse stack itself rather than on an auxiliary value stack.
See [Lewis], pp. 311-337. This method is difficult to implement so is not discussed here.
Section 4.4.1 Implementing Attributed Grammars in a PDA*
Table 4.7. Top-Down Parse Algorithm with Attributes
205
Data Structures: A parse stack of i nts.
A value stack of the following structures:
t ypedef st ruct
{
YYSTYPE l ef t ;
YYSTYPE r i ght ;
}
yyvst ype;
YYSTYPE is an arbitrary typeit is a character pointer in the current example.
A variable, Ihs, of type YYSTYPE.
Initially: Push the start symbol on the parse stack.
Push garbage onto the value stack.
(0) If the parse stack is empty, the parse is complete.
(1) If the item at the top of the parse stack is an action symbol: execute the associated code and pop
the item.
(2) If the item at the top of the parse stack is a nonterminal:
(a) Ihs =the r i ght field of the structure currently at the top of the value stack.
(b) replace the symbol at the top of the parse stack with its right-hand side, pushing the sym
bols in reverse order. Every time an item is pushed onto the parse stack, push an item onto
the value stack, initializing both fields to Ihs.
(c) Goto (0).
(3) Otherwise, if the item at the top of the parse stack is a terminal symbol, that symbol must also be
the current lookahead symbol. If its not, theres a syntax error, otherwise pop the terminal sym
bol and advance the input. Goto (0).
I ll give an example of how to use this attribute-passing mechanism in a moment.
First, however, I need to introduce a new notation for attribute management. Attributes Attribute notation ($N).
need only be referenced explicitly inside an action. Rather than giving attributes arbi
trary names, all you need is a mechanism to reference them from the perspective of the
action itself. The following notation, similar to the one used by yacc, does the job: The
attribute associated with a left-hand side is always called $$, and attributes for symbols
on the right-hand side are referenced by the notation $1, $2, and so forth, where the
number represents the offset of the desired symbol to the right of the current action.
Consider this production:
expr > + {$l =$2=newname();}
term {pri ntf ("%s+=%s\n",$$,$0 ); f reename($0 );}
expr
(Thats one long production, split up into several lines to make it easier to read, not three
right-hand sides.) The $1 in {$l =$2=newname () ; }attaches an attribute to the term
because term is one symbol to the right of the action. When the parser replaces this term
with its right-hand side, the symbols on that right-hand side can all use $$ to access the
attribute associated with the term. In the current right-hand side, the $$ in the
pr i nt f () statement references an attribute that was attached to the expr at some ear
lier point in the parse. The $2 in {$l =$2=newname () / }references the second action,
which is two symbols to the right of {$l =$2=newname () ;}. The second action can
get at this attribute using $0. That is, the value assigned to $2 in the first action is
accessed with $ 0 in the second action.
LLama can translate the dollar notation directly into value-stack references when it
outputs the swi tch statement that holds the action code. When an action is executed,
$$ is the left-hand-side field of the structure at the top of the value stack, $0 is the right-
hand-side field of the same structure, $ 1 is the structure at offset one from the top of the
stack, $2 is at offset two, and so forth.
The entire grammar, modified to use our new notation, is shown in Table 4.8. You
can see how all this works by looking at the parse of 1+2 that is shown in Table 4.9.
Remember that the symbols on the right-hand side of a replaced nonterminal inherit the
attributes in the nonterminals r i ght fieldboth the l ef t and r i ght fields of the
attribute-stack structure are initialized to hold the r i ght field of the parent nonterminal.
Table 4.8. An Augmented, Attributed Grammar Using the $ Notation
1. stmt e
2. 1 {$l =$2=newname () ; }expr {f r eename ($0) ; }; stmt
3. expr term expr'
4. expr' +{$l =$2=newname () ; }term {pr i nt f ( "%s+=%s\ n", $$, $0) ; f r eename ($0) ; }expr'
5. 1 e
6. term factor term'
7. term' * {$l =$2=newname () ; }factor {pr i nt f (,,%s*=%s\ n", $$, $0) ; f r eename ($0) ; }term'
8. 1 e
9. factor > number_or_i d {pr i nt f ("%s=%0. *s\ n", $$, yyl eng, yyt ext ) ; }
10. 1 ( expr )
Table 4.9. A Parse of 1+2 Showing Value Stacks
Section 4.4.1 Implementing Attributed Grammars in a PDA* 207
Parse and Value Stacks Notes
stmt
[?,?]
stmt-êxpr ; stmt
stmt
[?,?]
[?!?]
{ i }
[?,?]
expr
[?,?]
{0}
[?,?]
$l =$2=newname( ) ;
$1 references expr, $2 references {1}
stmt
[?,?]
t?!?]
{ i }
[?,t0]
expr
[?,t0]
expr-êxpr' term
stmt
[?,?]
#
[?!?i
{1}
[?,t0]
expr'
[t0,t0]
term
[tO,tO]
term >term' factor
stmt
[?,?]
[?!?]
{1}
[?,t0]
expr'
[t0,t0]
term'
[tO.tO]
factor
[t0,t0]
factor-* n u mo r i d
stmt
[?,?]
[?!?]
{1}
[?,t0]
expr'
[t0,t0]
term'
[t0,t0]
{3}
[t0,t0]
n
[t0,t0]
num_or_id token matches 1 in input
stmt
[?.?]
[?!?]
{1}
[?,t0]
expr'
[t0,t0]
term
[tO,tO]
{3}
[t0,t0]
p r i n t f ( "%s=%0. * s \ n " , $ $ , y y l e n g , y y t e x t ) ;
outputs: t 0 = l
stmt
[?,?]
{1}
[?,t0]
expr'
[t0,t0]
term
[tO,tO]
term' >
stmt
[?,?]
[?!?]
{1}
[?,t0]
expr'
[t0,t0]
expr'->+ term expr'
stmt
[?,?] i?!?]
{1}
[?,t0]
expr'
[t0,t0]
{2}
[tO,tO]
term
[t0,t0]
{0}
[t0,t0]
+
[t0,t0]
+ token matches + in input
stmt
[?,?]
#
[?!?]
{1}
[?,t0]
expr'
[t0,t0]
{2}
[tO,tl]
term
[t0,tl]
{0}
[t0,t0]
$l =$2=newname( ) ;
$1 references term, $2 references {2 }
stmt
[?,?]
[?!?]
{1}
[?,t0]
expr
[t0,t0]
{2}
[tO,t 1]
term
[t0,tl]
term -^factor term'
stmt
[?,?] [?!?]
{1}
[?,t0]
expr'
[t0,t0]
{2}
[tO,t 1]
term'
[tl,tl]
factor
[tl,tl]
factor >num_or_id
stmt
[?,?]
[?!?i
{1}
[?,t0]
expr'
[t0,t0]
{2}
[tO,t 1]
term'
[tl,tl]
{3}
[tl,tl]
n
[tl.tl]
num_or_id token matches 2 in input
stmt
[?,?]
[?!?]
{1}
[?,t0]
expr
[t0,t0]
{2}
[tO,t 1]
term'
[tl,tl]
{3}
[tl,tl]
pr i nt f ( "%s =%0 . * s \ n", $ $ , y y l e n g , y y t e x t ) ;
outputs: t l = 2
stmt
[?,?]
[?!?]
{1}
[?,t0]
expr'
[t0,t0]
{2}
[tO,t 1]
term'
[tl,tl]
term'
stmt
[?,?]
[?!?]
{1}
[?,t0]
expr'
[t0,t0]
{2}
[tO,tl]
p r i n t f ( "%s+=%s\n", $ $ , $ 0 ) ; f r e e na me ( $0) ;
outputs: t 0 + = t l
stmt
[?,?]
[?!?]
{1}
[?,t0]
expr'
[t0,t0]
expr'-Ê
stmt
[?,?] [?!?]
{1}
[?,t0]
f r e e na me ( $0) ;
stmt
[?,?] [?!?]
; token matches ; in input
stmt
[?,?]
...... ... ..... __
stmt>
. _ ......
n =
{0} =
{1} =
{2} =
{3} =
= n u mo r i d
= { $1 = $2 = newname()
: { f r e e name ( $0) ; }
= { p r i n t f ( "%s+=%s\n",
- { p r i n t f ( "%s=%0. *s \ n"
; }
$$, $0) ; f r e e na me ( $0) ; }
, $ $ , y y l e ng , y y t e x t ) ; }
Value-stack elements are shown as [ l e f t , r i g h t ] pairs; [ $ $, $ 0 ] in attribute notation.
PDAs in top-down pars
ing, parse tables.
Numeric values for sym
bols.
Executing actions.
4.5 Automating the Top-Down Parse Process*
Now that we have an algorithm, we need a way to automate the parse process. The
solution is a push-down automata in which a state machine controls the activity on the
stack. I ll start by looking at how a table-driven top-down parser works in a general
way, and then I ll examine a LLama output file in depth by way of example. I ll discuss
how LLama itself works, and how the tables are generated, in the next section.
4.5.1 Top-Down Parse Tables*
The tables for the push-down automata used by the parser are relatively straightfor
ward. The basic strategy is to give a numeric value to all the symbols in the grammar,
and then make tables of these numeric values. For example, the numeric values in Table
4.10 represent the symbols in the augmented grammar in Table 4.5 on page 203.
Table 4.10. Numeric and Symbolic Values For Symbols in Expression Grammar
Symbolic Value Numeric Value Notes
LP 5 Terminal Symbols
NUM OR ID 4
PLUS 2
RP 6
SEMI 1
TIMES 3
_EOI_ 0 (End of input marker, created by LLama)
expr 257 Nonterminal Symbols
expr' 259
factor 260
stmt 256
term 258
term' 261
{op ( ' + ' ) ; } 512 A cti ons
{o p ( ' * ' ) ; } 513
{cr eat e t mp( yyt ext ); } 514
The various symbol types can be distinguished by their numeric value. The end-of-
input marker is always 0, tokens are all in the range 1- 6, nonterminals are in the range
256-261, and actions are in the range 512-514. The parse stack contains these numbers
rather than the terminals, nonterminals, and actions that they represent. For example,
every appearance of an expr in a stack picture corresponds to a 257 on the real stack.
The earlier parse of 1+2 is shown in both symbolic and numeric form in Table 4.11. For
clarity, I will continue to use the symbolic names rather than the numbers, but remember
that the stack itself actually has the numeric, not the symbolic, values on it.
The parser takes care of actions with a switch statement controlled by symbolic
constantslike the one in Listing 4.1. The swi tch is executed when a number
representing an action is encountered at top of stackthat same number (actnum) is
used as the argument to the swi tch, and theres a case for every action symbol.
The parse itself is accomplished using two tables, Yy_pusht ab and Yy_d, shown in
Figures 4.5 and 4.6. The algorithm is in Table 4.12. Yy pusht ab holds the right-hand
2. This s w i t c h statement could be made easier to read by using macros to convert the numbers to a more-
readable string, but this mapping would make both LLama itself and the output code more complex, so I
decided not to do it. The symbol table is available from LLama if you need to examine the output code.
Section 4.5.1 Top-Down Parse Tables*
Table 4.11. A Parse of 1+2 Showing Both Symbols and Numeric Equivalents
209
Stack (Symbols) Stack (Numbers)
stmt 256
stmt 256
stmt 256 1
stmt expr 256 1 257
stmt 256 1
stmt expr' 256 1 259
stmt expr term 256 1 259 258
stmt expr 256 1 259
stmt expr term' 256 1 259 261
stmt expr' term' factor 256 1 259 261 260
stmt expr term' {2} 256 1 259 261 514
stmt expr term' {2} n 256 1 259 261 514 4
stmt expr' term' {2} 256 1 259 261 514
stmt 256 1
stmt expr' {0} 256 1 259 512
stmt expr {0} term 256 1 259 512 258
stmt expr {0} term + 256 1 259 512 258 2
stmt expr' {0} term 256 1 259 512 258
stmt expr {0} 256 1 259 512
stmt expr' {0} term' 256 1 259 512 261
stmt expr' {0} term' factor 256 1 259 512 261 260
stmt expr {0} term' 256 1 259 512 261
stmt expr {0} term' {2} 256 1 259 512 261 514
stmt expr {0} term' {2} n 256 1 259 512 261 514 4
stmt expr {0} term' {2} 256 1 259 512 261 514
stmt expr {0} term' 256 1 259 512 261
stmt expr {0} 256 1 259 512
stmt 256 1
stmt 256
{0} represents $1=$2= newname() ;
{2} represents p r i n t f ("%5+=%s\n" , $$, $ 0 ) ;f r e e na me ( $ 0 ) ;
Listing 4.1. Executing Actions from a swi tch
1 swi t ch ( act num )
2
{
3 case 512 : { op(' +' ); } br eak;
4 case 513: { op (' *' ) ; } br eak;
5 case 514 : { cr eat e t mp( yyt ext ) ; } br eak;
6
7 def aul t : pr i nt f ( "I NTERNAL ERROR: I l l egal act i on number \ n") ;
8 br eak;
9 }
sides of the productions in reverse order (because you have to push them in reverse order
when you do a replace operation). The index into the Yy pushtab array is the produc
tion numberan arbitrary, but unique, number assigned to each production. Here, the
topmost production in the grammar is Production 0, the next one is Production 1, and so
forth. The right-hand sides are represented by zero-terminated arrays of the earlier sym
bolic values. Yy_d is used to determine the actual replacement to perform. It is indexed
by potential lookahead symbols along one axis, and nonterminals along the other, and is
used with the parse algorithm in Table 4.12.
Figure 4.5. LLama Parse Tables: Yy pusht ab [ ]
Yy pusht ab
Yyp03
Yyp04
Yyp05
Yyp06
Yyp07
Yyp08
Yyp09
>
259, 512, 258, 2, 0
e x p r ' { op (ADD) ; } t e r m PLUS
>
1 o
>
261, 260, 0
t e r m ' f a c t o r
>
261, 513, 260, 3, 0
t e r m {op( mul t i pl y) ) f a c t o r TIMES
0 ....
>
514. 4, 0
{r val ue ( yyt ext ) } NUM OR ID
>
6, 257, 5, 0
RP e x p r LP
0
259,
e x p r
f
258,
t e r m
0
YypOO
YypOl
Yyp02
Figure 4.6. LLama Parse Tables: Yy d [ ]
Yyd h- SEMI PLUS TIMES NUM_OR_ID LP RP
s t m t 0 -1 -1
- 1
1 1 -1
e x p r
- 1
-1 -1
- 1
2 2 -1
t e r m
- J
-1 -1 - 1
5 5 -1
e x p r - 1
4 3
- 1
-1 -1 4
f a c t o r
- \
-1 -1
- 1
8 8 -1
t e r m '
- 1
7 7 6 -1 -1 7
At this point, I d suggest running through the parse of 1+2 again, using the algorithm
and tables just discussed.
3. Production numbers for LLama-generated parsers can be found by looking at the llout.sym file generated
by LLamas -D, -5, or -s switch.
Table 4.12. Top-Down Parse Algorithm, Final Version
Section 4.6LL( 1) Grammars and Their Limitations* 211
4.6 LL(1) Grammars and Their Limitations*
The concept of an LL(1) grammar was introduced in Chapter Three. This class of
grammar can be parsed by a parser that reads the input from left to right and does a left
most derivation of the parse tree. It needs at most one character of lookahead. There are
LL(O) grammars, which require no lookahead, but theyre too limited to be useful in a LL() gra m m a rs.
compilerall that they can do is recognize finite strings. LL(O) grammars are a subset
How a parser decides
which productions to
apply.
of the right-linear grammars discussed in the last chapterthey can have no 8 produc
tions in them.
The main task of a program like LLama is to manufacture the parse tables from the
tokenized input grammar. The basic parse table (called Yyd in the previous section) is a
two-dimensional array, the rows are indexed by nonterminal symbols and columns by
terminal symbols. The array contains either an error indicator or the number of the pro
duction to apply when a specified nonterminal is at the top of the parse stack and a par
ticular terminal is in the input stream. This parse table can only be created if the input
grammar is an LL(1) grammar.
An LL(1) grammar is limited in many ways, which are best explained by looking at
how the parser determines what productions to apply in a given situation, and by looking
at grammars that the parser cant use because it cant make this determination. Consider
the following production that recognizes simple expressions comprised of numbers,
parentheses and minus signs.
expression OPEN_PAREN expression CLOSEPAREN
NUMBER MINUS expression
If an expression is on top of the parse stack, the parser must determine which of the two
possible productions to apply. It makes the decision by looking at the current lookahead
token. If this token is a OPEN_PAREN, the parser applies the top production; if its a
NUMBER, the bottom production is applied. If any other symbol is the lookahead sym
bol, then an error has occurred. Note that if you were to add a production of the form
expression NUMBER PLUS expression
to the foregoing grammar, you could no longer determine which of the two right-hand
sides that start with a NUMBER to apply when an expression was on top of the stack.
The grammar would not be LL(1) in this case.
The situation is complicated somewhat in the following grammar:
expression OPEN_PAREN expression CLOSE PAREN
term MINUS expression
term NUMBER
IDENTIFIER
Here also, the top production is applied when an expression is on top of the stack and an
OPEN_PAREN is in the input. You need to look further than the actual production to
determine whether you can apply Production 2, however. Since the second production
starts with a nonterminal, you cant tell what to do by looking at Production 2 only. You
can resolve the problem by tracing down the grammar, looking at the symbols that can
occur at the beginning of the nonterminal. Since a term begins the second production,
and a term can start with either a NUMBER or IDENTIFIER, the parser can apply the
second production if an expression is on top the stack and the input symbol is a
NUMBER or IDENTIFIER. Note that problems would arise if you added a production
of the form
expression NUMBER
in the previous grammar, because, if a NUMBER were the lookahead symbol, the parser
couldnt determine whether to apply Production 2 or the new production. If this situa
tion existed, the grammar would not be LL(1).
Lets complicate the situation further with this grammar, which describes a simple
compound statement that can either contain another statement or be empty:
Section 4.6LL(1) Grammars and Their Limitations* 213
1 statement OPEN_CURLY expression CL OSECURL Y
expression SEMICOLON
expression > OPEN_PAREN expression CLOSE PAREN
term MINUS expression
8
term > NUMBER
IDENTIFIER
Production 1is applied if a statement is on top of the stack and the input symbol is an
OPEN__CURLY. Similarly, Production 2 is applied when a statement is on top of the
stack and the input symbol is an OPEN_PAREN, NUMBER, or IDENTI FIER (an
OPENPAREN because an expression can start with an OPENPAREN by Production
3, a NUMBER or IDENTIFIER because an expression can start with a term, which can,
in turn, start with a NUMBER or IDENTIFIER. The situation is complicated when an
expression is on top of the stack, however. You can use the same rules as before to
figure out whether to apply Productions 3 or 4, but what about the 8 production (Produc
tion 5)? The situation is resolved by looking at the symbols that can follow an expres
sion in the grammar. If expression goes to 8, it effectively disappears from the current
derivation (from the parse tree)it becomes transparent. So, if an expression is on top
of the stack, apply Production 5 if the current lookahead symbol can follow an expres
sion (if it is a CLOSE CURLY, CLOSE PAREN, or SEMICOLON). In this last
situation, there would be serious problems if CLOSE_CURLY could also start an
expression. The grammar would not be LL(1) were this the case.
4.7 Making the Parse Tables
*
I ll now formalize the foregoing rules and show how to use these rules to make a
top-down parse table such as the one in Figure 4.6 on page 210. This section discusses
the theory; code that implements this theory is presented later on in the chapter.
4.7.1 FIRST Sets
*
The set of terminal symbols that can appear at the far left of any parse tree derived
from a particular nonterminal is that nonterminals FIRST set. Informally, if you had a
bunch of productions that all had expression on the left-hand side, F\RST( expression)
(pronounced first of expression) would comprise all terminals that can start an expres
sion. In the case of the C language, this set would include numbers, open parentheses,
minus signs, and so forth, but it would not include a close parenthesis, because that sym
bol cant start an expression. Note that 8 is considered to be a terminal symbol, so it can
appear in a FIRST set. First sets can be formed using the rules in Table 4.13. I ll demon
strate how to find FIRST sets with an example, using the grammar in table 4.13.
The FIRST sets are put together in a multiple-pass process, starting out with the easy
ones. Initially, add those nonterminals that are at the far left of a right-hand side:
FIRST( stmt)
FIRST {expr)
FIRST( expr ) = {PLUS}
FIRST( term)
FIRST( term ) = {TI MES}
Computing FIRST sets,
an example.
(1) FIRST(A), where A is a terminal symbol, is {A }. If A is 8, then 8 is put into the FIRST set.
(2) Given a production of the form
sÂ ex
where s is a nonterminal symbol, A is a terminal symbol, and a is a collection of zero or more ter
minals and nonterminals, A is a member of FIRST( s ).
(3) Given a production of the form
s) b cx
where 5 and b is are single nonterminal symbols, and a is a collection of terminals and nontermi
nals, everything in FIRST(b) is also in FIRST(s ).
This rule can be generalized. Given a production of the form:
j h x
where s is a nonterminal symbol, a is a collection of zero or more nullable nonterminals,! is a
single terminal or nonterminal symbol, and p is a collection of terminals and nonterminals, then
FIRST(s) includes the union of FIRST($) and FIRST(a). For example, if a consists of the three
nullable nonterminals x, y, and z, then FIRST(s) includes all the members of FIRST(x),
FIRST( j ), and FIRST(z), along with everything in FIRST(^).
t A nonterminal is nullable if it can go to 8 by some derivation. 8 is always a member of a nullable
nonterminals FIRST set.
Table 4.13. Finding FIRST Sets
Table 4.14. Yet Another Expression Grammar
1 stmt ^ expr SEMI
2 expr ^ term expr
3 1 8
4 expr ^ PLUS term expr
5 1 8
6 term > factor term'
7 term' ^ TIMES factor term'
8 1 8
9: factor ^ LEFT PAREN expr RIGHT PAREN
10: 1 NUMBER
FIRST {factor) = {LEFTPAREN, NUMBER}
Next, close the sets (perform a closure operation on the initial sets) using the foregoing
rules. Everything in FIRST (factor) is also in FIRST (term) becaus e factor is the left
most symbol on the right-hand side of term. Similarly, everything in FIRST(term) is
also in FIRST (expr), and everything in FlRST(expr) is also in FIRST (stmt). Finally,
expr is a nullable nonterminal at the left of stmt9s right-hand side, so I ll add SEMI to
FIRST(stmt). Applying these relationships yields the following first sets:
FIRST (stmt) = {LEFTPAREN, NUMBER, SEMI}
FIRST (expr) = {LEFT_PAREN, NUMBER}
Section 4.7.1 FIRST Sets* 215
FIRST( expr ) = {PLUS}
FIRST (term) = {LEFTPAREN, NUMBER}
FIRST( term ) = {TI MES}
FIRST (factor) = {LEFT PAREN, NUMBER}
One final note: the FIRST notation is a little confusing because you see it used
three ways:
FIRST(A) (A is a terminal) is A. Since 8 is a terminal, then FIRST(e) ={}.
FIRST(x) (x is a nonterminal) is that nonterminals FIRST set, described above.
FIRST(a) (a is a collection of terminals and nonterminals) is the FIRST set computed
using the procedure in Rule (3), above: FIRST(a) always includes the
FIRST set of the leftmost symbol in a. If that symbol is nullable, then it is
the union of the FIRST sets of the first two symbols, if both of these symbols
are nullable, then it is the union of the first three symbols, and so forth,
all the symbols in a are nullable, then FIRST(a) includes e.
A subroutine that computes FIRST sets is presented towards the end of the current
chapter in Listing 4 27 on page 305.
4.7.2 FOLLOW Sets
The other set of symbols that you need in order to make the parse tables are the FOL
LOW sets. A terminal symbol is in a nonterminals FOLLOW set if it can follow that
nonterminal in some derivation. You can find a nonterminals FOLLOW set with the
rules in Table 4.15. To see how Rule 3 in Table 4.14 works, consider the following
grammar:
compound stmt OPEN CURLY stmt list CLOSE CURLY
stmt list
stmt
stmt Ji st stmt
expr SEMI
CLOSE_CURLY is in FOLLOW (stmtJist) because it follows stmt Ji st in Production
CLOSE_CURLY is also in FOLLOW(^mf) because of the second of the following
derivations:
L
compound stmt => OPEN CURLY stmt list CLOSE CURLY
L
L
OPENCURL Y stmt Ji st stmt CL OSECURL Y
OPEN CURLY stmt list expr SEMI CLOSE CURLY
The stmt Jist in the first sentential form was replaced by the right-hand side of Produc
tion 2 (stmtJist stmt), and in that derivation, a CLOSE_CURLY followed the stmt.
I ll demonstrate how to compute FOLLOW sets with the earlier grammar in Table
4.14 on page 214. In the initial pass, apply the first two rules for forming FOLLOW sets:
SEMI and RIGHT PAREN are added to FOLLOW(expr) because they actually follow
it in Productions 1 and 9; PLUS is added to FOLLOW(term) because everything in
FlRST(expr ) must be in FOLLOW (term) by Production 4; TIMES is added to
FOLLOW (factor) because everything in FIRST( term) must be in FOLLOW (factor)
by Productions 6 and 7. The initial pass looks like this:
FOLLO W( stmt) = {h}
FOLLOW(expr) = {SEMI, RIGHT PAREN}
FOLLOW( expr )
Computing FOLLOW
sets, an example.
Table 4.15. Finding FOLLOW Sets
If s is the goal symbol, h (the end-of-input marker) is in FOLLOW(s);
Given a production of the form:
sK . M *B. . .
where a is a nonterminal and (Bis either a terminal or nonterminal, FIRST(^) is in FOLLOW(a);
To generalize further, given a production of the form
S . M CX*B. ..
where s and a are nonterminals, a is a collection of zero or more nullable nonterminals and (Bis
either a terminal or nonterminal. FOLLOW( a) includes the union of FIRST(a) and FIRST($).
Given a production of the form:
s .. a
where a is the rightmost nonterminal on the right-hand side of a production, everything in
FOLLOW(s) is also in FOLLOW(a). (Ill describe how this works in a moment.) To generalize
further, given a production of the form:
S). ..GOC
where nonterminals, and a is a collection of zero or more nullable nonterminals,
everything in FOLLOW(s) is also in FOLLOW(a).
FOLLOW(term) = {PLUS}
FOLLOW( term')
FOLLOW(/<3Ctor) = {TI MES}
Now close the FOLLOW sets by making several passes through them, applying Rule 3
repetitively until nothing more is added to any FOLLOW set. The following holds:
Everything in FOLLOW(^jcpr) is also in FOLLOW(ex/?r') by Production 2.
Everything in FOLLOW( term) is also in FOLLOW( term') by Production 7.
Since expr' is nullable, everything in FOLLOW (expr) is also in FOLLOW(term)
by Production 4.
Since term' is nullable, everything in FOLLOW(term') is also in FOLLOW (factor)
by Production 7.
The first closure pass applies these identities to the original FOLLOW sets, yielding the
following sets:
FOLLOW( stmt) = {h}
FOLLOW(expr) = {SEMI, RIGHT PAREN}
FOLLOW(^jcpr') = {SEMI, RIGHT PAREN}
FOLLOW (term) = {PLUS, SEMI, RIGHT PAREN}
FOLLOW( term ) = {PLUS}
FOLLOW (factor) = {TIMES, PLUS}
Another pass, using the same identities, adds a few more elements
FOLLOW (stmt) = {H
FOLLOW(expr) ={SEMI, RI GHT PAREN}
Section 4.7.2FOLLOW Sets* 217
FOLLOW( expr ) {SEMI, RI GHT PAREN}
FOLLOW( term) {PLUS, SEMI, RIGHT PAREN}
FOLLOW( term') {PLUS, SEMI, RI GHT PAREN}
FOLLOW {factor) {TIMES, PLUS, SEMI, RI GHT PAREN}
A third pass adds nothing to the FOLLOW sets, so youre done. A subroutine that com
putes FOLLOW sets is presented towards the end of the current chapter in Listing 4.28
on page 307.
4.7.3 LL(1) Selection Sets
To review a bit, an LL(1) parse table looks like this:

input symbol
nonterminal
The columns are indexed by input symbol, the rows by nonterminal symbol. The table
holds either a marker that signifies a syntax error or the number of a production to apply,
given the current top-of-stack and input symbols. Each production has a unique, but
arbitrary, production number. Typically, the start production is Production 0, the next
one is Production 1, and so forth.
The LL(1) selection set for a given production is the set of nonterminals for which
there are legal entries in any given row of the parse table. For example, a grammar could
have the following productions in it:
terminal PERKIN ELMER pk
ADM3 adm
dec term
dec term VT 52
VT 100
The parse table for this grammar looks like this:
PERKIN ELMER ADM3 VT_52 VT_100
terminal 1 2 3 3
dec term error error 4 5
The number 1 is in the PERKIN_ELMER column of the terminal row, because Produc
tion 1 is applied if a term is at top of stack and PERKIN_ELMER is the lookahead sym
bol, and so forth. The same relationships are indicated by these selection sets:
SELECT(l)
SELECT(2)
SELECT(3)
SELECT(4)
SELECT(5)
PERKIN ELMER
ADM3 }
VT_52, VT 100
VT_52
VT 100
SELECT(3) indicates that Production 3 is selected if the left-hand side of Production 3 is
on the top of the stack, and the current lookahead symbol is VT_52 or VT_100. In gen
eral, if you have both a production N (which takes the form soc) and a token T, then T
is in SELECT(A0 if production N should be applied when 5 is on the top of the parse
stack and T is the current lookahead symbol. Note that the selection sets are attached to
the individual productions, not to the nonterminals. For a grammar to be LL(1), all pro
ductions that share a left-hand side must have unique selection sets, otherwise the parser
wouldnt know what to do in a given situation. [This is the real definition of LL(1).] The
LL(1) selection set.
218
Table 4.16. Finding LL(1) Selection Sets
Top-Down ParsingChapter 4
A production is nullable if the entire right-hand side can go to 8. This is the case, both when the
right-hand side consists only of 8, and when all symbols on the right-hand side can go to 8 by
some derivation.
For nonnullable productions: Given a production of the form
S >CX(B. ..
where s is a nonterminal, a is a collection of one or more nullable nonterminals, and (Bis either a
terminal or a nonnullable nonterminal (one that cant go to 8) followed by any number of addi
tional symbols: the LL(1) select set for that production is the union of FIRST(a) and FIRST((B).
That is, its the union of the FIRST sets for every nonterminal in a plus FIRST((B). If a doesnt
exist (there are no nullable nonterminals to the left of (B), then SELECT(s)=FIRST($).
For nullable productions: Given a production of the form
s>oc
where 5 is a nonterminal and a is a collection of zero or more nullable nonterminals (it can be 8):
LL(1) select set for that production is the union of FIRST(a) and FOLLOW(s). In plain
words: if a production is nullable, it can be transparentit can disappear entirely in some deriva
tion (be replaced by an empty string). Consequently, if the production is transparent, you have to
look through it to the symbols that can follow it to determine whether it can be applied in a given
situation.
LL(1) selection sets are formed using the rules in Table 4.16.
Note that 8 is used to compute the selection sets because s is nullable if FIRST(s)
Translating SELECT sets contains 8. Nonetheless, 8 is not itself a member of any selection set. The selection sets
into an LL(1) parse table. can translated into a parse table with the algorithm in Table 4.17.
Table 4.17. Translating SELECT Sets into LL(1) Parse Tables.
Initialize the table to all error transitions;
for( each production, N, in the grammar )
I
Ihs = the left-hand side of production /V;
for( every token in SELECT(/V))
parse_table[ Ihs ][ token ] =N\
}
4.8 Modifying Grammars*
All top-down parsers, including recursive-descent parsers, must use LL(1) grammars,
which are are quite limited. The main problem is that LL(1) grammars cant be left
recursive. Since you cant just dispense with left associativity, LL(1) grammars would
not be very useful unless some mechanism existed to translate left-recursive grammars
into left associative, LL(1) grammars. This section looks at various ways that you can
manipulate grammars algebraically to make them LL(1); the techniques are useful with
other classes of grammars as well.
Section 4.8Modifying Grammars* 219
The following discussion uses the generic-grammar notation discussed at the end of
Chapter Three without always calling out what everything means. J ust remember that
Greek letters represent collections of terminals and nonterminals, letters like & and #
represent a single terminal or nonterminal, italics are used for single nonterminals, and
boldface is used for single terminals.
4.8.1 Unreachable Productions
*
An unreachable nonterminal is one that cant possibly appear in a parse tree rooted
at the goal symbol. That is, theres no derivation from the goal symbol in which the
unreachable nonterminal can appear in the viable prefix. For example, in the following
grammar:
TERM TWO
Production 3 is clearly unreachable because b appears on no right-hand side. The situa
tion is not always so cut and dried. Productions 3 and 4 are both unreachable in the fol
lowing grammar, even though b and c both appear on right-hand sides:
s 7 a
a -> TERM
TERM TWO c
c TERM TOO b
Many of the transformation techniques discussed below create unreachable nonterminals
and, since productions with these nonterminals on their left-hand sides have no useful
function in the grammar, they should be removed. You can use the algorithm in Table
4.18.
4.8.2 Left Factoring*
A production like the following has two problems: It is clearly not LL(1) because the
left-hand sides of both productions start with an I F token so they cant possibly have dis
joint selection sets. Also, it is ambiguous because the same nonterminal appears twice
on the right-hand side:
statement > I F test THEN statement ELSE statement
I I F test THEN statement
Note that this is also a pretty serious ambiguity because it controls the ways that i f and
el se statements bind. Input like the following:
i f ( expr ) t hen
i f ( expr 2 ) t hen
st at ement ();
el se
st at ement ();
can create either of the trees pictured in Figure 4.7. The top parse causes the el se to
bind to the closest preceding i f the behavior required by most programming
languages. In the second parser, though, the el se incorrectly binds to first i f.
Both problems can be solved by a process known as left factoring, which isolates the
common parts of two productions into a single production. Any production of the form:
Unreachable nonterminal.
Ambiguity in i f / e l s e ,
binding problems.
Left factoring.
Table 4.18. Eliminating Unreachable Productions
/
Data structures: A stack
A list of reachable nonterminals.
Initially: Both the stack and the list are empty.
Add the goal symbol to the set of reachable nonterminals
and push the goal symbol onto the stack.
while( the stack is not empty )
pop one item off the stack
for( each nonterminal, x, on a right-hand side of s)
if( x is not in the list of reachable nonterminals )
push x;
add x to the list of reachable nonterminals
Remove from the grammar all productions whose
hand sides are not in the list of reachable nonterminals.
Figure 4.7. Parse Trees of an Ambiguous IF/ELSE Grammar
statement I F test THEN statement ELSE statement
I F test THEN statement
statement
I F test THEN statement ELSE statement
statement
IF test THEN statement ELSE statement
Section 4.8.2Left Factoring* 221
/
aft,
where a is a sequence of one or more symbols that appear at the start of every right-hand
side, a 1P, to Pnare dissimilar collections of zero or more symbols, can be replaced by
the fo: .ving:
a!'
/
The a' is just an arbitrary name for the new left-hand side. In the current example
a corresponds to statement
a corresponds to IF test THEN statement
i
2
corresponds to ELSE statement
corresponds to 8
2 goes to 8 because theres nothing in the second production that corresponds to the
ELSE statement in the first production. You can plug the foregoing into our equation
and replace the original productions with the following:
statement IF test THEN statement opt else clause
opt else clause ELSE statement
8
Note that this replacement has also eliminated the ambiguity, because only one parse
tree can now be generated from the earlier input.
4.8.3 Corner Substitution
In general, given a left-hand side with one or more right-hand sides:

i
CX2

<*n
and given a production of the form
* -> P P
this last production can be replaced with several productions of the form:

P n Y
This process is called a substitution.
In a production of the form a ^ A a, the leftmost symbol on the right-hand side (A) is
said to be a corner of a. An 8 production doesnt have a comer. In a corner substitution
you replace one or more of the nonterminal comers in a grammar with that comers
right-hand sides.
For example, consider the following grammar, which recognizes a list of one or more
ATOMs. The ATOMs can be single ATOMs, they can be followed by a bracketed
number (LB and RB stand for left and right bracket), or they can be preceded by any
Substitution.
Corner.
Corner substitution
preserves LL( 1) proper
ties.
Q Grammars.
number of STARs.
list_ele atom Ji st
list ele -> ATOM LB NUMBER RB
STAR list ele
This grammar is not LL(1) because, ATOM is in FIRST(list_ele) by Production 3, and
as a consequence, is also in SELECT(2). A substitution of the list_ele comer of Produc
tion 2 can help fix this situation. First, you replace all instances of list_ele in Production
2 with the right-hand sides of the two list ele productions (3 and 4), yielding:
2a.
2b.
ATOM LB NUMBER LB atom list
STAR list_ele
list ele -> ATOM LB NUMBER RB
STAR list ele
atom list
The grammar is still not LL(1) because Productions 1and 2a both start with an ATOM,
but that situation can be rectified by left factoring:
atom Ji st ATOM atom Ji st '
2b. I STAR list_ele atom Ji st
lb. atom Ji st ' LB NUMBER RB atom Ji st
1 C. I
list de -> ATOM LB NUMBER RB
STAR list ele
Comer substitution preserves the LL(1) properties of a grammargrammars that start
out LL(1) are still LL(1) after the comer substitution is made. You can see why if you
consider how FIRST sets are computed. If x is a comer of production s, then FIRST(s)
includes FIRST(x), and FIRST(x) is computed by looking at the right-hand side of x.
Substituting x for its right-hand side does not affect this computation.
One common use of comer substitution and left factoring takes advantage of this
property by rearranging the grammar so that every right-hand side in the grammar either
starts with a unique terminal symbol or is 8. That is, if a nonterminal has several right-
hand sides, then all those right-hand sides will start with different terminal symbols or be
. This particular type of grammar is called a Q grammar and is handy when you build
recursive-descent parsers because its very easy to code from a Q grammar. That is,
given a Q grammar like the following:
p ^ T. cx
i ~i
T 2 2
T a
n n
8
where T r . .Tnare unique terminal symbols, you can code it like this:
Section 4.8.3Comer Substitution* 223
P()
{
swi t ch( l ookahead_ character )
{
case Tj : advance () / (Xj
case T2: advance(); a2
case Tn: advance(); an

def aul t: / * Handle t he e p s i l o n pr oduct i on */
}
}
4.8.4 Singleton Substitution*
Generally, its not productive to replace symbols that arent comersyou shouldnt
replace nonterminals that arent at the far left of a right-hand side. The problem here is
that the substitution usually creates several productions that have the same comer, and as
a consequence the resulting grammar wont be LL(1). For example, substituting the
num or id for its right-hand side in the following grammar:
1. expr UN OP num or id
2. n u m o r i d NAME
3. I IDENTIFIER
yields the following, non-LL(l) grammar:
1. expr UN OP NAME
1a. I UNOP IDENTIFIER
Youve just done a reverse left factoring.
If a production has only one right-hand side, of course, the substitution is harmless.
In fact, it can reduce the size of the grammar by eliminating productions and is often
useful for this reason. This sort of production is called a singleton.
Its sometimes desirable to create singletons by breaking out a group of symbols
from the middle of a production. That is, the following:
5 a|i y
can be changed into:
5 a s' y
/ -> P
This technique is used primarily in bottom-up parsers, in which action symbols must be
at the far right of the production. As a consequence, a production like this:
s tempest {act (1 ) ;} scene_5
must be implemented like this:
s tempest s' scene_5
s ^ {act (1) t }
4.8.5 Eliminating Ambiguity*
Substitutions can create
ambiguity.
Singleton substitution
used to isolate actions for
bottom-up parser.
Ambiguous productions are those that have more than one occurrence of a given
nonterminal on their right-hand side. As weve seen, left factoring can be used to
Controlling associativity
when eliminating ambi
guity.
eliminate ambiguity, moving the rightmost of these nonterminals further down the parse
tree by introducing a new production. In general given an ambiguous production of the
form:
s a p f i p y
you can eliminate the ambiguity by introducing a new production, as follows:
s ^ cxp Ps
s' > p y
If the grammar has a production of the form:
p ^ s
this transformation makes the grammar left-associative, because there is now an indirect
left recursion, demonstrated with the following derivation:
s' =>p y =>s y
If an ambiguous right-hand side is one of several, then all of these right-hand sides
must move as part of the substitution. For example, given:
e > e +e
I NUM
you transform the grammar to be left-associative like this:
^ t \ t
t ^
I NUM
or right-associative like this:
^ t \ t
t ^
I NUM
Note that the foregoing transformation introduced a production of the form e ^ t into
the grammar. You can use this production to modify the grammar again:
^ t \ t
t -^t
I NUM
A production of the form t ^ t does nothing useful, so it can be eliminated:
e > t + e \ t
t -> NUM
To generalize, given grammars of the form:
s a^P^Y I ai !! a
you can eliminate the ambiguity in favor of right associativity by replacing the foregoing
with the following:
s > a / Ps y I s'
s ^ s I cx, I.. .1 cx
/ //
Now, because a production of the form s>s' has been created you can substitute the ss
for / s in the new production, yielding:
Section 4.8.5Eliminating Ambiguity* 225
s ^ s I ex. I.. .1 cx
1 n
and since that production of the form s'-^s' doesnt do anything useful, it can be elim
inated, yielding:
s a s' p s y I s'
s ^ ex. I.. .1 cx
/ n
The same applies in a left-associative situation. Productions of the form:
s a s P^y
can be disambiguated in favor of left-associativity by replacing the right occurrence of s
rather than the left one in the initial step.
If a grammar has several ambiguous productions, it can be modified by repetitive
application of the previous transformation. Similarly, if the s in the earlier general
example appears on more than one right-hand side, then it can be replaced with s' in all
the right-hand sides. For example, starting with the following expression grammar:
e * e
NUM
I ll eliminate the ambiguity one production at a time. Note that the order in which the
disambiguating rules are applied also affects operator precedencethe operators that are
lower down in the grammar are of higher precedence. In an expression grammar like the
current one, you want to start with the right-hand side that has the lowest-precedence
operator, Production 1. Since addition must be left associative, the transformed produc
tion must also be left associative, so you must replace the right e in Production 1, yield
ing:
lb. I e'
lc. 6 ^ C
e * e
NUM
Now, since a production of the form e-ê exists, you can substitute es for e s in Pro
ductions lc, and 2, yielding:
lb. I e'
lc. e ^ c
e ' * e '
NUM
Production lc now does nothing useful because the left- and right-hand sides are identi
cal, so it can be removed from the grammar:
lb. I e'
NUM
Controlling precedence
when eliminating ambi
guity.
You can now apply the same process to Production 2 to get the following grammar:
LL(1) grammars cannot
be left recursive.
la. e
lb.
2a. e
2b.
2c e"
3.
and substituing the e for e" in Production 2c, you get:
la. e )
lb. I
2a. e ^
2b. I
2c. e" ->
3. I
Finally, removing the redundant production yields:
la. e
lb.
2a. e'
2b.
3. e*
This grammar is left associative, and +is lower precedence than *.
4.8.6 Eliminating Left Recursion*
LL(1) grammars cannot be left recursive. The basic proof of this statement is as fol
lows:
(1) In all practical grammars, if any right-hand side of a production is left recursive,
there must be at least one nonrecursive right-hand side for the same nonterminal.
For example, if a grammar has a singleton production of the form s>s a, all deriva
tions that use that production go on forever:
s => s a
=> s a oc
=> s a a a
*
You need to add a nonrecursive right-hand side of s for the derivation to terminate.
(2) If x is a left-recursive nonterminal, the selection sets for all productions with x on
their left-hand side must contain FIRST(x), so, in all practical grammars, left-
recursive productions must have selection-set conflicts.
Unfortunately, you need left recursion to get left associativity in a list. Fortunately,
you can always eliminate left recursion from a grammar in such a way that the translated
grammar recognizes the same input strings. The basic strategy looks at what left-
recursive list grammars actually do. A list such as an expression (a list of operands
separated by operators) is either a single operand or a single operand followed by one or
more operator/operand pairs. You can express this structure with the following left-
recursive grammar:
list operand
I list operator operand
e +e
/
e
/ ^ //
e * e
//
e
NUM
e +e
e
e
e
e
sk ff
* e
NUM
e +e
f
e
/ * //
e * e
e
e
NUM
Section 4.8.6Eliminating Left Recursion* 227
The following nonrecursive grammar recognizes the same input strings, however:
list operand list'
list' operator operand
8
So, our goal is to translate lists of the first form into lists of the second form. If a produc
tion is a self left recursive like the following:
s ) s cx
(the left-hand side is duplicated as the first symbol on the right-hand side: 5 is a single
nonterminal; a and p are collections of terminals and nonterminals), you can make it
nonrecursive by shuffling the production around as follows:
5 P s'
8
Applying this translation rule to the following productions:
expr expr +term {act 2 } I term {act 1 }
the following relationships exist between the s, cx, and P in the rule and the real produc
tions:
expr expr +term {act 2 } I term {actl }
5, which is a single nonterminal, corresponds to an expression; cx, which is a collection of
terminals and nonterminals, corresponds to the +term, and p, which is also a collection
of terminals and nonterminals, corresponds to the second term (in this case, its a collec
tion of only one symbol). Productions like the foregoing can be shuffled around to look
like this:
8
The s' is a new nonterminal. You could call it anything, but just adding a ' is easiest.
Applying the foregoing to the real productions yields:
s s'
Figure 4.8 shows a parse for the input 1+2+3 for both the untranslated and translated
grammars. The important thing to notice is that, even though the parse tree for the
translated grammar is now right associative (as you would expect from looking at the
grammar), the order in which the terminal nodes are processed is identical in both trees.
That is, the actions are performed in the same sequence relative to the input in both
grammars. So there is still left associativity from the point of view of code generation,
even though the parse itself is right associative.
The foregoing substitution process can be generalized for more complex grammars
as follows:
Order in which actions
are executed with
transformed grammar.
Figure 4.8. Parse Trees for Translated and Untranslated, Left-Associative Grammars
1+2 +3
expr
term
expr + term {act2}
term {actl}
NUMBER
expr
14
+ expr, . 10 termn {act2},3
expr
4
+s term7 {act2}8 NUMBER
i
term2 {act l }3 NUMBER
6
NUMBER
I
expr
expr
term
term {actl} expr
+ term {act2}
NUMBER
expr I e
expr
16
term2 {acti}3 expr..
NUMBER +4 term6 { }7
expr
14
NUMBER, +8 term[0 {act2},, expr
13
NUMBER
9 |2
5 S
a
a
i
a
2
n 1 2
..I 6
m
Comer substitution
translates indirect recur
sion to self recursion.
you can replace the foregoing with:
s
i
P2s' I... I
m
S
an* I 8
Though the method just described works only with self-recursive productions, its an
easy matter to use a comer substitution to make an indirectly recursive grammar self
recursive. For example, in a grammar like this:
expr
ele list
elejist
NUMBER
expr PLUS NUMBER
you can use a comer substitution to make the grammar self-recursive, replacing the
elejist comer in Production 1with the right-hand sides of Productions 2 and 3 to yield
the following:
la.
lb.
expr NUMBER
expr PLUS NUMBER
ele list NUMBER
expr PLUS NUMBER
Productions 2 and 3 are now unreachable, so can be eliminated from the grammar.
Section 4.9Implementing LL(1) Parsers 229
4.9 Implementing LL(1) Parsers
The remainder of this chapter discusses implementation details for the foregoing
theory. Skip to the next chapter if youre not interested.
4.9.1 Top-Down, Table-Driven ParsingThe LLama Output File
First, an in-depth look at the LLama output file, in which the top-down parse algo
rithm is actually implemented, seems in order. A users manual for LLama is presented
in Appendix E, and you should read that appendix before continuing. This section exam
ines the LLama output file created from the input file at the end of Appendix E. Listing
4.2 shows token definitions that are generated from the %term directives in the LLama
input file. These are the same symbolic values that are used in the previous section.
Listing 4.2. llout.h Symbolic Values for Tokens
1 #def i ne EOI 0
2 #def i ne PLUS 1
3 #def i ne TIMES 2
4 #def i ne NUM OR ID 3
5 #def i ne LP 4
6 #def i ne RP 5
7 #def i ne SEMI 6
The LLama C-source-code output file begins in Listing 4.3. Most of the file is
copied from the template file llama.par, in a manner identical to the LX template file
described in Chapter Three4. Lines three to 23 were copied from the input file, and the
remainder of the listing was generated by LLama. Lines 28 to 33 define the boundaries
of the numbers that represent terminals, nonterminals, and actions on the stack (as were
described in the previous section). The minimums wont change, but the maximums
will.
The next part of llout.c (in Listing 4.4) is copied from the template file. <stdarg.h>
is included on Line 42 only if it hasnt been included earlier. The <tools/yy stack.h> file
included on the next line contains various stack-manipulation macros, and is described
in Appendix A. The macros on lines 45 to 47 determine what x is, using the earlier lim
its. The remainder of the listing is macros that define various parser limits described in
Appendix E. These definitions are active only if you havent defined the macros yourself
in a %{ %} block at the top of the input file. YY TTYPE (on line 34) is used to declare
the tables.
Various stacks are declared on lines 89 to 108 of Listing 4.4. The <yystack.h> mac
ros described in Appendix A are used for this purpose, and various <yystack.h> macros
are customized on lines 90 to 94 for the current usey y e r r o r () works like p r i n t f ()
but doesnt mess up the windows when its used. The parse stack is defined on line 98,
and the value to stack is on lines 100 to 108. Note that the <yystack.h> macros cant be
used here because the value stack is a stack of structures.
The next part of llout.c (in Listing 4.5) is generated by LLama. It contains the
definitions for the parse tables discussed in the previous section. The tables in Listing
Definitions generated
from %term.
Numerical limits of token-
ized input symbols.
Parse- and value-stack
declarations.
Parse-table declarations.
4. Thats why some of the following listings are labeled llama.par and others are labeled llout.c. The tables
(which are generated by LLama) are in listings labeled llout.c. Im sorry if this is confusing, but its the
only way 1can keep track of whats where.
Listing 4.3. llout.c File HeaderNumeric Limits
1 #i ncl ude < s t d i o . h >
2 #def i ne YYDEBUG
3 / * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
4 * Tempor ar y- var i abl e names ar e st or ed on a st ack. name( ) pops a name of f
5 * t he st ack and f r eename( x) put s i t back. A r eal compi l er woul d do some
6 * checki ng f or st ack over f l ow her e, but t her e' s no poi nt i n cl ut t er i ng t he
1 * code f or now.
8 */
9
10 char * Na me po o l [] =
11 {
12 "tO", " t l " , " t 2 " , "1 3", "t 4 ", "1 5", "t 6 ", "t 7 ", "t 8 ", "t 9"
13 };
14
15 char **Namep =Namepool ;
16
17 char *newname( ) { r et ur n(*Namep++) ; }
18 char *f r e e n a me ( char *x) { r et ur n(*Namep =x ) ; }
19
20 ext ern char * y y t e x t ;
21 ext ern i nt y y l e n g ;
22
23 #def i ne YYSTYPE char*
24
25 / * ---------------------------------------------------------------------------- * /
26 #i ncl ude "l l out.h"
27
28 #def i ne YY_MINTERM 1 / * Smal l est t er mi nal . */
29 #def i ne YY_MAXTERM 6 / * Lar gest t er mi nal . */
30 #def i ne YY_MINNONTERM 256 / * Smal l est nont er mi nal . */
31 #def i ne YY_MAXNONTERM 261 / * Lar gest nont er mi nal . */
32 #def i ne YY_START_STATE 256 / * Goal symbol ( push t o st ar t par se) . */
33 #def i ne YY MINACT 512 / * Smal l est act i on. */
Listing 4.4. llama.par File HeaderMacro Definitions
34 t ypedef unsi gned char YY_TTYPE; / * Type used f or t abl es. */
35 #def i ne YYF ( YY_ TTYPE) ( - 1 ) / * Fai l ur e t r ansi t i on i n t abl e. */
36
3 7 / * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
38 * Par ser f or l l ama- gener at ed t abl es
39 */
40
41 #i f ndef v a _ s t a r t
42 #i ncl ude < s t d a r g . h >
43 #endi f
44
45 #def i ne YY_I STERM( x) ( YY_MI NTERM <=(x) && (x) <=YY_MAXTERM )
46 #def i ne YY_I SNONTERM( x) ( YY_MI NNONTERM <= (x) && (x) <=YY_MAXNONTERM)
47 #def i ne YY_I SACT( x) (YY_MI NACT <=(x) )
48
49 #i f ndef YYACCEPT
50 # def i ne YYACCEPT r et ur n(0) / * Act i on t aken when i nput i s accept ed. */
51 #endi f
52
Section 4.9.1 Top-Down, Table-Driven ParsingThe LLama Output File 231
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
#i f ndef YYABORT
def i ne YYABORT r et ur n(1)
#endf
/
*
Act i on t aken when i nput i s
*
/
#i f ndef YYPRIVATE
YYPRIVATE /
*
Def i ne t o a nul l st r i ng t o make publ i c
*
/
#endi f
#i f ndef YYMAXERR
YYMAXERR 2 5 / * Abor t af t er t hi s many er r or s
*
/
#endi f
#i f ndef YYMAXDEPTH /
*
and val ue- st ack dept h
*
/
YYMAXDEPTH 128
#endi f
#i f ndef YYSTYPE / * Used t o decl ar e i n val ue st ack. */
YYSTYPE i nt
#endi f
ext ern i nt
ext ern i nt
y y l i n e n o ;
y y l e n g ;
/
*
Suppl i ed by LeX
*
/
*
y y t e x t ;
ext ern i nt
ext ern i nt
i i _ p l i n e n o ()
* i i _ p t e x t ()
i i l i n e n o ()
* i i 0
ext ern voi d i i mark p r e v ( )
/
*
i n 1. l i b: : i nput . c
*
/
voi d y y e r r o r (
*
f mt ,
/
*
* Par se and val ue st acks:
/
#i ncl ude < t o o l s / y y s t a c k . h>
#undef y y s t k e l s
#def i ne y y s t k e l s YYPRIVATE
#undef y y s t k e r r
#def i ne y y s t k e r r ( o ) ( (o) ( y y e r r o r ( "St ac k o v e r f l o w \ n " ) , e x i t ( l ) )
( y y e r r o r ( "St a c k u n d e r f l o w \ n " ) , e x i t ( 1) )
\
)
#def i ne y y t o s ( s t k ) y y s t k i t e m( s t k , 0 ) /
*
Eval uat es t o t op- of - st ack i t em.
/
y y s t k d e l ( Yy s t a c k , i nt, YYMAXDEPTH ) ;
/
Typedef f or val ue- st ack el ement s. */

{
YYSTYPE
YYSTYPE
l e f t ;
r i g h t ;
/ * Hol ds val ue of l ef t - hand si de at t r i but e.
/
*
Hol ds val ue of cur r ent - symbol ' s at t r i but e
/
/
} y y v s t y p e ;
yyvst ype Yy vst ack[ YYMAXDEPTH ];
yyvst ype
*
Yy vsp Yy vst ack + YYMAXDEPTH;
/ * Val ue st ack.
/ * Val ue- st ack poi nt er
/
/
110 0
111 0 Tabl es go her e. LLama r emoves al l l i nes t hat begi n wi t h 0 when i t copi es
112 0 l l ama. par t o t he out put f i l e
113 ~L
4.5 are identical in content to the ones pictured in Figure 4.6 on page 210. Note that the
Yyd table on lines 179 to 184 is not compressed because this output file was generated
with the -/switch active. Were -/not specified, the tables would be pair compressed, as
is described in Chapter Two. The yy_act () subroutine on lines 199 to 234 contains the
swi t ch that holds the action code. Note that $ references have been translated to expli
cit value-stack references ( Yy_vsp is the value-stack pointer). The Yy_synch array on
lines 243 to 248 is a -1-terminated array of the synchronization tokens specified in the
%synch directive.
Listing 4.5. llout.c Parse Tables
114 / *- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
115 * The YypNN ar r ays hol d t he r i ght - hand si des of t he var i ous pr oduct i ons, l i st ed
116 * back t o f r ont (so t hat t hey wi l l be pushed i n r ever se or der ) , NN i s t he
117 * pr oduct i on number (to be f ound i n t he symbol - t abl e l i st i ng out put wi t h a - s
118 * command- l i ne swi t ch) .
119 *
120 * Yy_pusht ab[ ] i s i ndexed by pr oduct i on number and poi nt s t o t he appr opr i at e
121 * r i ght - hand- si de ( YypNN) ar r ay.
122 */
123
124 YYPRI VATE i nt Yyp07[ ] =
131 YYPRI VATE i nt Yyp02[]=
134
135 YYPRI VATE i nt *Yy_pusht ab[] =
136 {
137 YypOO,
138 YypOl ,
139 Yyp02,
140 Yyp03,
141 Yyp04,
142 Yyp05,
143 Yyp06,
144 Yyp07,
145 Yyp08,
146 Yyp09
147 };
148
149 / *- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
150 * Yyd[ ] [ ] i s t he DFA t r ansi t i on t abl e f or t he par ser . I t i s i ndexed
151 * as f ol l ows:
0 };
261, 517, 260,
5, 257, 4, 0 }
518, 3, 0
};
0 };
259, 515, 258,
261, 260, 0 };
259, 258, 0 };
256, 6, 513, 2
0 };
516, 2, 0 };
514, 1, 0 };
57 512, 0 };
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
*
*
*
*
*
L
H
S
I nput symbol
+------------------------------------+
Pr oduct i on number
or YYF
* +------------------------------------+
*
*
The pr oduct i on number i s used as an i ndex i nt o Yy pusht ab, whi ch
* l ooks l i ke t hi s:
*
*
Yy pusht ab YypDD
*
*
+-----------+
*___
+--------------------------+
--------- >
*
*
+-----------+
*___
+--------------------------+
------>
* +-----------+
* *___i ____
>
*
+-----------+
*
*
*
YypDD i s t he t okeni zed r i ght - hand si de of t he pr oduct i on.
Gener at e a symbol t abl e l i st i ng wi t h l l ama' s - 1 command- l i ne
t he meani ngs of t he 173
A
swi t ch t o get bot h pr oduct i on number s and
174
*
YypDD st r i ng con t ent s.
175 */
176
177 YYPRI VATE YY_ TTYPE Yyd[ 6 ] [ 7 ]
178 {
179
/ *
00
*/ { o,
- 1 - 1
-Lr -1-/ 1, 1, - 1 ,
- 1
180 / *
01 * / { - 1 ,
- 1 - 1
-J-1 -1- f 2, 2, -I f
- 1
181 / *
02
* / { - 1 , -i , -i , 5, 5,
-I f
- 1
182
/* 03
*/ { - 1 , 3, - 1 , - 1 , -I f 4,
4
183 / *
04
* / { - 1 , - 1 , - 1 , 8, 9, -I f
- 1
184
/ * 05 * / { - 1 ,
7, 6,
1, - 1 , 7, 7
};
/ * ---------------------------------------------------------------------------------------
*
yy next ( st at e, c) i s gi ven t he cur r ent st at e and i nput
* char act er and eval uat es t o t he next st at e.
*
/
#def i ne yy next ( st at e, c) Yyd[ st at e ][ c ]
/ * ---------------------------------------------------------------------------------------
*
Yy act () i s t he act i on subr out i ne. I t i s t he t okeni zed val ue
*
of an act i on and execut es t he cor r espondi ng code.
*
/
YYPRI VATE i nt yy act ( act num )
{
/
*
The act i ons. Ret ur ns 0 nor mal l y but a nonzer o er r or code can be r et ur ned
* i f one of t he act s causes t he par ser t o t er mi nat e abnor mal l y.
*
/
swi t ch( act num )
{
512
{(Yy vsp[ 1] . r i ght ) =(Yy vsp[ 2] . r i ght ) =newname(); }
210 case 513:
211 { f r e e n a m e ( ( Yy _ v s p [ 0 ] . r i g h t ) ) ; }
212 br eak;
213 case 514:
214 {( Yy_vsp[ 1] . r i ght ) =( Yy_vsp[ 2] . r i ght ) =newname(); }
215 br eak;
216 case 515:
217 { y y c o d e ( "%s+=%s\ n", Y y _ v s p - > l e f t , ( Yy _ v s p [ 0 ] . r i g h t ) ) ;
218 f r e e n a m e ( ( Yy _ v s p [ 0 ] . r i g h t ) ) ; }
219 br eak;
220 case 516:
221 { ( Yy _ v s p [ 1 ] . r i g h t ) = ( Yy _ v s p [ 2 ] . r i g h t ) = n e wn a me ( ) ; }
222 br eak;
223 case 517:
224 {yycode("%s*=%s\ n", Yy_vsp- >l ef t , (Yy_vsp[ 0] . r i ght )) ;
225 f r e e n a m e ( ( Yy _ v s p [ 0] . r i g h t ) ) ; }
226 br eak;
227 case 518:
228 { yycode(" %s=%0. *s\ n" , Yy_vsp- >l ef t , yyl eng, yyt ext ); }
229 br eak;
230 def aul t : p r i n t f ( "INTERNAL ERROR: I l l e g a l a c t number ( %s ) \ n", a c t n u m) ;
231 br eak;
232 }
233 r et ur n 0;
234 }
235
236 / *- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
237 * Yy_synch[ ] i s an ar r ay of synchr oni zat i on t okens. When an er r or i s
238 * det ect ed, st ack i t ems ar e popped unt i l one of t he t okens i n t hi s
239 * ar r ay i s encount er ed. The i nput i s t hen r ead unt i l t he same i t emi s
240 * f ound. Then par si ng cont i nues.
241 */
242
243 YYPRIVATE i nt Yy _ s y n c h [] =
244 {
245 RP,
246 SEMI ,
247 - 1
248 };
249
250 / *- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
251 * Yy_st ok[ ] i s used f or debuggi ng and er r or messages. I t i s i ndexed
252 * by t he i nt er nal val ue used f or a t oken (as used f or a col umn i ndex i n
253 * t he t r ansi t i on mat r i x) and eval uat es t o a st r i ng nami ng t hat t oken.
254 */
255
256 char *Yy s t o k [] =
257 {
258
/ *
0 */ " _ E0I _ " ,
259
/ *
1
*/
" PLUS" ,
260
/ *
2
*/
" TI MES" ,
261
/ * 3 */ " NUM_ 0R_ I D" ,
262
/ *
4
*/
"LP",
263 / * 5 */
"RP ",
264 / * 6 */
" SEMI "
265 };
266
2 6 7
2 6 8
2 6 9
2 7 0
271
2 7 2
273
2 7 4
275
2 7 6
277
278
2 7 9
2 8 0
281
2 8 2
283
2 8 4
285
2 8 6
2 8 7
288
2 8 9
2 9 0
291
2 9 2
293
2 9 4
295
2 9 6
297
Symbol-to-string conver
sion arrays: Y y s t o k ,
Yy s nont er m, Yy sact .
Output streams.
Debugging functions and
macros, y y d e b u g .
Symbol stack,
Yy ds t ack.
debugging environment to make stack activity a little easier to follow.
The arrays following line 256 provide human-readable diagnostics; they translate the
various numeric values of the symbols into strings representing those symbols.
Yy stok, on lines 256 to 265, is indexed by token value and evaluates to a string
representing the tokens name. Similarly, Yy_snont er m[ ] on line 276 translates non
terminals, and Yy sact [ ] on line 431 puts the actions in some sort of readable form.
The {0 }, and so forth, are mapped to the actual code in the llout.sym file, generated by
giving LLama a - s, -5, or -D command-line switch.
Listing 4.6 starts moving into the parser proper. The streams declared on lines 300 to
302 are used for output to the code, bss, and data segments respectively. Theyre all ini
tialized to st dout , but you can change them with an f open () call.
The remainder of Listing 4.6 provides alternate versions of functions and macros for
debugging ( YYDEBUGdefined) production modes. The macros and subroutines on lines
306 to 431 are used if debug mode is active, otherwise the same macros and subroutines
are redefined on lines 431 to 497 not to print diagnostics, and so forth.
If YYDEBUG is defined, a second parse stack ( Yy_dst ack) is defined on line 308.
This second, symbol stack exactly parallels the normal parse stack, but it holds strings
representing the symbols, which are in turn represented by the numbers on the normal
parse stack. For example, an expr is represented by the number 257 on the parse stack,
and every time a 257 is pushed on the normal parse stack, the string "expr" is pushed
onto the debugging stack. A string is popped (and discarded) from the symbol stack
when a number is popped from the normal parse stack. This symbol stack is used in the
# i f d e f YYDEBUG
/ *---------------------------------------------------------------------------------------
* Yy_snont er m[ ] i s used onl y f or debuggi ng. I t i s i ndexed by t he
* t okeni zed l ef t - hand si de (as used f or a r ow i ndex i n Yyd[ ] ) and
* eval uat es to a st r i ng nami ng t hat l ef t - hand si de.
*/
char * Yy_snont e r m[]
{
/ * 256 */
/ * 257 */
/ * 258 */
/ * 259 * /
/ * 260 * /
/ * 261 */
};
/* ---------------------------
* Yy_sact [] i s al so used onl y f or debuggi ng. I t i s i ndexed by t he
* i nt er nal val ue used f or an act i on symbol and eval uat es to a st r i ng
* nami ng t hat t oken symbol .
*/
char *Yy_sact [ ] =
{
" {0}", " {1} ", " {2 }", " {3 }", " {4}", " {5 }", " {6} "
};
#endi f
s t mt " ,
e x p r " ,
t e r m",
' e x p r ' ",
f a c t o r " ,
' t e r m ' "
Listing 4.6. llama.par Macros for Parsing and Debugging Support
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
J *-------------------------------------- ----------- ----------- ------ -------------- -----------------------------* J
FI LE
FI LE
FI LE
i nt
*yycodeout
*yybssout
*yydat aout
yyner r s
st dout
st dout
st dout
0;
*
*
/
/
/
/ * Er r or count .
*
Out put st r eamf or code.
Out put st r eamf or i ni t i al i zed dat a.
Out put st r eamf or uni ni t i al i zed dat a
*
*
*
/
/
/
/
j * ---------------- ---- --------- ---- ---- --------------------------------------------- ------------------ ---------------------------------------- ------------* J
#i f def YYDEBUG /
*
Debuggi ng par se st ack.
*
/
yyst k del ( Yy dst ack,
*
YYMAXDEPTH );
YYPRI VATE
*
yy sym( sym )
{
/
*
*
Ret ur n a poi nt er to a st r i ng r epr esent i ng t he symbol , usi ng t he
appr opr i at e t abl e to map t he symbol to a st ri ng.
*
/
ret urn ( YY_I STERM( sym )
( YY_I SNONTERM( sym )
/ * assume i t ' s an act
!sym )
)
*
Yy st ok [sym]
Yy snont er m [ sym
/ Yy sact [sym
YY_MI NNONTERM]
YY MI NACTl ;
}
/ * St ack- mai nt enance. yy push and yy pop push and pop i t ems f r ombot h t he
* par se and symbol st acks si mul t aneousl y. The yycomment cal l s i n bot h r out i nes
* pr i nt a message to t he comment wi ndow sayi ng what j ust happened. The
* yy_pst ack () cal l r ef r eshes t he st ack wi ndow and al so r equest s a new command
* f r omt he user. That i s, i n most si t uat i ons, yy_pst ack( ) won' t r et ur n unt i l
* t he user has t yped anot her command ( except i ons ar e go mode, and so f or t h) .
* yy_pst ack( ) al so checks f or br eak poi nt s and hal t s t he par se i f a br eakpoi nt
* condi t i on i s met .
*
/
YYPRI VATE yy push( x, val )
i nt
YYSTYPE
{
x;
val ;
/ * Push t hi s ont o t he st at e st ack. */
/ * Push t hi s ont o t he val ue st ack. */
x yypush ( Yy_st ack,
yypush ( Yy dst ack, yy sym( x)
) ;
) ;
Yy_vsp;
Yy vsp- >l ef t Yy vsp- >r i ght val ;
/
/
*
The push( ) macr o checked

f or over f l ow al r eady.
/
/
yycomment ( "push %s\ n", yy sym( x )
) ;
yy pst ack( 0,
1 );
}
YYPRI VATE yy pop()
{
i nt pr ev_t os
++Yy vsp;
yypop( Yy st ack
);
yycomment ( "pop %s\ n", yypop ( Yy dst ack )
) ;
yy pst ack ( 0,
1 ) ;
ret urn pr ev t os;
}
356
357
YYPRI VATE yy say what s happeni ng( t os i t em, pr oduct i on)
358
359
360
i nt t os_i t em;
i nt pr oduct i on;
{
/
/
*
*
I t emat t op of st ack
pr oduct i on number we' r e about to appl y
*
*
/
/
361
362
363
364
365
/ * Pr i nt a message i n t he comment wi ndow descr i bi ng t he r epl ace oper at i on
* about to be per f or med. The mai n pr obl em i s t hat you must assembl e a
* st r i ng t hat r epr esent s t he r i ght - hand si de of t he i ndi cat ed pr oduct i on
* I do t hi s usi ng t he appr opr i at e Yy_pusht ab el ement , but go t hr ough t he
* ar r ay backwar ds (the Yy pusht ab ar r ays have been r ever sed to make t he
366
367
*
pr oduct i on- mode par se mor e ef f i ci ent - - you need to unr ever se t hemhere)
*
/
368
369
370
371
372
373
374
375
i nt
i nt
(
buf [80];
count ;
* *
end;
end
/
*
Assembl e st r i ng r epr esent i ng r i ght - hand si de her e
/ * Maxi mumsi ze of st r i ng i ng RHS.
/
*
St ar t and end of Yy pusht ab ar r ay t hat hol ds RHS
*
*
*
/
/
/
Yy pusht ab[ pr oduct i on ]; *end; end++ ) / * Fi nd end. */
376
*
buf ' \ 0' ;
377 ( count ( buf ) ; end >= st ar t && count > 0 ;) / * Assembl e */
378
379
380
381
{
st r ncat ( buf , yy sym( *end) , count );
( ( count st r l en( yy sym( *end) +1) ) <1 )
/
st r i ng.
*
/
382 st r ncat ( buf , " ", count );
383 }
384
385
386
}
yycomment ( "Appl yi ng %s- >%s\ n", yy sym( t os i t em) , buf );
387
388
389
390
391
392
393
394
395
/
*
*
*
*
*
Use t he f ol l owi ng r out i nes j ust l i ke pr i nt f () to cr eat e out put . I n debug
modef al l t hr ee r out i nes pr i nt to t he out put wi ndow ( yy_out put ( ) i s i n
yydebug. c) . I n pr oduct i on mode, equi val ent r out i nes pr i nt to t he associ at ed
st r eams ( yycodeout , yybssout , or yydat aout ). The f i r st ar gument to
yy out put () t el l s t he r out i ne whi ch st r eami s bei ng used. I f t he st r eami s
* st i l l set to t he def aul t st dout , t hen t he message i s wr i t t en onl y to t he
* wi ndow. I f t he st r eamhas been changed, however , t he out put i s sent bot h
* to t he wi ndow and t he associ at ed st r eam.
396
/
397
398
399
yycode( f mt )
* f mt ;
/
*
Wr i t e somet hi ng to t he code- segment st r eam
/
400 {
401
402
403
404
}
va l i st
va st ar t ( ar gs f mt ) ;
yy out put ( 0, f mt, ar gs );
405
406
407
yydat a( f mt )
*fmt;
/
Wr i t e somet hi ng to t he dat a- segment st r eam
/
408 {
409
410
411
412
va l i st
}
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
yybss( f mt ) /
*
Wr i t e somet hi ng to t he bss- segment st ream.
*
/
*
f mt ;
{
va l i st
}
/
*
*
*
*
*
*
*
Debuggi ng ver si ons of yycomment () and yy_er r or ( ) ar e pul l ed out of yydebug. c
when YYDEBUG i s def i ned. Si mi l ar l y, yy_br eak( ) , whi ch hal t s t he par se i f a
br eak- on- pr oduct i on- appl i ed br eakpoi nt has been t r i gger ed, i s f ound i n
yydebug. c. I t i s el i mi nat ed f r om t he pr oduct i on- mode out put by def i ni ng i t as
an empt y macr o, bel ow. Fi nal l y, yy_next oken(), whi ch eval uat es to a yyl ex()
cal l i n pr oduct i on mode, and whi ch i s def i ned i n yydebug. c, bot h get s t he
next t oken and pr i nt s i t i n t he TOKEN wi ndow.
*
/
#el se
433 # def i ne
yy.
push( x, v) ( yypush( Yy st ack, x) ,
434 - - Yy vsp, Yy vsp- >l ef t
435
436 # def i ne
yy.
_pop() ( ++Yy vsp, yypop( Yy st ack)
437 # def i ne
yy.
next oken () yyl ex()
438 # def i ne
yy.
qui t debug ()
439 # def i ne
yy.
_sym()
440 # def i ne
yy.
say what s happeni ng( t os i t em, pr od)
441 # def i ne
yy.
r edr aw st ack()
442 # def i ne
yy.
pst ack( r ef r esh, pr i nt i t)
443 # def i ne
yy.
br eak( x)
444
445 #i f ndef va l i st
446 # i ncl ude <st dar g. h>
\
Yy vsp- >r i ght =v )
)
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
#el se
#endi f
voi d
{
}
voi d
{
}
voi d
{
va_dcl
MUST USE ANSI VARI ABLE- ARGUMENT CONVENTI ONS I N <st dar g. h>
yycode( f mt ,
*fmt;
. . )
va l i st ar gs;
va st ar t ( ar gs, f mt ) ;
vf pr i nt f ( yycodeout , f mt, ar gs );
yydat a( f mt ,
* f mt ;
. . )
va l i st ar gs;
va_st ar t ( args, f mt );
vf pr i nt f ( yydat aout , f mt, ar gs );
yybss( f mt ,
*fmt;
. . )
va_l i st
va st ar t ( ar gs
ar gs;
f mt ) ;
474 vf pr i nt f ( yybssout , f mt, ar gs );
475 }
476
477 voi d yycomment ( f mt, . . . )
478 char *f mt ;
479 {
480 va l i st args;
481 va st ar t ( args, f mt );
482 vf pr i nt f ( st dout , f mt, ar gs );
483 }
484
485 voi d yyer r or ( f mt, . . . )
486 char * f mt ;
487 {
488 va l i st args;
489 ext er n char *yyt ext ;
490 ext er n i nt yyl i neno;
491
492 va st ar t ( args, f mt );
493 f pr i nt f ( st der r , "ERROR ( l i ne %d near %s) : ", yyl i neno, yyt ext );
494 vf pr i nt f ( st der r , f mt, ar gs );
495 f pr i nt f ( st der r , "\ n" );
496 }
497 #endi f
The symbol stack is manipulated in the push () and pop () subroutines on lines 331
to 354. Stack boundary checking is done only on the actual parse stack because, since
all stacks are the same size and they all run in parallel, additional checking is redundant.
These subroutines are translated into macros on lines 433 to 436 if YYDEBUG isnt
active. Most other details are called out in comments in the listing, to which you are
referred. I ll look at yydebug.c, which holds all the debug-mode support routines, in a
moment.
The rest of llout.c is the error-recovery code and the actual parser, in Listing 4.7.
Error-recovery is done with two subroutines on lines 502 to 545. yy_i n_synch () (on
line 502) checks to see if the symbol passed to it is in the set of synchronization symbols
given to the %synch directive and listed in the Yy_synch table. yy_synch () does the
actual error recovery. The error-recovery algorithm is described in the comment on line
516.
Listing 4.7. llama.par The Parser
498 / * --------------------------------------------------------------------------------------------------------------
499 * ERROR RECOVERY:
500 */
501
502 YYPRIVATE y y _ i n _ s y n c h ( sym )
503 {
504 / * Ret ur n 1 i f symi s i n t he synchr oni zat i on set def i ned i n Yy_synch. * /
505
506 i nt *p ;
507
508 f o r ( p = Yy_s ynch; *p && *p > 0 ; p++ )
509 i f ( *p == sym )
510 ret urn 1;
511 ret urn 0;
513
514 YYPRI VATE yy_synch( l ookahead )
515 {
516 / * Recover f r oman er r or by t r yi ng t o synchr oni ze t he i nput st r eamand t he
517 * st ack. Ret ur n t he next l ookahead t oken or 0 i f we can' t r ecover . Yypar seO
518 * t er mi nat es i f none of t he synchr oni zat i on symbol s ar e on t he st ack. The
519 * f ol l owi ng al gor i t hmi s used:
520 *
521 * (1) Pop symbol s of f t he st ack unt i l you f i nd one i n t he synchr oni zat i on
522 * set .
523 * (2) I f no such symbol i s f ound, you can' t r ecover f r omt he error . Ret ur n
524 * an er r or condi t i on.
525 * (3) Ot her wi se, advance t he i nput symbol ei t her unt i l you f i nd one t hat
526 * mat ches t he st ack symbol uncover ed i n (1) or you r each end of f i l e.
527 * /
528
529 i nt t o k ;
530
531 i f ( ++yyner r s > YYMAXERR )
532 ret urn 0;
533
534 whi l e( ! yy_i n_synch( t ok = yyt os( Yy_st ack ) ) \
535 && ! yyst k_empt y( Yy_st ack ) ) / * 1 * /
536 yy_pop () ;
537
538 i f ( yyst k_empt y(Yy_st ack) ) / * 2 * /
539 ret urn 0;
540
541 whi l e( l ookahead && l ookahead != t ok ) / * 3 * /
542 l ookahead = yy_next oken( ) ;
543
544 ret urn l o o k a h e a d ;
545 }
546
547 /* ------------------------------------------------------------------------------------------------------------------
548 * The act ual par ser . Ret ur ns 0 nor mal l y, - 1 i f i t can' t synchr oni ze af t er an
549 * e r r o r , ot her wi se r et ur ns a nonzer o val ue r et ur ned by one of t he act i ons.
550 * /
551
552 i nt y y p a r s e ( )
553 {
554 i nt *p ; / * Gener al - pur pose poi nt er . */
555 YY_TTYPE pr od; / *
Pr oduct i on bei ng pr ocessed.
*/
556 i nt l ookahead; / *
Lookahead t oken.
*/
557 i nt er r code = 0; / *
Er r or code r et ur ned by an act .
*/
558 i nt t char ; / *
Hol ds t er mi nal char act er i n yyt ext .
*/
559 i nt act ual l i neno; / *
Act ual i nput l i ne number .
*/
560 char *act ual t ext ; /* Text of cur r ent l exeme. */
561 i nt act ual l eng;
/ *
Lengt h of cur r ent l exeme.
*/
562 YYSTYPE val ; / *
Hol ds $$ val ue f or r epl aced nont er m.
*/
563
564 # i f def YYDEBUG
565
i f (
!yy i ni t debug( Yy st ack, &yyst k p( Yy st ack) ,
566 Yy dst ack, &yyst k p( Yy dst ack) ,
567 Yy vst ack, si zeof ( yyvst ype) , YYMAXDEPTH) )
568 YYABORT;
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
yyst k cl ear ( Yy dst ack) ;
yyst k cl ear (Yy st ack) ;
Yy vsp Yy vst ack + YYMAXDEPTH;
yy push( YY START STATE, (Yy vsp- l ) - >l ef t ); / * Push st ar t st at e ont o
*
/
*
par se st ack and j unk
/ * ont o t he val ue st ack.
*
*
yy i ni t l l ama Yy sp ) ; /
*
User - suppl i ed i ni t .
*
/
/
/
/
l ookahead yy next oken( ) ;
w h i l e ( !yyst k empt y( Yy st ack)
{
)
( YY I SACT( yyt os( Yy st ack)
) )
/ * i f TOS i s an act i on, do
*
{
}
{
}
{
/
*
i t and pop t he act i on.
*
/
/
yyl i neno
yyt ext
t char
yyt ext [ yyl eng]
i i _pl i neno( ) ;
i i _pt ext ( ) ;
yyt ext [ yyl eng
' \ 0' ;
i i pl engt h()
];
( er r code yy act ( yyt os( Yy st ack)
) )
er r code;
yy_p p( ) ;
yy_r edr aw_st ack();
yyt ext [ yyl eng ] = t char
Lf ( YY I STERM( yyt os( Yy k) ) ) / * Advance i f i t ' s a I .* /
( yyt os( Yy ck)
i
l ookahead ) /
*
ERROR f not t her e
*
/
yyer r or (
ft Q
s expect ed\ n", Yy st ok[ yyt os( Yy st ack)
]);
(
! ( l ookahead
YYABORT;
yy synch( l ookahead) )
)
}
{
/
*
*
*
*
Pop t he t er mi nal symbol at t op of st ack. Mar k t he cur r ent
t oken as t he pr evi ous one ( we' l l use t he pr evi ous one as
yyt ext i n subsequent act i ons), and advance.
/
yy_p p( ) ;
i i mar k pr ev();
l ookahead
act ual _l i neno
act ual _t ext
act ual l eng
yy_next oken() ;
yyl i neno;
yyt ext ;
yyl eng ;
}
/
*
*
*
*
*
*
Repl ace a nont er mi nal at t op of st ack wi t h i t s r i ght - hand si de
Fi r st l ook up t he pr oduct i on number i n t he t abl e wi t h t he
yy_next cal l . I f pr od==YYF, t her e was no l egal t r ansi t i on and
er r or - pr ocessi ng i s act i vat ed. Ot her wi se t he r epl ace oper at i on
i s done by poppi ng t he nont er mi nal , and pushi ng t he r i ght - hand
si de f r omt he appr opr i at e Yy pusht ab ent r y.
Top-down parser
yypar se().
Phase problems with
yytext.
The actual parser, yypar se (), starts on line 552 of Listing 4.7. For the most part, it
is a straightforward implementation of the algorithm in Table 4.12 on page 211. The one
difficulty not covered in the algorithm is the lexeme, yyt ext , which must be valid when
the action is executed. The problem is that the match-and-advance operation thats trig
gered when a lexeme is on top of the stack overwrites yyt ext with the new lexeme.
Consequently, you have to mark the current lexeme as the previous one before advanc
ing, using the i i _mar k_pr ev () call on line 614 (which was put into the input routines
described in Chapter Three for this very purpose). When a subsequent action is actually
performed (starting on line 586), yyt ext is modified to reference the previous lexeme,
not the current oneit references the lexeme associated with the token that we just
advanced past, not the lookahead token. The code on lines 588, 589, and 596 is just null
terminating the lexeme.
The remainder of llout.c is just copied from the input file at the end of Appendix D.
wont reproduce it here.
4.9.2 Occs and LLama Debugging Supportyydebug.c
This section discusses the debug-mode support routines used by the LLama-
generated parser in the previous section. The same routines are used by the occs-
generated parser discussed in the next chapter. You should be familiar with the interface
to the curses, window-management functions described in Appendix A before continu
ing.
Section 4.9.2Occs and LLama Debugging Supportyydebug.c 243
The debugging module, yydebug. c, starts in Listing 4.8 with various definitions,
which you should read now (there are extensive comments that describe what the vari
ables and definitions doI ll discuss some of them a bit more in a moment).
Listing 4.8. yydebug.c Definitions
1 #i ncl ude < s t d i o . h>
2 #i ncl ude <c t y p e . h>
3 #i ncl ude < s y s / t y p e s . h> / * ANSI / UNI X t i me f unct i ons. */
4 #i ncl ude < s y s / t i m e b . h> / *
ANSI / UNI X t i me f unct i ons.
*/
5 #i ncl ude < c u r s e s . h> / *
Wi ndow f unct i ons.
*/
6 #i ncl ude < s i g n a l . h> / * Needed by si gnal . */
7 #i ncl ude < s t d a r g . h> / * ANSI var i abl e- ar gument l i st s. * /
8 #i ncl ude / *
Pr ot ot ypes f or access (). */
9 #i ncl ude < s t r i n g . h>
/*
Pr ot ot ypes f or st r i ng f unct i ons.
*/
10 #i ncl ude < s t d l i b . h> /* Pr ot ot ypes f or ot her l i br ar y f unct i ons.
*/
11 #i ncl ude < t o o l s / d e b u g . h > / * Var i ous macr os.
* /
12 #i ncl ude < t o o l s / l . h> / * Pr ot ot ypes f or al l of l . l i b, i ncl udi ng al l
* /
13 / * f unct i ons i n t he cur r ent f i l e. * /
14 ext ern char * y y t e x t ; / *
Gener at ed by LeX and l ex.
* /
16 ext ern i nt yyl eng;
17
18 / * I f your syst emdoesn' t have an <st dar g. h>, use t he f ol l owi ng:
19 *
20 * t ypedef char *va_l i st ;
21 * tf def i ne va_st ar t ( arg_pt r, f i r st ) ar g_pt r =( va_l i st ) &f i r st + si zeof ( f i r st )
22 * i def i ne va_ar g ( arg_pt r, t ype) ( ( t ype*) ( ar g_pt r +=si zeof ( t ype) )) [- 1 ]
23 * #def i ne va_end( )
24 *---------------------------------------------------------------------------------------------------------------------
25 * The f ol l owi ng macr os t ake car e of syst emdependenci es. They assume a 25- l i ne
26 * scr een on t he I BMand a 24- l i ne scr een under Uni x. Code put i nsi de an MS( )
21 * macr o compi l es onl y i f MSDOS i s #def i ned. Code i n a UX( ) macr o compi l es onl y
28 * i f MSDOS i s not #def i ned. The NEWLI NE def i ne t akes car e of a bug i n t he UNI X
29 * cur ses package t hat i sn' t pr esent i n t he DOS ver si on pr esent ed i n t hi s book
30 * (i t cl ear s t he bot t oml i ne af t er a scr ol l ) . box. h (i n Appendi x A) hol ds
31 * #def i nes f or t he I BMBox- dr awi ng char act er s. tf def i ne N0T_I BM_PC t o use t he
32 * mor e por t abl e def i nes i n t hat f i l e (' +' i s used f or cor ner s, ' \ ' f or ver t i cal
33 * l i nes, and ' - ' f or hor i zont al ones) r at her t han t he l ess- por t abl e I BM
34 * gr aphi cs char act er s, f cnt l () i s al so used onl y i n UNI X mode.
35 * /
36
37 #i f def MSDOS
38 # i ncl ude < t o o l s / b o x . h>
39 # def i ne SCRNSIZE 25
40 # def i ne NEWLI NE( wi n) ( I nt er act i ve ? waddch( wi n, ' \ n' ) :0)
41 #el se
42 # def i ne N0T_IBM_PC
43 # i ncl ude < t o o l s / b o x . h >
44 # i ncl ude <f cnt l . h>
45 # def i ne SCRNSIZE 24
46 # def i ne NEWLI NE( wi n) ( I nt er act i ve ? ( waddch( wi n, ' \ n' ) , wcl r t oeol ( wi n) ) \
47 : 0 )
48 #endi f
49
50 / * --------------------------------------------------------------------------------------------------------------------------------------------
51 * Def i nes f or t he wi ndows. STACK_TOP i s t he t op l i ne of t he st ack wi ndow.
52 * DEFSTACK i s t he def aul t si ze of t he t ext ar ea of t he st ack wi ndow.
53 * STACK_WI NSI ZE i s t he hei ght of t he st ack wi ndow, i ncl udi ng t he bor der . I O_TOP
54 * i s t he t op l i ne of bot h t he I / O and comment s wi ndows, and I O_WI NSI ZE i s t he
55 * hei ght of t he t ext ar ea of t hose wi ndows. I t shoul d use t he whol e scr een
56
*
l ess t he ar ea used f or t he st ack and pr ompt wi ndows
57
*
/
58
59 #def i ne STACK TOP 0
60 #def i ne DEFSTACK 11 / * St acksi ze=DEFSTACK by def aul t
*
/
61 #def i ne STACK WINSIZE ( S t a c k s i z e +2)
62 #def i ne PROMPT TOP (SCRNSIZE 3)
63 #def i ne PROMPT WINSIZE 3
64 #def i ne 10 TOP (STACK WINSIZE-1)
65 #def i ne 10 WINSIZE ( (SCRNSIZE (STACK WINSIZE + PROMPT WINSIZE)) + 2)
66
67 #def i ne TOKEN WIDTH 22 / * Wi dt h of t oken wi ndow i ncl udi ng bor der .
*
/
68 #def i ne PRINTWIDTH 79 /
*
Onl y t hi s many char act er s ar e pr i nt ed on each
69
*
l i ne by t he wr i t e- scr een (w) command. Ext r a
70 * char act er s ar e t r uncat ed.
71
*
/
12
73 #def i ne ESC Oxl b / * ASCI I ESC char act er .
*
/
74
75 /
* _____________________________________________________________________________
76
*
Br eakpoi nt s. A br eakpoi nt i s set wi t h a ' b' command. I t causes aut omat i c- mode
11
78
*
*
oper at i on to t er mi nat e i mmedi at el y bef or e appl yi ng a pr oduct i on or
when a speci f i ed symbol i s on t he t op of st ack. P br eakpoi nt hol ds t he
19
80
*
*
pr oduct i on br eakpoi nt ; T_br eakpoi nt hol ds t he t op- of - st ack br eakpoi nt ;
I br eakpoi nt i s t he i nput br eakpoi nt . The f or mer i s an i nt because i t ' s
81
*
al ways a number . The l at t er t wo ar e st r i ngs because t hey can be symbol i c
82
*
names as wel l as number s. The l ast var i abl e, L br eakpoi nt , i s t he i nput
83
*
l i ne br eakpoi nt
84
*
/
85
86 #def i ne BRKLEN 33 /
*
Longest l exeme i n a br eakpoi nt + 1.
*
/
87
88 PRIVATE i nt P b r e a k p o i n t 1
89
90
91
PRIVATE i nt
PRIVATE
PRIVATE
L_ b r e a k p o i n t
S _ b r e a k p o i n t [ BRKLEN ]
I b r e a k p o i n t [ BRKLEN ]
1
{
{
\ 0'
\ 0'
}
}
92
93
/* --------------------------------------------------------------------------------------------------------------------
94
*
I ' ve at t empt ed t o i sol at e t hese r out i nes as much as possi bl e f r omt he act ual
95
*
They do need t o know wher e al l t he st acks ar e, however . The f ol l owi ng
96
97
* var i abl es ar e i ni t i al i zed at r un- t i me by an access r out i ne [ yy_i ni t _debug()]
* and ar e used t o access st at i c var i abl es i n t he par ser i t sel f . Not e t hat t he
98
*
addr esses of t he st ack poi nt er s ar e st or ed, not t he cont ent s of t he st ack
99
*
poi nt er s
100
*
/
101
102 PRIVATE i nt Abor t ; / *
103 PRIVATE char * Vs t a c k ;
/ *
104 / *
105 PRIVATE i nt V s i z e ;
/ *
106 PRIVATE char * * Ds t a c k ;
/ *
107 PRIVATE char ***P_ds p ; / *
108 PRIVATE i nt * S s t a c k ;
/ *
109 PRIVATE i nt **P s p ;
/ *
110 PRIVATE i nt Dept h ;
/ *
111
112
/ * ---------
113 * The f ol l owi ng var i abl es ar e al l used
For ce i nput r out i ne t o r et ur n EOI .
Base addr ess of val ue st ack (or NULL
*
i f cal l ed by l l ama
Si ze of one el ement of val ue st ack.
Base addr ess of debug ( symbol ) st ack
Poi nt er to debug- st ack poi nt er .
Base addr ess of st at e st ack.
Poi nt er to st at e- st ack poi nt er .
St ack dept h (al l t hr ee st acks) .
*
*
*
i nt er nal l y
*
*
*
/
/
/
/
/
/
/
/
/
114
*
/
115
116 PRIVATE WINDOW * S t a c k wi ndow
f / * Wi ndows f or t he debuggi ng scr een,

*/
117 PRIVATE WINDOW *Prompt wi ndow
118 PRIVATE WINDOW *Code wi ndow
f
119 PRIVATE WINDOW ^Comment wi ndow ;
120 PRIVATE WINDOW *Token wi ndow
f
121 PRIVATE i n t S t a c k s i z e = DEFSTACK; / * Number of act i ve l i nes i n t he st ack
* /
122 / * wi ndow ( doesn' t i ncl ude bor der ) . * /
123 PRIVATE i n t Onumel e = 0; / * Number of el ement s on t he st ack. * /
124 PRIVATE i n t I n t e r a c t i v e = 1;
/ * I nt er act i ve mode (not n or N) . * /
125 PRIVATE i n t S i n g l e s t e p = 1; / * Si ngl e st ep t hr ough par se i f t rue. * /
126 PRIVATE l o n gr De l a y = 0L; / * Amount of t i me t o wai t af t er pr i nt i ng * /
127 / * each st ack updat e when not si ngl e * /
128 / * st eppi ng ( mi l l i seconds) . * /
129 PRIVATE i n t Inp fm f i l e = 0; / * 1 i f i nput f i l e i s open. * /
130 PRIVATE FILE *Log = NULL; / *
Poi nt er t o t he l og f i l e i f one i s open. * /
131 PRIVATE i n t No comment p i x
0; / *
1 i f no comment - wi ndow out put i s pr i nt ed. * /
132 PRIVATE i n t No s t a c k p i x
0; / *
1 i f no st ack pi ct ur es ar e t o be pr i nt ed * /
133 / *
i n t he l og f i l e. * /
134 PRIVATE i n t Hor i z s t a c k p i x 0; / * 1 i f st ack pi ct ur es ar e pr i nt ed hor i z * /
135 / *
ont al l y i n t he l og f i l e. * /
136 PRIVATE i n t Pa r s e p i x ; / *
i f ( Hor i z st ack pi x) , pr i nt st at e st ack. * /
137 PRIVATE i n t Sym p i x ;
/ *
i f ( Hor i z st ack pi x) , pr i nt symbol st ack. * /
138 PRIVATE i n t A t t r p i x ;
/ *
i f ( Hor i z st ack pi x) , pr i nt at t r i b. st ack. * /
139
140 # i f n d e f MSDOS / * ---------------------------
141
142
143
144
145
146
147
148
149 PRIVATE i n t Ch a r _ a v a i l = 0;
150 # d e f i n e k b h i t ( ) Ch a r _ a v a i l
151
152 # e l s e / * --------------------------------------- DOS VERSI ON ONL Y--------------------------------------------------* /
153
154 e x t e r n i n t k b h i t ( v o i d ) ;
155
156
157
158
159
160 / * The Map[] ar r ay conver t s I BMbox- dr awi ng char act er s t o somet hi ng t hat ' s
161 * pr i nt abl e. The cor ner s ar e conver t ed t o pl us si gns, hor i zont al l i nes go t o
162 * dashes, and ver t i cal l i nes map t o ver t i cal bar s. The conv() subr out i ne i s
163 * passed a char act er and r et ur ns a mapped char act er . I t must be a subr out i ne
164 * because of t he way t hat i t ' s used bel ow. I t woul d have unaccept abl e
165 * si de- ef f ect s i f r ewr i t t en as a macr o.
166 * /
167
168 PRI VATE u n s i g n e d c h a r Map[] =
169 {
170
1 f + ' / ' + t I r ' +' /
171
9 ' +/ t ' +' t ' +' f ' + t 9
172
+ ' t ' +' t ' + f ' + f + f + f ' + ' t
173 } /
174
/ * Mi cr osof t f unct i on. r et ur ns 1 i f a * /
/ * char act er i s wai t i ng t o be r ead * /
/ * f r omt he keyboar d buf f er . Thi s * /
/ * f unct i on i s pr ovi ded i n most * /
/ * MS- DOS compi l er ' s l i br ar i es. * /
UNI X SYSTEM V ONL Y----------------------------------------------* /
/ * Si nce MS- DOS has a syst emcal l t hat
* gi ves t he keyboar d st at us, det ect i ng
* i f a key has been pr essed wi t hout
* r eadi ng t he char act er i s easy. I n
* UNI X you must use SI GI O t o set a
* f l ag ( Char _avai l ) . kbr eadyO i s t he
* SI GI O except i on handl er .
*/
175 PRIVATE i nt c o n v ( c )
176 {
177 r et ur n ( VERT <= c && c <= UL) ? Map[ c - VERT] : c ;
178 }
179 #endi f / *- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
Initialize debugging:
yy i ni t debug().
Getting keyboard status
underuni x, si gi o.
Disable curses on abort
q command raises
SI GI NT.
The actual code starts in Listing 4.9. The yy_i ni t _debug () function on line 180 ini
tializes the debugging functions. It is passed pointers to various variables in the parser
proper that it needs to draw the windows (things like the base addresses of the stacks).
Note that pointers to the stack pointers are passed, rather than the contents of the stack
pointers. This way the parser can push and pop stuff at will without having to tell the
debugging routines every time it does a push or pop. The debugging routines can just
examine the pointers themselves. The vst ack argument is NULL if a LLama-generated
parser is active (because LLama doesnt use a value stack). In this case, the value stack
is not printed in the log file or the stack window.
The si gnal () call on line 197 is used to determine keyboard status in UNIX sys
tems. (The UX macro is defined in debug.h. Its discussed in Appendix A.) The SI GI O
signal is generated every time a key is struck. Here, the signal activates the kbr eady ()
function (declared on line 257) which sets a global flag ( Char avai l ) to true. This flag
can be examined when youre looking to see if a character is ready. The flag is explicitly
set back to zero when the character is finally read. The f cnt l ( ) call on line 198
enables SI GI O (the signal is not generated otherwise).
The second si gnal () call on line 211 is used to disable curses when a Ctrl-Break
(Ctrl-C or DEL in UNIX) comes along. Curses must be turned off explicitly in this case to
prevent the screen from being left in an unknown state (with characters not echoing, and
so forth). The handler is on line 263. Note that it disables SI GI NT on line 265 so that
another Ctrl-Break wont interrupt the shut-down process (this is not a problem on newer
UNIX, versions, but it cant hurt). The q command to the debugger quits by raising the
SI GI NT signal, which will execute the handler. This way, any cleaning up of temporary
files that is done by a signal handler in the compiler itself is also done when you quit the
debugger. You can use the a command if you want to leave the debugger without clean
ing up.
Parsing command-line ar
guments:
yy get args().
Input buffering and character echo is turned off on line 213 of Listing 4.9. Note that
this is necessary in the BSD curses that runs on the UNIX system that I use, but may be
contraindicated on other systems. Ask your system administrator if this is the correct
thing to do with your implementation. The windows themselves are opened on the next
few lines. The boxwi n () function (defined on line 287) works much like a standard
newwi n () call. It opens up two windows: the outer one as a subwindow to st dscr,
and the inner one as a subwindow to the outer one. The outer window holds the box and
the inner window holds the text. This way, text written to the inner window wont
overwrite the box. The window title is printed, centered, in the top line of the box.
The final routine of interest in Listing 4.9 is the yy_get _ar gs () function on line
320. This routine parses the command line for a stack size (specified with -s) and an
input-file name. If you dont want to get this information from the command line, you
can do the following:
f l o c c i n a u c i n i h i l i p i l i f i c a t i o n ()
{
}
{
IIII
f
ll4-^4- II
};
* v e c t s [ ] = / * Si mul at e ar gv
yy g e t a r g s ( 3, v e c t s ) ;
/ * J unk
/ * I nput f i l e name
*
/
*
/
/ * St ack- wi ndow si ze == 18 l i nes. * /
*
/
Listing 4.9. yydebug. c Initialization
180 PUBLIC i nt W i n i t dstoucf ( s s t dc k^ p sp^ ds t ci ckj p dsp^ v s t c i c k , v s i s s i z s ^ d s p t h )
181
182 i nt * s s t a c k ; / * Base addr ess of st at e st ack. * /
183 i nt **p_s p; / * Addr ess of st at e- st ack poi nt er . * /
184 char * * d s t a c k ; / * Addr ess of debug st ack. * /
185 char ***p_ds p; / * Addr ess of debug- st ack poi nt er . * /
186 voi d * v s t a c k ; / * Base addr ess of val ue st ack or NULL f or LLama. * /
187 i nt v _ e l e _ s i z e ; / * Si ze of one el ement of val ue st ack. * /
188 i nt d e p t h ; / * Number of el ement s i n al l t hr ee st acks. * /
189 {
190 / * I ni t i al i ze f or i nt er act i ve I / O f or cur ses. Ret ur n 1 on a successf ul
191 * i ni t i al i zat i on, 0 ot her wi se.
192 * /
193
194 char b u f [ 8 0 ] ;
195
196 UX( i nt f l a g s ; )
197 UX( s i g n a l ( SI GI O, kbr e ady ) ; )
198 UX ( f l a g s = f c n t l ( f i l e n o ( s t d i n ) , F_GETFL, 0 ) ; )
199 UX( f c n t l ( f i l e n o ( s t d i n ) , F_SETFL, f l a g s | FASYNC ) ; )
200
201 S s t a c k = s s t a c k ;
202 Ds t a c k = d s t a c k ;
203 Vs t a c k = (char *) v s t a c k ;
204 V s i z e = v _ e l e _ s i z e ;
205 P_s p = p_ s p;
206 P_ds p = p_ds p;
207 Dept h = de pt h;
208 Abort = 0;
209
210 i n i t s c r ( ) ;
211 s i g n a l ( SI GI NT, d i e _ a _ h o r r i b l e _ d e a t h ) ;
212
213 n o e c h o ( ) ; / * Don' t echo i nput char act er s aut omat i cal l y. * /
214 c r mo d e ( ) ; / * Don' t buf f er i nput . * /
215 MS ( n o s a v e ( ) ; ) / * Don' t save r egi on under wi ndows (my cur ses onl y) . * /
216
217 St ack_wi ndow = b o x w i n ( STACK_WINSIZE, 80, STACK_T0P, 0, " [ s t a c k ] " ) ;
218 Comment_wi ndow = b o x w i n ( IO_WINSIZE, 40, I 0_T0P, 0, "[ c omme nt s ] " ) ;
219 Code_wi ndow = b o x w i n ( I O_WI NSI ZE, 41, I O_TOP, 39, " [ o u t p u t ] " ) ;
220
221 Prompt _wi ndow = boxwin(PROMPT_WINSIZE, (80 - TOKEN_WIDTH) + 1,
222 PROMPT_TOP, 0, " [ p r o mp t s ] " ) ;
223
224 Token_wi ndow = boxwin(PROMPT_WINSIZE, TOKEN_WIDTH, PROMPTJTOP,
225 80 - TOKEN_WI DTH, " [ l o o k a h e a d ] " ) ;
226 s c r o l l o k ( S t a c k wi ndow, TRUE ) ;
227 scr ol l ok( Comment _wi ndow, TRUE );
228 scr ol l ok( Code_wi ndow, TRUE ) ;
229 scr ol l ok ( Pr ompt _wi ndow, TRUE );
230 scr ol l ok ( Token_wi ndow, TRUE );
231 wr apok ( Token_wi ndow, FALSE );
232
233 Onumel e = 0;
234
235 whi l e( ! I n p _ f m _ f i l e )
236 {
237 / * I f you don' t have an i nput f i l e yet , get one. yypr ompt ( ) pr i nt s t he
238 * pr ompt i n t he PROMPT wi ndow, and f i l l s buf wi t h t he r epl y.
239 * /
240
241 i f ( ! yypr ompt ( "I nput f i l e name or ESC t o exi t : ", buf , 1 ) )
242 {
243 yy_qui t _debug () ;
244 ret urn 0;
245 }
246 new_i nput _f i l e( buf );
247 }
248 d e l a y (); / * Wai t f or a command bef or e pr oceedi ng. */
249 ret urn 1;
250 }
251
252 / * --------------------------------------------------------------------------------------------------------------------------------------------
253 * Except i on handl er s:
254 * /
255
256 #i f ndef MSDOS
257 PRI VATE voi d kbr eadyO / * Cal l ed when new char act er i s avai l abl e. */
258 {
259 Char a v a i l = 1;
260 }
261 #endi f
262
263 PRIVATE voi d d i e a h o r r i b l e d e a t h () / * Come her e on a SI GI NT */
264 { / * o r ' q' command. */
265 s i g n a l ( SIGINT, SIG_IGN ) ;
266 y y _ q u i t _ d e b u g ( ) ;
267 e x i t ( 0 ) ;
268 }
269
270 PUBLIC voi d y y _ q u i t _ d e b u g () / * Nor mal t er mi nat i on. */
271 {
272 e c h o ( ) ; / * Tur n echo and edi t i ng back on. * /
273 n o c r mo d e ( ) ;
274 mo v e ( 2 4 , 0 ) / / * Put t he cur sor on t he bot t omof t he scr een. */
275 r e f r e s h ( ) ;
276 e n d w i n ( ) ; / * Tur n of f cur ses. * /
277
278 i f ( Log )
279 f c l o s e ( Log ) ;
280
281 s t o p _ p r n t ( ) ;
282 s i g n a l ( SIGINT, SIG DFL ) ;
283 }
284
285 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
286
287 PRI VATE WI NDOW * b o x wi n ( l i n e s , c o l s , y _ s t a r t , x _ s t a r t , t i t l e )
288 i n t l i n e s ;
289 i n t c o l s ;
290 i n t y _ s t a r t ;
291 i n t x _ s t a r t ;
292 c h a r * t i t l e ;
293 {
294 / * Thi s r out i ne wor ks j ust l i ke t he newwi n( ) except t hat t he wi ndow has a
295 * box ar ound i t t hat won' t be dest r oyed by wr i t es t o t he wi ndow. I t
296 * accompl i shes t hi s f eat by cr eat i ng t wo wi ndows, one i nsi de t he ot her ,
297 * wi t h a box dr awn i n t he out er one. I t pr i nt s t he opt i onal t i t l e cent er ed
298 * on t he t op l i ne of t he box. Set t i t l e t o NULL (or "") i f you don' t want
299 * a t i t l $. Not e t hat al l wi ndows ar e made subwi ndows of t he def aul t wi ndow
300 * t o f aci l i t at e t he pr i nt - scr een command.
301 * /
302
303 WI NDOW *o u t e r ;
304
305 o u t e r = s u b w i n ( s t d s c r , l i n e s , c o l s , y _ s t a r t , x _ s t a r t ) ;
306 box ( o u t e r , VERT, HORI Z ) ;
307
308 i f ( t i t l e && * t i t l e )
309 {
310 wmove ( o u t e r , 0, ( c o l s - s t r l e n ( t i t l e ) ) / 2 ) ;
311 wpr i nt w( o u t e r , "%s", t i t l e ) ;
312 }
313
314 w r e f r e s h ( o u t e r ) ;
315 r e t u r n s u b w i n ( o u t e r , l i n e s - 2 , c o l s - 2 , y _ s t a r t + l , x _ s t a r t + l ) ;
316 }
317
318 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
319
320 PUBLI C i n t y y _ g e t _ a r g s ( a r g c , a r g v )
321 c h a r * * a r g v y
322 {
323 / * Scan ar gv ar gument s f or t he debuggi ng opt i ons and r emove t he ar gument s
324 * f r omargv. Recogni zed ar gument s are:
325 *
326 * - sN Set st ack- wi ndow si ze t o N l i nes. The si ze of t he ot her wi ndows
327 * scal e accor di ngl y. The st ack wi ndow i s not per mi t t ed t o get so
328 * l ar ge t hat t he ot her wi ndows wi l l di sappear , however .
329 *
330 * The f i r st ar gument t hat doesn' t begi n wi t h a mi nus si gn i s t aken t o be
331 * t he i nput f i l e name. That name i s not r emoved f r omargv. Al l ot her
332 * ar gument s ar e i gnor ed and ar e not r emoved f r omar gv, so you can pr ocess
333 * t hemi n your own pr ogr am. Thi s r out i ne pr i nt s an er r or message and
334 * t er mi nat es t he pr ogr ami f i t can' t open t he speci f i ed i nput f i l e.
335 * Command- l i ne pr ocessi ng st ops i mmedi at el y af t er t he f i l e name i s
336 * pr ocessed. So, gi ven t he l i ne:
337 *
338 * pr ogr am - x - si 5 - y f oo - si bar
339 *
340 * Ar gv i s modi f i ed to:
341 *
342 * pr ogr am - x - y f oo - si bar
343 *
344 * The f i l e "f oo" wi l l have been opened f or i nput and t he st ack wi ndow wi l l
345 * be 15 l i nes hi gh. Ret ur n new val ue of ar gc t hat r ef l ect s t he r emoved
346 * ar gument s;
347 */
348
349 c h a r **ne war gv;
350 c h a r * * o l d a r g v = a r g v ;
351 c h a r * f i l e n a m e = NULL;
352 i n t s s i z e = DEFSTACK;
353
354 newargv = ++ar gv;
355 f o r ( - - a r g c ; - - a r g c >= 0; ++ar gv )
356 {
357 i f ( a r g v [ 0 ] [0] != )
358 {
359 f i l e n a m e = *newargv++ = * a r g v ;
360 b r e a k ;
361 }
362 e l s e i f ( a r g v [ 0 ] [1] == ' s ' ) / * - s
* /
363 s s i z e = a t o i ( & a r g v [ 0 ] [2] ) ; / * Don' t copy t o *newar gv here.
* /
364 e l s e / * - ? * /
365 *newargv++ = * a r g v ;
366 }
367
368 S t a c k s i z e = ( s s i z e < 1 ) ? DEFSTACK
369 ( s s i z e > (SCRNSIZE-6) ) ? SCRNSIZE-6
370 / * ssi ze i s i n bounds * / s s i z e
f
371
372 i f ( f i l e n a m e )
373 {
374 / * Open i nput f i l e i f one was speci f i ed on t he command l i ne. * /
375
376 i f ( i i n e w f i l e ( f i l e n a me ) != - 1 )
377 Inp fm f i l e = 1;
378 e l s e
379
{
380 p e r r o r ( f i l e n a m e ) ;
381 e x i t ( 1 ) ;
382
}
383 }
384 r e t u r n ne war gv - o l d a r g v ;
385 }
Output functions: yyer- xhe next listing (Listing 4.10) holds all the output functions. There is one such rou-
ror()- tine for each window. In addition, yyer r or () (on line 510), writes to the comment
window and simulates the standard yyer r or () function by adding an input line number
and token value to the error message. In addition to the standard output functions,
di spl ay_f i l e () (on line 558 of Listing 4.10) is used by the /command to print a file
in the stack window, and wr i t e_scr een () is used by the w command to save the
current screen to a file. Note that, since everythings a subwindow to st dscr, the
st dscr functions can be used on line 654 to read the entire screen (rather than doing it
window by window).
Listing 4.10. yydebug.c Window Output Functions
386 PRI VATE void pr nt _put c( c, wi n)
387 WI NDOW*wi n;
388 {
389 / * Al l out put done t hr ough pr nt _put c i s suppr essed i n Go mode. Al so not e
390 * t hat t he ar gument s ar e r ever sed f r omaddch (). Thi s r ever sal l et s you use
391 * t he pr nt () subr out i ne ( descr i bed i n Appendi x A), whi ch expect s a put c( ) -
392 * l i ke out put f unct i on. Newl i nes ar e suppr essed her e so t hat you can have
393 * mor e cont r ol over scr ol l i ng. Si mi l ar l y, sequences of whi t e space ar e
394 * r epl aced by a si ngl e space char act er to conser ve space i n t he wi ndow.
395 * Test _c i s used to t ake car e of t he I BM gr aphi cs char act er s t hat f or m
396 * t he ver t i cal l i ne separ at i ng t he st r eam- i dent i f i cat i on col umn f r om t he
397 * act ual out put . The c i s mapped to a ' \ ' i f i t ' s t oo l ar ge t o be an ASCI I
398 * char act er (so i sspace () wi l l wor k pr oper l y) .
399 * /
400
401 st at i c WI NDOW*l ast _wi n =NULL;
402 st at i c i nt l a s t _ c = 0;
403 i nt t e s t _c;
404
405 i f ( I nt er act i ve && c != ' \ n' )
406 {
407 t est _c = (c < 0 x 7 f) ? c : ' I ' ;
408
409 i f ( ! ( wi n==l ast _wi n &&i sspace ( t est _c) && i sspace( l ast _c) ) )
410 waddch( wi n, i sspace( t est _c) ? ' ' : c );
411
412 l ast _wi n = wi n;
413 l ast _c = t est _c;
414 }
415 }
416
417 PRI VATE void r ef r esh_wi n( wi n )
418 WI NDOW *wi n;
419 {
420 / * Ref r esh t he wi ndows i f you' r e not i n go mode. (I f you are, not hi ng wi l l
421 * have been wr i t t en, so t her e' s not poi nt i n doi ng t he r ef r esh
422 */
423
424 i f ( I nt er act i ve )
425 wr ef r esh( wi n ) ;
426 }
427
428 / * -
429
430 PUBLI C voi d yy_out put ( wher e, f mt, ar gs ) / * Gener at e code */
431 i nt wher e;
432 char * f mt ;
433 va l i st arqs;
434 {
435 / * Wor ks l i ke vpr i nt f O, but sends out put to t he code wi ndow. I n t he wi ndow,
436 * i t i gnor es any newl i nes i n t he st r i ng but pr i nt s a newl i ne af t er ever y
437 * cal l . Al l code sent to yycode(), yydat af ) , and yybss () i s f unnel ed
438 * here, "wher e" shoul d be one of t he f ol l owi ng:
439 *
440 * 0 : code
441 * 1: dat a
442 * 2 : bss
443 *
444 *Not e t hat i f t he t hr ee associ at ed st r eams ( yycodeout , yybssout , and
445 * yydat aout al l decl ar ed i n t he par ser out put f i l e) ar e not di r ect ed to
446 * st dout , out put i s sent t o t hat st r eam TOO. Don' t modi f y t hese to poi nt
447 * at st der r (or any ot her st r eamt hat accesses t he consol e: / dev/ t t y, con,
448 * et c. ) or you' l l mess up t he scr een.
449 *
450 * Not e t hat t he r eal yycode(), et c (i e. t he ones suppl i ed when YYDEBUG
451 * i s not def i ned) don' t do anyt hi ng speci al wi t h newl i nes. I n par t i cul ar ,
452 * t hey ar e not i nser t ed aut omat i cal l y at t he end of t he l i ne. To make bot h
453 * set s of r out i nes compat i bl e, your out put st r i ngs shoul d al l have exact l y
454 * one newl i ne, pl aced at t he end of t he st r i ng ( don' t i mbed any i n t he
455 * mi ddl e) .
456 * /
457
458 e x t e r n FILE * y y c o d e o u t , * y y d a t a o u t , * y y b s s o u t ;
459
460 i f ( Log )
461 {
462 f p r i n t f ( Log, where == 0 ? "CODE->" :
463 where == 1 ? "DATA->" : "BSS >" ) ;
464 p r n t ( f p u t c , Log, f mt , a r g s ) ;
465 f p u t c ( ' \ n ' , Log ) ;
466 }
467
468 NEWLINE( Code_wi ndow ) ;
469
470 p r n t _ p u t c ( where==0 ? ' C' :whe r e ==l ? ' D' : ' B ' , Code_wi ndow ) ;
471 p r n t _ p u t c ( VERT, Code_wi ndow ) ;
472
473 p r n t ( p r n t _ p u t c , Code_wi ndow, f mt , a r g s ) ;
474 r e f r e s h _ w i n ( Code_wi ndow ) ;
475
476 i f ( where == 0 && y y c o d e o u t != s t d o u t )
477 v f p r i n t f ( y y c o d e o u t , f mt , a r g s ) ;
478
479 i f ( where == 1 && y y d a t a o u t != s t d o u t )
480 v f p r i n t f ( y y d a t a o u t , f mt , a r g s ) ;
481
482 i f ( where == 2 && y y b s s o u t != s t d o u t )
483 v f p r i n t f ( y y b s s o u t , f mt , a r g s ) ;
484 }
485
486 / -k- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -k/
487
488 PUBLIC v o i d y yc omme nt ( f mt , . . . )
489 c h a r *f mt ;
490 {
491 / * Wor ks l i ke pr i nt f () except t hat i t aut omat i cal l y pr i nt s a newl i ne
492 * I N FRONT OF t he st r i ng and i gnor es any \ n' s i n t he st r i ng i t sel f . Wr i t es
493 * i nt o t he comment wi ndow, and out put s a message to t he l og f i l e i f
494 * l oggi ng i s enabl ed.
495 * /
496
497 v a _ l i s t a r g s ;
498 v a _ s t a r t ( a r g s , fmt ) ;
499
500 i f ( Log && ! No_comment _pi x )
501 p r n t ( f p u t c , Log, f mt , a r g s ) ;
502
503 NEWLI NE( Comment _wi ndow );
504 pr nt ( pr nt _put c, Comment _wi ndow, f mt , ar gs );
505 r ef r esh_wi n( Comment _wi ndow );
506 }
507
508 / * -
509
510 PUBLI C voi d yyer r or ( f mt , . . . )
511 char *f mt ;
512 {
513 / * Debuggi ng ver si on of t he er r or r out i ne. Wor ks j ust l i ke t he nondebuggi ng
514 * ver si on, but wr i t es to t he comment wi ndow. Not e t hat yycomment () copi es
515 * t he er r or message t o t he Log f i l e. I nt er act i ve mode i s t empor ar i l y
516 * enabl ed to assur e t hat er r or messages get pr i nt ed.
517 * /
518
519 i nt ol d_ i n t e r a c t i v e ;
520 va_l i st ar gs;
521 va_st ar t ( ar gs, f mt );
522
523 ol d_i nt er act i ve = I nt er act i ve;
524 I n t e r a c t i v e = 1;
525
526 yycomment ( "ERROR, l i ne %d near <%s>\ n", yyl i neno, yyt ext );
527
528 i f ( Log )
529 pr nt ( f put c, Log, f mt , ar gs );
530
531 NEWLI NE ( Comment _wi ndow );
532 pr nt ( pr nt _put c, Comment _wi ndow, f mt , ar gs );
533 r ef r esh_wi n ( Comment _wi ndow );
534
535 I n t e r a c t i v e = o l d _ i n t e r a c t i v e ;
536 Si ngl est ep = 1 ; / * For ce a br eakpoi nt */
537 y y _ p s t a c k ( 0, 1 ) ;
538 }
539
540 / * -
541
542 PRI VATE voi d yy_i nput ( f mt , . . . )
543 char * f mt ;
544 {
545 / * Thi s i s not an i nput f unct i on; r at her , i t wr i t es t o t he I NPUT wi ndow.
546 * I t wor ks l i ke pr i nt f (). Not e t hat not hi ng i s l ogged her e. The l oggi ng
547 * i s done i n next oken( ) . I gnor es al l \ n' s i n t he i nput st r i ng.
548 * /
549
550 v a _ l i s t a r g s ;
551 va_st ar t ( ar gs, f mt );
552 pr nt ( pr nt _put c, Pr ompt _wi ndow, f mt , ar gs );
553 r ef r esh_wi n( Pr ompt _wi ndow );
554 }
555
556 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
557
558 PRI VATE voi d d i s p l a y _ f i l e ( name, b u f _ s i z e , p r i n t _ l i n e s )
559 char *name; / * I ni t i al l y hol ds t he f i l e name, but */
560 i nt buf _si ze; / * r ecycl ed as an i nput buf f er . */
561 i nt p r i n t _ l i n e s ;
562 {
563 / * Di spl ay an ar bi t r ar y f i l e i n t he st ack wi ndow, one page at a t i me.
564 * The st ack wi ndow i s not r ef r eshed by t hi s r out i ne.
565 * /
566
567 FILE * f d ;
568 i n t i ;
569 i n t l i n e n o = 0;
570
571 i f ( ! ( f d = f o p e n ( name, " r " ) ) )
572 {
573 NEWLINE ( Prompt _wi ndow ) ;
574 wpr i nt w ( Prompt _wi ndow, "Can' t open %s", name ) ;
575 w r e f r e s h ( Prompt _wi ndow ) ;
576 p r e s s k e y ( ) ;
577 }
578 el se / * Not e t hat or der of eval uat i on i s i mpor t ant i n * /
579 { / * t he f ol l owi ng whi l e st at ement . You don' t want * /
580 / * t o get t he l i ne i f i goes past 0. * /
581
582 f o r ( i = S t a c k s i z e - 1 ; ; i = (*name == ' ' ) ? 1 : S t a c k s i z e - 2 )
583 {
584 whi l e( i >= 0 && f g e t s (name, b u f _ s i z e , f d) )
585 {
586 i f ( p r i n t _ l i n e s )
587 wpr i nt w( St ac k_wi ndow, "%3d:", + + l i n e n o ) ;
588
589 wpr i nt w ( St ac k_wi ndow, "%s", name ) ;
590 w r e f r e s h ( St ac k_wi ndow ) ;
591 }
592
593 i f ( i > 0 )
594 break;
595
596 i f ( ! y y p r o mp t ( "ESC q u i t s . Spac e s c r o l l s 1 l i n e . En t e r f o r s c r e e n f u l " ,
597 name, 0) )
598 break;
599 }
600 y y p r o mp t ( "*** End o f f i l e . P r e s s any key t o c o n t i n u e ***", name, 0 ) ;
601 f c l o s e ( f d ) ;
602 }
603 }
604
605 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
606
607 PRIVATE v o i d w r i t e _ s c r e e n ( f i l e n a m e )
608 char * f i l e n a m e ;
609 {
610 / * Pr i nt t he cur r ent scr een cont ent s t o t he i ndi cat ed f i l e. Not e t hat t he
611 * r i ght edge of t he box i sn' t pr i nt ed i n or der t o l et us have 79- char act er
612 * l i nes. Ot her wi se, t he saved scr een shows up as doubl e- spaced on most
613 * pr i nt er s. The scr een i mage i s appended to t he end of t he f i l e. I n MS- DOS,
614 * Use " pm: " as t he f i l e name i f you want t o go t o t he pr i nt er .
615 *
616 * Syser r l i st and er r no ar e bot h def i ned i n <st dl i b. h>
617 * /
618
619 char b u f [ 2 ] ;
620 char *mode = "a";
621 i n t row, c o l , y, x;
622 FILE * f i l e ;
623
624 i f ( a c c e s s ( f i l e n a m e , 0 ) == 0 )
625 i f ( ! y y p r o mp t ( " F i l e e x i s t s , o v e r w r i t e or append? ( o / a ) : ", b u f , 0) )
626 {
627 NEWLINE ( Prompt wi ndow ) ;
628 yy i n p u t ( " Ab o r t i ng command. " ) ;
629 p r e s s k e y ( ) ;
630 r e t u r n ;
631 }
632 e l s e
633 {
634 i f ( t o u p p e r ( * b u f ) == ' O' )
635 mode = "w";
636 }
637
638 i f ( f i l e = f o p e n ( f i l e n a m e , mode) )
639 yy i n p u t ( " , . . % s % s . . . " ,
640 *mode==' w' ? " o v e r w r i t i n g " : "appe ndi ng t o " , f i l e n a m e ) ;
641 e l s e
642 {
643 yy i n p u t ( "Can' t open %s: %s. ", f i l e n a m e , s y s e r r l i s t [ e r r n o ] ) ;
644 p r e s s k e y ( ) ;
645 r e t u r n ;
646 }
647
648 g e t y x ( Prompt wi ndow, y, x ) ;
649
650 f o r ( row = 0; row < SCRNSIZE; row++ )
651 {
652 f o r ( c o l = 0; c o l < PRINTWIDTH; c o l + + )
653 {
654 UX( f p u t c ( m v i n c h ( r o w , c o l ) , f i l e ) ; )
655 MS( f p u t c ( c o n v ( m v i n c h ( r o w , c o l ) ) , f i l e ) ; )
656
}
657
658 f p u t c ( ' \ n ' , f i l e ) ;
659 }
660
661 f c l o s e ( f i l e ) ;
662 wmove( Prompt wi ndow, y, x ) ;
663 }
The routines in Listing 4.11 do the real work, yy pst ack () (on line 664) is called by
the parser every time the stack is modified. The stack is printed to the log file, if neces
sary, on line 700 to 773. I nt er act i ve is false on line 775 if the debugger is in nonin
teractive mode (initiated with an n command). In this case, a speedometer readout that
tells you that the program is actually doing something is updated (it can take a while to
parse a big input file, even in noninteractive mode) and the routine returns. A Stack
breakpoint is triggered on line 785 if necessary (the debugger is just thrown back into
single-step mode if one is found).
The del ay () call on line 874 does one of two things. If youre not single stepping,
it just delays by zero or more seconds (you can modify the number with the d command),
del ay () gets and executes a command if youre single stepping or you hit a key during
the delay, del ay () doesnt exit until one of the commands that starts up the parse again
(space to singlestep, n to enter noninteractive mode, or g to go) is executed. The
del ay () subroutine itself starts at the top of the following Listing (Listing 4.12). Note
Update stack window,
yy pst ack ().
Stack breakpoint.
Main command loop,
del ay().
that I m using the unix/ansi time f t i me () function here to get the time in milliseconds.
It loads the t i me buf structure with the elapsed number of seconds since J anuary 1,
1970 ( t i me_buf . t i me) and the number of milliseconds as well (in
t i me buf . mi l l i t m) . The Del ay variable holds the desired delay (in milliseconds).
Listing 4.11. yydebug.c Stack-Window Maintenance and the Control Loop
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
714
715
716
717
PUBLIC voi d yy p s t a c k do f r e s h p r i n t i t )
i nt
i nt
{
do
p r i n t i t ;
#
f /
/
*
*
r edr aw ent i r e wi ndow r at her t han updat e
i f t r ue, pr i nt t he st ack t o t he l og f i l e
*
*
/
/
/
*
Pr i nt t he st at e, debug, and val ue st acks.
*
*
*
*
*
*
*
*
*
The l i br ar y r out i ne yypst k( ) ( whi ch r et ur ns an empt y st r i ng by def aul t )
i s cal l ed t o pr i nt val ue st ack i t ems. I t shoul d r et ur n a poi nt er t o a
st r i ng t hat r epr esent s t he val ue- st ack cont ent s. J ust pr ovi de a si mi l ar l y
named r out i ne t o pr i nt your r eal val ue st ack. The LLAMA par ser passes
t o yypst k( ) , so use t he def aul t r out i ne i n LLAMA. The OCCS par ser
passes yypst k( ) a poi nt er t o a val ue- st ack i t emand i t shoul d r et ur n a
poi nt er to a st r i ng r epr esent i ng at t r i but es of i nt er est . The l i ne shoul d
not cont ai n any newl i nes and i t shoul d be at most 58 char act er s l ong.
* I f do r ef r esh i s t r ue, t he ent i r e st ack wi ndow i s r edr awn, ot her wi se
680 * onl y t hose par t s of t he wi ndow t hat have been changed ar e modi f i ed.
681 */
682
683 i nt numel e 9 / *
# of el ement s on t he st ack * /
684 i nt * t o s s 9 / *
t op of st at e st ack * /
685 char * * t o d s 9 / *
t op of debug st ack * /
686 char * t o v s 9 / *
t op of val ue st ack * /
687 i nt * s t a t e 9 / *
cur r ent st at e- st ack poi nt er
* /
688 char * *debug 9 / *
cur r ent debug- st ack poi nt er
* /
689 char * v a l u e 9 / *
cur r ent val ue- st ack poi nt er
* /
690 i nt wi d t h 9 / *
Wi dt h of col umn i n hor i z. st ack
* /
691 st at i c i nt t i me s c a l l e d
= - l ; / *
# of t i mes t hi s subr out i ne cal l ed
* /
692 char
*P
9
693 i nt i 9
694
695 s t a t e = *P s p;
696 debug = *P ds p;
697 numel e = Dept h - ( s t a t e - S s t a c k ) ;
698 v a l u e = ( Vs t a c k + ( Dept h - numel e)
V s i z e ) ;
699
700 i f ( Log && !No s t a c k p i x && p r i n t i t )
701
{
702
/ *
Pr i nt t he st ack cont ent s out t o t he l og f i l e. * /
703
704
i f (
! Ho r i z s t a c k p i x )
705 {
706 f p r i n t f ( L o g , " +
----- +-----------
+ \ n " ) ;
707 i f ( numel e <= 0 )
708 f p r i n t f ( Log, "
* *
************ S t a c k i s e mp t y . \ n
l")
709 el se
710 {
711 t o s s = s t a t e ;
712 t o d s = debug;
713 t o v s = v a l u e ;
( i
numel e;
f p r i n t f ( Log,
i >
"'3d
0; + + t o s s , + + t o d s , t o v s += V s i z e )
%3d
o
o1 6 . 16s
o
o1 . 5 2 s \ n " ,
t o s s * t o s s , * t o d s , y y p s t k ( t o v s , * t o d s ) ) ;
}
718 f pr i nt f (Log, " +- - - +- - - - - - - - - - - - - - - - - - +\ n") ;
719 }
720 el se
721 {
722 i f ( st at e < Sst ack )
723 f pr i nt f ( Log, "*** St ack empt y ***\ n" ) ;
724 el se
725 {
726 / * Pr i nt hor i zont al st ack pi ct ur es. Not e t hat you have t o go
727 * t hr ough t he st ack f r ombot t om t o t op t o get t he t op- of - st ack
728 * el ement on t he r i ght .
729 * /
730
731 f or ( i = 0/ i <= 2; ++i )
732 {
733 i f ( ! Par se_pi x && i ==0 ) conti nue;
734 i f ( ! Sym_pi x && i == 1 ) conti nue;
735 i f ( ! At t r _pi x && i ==2 ) conti nue;
736
737 swi t ch( i )
738 {
739 case 0: f pr i nt f ( Log, " PARSE " ); break;
740 case 1: f pr i nt f ( Log, " SYMBOL " ) ; break;
741 case 2: f pr i nt f ( Log, " ATTRI B " ) ; break;
742 }
743
744 t oss = Sst ack + ( Dept h - 1 ) ;
745 t o d s = Ds t a c k + ( Dept h - 1 ) ;
746 t ovs = Vst ack + ( ( Dept h - 1) * Vsi ze) ;
747
748 f or (; t oss >= st at e; t oss, t ods, t ovs - = Vsi ze )
749 {
750 / * Fi nd wi dt h of t he col umn. I ' massumi ng t hat t he
751 * number s on t he st at e st ack ar e at most 3 di gi t s
752 * l ong, i f not , change t he 3, bel ow.
753 * /
754
755 p =yypst k( t ovs, *t ods );
756 wi d t h = 3;
757
758 i f ( Sym_pi x ) wi dt h = max( wi dt h, st r l en(*t ods) );
759 i f ( At t r _pi x ) wi dt h = max( wi dt h, st r l en( p) );
760
761 swi t ch( i )
762 {
763 case 0: f pr i nt f ( Log, "%- *d ", wi dt h, *t oss ); break;
764 case 1: f pr i nt f ( Log, "%- *s ", wi dt h, *t ods ); break;
765 case 2: f pr i nt f ( Log, "%- *s ", wi dt h, p ); break;
766 }
767 }
768
769 f put c( ' \ n' , Log );
770 }
771 }
772 }
773 }
774
775 i f ( !I nt er act i ve )
776 {
I l l i f ( + + t i m e s _ c a l l e d % 25 == 0 )
778 {
779 wpr i nt w ( St ac k_wi ndow, "wor ki ng: %d\ r", t i m e s _ c a l l e d ) ;
781 }
782 return;
783 }
784
785 i f ( * S _ b r e a k p o i n t && s t a t e < S s t a c k + Dept h )
786 {
787 / * Br eak i f t he br eakpoi nt i s a di gi t and t he t op- of - st ack i t emhas t hat
788 * val ue, o r i f t he st r i ng mat ches t he st r i ng cur r ent l y at t he t op of
789 * t he symbol st ack.
790 * /
791
792 i f ( i s d i g i t ( * S _ b r e a k p o i n t ) )
793 {
794 i f ( a t o i ( S _ b r e a k p o i n t ) == * s t a t e )
795 S i n g l e s t e p = 1;
796 }
797 el se i f ( ! s t r c m p ( S _ b r e a k p o i n t , *debug) )
798 S i n g l e s t e p = 1;
799 }
800
801 i f ( d o _ r e f r e s h )
802 y y _ r e d r a w _ s t a c k ( ) ; / * Redr aw ent i r e st ack * /
803
804 el se i f ( numel e > Onumel e )
805 {
806 / * The st ack has gr own. Redr aw onl y t hose par t s of t he st ack t hat have
807 * changed. ( I ' massumi ng onl y by one el ement . ) The mai n di f f i cul t y
808 * her e i s t hat onl y t he t op f ew el ement s of a l ar ge st ack ar e
809 * di spl ayed. Consequent l y, t he st ack wi ndow may have to scr ol l up
810 * o r down a l i ne i f t he st ack si ze i s hover i ng ar ound t he wi ndow si ze.
811 * Ther e' s no por t abl e way t o scr ol l t he wi ndow up under UNI X cur ses, so
812 * we have t o r edr aw t he st ack t o scr ol l up i n t hi s si t uat i on. We' l l
813 * over wr i t e t he t op el ement wi t h i t sel f by t he wpr i nt w() cal l , but
814 * t hat ' s no bi g deal , and i t si mpl i f i es t he code.
815 * /
816
817 i f ( numel e > S t a c k s i z e )/ * scr ol l down, openi ng up t op l i ne */
818 {
819 MS( w s c r o l l ( St ac k_wi ndow, - 1 ) ; )
820 UX( y y _ r e d r a w _ s t a c k ( ) ; )
821
822 wmove( St ac k_wi ndow, 0, 0 ) ;
823 }
824 el se
825 wmove( St ac k_wi ndow, S t a c k s i z e - nume l e , 0 ) ;
826
827 w p r i n t w ( St ac k_wi ndow, "%3d%c %16. 16s %c %1. 52s ",
828 * s t a t e , VERT, *debug, VERT, y y p s t k ( v a l u e , *debug) ) ;
830 }
831 el se
832 {
833 / * The st ack has shr unk, per haps by sever al el ement s. Remove t hemone at
834 * a t i me. ( I t ' s t oo conf usi ng i f sever al el ement s di sappear f r omt he
835 * st ack at once. I t ' s best t o wat ch t hemgo one at a t i me. ) I f t he
836 * number of el ement s on t he st ack (i ) i s gr eat er t han t he wi ndow si ze,
837 * you can pop an el ement by scr ol l i ng up and t hen wr i t i ng i n a new
838 * bot t oml i ne, ot her wi se, j ust go t o t he cor r ect l i ne and er ase i t.
839 * Do a r ef r esh af t er each pop.
840 * /
841
842 f o r ( i = Onumel e; i > nume l e ; i )
843 {
844 i f ( i > S t a c k s i z e )
845 {
846 / * Do a pop by scr ol l i ng up, t he easi est way t o scr ol l i s t o
847 * move t o t he r i ght edge of t he bot t oml i ne and t hen i ssue
848 * a newl i ne. Af t er t he scr ol l , over wr i t e t he now- bl ank bot t om
849 * l i ne wi t h t he appr opr i at e st ack i nf or mat i on. The i nvol ved
850 * expr essi on t hat i s t he f i r st ar gument t o yypst k i s doi ng:
851 * ( Vst ack +Dept h) [ - i + St acksi ze ]
852 * I t must do t he poi nt er ar i t hmet i c expl i ci t l y, however , by
853 * mul t i pl yi ng by t he si ze of one val ue- st ack i t em ( Vsi ze) .
854 */
855
856 wmove ( St ac k_wi ndow, S t a c k s i z e - 1 , 77 ) ;
857 NEWLINE ( St ac k_wi ndow ) ;
858 wpr i nt w ( St ac k_wi ndow, "%3d%c %16. 16s %c %1. 52s ",
859 ( S s t a c k + D e p t h ) [ - i + S t a c k s i z e ] , VERT,
860 ( Ds t a c k + D e p t h ) [ - i + S t a c k s i z e ] , VERT,
861 y y p s t k ( ( Vs t a c k + ( D e p t h * V s i z e ) ) +
862 ( ( - i + S t a c k s i z e ) * V s i z e ) ,
863 ( Ds t a c k + D e p t h ) [ - i + S t a c k s i z e ] )
864 ) ;
865 }
866 el se
867 {
868 wmove ( St ac k_wi ndow, S t a c k s i z e - i , 0 ) ;
869 w c l r t o e o l ( St ac k_wi ndow ) ;
870 }
872 }
873 }
874 d e l a y ( ) ;
876 Onumel e = numel e;
877 }
878
879 / * -------------------------------------------------------------------------------------------------------------------------------- * /
880
881 PUBLIC voi d y y _ r e d r a w _ s t a c k ()
882 {
883 / * Redr aw t he ent i r e st ack scr een by wr i t i ng out t he t op St acksi ze el ement s
884 * of t he st ack i n t he st ack wi ndow. Not e t hat scr ol l i ng i s t ur ned of f so
885 * t hat t he scr een won' t scr ol l when you pr i nt t he l ast l i ne. Unl i ke
886 * yy_pst ack(), t hi s r out i ne won' t pause f or a command.
887 */
888
889 i nt i ;
890 i nt numel e; / * Number of el ement s on t he st ack */
891 i nt * s t a t e = *P_s p; / * Poi nt er t o t op of st at e st ack */
892 c h a r **debug = *P_ds p; / * Poi nt er t o t op of debug st ack */
893 c h a r *v a l u e ; / * Poi nt er t o t op of val ue st ack */
894
895 we r a s e ( S t a c k wi ndow ) ;
896 s c r o l l o k ( S t a c k wi ndow, FALSE ) ;
897
898 numel e = Dept h - ( s t a t e - S s t a c k ) ;
899 v a l u e = Vs t a c k + ( ( Dept h - numel e) * V s i z e ) ;
900
901 wmove( S t a c k wi ndow, numel e <= S t a c k s i z e ? S t a c k s i z e - numel e : 0, 0 ) ;
902
903 f o r ( i = m i n ( S t a c k s i z e , n u me l e ) ; i >= 0; + + s t a t e , ++debug, v a l u e += V s i z e )
904 w p r i n t w ( S t a c k wi ndow, "%3d%c %16. 16s %c %1 . 5 2 s \ n",
905 * s t a t e , VERT,
906 *debug, VERT, y y p s t k ( v a l u e , *debug) ) ;
907
908 s c r o l l o k ( S t a c k wi ndow, TRUE ) ;
909 }
Listing 4.12. yydebug.c Delay and Main Control Loop
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
PRI VATE v o i d del ay()
{
/ * Pr i nt a pr ompt and wai t f or ei t her a car r i age r et ur n or anot her command
*
Not e t hat t he t i me r et ur ned by t i me( ) i s t he t i me, i n seconds,
*
f r om 00: 00: 00, J anuar y 1, 1970 GMT. Si nce t her e ar e r oughl y 31, 557, 600
*
seconds i n a year ( 365. 25
*
24
*
60 * 60) and t he ( si gned)
*
*
32- bi t l ong i nt can hol d 2, 147, 483, 647, t he t i me won' t r ol l over unt i l
J anuar y 18, 2038 at 2: 56: 02 A. M. Don' t use t hi s pr ogr amon J anuar y 18,
* 2038 at 2: 56: 02 A. M.
*
/
l o n g c u r r e n t ;
i n t
(
{
}
{
b u f [ 8 0 ] ;
p r i n t _ l i n e s ;
t i me b t i me b u f ; /
*
def i ned i n sys/ t i meb. h
*
/
i f ( ! I n t e r a c t i v e ) /
*
n command ( noni nt er act i ve) i ssued * /
r e t u r n ;
I S i n g l e s t e p && k b h i t ( )
)
/
*
I f we' r e not si ngl e st eppi ng (a ' g' command has been i ssued) and
*
t her e' s a key st op si ngl e st eppi ng and t he char act er
*
/
i n p u t _ c h a r ( ) ;
S i n g l e s t e p 1;
L f ( I S i n g l e s t e p )
* I f we' r e 11 doi ng a go command (no key was f ound i n t he pr evi ous
*
*
*
*
*
i f st at ement ) , t hen del ay f or a whi l e. Must use t wo i f st at ement s
her e because we don' t want t o del ay i f we' ve j ust st opped go- i ng.
I f a key i s hi t whi l e we' r e del ayi ng, st op l oopi ng i mmedi at el y and
r ever t back t o si ngl e- st ep mode.
/
947 f t i m e ( &t i me_buf ) ;
948 s t a r t = ( t i m e _ b u f . t i me * 1000) + t i m e _ b u f . m i l l i t m ;
949
950 whi l e( 1 )
951 {
952 f t i m e ( &t i me_buf ) ;
953 c u r r e n t = ( t i m e _ b u f . t i me * 1000) + t i m e _ b u f . m i l l i t m ;
954
955 i f ( c u r r e n t - s t a r t >= De l a y )
956 br eak;
957
958 i f ( k b h i t ( ) ) / * I f a key i s hi t , st op del ayi ng * /
959 { / * and r ever t back to si ngl e- st ep * /
960 i n p u t _ c h a r ( ) ; / * mode.* /
961 S i n g l e s t e p = 1;
962 br eak;
963 }
964 }
965 i f ( I S i n g l e s t e p ) / * I f we' r e st i l l not si ngl e st eppi ng, t hen * /
966 return; / * we' r e done ( don' t get a command) , * /
967 / * ot her wi se, f al l out of t hi s bl ock and * /
968 / * ent er t he command l oop, bel ow. * /
969 }
970
971 whi l e( 1 )
972 {
973 y y p r o mp t ( "Ent e r command ( s p a c e t o c o n t i n u e , ? f o r l i s t ) : ", b u f , 0 ) ;
974
975 swi t ch( *buf )
976 {
977 case ' \ 0' :
978 case ' ' :
979 case ' \ n ' : / * si ngl est ep * /
980 got o o u t s i d e ;
981
982 case : / * hel p * /
983 c m d _ l i s t ( ) ;
984 NEWLINE ( Prompt _wi ndow ) ;
985 p r e s s k e y ( ) ;
986 y y _ r e d r a w _ s t a c k ( ) ;
987 br eak;
988
989 case ' a ' : / * abor t * /
990 Abort = 1;
991 S i n g l e s t e p = 0;
993
994 case ' b ' : / * br eakpoi nt s */
995 b r e a k p o i n t ( ) ;
996 y y _ r e d r a w _ s t a c k ( ) ;
997 br eak;
998
999 case ' d ' : / * set del ay t i me * /
1000
1001 i f ( y y p r o mp t ( "De l ay t i me ( i n s e c o n d s , CR=0, ESC c a n c e l s ) : ", b u f , l ) )
1002 De l a y = ( l ong) ( a t o f ( b u f ) * 1000. 0 ) ;
1003 br eak;
1004
1005 case ' f ' : / * r ead f i l e * /
1006
1007 i f ( ! yypr ompt ( " P r i n t l i n e numbers? ( y / n , CR=y, ESC c a n c e l s ) : ",
1008 b u f , 0) )
1009 br eak ;
1010
1011 p r i n t _ l i n e s = *buf != ' n ' ;
1012 i f ( ! y y p r o mp t ( " F i l e name or ESC t o c a n c e l : ", b u f , 1) )
1013 br eak;
1014
1015 w e r a s e ( St ac k_wi ndow ) ;
1016 d i s p l a y _ f i l e ( b u f , si zeof ( b u f ) , p r i n t _ l i n e s ) ;
1017 y y _ r e d r a w _ s t a c k ( ) /
1018 br eak;
1019
1020 case ' g' : / * go! */
1021 S i n g l e s t e p = 0;
1023
1024 case ' i ' :
1025 i f ( y y p r o mp t ( "I nput f i l e name or ESC t o c a n c e l : ", b u f , 1 ) )
1026 n e w _ i n p u t _ f i l e ( b u f ) ;
1027 br eak;
1028
1029 case ' 1 ' : / * enabl e l oggi ng * /
1030 t o _ l o g ( b u f ) ;
1031 br eak;
1032
1033 case ' N' : / * noni nt er act i ve w/ o l oggi ng * /
1034 Log = NULL;
1035 N o _ s t a c k _ p i x = 1;
1036 I n t e r a c t i v e = 0;
1037 S i n g l e s t e p = 0;
1038 De l a y = 0L;
1041
1042 case ' n ' : / * noni nt er act i ve mode w/ l og * /
1043 i f ( ! Log && ! t o _ l o g ( b u f ) )
1044 br eak;
1045 I n t e r a c t i v e = 0;
1046 S i n g l e s t e p = 0;
1047 De l a y = 0L;
1050
1051 case ' q ' : / * exi t t o oper at i ng syst em* /
1052 r a i s e ( SI GI NT ) ; / * as i f Ct r l - C was ent er ed * /
1053 e x i t (0) ;
1054
1055 case ' r ' : / * r edr aw t he st ack wi ndow * /
1056 y y _ r e d r a w _ s t a c k ( ) ;
1057 break;
1058
1059 case ' w ' : / * wr i t e scr een to f i l e * /
1060 i f ( y y p r o mp t ( "Out put f i l e name or ESC t o c a n c e l : ", b u f , 1) )
1061 w r i t e _ s c r e e n ( b u f ) ;
1062 br eak;
1063
1064 c a s e ' x ' : / * show l exemes */
1065 yyc omme nt ( " c u r r e n t [ %0 . * s ] \ n " , y y l e n g , y y t e x t
) ;
1066 yyc omme nt ( " p r e v i o u s [ %0 . * s ] \ n " , i i _ p l e n g t h ( ) , i i _ p t e x t ( )
) ;
1067 b r e a k ;
1068
1069 c a s e 0x01: yyhook a ( ) ; b r e a k ; / * Ct r l - A debugger hook ( see t ext ) * /
1070 c a s e 0x02: yyhook b ( ) ; b r e a k ; / * Ct r l - B * /
1071
1072 d e f a u l t :
1073 yyprompt ( " I l l e g a l command, p r e s s any key t o c o n t i n u e " , b u f , 0 )
9
1074 b r e a k ;
1075 }
1076
}
1077 o u t s i d e :
1078 we r a s e ( Prompt wi ndow ) ;
1079 w r e f r e s h ( Prompt wi ndow
) ;
1080 }
1081
1082
/ * _
* /
1083
1084 PRIVATE v o i d cmd l i s t ()
1085 {
1086 / * Pr i nt a l i st of commands i n t he st ack wi ndow & pr ompt f or an act i on. * /
1087
1088 we r a s e ( St a c k wi ndow
) ;
1089 wmove ( St a c k wi ndow, 0, 0 ) ;
1090 wpr i nt w ( St a c k wi ndow, "a ( a ) b o r t p a r s e by r e a d i n g EOI \ n " ) ;
1091 wpr i nt w ( St a c k wi ndow, "b
>
mo d i f y or e xami ne ( b ) r e a k p o i n t \ n " ) ;
1092 wpr i nt w ( St a c k wi ndow, "d s e t ( d ) e l a y t i me f o r go mode \ n " ) ;
1093 wpr i nt w ( St a c k wi ndow, "f r e a d ( f ) i l e \ n " ) ;
1094 wpr i nt w ( St a c k wi ndow, "g ( g ) o ( any key s t o p s p a r s e ) \ n " ) ;
1095 wpr i nt w ( St a c k wi ndow, "i c hange ( i ) n p u t f i l e \ n " ) ;
1096 wmove ( St a c k wi ndow, o, 39 ) ;
1097 wpr i nt w ( St a c k wi ndow, "1 ( l ) o g o u t p u t t o f i l e " ) ;
1098 wmove ( St a c k wi ndow,
1,
39 ) ;
1099 wpr i nt w ( St a c k wi ndow, "n ( n ) o n i n t e r a c t i v e mode") ;
1101 wpr i nt w ( S t a c k wi ndow, "q (q)u i t ( e x i t t o d o s ) " ) ;
1103 wpr i nt w ( St a c k wi ndow, 11r ( r ) e f r e s h s t a c k wi ndow") ;
1105 wpr i nt w ( St a c k wi ndow, "w ( w ) r i t e s c r e e n t o f i l e or d e v i c e \ n " ) ;
1107 wpr i nt w ( St a c k wi ndow, "x Show c u r r e n t and p r e v . l e ( X ) e m e \ n " ) ;
1108 wmove ( St a c k wi ndow, 7 , ( 7 8 - 2 9 ) / 2 ) ;
1109 wpr i nt w ( St a c k wi ndow, "Space or En t e r t o s i n g l e s t e p " ) ;
1110 w r e f r e s h ( St a c k wi ndow ) ;
1111
}
One point of note is the Ctrl-A and Ctrl-B commands, processed on lines 1069 and 1070 Debugger hooks for Ctrl-
of Listing 4.12. These commands call the yyhook a () and yyhook b () subroutines A*Ctrl-B. yyhook a o,
(in Listings 4.13 and 4.14), which do nothing at all. Their purpose is twofold. First, if yyhook- b()*
you are running the parser under your compilers debugger, these commands give you a
hook into that debugger. You can set a breakpoint at yyhook a (), and then issue a
Ctrl-A to transfer from the running parser to the debugger itself. These hooks also let
you add commands to the parser without having to recompile it. Since yyhook a ()
and yyhook b () are alone in separate files, customized versions that you supply are
linked rather than the default ones in the library. You can effectively add commands to
the parsers debugging environment by providing your own version of one or both of
these routines. For example, its convenient to print the symbol table at various states of
the compilation process so you can watch symbols being added. You can add this capa
bility to the parser by writing a routine that prints the symbol table and calling that rou
tine yyhook_a(). Thereafter, you can print the symbol table by issuing a Ctrl-A at the
parsers command prompt.
Listing 4.13. yyhook a.c Debugger Hook 1
1 voi d yyhook a () { } / * ent er ed wi t h a command */
Listing 4.14. yyhook b.c Debugger Hook 2
1 voi d yyhook b ( ) { } / * ent er ed wi t h a command */
Production breakpoints
Token input:
yy_next oken () calls
y y l e x ().
Listing 4.15 contains support for the input and production breakpoints. (Stack break
points were handled in yy pstack () the code of interest starts on line 785 of Listing
4.10, page 258). The yy next oken( ) function (starting 1112 of Listing 4.15)
gets an input token from yyl ex (), though if Abor t is true, then an a command has
been issued, and the end-of-input marker (0) is used rather than the next token. The rou
tine then echoes the new token to the input window, and triggers a breakpoint (on line
1142) if necessary, by setting Si ngl est ep true. Singlestep causes del ay () to wait
for a command the next time its called (after the next stack-window update),
yy br eak () (on line 1167) does the same thing, but for production breakpoints. It is
called from the LLama-generated parser just before every production is applied (when a
nonterminal on the stack is replaced by its right-hand side). In occs, its called just
before a reduction takes place, br eakpoi nt () (on line 1187 of Listing 4.15) takes
care of setting the breakpoints, and so forth. It processes the b command.
The remainder of the file (in Listing 4.16) comprises little support functions which
are adequately commented and dont require additional discussion here.
Listing 4.15. yydebug.c Breakpoint Support
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
PUBLIC i nt yy n e x t o k e n ( )
{
/
*
*
I nput a t oken f r omyyl ex() and echo to bot h t he t oken and comment
wi ndows. Yy comment () wr i t es to t he l og f i l e t oo. Br eak i f t he i nput
*
br eakpoi nt i s set and t he t oken has j ust been read. The t oken
*
name i s cent er ed i n t he t oken wi ndow i f i t ' s shor t enough. I t ' s
* t r uncat ed at TOKEN WI DTH char act er s ot her wi se.
*
/
st at i c i nt
i nt
t o k = - 1 ;
wi d t h ;
* s t r ;
* l e x e m e ;
b u f [ TOKEN_WIDTH ] ;
*Yy s t o k [ ] ;
/ * cur r ent t oken */
/
*
Gener at ed by occs and l l ama */
( t o k >= 0 && ( I n t e r a c t i v e Log)
)
y yc omme nt ( "Advance p a s t %s\ n", Yy s t o k [ t o k ] ) ;
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
l e xe me ( ( t ok Abort ? 0 : y y l e x ( ) ) 0)
9 if i
: y y t e x t ;
( I n t e r a c t i v e Log )
yyc omme nt ( "Read %s <%s>\ n", s t r Yy s t o k [ t o k ] , l e x e me ) ;
( I n t e r a c t i v e )
{
NEWLINE( Token_wi ndow ) ;
c o n c a t ( TOKEN_WIDTH, b u f , s t r ,
w p r i n t w ( Token wi ndow, "
VV VV
l e x e me , NULL ) ;
0 . TOKEN WIDTH 2, b u f ) ;
( L b r e a k p o i n t
i
1 && L b r e a k p o i n t <= y y l i n e n o )
{
L_ b r e a k p o i n t
S i n g l e s t e p
1;
l ;
yy p s t a c k ( 0, 1 ) ;
}
( ( *1 b r e a k p o i n t &&
(
( i s d i g i t ( * I b r e a k p o i n t ) && t o k a t o i ( I b r e a k p o i n t ) )
! s t r c mp ( l e x e me , I b r e a k p o i n t )
! s t r c mp ( Yy s t o k [ t o k ] , I b r e a k p o i n t )
)
)
)
{
yy p s t a c k ( 0, 1 ) ;
}
}
d e l a y ( ) ;
r e t u r n t o k ;
}
J * ____________________________________________________________________ * J
PUBLIC v o i d yy br e a k ( p r o d u c t i o n number )
i n t
{
p r o d u c t i o n number;
/
*
Handl es pr oduct i on- number br eakpoi nt s. I f a br eak i s
*
ql e st eppi ng and pr i t he st ack. St ack br eakpoi n
f st ar t
ar e handl ed i n
*
YY pst ack and i nput br eakpoi ar e done i n yy next oken ()
*
*
*
I f pr oduct i on_number
of P br eakpoi nt ;
1, a br eak i s f or ced of t he val ue
*
/
( p r o d u c t i o n number P b r e a k p o i n t p r o d u c t i o n number
1 )
{
yy p s t a c k ( 0, 1 ) ;
}
}
j 'k_________________________________________________________________________________________________________________ 'A*J
PRIVATE i n t b r e a k p o i n t ()
{
189 / *
Set up a br eakpoi nt by pr ompt i ng t he user f or any r equi r ed i nf or mat i on.
190
*
Ret ur n t r ue i f we have to r edr aw t he st ack wi ndow because a hel p scr een
191
*
was pr i nt ed t her e.
192
* /
193
194 i nt t y p e ;
195 char * * p ;
196 char bu f [8 0 ] ;
197 i nt r v a l = 0;
199 {
200 " S e l e c t a b r e a k p o i n t t y p e ( i , l , p , o r s) or command (c or 1 ) : " ,
201 "Type: D e s c r i p t i o n : En t e r b r e a k p o i n t a s f o l l o w s : " ,
202 " i i n p u t ...................................................... number f o r t o k e n v a l u e " ,
203 " or s t r i n g f o r l e xe me or t o k e n name",
204 " 1 i n p u t l i n e r e a d ........................... l i n e number",
205 " p r e d u c e by p r o d u c t i o n ..............number f o r p r o d u c t i o n number",
206 " s t o p - o f - s t a c k s y mb o l ................ number f o r s t a t e - s t a c k i t e m",
207 " or s t r i n g f o r s y m b o l - s t a c k i t e m",
208 " c = c l e a r a l l b r e a k p o i n t s " ,
209 " d = d i s p l a y ( l i s t ) a l l b r e a k p o i n t s " ,
210 NULL
211
};
212
213 i f (
! y y p r o mp t ( "Ent e r t y p e or command, ? f o r h e l p , ESC a b o r t s : ", b u f , 0 ) )
214 ret urn 1;
215
216
i f (
*buf == ' ? ' )
217
{
218 r v a l = 1;
219 we r a s e ( S t a c k wi n d o w) ;
220 wmove ( S t a c k wi ndow, 0, 0 ) ;
221
222 f or ( p = t e x t ; *p; p )
223 wp r i n t w( S t a c k wi ndow, "%s\ n", *p++ ) ;
224
225 w r e f r e s h ( S t a c k wi n d o w) ;
226 i f ( ! y y p r o mp t ( "Ent e r b r e a k p o i n t t y p e or command, ESC a b o r t s : ", b u f , 0 )
227 ret urn r v a l ;
228 }
229
230 i f (
( t y p e = * b u f ) == ' p ' )
231
{
232 i f ( y y p r o mp t ( " P r o d u c t i o n number or ESC t o c a n c e l : ", b u f , 1 ))
233
{
234 i f ( ! i s d i g i t ( *buf ))
235 y y p r o mp t ( "Must be a number, p r e s s any key t o c o n t i n u e . " , b u f , 0
236 el se
237 P b r e a k p o i n t = a t o i ( b u f ) ;
238
}
239 }
240 el se i f ( t y p e == ' 1 ' )
241
{
242 i f ( y y p r o mp t ( "I nput l i n e number or ESC t o c a n c e l : ", b u f , 1 ))
243 L b r e a k p o i n t = a t o i ( b u f ) ;
244 }
245 el se i f ( t y p e == ' i ' | t y p e == ' s ' )
246
{
1247 i f ( y y p r o mp t ( "Symbol v a l u e or ESC t o c a n c e l : ", b u f , 1 ) )
1248 s t r n c p y ( t y p e == ' i ' ? I _ b r e a k p o i n t : S _ b r e a k p o i n t , b u f , BRKLEN ) ;
1249 }
1250 el se
1251 {
1252 swi t ch ( t y p e )
1253 {
1254 case ' c ' :
1255 P _ b r e a k p o i n t = - 1 ;
1256 L_ b r e a k p o i n t = - 1 ;
1257 * S _ b r e a k p o i n t = 0;
1258 * I _ b r e a k p o i n t = 0/
1259 break;
1260
1261 case ' d' :
1262 r v a l = 1;
1263 we r a s e ( S t a c k _ wi n d o w) ;
1264 wmove ( St ac k_wi ndow, 0, 0 ) ;
1265
1266 wp r i n t w( St a c k_ wi nd o w,
1267 P _ b r e a k p o i n t == - 1 ? " P r o d u c t i o n = no ne \ n"
1268 : " P r o d u c t i o n = %d\ n", P _ b r e a k p o i n t ) ;
1269
1270 wp r i n t w( S t a c k _ wi nd o w, "St a c k = %s\ n",
1271 * S _ b r e a k p o i n t ? S _ b r e a k p o i n t : "none" ) ;
1272
1273 wp r i n t w( S t a c k _ wi nd o w, "I nput = %s\ n",
1274 * I _ b r e a k p o i n t ? I _ b r e a k p o i n t : "none" ) ;
1275 wp r i nt w( St a c k_ wi ndo w,
1276 I _ b r e a k p o i n t = = 0 ? "I nput l i n e = no ne \ n"
1277 : "I nput l i n e = %d\n", I _ b r e a k p o i n t ) ;
1278 w r e f r e s h ( S t a c k _ w i n d o w ) ;
1279 NEWLINE ( Pr ompt _wi ndow) ;
1280 p r e s s k e y ( ) ;
1281 br eak;
1282
1283 def aul t:
1284 y y p r o m p t ( " I l l e g a l command or t y p e , P r e s s any k e y . " , b u f , 0 ) ;
1285 br eak;
1286 }
1287 }
1288
1289 ret urn r v a l ;
1290 }
Listing 4.16. yydebug.c Input Routines
1291 PRI VATE i n t n e w _ i n p u t _ f i l e ( b u f )
1292 c h a r *buf ;
1293 {
1294 / * Open up a new i nput f i l e. I nput must come f r oma f i l e because t he
1295 * keyboar d i s used t o get commands. I n t heor y, you can use bot h st andar d
1296 * i nput and t he keyboar d (by openi ng t he consol e as anot her st r eam) , but I
1297 * had di f f i cul t i es doi ng t hi s i n a por t abl e way, and event ual l y gave up.
1298 * I t ' s not t hat bi g a deal to r equi r e t hat t est i nput be i n a f i l e.
1299 * /
1300
1301 NEWLI NE( Pr ompt _wi ndow ) ;
1302 w r e f r e s h ( Prompt _wi ndow ) ;
1303
1304 i f ( i i _ n e w f i l e ( b u f ) != - 1 )
1305 I n p _ f m _ f i l e = 1;
1306 el se
1307 {
1308 wpr i nt w( Pr ompt _wi ndow, "Can' t open %s. ", b u f ) ;
1309 p r e s s k e y ( ) ;
1310 }
1311 ret urn I n p _ f m _ f i l e ;
1312 }
1313
1314 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
1315
1316 PRI VATE FI LE * t o _ l o g ( b u f )
1317 char *buf ;
1318 {
1319 / * Set up ever yt hi ng t o l og out put t o a f i l e ( open t he l og f i l e, et c. ) . * /
1320
1321 i f ( ! y y p r o mp t ( " L o g - f i l e name ( CR f o r \ " l o g \ " , ESC c a n c e l s ) : ", b u f , l ) )
1322 ret urn NULL;
1323
1324 i f ( ! *buf )
1325 s t r c p y ( b u f , "l og" ) ;
1326
1327 i f ( ! (Log = f o p e n ( b u f , "w")) )
1328 {
1329 NEWLI NE( Pr ompt _wi ndow ) ;
1330 wpr i nt w( Pr ompt _wi ndow, "Can' t open %s", b u f ) ;
1331 p r e s s k e y ( ) ;
1332 ret urn NULL;
1333 }
1334
1335 i f ( ! y y p r o mp t ( "Log comment - wi ndow o u t p u t ? ( y / n , CR=y) : ", b u f , 0 ) )
1336 ret urn NULL;
1337 el se
1338 No_comment _pi x = ( * buf == ' n' ) ;
1339
1340 i f ( ! y y p r o mp t ( " P r i n t s t a c k p i c t u r e s i n l o g f i l e ? ( y / n , CR=y) : " , b u f , 0 ) )
1341 ret urn NULL;
1342
1343 i f ( ! ( No_st ack_pi x = ( *buf == ' n ' )) )
1344 {
1345 i f ( ! y y p r o mp t ( " P r i n t s t a c k s h o r i z o n t a l l y ? ( y / n , CR=y) : " , b u f , 0 ) )
1346 ret urn NULL;
1347
1348 i f ( Hor i z_st ack_pi x = ( *buf != ' n ' ) )
1349 {
1350 i f ( ! y y p r o mp t ( " P r i n t SYMBOL s t a c k ( y / n , CR=y) : " , b u f , 0 ) )
1351 ret urn NULL;
1352 Sym_pi x = ( *buf != ' n ' ) ;
1353
1354 i f ( ! y y p r o mp t ( " P r i n t PARSE s t a c k ( y / n , CR=y) : " , b u f , 0 ) )
1355 ret urn NULL;
1356 Par se_pi x = ( *buf != ' n' ) ;
1357
1358 i f ( ! y y p r o mp t ( " P r i n t VALUE s t a c k ( y / n , CR=y) : " , b u f , 0 ) )
1359 ret urn NULL;
1360 At t r pi x = ( *buf != ' n' ) ;
1361 }
1362 }
1363 ret urn Log;
1364 }
1365
1366 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
1367
1368 PRI VATE i n t i nput _char ( v o i d )
1369 {
1370 / * Get a char act er f r omt he i nput wi ndow and echo i t expl i ci t l y. I f we' ve
1371 * compi l ed under Uni x, r eset t he char act er - avai l abl e f l ag.
1372 * /
1373
1374 i nt c ;
1375
1376 i f ( (c = wget ch( Pr ompt _wi ndow) & 0 x 7 f ) != ESC )
1377 waddch( Pr ompt _wi ndow, c ) ;
1378
1379 UX( Char _avai l = 0; )
1380 r e t u r n c;
1381 }
1382
1383 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
1384
1385 PUBLI C i nt yypr ompt ( pr ompt , buf , get st r i ng )
1386 char *prompt , *buf ;
1387 i nt g e t s t r i n g ; / * get ent i r e st r i ng (as compar ed to a si ngl e char act er * /
1388 {
1389 / * Pr i nt a pr ompt and t hen wai t f or a r epl y, l oad t he t yped char act er s i nt o
1390 * buf . ~H ( dest r uct i ve backspace) i s suppor t ed. An ESC causes yypr ompt ()
1391 * to r et ur n 0 i mmedi at el y, ot her wi se 1 i s r et ur ned. The ESC i s not put
1392 * i nt o t he buf f er . I f "get st r i ng" i s t r ue, an ent i r e st r i ng i s f et ched and
1393 * pr ompt r et ur ns when t he newl i ne i s t yped. I f "get st r i ng" i s f al se, t hen
1394 * one char act er i s f et ched and yypr ompt () r et ur ns i mmedi at el y af t er get t i ng
1395 * t hat char act er . Leadi ng and t r ai l i ng whi t e space i s st r i pped f r om t he
1396 * l i ne i f get st r i ng i s t rue. (You can get a si ngl e space char act er i f i t ' s
1397 * f al se, however . )
1398 * /
1399
1400 r e g i s t e r i n t c;
1401 i nt y, x;
1402 char * s t a r t b u f = b u f ;
1403
1404 NEWLI NE ( Pr ompt _wi ndow ) ;
1405 wpr i nt w ( Pr ompt _wi ndow, "%s", pr ompt ) ;
1406 wr ef r esh ( Pr ompt _wi ndow ) ;
1407
1408 i f ( ! g e t s t r i n g )
1409 c = *buf ++ = i n p u t _ c h a r ( ) ;
1410 e l s e
1411 {
1412 w h i l e ( (c = i n p u t _ c h a r ( ) ) != ' \ n ' && c != ESC )
1413 {
1414 i f ( i s s p a c e ( c ) && b u f = = s t a r t b u f )
1415 c o n t i n u e ; / * ski p l eadi ng whi t e space * /
1416
1417 i f ( c != ' \ b ' )
1418 *buf ++ = c ;
1419 e l s e / * handl e dest r uct i ve backspace * /
1420 {
1421 g e t y x ( Prompt wi ndow, y, x ) /
1422
1423 i f ( b u f <= s t a r t b u f )
1424 wmove ( Prompt wi ndow, y, x+1 ) /
1425 e l s e
1426 {
1427 wa ddc h( Prompt wi ndow, ' ' ) ;
1428 wmove ( Prompt wi ndow, y, x ) ;
1429 - - bu f ;
1430 }
1431 }
1432 w r e f r e s h ( Prompt wi ndow ) ;
1433
}
1434
1435 w h i l e ( i s s p a c e ( b u f [ - l ] ) && b u f > s t a r t b u f )
1436 b u f ; / * St r i p t r ai l i ng whi t e space */
1437 }
1438 *buf = 0;
1439 r e t u r n (c != ESC);
1440 }
1441
1442
/ * -
------------------- * /
1443
1444 PRIVATE v o i d p r e s s k e y ()
1445
{
1446 / * Ask f or a key t o be pr essed and wai t f or i t. Not e t hat t hi s command
1447 * does a r ef r esh, but i t i nt ent i onal l y does not cl ear t he wi ndow bef or e
1448 * pr i nt i ng t he pr ompt .
1449 * /
1450
1451 wpr i nt w ( Prompt wi ndow, " P r e s s any ke y: " ) ;
1452 w r e f r e s h ( Prompt wi ndow ) ;
1453 i n p u t c h a r ( ) ;
1454
}
4.10 LLamaImplementing an LL(1) Parser-Generator
The remainder of this chapter discusses how LLama works. Much of the following
code is used by both LLama and occs (all the files whose names start with // are used
only by LLama, everything else is used jointly). As usual, I expect you to actually read
the code (Ive not repeated in the text those things that are adequately described in com
ments in the code itself). I also expect you to be familiar with the set and hash-table
functions described in Appendix A.
4.10.1 LLamas Parser
Llamas own, recursive
descent parser.
This section describes LLamas own parser. It provides a good example of a work
ing recursive-descent compiler for a small programming language (LLama is itself a
compiler, after all). It translates a very high-level language (a grammatical description
of a programming language) to a high-level language (C).
Section 4.10.1 LLamas Parser 271
built LLama in a two-step process, initially developing yet another recursive- Bootstrapping a compiler
descent parser (the last one well look at in this book) that supports a bare-bones input
language. I then used this recursive-descent version of LLama to rebuild its own parser,
substituting the LLama output file for the recursive-descent parser when I linked the
new version. This process is typical of how a language is brought up on a new machine.
You first construct a simple compiler for a subset of the target language using the tools at
typically an assembler. The output of this compiler is typically the same assem- hand
using language subsets.
bly language that the compiler itself is written in. Using your subset language, you then
write a compiler for a more complete version of the language.
For example, you could start writing a C compiler with a very small subset of C,
written in assembly languagean adequate subset would support simple expressions (no
fancy operators like conditionals), global variables of type i nt and char (but no local
variables), one-dimensional arrays of i nt and char, simple control-flow statements
(while, i f , and el se are adequate), and some sort of block structure. Note that, at this
level, a compiler is little more than a macro translator. (See [Angermeyer], pp. 51-92,
where all of the foregoing are implemented as macros using the Microsoft macro assem
bler, MASM.)
In the next step, you write a larger subset of C using the language defined in the pre
vious step, and compile it using the subset-of-C compiler. Typically, this second level
would add better typing (introducing structures, for example), support local variables
and subroutine arguments, and so forth. Continuing in this manner, you can bootstrap
yourself up to a full language implementation.
LLama, like most compilers, is easily divided into several distinct phases. The lexi
cal analyzer is created with LEX. (The specification is in Listing 4.17 and the required
token definitions are in Listing 4.18.) Note that white space (a comment is treated like
white space) is ignored if the global variable I gnor e (declared on line 27) is true, other
wise a WHI TESPACE token is returned. The variable is modified using the access rou
tines on lines 215 and 216. Also note how actions are processed on lines 73 to 153. The
Llamas lexical analyzer,
tokens, LeX input
specification.
entire action treated as a single token. LEX recognizes an initial open brace, and the
associated code absorbs everything up to and including the matching close brace. That
is, the code doesnt terminate until a close brace is found at nesting level 1, as defined by
nest l ev. Other close braces are absorbed into the lexeme.
Listing 4.17. parser.lex LX Input File for occs/LLama Lexical Analyzer
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
%{
#i ncl ude <t ool s/ hash.h>
#i ncl ude " l l out . h"
#def i ne CREATI NG_LLAMA PARSER
#i ncl ude "par ser . h"
/
/
*
*
Suppr ess var i ous def i ni t i ons i n par ser . h
t hat conf l i ct wi t h LeX- gener at ed def s.
*
/
/
/
* _____________________________________________________________________________
*
Lexi cal anal yzer f or bot h l l ama and yacc. Not e t hat l l ama doesn' t suppor t
*
o
o
o
ori ght , %noa or
o
o They ar e r ecogni zed her e so t hat we can
*
*
pr i nt an er r or message when t hey' r e encount er ed. By t he same t oken, yacc
l l ama i nput f i l es can be
*
i gnor es t he %synch di r ect i ve. Though al l
by yacc, t he r ever se i s not t rue.
*
* _____________________________________________________________________________
*
Whi comment s, and ot her wi se i l l egal char act er s must be handl ed
*
speci al l y. When we' r e pr ocessi ng code bl ocks, we need to get at
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
*
t he char act er s so t hat t hey can be t o t he out put , ot her wi se, t he
*
*
*
*
char act er s shoul d be i gnor ed. The ws () and nows () subr out i nes (at t he
bot t omof t he f i l e) swi t ch bet ween t hese behavi or s by changi ng t he val ue
i f I gnor e. I f I gnor e i s t r ue, whi t e space i s i gnor ed.
/
PRIVATE i n t
PRIVATE i n t
I g n o r e 0;
S t a r t l i n e ; /
*
st ar t i ng l i ne number
*
/
/
* _____________________________________________________________________________
*
*
Pr ot ot ypes f or f unct i ons at t he bot t om of t hi s f i l e
/
v o i d s t r i p e r P (( c h a r * s r c ) ) ;
v o i d nows
v o i d ws
}
P ( ( v o i d )) ;
P ( ( v o i d ) ) ;
/
/
/
*
Remove car r i age r et ur ns (but not
*
*
*
l i nef eeds) f r omsr c
I gnor e whi t e space
*
e t c
*
/ * Don' t i gnor e whi t e space et c
*
/
/
/
/
c name [ A- Za- z ] [ A- Za- z 0 - 9 ]
*
%.2-
15o
i i / * i i
{ / * Absor b a comment ( t reat i t as WHI TESPACE)
*
/
i n t i ;
i n t s t a r t y y l i n e n o ;
w h i l e ( i
{
i n p u t ()
)
( i < 0 )
{
i i u n t e r m( ) ;
n _ f l u s h (1) ;
i i _ t e r m ( ) ;
l error(NONFATAL, "Comment s t a r t i n g on l i n e %d
VV
\
"t o o l o n g , t r u n c a t i n g \ n " , s t a r t ) ;
}
( i
r * r
ScSc i i l o o k a h e a d ( l )
)
{
i n p u t ( ) ;
s t r i p e r ( y y t e x t ) ;
i f ( I g n o r e ) g o t o end;
r e t u r n WHITESPACE;
}
}
l e r r or ( FATAL, "End o f f i l e e n c o u n t e r e d i n comment \ n") ;
e n d :;
}
/
*
Suck up an ent i r e act i on. Handl e nest ed br aces her e
* Thi s code won' t wor k i f t he act i on i s l onger t han t he
*
*
buf f er l engt h used by t he i nput f unct i ons. I f t hi s i s a
pr obl em, you have t o al l ocat e your own buf f er and copy
* t he l exeme t her e as i t ' s
*
(i n a manner si mi l ar t o
t he %{ pr ocessi ng, bel ow) . I f space i s r eal l y a pr obl em,
t he code bl ocks can be copi ed to a t empor ar y f i l e and t he
* of f set t o t he st ar t of t he t ext (as r et ur ned by f t el l ( ) )
*
Listing 4.17. continued
81 * can be st or ed r at her t han t he st r i ng i t sel f .
82 * /
83
84 \ {
{
85 i n t i ;
86 i n t nest l ev; / * br ace- nest i ng l evel * /
87 i n t l bl ; / * pr evi ous char act er * /
88 i n t l b2 ; / * char act er bef or e t hat * /
89 i n t i n st r i ng; / * pr ocessi ng st r i ng const ant * /
90 i n t i n char const ; / * pr ocessi ng char. const ant
* /
91 i n t i n comment ; / *
pr ocessi ng a comment
* /
92
93 l bl = l b2 = 0;
94 i n st r i ng = 0;
95 i n char const = 0;
96 i n comment = 0;
97 St ar t _l i ne = yyl i neno;
98
99 f or ( nest l ev=l ; i =i nput (); l b2=l bl , l bl =i )
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
{
( l b2==' \ n' && l bl
l er r or ( FATAL,
/ 0,9
O && 1
9 9
o
)
ff o o o o
o o o o i n code bl ock st ar t i ng on l i ne %d\ n",
St ar t l i ne );
( i < 0 ) /
*
i nput - buf f er over f l ow * /
{
i i _unt er m();
i i _f l ush(1) ;
i i _ term();
l er r or ( FATAL,
"Code bl ock st ar t i ng on l i ne %d t oo l ong. Xn",
St ar t l i ne) ;
}
/ * Take car e of \{, \}, ' }'
*
/
( i ' \ V )
{
( ! (i
i nput ()) )
/ * di scar d backsl ash
*
/
c o n t i n u e ; /
*
and f ol l owi ng char
*
/
}
( i
9 i i 9
ScSc ! (i n char const i n comment )
)
i n st r i ng !i n st r i ng;
( i
' V ' ScSc ! (i n st r i ng i n comment ) )
i n char const !i n char const ;
( l bl
i n comment
' / ' && i
1;
r * r
&& !i n st r i ng )
( l bl
i n comment
' & & i
0;
' / ' && i n comment )
138 i f ( ! (i n st r i ng i n char const | i n comment ) )
139 {
140 i f ( i == ' {' )
141 ++nest l ev;
142
143 i f ( i == ' }' && nest l ev <= 0 )
144 {
145 st r i per ( yyt ext );
146 r et ur n ACTI ON;
147 }
148 }
149 }
150
151 l er r or ( FATAL, "EOF i n code bl ock st ar t i ng on l i ne %d\ n",
152 St ar t l i ne );
153 }
154
155
~ Ql Q.
" 6 "o r et ur n SEPARATOR; / * Must be anchor ed because */
156 / *
i t can appear i n a pr i nt f */
157 / *
st at ement . */
158 "%{" [ \ s\ t ] *
{
159 / * Copy a code bl ock t o t he out put f i l e. */
160
161 i nt c, l ooki ng_f or _br ace = 0;
162
163 #undef out put / * r epl ace macr o wi t h f unct i on */
164 / * i n mai n. c */
165 i f ( !No l i nes )
166 out put ( "\ n#l i ne %d \ " %s\ "\ n",
167 yyl i neno, I nput f i l e name );
168
169 whi l e( c = i nput () ) / * whi l e not at end of f i l e */
170 {
171 i f ( c == - 1 ) / * buf f er i s f ul l , f l ush i t */
172 i i f l ushbuf ();
173
174 el se i f ( c != ' \ r ' )
175 {
176 i f ( l ooki ng f or br ace ) / * l ast char was a % * /
177 { / * {*/
178 i f ( c = = ' }' ) br eak;
179 el se out put ( "%%%c", c );
180 }
181 el se
182 {
183 i f ( c = = ' % ' ) l ooki ng f or br ace = 1;
184 el se out put ( "%c", c );
185 }
186 }
187 }
188 ret urn CODE BLOCK;
189 }
190
191 <{c name}> r et ur n FI E LD; / * f or occs onl y */
192 "%uni on" ret urn PERCENT_UNI ON; / * f or occs onl y */
193 "%t oken" |
194 " I t em" ret urn TERM SPEC;
195 "%t ype" ret urn TYPE; / * f or occs onl y */
196 "%synch" ret urn SYNCH; / * f or l l ama onl y */
197 "%l ef t " r et ur n LEFT; / * f or occs onl y */ _
Section 4.10.1LLamas Parser 275
Li sti ng4.17. conti nued. .
198 "%ri ght" ret urn RIGHT; / * f or occs onl y
* /
199 "%nonassoc" ret urn NONASSOC; / * f or occs onl y
* /
200 "%prec" ret urn PREC; / * f or occs onl y * /
201 "%s t a r t " ret urn START; / * f or er r or messages onl y * /
202
VV, VV
ret urn COLON ;
203
VV fV
ret urn OR ;
204
VV VV
r ret urn SEMI ;
205
VV^VV
ret urn START OPT ;
206
VVj VV
|
207
VVj * VV
ret urn END OPT;
208
209 [ ~ \ x 0 0 - \ s % \ { } [ \ ] ( ) * : ; , < > ] + ret urn NAME;
210 \ x 0 d ; / * di scar d car r i age r et ur n (\ r) * /
211 [ \ x 0 0 - \ x 0 c \ x 0 e - \ s ] i f ( ! I g n o r e ) ret urn WHITESPACE;
212
a o
o o
213
/ * --------------------- -----------------------------------------------------------------------------------------------------------------------* /
214
215 PUBLIC voi d n o w s () { I g n o r e = 1; } / * I gnor e whi t e space, et c.
* /
216 PUBLIC voi d ws () { I g n o r e = 0; } / * Don' t i gnor e whi t e space, et c. * /
217
218 PUBLIC i nt s t a r t a c t i o n () / * Ret ur n st ar t i ng l i ne number of most * /
219
{
/ * r ecent l y r ead ACTI ON bl ock * /
220 ret urn S t a r t l i n e ;
221
}
222
/ * --------------------- ---------------------------------------------------------------------------------------------------------------------- * /
223
224 PRIVATE voi d s t r i p e r ( s r c ) / * Remove al l \ r' s (but not \ n' s) f r omsrc. * /
225 char * s r c ;
226
{
227 char * d e s t ;
228 f or ( d e s t = s r c ; * s r c ; s r c ++ )
229 i f ( *s r c != ' \ r ' )
230
*
d e s t + + = * s r c ;
231 * d e s t = ' \ 0 ' ;
232
}
LLamas parser uses the LEX-generated lexical analyzer to parse the LLama input
file, creating a symbol table with what it finds there. A symbol table is a large data struc- LLamas symbol-table.
ture, indexed by symbol name (the lexeme associated with the name in the input file). A
good analogy array of structures, one field of which is a string holding the symbol
name. The other fields in the structure hold things like the numeric value assigned to
each symbol (the token value) and the symbolss type (in the current case, the type is ter
minal, nonterminal, or action, as compared to the more usual i nt, long, and so forth).
LLamas symbol also stores tokenized versions of the right-hand sides as elements of
the symbol-table entry for each nonterminal. I ll discuss the details of this process in a
moment. The LLama code-generation phase is passed the symbol table, which in this
case serves as a representation of the entire input language, and outputs codethe parse
tables needed to parse an input file in the target language also copies the output
parser from the template file to the output file.
The parser.h file, which starts in Listing 4.19, is #i ncl uded in most of the files that
comprise both LLama and occs. starts out on lines six to 12 with conditional-
compilation macros that work like the D() macro in <tools/debug.h> (described in
Appendix A). The argument to the LL () macro compiles only if youre making LLama Conditional compilation
(I usually define LLAMA on the compilers command line with a cc -DLLAMA), the oppo- macros-LL0 OX().
site applies to the OX () macro. The listing also contains a few predefined exit codes.
Listing 4.18. llout.h LLama Token Definitions
1 #def i ne _ E0I _ 0
2 #def i ne ACTI ON 1
3 #def i ne CODE_BLOCK 2
4 #def i ne COLON 3
5 #def i ne END_OPT 4
6 #def i ne FI ELD 5
7 #def i ne LEFT 6
8 #def i ne NAME 7
9 #def i ne NONASSOC 8
10 #def i ne OR 9
11 #def i ne OTHER 10
12 #def i ne PREC 11
13 #def i ne RI GHT 12
14 #def i ne SEMI 13
15 #def i ne SEPARATOR 14
16 #def i ne START 15
17 #def i ne START_OPT 16
18 #def i ne SYNCH 17
19 #def i ne TERM_SPEC 18
20 #def i ne TYPE 19
21 #def i ne PERCENT_UNI ON 20
22 #def i ne WHI TESPACE 21
Listing 4.19. parser.h Compilation Directives and Exit Stati
1 /* PARSER. H Thi s f i l e cont ai ns t hose #def i nes, et c., t hat
2
*
ar e used by bot h l l ama and yacc. Ther e' s al so
3
*
a yacc. h f i l e t hat ' s used onl y by t he yacc code.
4
5
*/
6 #i f def LLAMA
7 #def i ne LL( x) x
8 #def i ne OX (x)
9 #el se
10 #def i ne LL (x)
11 #def i ne OX (x) x
12 #endi f
13
14
/* - - - - -
15 * Var i ous er r or exi t st at i . Not e t hat ot her er r or s f ound whi l e par si ng cause
16 * l l ama to exi t wi t h a st at us equal t o t he number of er r or s (or zer o i f
17 * t her e ar e no er r or s) .
18 */
19
20 #def i ne EXI T I LLEGAL ARG 255 / * I l l egal command- l i ne swi t ch */
21 #def i ne EXI T_TOO_MANY 254 / * Too many command- l i ne ar gs */
22 #def i ne EXI T_NO_DRI VEFl 253 / * Can' t f i nd l l ama.par */
23 #def i ne EXI T_OTHER 252 / * Ot her er r or ( synt ax er r or , et c. ) */
24 #def i ne EXI T USR ABRT 251 / * Ct r l - Br eak */
Parser.h continues in Listing 4.20 with the definitions of the numbers that represent
Token values: mi nnon- terminals, nonterminals, and actions. You can change the values of MINNONTERM and
t er m, mi nac t . MINACT (on lines 29 and 30), but they must always be in the same relative order. The
following must hold:
( 0=_EOI _) < ( MI NTERM=1) <MI NNONTERM <MI NACT
Zero is reserved for the end-of-input marker ( _EOI _) , and MI NTERMmust be 1. Also
note that there must be at least one hole between the maximum terminal and minimum
nonterminal values. This hole is generated with the -2 in the MAXNONTERMdefinition on
line 41. Its required because the symbolic value used for e is one more than the largest
number actually used to represent a terminal symbol; the hole guarantees that theres
enough space for the 8. ( EPSI LON is defined on line 65.) The listing finishes with a few
miscellaneous definitions for various file names, and so forth.
Listing 4.20. parser.h Numeric Limits for Token Values
25 #def i ne
26 #def i ne
27
28 #def i ne
29 #def i ne
30 #def i ne
31
32
33
34
35
36
37
38
39
40
41 #def i ne
42 #def i ne
43
44 #def i ne
45 #def i ne
46
47 #def i ne
48 #def i ne
49
50
51
52
53
54
55 #def i ne
56 #def i ne
57 #def i ne
58
59
60
61
62
63
64
65 #def i ne
66
67
68
69
70
MAXNAME
MAXPROD
MINTERM
MINNONTERM
MINACT
32 / *
512
/ *
1
/ *
256
/ *
512
/ *
Maxi muml engt h of a 1 or al name
*
Maxi mumnumber of pr oduct i ons i n t he i nput gr ammar
*
/
/
Token val ues assi gned t o t er mi nal s st ar t her e
nont er mi nal s st ar t her e
act s st ar t her e
*
*
*
/
/
/
/ * Maxi mumnumer i c val ues used f or t er mi nal s
* and nont er mi nal s ( MAXTERM and MI NTERM) , as
* wel l as t he maxi mumnumber of al and
*
*
*
*
nont er mi nal s ( NUMTERMS and NUMNONTERMS) .
Fi nal l y, USED_ TERMS and USED_NONTERMS ar e
t he number of t hese act ual l y i n use (i . e.
wer e decl ar ed i n t he i nput f i l e) .
*
/
MAXTERM
MAXNONTERM
(MINNONTERM
(MINACT
2)
1)
NUMTERMS
NUMNONTERMS
( (MAXTERM-MINTERM) +1)
( (MAXNONTERM-MINNONTERM)+1)
USED_TERMS
USED NONTERMS
( ( Cur_t erm
( (Cur nont erm
MINTERM)
+1)
MINNONTERM) +1)
/ * These macr os eval uat e t o t r ue i f x r epr esent s
* a t er mi nal ( I STERM) , nont er mi nal ( I SNONTERM)
* or act i on ( I SACT)
*
/
ISTERM(x)
ISNONTERM(x)
ISACT(x)
( (x)
( (x)
( (x)
&&
&&
&Sc
(MINTERM < (x) - > v a l ScSc (x) - > v a l <= MAXTERM ) )
(MINNONTERM <= ( x ) - > v a l && ( x ) - > v a l <= MAXNONTERM))
(MINACT < ( x ) - > v a l ) )
/
*
Epsi l 1 i s one mor e t han t he
*
*
t er mi nal act ual l y used. We can get away wi t h
t hi s onl y because EPSI LON i s not used unt i l
* af t er al l t he t er mi nal s have been ent er ed
*
i nt o t he symbol t abl e
*
/
EPSILON (Cur t erm+1)
/
*
The f ol l owi ng macr os ar e used t o adj ust t he
* nont er mi nal val ues so t hat t he smal l est
* nont er mi nal i s zer o (You need t o do t hi s
*
*
when you out put t he t abl es) . ADJ __VAL does
t he adj ust ment , UNADJ VAL t r ansl at es t he
71
*
adj ust val ue back t o t he or i gi nal val ue.
72 */
73
74 #def i ne ADJ VAL(x) ( (x)-MINNONTERM
)
75 #def i ne UNADJ VAL(x) ( (x)+MINNONTERM
)
76
77
/ * ------------------- ----------------------------------------------------------- * /
78
79 #def i ne NONFATAL 0 / * Val ues to pass to er r or () and l er r or (), */
80 #def i ne FATAL 1 / * def i ned i n mai n. c. */
81 #def i ne WARNING 2
82
83 #def i ne DOL LAR DOL LAR ( ( unsi gned) ~0 1 ) / * Passed to do dol l ar () to */
84 / * i ndi cat e t hat $$ i s to */
85 / * be pr ocessed. */
86
87
/ * ------------------- ----------------------------------------------------------- * /
88
89 #i f def LLAMA
/ *
Var i ous f i l e names:
* /
90 # def i ne TOKEN_FILE " l l o u t . h"
/ *
out put f i l e f or t oken i def i nes
* /
91 # def i ne PARSE_FILE " l l o u t . c"
/ *
out put f i l e f or par ser
* /
92 # def i ne SYM_FILE " l l o u t . sym"
/ *
out put f i l e f or symbol t abl e
* /
93 # def i ne DOC_FILE " l l o u t . doc"
/ *
LALR( l ) St at e machi ne descr i pt i on
* /
94 # def i ne DEF_EXT "lma" / *
f oo. l ma i s def aul t i nput ext ensi on
* /
95 # def i ne PAR_TEMPL " l l a ma . p a r " / * t empl at e f i l e f or par ser
* /
96 # def i ne PR0G_NAME "l l ama"
97 #el se
98 # def i ne TOKEN_FILE " y y o u t . h"
/ *
out put f i l e f or t oken i def i nes
*/
99 # def i ne PARSE_FILE " y y o u t . c"
/ *
out put f i l e f or par ser
* /
100 # def i ne ACT_FILE " y y a c t . c"
/ *
Used f or out put i f - a speci f i ed
*/
101 # def i ne TAB_FILE "y y o u t a b . c"
/ *
out put f i l e f or par ser t abl es (-T)
* /
102 # def i ne SYM_FILE " y y o u t . sym" / * out put f i l e f or symbol t abl e
* /
103 # def i ne DOC_FILE " y y o u t . doc"
/ *
LALR( l ) St at e machi ne descr i pt i on
* /
104 # def i ne DEF_EXT "ox"
/ *
f oo. ox i s def aul t i nput ext ensi on
* /
105 # def i ne PAR_TEMPL " o c c s . p a r "
/ *
t empl at e f i l e f or PARSE FI LE
* /
106 # def i ne ACT_TEMPL " o c c s - a c t . p a r
/ *
t empl at e f i l e f or ACT FI LE
* /
107 # def i ne PROG_NAME "oc c s "
108 #endi f
Definitions for output
functions, y y t t y p e
Listing 4.21 shows several definitions used by the output functions. The typedef s
for YY TTYPE 15 and 116 are used only to predict the size of the output transi-
Symbol-table data struc
tures.
tion tables. They should agree with the the YY_TTYPE definitions in the parser-
template files.
The next part of parser.h, the SYMBOL and PRODUCTION structures declared in List
ing 4.22, are used for LLamas symbol table. Symbol tables tend to be the most com
plex data structure in the compiler, and the LLama symbol table is no exception. Figure
4.9 shows a symbol table containing an entry for the following three productions:
expr term PLUS expr
term
term NUMBER {create tmp (yytext); }
SYMBOL structure.
For clarity, I ve left out unimportant fields, and NULL pointers are just left blank in the
picture.
The symbol table itself is made up of SYMBOL structures, shown in the dashed box in
Figure 4.9. This box is actually a hash table, put together with the functions described in
Listing 4.21. parser.h Simple Types
109 / * The f ol l owi ng ar e used t o def i ne t ypes of t he OUTPUT t r ansi t i on t abl es. The
110 * i f ndef t akes car e of compi l i ng t he l l ama out put f i l e t hat wi l l be used t o
111 * r ecr eat e t he l l ama i nput f i l e. We must l et t he 11 ama gener at ed def i ni t i ons
112 * t ake pr ecedence over t he t he def aul t ones i n par ser . h i n t hi s case.
113 */
114 #i f ndef CREATING_LLAMA_PARSER
115 LL( t ypedef unsi gned char YY TTYPE ; )
116 OX( t ypedef i nt YY_TTYPE ; )
117 #endi f
Listing 4.22. parser.h The SYMBOL and PRODUCTI ON Data Structures
118 / * SYMBOL st r uct ur e ( used f or symbol t abl e) . Not e t hat t he name i t sel f * /
119 / * i s kept by t he symbol - t abl e ent r y mai nt ai ned by t he hash f unct i on. * /
120
121 #def i ne NAME_MAX 32
/ *
Max name l engt h + 1 * /
122
123 t ypedef st ruct s ymbol
124
{
125 char name [ NAME MAX ] ;
/ *
symbol name. Must be f i r st */
126 char f i e l d [ NAME_MAX ] ;
/ *
%t ype <f i el d> */
127 unsi gned v a l ; / *
numer i c val ue of symbol */
128 unsi gned us e d; / *
symbol used on an r hs */
129 unsi gned s e t ; / *
symbol def i ned * /
130 unsi gned l i n e n o ; / *
i nput l i ne num. of st r i ng * /
131 unsi gned char * s t r i n g ; / *
code f or act i ons. * /
132 st ruct prod * p r o d u c t i o n s ; / *
r i ght - hand si des i f non t er m'*/
133 SET * f i r s t ; / *
FI RST set * /
134 LL( SET * f o l l o w ; ) / * FOLLOWset * /
135 } SYMBOL;
136
137 #def i ne NULLABLE( sym) ( ISNONTERM(sym) && MEMBER( ( s y m ) - > f i r s t , EPSILON) )
138
/ * ----------------------------- ------------------------------------------ * /
139 / * PRODUCTI ON St r uct ur e. Repr esent s r i ght - hand si des. * /
140
141 #def i ne MAXRHS 31 / * Maxi mumnumber of obj ect s on a r i ght - hand si de * /
142 #def i ne RHSBITS 5 / * Number of bi t s r equi r ed t o hol d MAXRHS * /
143
144 t ypedef st ruct pr od
145
{
146 unsi gned num; / * pr oduct i on number * /
147 SYMBOL * r hs [ MAXRHS + 1 ] ; / * Tokeni zed r i ght - hand si de * /
148 SYMBOL * I h s ; / * Lef t - hand si de * /
149 unsi gned char r hs l e n ; / * # of el ement s i n r hs[] ar r ay * /
150 unsi gned char non a c t s ; / *
" t hat ar e not act i ons
* /
151 SET * s e l e c t ; / * LL (1) sel ect set * /
152 st ruct prod * n e x t ; / * poi nt er to next pr oduct i on
* /
153 / *
f or t hi s l ef t - hand si de.
* /
154 OX( i nt p r e c ; )
/ * Rel at i ve pr ecedence * /
155
156 } PRODUCTION;
Appendix A. A SYMBOL contains the name of the symbol (name) and the internal
numeric value (val ). If the symbol is an action, the st r i ng field points at a string that
holds the code specified in the LLama input file. In the case of a nonterminal, the
PRODUCTION structure.
Figure 4.9. Structure of the LLama Symbol Table
name:
val:
expr
257
string:
productions:
name:
val:
term
258
string:
productions:
name:
val:
PLUS
string:
productions:
name: NUMBER
val: 4
string:
productions:
>
name:
val:
{2 }
514
stri ng:
productions:
{r val ue( yyt ext ); }
L _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ J
pr oduct i ons field points at the head of a linked list of PRODUCTI ON structures, one
element of the list for each right-hand side. The right-hand side is represented within the
PRODUCTI ON as an array of pointers to symbol-table entries for each element of the
right-hand side. The size of this array obviously limits the size of the right-hand side. I
opted for this approach rather than keeping an array of numeric values for the symbols,
because the latter would require an additional table lookup every time I had to access a
symbol on a right-hand side. Note that the PRODUCTI ONS form a doubly linked list in
that the I hs field of each PRODUCTI ON points back to the productions left-hand side.
The next field is just the next element in the linked list.
Other fields in the PRODUCTI ON are the production number (num)an arbitrary, but
unique, number assigned to each production; r hs l en is the number of symbols in the
rhs[ ] array; non act s is the number of elements in r hs [ ] that are not actions;
finally, sel ect is the LL(1) select set. (Ill discuss exactly what this is in a moment.)
The pr ec field holds the relative precedence level of the current production if youre
compiling for occs. This number is usually the precedence level of the rightmost
terminal on the right-hand side, but it can be changed with a %pr ec directive.
Other fields in the SYMBOL structure are: l i nenothe input line number on which
the symbol is defined; used is set true when a symbol is used on a right-hand side; and
set is set true when a symbol is defined with a %t er mor by appearing as a left-hand
side. Its a hard error for a symbol to be used without having been defined (set). I ts just
a warning if the symbol is defined but not used, however. Actions are set implicitly by
being used. I ll discuss the f i r st and f ol l owfields in a moment.
The next data structure of interest is the PREC_TAB structure defined on lines 159 to Precision table:
164 of Listing 4.23. This structure is used only by occsit holds precedence and associ- PREC_TAB
ativity information for the terminal symbols. An array of these (defined on line 200) is
indexed by nonterminal value. This information could also be incorporated into the
SYMBOL structure, but its easier to keep it as a separate table. The remainder of parser.h
comprises global-variable and subroutine declarations. Space is allocated for the vari
ables if ALLOCATE is defined before parser.h is #ncluded.
Listing 4.23. parser.h Declarations and Type Definitions
157
158
159
160
161
162
163
193
194
195
196
197
198
#i f def OCCS
st ruct p r e c t a b
{
unsi gned char l e v e l ; / * Rel at i ve 0=none
unsi gned char a s s o c ; /
*
associ at i vi t y 1
/ /
l =l owest
r i ght , ' \ 0' =none
*
*
/
/
164 } PREC_T AB;
165
166 #def i ne DEF'_FIELD "yy_de f " / * Fi el d name f or def aul t f i el d i n a * /
167 / * %uni on. * /
168 #endi f
169 #i f def ALLOCATE / * I f ALLOCATE i s t r ue, al l ocat e space and */
170 # def i ne CLASS / * act i vat e t he i ni t i al i zer , ot her wi se t he * /
171 # def i ne I ( x ) x / * st or age cl ass i s ext er n and t he */
172 #el se / * i ni t i al i zer i s gobbl ed up. */
173 # def i ne CLASS ext ern
174 # def i ne I ( x )
175 #endi f /* The f ol l owi ng ar e set i n */
176 /*
mai n. c, most l y by command- */
177 /*
l i ne swi t ches. * /
178 CLASS i nt Debug I ( = 0 ) ;
/ *
Tr ue f or debug di agnost i cs * /
179 CLASS char * I nput f i l e name I ( = " c o n s o l e " ) ;
/ *
I nput f i l e name * /
180 CLASS i nt Make a c t i o n s I (
= 1 ) ; / *
==0 i f - p on command l i ne * /
181 CLASS i nt Make p a r s e r I (
= 1 ) ; / *
==0 i f - a on command l i ne * /
182 CLASS i nt Make y y o u t a b I ( = 0 ) ; / *
==1 i f - T on command l i ne * /
183 CLASS i nt No l i n e s I ( = 0 ) ; / *
Suppr ess i l i nes i n out put * /
184 CLASS i nt No wa r ni ng s I ( = 0 ) ; / *
Suppr ess war ni ngs i f t r ue * /
185 CLASS FILE *Out put I ( = s t d o u t ) ;
/ *
Out put st r eam. * /
186 CLASS i nt P u b l i c I ( = 0 ) ;
/ *
Make st at i c symbol s publ i c * /
187 CLASS i nt Symbol s I ( = 0 ) ; / *
Gener at e symbol t abl e. * /
188 CLASS i nt Th r e s h o l d I ( = 4 ) ; / * Compr essi on t hr eshol d * /
189 CLASS i nt Uncompres s ed I ( = 0 ) ; / * Don' t compr ess t abl es * /
190 CLASS i nt Use s t d o u t I ( = 0 ) ; / * - t speci f i ed on command l i ne * /
191 CLASS i nt Ve r bo s e I ( = 0 ) ; / * Ver bose- mode out put (1 f or * /
192 / * - v and 2 f or - V) * /
CLASS SYMBOL *Te r ms [ MINACT ] ; / * Thi s ar r ay i s i ndexed by t er mi nal or
* nont er mi nal val ue and eval uat es t o a
*
*
poi nt er t o t he equi val ent symbol - t abl e
en t r y.
*
/
199
200 OX( CLASS PREC TAB P r e c e d e n c e [ MINNONTERM ]
; )
/ * Used onl y by occs. Hol ds
* /
201 / * r el at i ve pr ecedence and * /
202 / * associ at i vi t y i nf or mat i on * /
203 / * f or bot h t er mi nal s and
* /
204 / * nont er mi nal s. * /
205
206 LL( CLASS SET *Synch; ) / *
Er r or - r ecover y synchr oni zat i on
* /
207 / * set ( speci f i ed wi t h %synch) . * /
208 / *
Synch i s i ni t i al i zed i n act s. c
* /
209 / *
I t i s used onl y by l l ama.
* /
210 CLASS char ^Templ at e I (=PAR__TEMPL) ;
/ *
Templ at e f i l e f or t he par ser ; * /
211 / *
can be modi f i ed i n mai n. c
* /
212 CLASS HASH_TAB *Symt ab; / * The symbol t abl e i t sel f * /
213 / * ( i ni t i al i zed i n yyact . c) * /
214 CLASS SYMBOL *Goal s ymbol I ( =NULL) ; / *
Poi nt er to symbol - t abl e ent r y * /
215 / *
f or t he st ar t (goal ) symbol
* /
216 / * The f ol l owi ng ar e used * /
217 / * by t he act s [i n l l act . c]
* /
218 CLASS i nt Cur t e r m
I (=
= MINTERM- l ); / * Cur r ent t er mi nal * /
219 CLASS i nt Cur nont erm
I ( =
= MINNONTERM-1 ) ; / * " nont er mi nal * /
220 CLASS i nt Cur a c t
I (=
=MINACT-1 ) ; / * " act i on * /
221 CLASS i nt Num p r o d u c t i o n s
I (
= 0 ) ; / * Number of pr oduct i ons * /
222
223 #undef CLASS
224 #undef I
225
/ * ----------------------- -------------------------------------------------- * /
226
227 #def i ne o u t c ( c ) p u t c ( c , O u t p u t ) ;
/ *
Char act er r out i ne to compl ement */
228 / *
out put () i n mai n. c */
LLamas parser does one thing: it loads the symbol table with the grammar
represented by the input file. The process is analogous to creating a physical parse
treethe input language can be represented entirely by the symbol table, just as normal
LLama input language:
recursive-descent gram
mar, LLama input file.
Recursive-descent parser
for LLama.
programming language can be represented by the parse tree.
An augmented grammar for the simple, recursive-descent version of the LLama
parser is shown in Table 4.19. The LLama input file in Listing 4.24 specifies the com
plete input grammar for both LLama and occs. I ll build a recursive-descent LLama
first, and then build a more complete parser later, using the stripped-down version of
LLama for this purpose.
The recursive-descent parser for LLama is in Listing 4.25. It is a straightforward
representation of the grammar in Table 4.19.
Table 4.19. LLama Grammar (Small Version)
Production
Implemented in
this subroutine
spec ^ definitions {f i r st sym()}body stuff yyparse( )
definitions ^ TERM_SPEC tnames definitions def i ni ti ons()
1 CODE_BLOCK definitions def i ni ti ons()
1 SYNCH snames definitions def i ni ti ons()
1 SEPARATOR def i ni ti ons()
1 EOI def i ni ti ons()
snames ^ NAME {add synch ( yytext) } snames def i ni ti ons()
tnames ^ NAME {make term ( yytext) } tnames def i ni ti ons()
body ^ rule body body()
1 rule SEPARATOR body( )
1 rule _EOI body()
rule ^ NAME {new nonterm (yytext, 1 ) } COLON right_sides body()
1 body()
right_sides ^ {new rhs () } rhs OR right_sides ri ght si des()
1 {new rhs () } rhs SEMI ri ght si des()
rhs ^ NAME {add to rhs ( yytext, 0 ) } rhs rhs ( )
1 ACTION {add to rhs (yytext, star t acti onO) } rhs rhs ( )
Listing 4.24. parser.lma LLama Input Specification for Itself
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
o
o
{
#i ncl ude < s t d a r g . h >
#i ncl ude < t o o l s / d e b u g . h>
#i ncl ude < t o o l s / s e t . h>
#def i ne CREATING_LLAMA PARSER
#i ncl ude " p a r s e r . h "
/
*
var i ous def i ni t i ons i n par ser . h
*
/ * t hat conf l i ct wi t h l l ama
*
/
/
/
*
*
*
Thi s f i l e i s a l l ama i nput f i l e t hat cr eat es a par ser f or l l ama, l i ke a snake
eat i ng i t s t ai l . The r esul t i ng yypar se. c f i l e can be used i n pl ace of t he
r ecur si ve- descent par ser i n l l par . c. Not e t hat , t hough t hi s f i l e i s cer t ai nl y
* easi er to w r i t e t han t he r ecur si ve- descent ver si on, t he r esul t i ng code i s
* about I K l ar ger . nows( ) shoul d be cal l ed bef or e f i r i ng up t he par ser .
* Most of t he ext er nal subr out i nes cal l ed f r om t hi s modul e ar e i n act s. c.
*
i ons ar e
*
/
19 ext ern char * y y t e x t ; / * Gener at ed by l ex
* /
20 ext ern i nt y y l i n e n o ; / * Gener at ed by l ex * /
21 ext ern voi d n o w s ( ) , WS () ; / * Decl ar ed i n l l ama. l ex.
* /
22
23 #def i ne YYSTYPE char* / * Val ue- st ack t ype. * /
24
% }
25
26 %token ACTION / * {str} * /
27 %token CODE BLOCK
/ * % { . . . % } * /
28 %token COLON
/ *
* /
29 %token END_0PT
/ * ] ] * * /
30 %token FIELD
/ *
<name> * /
31 %token LEFT
/ * %l ef t * /
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
t o k e n NAME
t o k e n NONASSOC
t o k e n OR
t o k e n OTHER
t o k e n PREC
t o k e n RIGHT
t o k e n SEMI
t o k e n SEPARATOR
t o k e n START
t o k e n START_OPT
t o k e n SYNCH
t o k e n TERM_SPEC
t o k e n TYPE
t o k e n PERCENT_UNION
t o k e n WHITESPACE
I s y n c h SEMI OR
oo
o o
s p e c
end
f names
t names
pnames
o p t names
snames
r u l e s
/ * name
* /
/ *
%nonassoc * /
/ *
|
* /
/ *
anyt hi ng el se * /
/ * %pr ec * /
/ *
%r i ght * /
/ *
m
f * /
/ *
o
\
o
o
\
o
* /
/ * %st ar t * /
/ * [ * /
/ *
%syn ch
* /
/ *
%t er mor %t oken * /
/ *
%t ype * /
/ *
%uni on
* /
/ *
oA
i
i
oA
I
INN
* /
d e f s SEPARATOR { f i r s t s y m( ) ; } r u l e s end
{w s ( ) ; } SEPARATOR
/
*
empt y
*
/
TYPE f names
SYNCH snames d e f s
PERCENT UNION ACTION { u ni o n d e f ( y y t e x t ) ; } d e f s
new_f i e l d ( 1111)
new_f i e l d ( 1111)
n e w _ f i e l d ("")
n e w _ f i e l d ("")
new f i e l d C " ' )
TERM SPEC { new l e v ( 0 ) ; } t names
LEFT
RIGHT
NONASSOC
CODE_BLOCK
START
{
{ n e w_ l e v ( ' 1 ' ) ; } pnames
{ n e w_ l e v ( ' r ' ) ; } pnames
{ new l e v ( ' n ' ) ; } pnames
d e f s
d e f s
/
*
t he bl ock i s copi ed out by yyl ex
/
d e f s
d e f s
d e f s
l e r r o r ( NONFATAL,
II o,
o s t a r t no t s u p p o r t e d by o c c s . The f i r s t \ n "
" \ t \ t \ t p r o d u c t i o n i s t h e s t a r t p r o d u c t i o n \ n "
}
o pt names d e f s
/
*
empt y
*
/
NAME { new nont erm ( y y t e x t , 0 ) ; } f names
FIELD { new f i e l d ( y y t e x t ) ; } f names
/
*
empt y
*
/
NAME { m a k e _ t e r m ( y y t e x t ) ; } t names
FIELD { new f i e l d ( y y t e x t ) ; } t names
/
*
empt y
*
/
NAME { p r e c _ l i s t ( y y t e x t ) ; } pnames
FIELD { new f i e l d ( y y t e x t ) ; } pnames
/
*
empt y
*
/
NAME o pt names
/
*
empt y
*
/
NAME { add s y nc h ( y y t e x t ) ; } snames /
*
empt y
*
/ ;
r u l e r u l e s
/
*
empt y
*
/
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
r ul e NAME { new_nont er m( yyt ext , 1); } COLON r i ght _si des
FI ELD { new nont er m( yyt ext , 1)/ } COLON r i ght si des
r i ght si des #
f
{ new r hs( ) ; } rhs end rhs
end rhs OR r i ght si des
SEMI
rhs NAME
FI ELD
ACTI ON
PREC NAME
START OPT
{ add_t o_r hs( yyt ext , 0
{ add t o rhs ( yyt ext , 0
) ;
) ;
{ add t o rhs ( yyt ext , st ar t act i onO) ;
{ pr ec ( yyt ext ) ;
}
}
}
}
rhs
rhs
rhs
rhs
{ st ar t _opt ( yyt ext ) ;
rhs END OPT { end opt ( yyt ext );
}
}
/
*
empt y
rhs
*/
o o
o o
yy_i ni t _l l ama( t ovs )
yyvst ype *t ovs;
{
t ovs- >l ef t t ovs- >r i ght
IIII
}
*
yypst k( t ovs, t ods)
yyvst ype *t ovs;
*t ods;
{
buf [128];
( *t ovs- >l ef t
*
t ovs- >r i ght )
{
spr i nt f ( buf , " [%s, %s] ", t ovs- >l ef t , t ovs- >r i ght ) ;
ret urn buf ;
}
ret urn
ll ll
f
}
Listing 4.25, llpar.c Recursive-Descent Parser for LLama
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#i ncl ude <t ool s/ l . h>
#i ncl ude "l l out . h"
#i ncl ude "par ser . h"
/ * LLPAR. C
*
*
A r ecur si ve- descent par ser f or a ver y st r i pped down l l ama.
Ther e' s a l l ama i nput speci f i cat i on f or a t abl e- dr i ven par ser
i n l l ama. l ma.
*
/
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
voi d advance
voi d l ookf or
P
P
voi d def i ni t i ons P
voi d body P
voi d r i ght si des P
voi d r hs
i nt yypar se
P
P
voi d
i nt f i r st
voi d
voi d
voi d
voi d
voi d
/ * l ocal * /
ext ern i nt
ext ern i nt
yyl i neno;
*yyt ext ;
yyl ex();
/
/
*
publ i c
*
/
*
Cr eat ed by l ex
*
/
J 'k_____________________________________________________________________________________________ * J
PUBLI C i nt yyner r s;
PRI VATE i nt Lookahead;
/ * Tot al er r or count
/ * Lookahead t oken
*
*
/
/
/ * ======== ======= ======================================= ================
*
Low- l evel suppor t r out i nes f or par ser
*
/
#def i ne mat ch( x) ( (x) Lookahead)
PRI VATE voi d advance()
{
( Lookahead ! EOI )
whi l e( ( Lookahead yyl ex()) WHI TESPACE )
}
/* -------------------------------------------------------------------------------------------------------------------- */
PRI VATE voi d l ookf or ( f i r st ,
. . )
i nt
{
f i r st ;
/
*
*
Read i nput unt i l t he cur r ent symbol i s i n t he ar gument l i st . For exampl e,
l ookf or ( OR, SEMI , 0) t er mi nat es when t he cur r ent Lookahead symbol i s an
* OR or SEMI . Sear chi ng st ar t s wi t h t he next (not t he cur r ent ) symbol
/
i nt
ob j ;
( advance() ;; advance() )
{
( obj &f i r st ;
obj && !mat ch(*obj ) ; obj ++ )

Lf (
*
obj )
/ * Found a mat ch * /
L f ( mat ch( EOI )
)
l er r or ( FATAL, "Unexpect ed end of f i l e\ n") ;
}
}
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
* The Par ser i t sel f
/
PUBLI C
{
i nt yypar se() /
*
spec : def i ni t i ons body st uf f
*
/
yyl i neno;
77 Lookahead =y y l e x (); /* Get f i r st i nput symbol */
78 d e f i n i t i o n s ();
79 f i r s t s y m();
80 b o d y ( ) ;
81
}
82
83
/ *- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ----------------------------------- */
84
85 PRIVATE voi d d e f i n i t i o n s ()
86
{
87 /* i mpl ement ed at :
88 * def i ni t i ons : TERM SPEC t names def i ni t i ons 1
89 * CODE_BLOCK def i ni t i ons 2
90 * SYNCH snames def i ni t i ons 3
91 * SEPARATOR 4
92 * I _EOI _
4
93 * snames : NAME {add synch} snames 5
94 * t names : NAME {make t erm} t names 6
95
*
96 * Not e t hat LeX copi es t he CODE BLOCK cont ent s to t he out put f i l e
97 * aut omat i cal l y on r eadi ng i t.
98 */
99
100 whi l e( !match(SEPARATOR) && !mat ch ( EOI ) )
j * ^ * /
101
{
102 i f ( Lookahead == SYNCH ) / * 3 */
103
{
104 f or ( a d v a n c e ( ) ; match(NAME); a d v a n c e () ) / * 5 * /
105 add s ync h ( y y t e x t ) ;
106
}
107 el se i f ( Lookahead == TERM SPEC ) / * 1 */
108
{
109 f or ( a d v a n c e ( ) ; match(NAME); a d v a n c e () ) / * 6 */
110 make t e r m( y y t e x t ) ;
111
}
112 el se i f ( Lookahead != CODE BLOCK ) / * 2 * /
113
{
114 l e r r o r (NONFATAL, 11I g n o r i n g i l l e g a l <%s> i n d e f i n i t i o n s \ n " , y y t e x t ) ;
115 a d v a n c e ( ) ;
116
}
117
}
118
119 a d v a n c e ( ) ; / * advance past t he %% * /
120 }
121
122
/ * -------------------------------------------------------------------------------------------------------- ----------------------------------- * /
123
124 PRIVATE voi d b o d y ()
125
{
126 / * i mpl ement ed at :
127 * body : r ul e body 1
128 * r ul e SEPARATOR 1
129 * r ul e EOI 1
130 * r ul e : NAME {new nont er m} COLON r i ght si des 2
131 * : <epsi l on> 3
132 * /
133
134 whi l e( !match(SEPARATOR) && ! mat ch( EOI ) ) / * 1 * /
135 {
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
( match(NAME)
)
{
new_nont erm( y y t e x t , 1 ) ;
a d v a n c e ( ) ;
}
{
l error(NONFATAL, " I l l e g a l
l o o k f o r ( SEMI, SEPARATOR, 0 ) ;
)
<%s>, n o n t e r mi n a l
( mat ch( SEMI)
a d v a n c e ( ) ;
c o n t i n u e ;
}
( m a t c h ( COLON )
a d v a n c e ( ) ;
)
l error(NONFATAL, " I n s e r t e d m i s s i n g ' : ' \ n " ) ;
r i g h t s i d e s ( ) ;
}
ws () ;
( match(SEPARATOR)
y y l e x () ;
)
/ * Advance past
O, O
O O
}
/
*
PRIVATE v o i d r i g h t s i d e s ()
{
/
*
r i ght si des
*
{new_r hs} r hs OR r i ght si des
{new rhs} r hs SEMI
*
/
n e w _ r h s ( ) ;
r hs ( ) ;
w h i l e ( match(OR)
{
a d v a n c e ( ) ;
new r hs ( ) ;
)
r hs () ;
}
( mat ch( SEMI)
a d v a n c e ( ) ;
)
l error(NONFATAL, " I n s e r t e d m i s s i n g s e m i c o l o n \ n " ) ;
/
* *
/
/
*
3
*
/
\ n " , y y t e x t ) ;
/ * Enabl e whi t e space (see par ser . l ex)
}
/
PRIVATE v o i d r hs ()
{
/
/
*
r hs NAME {add t o rhs} r hs
ACTI ON {add to r hs} r hs
/
/
w h i l e ( match(NAME)
{
match(ACTION)
)
add t o r h s ( y y t e x t , match(ACTION) ? s t a r t a c t i o n () 0 ) ;
/
1
2
/
/
/

/
/
196 a d v a n c e ( ) ;
197 }
198
199
i f (
!match(OR) && Imatch(SEMI) )
200
{
201 l error(NONFATAL, " i l l e g a l <%s>, i g n o r i n g r e s t o f p r o d u c t i o n \ n " , y y t e x t ) ;
202 l o o k f o r ( SEMI, SEPARATOR, OR, 0
) ;
203 }
204 }
The action subroutines, in Listings 4.26,4.27, and 4.28. work as follows:
add synch( char *yyt ext )
Add the symbol named by yyt ext to the synchronization set used for error
recovery.
add t o r hs( char *yyt ext , i nt i sact )
Add the symbol named by yyt ext to the current right-hand side. The i sact
argument should be true if the symbol is an action.
i end opt ( char *yyt ext )
Mark the end of an optional production. This routine processes occs [ . . . ] ,
[ . . . ] * operators by adding extra elements to the production.
sym( voi d)
This routine finds the goal symbol, which is the left-hand side of the first produc
tion that follows the first %%directive. The problem here is that, in occs, a nonter
minal can be declared explicitly in a previous %t ype directive, so you cant
assume that the first nonterminal thats declared is the goal symbol.
l er r or ( i nt er r or t ype, char *f or mat , . . . )
This routine prints an error message. The first argument should be FATAL, NON
FATAL, or WARNING (these are defined at the top of parser.h). If the error is
FATAL, all files are closed and LLama is terminated, otherwise parsing contin
ues. The input file name and line number is automatically appended to the front
of the message. This routine works like f pr i nt f () in other respects.
make t er m( char *yyt ext )
Make an entry for a terminal symbol.
new f i el d( char *yyt ext )
Change the name of the %uni on field to which subsequent input symbols are
attached.
new l ev( i nt associ at i vi t y)
Increment the precedence level by one notch (all subsequent arguments to a
%l ef t , %ri ght , or %nonassoc use this new level). Also change the current
associativity as specified. This argument should be ' 1 ' for left , ' r' for right,
' n' for none, or 0 for unspecified.
LLama actions:
Listing 4.28, lines 628
and 690.
Listing 4.27, line 488.
Listing 4.28, line 596
Listing 4.27, line 380
In main.c.
Listing 4.27, line 335.
and 793.
and 698.
Listing 4.27, line391. new nonterm(char *yyt ext , i n t i s Ihs)
and 735.
and 710.
and 762.
Create a symbol-table entry for a nonterminal if one doesnt already exist,
yyt ext is the symbol name. i s_l hs tells the subroutine whether the nontermi
nal was created implicitly (whether it was used before appearing on the left-hand
side of a production), i s l hs zero in this case. Return a pointer to the SYMBOL
or NULL if an attempt is made to use a terminal symbol as a left-hand side.
Listing 4.27, line 460. new r hs ( voi d)
Stop adding symbols to the current right-hand side, and start up a new (and
empty) right-hand side.
In main.c. out put ( char * f or mat , . . .)
Works like f pr i nt f ( Out put , . . .) ; sends the string to the current output
stream, typically standard output but it can be changed from the command line.
yyt ext )
Process the argument to a %pr ec directive. (Change the precedence level of the
current right-hand side from the default level to the one indicated in yytext. )
Listing 4.28, lines 671 pr ec l i st ( char *yyt ext )
Process an argument to %l ef t , %ri ght , or %nonassoc
Listing 4.28, line 581. st ar t opt ( char *yyt ext )
Mark the start of an optional or repeating portion of a right-hand side [see
end opt () ].
Listing 4.28, lines 665 uni on def ( char *yyt ext )
Create a typedef for the occs value stack, using the %uni on definition in
yyt ext .
Listing 4.26. acts.c LLama Action Subroutines
2 #i ncl ude < ma l l o c . h >
3 #i ncl ude < c t y p e . h >
4 #i ncl ude < s t r i n g . h >
7 # i ncl ude < t o o l s / h a s h . h>
8 #i ncl ude < t o o l s / c o m p i l e r . h >
9 #i ncl ude < t o o l s / l . h >
10
11 #i ncl ude c t o o l s / s t a c k . h > / * st ack- mani pul at i on macr os */
12 #undef s t a c k _ c l s / * Make al l st acks st at i c */
13 #def i ne s t a c k _ c l s st at i c
14
15 #i ncl ude " p a r s e r . h "
16 #i ncl ude " l l o u t . h "
17
18 / * ACTS. C Act i on r out i nes used by bot h l l ama and occs. These bui l d
19 * up t he symbol t abl e f r omt he i nput speci f i cat i on.
20 */
21
Listing4.26. conti nued...
22 voi d f i n d pr o bl e ms P (
23 i nt c i d e n t i f i e r
P (
24 voi d p r i n t t o k
P (
25 voi d pt e r m
P (
26 voi d p a c t
P (
27 voi d pnont erm
P (
28 voi d p r i n t s ymbol s
P (
29 i nt pr o bl e ms
P (
30 voi d i n i t a c t s
P (
31 SYMBOL *make t erm
P (
32 voi d f i r s t sym
P (
33 SYMBOL *new nont erm
P (
34 voi d new r hs
P (
35 voi d add t o r hs
P (
36 voi d s t a r t o pt
P (
37 voi d end o pt
P (
38 voi d add s ync h
P (
39 voi d new l e v
P (
40 voi d p r e c l i s t
P (
41 voi d p r e c
P (
42 voi d u ni o n d e f
P (
43 i nt f i e l d s a c t i v e
P (
44 voi d new f i e l d
P (
45
46
/ * ---------
47
48 ext ern i nt y y l i n e n o ;
49 PRIVATE i nt A s s o c i a t i v
50 PRIVATE i nt Pr e c l e v =
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
(( SYMBOL *sym, voi d *j unk )
\
(( char *name )
1
(( FI LE * s t r e a m t char * f o r ma t , i nt a r g )
) ;
(( SYMBOL *sym, FI LE * s t r e a m )
) ;
) ;
) ;
(( FI LE * s t r e a m
) ) ;
(( voi d
) ) ;
( ( voi d
) ) ;
(( char *name ) ) ;
(( voi d ) ) ;
(( char *name, i nt i s I h s ) ) ;
(( voi d
) ) ;
(( char * o b j e c t , i nt i s an a c t i o n ) ) ;
(( char * l e x ) ) ;
(( char * l e x ) ) ;
(( char *y y t e x t ) ) ;
(( i nt how ) ) ;
(( char *name ) ) ;
(( char *name ) ) ;
(( char * a c t i o n ) ) ;
(( voi d ) ) ;
(( char * f i e l d name ) ) ;
/ * l ocal */
/
publ i c
*
/
/
/
/
I nput l i ne number - - cr eat ed by LeX

Cur r ent associ at i vi t y di r ect i on.
*
0; / * Pr ecedence l evel . I ncr ement ed

*
/
af t er f i ndi ng
o
o et c. ,
*
PRIVATE
PRIVATE i nt
F i e l d name[NAME MAX;
/ * but bef or e t he names ar e done.
/ * Fi el d name speci f i ed i n <name>.
*
F i e l d s a c t i v e 0; /
/
Fi el ds ar e used i n t he i nput .
(I f t hey' r e not, t hen aut omat i c
*
/ * f i el d- name gener at i on, as per

*
PRIVATE i nt Goal s ymbol i s n e x t
/
0; /
/
%uni onf i s not act i vat ed. )

I f t r ue, t he next nont er mi nal i s
t he goal symbol .
*
*
*
/
/
/
/
/
/
/
/
/
/
/
/
/
*
The f ol l owi ng st uf f ( t hat ' s a t echni cal t erm) i s used f or pr ocessi ng nest ed
*
* opt i onal (or r epeat i ng) pr oduct i ons def i ned wi t h t he occs [] and []
* oper at or s. A st ack i s kept and, ever y t i me you go down anot her l ayer of
* nest i ng, t he cur r ent nont er mi nal i s st acked and an new nont er mi nal i s
* al l ocat ed. The pr evi ous st at e i s r est or ed when you' r e done wi t h a l evel
*
/
#def i ne SSIZE
st ruct
8
cur sym
/
*
Max. opt i onal - pr oduct i on nest i ng l evel
*
/
{
I hs name[NAME MAX]; / * Name associ at ed wi t h l ef t - hand si de
*
SYMBOL
*
I h s ;
PRODUCTION
*
r h s ;
/
/
/
*
*
*
Poi nt er to symbol - t abl e ent r y f or
t he cur r ent l ef t - hand si de
Poi nt er t o cur r ent pr oduct i on.
*
*
*
/
/
/
/
} CUR SYM;
CUR SYM S t a c k [ SSIZE ] , / * St ack and
Sp S t a c k + ( SSI ZE- 1 ) ; / * st ack poi nt er . I t ' s i nconveni ent to u
/
/
81 / * st ack. h because st ack i s of st r uct ur es. * /
82
83 / * ======================================================================
84 * Suppor t r out i nes f or act i ons
85 * /
86
87 PUBLIC voi d p r i n t _ t o k ( s t r e a m, f o r ma t , a r g )
88 FI LE *st r eam;
89 char * f o r ma t ; / * not used her e but suppl i ed by pset ( ) * /
90 i nt a r g ;
91 {
92 / * Pr i nt one nont er mi nal symbol t o t he speci f i ed st r eam. * /
93
94 i f ( a r g == - 1 ) f p r i n t f ( s t r e a m , " n u l l " ) ;
95 el se i f ( a r g == - 2 ) f p r i n t f ( s t r e a m , "empty " ) ;
96 el se i f ( ar g ==_EOI _ ) f pr i nt f ( st r eam, "$ " );
97 el se i f ( a r g == EPSI LON ) f p r i n t f ( s t r e a m , " < e p s i l o n > " ) ;
98 el se f p r i n t f ( s t r e a m , "%s ", Te r ms [ ar g ] - >na me ) ;
99 }
100
101 / * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
102 * The f ol l owi ng t hr ee r out i nes pr i nt t he symbol t abl e. Pt er m( ) , pact (), and
103 * pnont er m() ar e al l cal l ed i ndi r ect l y t hr ough pt ab( ) , cal l ed i n
104 * pr i nt _symbol s (). They pr i nt t he t er mi nal , act i on, and nont er mi nal symbol s
105 * f r omt he symbol t abl e, r espect i vel y.
106 */
107
108 PUBLIC voi d pt e r m( sym, s t r e a m )
109 SYMBOL *sym;
110 FI LE *st r eam;
111 {
112 i nt i ;
113
114 i f ( ! I STERM(sym) )
115 r et ur n;
116
117 LL( f p r i n t f ( s t r e a m, "%- 16. 16s %3d\ n", sym- >name, s y m- >v a l ) ; )
118 OX( f p r i n t f ( s t r e a m, "%- 16. 16s %3d %2d %c <%s >\ n", \
119 s y m- >na me , \
120 s y m - > v a l , \
121 P r e c e d e n c e [ s y m - > v a l ] . l e v e l , \
122 ( i = P r e c e d e n c e [ s y m - > v a l ] . a s s o c ) ? i :
123 s y m - > f i e l d ) ; )
124 }
125
126 / *- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
127
128 PUBLIC voi d p a c t ( sym, s t r e a m )
129 SYMBOL *sym;
130 FI LE * st r eam;
131 {
132 i f ( ! I SACT(sym) )
133 return;
134
135 f p r i n t f ( s t r e a m, "%-5s %3d, ", sym- >name, s y m- >v a l ) ;
136 f p r i n t f ( s t r e a m, " l i n e %-3d: ", s y m- > l i n e n o ) ;
137 f p u t s t r ( s y m - > s t r i n g , 55, s t r e a m ) ;
138 f p r i n t f ( s t r e a m, " \ n " ) ;
139 }
140
141 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
142
143 PUBLI C char * p r o d u c t i o n _ s t r ( pr od )
144 PRODUCTI ON *prod; _
145 {
146 / * r et ur n a st r i ng r epr esent i ng t he pr oduct i on * /
147
148 i nt i , ncha r s , a v a i l ;
149 st at i c char b u f [ 8 0 ] ;
150 char *p;
151
152 n c h a r s = s p r i n t f ( b u f , "%s p r o d - > l h s - > n a me ) ;
153 p = buf + n c h a r s ;
154 a v a i l = si zeof ( buf ) - n c h a r s - 1;
155
156 i f ( !p r o d - > r h s _ l e n )
157 spr i nt f ( p, " ( epsi l on) " ) ;
158 el se
159 f o r ( i = 0; i < pr od- >r hs_l en && avai l > 0 ; ++i )
160 {
161 n c h a r s = s p r i n t f ( p , " %0. *s ", a v a i l - 2 , p r o d - > r h s [ i ] - >name ) ;
162 a v a i l - = n c h a r s ;
163 p += n c h a r s ;
164 }
165
166 ret urn bu f ;
167 }
168
169 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
170
171 PUBLI C voi d pnont e r m( sym, s t r e a m )
172 SYMBOL *sym;
173 FILE * s t r e a m;
174 {
175 i nt i ;
176 char *s ;
177 PRODUCTI ON *p;
178 i nt c h a r s _ p r i n t e d ;
179
180 st ack_dcl ( pst ack, PRODUCTI ON *, MAXPROD ) ;
181
182 i f ( ! I SNONTERM( sym) )
183 return;
184
185 f p r i n t f ( s t r e a m, "%s (%3d) %s", sym- >name, s y m- > v a l ,
186 sym== Goal _symbol ? " ( goal symbol ) " : "" ) ;
187
188 OX( f pr i nt f ( st r eam, " <%s>\ n", sym- >f i el d ) ; )
189 LL ( f pr i nt f ( st r eam, "\ n" ) ; )
190
191 i f ( Symbol s > 1 )
192 {
193 / * Pr i nt f i r st and f ol l ow set s onl y i f you want r eal l y ver bose out put . */
194
195 f p r i n t f ( s t r e a m, " FI RST : " ) ;
196 p s e t ( s y m - > f i r s t , p r i n t _ t o k , s t r e a m ) ;
197
198 LL( f pr i nt f ( st r eam, "\ n FOLLOW: ") ; )
199 LL( p s e t ( s y m- > f o l l o w , p r i n t _ t o k , s t r e a m ) ; )
200
2 0 1 f p r i n t f ( s t r e a m , " \ n " ) ;
202 }
203
204 / * Pr oduct i ons ar e put i nt o t he SYMBOL i n r ever se or der because i t ' s easi er
205 * t o t ack t hemon t o t he begi nni ng of t he l i nked l i st . I t ' s bet t er t o pr i nt
206 * t hemi n f or war d or der , however , t o make t he symbol t abl e mor e r eadabl e.
207 * Sol ve t hi s pr obl emby st acki ng al l t he pr oduct i ons and t hen poppi ng
208 * el ement s t o pr i nt t hem. Si nce t he pst ack has MAXPROD el ement s, i t ' s not
209 * necessar y t o t est f or st ack over f l ow on a push.
210 */
211
2 1 2 f o r ( p = s y m- > p r o d u c t i o n s ; p / p = p - > n e x t )
213 p u s h ( p s t a c k , p ) ;
214
215 w h i l e ( ! s t a c k _ e m p t y ( p s t a c k ) )
216 {
217 p = p o p ( p s t a c k ) ;
218
219 c h a r s _ p r i n t e d = f p r i n t f ( s t r e a m , " %3d: %s",
2 2 0 p- >num, p r o d u c t i o n _ s t r ( p ) ) ;
2 2 1
222 LL( f o r ( ; c h a r s _ p r i n t e d <= 45; + + c h a r s _ p r i n t e d ) )
223 LL ( p u t c ( ' . ' , s t r e a m ) ; )
224 LL ( f p r i n t f ( s t r e a m , "SELECT: ") ; )
225 LL( p s e t ( p - > s e l e c t , p r i n t _ t o k , s t r e a m ) ; )
226
227 OX( i f ( p - > p r e c ) )
228 OX ( { )
229 OX( f o r ( ; c h a r s _ p r i n t e d <= 60; + + c h a r s _ p r i n t e d ) )
230 OX( p u t c ( s t r e a m ) ; )
231 OX( i f ( p - > p r e c ) )
232 OX( f p r i n t f ( s t r e a m , "PREC %d", p - > p r e c ) ; )
233 OX ( } )
234
235 p u t c ( ' \ n ' , s t r e a m) ;
236 }
237
238 f p r i n t f ( s t r e a m , " \ n " ) ;
239 }
240
241 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
242
243 PUBLIC v o i d p r i n t _ s y m b o l s ( s t r e a m )
244 FILE * s t r e a m;
245 {
246 / * Pr i nt out t he symbol t abl e. Nont er mi nal symbol s come f i r st f or t he sake
247 * of t he ' s' opt i on i n yydebug (); symbol s ot her t han pr oduct i on number s can
248 * be ent er ed symbol i cal l y. pt ab r et ur ns 0 i f i t can' t pr i nt t he symbol s
249 * sor t ed ( because t her e' s no memor y. I f t hi s i s t he case, t r y agai n but
250 * pr i nt t he t abl e unsor t ed) .
251 * /
252
253 p u t c ( ' \ n ' , s t r e a m ) ;
254
255 f p r i n t f ( s t r e a m, "--------------------------------- Symbol t a b l e --------------------------------------- \ n" ) ;
256 f p r i n t f ( s t r e a m, "\nNONTERMINAL SYMBOLS: \ n \ n " ) ;
257 i f ( p t a b ( Symt ab, pnont e r m, s t r e a m, 1 ) == 0 )
258 p t a b ( Symt ab, pnont e r m, s t r e a m, 0 ) ;
259
260
261
262
f p r i n t f ( s t r e a m, "\nTERMINAL SYMBOLS: \ n \ n " ) ;
OX( f p r i n t f ( s t r e a m,
LL( f p r i n t f ( s t r e a m,
M
ii
name
name
v a l u e p r e c a s s o c
v a l u e \ n " ) ;
f i e l d \ n " ) ; )
)
263
264
265
266
( p t a b ( Symt ab, pt e r m, s t r e a m, 1 )
p t a b ( Symt ab, pt e r m, s t r e a m, 0 ) ;
0 )
267
268
269
270
271 }
LL (
LL (
LL (
LL (
f p r i n t f ( s t r e a m, "\nACTION SYMBOLS: \ n \ n " ) ;
( ! p t a b ( Symt ab, p a c t , s t r e a m,
1 ) )
p t a b ( Symt ab, p a c t , s t r e a m, 0 ) ;
)
)
)
f p r i n t f ( s t r e a m, "--------------------------------------------------------------------------------\ n" ) ; )
272
273
/* --------------------------------------------------------------------------------------------------------------------
274
275
*
*
Pr obl ems() and f i nd_pr obl ems wor k t oget her t o f i nd unused symbol s and
symbol s t hat ar e used but not def i ned.
276
*
/
277
278 PRIVATE voi d f i n d p r o b l e m s ( sym, j unk )
279
280
SYMBOL
voi d
*sym;
*j unk; / * not used * /
281 {
282
283
( ! s ym- >us e d && s ym! =Goal s ymbol )
e r r o r ( WARNING, "<%s> n o t u s e d ( d e f i n e d on l i n e %d) \ n",
284
285
286
( ! s y m- > s e t && ! ISACT(sym) )
sym- >name, s y m- > s e t ) ;
e r r o r ( NONFATAL, "<%s> no t d e f i n e d ( us e d on l i n e %d) \ n",
287 sym- >name, s y m- >us e d ) ;
288 }
289
290
291
PUBLIC i nt p r o b l e m s ()
{
292
293
294
295
/
*
*
*
Fi nd, and pr i nt an er r or message, f or al l symbol s t hat ar e used but not
def i ned, and f or al l symbol s t hat ar e def i ned but not used. Ret ur n t he
number of er r or s af t er checki ng.
*
/
296
297
298
299 }
p t a b ( Symt ab, f i n d p r o b l e ms , NULL, 0 ) ;
y y n e r r s ;
300
301
j * ________________________________________________________________ ____ ____________ _____________________________ * j
302
303 PRIVATE i nt has h f u n c t ( p )
304
305
SYMBOL *p;
{
306
307
308
(
i *
p- >name )
l er r or ( FATAL, "I l l egal empt y symbol name\ n" );
309
310 }
ret urn has h a d d ( p- >name ) ;
311
312 PUBLIC voi d i n i t a c t s ( )
313 {
314 / * Var i ous i ni t i al i zat i ons t hat can' t be done at compi l e t i me. Cal l t hi s
315
316
317
* r out i ne bef or e st ar t i ng up t he par ser . The hash- t abl e si ze (157) i s
*
an ar bi t r ar y pr i me number , r oughl y t he number symbol s i n t he
* t abl e.
318
/
319
296
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
J *____________ _______ ______________________________________ ______ _____ _____* J
PUBLI C voi d
{
f i r st sym( )
/
*
Thi s r out i ne i s cal l ed j ust bef or e t he f i r st r ul e f ol l owi ng t he
4- O,O
O O I t ' s used t o poi nt out t he goal symbol ;
*
/
Goal symbol i s next 1;
}
/* -------------------------------------------------------------------------------------------------------------------- */
PUBLI C SYMBOL *new nont er m( name, i s I hs )
*
i nt
{
name;
i s I hs;
/
*
*
Cr eat e, and i ni t i al i ze, a new nont er mi nal . i s_l hs i s used to
di f f er ent i at e bet ween i mpl i ci t and expl i ci t decl ar at i ons. I t ' s 0 i f t he
* nont er mi nal i s added because i t was f ound on a r i ght - hand si de. I t ' s 1 i f
* t he nont er mi nal i s on a l ef t - hand si de.
*
*
*
Ret ur n a poi nt er t o t he new symbol or NULL i f an at t empt i s made t o use a
t er mi nal symbol on a l ef t - hand si de.
*
/
SYMBOL
*
p;
( p
( SYMBOL
*
)
f i ndsym( Symt ab, name )
)
{
Lf ( !I SNONTERM( p )
)
{
l er r or ( NONFATAL, "Symbol on l ef t - hand si de must be nont er mi nal \ n" );
P
NULL;
}
}
Lf ( Cur nont er m >= MAXNONTERM )
{
l er r or ( FATAL, "Too many nont er mi nal symbol s (%d max. ) . \ n", MAXTERM );
}
/
Add new nont er mi nal t o symbol t abl e
/
{
P
( SYMBOL *) newsym( ( SYMBOL) );
st r ncpy ( p- >name, name, NAME MAX );
st r ncpy ( p- >f i el d, Fi el d name, NAME MAX );
p- >val ++Cur nont er m ;
Ter ms[ Cur nont er m]
p;
addsym ( Symt ab, p );
}
i f ( p )
{
/
( r e) i ni t i al i ze new nont er mi nal * /

Lf ( Goal symbol i s next )
{
Goal symbol
p;
Goal symbol i s next 0;
}
438 i f ( ! p - > f i r s t )
449
450
451
495
496
p - > f i r s t = n e ws e t ( ) ; 439
440
441 LL( i f ( ! p - > f o l l o w )
)
442 LL( p - > f o l l o w = n e w s e t ( ) ; )
443
444 p - > l i n e n o = y y l i n e n o ;
445
446 i f ( i s I h s )
447 {
448 s t r n c p y ( Sp- >l hs _ na me , name, NAME MAX ) ;
S p - > l h s = p ;
Sp- >r hs = NULL;
S p - > l h s - > s e t = y y l i n e n o ;
452 }
453 }
454
455 r et ur n p ;
456 }
457
458 / * --------------------------------------------------------------------------------------------------------------------------- */
459
460 PUBLI C v o i d new r h s ( )
461 {
462 / * Get a new PRODUCTI ON and l i nk i t t o t he head of t he pr oduct i on chai n.
463 * of t he cur r ent nont er mi nal . Not e t hat t he st ar t pr oduct i on MUST be
464 * pr oduct i on 0. As a consequence, t he f i r st r hs associ at ed wi t h t he f i r st
465 * nont er mi nal MUST be t he st ar t pr oduct i on. Numpr oduct i ons i s i ni t i al i zed
466 * t o 0 when i t ' s decl ar ed
467 */
468
469 PRODUCTI ON *p;
470
471 i f ( ! (p = (PRODUCTION *) c a l l o c ( l , s i zeof ( PRODUCTI ON) )) )
l e r r or ( FATAL, "Out o f me mor y\ n") ; 472
473
474 p - > n e x t S p - > I h s - > p r o d u c t i o n s ;
475 S p - > l h s - > p r o d u c t i o n s = p;
476
477 LL( p - > s e l e c t = n e w s e t ( ) ; )
478
479 i f ( (p->num = Num p r o d u c t i o n s + + ) >= MAXPROD )
480 l e r r or ( FATAL, "Too many p r o d u c t i o n s (%d ma x . ) \ n " , MAXPROD ) ;
481
482 p - > l h s = S p - > l h s ;
483 S p - >r h s = p;
484 }
485
486 / *- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
487
488 PUBLIC v o i d add t o r h s ( o b j e c t , i s an a c t i o n )
489 c h a r *obj e c t ;
490 i n t i s _ a n _ a c t i o n ; / * 0 of not an act i on, l i ne number ot her wi se * /
491 {
492 SYMBOL *p;
493 PRODUCTION *prod;
494 i n t i ;
b u f [ 3 2 ] ;
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
}
/ * Add a new el ement t o t he RHS cur r ent l y at t op of st ack. Fi r st deal wi t h
* f or war d r ef er ences. I f t he i t emi sn' t i n t he t abl e, add i t. Not e t hat ,
* si nce t er mi nal symbol s must be decl ar ed wi t h a %t er mdi r ect i ve, f or war d
*
l ways r ef er t o 1 or act i on When we exi t th
*
i f st at ement , p poi nt s at t he symbol t abl e ent r y f or t he cur r ent obj ect
*
/
Lf ( ! (p
{
}
{
}
}
p - > u s e d
( ( i
(SYMBOL
*
) f i n d s y m( Symt ab, o b j e c t ) ) ) /
*
not i n t ab yet
*
/
i f ( ! i s an a c t i o n )
{
Lf ( ! (p new n o n t e r m( o b j e c t , 0 ))
)
{
/
*
Won' t get her e unl ess p i s a t er mi nal symbol
*
/
l e r r or ( FATAL, " ( i n t e r n a l ) Un e x p e c t e d t e r m i n a l s y mb o l \ n " ) ;
r e t u r n ;
}
/
*
*
Add an act i on. Al l act i ons ar e named {DDD} wher e DDD i s t he
act i on number . The cur l y br ace i n t he name guar ant ees t hat t hi s
* name won' t conf l i ct wi t h a nor mal name. I amassumi ng t hat cal l oc
* i s used t o al l ocat e memor y f or t he new node (i e. t hat i t ' s
* i ni t i al i zed to zer os) .
*
/
s p r i n t f ( b u f , " {%d}", ++Cur a c t MINACT ) ;
P
(SYMBOL *) news ym( (SYMBOL)
) ;
s t r n c p y ( p- >name, b u f , NAME MAX ) ;
addsym ( Symt ab, p ) ;
p - > v a l
p - > l i n e n o
Cur _ac t ;
i s an a c t i o n ;
i f ( ! (P - > s t r i n g s t r s a v e ( o b j e c t ) )
)
l e r ror ( FATAL, " I n s u f f i c i e n t memory t o s a v e a c t i o n \ n " ) ;
y y l i n e n o ;
S p - > r h s - > r h s l e n++) > MAXRHS )
l error(NONFATAL, " Ri g h t - h a n d s i d e t o o l o n g (%d ma x ) \ n " , MAXRHS ) ;
{
LL (
LL (
OX(
OX(
( i
0 &&
P
S p - > l h s ) )
l error(NONFATAL, " I l l e g a l l e f t r e c u r s i o n i n p r o d u c t i o n . \ n " ) ; )
( ISTERM( p )
S p - > r h s - > p r e c
)
P r e c e d e n c e [ p - > v a l ] . l e v e l
)
)
S p - > r h s - > r h s [ i ]
S p - > r h s - > r h s [ i + 1 ]
p;
NULL; /
*
NULL t er mi nat e t he ar r ay
/
(
! ISACT(p)
)
++( S p - >r h s - >n o n a c t s ) ;
}
557
558 / * --------------------------------------------------------------------------------------------------------------------------------------------
559 * The next t wo subr out i nes handl e r epeat i ng or opt i onal subexpr essi ons. The
560 * f ol l owi ng mappi ngs ar e done, dependi ng on t he oper at or :
561 *
562 * S : A [B] C ; S - > A 001 C
563 * 001 - > B | epsi l on
564 *
565 * S : A [ B] * C ; S - >A 001 C (occs)
566 * 001 - >001 B | epsi l on
567 *
568 * S : A [ B] * C ; S - >A 001 C (l l ama)
569 * 001 - >B 001 | epsi l on
570 *
571 * I n al l si t uat i ons, t he r i ght hand si de t hat we' ve col l ect ed so f ar i s
572 * pushed and a new r i ght - hand si de i s st ar t ed f or t he subexpr essi on. Not e t hat
573 * t he f i r st char act er of t he cr eat ed r hs name (001 i n t he pr evi ous exampl es)
574 * i s a space, whi ch i s i l l egal i n a user - suppl i ed pr oduct i on name so we don' t
575 * have t o wor r y about conf l i ct s. Subsequent symbol s ar e added t o t hi s new
576 * r i ght - hand si de. When t he ), ], or *) i s f ound, we f i ni sh t he new r i ght - hand
5 7 7 * si de, pop t he st ack and add t he name of t he new r i ght - hand si de t o t he
578 * pr evi ousl y col l ect ed l ef t - hand si de.
579 * /
580
581 PUBLIC v o i d s t a r t _ o p t ( l e x ) / * St ar t an opt i onal subexpr essi on * /
582 char * l e x ;
583 {
584 char name [ 3 2 ] , *t name;
585 s t a t i c i n t num = 0 ;
586
587 Sp; / * Push cur r ent st ack el ement * /
588 s p r i n t f ( name, " %06d", num++) ; / * Make name f or new pr oduct i on * /
589 ne w_nont e r m( name, 1 ) ; / * Cr eat e a nont er mi nal f or i t. * /
590 n e w _ r h s ( ) ; / * Cr eat e epsi l on pr oduct i on. * /
591 n e w _ r h s ( ) ; / * and pr oduct i on f or sub- pr od. * /
592 }
593
594 / - k - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -k /
595
596 PUBLIC v o i d e n d _ o p t ( l e x ) / * end opt i onal subexpr essi on * /
597 char * l e x ;
598 {
599 char *name = Sp- >l hs _ na me ;
600 SYMBOL *p;
601 i n t i ;
602
603 i f ( l e x [ l ] = = ' * ' ) / * Pr ocess a [ . . . ] * * /
604 {
605 a d d _ t o _ r h s ( name, 0 ) ; / * Add r i ght - r ecur si ve r ef er ence. * /
606
607 # i f d e f OCCS / * I f occsf must be l ef t r ecur si ve. */
608 i = S p - > r h s - > r h s _ l e n - 1; / * Shuf f l e t hi ngs ar ound. * /
609 p = S p - > r h s - > r h s [ i ] ;
610 memmove( &( S p - > r h s - > r h s ) [ 1 ] , &( S p - > r h s - > r h s ) [ 0 ] ,
611 i * s i z e o f ( ( S p - > r h s - > r h s ) [1] ) ) ;
612 S p - > r h s - > r h s [ 0 ] = p ;
613 # e n d i f
614 }
615
616 ++Sp; / * di scar d t op- of - st ack el ement * /
617 a d d _ t o _ r h s ( name, 0 ) ;
618 }
619
620 / * ======================================================================
621 * The f ol l owi ng r out i nes have al t er nat e ver si ons, one set f or l l ama and anot her
622 * f or occs. The r out i nes cor r espondi ng t o f eat ur es t hat ar en' t suppor t ed i n one
623 * or t he ot her of t hese pr ogr ams pr i nt er r or messages.
624 * /
625
626 #i f def LLAMA
627
628 PUBLI C voi d a d d _ s y n c h ( name )
629 char *name;
630 {
631
632
633
634
635
636
637
638
639
640
641
642
643
644 }
645
646 / *
647
648 PUBLI C voi d new_l ev( how )
649 {
650 swi t ch( how )
651 {
652 case 0 : / * i ni t i al i zat i on: i gnor e i t * / break;
653 case ' 1 ' : l er r or ( NONFATAL, "%%l ef t not r ecogni zed by LLAMA\ n" ) ; break;
654 case ' r' : l er r or ( NONFATAL, "%%r i ght not r ecogni zed by LLAMA\ n" ) ; break;
655 def aul t : l er r or ( NONFATAL, "%%nonassoc not r ecogni zed by LLAMA\ n" ) ; break;
656 }
657 }
658
659 PUBLI C voi d pr ec( name )
660 char *name;
661 {
662 l er r or ( NONFATAL, "%%pr ec not r ecogni zed by LLAMA\ n" );
663 }
664
665 PUBLI C voi d uni on_def ( act i on )
666 char *act i on;
667 {
6 6 8 l er r or ( NONFATAL, "%%uni on not r ecogni zed by LLAMA\ nM);
669 }
670
671 PUBLI C voi d pr ec_l i st ( name ) char *name;
672 {
673 }
674
/ * Add "name t o t he set of synchr oni zat i on t okens
*/
SYMBOL *p;
i f ( ! (p = ( SYMBOL *) f i ndsym( Symt ab, name )) )
l er r or ( NONFATAL, " %%synch: undecl ar ed symbol <%s>. \ n", name );
el se i f ( !I STERM( p) )
l er r or ( NONFATAL, " %%synch: <%s> not a t er mi nal symbol \ n", name );
el se
ADD( Synch, p- >val );
---------------------------------- ----------------------------- ------ -------------------------------- */
675 PUBLIC voi d n e w _ f i e l d ( f i e l d _ n a me )
676 char * f i e l d _ n a me ;
677 {
678 i f ( * f i e l d _ n a me )
679 l error(NONFATAL, "<name> n o t s u p p o r t e d by LLAMA\nM) ;
680 }
681
682 PUBLIC make _nont e r m( name )
683 char *name;
684 {
685 l error( NONFATAL, "%type no t s u p p o r t e d by LLAMA\nM) ;
686 }
687
688 # e l s e /* ============================================================* /
689
690 PUBLIC voi d a d d _ s y n c h ( y y t e x t )
691 char *y y t e x t ;
692 {
693 l error( NONFATAL, "%%synch no t s u p p o r t e d by OCCS\ n") ;
694 }
695
696 /-k - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -k/
697
698 PUBLIC voi d n e w _ l e v ( how )
699 {
700 / * I ncr ement t he cur r ent pr ecedence l evel and modi f y "Associ at i vi t y"
701 * to r emember i f we' r e goi ng l ef t , r i ght , or nei t her .
702 * /
703
704 i f ( A s s o c i a t i v i t y = how ) / * ' 1' , ' r ' , ' n' , (0 i f unspeci f i ed) * /
705 + + P r e c _ l e v ;
706 }
707
708 / * ---------------------------------------------------------------------------------------------------------------------------------- * /
709
710 PUBLIC voi d p r e c _ l i s t ( name )
711 char *name;
712 {
713 / * Add cur r ent name (i n yyt ext ) t o t he pr eci si on l i st . "Associ at i vi t y"
714 * set t o ' 1' , ' r ' , or ' n' , dependi ng on whet her we' r e doi ng a %l ef t ,
715 * %r i ght f or %nonassoc. Al so make a nont er mi nal i f i t doesn' t exi st
716 * al r eady.
717 * /
718
719 SYMBOL *sym;
720
721 i f ( ! (sym = f i n d s y m( S y mt a b , n a me ) ) )
722 sym = make t e r m( name ) ;
723
724 i f ( ! ISTERM(sym) )
725 l error( NONFATAL, "%%l eft or %%ri ght, %s must be a t o k e n \ n " , name )
726
727 {
728 P r e c e d e n c e [ s y m- >v a l ] . l e v e l = P r e c _ l e v ;
729 P r e c e d e n c e [ s y m- >v a l ] . a s s o c = A s s o c i a t i v i t y ;
730 }
731 }
732
i s
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
111
778
779
780
781
782
783
784
785
786
787
788
789
790
/
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ *
/
PUBLI C v o i d pr ec( name )
*
name ;
{
/
*
Change t he l evel f or t he cur r ent r i ght - hand si de, usi ng
*
*
(1) an expl i ci t number i f one i s speci f i ed, or (2) an el ement f r om t he
Pr ecedence[] t abl e ot her wi se.
*
/
SYMBOL
*
sym;
i f ( i sdi gi t (*name)
Sp- >r hs- >pr ec
)
at oi ( name) ; /
*
(1)
*
/
{
(
(sym f i ndsym( Symt ab, name) )
)
l er r or ( NONFATAL, "%s ( used i n %%pr ec) undef i ned\ n" );
( !I STERM( sym)
)
l er r or ( NONFATAL,
II O
o S ( used i n %%pr ec) must be t er mi nal symbol Xn" );
Sp- >r hs- >pr ec Pr ecedence[ sym- >val ] . l evel ; /
*
(2)
*
/
}
}
/
*
/
PUBLI C v o i d uni on def ( act i on )
*act i on;
{
/
*
*
*
*
cr eat e a YYSTYPE def i ni t i on f or t he uni on, usi ng t he f i el ds speci f i ed
i n t he %uni on di r ect i ve, and al so appendi ng a def aul t i nt eger - si zed
f i el d f or t hose si t uat i on wher e no f i el d i s at t ached t o t he cur r ent
symbol .
*
/
w h i l e ( *act i on && *act i on !
++act i on;
i f ( *act i on )
++act i on;
) /
/
Get r i d of ever yt hi ng up t o t he
open br ace
/ * and t he br ace i t sel f
/
/
/
out put ( "t ypedef uni on\ n" ) ;
out put ( " {\ n" );
out put (
II
i nt
o
OOf /
Def aul t f i el d, used when no

o o
o o t ype f ound */ ",
DEF FI ELD );
out put ( "%s\ n", act i on );
out put ( "yyst ype; \ n\ n"
);
out put ( "#def i ne YYSTYPE yyst ype\ n" );
Fi el ds act i ve 1;
}
PUBLI C f i el ds act i ve()
{
r e t u r n Fi el ds act i ve ; /
pr evi ous %uni on was speci f i ed
/
}
791
/ * -------------------------------------------------------------------------------------------------------------------------- _ _ * /
792
793 PUBLIC v o i d new f i e l d ( f i e l d name )
794 c h a r * f i e l d name;
795
{
796 / * Change t he name of t he cur r ent <f i el d> * /
797
798 c h a r *p;
799
800 i f ( ! * f i e l d name )
801 * F i e l d name = ' \ 0 ' ;
802 e l s e
803 {
804 i f ( p = s t r c h r ( + + f i e l d name, ' > ' ) )
805 *p = ' \ 0 ' ;
806
807 s t r n c p y ( F i e l d name, f i e l d name, s i z e o f ( F i e l d name) ) ;
808
}
809
}
810 # e n d i f
4.10.2 Creating The Tables
The next step is to create the parse tables from the grammar we just assembled, and
the LL(1) select sets are needed to do this.
4.10.2.1 Computing FIRST, FOLLOW, and SELECT Sets. LLama starts by
creating the FIRST sets, using the code in Listing 4.27, which implements the algorithm
described earlier. The subroutine f i r st (), on line 21, figures the FIRST sets for all the
nonterminals in the symbol table. The pt ab () call on line 30 traverses the table, calling
f i r st cl osur e () (on line 37) for every table element ( pt ab () is one of the hash
functions described in Appendix A). Multiple passes are made through the table, until
nothing is added to any symbols FIRST set.
The f i r st r hs () subroutine on line 97 computes the first set for an entire right-
hand side, represented as an array of SYMBOL pointers. (Theyre stored this way in a
PRODUCTI ON structure), f i r st _r hs () wont work if called before f i r st ().
The subroutines in Listing 4.28 compute FOLLOW sets for all nonterminals in
LLamas symbol table, again using the procedure described earlier in this Chapter.
Finally, LLama uses the FIRST and FOLLOW sets to create the LL(1) selection sets in
Listing 4.29.
4.10.3 The Rest of LLama
The remainder of LLama is Listings 4.30. to 4.37. All that they do is print out the
tables, and so forth. They are adequately commented and need no further description
here.
Listing 4.27. first.c Find FIRST Sets
Section 4.10.3The Rest of LLama 305
3 # i ncl ude < t o o l s / s e t . h>
4 #i ncl ude <t ool s/ hash. h>
5 # i ncl ude < t o o l s / c o m p i l e r . h>
6 #i ncl ude <t ool s/ l . h>
7 # i ncl ude " p a r s e r . h "
8
9 / * FI RST. C Comput e FI RST set s f or al l pr oduct i ons i n a symbol t abl e.
!0 *-------------------------------------------------------------------------------------------------------------------------
11 */
12
13 i nt f i r st _cl osur e P( ( SYMBOL *l hs ) ) / / * l ocal * /
14 voi d f i r st P (( voi d ) ) ; / * publ i c * /
15 i nt f i r st _r hs P( ( SET *dest , SYMBOL **r hs, i nt l en ));
16
17 PRI VATE i nt Di d_somet hi ng;
18
19 / * ---------------------------------------------------------------------------------------------------------------------------------------------- * /
20
21 PUBLI C voi d f i r st ( )
22 {
23 / * Const r uct FI RST set s f or al l nont er mi nal symbol s i n t he symbol t abl e. * /
24
25 D( pr i nt f ("Fi ndi ng FI RST set s\ n") ; )
26
27 do
28 {
29 Di d _ s o me t h i n g = 0;
30 pt ab( Symt ab, f i r st _cl osur e, NULL, 0 );
31
32 } whi l e( Di d somet hi ng );
33 }
34
35 / * ---------------------------------------------------------------------------------------------------------------------------------------------- * /
36
37 PRI VATE f i r st _cl osur e( I hs )
38 SYMBOL *l hs; / * Cur r ent l ef t - hand si de * /
39 {
40 / * Cal l ed f or ever y el ement i n t he FI RST set s. Adds el ement s t o t he f i r st
41 * set s. The f ol l owi ng r ul es ar e used:
42 *
43 * 1) gi ven l hs- >. . . Y. . . wher e Y i s a t er mi nal symbol pr eceded by any number
44 * ( i ncl udi ng 0) of nul l abl e nont er mi nal symbol s or act i ons, add Y t o
45 * FI RST( x) .
46
47 * 2) gi ven l hs- >. . . y. .. wher e y i s a nont er mi nal symbol pr eceded by any
48 * number ( i ncl udi ng 0) of nul l abl e nont er mi nal symbol s or act i ons, add
49 * FI RST( y) t o FI RST ( I hs) .
50 * /
51
52 PRODUCTI ON *pr od; / * Poi nt er t o one pr oduct i on si de * /
53 SYMBOL **y; / * Poi nt er t o one el ement of pr oduct i on * /
54 st at i c SET *set = NULL; / * Scr at ch- space set . * /
55 i nt i ;
56
57 i f ( ! ISNONTERM( I h s ) ) / * I gnor e ent r i es f or t er mi nal symbol s * /
58 return;
59
60 i f ( ! s e t ) / * Do t hi s once. The set i sn' t f r ee( ) d * /
61 s e t = n e w s e t ( ) /
62
63 ASSI GN( s e t , l h s - > f i r s t ) ;
64
65 f o r ( pr od = l h s - > p r o d u c t i o n s ; pr od ; pr od = p r o d - > n e x t )
66 {
67 i f ( p r o d - > n o n _ a c t s <= 0 ) / * No nonact i on symbol s * /
6 8 { / * add epsi l on t o f i r st set * /
69 ADD( s e t , EPSI LON ) ;
70 conti nue;
71 }
72
73 f o r ( y = p r o d - > r h s , i = p r o d - > r h s _ l e n ; i >= 0 ; y++ )
74 {
75 i f ( I SACT( *y ) ) / * pr et end act s don' t exi st * /
76 conti nue;
77
78 i f ( I STERM( *y ) )
79 ADD( s e t , ( * y ) - > v a l ) ;
80
81 el se i f ( *y ) / * i t ' s a nont er mi nal * /
82 UNI ON( s e t , ( * y ) - > f i r s t ) ;
83
84 i f ( ! NULLABLE( *y ) ) / * i t ' s not a nul l abl e nont er mi nal * /
85 br eak;
86 }
87 }
88
89 i f ( ! I S_EQUI VALENT( s e t , l h s - > f i r s t ) )
90 {
91 ASSI GN( l h s - > f i r s t , s e t ) ;
92 Di d s o me t h i n g = 1;
93 }
94 }
95
96 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
97 PUBLI C i nt f i r s t _ r h s ( d e s t , r h s , l e n )
98 SET * d e s t ; / * Tar get set * /
99 SYMBOL * * r h s ; / * A r i ght - hand si de * /
1 0 0 i nt l e n ; / * # of obj ect s i n r hs */
101 {
1 0 2 / * Fi l l t he dest i nat i on set wi t h FI RST( r hs) wher e r hs i s t he r i ght - hand si de
103 * of a pr oduct i on r epr esent ed as an ar r ay of poi nt er s t o symbol - t abl e
104 * el ement s. Ret ur n 1 i f t he ent i r e r i ght - hand si de i s nul l abl e, ot her wi se
105 * r et ur n 0.
106 * /
107
108 i f ( l e n <= 0 )
109 {
110 ADD( d e s t , EPSI LON ) ;
111 ret urn 1;
112 }
113
114 f o r (; - - l e n >= 0 ; ++rhs )
115 {
116 i f ( I SACT( r h s [0] ) )
117 conti nue;
118
119 i f ( ISTERM( r h s [0] ) )
1 2 0 ADD( d e s t , r h s [ 0 ] - > v a l ) ;
1 2 1 el se
1 2 2 UNION( d e s t , r h s [ 0 ] - > f i r s t ) ;
123
124 i f ( ! NULLABLE( r h s [ 0 ] ) )
125 br eak;
126
}
127
128 r et ur n( l e n < 0 ) ;
129 }
Listing 4.28.follow.c Find FOLLOW Sets
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# i ncl ude < t o o l s / d e b u g . h>
# i ncl ude < t o o l s / s e t . h>
# i ncl ude < t o o l s / c o m p i l e r . h>
# i ncl ude < t o o l s / l . h >
# i ncl ude " p a r s e r . h "
#i ncl ude " l l o u t . h"
/ * FOLLOW. C
*
*
Comput e FOLLOWset s f or al l pr oduct i ons i n a
symbol t abl e. The FI RST set s must be comput ed
bef or e f ol l ow() i s cal l ed.
k
/
i nt
i nt
voi d
voi d
f o l l o w c l o s u r e P ( ( SYMBOL * l h s )) / * l ocal * /
remove e p s i l o n P ( ( SYMBOL * l h s ))
i n i t
f o l l o w
P ( ( SYMBOL * l h s ))
P(( voi d
) ) /
*
publ i c
k
/
J k___________________________ _____ __________________ __________________ ______ * J
PRIVATE i nt Di d s o me t h i n g ;
j k ____________ _________________________________________ _____ ____ __________________________________-----------------k j
PUBLIC
{
voi d f o l l o w ()
D( i nt p a s s 0; )
D( p r i n t f ( " I n i t i a l i z i n g FOLLOW s e t s \ n " ) ; )
p t a b ( Symt ab, i n i t , NULL, 0 ) ;
do {
/
*
*
*
*
Thi s l oop makes sever al passes t hr ough t he ent i r e gr ammar , addi ng
FOLLOWset s. The f ol l ow_cl osur e() r out i ne i s cal l ed f or each gr ammar
symbol , and set s Di d somet hi ng to t r ue i f i t adds any el ement s to
exi st i ng FOLLOWset s
k
/
D( p r i n t f ( "Cl os ur e p a s s %d\n", ++pas s ) ; )
D( f p r i n t f ( s t d e r r , "%d\n", p a s s ) ; )
43 Di d _ s o me t h i n g = 0/
44 p t a b ( S y m t a b , f o l l o w _ c l o s u r e , NULL, 0 ) ;
45
46 } w h i l e ( D i d _ s o m e t h i n g ) ;
47
48 / * Thi s l ast pass i s j ust f or ni cet y and coul d pr obabl y be el i mi nat ed ( may
49 * as wel l do i t r i ght t hough) . St r i ct l y speaki ng, FOLLOWset s shoul dn' t
50 * cont ai n epsi l on. Nonet hel ess, i t was much easi er t o j ust add epsi l on i n
51 * t he pr evi ous st eps t han t r y t o f i l t er i t out t her e. Thi s l ast pass j ust
52 * r emoves epsi l on f r omal l t he FOLLOWset s.
53 * /
54
55 p t a b ( S y m t a b , r e m o v e _ e p s i l o n , NULL, 0 ) ;
56 D( p r i n t f ( " F o l l o w s e t c o m p u t a t i o n d o n e \ n " ) ; )
57 }
58
5 9 / * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
60
61 PRI VATE v o i d i n i t ( I h s )
62 SYMBOL * l h s ; / * Cur r ent l ef t - hand si de * /
63 {
64 / * I ni t i al i ze t he FOLLOWset s. Thi s pr ocedur e adds t o t he i ni t i al f ol l ow set
65 * of each pr oduct i on t hose el ement s t hat won' t change dur i ng t he cl osur e
6 6 * pr ocess. Not e t hat i n al l t he f ol l owi ng cases, act i ons ar e j ust i gnor ed.
67 *
6 8 * (1) FOLLOW( st ar t ) cont ai ns end of i nput ( $ ) .
69 *
70 * (2) Gi ven s- >. . . xY. . . wher e x i s a nont er mi nal and Y i s
71 * a t er mi nal , add Y t o FOLLOW( x) . x and Y can be separ at ed
72 * by any number ( i ncl udi ng 0) of nul l abl e nont er mi nal s or act i ons.
73 *
74 * (3) Gi ven x- > . . . x y . . . wher e x and y ar e bot h nont er mi nal s,
75 * add FI RST( y) t o FOLLOW( x) . Agai n, x and y can be separ at ed
76 * by any number of nul l abl e nont er mi nal s or act i ons.
77 * /
78
79 PRODUCTI ON * p r o d ; / * Poi nt er t o one pr oduct i on * /
80 SYMBOL * * x ; / * Poi nt er t o one el ement of pr oduct i on * /
81 SYMBOL * * y ;
82
83 D( p r i n t f ( " %s : \ n " , l h s - > n a m e ) ; )
84
85 i f ( ! I S NONT E RM( I h s ) )
86 r e t u r n ;
87
88 i f ( I h s = = G o a l _ s y m b o l )
89 {
90 D( p r i n t f ( " \ t A d d i n g _ E O I _ t o FOLLOW( %s ) \ n " , l h s - > n a m e ) ; )
91 A D D ( l h s - > f o l l o w , _ E O I _ ) ;
92 }
93
94 f o r ( p r o d = l h s - > p r o d u c t i o n s ; p r o d ; p r o d = p r o d - > n e x t )
95 {
96 f o r ( x = p r o d - > r h s ; * x ; x + + )
97 {
98 i f ( I SNONTERM( * x ) )
99 {
100 f o r ( y = x + 1 ; * y ; + + y )
101 {
Section 4.10.3 The Rest of LLama 309
02 i f ( I S A C T ( * y ) )
03 conti nue;
04
05 i f ( I S TERM( * y ) )
06 {
07 D( p r i n t f ( " \ t A d d i n g %s " , ( * y ) - > n a m e ) ; )
08 D( p r i n t f ( " t o F OL L OW( %s ) \ n " , ( * x ) - > n a m e ) ; )
09
10 ADD( ( * x ) - > f o l l o w , ( * y ) - > v a l ) ;
11 br eak;
12 }
13 el se
14 {
15 U NI ON( ( * x ) - > f o l l o w , ( * y ) - > f i r s t ) ;
16 i f ( ! NULLABLE( * y ) )
17 br eak;
18 }
19 }
20 }
21 }
22 }
23 }
24
25 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
26
27 PRI VATE f o l l o w _ c l o s u r e ( I h s )
29 {
30 / * Adds el ement s t o t he FOLLOWset s usi ng t he f ol l owi ng r ul e:
31
32 * Gi ven s- >. . . x or s- >. . . x. . . wher e al l symbol s f ol l owi ng
33 * x ar e nul l abl e nont er mi nal s or act i ons, add FOLLOW( s) t o FOLLOW( x) .
34 * /
35
36 PRODUCTI ON * p r o d ; / * Poi nt er t o one pr oduct i on si de * /
37 SYMBOL **x, * * y ; / * Poi nt er t o one el ement of pr oduct i on * /
38
39 D( p r i n t f ( " %s : \ n " , l h s - > n a m e ) ; )
40
41 i f ( I S A C T ( I h s ) | | I S T E R M ( l h s ) )
42 r et ur n;
43
44 f o r ( pr od = l h s - > p r o d u c t i o n s ; pr od ; pr o d = p r o d - > n e x t )
45 {
46 f o r ( x = p r o d - > r h s + p r o d - > r h s _ l e n ; x > = p r o d - > r h s ; )
47 {
48 i f ( I S A C T ( *x) )
49 cont i nue;
50
51 i f ( I S TERM( *x) )
52 br eak;
53
54 i f ( ! s u b s e t ( ( * x ) - > f o l l o w , l h s - > f o l l o w ) )
55 {
56 D( p r i n t f ( " \ t A d d i n g FOLLOW( %s ) " , l h s - > n a m e ) ; )
57 D( p r i n t f ( " t o FOLLOW( %s ) \ n " , ( *x) - >name ) ; )
58
59 UNI ON( ( * x ) - > f o l l o w , l h s - > f o l l o w ) ;
60 Di d _ s o me t h i n g = 1;
61 }
j * ______ _____________________________________________________ __________________________________________________ * J
162
163
164
165 }
166 }
167 }
168
169
170
171
172
173 {
174 /
175
176
177
178
179
180 }
L f ( ! NULLABLE( * x ) )
PRI VATE r e m o v e e p s i l o n ( I h s )
SYMBOL
*
I h s ;
*
Remove epsi l on f r omt he FOLLOWset s. The pr esence of epsi l on i s a
si de ef f ect of addi ng FI RST set s t o t he FOLLOWset s wi l l y ni l l y.
*
/
( I S NONT E RM( I h s ) )
REMOVE( l h s - > f o l l o w , EPS I LON ) ;
Listing 4.29. llselect.c Find LL(1) Select Sets
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#i ncl ude < t o o l s / s e t . h >
# i ncl ude < t o o l s / h a s h . h >
# i ncl ude < t o o l s / c o m p i l e r . h >
# i ncl ude < t o o l s / l . h >
/ * LLSELECT. C Comput e LL( 1) sel ect set s f or al l pr oduct i ons i n a
*
symbol t abl The FI RST and FOLLOWset s must be comput ed
*
*
bef or e sel ect () i s cal l ed. These r out i nes ar e not used
by occs.
*
/
/
* *
/
voi d f i n d _ s e l e c t s e t
voi d s e l e c t
P ( ( SYMBOL * l h s ) )
P(( voi d
) ) ;
/
/
*
*
l ocal
publ i c
*
*
/
/
/
* *
/
PRI VATE voi d
SYMBOL * l h s ;
{
f i n d s e l e c t s e t ( I h s )
/
l ef t - hand si de of pr oduct i on
/
/
Fi nd t he LL( 1) sel ect i on set f or al l pr oduct i ons at t ached t o t he

* i ndi cat ed l ef t - hand si de (I hs). The f i r st _r hs( ) cal l put s t he FI RST set s
* f or t he i ni t i al symbol s i n pr od- >r hs i nt o t he sel ect set . I t r et ur ns t r ue
*
*
onl y i f t he ent i r e r i ght - hand si de was nul l abl e ( EPSI LON was an el ement
of t he FI RST set of ever y symbol on t he r i ght - hand si de.
/
PRODUCTI ON
p r o d ;
35 f o r ( pr od = l h s - > p r o d u c t i o n s ; pr od ; pr od = prod' - > n e x t )
36
{
37 i f ( f i r s t r hs ( p r o d - > s e l e c t , p r o d - > r h s , prod' - > r h s l e n ) )
38 UNION( p r o d - > s e l e c t , l h s - > f o l l o w ) ;
39
40 REMOVE( p r o d - > s e l e c t , EPSILON ) ;
41
}
42
}
43
44
/ * -
--------------------------------------- * /
45
46 PUBLIC voi d s e l e c t ( )
47 {
48 / * Comput e LL( 1) sel ect i on set s f or al l pr oduct i ons i n t he gr ammar * /
49
50
i f (
Ve r bo s e )
51 p r i n t f ( " Fi n d i n g LL(1) s e l e c t s e t s \ n " ) ;
52
53 p t a b ( Symt ab, f i n d s e l e c t s e t , NULL, 0 ) ;
54
}
Listing 4.30. llcode.c Routines That Make the Parse Tables
1 #i ncl ude
2 #i ncl ude
3 #i ncl ude
4
5 #i ncl ude
6 #i ncl ude
7 #i ncl ude
8 #i ncl ude
9 #i ncl ude
10 #i ncl ude
11
12
/*
13
*
14
*
15
*
16
*
17
*
18
*
19
*
20
*
21
*
22
*
23
*
24
*
25
*
26
*
27
*
28
*
29
*
30
*
31
*
32
*
33
*
34
9c
< s t d i o . h>
< s t d l i b . h>
<c t y p e . h>
< t o o l s / d e b u g . h>
< t o o l s / s e t . h>
< t o o l s / h a s h . h>
< t o o l s / c o m p i l e r . h>
< t o o l s / l . h>
" p a r s e r . h"
LLCODE. C
Yyd[ ] []
YydN[]
Yy_pusht ab[]
YypDD[]
Yy snont er m[]
Yy_sact []
Yy_ syn ch [ ]
yy act ()
Pr i nt t he var i ous t abl es needed f or a l l ama
LL( 1) par ser . The t abl es ar e
The par ser st at e machi ne' s DFA t r ansi t i on t abl e. The
hor i zont al axi s i s i nput symbol and t he ver t i cal axi s i s
t op- of - st ack symbol . Onl y nont er mi nal TOS symbol s ar e i n
t he t abl e. The t abl e cont ai ns t he pr oduct i on number of a
pr oduct i on t o appl y or -1 i f t hi s i s an er r or
t r ansi t i on.
(N i s 1- 3 deci mal di gi t s) . Used f or compr essed t abl es
onl y. Hol ds t he compr essed rows.
I ndexed by pr oduct i on number , eval uat es t o a l i st of
obj ect s t o push on t he st ack when t hat pr oduct i on i s
r epl aced. The YypDD ar r ays ar e t he l i st s of obj ect s
and Yy pusht ab i s an ar r ay of poi nt er s t o t hose l i st s
For debuggi ng, i ndexed by nont er mi nal , eval uat es t o t he
name of t he nont er mi nal .
Same but f or t he act s.
Ar r ay of synchr oni zat i on t okens f or er r or r ecover y.
Subr out i ne cont ai ni ng t he act i ons.
35
36
37
*
9c
9c
Yy_st ok[ ] i s make i n st ok. c. For t he most par t , t he number s i n t hese t abl es
ar e t he same as t hose i n t he symbol t abl e. The except i ons ar e t he t oken
38 * val ues, whi ch ar e shi f t ed down so t hat t he smal l est t oken has t he val ue 1
39 * (0 i s used f or EOI ) .
40 */
41
42 #d e f i n e DTRAN "Yyd" / * Name of DFA t r ansi t i on t abl e * /
43 / * ar r ay i n t he PARSE_FI LE. * /
44
45
/ * --------- --------------------------------- * /
46
47 e x t e r n v o i d t a b l e s P (
( v o i d ) ) ; / * publ i c */
48
49 s t a t i c v o i d f i l l row
P (
( SYMBOL * l h s , v o i d * j unk ) ) ; / * l ocal * /
50 s t a t i c v o i d make p u s h t a b
P (
( SYMBOL * l h s , v o i d * j unk ) ) ;
51 s t a t i c v o i d make yy p u s h t a b P ( ( v o i d ) ) ;
52 s t a t i c v o i d make yy d t r a n P ( ( v o i d ) ) ;
53 s t a t i c v o i d make yy s y nc h P ( ( v o i d ) ) ;
54 s t a t i c v o i d make yy s no nt e r m
P (
( v o i d )
) ;
55 s t a t i c v o i d make yy s a c t
P (
( v o i d ) ) ;
56 s t a t i c v o i d make a c t s P (
( SYMBOL * l h s , v o i d *j unk ) ) ;
57 s t a t i c v o i d make yy a c t P ( ( v o i d
) ) ;
58
59 / *
--------------------------------- */
60 PRIVATE i n t
*
Dt r a n ; / * I nt er nal r epr esent at i on of t he par se t abl e
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
* I ni t i al i zat i on i n make yy dt r an () assumes
9c
t hat i t i s an i nt [i t cal l s memi set ( ) ] .
*
/
/* -------------------------------------------------------------------------------------------------------------------- */
PUBLI C v o i d t abl es()
{
}
*
*
*
*
9c
9c
9c
9c
9c
/ * Pr i nt t he var i ous t abl es needed by t he par ser .
*
/
ma k e _ y y _ p u s h t a b ( ) ;
ma k e _ y y _ d t r a n ( ) ;
ma k e _ y y _ a c t ( ) ;
make_yy_synch();
make_yy_st ok();
make t oken f i l e( ) ;
out put ( "\ n#i f def YYDEBUG\ n");
make_yy_snont er m();
make yy sact ();
out put ( "\ n#endi f \ n" );
/* --------------------------------------------------------------------------------------------------------------------
f i l l r ow()
Make one r ow of t he par ser ' s DFA t r ansi t i on t abl e.
Col umn 0 i s used f or t he EOI condi t i on; ot her col umns
ar e i ndexed by nont er mi nal ( wi t h t he number nor mal i zed
f or t he smal l est nont er mi nal ) . That i s, t he t er mi nal
val ues i n t he symbol t abl e ar e shi f t ed downwar ds so t hat
t he smal l est t er mi nal val ue i s 1 r at her t han MI NTERM.
The r ow i ndexes ar e adj ust ed i n t he same way (so t hat
95 * r ow 0 i s used f or MI NNONTERM) .
96 *
97 * Not e t hat t he code assumes t hat Dt r an consi st s of byt e- si ze
98 * cel l s.
99 * /
100
1 0 1 PRI VATE voi d f i l l _ r o w ( I h s , j unk )
103 voi d *j unk;
104 {
106 i nt *row;
107 i nt i ;
108 i nt r o w s i z e ;
109
110 i f ( ! I SNONTERM( I hs ) )
111 r et ur n;
112
113 r o w s i z e = USED_TERMS + 1;
114 row = Dt r an + ( ( i = ADJ_VAL( l h s - > v a l ) ) * r o w s i z e ) ;
115
116 f o r ( pr od = l h s - > p r o d u c t i o n s ; pr o d ; pr o d = p r o d - > n e x t )
117 {
119 whi l e ( ( i = n e x t _ m e m b e r ( p r o d - > s e l e c t ) ) >= 0 )
120 {
1 2 1 i f ( r o w[ i ] == - 1 )
1 2 2 r o w[ i ] = prod- >num;
123 el se
124 error(NONFATAL, "Grammar n o t LL( 1 ) , s e l e c t - s e t c o n f l i c t i n " \
125 "<%s>, l i n e %d\ n", l h s - > n a me , l h s - > l i n e n o ) ;
126 }
127 }
128 }
129
130 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
131
132 PRI VATE voi d ma k e _ p u s h t a b ( I h s , j unk )
133 SYMBOL * l h s ;
134 voi d *j unk;
135 {
136 / * Make t he pusht ab. The r i ght - hand si des ar e out put i n r ever se or der
137 * (to make t he pushi ng easi er ) by st acki ng t hemand t hen pr i nt i ng
138 * i t ems of f t he st ack.
139 * /
140
141 regi st er i nt i ;
142 PRODUCTI ON *prod ;
143 SYMBOL **sym ;
144 SYMBOL * s t a c k [ MAXRHS ] , * * s p;
145
146 s p = S s t a c k [ - 1 ] ;
147 f o r ( pr od = l h s - > p r o d u c t i o n s ; pr od ; pr o d = p r o d - > n e x t )
148 {
149 o u t p u t ( "YYPRIVATE i n t Yyp%02d[ ] ={ ", prod- >num ) ;
150
151 f o r ( sym = p r o d - > r h s , i = p r o d - > r h s _ l e n ; i > = 0 ; )
152 *++s p = *sym++ ;
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
(; I NBOUNDS( s t ack, s p) ; o u t p u t (
I I Q.
d 1 1 K U f 9 ( *s p ) - > v a l )
)
o u t p u t ( " 0 } ; \ n " , p r o d - > r h s [ 0 ] );
}
}
_ _
PRIVATE voi d make yy p u s h t a b ()
{
/
*
Pr i nt out yy pusht ab.
*
/
i nt
i nt
i ;
maxprod
[]
Num p r o d u c t i o n s 1;
{
'The YypNN a r r a y s h o l d t h e r i g h t - h a n d s i d e s o f t h e p r o d u c t i o n s , l i s t e d " ,
' back t o f r o n t ( s o t h a t t h e y a r e pus he d i n r e v e r s e o r d e r ) , NN i s t h e " ,
' p r o d u c t i o n number ( t o be f ound i n t h e s y m b o l - t a b l e l i s t i n g o u t p u t " ,
' wi t h a - s c ommand- l i ne s w i t c h ) . ",
I II
' Yy _ p u s h t a b [] i s i n d e x e d by p r o d u c t i o n number and p o i n t s t o t h e " ,
' a p p r o p r i a t e r i g h t - h a n d s i d e (YypNN) a r r a y . " ,
NULL
};
c omme nt ( Out put , t e x t ) ;
p t a b ( Symt ab, make p u s h t a b , NULL, 0 ) ;
o u t p u t ( "\nYYPRIVATE i n t
*
Yy p u s h t a b [] = \ n { \ n " ) ;
( i
0 ; i < maxprod; i ++ )
o u t p u t ( "\ t Yy p %0 2 d , \ n " , i ) ;
o u t p u t ( "\ t Yy p%0 2 d\ n} ; \ n " , maxprod ) ;
}
/* -------------------------------------------------------------------------------------------------------------------- */
PRIVATE voi d make yy d t r a n ( )
{
/ * Pr i nt t he DFA t r ansi t i on t abl e
*
/
i nt
i nt
i ;
n t e r ms , nnont e r ms ;
*
[]
{
Y y d [ ] [] i s t h e DFA t r a n s i t i o n t a b l e f o r t h e p a r s e r . I t i s i n d e x e d " ,
as f o l l o w s
II
II
L
H
S
I nput s ymbol ",
+------------------------------------+
P r o d u c t i o n number
or YYF
+------------------------------------+
II
The p r o d u c t i o n number i s us e d as an i n d e x i n t o Yy p u s h t a b , whi ch",
l o o k s l i k e t h i s : " ,
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
}
VV
Yy p u s h t a b YypDD: ",
H-------------h
*___
+--------------------------+
VV
--------- >
VV
+-----------+
*
>
II
II
*
>
II
+-----------+"
II
YypDD i s t h e t o k e n i z e d r i g h t - h a n d s i d e o f t h e p r o d u c t i o n . " ,
Ge ne r a t e a s ymbol t a b l e l i s t i n g wi t h l l a m a ' s - 1 c o mma nd- l i ne ",
s w i t c h t o g e t b o t h p r o d u c t i o n numbers and t h e me a ni ng s o f t h e " ,
YypDD s t r i n g c o n t e n t s . " ,
NULL
};
nt e rms
nnont erms
USED_TERMS + 1;
USED NONTERMS;
/
*
+1 f or EOI
*
/
l nt e r ms * nnont e r ms ; / * Number of cel l s i n ar r ay
*
/
i f ( ! ( Dt ran ( i n t
*
) m a l l o c ( i
*
( * D t r a n ) )
) ) /
number of byt es
*
/
f e r r ("Out o f memory\ n") ;
m e m i s e t ( Dt r an,
1 , i ) ;
/ * I ni t i al i ze Dt r an to al l f ai l ur es * /
p t a b ( Symt ab, f i l l row, NULL, 0 ) ; / * and f i l l nonf ai l ur e t r ansi t i ons
*
/
comment ( Out put , t e x t ) ; / * Pr i nt header comment * /
i f ( Uncompr e s s e d )
{
f p r i n t f ( Output, "YYPRIVATE YY TTYPE
o
oS [ d ] %d \n",
DTRAN, nno nt e r ms , nt e r ms ) ;
p r i n t a r r a y ( Out put , Dt r an, nno nt e r ms , nt e r ms ) ;
d e f n e x t ( Out put , DTRAN ) ;
( Ve r bo s e )
p r i n t f ( " %d b y t e s r e q u i r e d f o r t a b l e s \ n " , i
*
(YY TTYPE)
) ;
}
{
l p a i r s ( Out put , Dt r a n, nno nt e r ms , n t e r ms , DTRAN, Th r e s h o l d ,
p n e x t ( Out put , DTRAN ) ;
1) ;
i f ( Ve r bo s e )
p r i n t f ( " %d b y t e s r e q u i r e d f o r c o mp r e s s e d t a b l e s \ n " ,
( i
*
(YY TTYPE)) + ( nnont erms
*
(YY TTYPE*) ) ) ;
}
o u t p u t ( "\ n \ n " ) ;
! *_____________________________________________________________________________* j
PRI VATE v o i d make yy s ync h ()
{
i n t
i n t
mem
i ;
/
cur r ent member of synch set
/ * number of member s i n set

*
/
/
274
276 {
277 " Y y _ s y n c h [ ] i s a n a r r a y o f s y n c h r o n i z a t i o n t o k e n s . Whe n a n e r r o r i s " ,
278 " d e t e c t e d , s t a c k i t e m s a r e p o p p e d u n t i l o n e o f t h e t o k e n s i n t h i s " ,
279 " a r r a y i s e n c o u n t e r e d . T h e i n p u t i s t h e n r e a d u n t i l t h e s a me i t e m i s " ,
280 " f o u n d . T h e n p a r s i n g c o n t i n u e s . " ,
281 NULL
282 };
283
284 c o m m e n t ( O u t p u t , t e x t ) ;
285 o u t p u t ( "YYPRI VATE i n t Y y _ s y n c h [ ] = \ n { \ n " ) ;
286
287 i = 0 ;
288 f o r ( next _member( NULL) / (mem = n e x t _ me mb e r ( S y n c h ) ) >= 0 ; )
289 {
290 o u t p u t ( " \ t % s , \ n " , T e r m s [ m e m ] - > n a m e ) ;
291 + + i ;
292 }
293
294 i f ( i == 0 ) / * No member s i n synch set * /
295 o u t p u t ( " \ t _ E O I _ \ n " ) ;
296
297 o u t p u t ( " \ t - l \ n } ; \ n " ) ;
298 n e x t _ m e m b e r ( NULL ) ;
299 }
300
301 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
302
303 PRI VATE voi d m a k e _ y y _ s n o n t e r m ()
304 {
305 regi st er i nt i ;
306
307 st at i c char * t h e _ c o m m e n t [ ] =
308 {
309 " Y y _ s n o n t e r m [ ] i s u s e d o n l y f o r d e b u g g i n g . I t i s i n d e x e d b y t h e " ,
310 " t o k e n i z e d l e f t - h a n d s i d e ( a s u s e d f o r a r o w i n d e x i n Y y d [ ] ) a n d " ,
311 " e v a l u a t e s t o a s t r i n g n a m i n g t h a t l e f t - h a n d s i d e . " ,
312 NULL
313 };
314
315 c o m m e n t ( O u t p u t , t h e _ c o m m e n t ) ;
316
317 o u t p u t ( " c h a r * Y y _ s n o n t e r m [ ] = \ n { \ n " ) ;
318
319 f o r ( i = MINNONTERM; i < = C u r _ n o n t e r m ; i + + )
320 {
321 i f ( T e r m s [ i ] )
322 o u t p u t ( "\ t / * %3d * / \ " %s \ " ", i , Te r ms [ i ] - > n a me ) ;
323
324 i f ( i ! = C u r _ n o n t e r m )
325 o u t c ( ' , ' )
326
327 o u t c ( ' \ n ' ) ;
328 }
329
330 o u t p u t ( " } ; \ n \ n " ) ;
331 }
332
333 /* -------------------------------------------------------------------------------------------------------------------- */
334
335 PRIVATE void make vv s a c t
336 {
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362 }
363
364 /*
365
366 PRIVATE void m a k e _ a c t s ( I h s , j unk )
368 voi d *j unk;
369 {
370 / * Thi s subr out i ne i s cal l ed i ndi r ect l y f r omyy_act , t hr ough t he subr out i ne
371 * pt ab( ) . I t pr i nt s t he t ext associ at ed wi t h one of t he act s.
372 */
373
374 char *p;
375 i nt num; /* The N i n $N */
376 char * d o _ d o l l a r ();
377 char f n a me [ 8 0 ] , *f p;
378 i nt i ;
379
380 i f ( ! l h s - > s t r i n g )
381 return;
382
383 o u t p u t ( " c a s e %d: \ n", l h s - > v a l ) ;
384
385 i f ( N o _ l i n e s )
386 o u t p u t ( " \ t \ t " ) ;
387 el se
388 o u t p u t ( " # l i n e %d \ " % s \ " \ n \ t \ t " , l h s - > l i n e n o , I n p u t _ f i l e _ n a me ) ;
389
390 f or( p = l h s - > s t r i n g ; *p ; )
391 {
/ * Thi s subr out i ne gener at es t he subr out i ne t hat i mpl ement s t he act i ons. * /
regi ster i nt i ;
stati c char * t he _ c o mme nt [] =
{
"Y y _ s a c t [] i s a l s o u s e d o n l y f o r d e b u g g i n g . I t i s i n d e x e d by t h e " ,
" i n t e r n a l v a l u e u s e d f o r an a c t i o n s ymbol and e v a l u a t e s t o a s t r i n g " ,
"nami ng t h a t t o k e n s y mb o l . " ,
NULL
};
comment ( Out put , t he_comment ) ;
o u t p u t ( " c h a r * Yy _ s a c t [ ] = \ n { \ n \ t " ) ;
i f ( Cur _ac t < MINACT )
out put ( "NULL / * There a r e no a c t i o n s * / " ) ;
el se
f o r ( i = MINACT; i <= Cur _ a c t ; i ++ )
{
o u t p u t ( " \ " { %d}\ "%c", i - MINACT, i < Cur _ a c t ? ' , ' : ' ' ) ;
i f ( i % 10 == 9 )
o u t p u t ( " \ n \ t " ) ;
}
o u t p u t ( " \ n } ; \ n " ) ;
392 i f ( *p == ' \ r ' )
393 conti nue;
394
395 i f ( *p != ' $ ' )
396 o u t p u t ( "%c ", * p + + ) ;
397 el se
398 {
399 / * Ski p t he at t r i but e r ef er ence. The i f st at ement handl es $$ t he
400 * el se cl ause handl es t he t wo f or ms: $N and $- N, wher e N i s a
401 * deci mal number . When you hi t t he do_dol l ar cal l (i n t he out put ()
402 * cal l ) , "num" hol ds t he number associ at ed wi t h N, or DOLLAR_DOLLAR
403 * i n t he case of $$.
404 * /
405
406 i f ( *++p != ' < ' )
407 *f name = ' \ 0 ' ;
408 el se
409 {
410 ++p; / * ski p t he < * /
411 f p = f name;
412
413 f o r ( i = s i z e o f ( f name ) ; i > 0 && *p && *p != ' > ' ; *f p++ = *p++ )
414
415 * f p = ' \ 0 ' ;
416
417 i f ( *p == ' > ' )
418 ++p;
419 }
420
421 i f ( *p == ' $ ' )
422 {
423 num = D 0 L LAR_D 0 L LAR;
424 ++p;
425 }
426 el se
427 {
428 num = a t o i ( p ) ;
429 i f ( *p )
430 ++p ;
431 whi l e( i s d i g i t ( * p ) )
432 ++p ;
433 }
434
435 o u t p u t ( "%s", d o _ d o l l a r (num, 0, 0, NULL, fname) ) ;
436 }
437 }
438
439 o u t p u t ( " \ n b r e a k ; \ n " ) ;
440 }
441
442 / * - ------------- ------------------------------------------------------------------------------------------------------------------------ * /
443
444 PRI VATE voi d m a k e _ y y _ a c t ()
445 {
446 / * Pr i nt al l t he act s i nsi de a subr out i ne cal l ed yy_act ( ) * /
447
448 st at i c char * c o mme n t _ t e x t [] =
449 {
450 " Y y _ a c t ( ) i s t h e a c t i o n s u b r o u t i n e . I t i s p a s s e d t h e t o k e n i z e d v a l u e " ,
451 "of an a c t i o n and e x e c u t e s t h e c o r r e s p o n d i n g c o d e . " ,
Listing4.30. conti nued.. .
452 NULL
453
};
454
455 st at i c char * t o p [] =
456
{
457 "YYPRIVATE i n t yy a c t ( act num ) ",
458
459
VI
/ * The a c t i o n s . Re t ur ns 0 n o r ma l l y but a n o n z e r o e r r o r c o de c an",
460
I I
* be r e t u r n e d i f one o f t h e a c t s c a u s e s t h e p a r s e r t o t e r m i n a t e " ,
461
I I
* a b n o r ma l l y . " ,
462
I I
* / " ,
463
I I I I
f
464
I I
s w i t c h ( act num ) ",
465
I I
466 NULL
467
} ;
468 st at i c char * bo t t o m[ ] =
469
{
470
l l
d e f a u l t : pri nt f ( V' I NTERNAL ERROR: "\
471 " I l l e g a l a c t number ( %s ) \ \ n \ " , a c t n u m) ; " ,
472
l l
b r e a k ; ",
473
l l
474
l l
r e t u r n 0 ; ",
475
" } " ,
476 NULL
477
} ;
478
479 comment ( Out put , comment t e x t ) ;
480 p r i n t v ( Out put , t o p ) ;
481 p t a b ( Symt ab, make a c t s , NULL, 0 ) ;
482 p r i n t v ( Out put , bo t t o m ) ;
483 }
Listing 4.31. lldriver.c Routines That Make Driver Subroutines
2 #i ncl ude < s t d l i b . h >
3
8 # i ncl ude < t o o l s / l . h >
10
11 /* --------------------------------------------------------------------------------------------------------------------- */
12
13 ext ern voi d f i l e_header P ( ( voi d )); /* publ i c */
14 ext ern voi d code_header P (( voi d ));
15 ext ern voi d dr i ver P (( voi d ));
16
17 /* --------------------------------------------------------------------------------------------------------------------- */
18
19 PRI VATE FI LE *Dr i ver f i l e = st der r ;
20
21 / *-
22
*
Rout i nes i n t hi s f i l e ar e l l ama speci f i c. Ther e' s a di f f er ent ver si on
23
*
of al l t hese r out i nes i n yydr i ver . c.
24
*-
25 */
26
27 PUBLIC voi d f i l e h e a d e r ()
28 {
29 / * Thi s header i s pr i nt ed at t he t op of t he out put f i l e, bef or e
30 * t he def i ni t i ons sect i on i s pr ocessed. Var i ous i def i nes t hat
31 * you mi ght want t o modi f y ar e put here.
32 */
33
34 o u t p u t ( " # i n c l u d e < s t d i o . h > \ n \ n " );
35
36 i f ( P u b l i c )
37 o u t p u t ( " # d e f i n e YYPRIVATE\n" );
38
39 i f ( Debug )
40 o u t p u t ( " # d e f i n e YYDEBUG\n" );
41
42
43 o u t p u t ( " \ n / * * / \ n \ n " ) ;
44
45
}
46
47 / *- /
48
49 PUBLIC voi d c ode h e a d e r ()
50 {
51 / * Thi s header i s out put af t er t he def i ni t i ons sect i on i s pr ocessed,
52 * but bef or e any t abl es or t he dr i ver i s pr ocessed.
53 */
54
55 o u t p u t ( " \ n \ n / * * / \ n \ n " ) ;
56 o u t p u t ( "# i n c l u d e \ "%s \ "\ n \ n " , TOKEN_FILE );
57 o u t p u t ( " # d e f i n e YY MINTERM l \ n " );
58 o u t p u t ( " # d e f i n e YY MAXTERM %d\n", Cur t e r m );
59 o u t p u t ( "# d e f i n e YY_MINNONTERM %d\ n", MINNONTERM );
60 o u t p u t ( " # d e f i n e YY MAXNONTERM %d\ n", Cur nont erm ) ;
61 o u t p u t ( "# d e f i n e YY_START_STATE %d\ n", MINNONTERM );
62 o u t p u t ( " # d e f i n e YY_MINACT %d\n", MINACT );
63 o u t p u t ( "\ n" ) ;
64
65 i f ( !( Dr i v e r f i l e = d r i v e r 1 ( Out put , !No l i n e s , Te mpl at e ) ))
66 e r r o r ( NONFATAL, "%s no t f o u n d - - o u t p u t f i l e wo n ' t c o mp i l e \ n " , Templ at e)
67 }
68
69 /* - /
70
71 PUBLIC voi d d r i v e r ()
72 {
73 / * Pr i nt out t he act ual par ser by copyi ng t he f i l e l l ama. par
74 * t o t he out put f i l e.
75 */
76
77 d r i v e r 2( Out put , !No l i n e s );
78 f c l o s e ( D r i v e r f i l e );
79 }

k
Listing 4.32. stok.c Routines to Make yyout.handy y s t ok [ ]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#i ncl ude < t o o l s / d e b u g . h>
#i ncl ude < t o o l s / c o m p i l e r . h>
#i ncl ude < t o o l s / l . h >
/* -------------------------------------------------------------------------------------------------------------------- */
voi d make yy s t o k P(( voi d )); /
*
publ i c
*
/
voi d make t o k e n f i l e P (( voi d ));
/* -------------------------------------------------------------------------------------------------------------------- */
PUBLIC voi d
{
make yy s t o k ( )
/ * Thi s subr out i ne t he Yy st ok[ ] ar r ay t hat ' s
*
*
i ndexed by t oken val ue and eval uat es to a st r i ng
r epr esent i ng t he t oken name. Token val ues ar e adj ust ed
* so t hat t he smal l est t oken val ue i s 1 (0 i s r eser ved
*
f or end of i nput ) .
*
/
i nt i ;
*
t h e comment []
{
MYy _ s t o k [ ] i s u s e d f o r d e b u g g i n g and e r r o r me s s a g e s . I t i s i n d e x e d " ,
"by t h e i n t e r n a l v a l u e u s e d f o r a t o k e n ( as u s e d f o r a col umn i n d e x i n
"t he t r a n s i t i o n ma t r i x ) and e v a l u a t e s t o a s t r i n g nami ng t h a t t o k e n . " ,
NULL
VV
};
comment ( Out put , t h e comment ) ;
o u t p u t ( "char
*
Yy s t o k [ ] = \ n { \ n " ) ;
o u t p u t ( " \ t /
*
0
*
/ \ " EOI \ " , \ n " ) ;
( i
MINTERM; i <= Cur t e rm; i ++ )
{
o u t p u t (
Lf( i
I
" \ t / * %3d * /
Cur t e r m )
\ "%s \ " ", ( i-MINTERM)+1, Te r ms [ i ] - >name ) ;
(
( i & 0 x 1 )
o u t c ( ' \ n '
0 i Cur t e r m )
) ;
/ * Newl i ne f or ever y
/ * ot her el ement
/
/
}
o u t p u t ( " } ; \ n \ n " ) ;
}
/* --------------------------------------------------------------------------------------------------------------------*/
PUBLIC voi d
{
make t o k e n f i l e ( )
58 / * Thi s subr out i ne gener at es t he yyt okens.h f i l e. Tokens have
59 * t he same val ue as i n make yy st ok( ) . A speci al t oken
60 * named EOI ( wi t h a val ue of 0) i s al so gener at ed.
61 * /
62
63 FILE * t o k f i l e ;
64 i n t i ;
65
6 6 i f ( ! ( t o k f i l e = f o p e n ( TOKEN_FILE , "w") ) )
67 e r r o r ( FATAL, "Can' t open %s\ n", TOKEN FILE ) ;
6 8
69 D ( e l s e i f ( Ve r bo s e ) )
70 D( p r i n t f ( " Ge n e r a t i n g %s\ n", TOKEN FILE ) ; )
71
72 f p r i n t f ( t o k f i l e , " # d e f i n e EOI 0 \ n " ) ;
73
74 f o r ( i = MINTERM; i <= Cur t e r m; i ++ )
75 f p r i n t f ( t o k f i l e , " # d e f i n e %-1 0 s %d\n",
76 Te r ms [ i ] - >name, ( i-MINTERM)+1 ) ;
77 f c l o s e ( t o k f i l e ) ;
78 }
Listing 4.33. main.c Command-Line Parsing and main ()
2 #i ncl ude <c t y p e . h>
3 #i ncl ude < s t d a r g . h>
4 #i ncl ude < s i g n a l . h>
5 #i ncl ude < m a l l o c . h>
6 #i ncl ude <e r r n o . h>
7 #i ncl ude < t i m e . h>
8 #i ncl ude < s y s / t y p e s . h>
9 #i ncl ude < s y s / t i m e b . h>
1 0
11
#i ncl ude 
11
1 2 #i ncl ude " d a t e . h"
13
16 #i ncl ude < t o o l s / h a s h . h>
17 #i ncl ude < t o o l s / c o m p i l e r
18 #i ncl ude < t o o l s / l . h>
19
2 0 #i f def LLAMA
2 1 # def i ne ALLOCATE
2 2 # i ncl ude " p a r s e r .
23 # undef ALLOCATE
24 #el se
25 # def i ne ALLOCATE
26 # i ncl ude " p a r s e r .
27 # undef ALLOCATE
28 # e n d i f
29
30 PRIVATE i nt Warn e x i
31 PRIVATE i nt Num warn
32 PRIVATE c h a r *Out put
33 PRIVATE FILE *Doc f i l e
h>
fname
0;
0;
"9999"
f
NULL;
/
/
/
/
*
t o 1 i f Won command l i ne
*
*
Tot al war ni ngs pr i nt ed

Name of t he out put f i l e
Er r or l og & machi ne
*/
*/
*/
i on * /
34
35 #def i ne VERBOSE(str) i f ( Ve r b o s e ) { p r i n t f ( s t r " : \ n " ) ; }el se
36
37
------------------------------------------------* /
38
39 voi d o n i n t r
P (
voi d ) ) ; / * l ocal * /
40 i nt p a r s e a r g s P (
i nt a r g c , char * * a r g v
l i t
41 i nt d o _ f i l e
P ( voi d
I 1 9
42 voi d s ymbol s P (
voi d
l i t
43 voi d s t a t i s t i c s P (
FILE * f p
l i t
44 voi d t a i l
P (
voi d
l i t
45
46 i nt mai n
P (
i nt a r g c , char * * a r g v ) ) ; / * publ i c * /
47 voi d o u t p u t
P (
char *f mt , ...
l i t
48 voi d l e r r o r
P ( i nt f a t a l , char * f mt , ...
) ) ;
49 voi d e r r o r
P (
i nt f a t a l , char *f mt , ...
\ \ t
50 char *open e r r ms g
P (
voi d
) ) ;
51 char *do d o l l a r
P (
i nt num, i nt r hs s i z e , i nt l i n e n o , PRODUCTION *pr od, \
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
* f n a me ) ) ;
/
*
*
Ther e ar e t wo ver si ons of t he f ol l owi ng subr out i nes- - used i n do f i l e( ) ,
*
dependi ng on whet her t hi s i s l l ama or occs
* The occs ver si ons i s di scussed i n t he next Chapt er
*
* subr out i ne: l l ama ver si on i n occs ver si on m
*
* f i l e_header ( )
* code_header ()
* dr i ver ()
* t abl es ()
* pat ch ()
*
l l dr i ver . c
l l dr i ver . c
l l dr i ver . c
11 code. c
yydr i ver . c
yydr i ver . c
yydr i ver . c
yycode. c
yypat ch. c
*
0
do dol l ar ()
l l sel ect . c
l l dol l ar . c yydol l ar . c
*
* Al so, sever al par t of t hi s f i l e ar e compi l ed onl y f or l l ama, ot her s onl y f or
*
*
*
*
*
. The l l ama- speci f i c par t s ar e ar gument s t o LL () macr os, t he occs
speci f i c par t s ar e i n 0X( ) macr os. We' l l l ook at what t he occs-
par t s act ual l y do i n t he next Chapt er .
/
PUBLIC m a i n ( a r g c , a r g v )
* *
a r g v ;
{
a mb l k s i z 2048; / * Decl ar ed i n mal l oc. h
*
/
s i g n o n ( ) ;
s i g n a l ( SIGINT, o n i n t r ) ;
p a r s e a r g s ( a r g c , a r g v ) ;
/ * Pr i nt si gn on message
/ * Cl ose out put f i l es on Ct r l - Br eak
*
/
/
( Debug && ! Symbol s )
Symbol s 1;
OX (
0X( {
OX (
OX (
OX (
OX (
( Make p a r s e r )
( Ve r bo s e
1 )
{
(
! (Doc f i l e f o p e n ( DOC FILE,
w )
) )
f e r r ( "Can' t open l o g f cs\ n", DOC FILE ) ;
94 OX ( } )
95 OX( e l s e i f ( Ve r b o s e > 1 ) )
96 OX( D o c _ f i l e = s t d e r r ; )
97 OX( } )
98
99 i f ( U s e _ s t d o u t )
100 {
1 0 1 Out put _f name = " / d e v / t t y " ;
102 Out put = s t d o u t ;
103 }
104 e l s e
105 {
106 OX( Out put _f name = ! Make _par s e r ? ACT_FILE : PARSE_FILE ; )
107 LL( Out put _f name = PARSE_FILE; )
108
109 i f ( ( Out put = f o p e n ( Out put _f name , "w")) == NULL )
110 e r r o r ( FATAL, "Can' t open o u t p u t f i l e %s: %s\ n",
1 1 1 Out put _f name , o p e n _ e r r ms g () ) ;
112 }
113
114 i f ( ( y y n e r r s = d o _ f i l e ( ) ) == 0) / * Do al l t he wor k */
115 {
116 i f ( Symbol s )
117 s y m b o l s ( ) ; / * Pr i nt t he symbol t abl e * /
118
119 s t a t i s t i c s ( s t d o u t ) ; / * And any cl osi ng- up st at i st i cs. * /
1 2 0
121 i f ( Ve r bo s e && D o c _ f i l e )
122 {
123 OX( s t a t i s t i c s ( D o c _ f i l e ) ; )
124 }
125 }
126 e l s e
127 {
128 i f ( Out put != s t d o u t )
129 {
130 f c l o s e ( Out put ) ;
131 i f ( u n l i n k ( Out put _f name ) == - 1 )
132 p e r r o r ( Out put _f name ) ;
133 }
134 }
135
136 / * Exi t wi t h t he number of har d er r or s (or, i f - Wwas speci f i ed, t he sum
137 * of t he har d er r or s and war ni ngs) as an exi t st at us. Doc_f i l e and Out put
138 * ar e cl osed i mpl i ci t l y by exi t ().
139 * /
140
141 e x i t ( y y n e r r s + ( War n_e xi t ? Num_warni ngs : 0) ) ;
142 }
143
144 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
145
146 PRIVATE v o i d o n i n t r () / * SI GABRT ( Ct r l - Br eak, ~C) Handl er * /
147 {
148 i f ( Out put != s t d o u t ) / * Dest r oy par se f i l e so t hat a * /
149 { / * subsequent compi l e wi l l f ai l */
150 f c l o s e ( Out put ) ;
151 u n l i n k ( Out put _f name ) ;
152 }
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
e x i t ( EXIT USR ABRT ) ;
}
J * ________________ __________ __________________ _______ ______ _____________________________* J
PRIVATE p a r s e a r g s ( a r g c , a r g v )
i nt a r g c ;
* *
ar gv;
{
/
*
Par se t he command l i ne, set t i ng gl obal var i abl es as appr opr i at e
*
/
*
p;
name b u f [ 8 0 ] ; / * Use t o assembl e def aul t f i l names
*
/
*
us a g e ms g []
{
#i f def LLAMA
Us age i s : l l a ma [ - s w i t c h ] f i l e " ,
Cr e a t e an LL(1) p a r s e r from t h e s p e c i f i c a t i o n i n t h e " ,
i n p u t f i l e . Le gal c ommand- l i ne s w i t c h e s a r e : " ,
VV
cN
D
f
us e N as t h e p a i r s t h r e s h o l d when ( C) o mp r e s s i n g " ,
e n a b l e ( D) ebug mode i n y y p a r s e . c ( i m p l i e s
( F ) a s t , unc ompr e s s e d, t a b l e s " ,
s) ",
#el se
Usage i s o c c s [ - s w i t c h ] f i l e " ,
VV
\ t C r e a t e an LALR(l ) p a r s e r from t h e s p e c i f i c a t i o n i n t h e " ,
\ t i n p u t f i l e . Le g a l c ommand- l i ne s w i t c h e s a r e : " ,
M
a
D
Out put a c t i o n s o n l y ( s e e
P) ",
e n a b l e ( D) ebug mode i n y y p a r s e . c ( i m p l i e s s) ",
#endi f
g
i
make s t a t i c s ymbol s ( G) l o b a l i n y y p a r s e . c " ,
s u p p r e s s # ( L) i n e d i r e c t i v e s " ,
m< f i l e > us e < f i l e > f o r p a r s e r t e ( M ) p l a t e " ,
P
s
S
t
T
v
V
w
W
NULL
o u t p u t p a r s e r o n l y ( can be us e d wi t h -T a l s o ) " ,
make ( s ) ymbol t a b l e " ,
make mo r e - c o mp l e t e ( S) ymbol t a b l e " ,
p r i n t a l l ( T ) a b l e s (and t h e p a r s e r ) t o s t a n d a r d o u t p u t " ,
move
p r i n t
t a b l e s f rom y y o u t . c t o y y o u t a b . c " ,
( V) e r bo s e d i a g n o s t i c s ( i n c l u d i n g s ymbol t a b l e ) " ,
more v e r b o s e t ha n - v . I mp l i e s - t , & y y o u t . d o c g o e s t o s t d e r r
s u p p r e s s a l l war ni ng me s s a g e s \ n " ,
VV
wa r ni ng s ( as w e l l as e r r o r s ) no nz e r o e x i t s t a t u s " ,
};
/
*
*
*
Not e t hat al l gl obal var i abl es set by command- l i ne swi t ches ar e decl ar ed
i n par ser . h. Space i s al l ocat ed because a i def i ne ALLOC i s pr esent at
t he t op of t he cur r ent f i l e.
*
/
( + + a r g v , - - a r g c ; a r g c &&
*
(P
*
argv)
t .
++ar gv, a r g c )
{
whi l e( *++p )
{
swi t ch(
{
*
P )
OX ( ' a' : Make p a r s e r 0; )
214 OX ( Te mpl at e = ACT_TEMPL; )
215 OX ( br eak;
)
216
217 case ' D' Debug = l ; br eak;
218 case ' g ' P u b l i c = l ; br eak;
219 LL ( case ' f ' Unc ompr e s s e d = l ; br eak;
220 case ' 1' No l i n e s = l ; br eak;
221 case ' m' Te mpl at e = P + l ; got o o u t ;
222 OX ( case ' p ' Make a c t i o n s = 0 ; br eak;
223 case ' s ' Symbol s = 1; br eak;
224 case ' S ' Symbol s = 2; br eak;
225 case ' t ' Use s t d o u t = l ; br eak;
226 case
9 rji f
Make y y o u t a b = l ; br eak;
227 case ' v' Ve r bo s e
= l ;
br eak;
228 case ' V' Ve r bo s e = 2; br eak;
229 case ' w' No wa r n i n g s = l ; br eak;
230 case ' W' Warn e x i t = l ; br eak;
231 LL ( case ' c' Th r e s h o l d = <a t o i ( ++p
) ;
232 LL ( whi l e( *p && i s d i g i t (
p [ i ] ) )
233 LL ( ++p;
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
)
)
LL (
)
)
)
)
ff
<
o
o i l l e g a l a r g u me n t \ n " ,
*
p ) ;
p r i n t v ( s t d e r r , us a g e _ ms g ) ;
e x i t ( EXIT ILLEGAL ARG ) ;
}
}
o u t :
}
( Ve r bo s e > 1 )
Use s t d o u t 1;
(
<
No l i n e s
0 ) /
*
I nput f r omst andar d i nput
*
/
1;
Lf ( > 1 )
{
f p r i n t f ( s t d e r r , "Too many a r g u me n t s . \ n" ) ;
p r i n t v ( s t d e r r , u s a g e msg ) ;
e x i t ( EXI T TOO MANY ) ;
}
/
*
1, i nput f r omf i l e
*
/
{
( i i n e w f i l e ( I nput f i l e name
*
a r g v ) < 0 )
{
s p r i n t f ( name b u f , "%0. 70s . %s ",
*
a r g v DEF EXT ) ;
( i i n e w f i l e ( I nput f i l e name name buf ) < 0 )
e r r o r ( FATAL, "Can' t open i n p u t f i l e %s or %s:
o
o s\n",
*
a r g v name b u f , open e r r m s g O ) ;
}
}
}
J * --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------' k J
272 PRIVATE i nt d o _ f i l e ()
273 {
274 / * Pr ocess t he i nput f i l e. Ret ur n t he number of er r or s.* /
275
276 st ruct t i me b s t a r t _ t i m e , e n d _ t i me ;
277 l ong t i me ;
278
279 f t i m e ( &s t a r t t i me ); / * I ni t i al i ze t i mes now so t hat t he di f f er ence */
280 end t i me = s t a r t t i me ; / * bet ween t i mes wi l l be 0 i f we don' t bui l d t he */
281 / * t abl es. Not e t hat I ' musi ng st r uct ur e assi gn- */
282 / * ment here. */
283
284 i n i t a c t s (); / * I ni t i al i ze t he act i on code. */
285 f i l e h e a d e r (); / * Out put #def i nes t hat you mi ght want to change */
286
287 VERBOSE( "p a r s i ng " );
288
289 nows () ; / * Make l ex i gnor e whi t e space unt i l ws() i s cal l ed */
290 y y p a r s e (); / * Par se t he ent i r e i nput f i l e */
291
292 i f ( ! ( y y n e r r s | p r o b l e m s ()) ) / * I f no pr obl ems i n t he i nput f i l e */
293
{
294 VERBOSE( " a n a l y z i n g grammar" );
295
296 f i r s t (); / * Fi nd FI RST set s, */
297 LL( f o l l o w (); ) / * FOLLOWset s, */
298 LL( s e l e c t (); ) / * and 11 (1) sel ect set s i f t hi s i s l l ama */
299
300 c ode h e a d e r (); / * Pr i nt var i ous #def i nes to out put f i l e */
301 OX( p a t c h (); ) / * Pat ch up t he gr ammar (i f t hi s i s occs) */
302 / * and out put t he act i ons. */
303
304 f t i m e ( &s t a r t t i me )
m
f
305
306 i f ( Make p a r s e r )
307
{
308 VERBOSE( "maki ng t a b l e s " );
309 t a b l e s (); / * gener at e t he t abl es, */
310 }
311
312 f t i me ( &end t i me ) ;
313 VERBOSE( " c o p y i n g d r i v e r " );
314
315 d r i v e r (); / * t he par ser ,
*/
316
317 i f ( Make a c t i o n s )
318 t a i l (); / * and t he t ai l end of t he sour ce f i l e. */
319
}
320
321 i f ( Ve r bo s e )
322
{
323 t i me = ( end t i m e . t i me * 1 0 0 0 ) + end t i m e . m i l l i t m ;
324 t i me - = ( s t a r t t i m e . t i me * 1 0 0 0 ) + s t a r t t i m e . m i l l i t m ;
325 p r i n t f ( "t i me r e q u i r e d t o make t a b l e s : %l d. %- 031d s e c o n d s \ n " ,
326 ( t i m e / 1 0 0 0 ) , ( t i me%1 0 0 0 ) );
327 }
328
329 ret urn y y n e r r s ;
330 }
331 PRI VATE v o i d s y m b o l s ( v o i d ) / * Pr i nt t he symbol t abl e * /
Listing 4.34. main.c Error, Output, and Statistics Routines
332 {
333 FILE * f d;
334
335 i f ( ! ( f d = f o p e n ( SYM_FILE, "w")) )
336 p e r r o r ( SYM_FILE ) ;
337 e l s e
338 {
339 p r i n t _ s y m b o l s ( f d ) ;
340 f c l o s e ( f d ) ;
341 }
342 }
343
344 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
345
346 PRIVATE voi d s t a t i s t i c s ( f p )
347 FILE * f p ;
348 {
349 / * Pr i nt var i ous st at i st i cs
350 * /
351
352 i nt c o n f l i c t s ; / * Number of par se- t abl e conf l i ct s * /
353
354 i f ( Ve r b o s e )
355 {
356 f p r i n t f ( f p , " \ n " ) ;
357 f p r i n t f ( f p , "%4d/%-4d t e r m i n a l s \ n " , USED_TERMS, NUMTERMS ) ;
358 f p r i n t f ( f p , "%4d/%-4d n o n t e r m i n a l s \ n " , USED_NONTERMS, NUMNONTERMS);
359 f p r i n t f ( f p , "%4d/ %-4d p r o d u c t i o n s \ n " , Num_ pr o duc t i o ns , MAXPROD ) ;
360 LL( f p r i n t f ( f p , "%4d a c t i o n s \ n " , ( Cur _ac t - MINACT) +1 ) ; )
361 OX( l r _ s t a t s ( f p ) ; )
362 }
363
364 LL( c o n f l i c t s = 0; )
365 OX( c o n f l i c t s = l r _ c o n f l i c t s ( f p ) ; )
366
367 i f ( f p == s t d o u t )
368 f p = s t d e r r ;
369
370 i f ( Num_warni ngs - c o n f l i c t s > 0 )
371 f p r i n t f ( f p , "%4d w a r n i n g s \ n " , Num_warni ngs - c o n f l i c t s ) ;
372
373 i f ( y y n e r r s )
374 f p r i n t f ( f p , "%4d har d e r r o r s \ n " , y y n e r r s ) ;
375 }
376
377 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
378
379 PUBLIC voi d o u t p u t ( f mt , . . . )
380 c h a r * f mt ;
381 {
382 / * Wor ks l i ke pr i nt f (), but wr i t es t o t he out put f i l e. See al so: t he
383 * out c( ) macr o i n par ser . h
384 * /
385
386 v a _ l i s t a r g s ;
387 v a _ s t a r t ( a r g s , f mt ) ;
388 v f p r i n t f ( Out put , f mt , a r g s ) ;
389 }
Section 4.10.3The Rest of LLama
329
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
+ + y y n e r r s ;
}
v a _ s t a r t ( a r g s ,
f p r i n t f ( s t d o u t ,
f m t ) ;
"%s %s ( %s , l i n e
o
od) :
VV
PROG_NAME, t y p e ,
I n p u t f i l e n a me , y y l i n e n o ) ;
v f p r i n t f ( s t d o u t , f m t , a r g s ) ;
OX (
OX( {
OX (
OX (
OX( }
( V e r b o s e && D o c f i l e )
f p r i n t f ( D o c f i l e ,
II o
SS ( l i n e
o
od) ", t y p e , y y l i n e n o ) ;
v f p r i n t f ( D o c f i l e , f m t , a r g s ) ;
( f a t a l FATAL )
e x i t ( EXI T OTHER ) ;
}
PUBLI C voi d e r r o r ( f a t a l , f m t ,
. . . )
*
f m t ;
{
/
*
*
Thi s er r or r out i ne wor ks l i ke l er r or () except t hat no l i ne number i s
gener at ed. The gl obal er r or count i s st i l l modi f i ed, however .
*
/
v a l i s t a r g s ;
t y p e ;
*
( f a t a l WARNING )
{
+ + N u m _ w a r n i n g s ;
( N o _ w a r n i n g s )
r et ur n;
t y p e = "WARNI NG:
}
{
t y p e "ERROR
VV .
9
+ + y y n e r r s ;
}
v a s t a r t ( a r g s , f m t ) ;
f p r i n t f ( s t d o u t , t y p e ) ;
v f p r i n t f ( s t d o u t , f m t , a r g s )
OX (
OX( {
OX(
OX (
OX( }
( V e r b o s e && D o c f i l e )
f p r i n t f ( D o c _ f i l e , t y p e ) ;
v f p r i n t f ( D o c f i l e , f m t , a r g s ) ;
( f a t a l FATAL )
e x i t ( EXI T OTHER ) ;
}
PUBLI C
{
*
o p e n e r r m s g O
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
/
*
Ret ur n an er r or message t hat makes sense f or a bad open
*
/
er r no;
swi t ch( er r no )
{
EACCES
EEXI ST
EMFI LE
ENOENT
ret urn "Fi l e i s r ead onl y or a di r ect or y";
' Fi l e al r eady exi st s";
' Too many open f i l es";
' Fi l e not f ound";
' Reason unknown";
}
}
/* -------------------------------------------------------------------------------------------------------------------- */
PRI VATE voi d t ai l ( )
{
/
*
Copy t he r emai nder of i nput f i l e t o st andar d out put . Yypar se wi l l have
*
t er mi nat ed wi t h t he i nput poi nt er j ust past t he
O, O
O O At t r i but e mappi ng
*
($$ to Yyval , $N to a st ack r ef er ence, et c. ) i s done by t he do dol l ar ()
* cal l .
*
*
*
On ent ry, t he par ser wi l l have r ead one t oken t oo f ar , so t he f i r st
t hi ng to do i s pr i nt t he cur r ent l i ne number and l exeme.
*
/
ext ern i nt yyl i neno ;
*
i nt
yyt ext ;
c, i , si gn;
f name[801,
/
/
*
LeX
*
* *
/
/
*
p;
/ * f i el d name i n $< . >n
/
out put (
a
s", yyt ext ) ; /
Out put newl i ne f ol l owi ng

9-9- *
t>o /
( !No l i nes )
out put ( "\ n#l i ne %d \ "%s\ "\ n", yyl i neno, I nput f i l e name );
o
o
i i unt er m(); / * Lex wi l l have t er mi nat ed yyt ext
/
whi l e(
{
(c i i advance())
i
0 )
( c 1 )
{
n_ f l ush (1) ;
cont i nue;
}
Lf ( c
)
{
i i mar k st ar t ();
( (c i i advance())
i
*
f name ' \ 0' ;
/ * ext r act name i n $<f oo>l */
{
P
f name;
(i
( f name) ; i > 0 && ( c=i i advance()) != ' >' ; )
*
p++ c;
*
p++ ' \ 0' ;
7 PUBLIC c h a r *do d o l l a r ( num, r hs s i z e , l i n e n o , pr o d, f i e l d )
8 i n t num;
/ *
t he N i n $N * /
9 i n t r hs s i z e ; / *
not used
* /
1 0 i n t l i n e n o ;
/ *
not used
* /
1 1 PRODUCTION *prod; / * not used * /
1 2 c h a r * f i e l d ; / *
not used
* /
13
14
15
16
17
18
19
20
21
22
23
{
bu f [ 3 2 ] ;
( num DOLLAR_DOLLAR )
"Yy v s p - > l e f t " ;
{
s p r i n t f ( b u f , " (Yy v s p [ %d]. r i g h t ) ", num ) ;
r e t u r n b u f ;
/
/
*
*
assumi ng t hat num
has <16 di gi t s
*
*
/
/
}
}
Listing 4.36. signon.c Print Sign-on Message
3 # i ncl ude < t o o l s / h a s h . h >
4 # i ncl ude <t ool s/ set . h>
5 # i ncl ude " p a r s e r . h "
6
7 PUBLI C voi d s i g n o n ()
8 {
9 / * Pr i nt t he si gnon message. Not e t hat si nce t he consol e i s opened
1 0 * expl i ci t l y, t he message i s pr i nt ed even i f bot h st dout and st der r ar e
11 * r edi r ect ed. I ' musi ng t he ANSI __TI ME and ___DATE macr os t o get t he
1 2 * t i me and dat e of compi l at i on.
13 * /
14
15 FILE * s c r e e n ;
16
17 UX( i f ( ! ( s c r e e n = f o p e n ( " / d e v / t t y " , "w")) ) )
18 MS( i f ( ! ( s c r e e n = f o p e n ( " c o n : ", "w") ) ) )
19 scr een = st der r ;
20
21 LL ( f p r i n t f ( s c r e e n , "LLAMA 1 . 0 [ " ___ DATE__ " " ___TI ME_____ " ] \ n " ) ; )
22 OX( f p r i n t f ( s c r e e n , "OCCS 1 . 0 [" DATE " " TIME_ " ] \ n " ) ; )
23
24 f pr i nt f ( scr een, " (C) " ___ DATE___ ", Al l en I . Hol ub. Al l r i ght s r eser ved. \ n") ;
25 i f ( s c r e e n != s t d e r r )
26 f cl ose( scr een) ;
27 }
4.11 Exercises
4.1. A standard, recursive, in-order, binary-tree traversal routine looks like this:
t ypedef st ruct node
{
i nt key;
st ruct node * l e f t , *r i g h t ;
}
NODE;
t r a v e r s e ( r o o t )
NODE * r o o t ;
{
i f ( ! r o o t )
r et ur n;
t r a v e r s e ( r o o t - > l e f t ) ;
p r i n t f ( "%d\ n", r o o t - > k e y ) ;
t r a v e r s e ( r o o t - > r i g h t ) ;
}
Write a nonrecursive binary-tree traversal function that uses a local stack to hold
pointers to the parent of the current subtree rather than using the run-time stack,
as is implicit in the recursive algorithm.
4.2. A palindrome is a word or phrase that reads the same backwards as forwards,
such as the Napoleonic palindrome: Able was I ere I saw Elba.
a. Write a grammar that recognizes simple palindromes made up of digits. Each
digit should appear exactly twice, and the center of the palindrome should be
marked with a I character. For example, the input string 123414321 should be
recognized.
b. Write a program that uses a push-down automata to recognize palindromes
such as the foregoing.
c. Can you modify your grammar to recognize real palindromes, such as able
was I ere I saw Elba. If so, implement the grammar; if not, explain why.
Write a computer program that translates the right-linear grammars discussed in
Chapter 3 into the state-machine grammars discussed in the same chapter.
Determine the FIRST and FOLLOW sets for all nonterminals in the first 10 pro
ductions in the grammar for C, presented in Appendix C. Is this grammar LL(1)?
Why or why not?
Modify the following grammar so that it is not ambiguous and is LL(1):
expr > expr
I * expr
I expr * expr
I expr / expr
I expr =expr
I expr +expr
I expr - expr
I ( expr )
Operator precedence and associativity should be according to the following table
(higher lines in the table are higher precedence operators).
Associativity Operator Description
left to right
( )
parentheses for grouping (highest precedence)
left to right
_ *
unary minus and pointer dereference
left to right
* /
multiplication, division
left to right +- addition, subtraction
right to left
assignment (lowest precedence)

Write an LL( 1) grammar for the state-machine-description language presented as
an exercise in the last chapter, and implement a table-driven, top-down parser for
that language.
Is the following grammar LL(1)? Why or why not? If it isnt, convert it to LL(1).
TOKEN TOKEN stuff
star list stuff
star list > star list STAR
8
symbol TOKEN
stuff ANOTHER TOKEN
Design and implement a desk-calculator program using LLama and LPX. The
program should take process an input file that contains various semicolon-
terminated statements. Support C syntax and the following C operators:
Section 4.11 Exercises 335
4.9.
4.10.
4.11.
+ - /
( )
The - is both unary and binary minus. The * is a multiply. Your calculator
should support a reasonable number of variables (say, 128) with arbitrary names.
A variable is declared implicitly the first time its used. In addition to the forego
ing operators, your calculator should support the following pseudo subroutine:
pr i nt f ("' \ I {f or mat }
The pr i nt f () pseudo subroutine should work like the normal pr i nt f (),
except that it needs to support only the %d, %x, %o, and %f conversions. It does
not have to support field widths, left justification, and so forth.
Modify the calculator built in the previous exercise so that it generates the assem
bly language necessary to evaluate the expression, rather than actually evaluating
the expression at run time.
Modify LLama so that it automatically makes the following transformations in
the input grammar. Productions of the form:
s : s ymbol s t u f f
s ymbol o t h e r s t u f f
should be transformed to:
s : s ymbol s '
s ' : s t u f f
o t h e r s t u f f
The transformations should be applied recursively for as long as the initial sym
bols on the right-hand side are the same. For example,
s : symbol o t h e r
s ymbol o t h e r s t u f f
should become:
s : s ymbol s '
s ' : o t h e r
o t h e r s t u f f
should become:
s : symbol s '
s ' : o t h e r s ' '
s ' ' : s t u f f
/ * empt y * /
Modify the interactive debugging environment in yydebug.c so that you can look
at the actual code associated with an action. You can do this by storing all action
code in a file rather than in internal buffers and keeping a pointer to the file posi
tion for the start of the code in the occs symbol table (rather than a string
pointer). Normally, this intermediate file would be destroyed by LLama, but the
file should remain in place if -D is specified on the command line. To make this
file available from the debugger, LLama will need to output the source code for
an array indexed by production number which evaluates to the correct file posi
tion. Finally, add two commands to yydebug.c: The first prompts for a produc
tion number and displays the associated code in the stack window. The second
causes the code to be printed in the comments window every time a reduction
occurs.
4.12. Several arrays output by LLama are used only for debugging, and these arrays
take up unnecessary space in the executable image Modify LLama and
yydebug.c so that this information is stored in binary in an auxiliary file that is
read by the debugger at run time. Dont read the entire file into memoryuse
l seek () to find the required strings and transfer them directly to the screen.
4.13. The action code and the debugging-environment subroutines in yydebug.c could
be overlayscode in an action doesnt call debugger subroutines and vice versa.
Arrange for overlays to be used for the debugger to save space in the executable
image.
The first part of this chapter looks at bottom-up parsing in generalat how the
bottom-up parse process works, how it can be controlled by a state machine, and how the
state machine itself, and the associated tables, are generated. The second part of the
chapter looks at a practical implementation of the foregoing: The internals of occs, a
yacc-like program that builds a bottom-up parser from a grammar, is discussed in depth,
as is the occs output file that implements the parser. There is no extended example of
how occs is used because Chapter Six, in which code generation is discussed, provides
such an example.
Bottom-up parsers such as the ones created by yacc and occs work with a class of
grammars called LALR(l) grammars, which are special cases of the more inclusive LR(J)
class. As with the LL(1) notation, the (I) in LR(1) means that one token of lookahead is
required to resolve conflicts in the parse. Similarly, the first L means that input is read
left to right. The R means that a rightmost derivation is used to construct the parse
treethe rightmost nonterminal is always expanded instead of expanding the leftmost
nonterminal, as was done in the previous chapter. The LA in LALR(l) stands for look
aheadinformation about certain lookahead characters is actually built into LALR(l)
parse tables. I ll explain the exact characteristics of these grammars when I discuss how
bottom-up parse tables are created, below. All LL(1) grammars are also LR(1) gram
mars, but not the other way around.
Most LR grammars cannot be parsed with recursive-descent parsers, and LR and
LALR grammars tend to be more useful than LL grammars because they are easier to
writethere arent so many restrictions placed on them. This grammatical flexibility is
a real advantage when writing a compiler. LR parsers have many disadvantages, how
ever. It is difficult to write an LR parser unless you have a tool, such as occs or yacc, to
help you construct the tables used by the parser. Also, the tables are larger than LL parse
tables and the parser tends to be slower. LR error recovery is difficult to doerror
recovery in yacc-generated parsers is notoriously bad.
LALR(1) grammars.
Advantages and disad
vantages of LR parsers.
337
338 Bottom-Up ParsingChapter 5
Goal symbol.
Bottom-up parsing with a
PDA, basic algorithm.
Reduce.
Shift.
Accept.
Shiftepush, reduce^pop.
5.1 How Bottom-Up Parsing Works*
Bottom-up parsers use push-down automata to build the parse tree from the bottom
up rather than from the top down. Instead of starting at the root, as is the case with a
top-down parser, a bottom-up parser starts with the leaves and, when its collected
enough leaves, it connects them together with a common root. The process is best illus
trated with an example. I ll use the following expression grammar:
0. 5 ^ e
1. e ^ e +1
2. 1 t
3. t ^
t * f
4. 1
f
5.
f
^ ( e )
6. 1 NUM
The nonterminals are e (for expression), t (for term), and / (for factor). The terminals are
+, *, (,), and NUM. (NUM is a numbera collection of ASCII characters in the range
' 0 ' to ' 9' ) . Multiplication is of higher precedence than addition: since the star is
further down in the grammar, it will be lower on the parse tree than the plus unless
parentheses force it to go otherwise. The special symbol s is called the goal symbol. A
bottom-up grammar must have one unique goal symbol, and only one production can
have that goal symbol on its left-hand side. Note that this grammar cant be parsed from
the top down because its left recursive.
A bottom-up parser is a push-down automaton; it needs a state machine to drive it
and a stack to remember the current state. For now, well ignore the mechanics of the
state machine and just watch the stack. The parse process works as follows:
(1) If the top few items on the stack form the right-hand side of a production, pop these
items and replace them with the equivalent left-hand side. This procedure (popping
the number of elements on a right-hand side and pushing the left-hand side) is
known as a reduce operation.
(2) Otherwise, push the current input symbol onto the stack and advance the input.
This operation is known as a shift.
(3) If the previous operation was a reduce that resulted in the goal symbol at the top of
stack, accept the inputthe parse is complete. Otherwise, go to 1.
Dont confuse shifting and reducing with pushing and popping. A shift always
pushes a token, but it also advances the input. A reduce does zero or more pops fol
lowed by a push. (There are no pops on productions.) I ll illustrate the process with a
parse of 1* (2+3). Initially, the stack (on the left, below) is empty and none of the input
characters (on the right, below) have been read:
__________________________ 1 * ( 2 + 3 )
Since theres no legitimate right-hand side on the stack, apply rule two: shift a NUM and
advance past it in the input.
NUM_________________________ * ( 2 + 3 )
Note that the input character 1 has been translated to a NUM token as part of the input
process. The right-hand side of Production 6 is now at top of stack, so reduce by
/ >NUM. Pop the NUM and replace it with an /:
* ( 2 + 3 )
/
* ( 2 + 3 )
Section 5.1 How Bottom-Up Parsing Works* 339
The previous operation has put another right-hand side on the stack, this time for Produc
tion 4 reduce by popping the / and pushing a no replace it:
* ( 2 + 3
t * ( 2 + 3
The t at top of stack also forms a right-hand side, but theres a problem here. You could
apply Production 2 popping the t and replacing it with an e, but you dont want to
do this because theres no production that has an e followed by a * (the next input sym
bol) on its right-hand side. There is, however, a production with a t followed by a * (Pro
duction 3: t>t*f). So, by looking ahead at the next input symbol, the conflict is resolved
in favor of the shift (rather than a reduction by Production 2) in the hope that another /
will eventually get onto the stack to complete the right-hand side for Production 3. The
stack now looks like this:
t * 2 + 3
Theres still no reasonable right-hand side on the stack, so shift the next input symbol to
the stack:
t * ( 2 + 3
and the 2 as well
t * ( NUM + 3
Now there is another legitimate right-hand side on the stack: the NUM. The parser can
apply the same productions it used to process the earlier NUM:
t * ( + 3
t * { f
+ 3
t * ( + 3
t * ( t + 3
The next input symbol is a plus sign. Since the grammar doesnt have any productions in
which a plus follows a /, it applies Production 2 (e^>t) instead of shifting:
t * ( + 3
t * ( e + 3
Again, since theres still input, shift the next input symbol rather than reducing by Pro
duction 1.
t * ( e + 3
and the next one
t * ( e +NUM
The number forms a legitimate right-hand side, so the parsers process it as before
t * ( e +
t * ( e +/
t * ( e +
t * ( e +t
Another right-hand side has now appeared on the stacke+t is the right-hand side of
Production 1so the parser can apply this production, popping the entire right-hand side
(three objects) and pushing the corresponding left-hand side:
t * ( e +
Relationship between
stack and parse tree.
List processing in a
bottom-up parser.
t * ( e )
t* (_________________________________________________________________ )
t * ( e )
As before, the parser defers the application of Production 0 because theres still input; it
shifts the final input symbol onto the stack:
t * ( e )
thereby creating another right-hand side(e) is the right-hand side of Production 5,
Reduce by this production:
t * ( e
t* (__________________________
t *
t * f __________________________
The three symbols now at top of stack form the right-hand side of Production 3 (t>t*f);
pop them and push the left-hand side:
t *___________________________
t
I _______________
Reduce by Production 2:
e
Finally, since theres no more input, apply Production 0:
Since the goal symbol is at the top of stack, the machine accepts and the parse is com
plete. The entire process is summarized in Table 5.1. Figure 5.1 shows the parse tree as
its created by this parse. Notice that you can determine the state of the stack for any of
the partially created trees by reading across the roots of the various subtrees. For exam
ple, the tops of the subtrees shown in Figure 5.1(m) are (from left to right) f, *, (, e, +,
and /, and these same symbols comprise the stack at the equivalent time in the parse. So,
you can look at the stack as a mechanism for keeping track of the roots of the partially-
collected subtrees.
5.2 Recursion in Bottom-Up Parsing*
Hitherto, left recursion has been something to avoid in a grammar, primarily because
top-down parsers cant handle it. The situation is reversed in a bottom-up parser: Left
recursion should always be used for those lists in which associativity is not an issue.
Right recursion can be used for right-associative lists, but it should be used only when
right associativity is required. The reason for this change can be seen by looking at how
right- and left-recursive lists are parsed by a bottom-up parser. Consider a left-recursive
grammar, such as the following list of numbers:
Table 5.1. A Bottom-Up Parse of l*(2+3)
Section 5.2Recursion in Bottom-Up Parsing* 341
Stack
1
Input Next action
See Fig.
5.1
1 * ( 2 + 3 ) Shift NUM a
NUM * ( 2 + 3 ) Reduce by / >NUM b
* ( 2 + 3 )
/
* ( 2 +3 ) Reduce by f -^t c
* ( 2 + 3 )
t * ( 2 + 3 ) Shift * d
t * ( 2 + 3 ) Shift ( e
t * ( 2 + 3 ) Shift NUM f
/ * (NUM + 3 ) Reduce by / >NUM
g
t * ( + 3 )
t * ( /
+ 3 ) Reduce by t -^f h
t * ( + 3 )
t * ( t + 3 ) Reduce by e-^t
i
t * ( + 3 )
t * ( e + 3 ) Shift +
J
t * ( e + 3 ) Shift NUM k
t * (e +NUM
)
Reduce by / >NUM 1
t * ( e +
)
t * ( e +f )
Reduce by / m
t * ( e +
)
t * ( e + t
t (
)
\
Reduce by e>e+t n
/ 'T*1
t * ( e
)
) Shift) 0
t * ( e ) Reduce by f >(e)
P
/
t *f
Reduce oy t>t* f
q
t Reduce by e>t r
e Reduce by s>e s
s Accept t
1. list > list NUM
2. I NUM
The input 12 3 generates the following bottom-up parse:
Stack I nput Comments
NUM
list
list NUM
list
list NUM
list
12 3
2 3
2 3
3
3
shift a NUM
reduce by list>NUM
shift a NUM
reduce by list^list NUM
shift a NUM
reduce by list-^list NUM
Nonrecursive list element
reduced first.
The stack never grows by more than two elements because a reduction occurs every time
that a new NUM is shifted. Note that the nonrecursive production, list-^NUM is applied
first because the parser must get a list onto the stack in order to process the rest of the list
elements. This is always the case in left-recursive lists such as the foregoing, and is
Figure 5.1. Evolution of Parse Tree for 1 * (2+3)
a.
e
t
*
NUM
l
t
*
NUM
m.
t
*
NUM
t
NUM
e +
t
NUM
NUM
b.
NUM
f.
t
*
NUM
J-
e
t
t
*
NUM
NUM
n.
e + t
t
NUM
t
*
NUM
NUM
c.
NUM
g
t
*
NUM
NUM
k.
e +
t
t
*
NUM
NUM
o.
e
e + t
t
NUM
t
*
NUM
NUM
d. t
NUM
h
t
*
NUM
NUM
1.
e +
t
t
*
NUM
NUM
P- e
e +
t
t
*
NUM
f
NUM
NUM
t
NUM
q-
t
*
NUM
r. s. t. s
e e
t t t
e
e + t
t
NUM
e
e + t
t
NUM
e
e + t
t
NUM
e
e +
t
NUM t
*
NUM t
*
NUM t
*
NUM
NUM NUM NUM
t
f
NUM
Section 5.2Recursion in Bottom-Up Parsing* 343
handy when doing code generation because it provides a hook for initialization actions.
(More on this in the next chapter.)
Now consider a right-recursive list, such as the following:
1. list NUM list
2. I NUM
The input 1 2 3 now generates the following bottom-up parse:
NUM
NUM NUM
NUM NUM NUM
NUM NUM list
NUM list
list
1 2 3
2 3
3
Shift a NUM
Shift a NUM
Shift a NUM
Apply list-^NUM
Apply list-^NUM list
Apply listNUM list
Apply listNUM list
Here, all of the list elements are pushed onto the stack before any of them are pro
cessed via a reduction. This behavior is always the caseright-recursive productions
use a lot of stack because all the list elements must be pushed before a reduction can
occur, so its worth avoiding right associativity if at all possible. Of course, you cant do
this in an expression grammar because certain operators just have to associate in certain
ways. Lists, however, are another matter.
5.3 Implementing the Parser as a State Machine*
I need to introduce a few terms at this point. At the risk of repeating myself, a shift is
the process of moving a token from the input stream to the stackpush the current token
and advance the input. A reduce is the process of replacing a right-hand side on the
stack with the matching left-hand sidepop as many items as there are on the right-hand
side of a production and push the left-hand side. A viable prefix is a partially parsed
expressionan incomplete derivation. In a bottom-up parser, the symbols on the parse
stack from bottom to top, are identical to the viable prefix, read left to right.
A handle is a right-hand side on the top few items of the stack. The parser scans the
stack looking for a handle, which when found, triggers a reduce operation. Put another
way, a reduce can be performed when the rightmost symbols of the viable prefix form a
right-hand side. Strictly speaking, a handle is a right-hand side which, when found on
the stack, triggers a reduction. The parser might defer a reduction even though a right-
hand side is at the top of the stackthe associativity and precedence rules built into the
grammar might require that more input symbols be read before any reductions occur. In
this situation, a right-hand side does not form a handle because it doesnt trigger a reduc
tion. In most situations, however, a reduction occurs as soon as the right-hand side
appears on the stack.
To automate a bottom-up parser, some mechanism is needed for recognizing handles
as they appear on the stack. Of course, you could do this by brute forcescanning the
stack and comparing what you find there to a list of right-hand sidesbut this method is
so inefficient as to be impractical. Its better to build a state machine that keeps track of
all the push and pop operations. The condition of the stack can then be inferred from a
knowledge of what items have been pushed or popped. The accepting states in the
machine trigger a reduce operation. Nonaccepting states represent either a shift or the
push of a left-hand side that is part of a reduce operation. The state machine in Figure
5.2. serves to illustrate some of the concepts involved. This machine implements the
Right-recursive lists use
a lot of stack.
Shift, reduce, viable
prefix.
Handle.
Not all right-hand sides in
viable prefix form han
dles.
Using a state machine to
recognize handles.
Stack remembers state
transitions. Arrows indi
cate pushes.
How terminals and non
terminals get onto stack.
States and handles.
following grammar, a subset of the one used in the previous example.
0. 5 ^ e
1. e ^ e +1
2. 1 t
3. t ^ NUM
This grammar recognizes expressions composed either of single numbers or of alternat
ing numbers and plus signs. The expressions associate from left to right. The state
machine can be viewed as a sort of super syntax diagram that reflects the syntactic struc
ture of an entire language, not just of the individual productions. Its as if you formed an
NFA by combining the syntax diagrams for all the productions in the grammar.
Figure 5.2. A Simple LR State Machine
(ac cepting)
The automatons stack keeps track of state transitions. The current state is the one
whose number is at top of stack. The direction of the arrows in the state diagram
represents pushes. (The state number associated with the next state is pushed.) Pops
cause us to retrace the initial series of pushes. That is, the stack serves as a record of the
various state transitions that were made to get to the current state. You can retrace those
transitions backwards by popping the states. For example, starting at State 0, a zero is at
the top of the stack. Shifting a NUM results in a transition from State 0 to State 1, and a
1 is pushed to record this transition. Reducing by t>NUM involves a pop (which causes
a retrace from State 1 back to State 0) and a push of a 3 (which causes a transition from
State 0 to State 3). The parser can retrace the original path through the machine because
that information is on the stackthe state that you came from is at the top of the stack
after the pop part of the reduction.
I ts important to notice here that the only way that a terminal symbol can get onto the
stack is by the push that occurs in a shift operation, and the only way that a nonterminal
symbol can get onto the stack is by the push that occurs during a reduction.
The purpose of the state machine is to detect handles. In this machine, States 1, 2, 3,
and 5 represent the recognition of a handle. (A reduce from State 2 also terminates the
parse successfully.) Entering one of these states usually triggers a reduce operation
(depending on the next input character). For example, if the machine is in State 1, the
parser must have just shifted a NUM, so it will reduce by r^NUM. If the machine is in
State 3, the parser must have just reduced by a production that had a t on its left-hand
side, and it will do another reduce (by e>t). If the machines in State 2, the parser
reduces by s>e only if the next input character is the end-of-input marker, otherwise it
will shift to State 4. State 5 tells the parser that an e, plus, and t must have been pushed
so the handle e+t is recognized, and it reduces by e>e+t. Since the state stack itself,
along with a knowledge of what those states represent, is sufficient to see whats going
on, theres no reason to keep a stack of actual symbols, as I did in the previous section.
Section 5.3Implementing the Parser as a State Machine* 345
The state machine from Figure 5.2. is represented in array form in Table 5.2
columns are possible input symbols (in the Action section) or nonterminal symbols (in
the Goto section). The special symbol h in the Action table is an end-of-input marker.
The rows represent possible states of the machine. The basic parsing algorithm is shown
in Table 5.3.
Table 5.2. Array Representation of an LR State Machine
Action Goto
NUM + e t
0
si
2 3
Top
rtf
1 r3 r3
2 Accept
-
s4
-
Stack
3 r2 r2
-
4 si
5
5 rl
rl

I ll demonstrate the parse process by parsing the expression 1+2+3. (The entire
parse is summarized in Table 5.4I ll do the parse step by step here, however.) The
algorithm starts off by pushing 0, the state number of the start state.
1 + 2 + 3
Action[0][NUM] holds an s i , so shift a 1onto the stack entering State 1. Advance the
input as part of the shift operation:
0 1 + 2 + 3
Now, since the new top-of-stack symbol is a 1, the parser looks at Action[l][+] and finds
an r3 (reduce by production 3). The algorithm in Figure 5.2 tells the parser to execute a
code-generation action and pop one item off the stack (because theres one item on the
right-hand side of tNUM).
+ 2 + 3
The parser now has to figure the Goto transition. The left-hand side of Production 3 is a t
and popping the 1uncovered the 0, so the parser consults Goto[0][f], finding a 3, which it
pushes onto the state stack:
03 + 2 + 3
Spinning back up to the top of the whi l e loop, Action[3][+] is an r2, a reduction by Pro
duction 2 (e^t). The parser pops one item and then pushes the contents of Goto[0][e],
State 2:
+ 2 + 3
02 + 2 + 3
Looping again, Action[2][+] is an s4shift a 4, advancing the input:
024 2 + 3
Action[4][NUM] is an si so we shift again:
024 1 + 3
Action[l][+l is an r3.
024 + 3
This time, however, the uncovered top-of-stack item is a 4, not a 0, as it was the first time
the parser reduced by Production 3. Goto[4][r] is a 5, so the machine goes to State 5 by
LR parse tables: Action
and Goto components.
Table-driven LR parse,
an example.
push( 0 )
while( Action[TOS] [input] Â ccept)
{
if( Action[TOS] [input] =- )
error()
else if ( Action[TOS][input] =sX )
push(X)
advance()
else if ( Action[TOS] [input] =rX )
act( X )
pop( as many items as are in the RHS of Production X )
push( Goto[ uncovered TOS ] [ LHS of Production X ])
Table 5.3. State-Machine-Driven LR Parser Algorithm
accept();
Definitions:
Act i on Columns in the Action part of the state-machine array.
accept ( ) Return success status. An accept is actually a reduction by Production
0 on end of input.
act ( X) Perform the action associated with a reduction of Production X.
This action usually generates code.
advance( ) Discard current input symbol and get the next one.
er r or ( ) Print error message and do some sort of error recovery.
Got o Columns in the Goto part of the state-machine array.
i nput Current input symbol.
St ack State stack.
push (X) Push state X onto the stack.
pop (N) Pop N items from the stack.
TOS State number at the top of the stack.
pushing a 5 onto the stack:
0245 + 3 h
Looping yet again, Action[5][+] is an rl, so an e, plus, and t must be on the stack, and^the
parser reduces by Production 1 (e>e +t). Note that the current state (5) can only be
reached by making transitions on an e, +, and t. Since each forward-going transition is a
push, the stack must contain these three symbols. Since there are three items on the
right-hand side of the production, pop three states:
0 + 3 h
and then push Goto[0][^] to replace them:
02 + 3 h
Next, since Action[2][+] is a s4, the parser goes to State 4 with a shift:
024 3 h
Action[4][NUM] is an si so it shifts to State 1:
024 1 h
Action[l][l~] calls for a reduction by Production 3:
Section 5.3Implementing the Parser as a State Machine* 347
024_________________________ h
0245 h
Action[5][h] is a reduction by Production 1:
024 h
02 h
0 h
02 h
Action[5][h] is an accept action, so the parse is finished. The tree implied by the previ
ous parse is shown in Figure 5.3.
Table 5.4. A Bottom-Up Parse of 1+2+3 Using a State Machine
State Stack Symbols I nput Notes
0 $ 1+2 +31- Push start state
0 1 $ NUM +2 +3 h Shift NUM
0 $ +2 +3 h Reduce by t>NUM
03 $ t +2 +3 h
0 $ +2 +3 h Reduce by e ^ t
02 +2 +3 h
024 $e + 2 +3 h Shift +
024 1 $ e +NUM +3 h Shift NUM
024 $>+ +3 h Reduce by t>NUM
0245 $ e +/ +3 h
024 $ e + +3 h Reduce by e +t
02 $ e +3 h
0 $ +3 h
02 +3 h
024 $e + 3 h Shift +
024 1 $ e + NUM h Shift NUM
024 $ e + h Reduce by r-^NUM
0245 S e +1 h
024 $e + h Reduce by e-ê +t
02 $e h
0 $ h
0 1 $e h Accept
Difference between sym
bol and state stack.
Accept.
Reject.
Sentence.
There is one difference between a symbol-oriented stack, as was discussed in the pre
vious section, and a state stack: the latter requires an initial push of the start state that
wasnt required on the symbol stack. I m using a $ to hold the place of this extra stack
item on the symbol stack. Also, note that the term accept is often used in two ways when
PDA-based parsers are discussed. First of all, there are normal accepting states, in this
case States 1, 2, 3, and 5. A transition into an accepting state signifies that a handle has
been recognized. The action associated with these states is a reduce (and the associated
code-generation actions, which are discussed below). You can also use the word accept
to signify that a complete program has been parsed, however. If you look at the parser as
a recognizer program, the machine as a whole accepts or rejects an input sentence. (A
sentence in this context is a complete program, as represented by the grammar. A valid
sentence in a grammar forms a complete parse tree.)
Stack lacks lookahead in
formation for error
recovery.
Panic-mode error
recovery.
Avoid cascading error
messages.
Attributes in recursive
descent.
Figure 5.3. A Parse Tree for 1+2+3
5.4 Error Recovery in an LR Parser*
Error recovery in a bottom-up parser is notoriously difficult. The problem is that,
unlike a top-down parsers stack, the bottom-up parse stack contains no information
about what symbols are expected in the input. The stack tells you only whats already
been seen. One effective technique, called panic-mode error recovery, tries to get around
this difficulty by using the parse tables themselves to find legal input symbols. It works
as follows:
(0) Error recovery is triggered when an error transition is read from the parse table
entry for the current lookahead and top-of-stack symbols (when theres no legal
outgoing transition from a state on the current input symbol).
(1) Remember the current condition of the stack.
(2) Discard the state at the top of stack.
(3) Look in the parse table and see if theres a legal transition on the current input sym
bol and the uncovered stack item. If so, the parser has recovered and the parse is
allowed to progress, using the modified stack. If theres no legal transition, and the
stack is not empty, go to (2).
(4) If all items on the stack are discarded, restore the stack to its original condition, dis
card an input symbol, and go to (2).
The algorithm continues either until it can start parsing again or until the complete input
file is absorbed. Table 5.5 shows what happens when the incorrect input 1++2 is parsed
using the simple LR state machine in Table 5.2 on page 345. Messages should be
suppressed if a second error happens right on the tail of the first one in order to avoid
cascading error messages. Nothing should be printed if an error happens within a limited
number (four or five is a good choice) of parse cycles (shift or reduce operations) from
the point of the previous error.
5.5 The Value Stack and Attribute Processing*
The next issue is how bottom-up parsers, such as the one generated by occs, handle
attributes and use them to generate code. In a recursive-descent parser, attributes can be
computed anywhere in the parse process, but they are only passed in two places: Inher
ited attributes are passed when a subroutine is called, and synthesized attributes are
passed when the subroutine returns. Since a table-driven LR parser doesnt use
Section 5.5The Value Stack and Attribute Processing*
Table 5.5. Error Recovery for 1++2
1 + + 2 h Shift start state
0
$
1 + + 2 h Shift NUM (goto 1)
0 1
$ NUM
+ + 2 h
Reduce by Production 3 (t>NUMj
(Return to 0, goto 3)
0 3
$ t
+ + 2 h
Reduce by Production 2 (e-^t)
0 2
$ e
+ + 2 l~ Shift +(goto 4)
0 2 4
$ e +
+ 2 h
ERROR (no transition in table)
Pop one state from stack
0 2
$ e
+ 2 h
There is a transition from 2 on +
Error recovery is successful
0 2
$ e
+ 2 h Shift +(goto 4)
0 2 4
$ e +
2 h Shift NUM (goto 1)
0 2 4 1
$ e + NUM
2 h Shift NUM (goto 1)
0 2 4 1
$ e + NUM
h
Reduce by Production 3 (t>NUM)
0 2 4 5
$ e + t
h
Reduce by Production 1 (e>e+t)
0 2
$ e
h Accept
subroutine calls, another method is required to pass attributes and execute code-
generation actions. To simplify matters, its rarely necessary to use inherited attributes
in a bottom-up parser; I ll concentrate on synthesized attributes only.
Reviewing the relationship between a bottom-up parse and the parse tree itself, a
production like eM+e generates the following tree:
e
t + e
In a bottom-up parse, the t is pushed onto the stack by means of a reduction by f-^NUM,
the plus is just shifted, then there is a flurry of activity which terminates by the bottom e
being pushed as part of a reduction by either e-^>e+t or e-^t. In any event, once the f,
plus, and e are on the stack, a handle is recognized; all three symbols are popped, and an
e (the left-hand side) is pushed. Synthesized attributes move up the parse treefrom the
children to the parent, and they are passed during a reduce operationwhen the handle
is replaced with the left-hand side. The synthesized attribute is attached to the left-hand
side that is pushed as part of a reduce. Inherited attributes, which go from a parent to a
child, would be pushed as part of a shift operation. The situation is simplified by ignor
ing any inherited attributes, so attributes of interest are all passed during the reduce
349
Bottom-up parser uses
synthesized attributes.
Value, attribute stack.
Bottom-up attribute pass
ing, an example.
The names of the tem
poraries are on the value
stack.
operationwhen the parser moves up the tree. If you reexamine the LR parse algorithm
presented in Table 5.3 on page 346, youll notice that code generation occurs only on a
reduce operation.
It turns out that the easiest place to store attributes is on the parse stack itself. In
practical terms, you can do this in one of two ways, either make the state stack a stack of
structures, one field of which is the state number and the rest of which is for attributes, or
implement a second stack that runs parallel to the state stack. This second stack is called
a value or attribute stack. Every push or pop on the state stack causes a corresponding
push or pop on the value stack. All code-generation actions in a bottom-up parser are
executed after the parser decides to reduce, but before the parser pops anything off the
stack. So, you need to arrange for any attributes that are needed by a code-generation
action to be on the value stack at the appropriate time. Similarly, the code-generation
action needs to be able to put attributes onto the value stack for use by subsequent
actions. I ll demonstrate with an example.
Figure 5.4 shows a parse of 1+2+3 with both the state and value stacks shown (the
value stack is on the bottom). Emitted code is printed to the right of the value stack.
The attributes that appear on the value stack are the names of the temporary variables
used in expression evaluation. (Ill demonstrate how they get there shortly.) I ve used
to mark undefined value-stack entries.
Look, initially, at stack number two in Figure 5.4. Here, a NUM has been shifted
onto the stack and the corresponding value-stack item is undefined. The next stage in the
parse is a reduce by fNUM. Before the parser does the actual reduction, however, it
executes some code. Two actions are performed: First, the t 0=l instruction is gen
erated. (The 1is the lexeme attached to the NUM token that is shifted onto Stack 2 in
Figure 5.4.) Next, the code that outputs the instruction leaves a note for the parser say
ing that the attribute 1 0 should be placed on top of the value stack after the reduction is
performed. When the left-hand side is pushed as the last step of a reduction, the parser
must push a t O in the corresponding place on the value stack.
The parser now does the actual reduction, popping the NUM from the state stack and
replacing it with a t. It modifies the value stack too, popping one item and, instead of
pushing garbage, it pushes a t O onto the value stack. In other words, the attribute t O on
the value stack is now attached to the t on the state stack (because its at the same rela
tive position). The t has acquired the attribute tO. 1
The next step in the parse process is application of e>t. No code is generated here.
There is an associated action, however. The e must acquire the attributes formerly asso
ciated with the t. So, again, the parser leaves a note for itself saying to push tO onto the
value stack as it pushes the new left-hand side onto the state stack.
The next transition of interest is the one from stack seven to eight. Here, the handle
e+t is recognized and the parser applies e-ê+t. Two actions are performed. A t 0+=t 1
instruction is generated, and the attribute tO is acquired by the e (is passed to the e). The
important thing to notice here is that the names of the two temporary variables are attri
butes of e and t. The names of these variables are on the value stack. The code that gen
erates the add instruction can discover these names by examining the cells at offsets zero
and two from the top of the value stack. Moreover, the position of these attributes on the
1. Note that its a common, but incorrect, usage to say that the left-hand side inherits attributes from the
right-hand side, probably because thats the way that control flows through the actual code (the childs
code is executed first). Id recommend against this usage because of the potential confusion involved. Its
better to say that a parent acquires an attribute (or is passed an attribute) thats synthesized by the child.
Section 5.5The Value Stack and Attribute Processing* 351
Figure 5.4. The Value Stack and Code Generation
stack can be determined by examining the grammar. The handle on the stack comprises
a right-hand side, so the position of the attribute on the stack corresponds to the position
of the symbol on the right-hand side. The rightmost symbol is at top of stack. The e is
two symbols away from the rightmost symbol in the production, so its at offset two from
the top of stack.
The synthesized attribute here is the name of the temporary variable that holds the
result of the addition at run time; the tO thats attached to the e on stack eight is the
name of the anonymous temporary that holds the result of the entire subexpression
evaluation. Its just happenstance that the attribute is the same one that was attached to
the e in in stack seven. The e in stack eight is not the same e as the one in stack seven
the one in stack eight is the one on the left-hand side of e-^>e+t, the e in stack seven is on
the right-hand side of the same production. If the code generator had decided to emit an
instruction like t 3 = t 0 + t 1, then the synthesized attribute would have been t 3 .
Augmented and attribut-
The next step in building a parser is to augment the earlier grammar to incorporate ed grammars for bottom-
code-generation and attribute-passing rules. This augmentation is done in Table 5.6. up parsing.
$$ represents left-hand
side.
Parser maintains value
stack.
gener at e(),
new_name(),
f r ee name().
The actions are performed immediately before reducing by the associated production.
The special symbol $$ is used to leave a note about what attribute to push. The value
thats assigned to $$ in an action is, in turn, attached to the left-hand side thats pushed
as part of the reduce operation. ($$ is pushed onto the value stack when the new left-
hand side is pushed onto the parse stack.) The code assumes that the value stack is
declared as an array of character pointers:
c h a r * V a l u e _ s t a c k [ 128] ;
c h a r **Vsp = V a l u e _ s t a c k + s i z e o f ( V a l u e _ s t a c k ) ;
Vsp is the stack pointer. The stack grows towards low memory: *- - Vsp=val ue is a
push and val ue=*Vsp++is a pop. Consequently, Vsp [0] is the item at top of stack,
Vsp [ 1] is the cell immediately below that, and so forth.
Table 5.6. Expression Grammar with Actions
Grammar Actions
0. S - > e gener at e ( "answer = %s", Vsp[ 0] );
1. e > e + t gener at e ( "%s += I s", Vsp[2], Vsp[ 0] );
f r ee name( Vsp[ 0] );
$$ = Vs p[2];
2. ^ t $$ = Vs p[0];
3. t -> NUM name = new name( ) ;
gener at e( "%s = %s", name, yyt ext );
$$ = name
The value stack is maintained directly by the parser. A place marker that correspond
to the terminal thats pushed onto the parse stack is pushed onto the value stack during a
shift. In a reduce, as many objects as are removed from the parse stack are also popped
off the value stack, then $$ is pushed to correspond with the push of the left-hand side.
The actions in Table 5.6 call three subroutines: gener at e ( ) works like
pr i nt f ( ), except that it puts a newline at the end of every line, newname ( ) returns
a string that holds the name of a temporary variable, f r ee name ( ) puts the name back
into the name pool so that it can be reused by a subsequent new name ( ) call. The
three routines are implemented in Listing 5.1. (Error detection has been removed for
clarity.) The Names array is a stack of available temporary-variable names, initialized to
hold eight names. Namep is the name-stack pointer. The new name ( ) subroutine pops
a name and returns it; f r ee name ( ) pushes the name to be freed.
2. Downward-growing stacks are required by a n s i C unless you want to waste one cell of the array. An a n s i
pointer is assumed always to point into an array, though it can go one cell past the end of the array. It
cannot point below the base address of the array, however. The following code demonstrates the problem:
i nt x[10];
p = & x[ 10] ; / * t hi s i s val i d */
p = x; / * t hi s i s val i d */
- - p; / * p i s now undef i ned */
The problem is a real one in an 8086-family machine. If x is at address 1000:0000, the p in a compact
or large-model program usually evaluates to 1000:flfe rather than 0flT:flTe. Consequently p<x is always
false because 1000: fife is larger than 1000:0000.
Section 5.5.1 A Notation for Bottom-Up Attributes* 353
Listing 5.1. support.c Support Routines for Actions in Table 5.6
1 #ncl ude < s t d i o . h >
2 # i ncl ude < s t d a r g . h >
3
4 char *Names[ ] ={ "tO", " t l " , " t 2 11, Mt 3 " , "t 4 ", "t 5 ", "t 6", 111 7 11 };
5 char **Namep =Names;
6
7 voi d g e n e r a t e ( f mt , . . . )
8 char * f mt ;
9
{
11 va s t a r t ( a r g s , fmt ) ;
13 f p u t c ( ' \ n ' , s t d o u t ) ;
14
}
15
16 char *new name ()
17
{
18 i f ( Namep >= &Names[ si zeof ( Names ) / si zeof ( *Names) ] )
19 {
20 p r i n t f ( " Ex p r e s s i o n t o o c o mp l e x \ n " ) ;
21 e x i t ( 1 ) ;
22
}
23
24 r et ur n( *Namep++ )
f
25
}
26
27 f r e e name( s )
28 char * s ;
29
{
30 *Namep = s ;
31 }
5.5.1 A Notation for Bottom-Up Attributes*
Though attributes are actually at specific offsets from the top of the value stack, its
inconvenient always to refer to them in terms of the stack pointer. Yacc and occs both
use a notation for describing attributes that is sufficiently useful to merit mention here.
(This notation is used heavily in the next chapter.) As before, $$ holds the value that is
pushed as part of a reduction. Other attributes are indicated with the notation $1, $2,
and so forth, where the number represents the position of the symbol on the right-hand
side, for example:
$$ $1 $2 $3 $4
s ^ C *1) ...
In addition, yacc and occs automatically provide a default action of $$ = $1 if no
explicit assignment is made to $$. Dont confuse this notation with the similar notation
used by LLama, in which the number in $n is the offset of a symbol to the right of an
action. Here, n is the absolute position of the symbol on the right-hand side, not a rela
tive position.
Table 5.7 shows the augmented expression grammar we just looked at modified to
use these conventions. The actions are all enclosed in curly braces to separate them from
other symbols in the production. No explicit actions are now required in Production 2
because the default $$=$1 is used.
Bottom-up attribute nota
tion: $$, $i , $2, etc.
Default action: $$=$!.
Shift/reduce conflicts
caused by imbedded ac
tions.
Table 5.7. Augmented Expression Grammar with Dollar Notation
0. 5 ^ e { generate( "answer = %s",
$1 ); }
1. e ^ e +1 { generate( "%s += %s", $1,
f ree name( $3 ); }
$3 ) ;
2. e > t
3. t ^ NUM { name = new name();
generate( "%s = %s", $$ = name, yytext ); }
5.5.2 Imbedded Actions*
Note that the actions are not really grammatical symbols in a bottom-up parser, as
they were in the top-down parserthe actions are not part of the viable prefix, they are
just executed at appropriate times by the parser. In particular, an action is performed
only when the parser performs a reduction. Hitherto, our actions have all been to the far
right of the production to signify this fact. Its convenient sometimes to put an action in
the middle of a production, however, as in the following production:
function body^>arg list {args_done () ; }compound stmt {funct_done () ; }
In order to process this sort of production, you must modify the grammar so that the mid
dle action can be done as part of a reduction. The easiest way to do this is to provide a
dummy production as follows:
function body arg list dummy compound stmt {f unct done () ; }
dummy {args_done () ; }
Both yacc and occs do this transformation for you automatically, but the extra produc
tion can often create parsing problems called shift/reduce conflicts, discussed below in
detail. You can often eliminate the unwanted shift/reduce conflict by inserting an imbed
ded action immediately to the right of a terminal symbol in the production, but its gen
erally better to avoid imbedded actions entirely. You could rearrange the previous gram
mar as follows without introducing an production:
/ unction body > formal arg list compound stmt {f unct done () ; }
formal arg list > arg Ji st {args_done () ; }
5.6 Creating LR Parse TablesTheory*
This section discusses how LR parse tables are created, and also discusses the LR-
class grammars in depth.
5.6.1 LR(0) Grammars*
I ll start discussing LR-class grammars by considering two, more restrictive LR
grammars and why theyre too limited for most practical applications. These are LR(O)
grammarsLR grammars that require no lookahead tokensand SLR(l) grammars
(simple LR(1) grammars). The expression grammar in Table 5.8 is used throughout the
current section. The nonterminals are +, *, (, ), and NUM (a number). Note that this
grammar is not LL(1), so it cant be parsed with a top-down parser. The parse table for
this grammar is represented as a graph in Figure 5.5 and in tabular form in Table 5.9.
Section 5.6.1 LR(0) Grammars* 355
Table 5.8. An Expression Grammar
0 s ^ e
1 e e + t
2 e t
3 t
t * /
4 t ^
/
5
f
^
( e )
6
f
^ NUM
Figure 5.5. LALR(l) State-Machine for Grammar in Listing 5.8
0. 3. i .
S ^. e-^>
s >e .
^ +. t
e >e . +t
t
t
e
t
t .
t - *f
7K
t
H /
(.*)
NUM
7K
V
*
NUM
H I
NUM .
NUM
\/
e
t
v
10.
e >e +t .
t ^t .
*
\/
8.
| ^ ^t *
./
NUM
11.
t
V
( e. )
( e ) .
+
The table and graph relate to one another as follows:
How bottom-up parse
table relates to state di-
Each box is a state, and the state number is the row index in both the action and goto agram.
parts of the parse table.
Nonterminal edges are
Outgoing edges labeled with nonterminal symbols represent the goto part of the goto entries,
parse tablea number representing the nonterminal is used as a column index and
Terminal edges are shift
entries.
Reduce directives.
Error transitions.
LR(0) items: Dots in pro
ductions.
Dot movement.
Table 5.9. State-Transition Table for Grammar in Table 5.8
Shift/Reduce Table (Yy acti on) Goto Table (Yy goto)
l~ NUM +
*
( >
5 e t
/
0
si

s2
3 4 5
1 r6
r6 r6
r6
-
2
si

s2
6 4 5
3 accept
s7

4 r2
r2 s8
r2

5 r4
r4 r4
r4

6

s7

s9

7
si

s2

10 5
8
si

s2

11
9 r5
r5 r5
r5

10 rl
rl s8
rl

11 r3
r3 r3
r3

s/V=shift to state N. r/V=reduce by production N.
Error transitions are marked with -.
the table holds the next-state number. Remember that this path represents the push
part of a reduction.
All outgoing edges that are labeled with terminal symbols become shift directives in
the action part of the parse table. The token value is used as a column index, and the
parser shifts from the current to the indicated state when it encounters that token.
Reduce directives in the parse table are found by looking at the productions in each
state. A reduction happens from a state when a production has a dot at its far right.
A reduction by / >NUM can happen from State 1 because State 1has the production
/ NUM. in it. (Remember, the dot to the right of the NUM signifies that a NUM
has already been read and shifted onto the stack.) I ll describe how occs decides in
which column to put the reduce directives in a moment.
Everything else in the table is an error transition.
The productions in the bottom part of each state are LR(0) items; an LR(O) item is
different from a production in that it is augmented with a dot, which does two things.
First, it keeps track of whats on the parse stackin a given state, everything to the left
of the dot is on the stack. When a dot moves to the far right of a production, the entire
right-hand side is on the stack and the parser can reduce by that production. Second, the
dot shows you how much input has been read so fareverything to the right of the dot in
an item for a given state has not yet been input when the parser is in that state. Conse
quently, symbols to the right of the dot can be used as lookahead symbols.
When you move from one state to another, the dot moves from the left of the symbol
with which the edge is labeled to the right of that symbol. In the case of a terminal sym
bol, the dots movement means that the parser has read the token and shifted it onto the
stack. In the case of a nonterminal, the parser will have read all the input symbols that
comprise that nonterminalin terms of parse trees, the parser will have processed the
entire subtree rooted at that nonterminal, and will have read all the terminal symbols that
form leaves in the subtree. In other words, the dot movement signifies that the parser has
pushed a nonterminal onto the stacksomething that can happen only as part of a reduc
tion. When the parser reduces, it will have processed all of the input associated with the
pushed nonterminal.
Now, I ll demonstrate how particular items get into particular states. First, notice
that there are two classes of states: those that have incoming edges labeled with terminal
symbols, and those with incoming edges labeled with nonterminal symbols. There are
no states with both kinds of incoming edgeits either one or the other. I ll look at the
two classes of states separately.
Start by considering how the parse works. From State 0, a NUM token causes a shift
to State 1, which in turn causes the dot to move past the NUM. Since the parser just
shifted the NUM to the stack, its at the top of the stack, and the dot is to its immediate
right. Since all transitions into State 5 involve a shift of a NUM, all items in State 5
must have a NUM immediately preceding the dot. This is the case with all states that
have incoming edges labeled with terminal symbols. The dot follows that nonterminal in
every item in the state.
You must examine the reduce operation to see whats happening with the other
states. All reductions are done in two steps. First, a previous state is restored by going
backwards through the machine (against the direction of the arrows) by popping as many
states as there are symbols on a right-hand side of the production. Next, the parser
makes a transition to a new state by following an edge labeled with the left-hand side of
the production by which youre reducing. These next states comprise the goto part of the
parse table.
Were interested in the second of these steps, the outgoing edges that are labeled with
left-hand sides. Looking at an example, since the dot is to the far right of the production
in State 5, the parser reduces from State 1 by applying f >NUM. The reduction involves
a single pop, which uncovers State 0, followed by a push of the /, which moves us to
State 5. Looking at the items in State 5, youll notice that the dot follows the /. Since an
/ must be on top of the stack in this state, the dot must be to the immediate right of the /
in all items in State 5. The next reduction is by t>f and gets us, first back to State 0, and
then to State 4. Since a t must now be at top of stack, all items in State 4 have a dot to
the immediate right of t. Lets look at it another way. State 0 contains the item
[e>.e +t]. A reduction by some production that has an e on its left-hand side causes us
to return to State 0 and then go to State 3. When the parser does this reduction, it has
read past all input symbols that comprise the e, and the following item appears in State 3
to signify this fact: [e-ê. +1\. Given the foregoing, you can create the parse table with
the following procedure:
(1) Initialize: State 0 is the start state. Initially, it contains a single LR(0) item con
sisting of the start production (the one with the goal symbol as its left-hand side),
and a dot at the far left of the right-hand side. The initial set of items in this state
(and all other states) is called the kernel or seed items. The set of kernel items for
the start state looks like this:
~0~
s . e
(2) Close the kernel items: The kernel items for the next states are created with a clo
sure operation, which creates a set of closure items. Closure is performed on those
kernel items that have a nonterminal symbol to the right of a dot. (Ill call this non
terminal symbol the closure symbol.) Add to the closure set items for those
3. Items are usually differentiated from simple productions by surrounding them with brackets, as is done
here.
Adding items to states.
Procedure for creating
bottom-up parse tables.
Initialization. Start pro
duction.
Kernel, seed items.
Closure.
Transition symbols.
productions which, when reduced, can return us to the current state.
(a) Initiate the closure process by adding to the closure set all productions that
have the closure symbol on their left-hand side. Remember, if the dot is to the
left of a nonterminal, there will be an outgoing transition from the current
state (labeled with that nonterminal), and that transition represents the push
part of a reduction. Consequently, productions with that nonterminal on their
left-hand side could return us to the current state when a reduction by that
production is performed.
The new item is formed by placing a dot at the far left of the right-hand side
of the production. In the current grammar, since an e is to the right of the dot
in all the kernel items, items containing all productions with an e on their
left-hand side are added. The closure set looks like this so far:
0
s . e
e >. e + t
e >. t
(b) Repeat the closure process on the newly added items. If a newly added item
has a nonterminal to the right of the dot, and closure has not yet been per
formed on that nonterminal, add to the closure set all productions that have
that nonterminal as their left-hand side. In the current example, the second
production has a t to the right of the dot, so I ll add all productions with t on
their left-hand side, yielding the following:
0
s . e
e . e + /
e . t
t ^. t
(c) Repeat (b) until no more items can be added. A production with an / to the
right of a dot was added in the previous step, so the closure set is extended to
include all productions that have / on their left-hand sides, yielding:
0
s . e
e . e + t
e . t
t ^. t
t - > . f
/ - > . NUM
Since none of the new productions have dots to the left of a nonterminal, the
procedure is complete.
(3) Form the kernel items for the next states:
(a) Partition the items: Group together all items in the current state (both kernel
and closure items) that have the same symbol to the right of the dot. I ll call
this symbol the transition symbol, and all items that share a transition symbol
form a partition. The following picture shows the items partitioned by sym
bol to the right of the dot.
_ 0_________________
s >. e
e >. e + t_____________
e >. t
*->/______________
/ - > . ( * ) _______________________________________________________________________________
/ . NUM______________
Partitions that have dots at the far right of the productions do not form new
states because they represent reductions from the current state. (There will
be, at most, one of these partitions, and all 8 productions are part of this parti
tion.) All other partitions form the kernels of the new states.
(b) Form the kernel items for the next states: Modify the items in those parti
tions that form the new states by moving the dot one notch to the right, and:
(c) Add next-state transitions: If a state already exists that has the same set of
kernel items as the modified partition, add an edge labeled with the transition
symbol that goes from the current state to the existing state. Otherwise, create
a new state, using the group of modified items as the kernel items. Add an
edge from the current state to the new one on the transition symbol.
(4) Repeat steps (2) and (3) on the kernel items of any new states until no more states
can be added to the table.
Figure 5.6 shows a complete state machine built from our grammar using the forego
ing procedure. The initial state is State 0, the first pass of the algorithm added States 1 to
5, and so on.
The issue of 8 productions deserves further mention. 8 productions are special only
in one way. There are no outgoing edges from any state that are labeled 8; rather, if an 8
production is added to a state as part of a closure operation, the machine reduces from
the current state on that production. In other words, the right-hand side of an 8 produc
tion effectively has the dot at its far right to begin withthe 8 may as well not be there
because it represents an empty string. Consider a grammar such as the following frag
ment:
s a b
b > 8
I X
which generates the following states:
1 2 3
The item [a-â.b] adds the closure items [b-^8.] and [&.X], the latter of which
appears as [b-^X*] in State 4. If b goes to 8 in State 2, a reduction is performed from that
state. Since there are no symbols on the right-hand-side, you wont pop anything off the
state stack, so you wont change states in the first part of the reduce. Nonetheless, since
b is the left-hand side of the production, you do want to go to State 2 in the goto part of
e productions in bottom-
up parse tables.
Figure 5.6. An LR(0) State Machine
t
NUM
e
t
+
^ . + t
e
t
t
. t
t *f
>(e)
. NUM
NUM
V
NUM
V
*
)k_
e
t
t
(e)
. NUM
t
*
V
+
6
/ - >( *
. + t
)
\!
9
./ ~>( e )
Shift/reduce,
reduce/reduce conflicts
Inadequate states.
the reduce operation, so theres a transition on b from State 1.
Unfortunately, LR(0) state machines have problems. As I said earlier, a reduction is
performed from any given state if a dot is at the far right of an item, and a terminal sym
bol is shifted if the dot is to its immediate left. But consider a state like State 4 in Figure
5.6, where one of the items has a dot to the far right, and the other has a dot buried in the
middle of the production. Should the parser shift or reduce? This situation is called a
shift/reduce conflict. I ts also possible for a state to have two items with different pro
ductions, both of which have dots at the far right. This situation is called a conflictI ll
look at one in a moment. States that contain shift/reduce or reduce/reduce conflicts are
called inadequate states. A grammar is LR(0) if you can use the foregoing procedures to
create a state machine that has no inadequate states. Obviously, our current grammar is
not LR(0).
5.6.2 SLR(1) Grammars*
There is a simple way to resolve some shift/reduce conflicts. Remember how the
FOLLOW sets were used to form LL(l) select sets in the previous chapter. If a right-
hand side was nullable, you could apply that production if the lookahead symbol was in
the FOLLOW set of the left-hand side, because there could be a derivation in which all
elements of the right-hand side could effectively disappear, and the symbol following the
left-hand side would be the next input symbol in this case.
You can use similar reasoning to resolve shift/reduce conflicts. In order to do a
reduction, the next input symbol must be in the FOLLOW set of the nonterminal on the
left-hand side of the production by which were reducing. Looking at State 3 in the
LR(0) machine in Figure 5.6, you can reduce by s>e if the lookahead symbol is in
FOLLOWS). Similarly, you must shift if the lookahead symbol is a plus sign. As long
as FOLLOW(s) doesnt contain a plus sign, the shift/reduce conflict in this state can be
resolved by using the next input symbol. If all inadequate states in the LR(0) machine
generated from a grammar can be resolved in this way, then you have an SLR(l) gram
mar.
The FOLLOW sets for our current grammar are in Table 4.17. Looking at the
shift/reduce conflict in State 4, FOLLOW^) doesnt contain a *, so the SLR(l) method
works in this case. Similarly, in State 3, FOLLOW(s) doesnt contain a +, so
everythings okay. And finally, in State 10, there is an outgoing edge labeled with a *,
but FOLLOW^) doesnt contain a *. Since the FOLLOW sets alone are enough to
resolve the shift/reduce conflicts in all three states, this is indeed an SLR(l) grammar.
Table 5.10. FOLLOW Sets for the Expression Grammar
FOLLOW(.v)
= I )
FOLLOW^) s j h ) +|
FOLLOW(t ) = | H ) + * |
FOLLOW(/) = {h ) + * |
5.6.3 LR(1) Grammars*
Many grammars are not as tractable as the current oneits likely that a FOLLOW
set will contain symbols that also label an outgoing edge. A closer look at the machine
yields an interesting fact that can be used to solve this difficulty. A nonterminals FOL
LOW set includes all symbols that can follow that nonterminal in every possible context.
The state machine, however, is more limited. You dont really care which symbols can
follow a nonterminal in every possible case; you care only about those symbols that can
be in the input when you reduce by a production that has that nonterminal on its left-
hand side. This set of relevant lookahead symbols is typically a subset of the complete
FOLLOW set, and is called the lookahead set.
The lookahead set associated with a nonterminal is created by following a production
as it moves through the state machine, looking at those terminal symbols that can actu
ally follow the nonterminal just after a reduction. In the case of the inadequate State 10,
there are two paths from State 0 to State 10 (0-3-7-10 and 0-2-6-7-10). For the pur
poses of collecting lookaheads, the only important terminals are those that follow non
terminal symbols preceded by dots. (Were looking for the sequence: dot-nonterminal-
terminal.) Remember, here, that were really figuring out which elements of the
Use FOLLOW sets to
resolve conflicts.
Some symbols in FOL
LOW set are not needed.
Lookahead set.
Creating lookahead sets.
LR(1) items.
Creating LR(1) state
machines.
FOLLOW set tell us whether or not to reduce in a given situation. If a dot precedes a
nonterminal, then that nonterminal can be pushed onto the parse stack as part of the
reduction, and when the nonterminal is pushed, one of the terminals that follows it in the
grammar better be in the input.
Look at State 0 in Figure 5.6. The parser can enter State 0 in the middle of a reduc
tion by some production that has an e on its left-hand side, and it will exit State 0 by the
e edge. This edge was created because a dot preceded an e in some item in that state.
Looking at State 0 again, a plus can be in the input when the parser reduces by a produc
tion with e on its left-hand side because of the item:
[e >. e + t\
If the parser enters State 0 in the middle of a reduction by a production with e on its
left-hand side, then a +could reasonably be the next input symbol. Looked at another
way, after all the tokens that comprise the e are read, +could be the next input symbol in
this context. Theres also an implied end-of-input marker in the start production, so
[s . e h]
tells us to add h to the lookahead set. The item:
doesnt add anything to the lookahead set, even though it has an e in it, because this item
isnt considered when a reduction is performed (the dot doesnt precede the e). Of the
other states along the two paths from State 0 to State 10, the only other lookahead is
added in State 4, because of the item:
If the parser enters State 4 in the middle of a reduction by some production with an e on
its left-hand side, then a right parenthesis is also a possible input character. When you
get to State 10, the only elements of FOLLOW(e) that can actually follow an e in this
context are the elements that youve collected by looking at things that can actually fol
low an e in some state on the path from the start state to State 10. This set of lookaheads
has only three elements: +,), and K A * is not part of this set, so a * cannot legally fol
low the e in this context, and the parser doesnt have to consider the * when its deciding
whether to reduce from State 10. It can safely reduce if the lookahead symbol is a +,),
or h, and shift on a *.
Formalizing the foregoing procedure, an LR(1) item is an LR(0) item augmented with
a lookahead symbol. It has three parts (a production, a dot, and a lookahead) and is typi
cally represented like this: [sâ.p, D], where D is the lookahead symbol. Note that
the lookahead is part of the item; two items with the same production and dot position,
but with different lookaheads, are different items.
The process of creating an LR(1) state machine differs only from that used to make
an LR(0) machine only in that LR(1) items are created in the closure operation rather
than LR(0) items. The initial item consists of the start production with the dot at the far
left and h as the lookahead character. In the grammar weve been using, it is:
[5 , I- ]
An LR(1) item is created from a kernel item as follows: Given an item and a production
that take the following forms:
[s -> a - jcp, C]
x ^ .y
Section 5.6.3LR( 1) Grammars* 363
(s and x are nonterminals, a, (3, and y are collections of any number of terminals and non- I*-** x p, C].
terminals [any number includes zero], and C is a terminal symbol) add:
[jc>.y, FIRST(p C)].
to the closure set. FIRST(P C) is computed like FIRST of a right-hand side. If P is not FIRST(P C)].
nullable, then FIRST(P C) is FIRST(P), otherwise it is the union of FIRST(P) and C.
Looking at a real situation, the start production relates to the previous formula as fol
lows:
[ s ^ oc x p, C ]
[ s > e, h ]
Note that both a and P match empty strings in this situation. Using the previous for- a and p can be empty,
mula, there are two productions that have e on the left-hand side: e>e+t and e>t, so add
the following to the closure set:
[ 5 -> .y, FIRST(p C) ]
j e ^ . e +t, FIRST(e h) ]
[ e -> .t, FIRST(e h) ]
I m using here to signify that P in the original item matched an empty string.
FIRST(d) is {I- }. Continue the closure operation by closing both of the newly added
items: The first one is:
[ S > OC A' p, C ]
[ e e +1, h ]
so add:
[ s .y, FIRST(p C) ]
f e ^ . e +t, FIRST(+1 h) ]
| e -> t, FIRST(+1 h) ]
Note that these items differ from the previous ones only in the lookahead sets. Since +is
not nullable, FIRST(+t h) is +. The second production derived directly from the kernel
itemwas:
[ -9 ^ CX -V P, C
]
[ *
^ t , h ]
so add:
I ~* *y
FIRST(|i C)
I
[ t
FIRST(e h)
]
[ t
FIRST(e h)
]
The process continues in this manner until no more new LR(1) items can be created. The
next states are created as before, adding edges for all symbols to the right of the dot and
moving the dots in the kernel items of the new machine. The entire LR(1) state machine
for our grammar is shown in Figure 5.7. I ve saved space in the Figure by merging
together all items in a state that differ only in lookaheads. The lookaheads for all such
items are shown on a single line in the right column of each state. Figure 5.8 shows how
the other closure items in State 0 are derived. Derivations for items in States 2 and 14 of
the machine are also shown.
The LR( 1) lookaheads are the relevant parts of the FOLLOW sets that we looked at
earlier. A reduction from a given state is indicated only if the dot is at the far right of a
production and the current lookahead symbol is in the LR(1) lookahead set for the item
e items.
Figure 5.7. LR(1) State Machine
0
s >. e h
e >. e + t h +
e >. / h +
/ >. t * / h + *
f ~ > /
h + *
l~ + *
/ - > . NUM h + *
t
NUM
1
- H I
e
3
s >e .
^ . + t
h
h +
t
>
NUM .
h +
*
v
4
e t . h +
h + *
*
\/
l~+* |<
+
NUM
V
2
>( e )
h + *
^ . + / + )
^ . / + )
t >. t * / + ) *
t > /
+ ) *
/ > ( * )
+ ) *
/ - > . NUM + ) *
>
7
e >e + . / h +
t >. / * /
t > /
/ - > ( * )
/ - > . NUM
I- + *
h + *
h + *
h + *
t
\/
10
e >e + t . b"+
h + *
'
*
\ /
8
t -> t *
/
h + *
/ - > . NUM
h + *
h + *
NUM
\/
11
h +*
t NUM
e
+
>
\/
6
/ - > ( e
^ ^^.
.)
, + t
h + *
+ )
)
\!
9
>( O
h + *
that contains that production. As before, a shift is indicated when the lookahead symbol
matches a terminal label on an outgoing edge; outgoing edges marked with nonterminals
are transitions in the goto part of the parse table. If theres a conflict between the looka
head sets and one of the outgoing edges, then the input grammar is not LR( 1).
items are special only in that theyre a degraded case of a normal LR(1) item. In
particular, items take the form [s>., C], where the lookahead C was inherited from
the items parent in the derivation tree. An item is never a kernel item, so will never
generate transitions to a next state. The action associated with an item is always a
reduce on C from the current state.
Figure 5.7. continued. LR( 1) State Machine
Section 5.6.3LR(1) Grammars* 365
+
12
+) *
7K
t
13
e> t .
+)
t * /
+ ) *
A
V
*
t
14
/ >( e ) +) *
e>. e +t
e>. t
t * /
/-> A e)
/-> . NUM
+)
+)
+) *
+) *
+) *
+) *
NUM
NUM
15
NUM +) *
NUM
\/
>
16
^ +. t +)
f
f >(e )
/ . num
+) *
+) *
+) *
+) *
t
\/
17
c^c +t .
t-> t . */
+)
+) *
*
\I _
18
t *
/
+) *
f >( e )
/ - >. NUM
+) *
+) *
NUM
\/
e
19
t>t * /. +) *
+
>
\/
20
/-> i e . )
e>e . +t
+) *
+)
\
)
/
21
+) *
5.6.4 LALR(1) Grammars*
The main disadvantage of a LR(1) state machine is that it is typically twice the size
of an equivalent LR(0) machine, which is pretty big to start out with [the C grammar
used in the next chapter uses 287 LR(0) states]. Fortunately, its possible to have the
best of both worlds. Examining the LR(1) machine in Figure 5.7, you are immediately
struck by the similarity between the two halves of the machine. A glance at the looka
heads tells you whats happening. A close parenthesis is a valid lookahead character
only in a parenthesized subexpressionit cant appear in a lookahead set until after the
open parenthesis has been processed. Similarly, the end-of-input marker is not a valid
lookahead symbol inside a parenthesized expression because you must go past a close
Figure 5.8. Deriving Closure Items for LR(1) States 0, 2 and 14
Given: [s a . x B, c] and x>y
Add: .y,FIRST(pc)]
State 0
e . t h
t
e . e + / h
States 2 and 14
. t
parenthesis first. The outer part of the machine (all of the left half of Figure 5.7 except
States 6 and 9) handles unparenthesized expressions, and the inner part (States 6 and 9,
and all of the right half of Figure 5.7) handles parenthesized subexpressions. The parser
moves from one part to the other by reading an open parenthesis and entering State
which acts as a portal between the inner and outer machines. The parser moves back
into the outer part of the machine by means of a reduction; it pops enough items so that
the uncovered state number is in the outer machine.
You can take advantage of the similarity between the machines by merging them
together. Two LR( 1) states can be merged if the production and dot-position
Section 5.6.4LALR( 1) Grammars* 367
components of all kernel items are the same. The states are merged by combining the
lookahead sets for the items that have the same production and dot position, and then
modifying the next-state transitions to point at the merged state. For example, States 2
and 14 are identical in all respects but the lookaheads. You can merge them together,
forming a new State 2/14, and then go through the machine changing all next-state tran
sitions to either State 2 or State 14 to point at the new State 2/14. This merging process
creates the machine pictured in Figure 5.9.
Figure 5.9. LALR( 1) State Machine for the Expression Grammar
0
s >. e h
^. + t
6 ^. t
t-> . /
/ . ( e )
. NUM
*
*
*
*

+
+
+
+
+
+
t
NUM
e
3
s>e .
ee . +1
\~
\~ +
5/12
*->/
h +) *
7K
4/13
^t .
t^t . * f
1 +)
h +) *
>
t
V
*
2/14
h + ) *
e>. e +t +)
e>. t +)
t-> . t * f +) *
+) *
/-* ( e )
+) *
f - + . NUM +) *
NUM
1/15
H I
NUM .
h +) *
+
NUM
V
>
7/16
ee +. t I- +)
t-> . t * f
t-> J
/ - > . NUM T
T
T
T

+
+
+
+
*
*
*
*
t
____ \/
10/17
ee +t .
1 +)
h +) *
*
..........\ L ..
8/18
t>t *
f
h +) *
/ - > . NUM
h +) *
h + ) *
NUM
V
11/19
t^t * f . h +) *
e
+
v
6/20
f - > ( e
e>e .
.)
+t
h +) *
+)
)
\!
9/21
h +) *
The merging process does not affect error detection in the grammar. It usually takes Merging does not affect
f i i r t j i* j i T j i- j error recovery, might add
a little longer for the merged machine to detect an error, but the merged machine reads inadequate states
no more input than the unmerged one. There is one pitfall, however. The merging does,
Manufacturing the tables.
More efficient table-
generation methods.
Positive numbers=shift,
Negative
numbers=reduce, 0=ac-
cept.
LALR(1) parse-table size.
Pair-compression in
LALR(1) parse tables.
after all, add elements to the lookahead sets. Consequently, there might be inadequate
states in the merged machine that are not present in the original LR(1) machine. (This is
rare, but it can happen.) If no such conflicts are created by the merging process, the
input grammar is said to be an LALR( 1) grammar.
If you compare the LALR(l) machine in Figure 5.9 with the LR(0) machine in Fig
ure 5.6 on page 360 youll notice that the machines are identical in all respects but the
lookahead characters. This similarity is no accident, and the similarity of the machines
can be exploited when making the tables; you can save space when making an LALR(l)
machine by constructing an LR(0) machine directly and adding lookaheads to it, rather
than creating the larger LR(1) machine and then merging equivalent states. You can do
this in several waysthe method used by occs is the simplest. It proceeds as if it were
making the LR(1) machine, except that before creating a new state, it checks to see if an
equivalent LR(0) state (an LR( 1) state with the same productions and dot positions in the
kernel) already exists. If it finds such a state, occs adds lookaheads to the existing state
rather than creating a new one. I ll look at this procedure in depth, below.
There are other, more efficient methods for creating LALR( 1) machines directly from
the grammar that I wont discuss here. An alternate method that is essentially the same
as the foregoing, but which uses different data structures and is somewhat faster as a
consequence, is suggested in [Aho], pp. 240-244. An LR(0) state machine is created and
then lookaheads are added to it directly. You can see the basis of the method by looking
at Figure 5.8 on page 366. Lookaheads appear on the tree in one of two ways, either they
propagate down the tree from a parent to a child or they appear spontaneously. A looka
head character propagates from the item [s>oc.*P,c] to its child when FIRST(P) is null-
able. It appears spontaneously when FIRST(P) contains only terminal symbols. An even
more efficient method is described in [DeRemer79] and [DeRemer82]; it is summarized
in [Tremblay], pp. 375-383.
5.7 Representing LR State Tables
The LALR(l) state machine is easily represented by a two dimensional array of
integers, where the value of the number represents the actions. The row index is typi
cally the state number, and the columns are indexed by numbers representing the tokens
and nonterminals. The LALR(l) state machine in Figure 5.9 is represented by the two-
dimensional array in Table 5.11. (This is the same table as in Figure 5.9.) Shifts can be
represented by positive numbers, and reductions can be represented with negative
numbers. For example, a s6 directive can be represented by the number 6, and an r5 can
be represented by a -5. A very large integer thats not likely to be a state number can be
used to signify error transitions. The accept action, which is really a reduction by Pro
duction 0, can be represented by a zero.
Though the table can be represented by a two-dimensional array, such a table is
likely to be very large. Taking the C grammar thats used in the next chapter as charac
teristic, the grammar has 182 productions, 50 terminal symbols, and 75 nonterminal sym
bols. It generates 310 LALR(l) states. The resulting LALR(l) transition matrix has as
many rows as there are states, and as many columns as there are terminal and nontermi
nal symbols (plus one for the end-of-input marker), so there are (50+75+1) = 125
columns and 310 rows, or 38,750 cells. Given a 2-byte int, thats 77,500 bytes. In most
systems, its worthwhile to minimize the size of the table.
Only 3,837 cells (or 10%) of the 38,750 cells in the uncompressed table are nonerror
transitions. As a consequence, you can get considerable compression by using the pair-
compression method discussed in Chapter Two: Each row in the table is represented by
an array, the first element of which is a count of the number of pairs in the array, and the
Section 5.7Representing LR State Tables
Table 5.11. Tabular Representation of Transition Diagram in Figure 5.9
369
Shift/Reduce Table (Yy act i on) Goto Table (Yy got o)
1- NUM +
*
( >
5 e t
/
0
si

s2
--
3 4 5
1 r6
r6 r6
r6
- -
2
si

s2
-
6 4 5
3 accept
s7
-
4 r2
r2 s8
r2
--
5 r4
r4 r4
r4
--
6

s7

s9
-
7
si

s2
-
10 5
8
si

s2
-
11
9 r5
r5 r5
r5
-
10 rl
rl s8
rl
--
11 r3
r3 r3
r3
-
Error transitions are marked with -
remainder of which is series of [nonterminal,action] pairs. For example, Row 4 of the
transition matrix can be represented as follows:
st at e_4 [ ] = {4, [),r2], [+, r2], [h,r2], [*,s8] }/
The initial 4 says that there are four pairs that follow, the first pair says to reduce by Pro
duction 2 if the input symbol is a close parentheses, and so forth. If the current input
symbol is not attached to any pair, then an error transition is indicated.
Another array (call it the index array) is indexed by state number and each cell evalu
ates to a pointer to one of the pairs arrays. If an element of the index array is NULL,
then there are no outgoing transitions from the indicated state. If several rows of the
table are identical (as is the case in the current table), only a single copy of the row
needs to be kept, and several pointers in the index array can point at the single row. The
goto portion of the table is organized the same way, except that the row array contains
[nonterminal, next state] pairs. Figure 5.10 shows how the table in Listing 5.11 is
represented in an occs output file. The actual code is in Listing 5.2.
The Yya arrays on lines 20 to 28 of Listing 5.2 represent the rows, and Yy act i on
(on lines 30 to 34) is the index array; the pairs consist of tokenized input symbols and
action descriptions, as described earlier. The values used for tokens are as follows:
Token
Value returned from
lexical analyzer
h 0
NUM 1
+ 2
*
3
(
4
)
5
Looking at row four of the table ( Yya004 on line 23), the pair [5,-2] is a reduction by
Production 2 (-2) on a close parenthesis (5). [3,8] is a shift to State 8 (8) on a * (3), and
so on. The tables can be decompressed using the yy next () subroutine in Listing 5.3.
YyaN arrays, Yy a c t i o n
Parse-table decompres
sion.
Figure 5.10. Actual Representation of Parse Tables for Grammar in Listing 5.8
Listing 5.2. C Representation of the Parse Tables
1
2
3
4
5
6
7
8
9
10
11
12
13
14
/* ----------------------------------------------------------------------------------------
*
The Yy act i on t abl e i s act i on par t of t he LALR( l ) t r ansi t i on
* mat r i x. I t ' s compr essed and can be accessed usi ng t he yy n e x t fj
*
subr out i ne, decl ar ed bel ow.
*
*
YyaOOO[] { 3 , 5 , 3 2, 2
1,1
A
A
A
A
st at e number ------ +
number of pai r s i n l i st - +
i nput symbol ( t er mi nal ) --------------+
act i on------------------------------------------------------------ +
*
*
act i on yy next ( Yy act i on, cur st at e, l ookahead symbol );
*
act i on < 0 Reduce by pr oduct i on n, n act i on.
Section 5.7Representing LR State Tables 371
15
A
act i on
----
0
-
Accept (i e. r educe by pr oduct i
16
A
act i on > 0
-
Shi f t t o st at e n, n == act i on
17
A
act i on
- -
YYF er r or
18 */
19
20 i nt YyaOOO[ ] ={ 2, 4, 2
, 1/ 1 };
21 i nt YyaOO1[ ] ={ 4, - 6 , 3, - 6 2, - - 6 , 0, - 6 };
22 i nt Yya003[ ] ={ 2, 0, 0 , 2, 7
};
23 i nt Yya004[ ] ={ 4, 5, - 2 , 2, - 2
f o, --2 , 3, 8 };
24 i nt Yya005[ ] ={ 4, - 4 , 3, - 4
r 2, - -4 , 0, - 4
};
25 i nt Yya006[ ] ={ 2,
5r
9 , 2, 7
26 i nt Yya009[ ] ={ 4, 5, - 5 , 3, - 5
f 2, - -5 , 0, - 5 };
27 i nt YyaOl O[ ] ={ 4, 5, - 1 , 2, - 1
r o, --1 , 3, 8 };
28 i nt YyaOl l [ ] ={ 4, 5, - 3 , 3, - 3
r 2, - "3 , 0, - 3
};
29
30 i nt *Yy a c t i o n [ 12]
zz:
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
0)
{
YyaOOO, YyaOOl , YyaOOO, Yya003, Yya004, Yya005, Yya006, YyaOOO, YyaOOO,
Yya009, YyaOl O, YyaOl l
};
/* ----------------------------------------------------------------------------------------
*
*
The Yy got o t abl e i s got o par t of t he LALR( l ) t r ansi t i on mat r i x. I t ' s com-
and can be accessed usi ng t he yy next () subr out i ne, decl ar ed bel ow
*
*
nont er mi nal Yy I hs[ pr oduct i on number by whi ch we j ust r educed ]
*
*
YygOOO[] { 5, 3 2, 2
1,1
*
*
+
*
*
uncover ed st at e- +
number of pai r s i n l i st - -
nont er mi nal --------------------------------------------- +
got o t hi s st at e----------------------------------------
*
got o st at e yy next ( Yy got o, cur st at e, nont er mi nal );
*
/
50 i nt YygOOO[]=={ 3, 3, 5 , 2, 4 , 1, 3 };
51 i nt Yyg002[]=
={ 3, 3, 5 , 2, 4 , 1, 6
52 i nt Yyg007[]>
={ 2, 3, 5 , 2, 10 };
53 i nt Yyg008[]=
= { 1,
3, 11
};
54
55 i nt *Yy got o[12] =
{
YygOOO, NULL , Yyg002, NULL
NULL , NULL , NULL
NULL NULL NULL , Yyg007, Yyg008,
};
/ * ---------------------------------------------------------------------------------------
*
*
The Yy_l hs ar r ay i s used f or r educt i ons. I t i s i ndexed by pr oduct i on
number and hol ds t he associ at ed l ef t - hand si de adj ust ed so t hat t he
* number can be used as an i ndex i nt o Yy got o.
*
/
YYPRI VATE i nt Yy I hs[7]
{
/
/
/
*
*
*
0
1
2
*
*
/
/
/
0,
1,
72 / * 3 * / 2,
73 / * 4 * / 2,
74 / *
5 * / 3,
75 / *
6 */ 3
76
};
77
78
/ * ---------------------
79 * The Yy r educe[] ar r ay i s i ndexed by pr oduct i on number and hol ds
80 * t he number of symbol s on t he r i ght - hand si de of t he pr oduct i on.
81 * /
82
83 YYPRIVATE i n t Yy r e d u c e [7] =
84
{
85 / *
0 * / 1,
86 / *
1 */ 3,
87 / *
2 * /
1,
88 / *
3 * / 3,
89 / *
4 */
1,
90 / *
5 * / 3,
91 / * 6 * / 1
92
Listing 5.3. yynext.c Table Decompression
1 i n t yy n e x t ( t a b l e , c ur s t a t e , s ymbol )
2 i n t * * t a b l e ; / *
t abl e t o decompr ess * /
3 i n t s y mb o l ; / *
col umn i ndex
* /
4 i n t c ur s t a t e ; / *
r ow i ndex
* /
5
{
6 / * Gi ven cur r ent st at e and i nput symbol , r et ur n next st at e. * /
7
8 i n t *p = t a b l e [ c ur s t a t e ] ;
9 i n t i ;
10
11 i f ( P )
12 f o r ( i = ( i n t ) *p++; - - i >= 0 ; p += 2 )
13 i f ( s ymbol == p [0] )
14 r e t u r n p [ 1 ] ;
15
16 r e t u r n YYF; / *
er r or i ndi cat or * /
17
}
Goto transitions. Yy i hs,
Yy r educe.
The goto transitions are computed using two arrays. Yy_l hs (on line 67 of Listing
5.2) is indexed by production number and evaluates to an integer representing the sym
bol on the left-hand side of that production. That integer is, in turn, used to search the
goto part of the array for the next-state transition. A second array, Yy_r educe on line
83 of Listing 5.2, is indexed by production number and evaluates to the number of sym
bols the right-hand side of that production. A simplified of the occs parser
Goto transitions could be
compressed by column.
loop, which demonstrates how the tables are used, is shown in Listing 5.4.
Since there tend to be fewer nonterminals than states, the Yy got o array could be
made somewhat smaller if it were compressed by column rather than by row. The
Yy got o array could be indexed by nonterminal symbol and could contain
[current_state, nextjstate] pairs. Occs doesnt do this both because two table-
decompression subroutines would be needed and because the parser would run more
Section 5.7Representing LR State Tables 373
Listing 5.4. Simplified Parser Loop Demonstrating Parse-Table Usage
1 push ( 0 ); / * Push st ar t st at e */
2 l ookahead = yyl ex (); / * get f i r st l ookahead symbol * /
3
4 w h i l e ( 1 )
5 {
6 do t hi s = yy next ( Yy act i ont st at e at t op of st ack(), l ookahead );
7
8 i f ( do t hi s == YYF
)
9
{
10 / * ERROR */
11 }
12 e l s e i f ( do t hi s > 0 ) / * Si mpl e shi f t act i on */
13
{
14 l ookahead = yyl ex (); / * advance */
15 push ( do t hi s )
f
/ * push new st at e */
16
}
17 e l s e / * r educt i on */
18
{
19 yy act ( - do t hi s ); / * do code- gener at i on act i on */
20
21 i f ( do t hi s == YYACCEPT )
22 b r e a k ;
23 e l s e
24
{
25 pr oduct i on number - do t hi s;
26
27 rhs l engt h = Yy r educe[ pr oduct i on number ];
28
29 w h i l e ( - - r hs l engt h >= 0 ) / * pop r hs l engt h i t ems * /
30 pop ();
31
32 next st at e = yy next ( Yy got o, st at e at t op of st ack (),
33 Yy I hs[pr oduct i on number ] );
34 push ( next st at e );
35
}
36
}
37
}
slowly because the chain lengths would be a bit longer. You can also combine the two
arrays, putting shift actions, reduce actions, and goto transitions together in a single row
array; but again, it would take longer to decompress the table because of the extra chain
lengths.
5.8 Eliminating Single-Reduction States*
Reexamining Table 5.11 on page 369, youll notice that States 1, 5, 9, and 11 all con
sist of nothing but error transitions and reductions by a single production. For example,
all the actions in State 1are reductions by Production 6. These states, called single
reduction states, are typically caused by productions that are used only to add pre- Single-reduction states.
cedence rules to a grammarthey are created by productions with a single symbol on
the right-hand side, such as the eX and // in the grammar weve been using.
You can see what these states are actually doing by looking at how one of the states
is used as the parse progresses. If a NUM is found in State 0, for example, the parser
shifts to State 1, where the only legal action is a reduce by Production 6 if the next input
Removing single
reduction states from
parse table.
Cant remove single
reduction state if there is
an action.
character is h, +, * or close parenthesis. This reduction just gets us back to State 0,
without having advanced the input. All this backing and forthing is really unnecessary
because the parser could reduce by Production 6 directly from State 0 if the next input
character is a NUM and the following character is a h, +, * or close parenthesis. If
something else is in the input, the error is caught the next time the parser tries to shift, so
the error information is actually duplicated in the table.
All single-reduction states can be removed from the table by introducing another
class of action items. If a row of the table contains a shift to a single-reduction state,
replace it with a directive of the form d/V, where N is the number of the production by
which the parser would reduce had it gone to the single-reduction state. In the current
table, for example, all si directives would be replaced by d6 directives. The parser
processes a d directive just like a normal reduce directive, except that it pops one fewer
than the number of symbols on the right-hand side of the indicated production, because it
didnt push the last of these symbols.
A similar substitution has to be made in the goto portion of the tableall transitions
into single-reduction states must be replaced with equivalent d directives. This change
forces a second modification to the parser because several code-generation actions might
now have to be executed during a reduction if a d directive is encountered in the goto
table. The earlier table is modified to eliminate single-reduction states in Table 5.12. A
parser algorithm for the new table is in Table 5.13.
Table 5.12. Parse Table with Single-Reduction States Removed
Shift/Reduce Table (Yy a c t i o n ) Goto Table (Yy goto)
h NUM +
*
( >
5 e t
/
0
d6

s2

3 4 d4
2
d6

s2

6 4 d4
3 accept
s7

4 r2
r2 s8
r2

6

s7

d5

7
d6

s2

10 d4
8
d6

s2

d3
10 rl
rl s8
rl

Error transitions are marked with -
Theres one final caveat. You cannot eliminate a single-reduction state if there is a
code-generation action attached to the associated production because the stack will have
one fewer items on it than it should when the action is performedyou wont be able to
access the attributes correctly. In practice, this limitation is enough of a problem that
occs doesnt use the technique. In any event, the disambiguating rules discussed in the
next section eliminate many of the single-reduction states because the productions that
cause them are no longer necessary.
There is one thing thats easy to do that does not affect the grammar at all, however,
and can significantly reduce the table sizes in a pair-compressed array. Since the error
information associated with single-reduction states is redundant (because the error is
caught with the next shift), you can replace the entire chain of pairs that represent the
single-reduction state with a single, default, reduce action that is performed regardless of
the input symbol. The UNIX yacc utility takes this idea even further by providing a
default action for every row of the table that represents the most common action in that
row. The problem here is the error recovery, which is now very difficult to do because it
is difficult to determine legal lookahead characters while doing the recovery.
Table 5.13. LR Parser Algorithm for Single-Reduction-Minimized Tables
Section 5.8Eliminating Single-Reduction States* 375
push( 0 );
while( Action[TOS][input] * Accept)
{
if( Action[TOS][input] =- )
error();
else if ( Action[TOS][input] =sX )
{
push( X );
advance();
}
else if ( Action[TOS][input] =rX )
{
act( X );
pop( as many items as are in the RHS of production X );
while( (i =Goto[ uncovered TOS ][ LHS of production X ]) =dX )
{
act( X );
pop( one fewer than the number of symbols in the RHS of production X );
}
push i;
}
else if ( Action[TOS][input] =dX )
{
act( X );
pop( one fewer than the number of symbols in the RHS of production X )
while( (i =Goto[ uncovered TOS ][ LHS of production X ]) =dX )
{
act( X );
pop( one fewer than the number of symbols in the RHS of production X );
}
push i;
}
}
accept();
5.9 Using Ambiguous Grammars*
Even though LR(1) and LALR(l) grammars are more flexible than the SLR(l) and
LR(O), they still cant be ambiguous, because, as you would expect, an ambiguous gram
mar yields an ambiguous parse. The ambiguity is reflected in the resulting state machine
as a shift/reduce conflict. Ambiguous productions are, nonetheless, useful. They tend to
make a grammar both smaller and easier to read. The two most common ambiguities are
found in Productions 2,4, and 5 of the following grammar:
Resolving ambiguities
when making parse
tables.
Shift for right associativi
ty. Reduce for left asso
ciativity.
1. stmt expr SEMI
2. 1 I F stmt ELSE stmt
3. 1 I F stmt
4. expr expr PLUS expr
5. 1 expr TIMES expr
6. 1 ID
ID is an identifier
This grammar creates the state machine pictured in Figure 5.11, and the ambiguities
are reflected as shift/reduce conflicts in the States 5, 10, and 11. The LALR(l) looka
heads are shown in brackets next to those items that trigger reductions.
Though the foregoing grammar is not LALR(l), it is nonetheless possible to use it to
create a parser. The basic strategy examines the various shift/reduce conflicts in the
machine, and makes decisions as to whether a shift or reduce is appropriate in that situa
tion, basing the decision on the semantic rather than syntactic description of the
language. In practical terms, you need to know the relative precedence and the associa
tivity of those tokens that appear in ambiguous productions, and this information can be
used to disambiguate the tables.
Precedence and associativity information is used to resolve ambiguities in arithmetic
expressions as follows: State 10 contains the following items:
expr > expr . +expr
expr expr +expr . [; +* ]
expr > expr * expr
The parser wants to reduce by
if the lookahead symbol is a semicolon, plus sign, or star; it also wants to shift on a plus
sign or star. The semicolon poses no problem, but the arithmetic operators do.
Now, consider the parse caused by the input a+b+c. The stack and input looks like
this after the a, +, and b are processed:
expr +expr + c
The parser now looks ahead and sees the plus, and it has a problem. If the plus operator
is left associative, it will want to reduce immediately (before processing the next plus
sign). A left-associative parse would continue as follows:
expr +expr + c
expr + c
expr + c
expr +ID
expr+expr
expr
expr ;
stmt
This way the left subexpression would be processed before going on with the input.
Right associativity is handled differently. Here the parser wants to put the whole expres
sion onto the stack so that it can reduce from right to left, like this:
Figure 5.11. State Machine for an Ambiguous Grammar
Section 5.9Using Ambiguous Grammars* 377
Shift for higher pre
cedence, reduce for
lower.
Dangling else.
expr +expr + c ;
expr +expr + c ;
expr +expr +ID ;
expr +expr +expr ;
expr +expr ;
expr ;
expr ;
stmt
So the decision of whether to shift or reduce here is based on the desired associativity of
the plus operator. In this case, you get left associativity by resolving the conflict in favor
of the reduce.
State 10 has one other problemillustrated by the input a+b*c. Here, the star is
higher precedence than the plus, so the multiplication must be done first. The conflict is
active when the following stack condition exists:
expr +expr____________________ * c ;
Reducing at this point would effectively ignore the relative precedence of +and *; the
addition would be done first. Resolving in favor of the shift, however, correctly gen
erates the following parse:
expr +expr * c ;
expr +expr * ID ;
expr +expr * expr ;
expr +expr ;
expr ;
expr;
stmt
and the multiplication is done before the addition. So, the conflict should be resolved in
favor of the shift if the lookahead operator is higher in precedence than the token closest
to the top of the stack.
Precedence and associativity help only if a production implements an arithmetic
expression, however. The sensible default for handling a shift/reduce conflict in nonar
ithmetic productions is to decide in favor of the shift. This decision takes care of the
conflict in State 5, which contains the following items
I F stmt. ELSE stmt
I F stmt [h ELSE ]
Here, the machine wants both to shift and reduce when the lookahead token is an ELSE.
The problem is exemplified by the following input:
Section 5.9Using Ambiguous Grammars* 379
I F
I D ;
ELSE
I F
I D ;
ELSE
I D ;
Most programming languages require an ELSE to bind with the closest preceding IF. A
partial parse of the foregoing generates the following stack:
I F stmt_______________________ ELSE I F ID ; ELSE ID ;
A shift is required in this situation, because an ELSE is present in the input. Were you
to reduce IF stmt to stmt, the ELSE would be left dangling in the input. That is, since a
stmt cant start with an ELSE, a syntax error would be created if the parser reduced.
Letting the parse continueshifting on every conflictyoull eventually get the fol
lowing on the stack:
T f stmt ELSE IF stmt ELSE stmt
which can then be reduced to:
I f stmt ELSE stmt
and then to:
stmt
Since the rightmost statement is reduced first, the ELSE binds to the closest IF, as
required. The foregoing rules can be summarized as follows:
(1) Assign precedence and associativity information to all terminal symbols and pre
cedence information to productions.4 If no precedence and associativity is
assigned, a terminal symbol is assigned a precedence level of zero (very low) and is
nonassociative. Productions are assigned the same precedence level as the right
most terminal symbol in the production or zero if there are no terminal symbols on
the right-hand side.5
(2) When a LALR(l) shift/reduce conflict is encountered, the precedence of the termi
nal symbol to be shifted is compared with the precedence of the production by
which you want to reduce. If either the terminal or the production is of precedence
zero, then resolve the conflict in favor of the shift. Otherwise, if the precedences
are equal, resolve using the following table:
associativity of
lookahead symbol
resolve in favor of
left reduce
right shift
nonassociative shift
Otherwise, if the precedences are not equal, use the following table:
4. Yacc and occs do this with the %l ef t , %r i ght and %nonas s oc directives.
5. Yacc and occs let you override this default with a %pr ec TOKEN directive where TOKEN is a terminal
symbol that was declared with a previous %l ef t , %r i ght , or %nonas s oc. The production is assigned the
same precedence level as the token in the %pr ec directive. Note that the %pr ec directive must be to the
left of the semicolon or vertical bar that terminates the production.
Rules for resolving ambi
guities in parse table.
Resolving reduce/reduce
conflicts.
precedence resolve in favor of
lookahead symbol <production
lookahead symbol >production
reduce
shift
Theres another kind of conflict that can appear in your grammara reduce/reduce
conflict. Consider the following grammar that implements one rule of Kemighan and
Cherrys troff preprocessor, eqn:
1. expr expr SUB expr SUP expr
2. 1 expr SUB expr
3. 1 expr SUP expr
4. 1 LB expr RB
5. 1 CONST
Eqn typesets equations. The implemented grammar does the following:
I nput Typeset as follows:
X sub 1 X,
X sup 2 X2
X sub 1 sup 2 Xj
The last case has to be handled specially to prevent the subscript and superscript from
getting out of line, like this:
x i2
The earlier grammar takes care of the special case by adding an extra production (Pro
duction 1) that takes care of a case that would actually be handled by the rest of the
grammar were it not thereX sub 1 sup 2 would be parsed without difficulty if Pro
duction 1wasnt in the grammar. It wouldnt be recognized as a special case, however.
The grammar as presented yields a state with the following kernel items:
expr
expr
expr
expr
expr
expr SUB expr SUP expr ,
expr SUP expr.
expr . SUB expr SUP expr
expr. SUB expr
expr. SUP expr
[h SUB SUP RB ]
[h SUB SUP RB ]
Here, the parser cant decide which of the two productions that have dots on the far right
should be reduced. (This situation is a reduce!reduce conflict.) The sensible default is to
resolve the conflict in favor of the production that is higher in the grammar.
Note that reduce/reduce conflicts should generally be avoided by modifying the
grammar. More often than not, this type of conflict is caused by an error in the grammar,
like this:
foo TOKEN
TOKEN
6. This example is taken from [Aho], pp. 252-254. Its significant that neither I nor any of the other compiler
programmers that I know could think of a real example of a valid reduce/reduce conflict other than the
current one. Reduce/reduce conflicts are almost always caused by errors in the grammar.
Section 5.9Using Ambiguous Grammars*
They can also appear when you do seemingly reasonable things like this, however
karamazov dmitry I ivan I alyosha
dm i try
ivan
alyosha
NAME do something
NAME do something else
NAME do_yet another thing
This second situation is essentially the same as the earlier example. Several of
karamazov's right-hand sides begin with the same terminal symbol. Here, however, the
fact has been disguised by an intervening nonterminal. Because of the first-production-
has-higher-precedence rule, ivan and alyosha are never executed. The solution is to res
tructure the grammar as follows:
karamazov
brother
dm i try
ivan
alyosha
NAME brother
dmitry I ivan I alyosha
do something
do something_else
do_yet another thing
or to go a step further:
karamazo v
brother
NAME brother
dosomething
do something_else
do_yet another thing
5.10 Implementing an LALR(1) ParserThe Occs Output File
The remainder of this chapter discusses both the occs output file and how occs itself
works; you can skip to the next chapter if youre not interested in this level of detail. In
any event, you should read Appendix E, which contains a users manual for occs, before
continuing, because occs will be used in the next chapter.
Most of the code that comprises occs is taken directly from LLama, and was dis
cussed in the last chapter. The two main differences are the parser itself and the parse-
table generation routines. This section discusses the parser, using the output file created
from the occs input file in Listing 5.5. (The associated lexical analyzer is defined by the
LPX input file in Listing 5.6.)
The grammar used here is not a particularly good use of occs, because I ve built pre
cedence and associativity information into the grammar itself, rather than using the
%l eft and %ri ght directives as I did with the expression grammar in Appendix E. I ve
done this because occs fudges the tables for ambiguous grammars, using the rules
described in the previous section. For the purposes of discussion, however, its better to
work with a grammar that is a strict LALR(l) grammar. Note that I ve extended the
definition of the NUM token to recognize an identifier as well as a number (and renamed
it to NUM OR ID as a consequence).
7. In yacc and occs, be careful of blithely using %pr ec to solve a problem like the foregoing without
considering the consequences. The problem may well have been caused by a deficiency in the grammar
and, though the %pr ec may eliminate the warning message, the problem wi l l still exist.
Listing 5.5. expr.y Occs Input File for an Expression Compiler
1 %term NUM_OR_ID / * a
2
3 %l e f t PLUS / *
+
4 %l e f t STAR / *
*
5 %l e f t LP RP / *
6
7
%{
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
number or i dent i f i er
*
/
( l owest
*
*
;
( hi ghest ce)
*
/
/
/
# i n c l u d e < s t d i o . h >
# i n c l u d e < c t y p e . h>
# i n c l u d e < ma l l o c . h >
# i n c l u d e < t o o l s / d e b u g . h >
# i n c l u d e < t o o l s / s t a c k . h>
*
y y t e x t ;
s t a c k d e l ( Namepool ,
*
10 ) ;
#d e f i n e f r e e n a me ( x ) p u s h ( Namepool , (x)
#d e f i n e g e t n a me () pop ( Namepool
)
)
/
/
/
*
*
*
St ack of 10 t empor ar y- var names
Rel ease a t empor ar y var i abl e
Al l ocat e an t empor ar y var i abl e
*
*
*
/
/
/
#d e f i n e YYSTYPE
* s t y p e ;
s t y p e
/ * Val ue st ack i s st ack of char poi nt er s
*
/
#d e f i n e YYSHIFTACT(tos) ( * t o s
%}
ff ff
)
/ * Shi f t a nul l st r i ng
*
/
oo
o o
/
*
A smal l expr essi on gr ammar t hat r ecogni zes number s, names, addi t i on (+),
i ons associ at e l ef t to r i ght
*
mul t i pl i cat i on (*), and
unl ess par ent heses f or ce i t t o go ot her wi se
*
i s hi gher t han +
/
s : e ;
e e PLUS t
t
{ y y c o de ( 11%s +
o
o s \ n " , $1, $ 3 ) ; f r e e n a me ( $ 3 ) ; }
/
$1
/
t t STAR f
f
{ y y c o de ( 11%s
/
$1
/
o
o s \ n " , $1, $ 3 ) ; f r e e n a me ( $ 3 ) ; }
f LP e RP
NUM OR ID
{ $$ $2;
}
{
/
*
*
Copy oper and t o a t empor ar y. Not e t hat I ' maddi ng an
under scor e t o ext er nal names so t hat t hey can' t con-
* f l i ct wi t h t he compi l t empor ar y names
*
(tO, tl , et c. ) .
/
y y c o d e ( "%s l s %s \ n", $$ g e t n a me ( ) ,
i s d i g i t ( * y y t e x t )
y y t e x t ) ;
9 ff ff ff ff
}
oo
o o
/

/
Section 5.10Implementing an LALR(l) ParserThe Occs Output File 383
58 char *y y p s t k ( v p t r , d p t r )
59 char * * v p t r ; / * Val ue- st ack poi nt er */
60 char * dpt r ; / * Symbol - st ack poi nt er (not used) */
61 {
62 / * Yypst k i s used by t he debuggi ng r out i nes. I t i s passed a poi nt er t o a
63 * val ue- st ack i t emand shoul d r et ur n a st r i ng r epr esent i ng t hat i t em. Si nce
64 * t he cur r ent val ue st ack i s a st ack of st r i ng poi nt er s, al l i t has t o do
65 * i s der ef er ence one l evel of i ndi r ect i on.
66 */
67
68 return * v p t r ? * v p t r : ;
69 }
70
7! /* --------------------------------------------------------------------------------------------------------------------- */
72
73 y y _ i n i t _ o c c s ()
74 {
75 / * Cal l ed by yypar se j ust bef or e i t st ar t s par si ng. I ni t i al i ze t he
76 * t empor ar y- var i abl e- name st ack and out put decl ar at i ons f or t he var i abl es.
11 */
78
79 p u s h ( Namepool , Mt 9 " ) ; p u s h ( Namepool , "t 8" ) ; pus h ( Namepool , "t 7" ) ;
80 p u s h ( Namepool , "t 6" ) ; p u s h ( Namepool , "t 5" ) ; pus h ( Namepool , "t 4" ) ;
81 p u s h ( Namepool , "t 3" ) ; p u s h ( Namepool , "t 2" ) ; p u s h ( Namepool , " t l " ) ;
82 p u s h ( Namepool , "tO" ) ;
83
84 y y c o d e ( " p u b l i c word t O, t l , t 2 , t 3 , t 4 ; \ n " ) ;
85 y y c o d e ( " p u b l i c word t 5 , t 6 , t 7 , t 8 , t 9 ; \ n " ) ;
86 }
87
88 /* --------------------------------------------------------------------------------------------------------------------- */
89
90 ma i n ( a r g c , a r g v )
91 char ** ar g v;
92 {
93 y y _ g e t _ a r g s ( a r g c , a r g v ) ;
94
95 i f ( a r g c < 2 )
96 f e r r ( " Ne e d f i l e n a me \ n " ) ;
97
98 el se i f ( i i _ n e w f i l e ( a r g v [ 1 ] ) < 0 )
99 f e r r ( "Can' t open %s\ n", a r g v [ l ] ) ;
100
101 y y p a r s e ( ) ;
102 e x i t ( 0 ) ;
103 }
Occs generates several output files for these input files. A symbol-table dump is in
Listing 5.7 (yyout.sym), and token definitions are in yyout.h, Listing 5.8. The symbol
table shows the internal values used for other various symbols including the production
numbers that are assigned by occs to the individual productions. These numbers are use
ful for deciphering the tables and debugging. The token-definitions file is typically
#i ncl uded in the LEX input file so that the analyzer knows which values to return. The
token names are taken directly from the %term and %l eft declarations in the occs input
file.
Occs token definitions
(yyout.h), symbol table
(yyout.sym).
Listing 5.6. expr .lex LeX Input File for an Expression Compiler
1
%{
2 # i n c l u d e
vv
y y o u t . h"
3
%}
4 d i g i t [0 - 9 ]
5 a l p h a [a - z A-
ZJ
6 al num [0 - 9 a - zA-Z ]
7
o o
o o
8
9
vv+ vv
r e t u r n PLUS;
10
vv* vv
r e t u r n STAR;
11
vv ^vv
r e t u r n LP;
12
vv^vv
r e t u r n RP;
13 {d i g i t } + I
14 { a l p h a } {al num} * r e t u r n NUM_OR_ID;
15
f
16
o o
o o
Listing 5.7. yyout.sym Symbol Table Generated from Input File in Listing 5.5
1 Symbol t a b l e
2
3 NONTERMINAL SYMBOLS:
4 e {251) <>
5 FIRST : NUM_OR_ID LP
6 1: e - > e PLUS t .
7 2: e - > t
8
9 f ( 259) <>
11 5: f - > LP e RP
12 6: f - > NUM_OR_ID
13
14 s ( 256) ( g o a l s ymbol ) <>
15 FIRST : NUM_0R_ID LP
16 0: s - > e
17
18 t ( 258) <>
20 3: t - > t STAR f
21 4: t - > f
22
23
24 TERMINAL SYMBOLS:
25
26 name v a l u e p r e c a s s o c f i e l d
27 LP 4 3 1 <>
28 NUM_OR_ID 1 0
<>
29 PLUS 2 1 1 <>
30 RP 5 3 1 <>
31 STAR 3 2 1 <>
PREC 1
PREC 3
PREC 2
Listing 5.8. yyout.h Token Definitions Generated from Input File in Listing 5.5
1 #def i ne EOI 0
2 #def i ne NUM OR ID 1
3 #def i ne PLUS 2
4 #def i ne STAR 3
5 #def i ne LP 4
6 #def i ne RP 5
The occs-generated parser starts in Listing 5.9. As before, listings for those parts of
the output file that are just copied from the template file are labeled occs.par. The occs-
generated parts of the output file (tables, and so forth) are in listings labeled yyout.c.
Since we are really looking at a single output file, however, line numbers carry from one
listing to the other, regardless of the name. The output starts with the file header in List
ing 5.9. Global variables likely to be used by yourself (such as the output streams) are
defined here. Note that <stdio.h>, <stdarg.h>, and <tools/yystk.h>, are #ncluded on
lines one to three. <stdarg.h> contains macros that implement the ansi variable-
argument mechanism and <tools/yystk.h> contains stack-maintenance macros. These
last two files are both described in Appendix A. The remainder of the header, in Listing
5.10, is copied from the header portion of the input file.
Listing 5.9. occs.par File Header
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#i ncl ude <t ool s/ yyst ack. h>
FI LE
FI LE
FI LE
i nt
*yycodeout
*yybssout
*yydat aout
yyl ookahead
st dout
st dout
st dout
*yyt ext ;
ext ern i nt yyl i neno;
ext ern i nt yyl eng;
*n_pt ext ()
ext ern i nt i i _pl engt h( )
ext ern i nt i i pl i neno()
#i f def YYDEBUG
*
*
/
/
/
/ * Lookahead t oken.
*
/
*
/
/
*
*
Out put st r eam (code)
Out put st r eam (bss )
Out put st r eam (data)
*
*
*
*
/
/
/
/
Decl ar ed by l ex i n l exyy. c
*
/
Lookback f unct i on used by l ex
i n / sr c/ compi l er / l i b/ i nput . c.
*
*
/
/
/ * Def i ne YYD her e so t hat i t can be used */
YYD( x) x /
*
i n t he user - suppl i ed header
*
/
#el se
YYD( x) /
*
empt y
*
/
#endi f
j it____________________ ______________________________ ___________ _____________* J
The output file continues in Listing 5.11 with various macro definitions. As was the Macro definitions.
case in the LEX output file, these macros are not redefined if you defined them in the
header part of the input file. (Thats what all the #i f ndef s are for.) Note that
pr i nt f () calls are mapped to yycode () calls on line 44 if debugging is enabled. I ve p r i n t f o mapped to
done this because the standard pr i nt f () circumvents the window-output mechanism yycode 0 -
and messes up the screen. This way, all pr i nt f () calls in the occs input file, at least,
will behave properly. Direct output to the screen in other files is still a problem,
Listing 5.10. yyout.c Code Taken from Input-Specification Header
26 # i ncl ude < c t y p e . h>
27 #i ncl ude < ma l l o c . h >
29 # i ncl ude < t o o l s / s t a c k . h >
30
32
33 s t k d e l ( Namepool , char* , 10 ); / *
St ack of 10 t empor ar y- var names * /
34 #def i ne f r e e n a me ( x ) p u s h ( Namepool , (x)
) / *
Rel ease a t empor ar y var i abl e * /
35 #def i ne g e t n a me () pop ( Namepool
) /*
Al l ocat e a t empor ar y var i abl e * /
36
37 t ypedef char * s t y p e ;
/*
Val ue st ack i s st ack of char poi nt er s * /
38 #def i ne YYSTYPE s t y p e
39
40 #def i ne YYSHIFTACT(tos) ( * t o s = " ") / * Shi f t a nul l st r i ng */
however. The stack macros, discussed in Appendix A and #i ncl uded on line three of
Stack-macro customiza- Listing 5.9, are customized on line 87 of Listing 5.11. The redefinition of yyst k cl s
or1, causes all the stacks declared with subsequent yyst k_dcl ( ) invocations to be
stati c.
Listing 5.11. occs.par Definitions
41 #undef YYD / *
Redef i ne YYD i n case YYDEBUG was def i ned */
42 #i f def YYDEBUG /*
expl i ci t l y i n t he header r at her t han wi t h */
43 # def i ne YYD(x) x / * a - D on t he occs command l i ne. * /
44 # def i ne p r i n t f y y c o de
/ *
Make pr i nt f () cal l s go to out put wi ndow * /
45 #el se
46 # def i ne YYD(x) / * empt y * /
47 #endi f
48
49 #i f ndef YYACCEPT
50 # def i ne YYACCEPT r et ur n(0) / *
Act i on t aken when i nput i s accept ed. * /
51 #endi f
52
53 #i f ndef YYABORT
54 # def i ne YYABORT r et ur n(1)
/ *
Act i on t aken when i nput i s r ej ect ed. * /
55 #endi f
56
57 #i f ndef YYPRIVATE
58 # def i ne YYPRIVATE st at i c /* def i ne t o a nul l st r i ng to make publ i c
*/
59 #endi f
60
61 #i f ndef YYMAXERR
62 # def i ne YYMAXERR 25 / * Abor t af t er t hi s many er r or s * /
63 #endi f
64
65 #i f ndef YYMAXDEPTH / * St at e and val ue st ack dept h * /
66 # def i ne YYMAXDEPTH 128
67 #endi f
68
69 #i f ndef YYCASCADE / * Suppr ess er r or msgs. f or t hi s many cycl es * /
70 # def i ne YYCASCADE 5
71 #endi f
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
#i f ndef YYSTYPE / * Def aul t val ue st ack t ype
*
/
YYSTYPE i nt
#endi f
/ * Def aul t shi f t act i on: i nher i t $$
*
/
#i f ndef YYSHIFTACT
YYSHIFTACT(tos)
(
( t o s ) [0] y y l v a l )
#endi f
#i f def YYVERBOSE
YYV(x) x
#el se
YYV(x)
#endi f
#unde f y y s t k e l s / * r edef i ne st ack macr os f or l ocal * /
#def i ne y y s t k e l s YYPRIVATE /
* *
/
/
* _____________________________________________________________________________
* #def i nes used i n t he t abl es. Not e t hat t he par si ng al gor i t hmassumes t hat
* t he st ar t st at e i s St at e 0. Consequent l y, si nce t he st ar t st at e i s shi f t ed
*
*
*
*
*
onl y once when we st ar t up t he par ser , we can use 0 t o si gni f y an accept .
Thi s i s handy i n pr act i ce because an accept i s, by def i ni t i on, a r educt i on
i nt o t he st ar t st at e. Consequent l y, a YYR( 0) i n t he par se t abl e r epr esent s an
accept i ng act i on and t he t abl e- gener at i on code doesn' t have t o t r eat t he
accept i ng act i on any di f f er ent l y t han a nor mal r educe.
*
*
*
Not e t hat i f you change YY_TTYPE t o somet hi ng ot her t han shor t , you can no
l onger use t he - T command- l i ne swi t ch.
*
/
#def i ne YY_IS_ACCEPT
#def i ne YY IS SHIFT (s)
0
( ( s) > 0)
/
/
*
*
Accept i ng act i on ( r educe by 0)
s i s a shi f t act i on
*
*
/
/
short
#def i ne YYF
YY TTYPE;
((YY TTYPE) ( (unsi gned short) ~0 1 ))
109
110
111
112
/* --------------------------------------------------------------------------------------------------------------------
*
*
*
Var i ous gl obal var i abl es used by t he par ser . They' r e her e because t hey can
be r ef er enced by t he user - suppl i ed act i ons, whi ch f ol l ow t hese def i ni t i ons
113
114
115
116
117
118
119
120
121
122
123
124
*
*
A
A
I f p or - a was gi ven t o OCCS, make Yy r hsl en and Yy val (the r i ght - hand
si de l engt h and t he val ue used f or $$) publ i c, r egar dl ess of t he val ue of
YYPRI VATE ( yyl val i s al ways publ i c) . Not e t hat occs gener at es ext er n
st at ement s f or t hese i n yyact s. c ( f ol l owi ng t he def i ni t i ons sect i on) .
*
/
#i f ! defi ned(YYACTION)
YYP / * not hi ng
! def i ned( YYPARSER)
*
/
#el se
YYP YYPRIVATE
#endi f
125
126
127
128
129
130
YYPRI VATE i nt y y n e r r s 0;
y y s t k d e l ( Yy s t a c k , i nt, YYMAXDEPTH ) ;
YYSTYPE y y l v a l ;
YYP YYSTYPE Yy v a l ;
/ * Number of er r or s
/ * St at e st ack.
/ * Used t o hol d $$.
/
/ * At t r i but e f or l ast t oken. */
/
Li sti ng5.11. conti nued.>t t
131 YYP YYSTYPE Yy v s t a c k [ YYMAXDEPTH ] ; / * Val ue st ack. Can' t use * /
132 YYP YYSTYPE *Yy v s p; / * yyst ack. h macr os because * /
133 / * YYSTYPE coul d be a st r uct .
* /
134 YYP i nt Yy r h s l e n ; / * Number of nont er mi nal s on */
135 / *
r i ght - hand si de of t he * /
136 / * pr oduct i on bei ng r educed. V
YY_I S_SHI FT,
YY TTYPE.
The definitions on lines 103 to 107 of Listing 5.11 are used for the tables. Shift
actions are represented as positive numbers (the number is the next state), reduce opera
tions are negative numbers (the absolute value of the number is the production by which
youre reducing) and zero represents an accept action (the input is complete when you
reduce by Production 0). YY I S SHI FT on line 104 is used to differentiate between
these. YY TTYPE on the next the table type. You should probably change it to
shor t if your machine uses a 16-bit shor t and 32-bit i nt . YY_TTYPE must be
signed, and a char is usually too small because of the number of states in the machine.
Error marker: yyf .
YYF, on line 107 of Listing 5 represents failure transitions in the parse tables
(YYF is not stored in the compressed table, but is returned by the table-decompression
subroutine, yy a c t n e x t (), which I ll discuss in a moment.) It evaluates to the largest
positive short i nt (with twos complement numbers). Breaking the macro down:
uns i gned 0 igned int with all its bits The unsi gned suppresses sign
extension on the right shift of one bit, which yields a number with all but the high bit set.
The resulting quantity is cast back to a YY TTYPE so that it will agree with other ele
ments of the table. Note that this macro is not particularly portable, and might have to
be changed if you change the YY_TTYPE definition on the previous line. J ust be sure
that YYF has a value that cant be confused with a normal shift or reduce directive.
The final part of Listing 5.11 comprises declarations for parser-related variables that
might be accessed in one of the actions. The state and stacks are defined here
YYP.
Action subroutine,
yy act ().
(along with the stack pointers), as well as some house-keeping variables. Note that those
variables of class YYP are made public if -a or -p is specified to occs. (In which case,
YYACTI ON and YYPARSER are not both presentthe definition is output by occs itself at
the top of the file, and the test is on line 119).
Listing 5.12 holds the action subroutine, which executes the code-generation actions
from the occs input file. Various tables that occs generates from the input grammar are
also in this listing. The actions imbedded in the input grammar are output as case state
ments in the swi t ch in y y a c t () (on lines 137 to 169 of Listing 5.12). As in LLama,
the case values are the production numbers. Each production is assigned a unique but
arbitrary number by occs; these numbers can be found in yyout.sym in Listing 5.7 on
page 384, which is generated when occs finds D s, or S command-line switch. The
Translated dollar attri
butes: $$, $i , etc.
production numbers precede each production in the symbol-table output file. The start
production is always Production 0, the next one in the input file is Production 1, and so
forth.
Note that the dollar attributes have all been translated to references to the value stack
at this juncture. For example, on line 164, the line:
t t STAR f { yycode( "%s
*
o
os \ n " , $1/ $ 3 ) ; f r e e n a me ( $ 3 ) ;
}
has generated:
{ y y c o d e ( 11%s
*
o
o s \ n " , y y v s p [ 2 ] , y y v s p [ 0 ] ) ; f r e e n a m e ( y y v s p [ 0 ] ) ;
}
in the action subroutine, y y v s p is the value-stack pointer and a downward-growing
stack is used. (A push is a *y y v s p = x ; a pop is a * y y v s p + +.) The situation is
Listing 5.12. yyout.c The Action Subroutine and Tables
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
y y a c t ( y y p r o d u c t i o n n u m b e r , y y v s p )
i n t y y p r o d u c t i o n n u m b e r ;
YYSTYPE * y y v s p ;
{
}
*
*
*
/
*
*
*
*
*
*
Thi s subr out i ne hol ds al l t he act i ons i n t he or i gi nal i nput
speci f i cat i on. I t nor mal l y r et ur ns 0, but i f any of your act i ons r et ur n a
nonzer o number , t hen t he par ser wi l l hal t i mmedi at el y, r et ur ni ng t hat
nonzer o number t o t he cal l i ng subr out i ne. I ' ve vi ol at ed my usual nami ng
convent i ons about l ocal var i abl es so t hat t hi s r out i ne can be put i nt o a
separ at e f i l e by occs.
*
/
s w i t c h ( y y p r o d u c t i o n n u m b e r )
{
1 :
{ y y c o d e ( " % s +
o
os \ n " , y y v s p [ 2 ] , y y v s p [ 0 ] ) ; f r e e n a m e ( y y v s p [ 0 ] ) ;
}
6:
{
y y c o d e ( "%s %s %s \ n " , Yy v a l g e t n a m e ( ) ,
i s d i g i t ( * y y t e x t )
y y t e x t ) ;
9 it it vv vv
}
5:
{ Yy v a l y y v s p [ 1 ] ; }
3:
{ y y c o d e (
HO *
o
os \ n " , y y v s p [ 2 ] , y y v s p [ 0 ] ) ; f r e e n a m e ( y y v s p [ 0 ] ) ;
}
/
*
I n case t her e ar e no act i ons * /
}
r e t u r n 0;
/ * ---------------------------------------------------------------------------------------
Yy_st ok[] i s used f or debuggi ng and er r or messages. I t i s i ndexed
by t he i nt er nal val ue used f or a t oken (as used f or a col umn i ndex i n
t he t r ansi t i on mat r i x) and eval uat es to a st r i ng nami ng t hat t oken.
*
/
111 c ha r *
Yy_
s t o k []
178
{
179
/*
0 */ "_E0I _",
180
/*
1
*/
"NUM_0R_ID",
181
/*
2
*/
"PLUS",
182
/* 3 */
"STAR",
183
/*
4 */ "LP",
184
/* 5 */
"RP"
185 };
186
187
/ * ---------
188 * The
yy
act i on t abl e i s act i on
189
190
191
192
193
194
195
*
par t of t he LALR( l ) t r ansi t i on mat r i x. I t ' s
compr essed and can be accessed usi ng t he yy next () subr out i ne, bel ow.
*
*
YyaOOO[] { 5, 3
1,1 };
*
*
*
in
st at e number ----- +
number of pai r s i n l i st - +
i nput symbol (t er mi nal ) ------------ +
act i on------------------------------------------------- +
196
197
198
199
200
201
202
203
204
218
219
220
221
222
223
224
225
226
227
228
229
230
231
244
245
246
247
248
249
250
251
252
253
*
*
act i on yy next ( Yy act i on, cur st at e, l ookahead symbol );
*
*
*
*
*
act i on <
act i on
act i on >
act i on =
0
0
0
YYF
Reduce by pr oduct i on n n act i on
Accept (i e. r educe by pr oduct i on 0)
Shi f t t o st at e n n act i on.
*
/
205 YYPRIVATE YY TTYPE YyaOOO [ ] = { 2, 4 2 1 , 1
};
206 YYPRIVATE YY_TTYPE YyaOOl [ ] = { 4, 5 - 6 , 3 , - 6 2, - - 6 , 0 , - 6
};
207 YYPRIVATE y y _TTYPE Yya003 [ ] = { 2, 0 0 2 , 7
};
208 YYPRIVATE YY_TTYPE Yya004 [ ] ={ 4, 5 - 2 , 2 , - 2 o,--2 , 3, 8
};
209 YYPRIVATE YY_TTYPE Yy a 0 0 5 [ ] = { 4, 5 - 4 , 3 , 4 2, - -4 , 0 , - 4
};
210 YYPRIVATE y y _TTYPE Yy a 0 0 6 [ ] ={ 2, 5 9 2 , 7
};
211 YYPRIVATE y y _TTYPE Yy a 0 0 9 [ ] = { 4, 5 - 5 , 3 , "5 r 2, - "5 , 0 , - 5 };
212 YYPRIVATE y y _TTYPE YyaOl O[ ] = { 4, 5 - 1 , 2 , - 1
r o,--1 ,
3, 8
};
213 YYPRIVATE y y *_TTYPE YyaOl 1[
] = { 4,
5 - 3 , 3 , - 3
r 2, - -3 , 0 , - 3
} ;
214
215 YYPRIVATE YY_TTYPE *Yy a c t i o n [ 1 2 ]
-
216
{
217 YyaOOO
1
ya001, YyaOOO , Yya003 Yya004, Yya005, Yya006, YyaOOO, YfyaOOO
Yya009, YyaOl O, YyaOl l
}f
/* ----------------------------------------------------------------------------------------
*
*
The Yy got o t abl e i s got o par t of t he LALR( l ) t r ansi t i on mat r i x. I t ' s com-
and can be accessed usi ng t he yy next () subr out i ne, decl ar ed bel ow.
*
*
nont er mi nal Yy I hs[ pr oduct i on number by whi ch we j ust r educed
*
*
YygOOO[]
{
5, 3 2, 2
1,1 };
*
*
+
*
*
uncover ed st at e- +
number of pai r s i n l i st - -
nont er mi nal ----------------------------------------------+
got o t hi s st at e----------------------------------------- +
232 * got o st at e = y y next ( yy_ got o, cur st at e, nont er mi nal )
233 * /
234
235 YYPRIVATE YY_TTYPE YygOOO[] = { 3, 3 , 5 2 , 4 1 , 3 };
236 YYPRIVATE YY_TTYPE Yy g 0 0 2 []
= { 3, 3 , 5 2 , 4 1 , 6 };
= { 2, 3 , 5 2 , 1 0 };
= { 1, 3 , 1 1 };
239
240 YYPRIVATE YY_TTYPE *Yy g o t o [ 1 2 ] =
241
{
242 YygOOO, NULL , Yyg002, NULL , NULL , NULL , NULL ,
243 NULL , NULL , NULL
, Yyg007, Yyg008,
};
/* ----------------------------------------------------------------------------------------
*
*
*
The Yy_l hs ar r ay i s used f or r educt i ons. I t i s i ndexed by pr oduct i on number
and hol ds t he associ at ed l ef t - hand si de, adj ust ed so t hat t he number can be
used as an i ndex i nt o Yy got o.
*
/
YYPRIVATE i n t Yy I h s [7]
{
Section 5.10Implementing an LALR( 1) P arserThe Occs Output File 391
254 / * 0 * / 0,
255 / *
1 * /
1.
256
/ *
2 * / 1.
257
/ *
3 * / 2,
258 / *
4 * / 2,
259 / * 5 * / 3,
260
/ *
6
* /
3
261 };
262
263 / *
264 * The Yy_r educe[ ] ar r ay i s i ndexed by pr oduct i on number and hol ds
265 * t he number of symbol s on t he r i ght hand si de of t he pr oduct i on.
266 * /
267
268 YYPRIVATE i n t Yy r e d u c e [7] =
269 {
270 / * 0 * / 1,
271
/ *
1
* / 3 ,
272
/ *
2
* / 1.
273 / * 3 * / 3 ,
274 / * * /
275 / * 5
* / 3 ,
276 / * 5
* /
1
277 };
278 # i f d e f YYDEBUG
279
280 / * ----------------------------------------------------------------------------------------------------------
281 * Yy_sl hs[ ] i s a debuggi ng ver si on of Yy_l hs[ ] . I ndexed by pr oduct i on number ,
282 * i t eval uat es to a st r i ng r epr esent i ng t he l ef t - hand si de of t he pr oduct i on.
283 * /
284
285 YYPRIVATE c h a r * Yy__ s l h s [7]
286
{
287 / * 0
* / "s'
288 / * 1
* /
He
289 / * 2
* /
He
290 / * 3
* /
"t '
291 / * * / "t '
292
/ *
5
* /
ff f
293
/ *
6
* /
ff I
294 };
295
296 / * ----------------------------------------------------------------------------------------------------------
297 * Yy_sr hs[] i s al so used f or debuggi ng. I t i s i ndexed by pr oduct i on number
298 * and eval uat es t o a st r i ng r epr esent i ng t he r i ght - hand si de of t he pr oduct i on.
299 * /
300
301 YYPRIVATE c h a r * Yy _ s r h s [ 7 ] =
302 {
303
/ * (9 * / "e",
304
/ *
1 * / "e PLUS t " ,
305 / *
2 * /
ff 4ff
306
/ * 3 * / "t STAR f " ,
307 / *
4 * /
ff f ff
308
/ * 5 * / C
D
309 / * 6 * / "NUM_0R_ID"
310 };
311 # e n d i f
complicated by the fact that attributes are numbered from left to right, but the rightmost
(not the leftmost) symbol is at the top of the parse stack. Consequently, the number that
is part of the dollar attribute cant be used directly as an offset from the top of stack. You
can use the size of the right-hand side to compute the correct offset, however. When a
reduction is triggered, the symbols on the parse stack exactly match those on the right-
hand side of the production. Given a production like t-^t STAR f / is at top of stack,
STAR is just under the/, and t is under that. The situation is illustrated in Figure 5.12.
y y v s p , the stack pointer, points at/, so $3 translates to y y v s p [ 0] in this case. Simi
larly, $2 translates to y y v s p [ 1 ] , and $3 translates to y y v s p [ 2 ] . The stack offset for
an attribute is the number of symbols on the right-hand side of the current production
less the number that is part of the dollar attribute. $ 1 would be at y y v s p [ 3 ] if the
right-hand side had four symbols in it.
Figure 5.12. Translating $N to Stack References
t - > t STAR/
inrules section insectionfol l owi ngsecond%%
Yy v s p ---->
/
$3 Yy v s p [ 0 ] Yy v s p [ Yy r h s l e n - 3 ]
STAR $2 Yy v s p [ l ] Yy v s p [ Yy r h s l e n - 2 ]
t $1 Yy v s p [ 2] Yy v s p [ Yy r h s l e n - 1 ]
Figure 5.12 also shows how attributes are handled when they are found in the third
part of the occs input file rather than imbedded in a production. The problem here is that
the number of symbols on the right-hand side is available only when occs knows which
production is being reduced (as is the case when an action in a production is being pro
cessed). Code in the third section of the input file is isolated from the actual production,
so occs cant determine which production the attribute references. The parser solves the
problem by setting a global variable, Yy rhsl en, to the number of symbols on the
right-hand side of the production being reduced. Yy rhsl en is modified just before
each reduction. This variable can then be used at run-time to get the correct value-stack
item. Note that negative attribute numbers are also handled correctly by occs. Figure
5.13 shows the value stack just before a reduction by b>d e f in the following gram
mar:
s ^ a b c
b >d e f {x=$- l ; }
The $- 1 in the second production references the a in the partially assembled first pro
duction.
Listing 5.12 continues on line 177 with a token-to-string translation table. It is
indexed by token value (as found in yyout.h) and evaluates to a string naming that token.
I ts useful both for debugging and for printing error messages. The other conversion
tables are used only for the debugging environment, so are #fdefed out when YYDE
BUG is not defined. Yy sl hs [ ] on line 285 is indexed by production number and holds
a string representing the left-hand side of that production. Yy srhs [ ] on line 301 is
Attributes in code section
of input file.
Right-hand-side length:
Yy r hsl en.
Token-to-string conver
sion: Yy st ok[ ] .
Yy s l hs [ ] , Yy sr hs[ ] .
8. They arent accepted by yacc
Figure 5.13. Translating $-N to Stack References
s >a b c
b d e f
in rules section in section following second %%
Yy v s p ---->
f
$3 Yy v s p [ 0 ] Yy v s p [ Yy r h s l e n - 3 ]
e $2 Yy v s p [ l ] Yy v s p [ Yy r h s l e n - 2 ]
d $1
Yy v s p [ 2 ] Yy v s p [ Yy r h s l e n - 1 ]
a $ - 1 Yy v s p [ 3 ] Yy v s p [ Yy r h s l e n - - 1 ]
= Yy v s p [ Yy r h s l e n + 1 ]
parse tables.
similar, but it holds strings representing the right-hand sides of the productions.
Lines 187 to 277 of Listing 5.12 hold the actual parse tables. The state machine for Using occs compressed
our current grammar was presented earlier (in Table 5.11 on page 369) and the
compressed tables were discussed in that section as well. Single-reduction states have
not been eliminated here. I ll demonstrate how the parse tables are used with an exam
ple parse of (1+2 ). The parser starts up by pushing the number of the start state,
State 0, onto the state stack. It determines the next action by using the state number at
the top of stack and the current input (lookahead) symbol. The input symbol is a left
parenthesis (an LP token), which is defined in yyout.h to have a numeric value of 4. (Its
listed in yyout.sym.) The parser, then, looks in Yy acti on [ 0 4 Row 0 of the
table represented in YyaOOO (on Line 205 of Listing 5.12), and its interested in the
first pair: [4,2]. The 4 is the column number, and the 2 is the parser directive. Since 2 is
positive, this is a shift action: 2 is pushed onto the parse stack and the input is advanced.
Note that YyaOOO also represents rows 2, 6, and 7 of the table because all four rows have
the same contents.
The next token is a number (a NUM_OR_ID token, defined as 1in yyout.h). The
parser now looks at Yy acti on [2] [ 1] (2 is the current state, 1the input token). Row
is also represented by YyaOOO on line 205 of Listing 5.12, and Column 1 is
represented by the second pair in the list: [1,1]. The action here is a shift to State 1, so a
1 is pushed and the input is advanced again.
A 1 is now on the top of the stack, and the input symbol is a PLUS, which has the
value 2. Yy a c t i o n [ 1] [ 2] holds a - 6 (its in the third pair in YyaOOO, on line 205),
which is a reduce-by-Production-6 directive (reduce by / NUM OR ID). The first
thing the parser does is perform the associated action, with a y y a c t ( 6) call. The
actual reduction is done next. Yy r e d u c e [ 6] evaluates to the number of objects on the
right-hand side of Production 6in this case, 1. So, one object is popped, uncovering
the previous-state number (2). Next, the goto component of the reduce operation is per
formed. The parser does this with two table lookups. First, it finds out which left-hand
side is associated with Production 6 by looking it up in Yy l h s [ 6] (it finds a 3 there).
It then looks up the next state in Yy g o t o . The machine is in State 2, so the row array is
fetched from Yy g o t o [2 ], which holds a pointer to Yy g 0 0 2 . The parser then searches
the pairs in Yy g 0 0 2 for the left-hand side that it just got from Yy l h s (the 3), and it
finds the pair [3,5]the next state is State 5 (a 5 is pushed). The parse continues in this
manner until a reduction by Production 0 occurs.
02
02 1
02
025
(1+2 )
1+2 )
+2 )
+2 )
+2 )
Listing
(yy next
5.13 contains the table-decompression routine mentioned earlier Table decompression
() on line one), and various output subroutines as well. Two versions of the
output routines are presented at the end of the listing, one for debugging and another ver
sion for production mode. The output routines are mapped to window-output functions
if debugging mode is enabled, as in LLama. Also of interest is a third, symbol stack
yy next ().
Symbol stack
Yy dst ack.
(Yy dst ack) defined on line 24. This stack is the one thats displayed in the middle of
the debugging environments stack window. It holds strings representing the symbols
that are on the parse stack.
Listing 5.13. occs.par Table-Decompression and Output Subroutines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
YYPRI VATE YY_TTYPE y y n e x t ( t a b l e , c u r s t a t e , i n p )
YY TTYPE * * t a b l e ;
YY_TTYPE
i nt
{
c u r _ s t a t e
inp;
/
*
*
Next - st at e r out i ne f or t he compr essed t abl es. Gi ven cur r ent st at e and
i nput symbol (i np), r et ur n next st at e.
*
/
YY_TTYPE
i nt
*
P
t a b l e [ c u r s t a t e ] ;
i ;
( P )
( i (i nt)
( i n p
*P++;
P [0]
i > 0 ; p + 2 )
)
ret urn p[1];
ret urn YYF;
}
/* -------------------------------------------------------------------------------------------------------------------- */
#i f def YYDEBUG
y y s t k d e l ( Yy d s t a c k , * , YYMAXDEPTH ) ; /
*
Symbol st ack
*
/
y y c o d e ( f m t )
* fmt;
{
v a s t a r t ( a r g s f m t ) ;
y y o u t p u t ( 0 , f m t , a r g s ) ;
}
y y d a t a ( f m t )
* f m t ;
{
}
y y b s s ( f m t )
*
f m t ;
{
v a l i s t
}
/
*
yycomment () and yyer r or () ar e def i ned i n yydebug. c
*
/
#el se
/* ----------------------------------------------------------------------------------------*/
54 # def i ne yy n e x t o k e n ( ) y y l e x () / * when YYDEBUG i sn' t def i ned. */
55 # def i ne yy q u i t d e b u g ()
56 # def i ne yy i n i t d e b u g ()
57 # def i ne yy p s t a c k ( x , y )
58 # def i ne yy sym( )
59
60 / * Use t he f ol l owi ng r out i nes j ust l i ke pr i nt f () t o cr eat e out put . The onl y
61 * di f f er ences ar e t hat yycode i s sent to t he st r eamcal l ed yycodeout , yydat a
62 * goes to yydat aout , and yybss goes t o yybssout . Al l of t hese ar e i ni t i al i zed
63 * tc st dout . I t' s up to you to cl ose t he st r eams af t er t he par ser t er mi nat es.
64 */
65
66 y y c o d e ( fmt )
67 char * f mt ;
68
{
71 v f p r i n t f ( y y c o d e o u t , f mt , a r g s ) ;
72
}
73
74 y y d a t a ( fmt )
75 char * f mt ;
76
{
79 v f p r i n t f ( y y d a t a o u t , f mt , a r g s ) ;
80 }
81
82 y y b s s ( fmt )
83 char * f mt ;
84
{
87 v f p r i n t f ( y y b s s o u t , f mt , a r g s ) ;
88 }
89
90 voi d yyc omme nt ( f mt )
91 char * f mt ;
92
{
96 }
97
98 voi d y y e r r o r ( f mt , . . . )
99 char * f mt ;
100
{
102 ext ern i nt y y l i n e n o ;
104
105 va s t a r t ( a r g s , fmt ) ;
106 f p r i n t f ( s t d e r r , "ERROR ( l i n e %d ne a r %s ) : ", y y l i n e n o , y y t e x t ) ;
107 v f p r i n t f ( s t d e r r , f mt , a r g s ) ;
108 f p r i n t f ( s t d e r r , "\ n" ) ;
109 }
110 #endi f
Shift subroutine,
yy shi f t ().
The parser itself starts in Listing 5.14. The shift and reduce operations have been
broken out into subroutines (on lines 111 and 128) in order to simplify the code in the
parser itself. A new state is pushed onto the state stack on line 115. Garbage is pushed
onto the value stack on line 116, but the user-supplied, default shift action [in the macro
YYSHIFTACT ( ) ] is performed on the next line. This macro is passed a pointer to the top
of the value stack (after the push)the default action (defined on line 78 of Listing 5.11,
page 387) pushes the contents of y y l v a l onto the stack, so you can push an attribute for
a nonterminal by modifying y y l v a l from an accepting action in a LEX input file. (This
process is discussed in Appendix E.) The debugging stack is kept aligned with the other
stacks by pushing a string representing the shifted token on line 121. The
y y _ p s t a c k () call on the next line updates the stack window and waits for another
command if necessary. Breakpoints are also activated there.
Listing 5.14. occs.par The Parser
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
YYPRIVATE
i nt
i nt
{
y y _ s h i f t ( new s t a t e , l o o k a h e a d )
n e w _ s t a t e ;
l o o k a h e a d ;
/ * shi f t
*
/
*
push t hi s st at e
*
/
/
/ * Cur r ent l ookahead */
y y p u s h ( Yy _ s t a c k , n e w _ s t a t e ) ;
- - Yy _ v s p ;
YYSHIFTACT( Yy v s p ) ;
/
*
Push ont o val ue st ack */
/ * Then do def aul t act i on
*
/
#i f def YYDEBUG
y y c o mme nt ( " S h i f t %0. 16s ( %d) \ n", Yy s t o k [ l o o k a h e a d ] , new s t a t e )
yypus h ( Yy d s t a c k , Yy s t o k [ l o o k a h e a d ]
) ;
yy p s t a c k ( 0,
1) ;
#endi f
}
/* -------------------------------------------------------------------------------------------------------------------- */
YYPRIVATE
i nt
yy r e d u c e ( pr od num, amount )
i nt
{
prod_num;
amount ;
/
/
*
*
Reduce by t hi s pr oduct i on
# symbol s on r i ght - hand si de of pr od num
*
*
/
/
i nt n e x t s t a t e
y y p o p n ( Yy _ s t a c k , amount ) ;
Yy v s p += amount ;
/
*
Pop n i t ems of f t he st at e st ack */
*
Yy v s p Yy v a l ;
/ * and t he val ue st ack.
/ * Push $$ ont o val ue st ack
*
*
/
/
n e x t s t a t e yy n e x t ( Yy g o t o , y y s t k i t e m( Yy s t a c k , 0 Yy I h s [ p r o d num] );
#i f ndef YYDEBUG
yypus h ( Yy s t a c k , n e x t s t a t e ) ;
#el se
yy b r e a k ( pr od num ) ; /
*
act i vat e pr oduct i on br eakpoi nt
*
/
yypopn ( Yy d s t a c k , amount ) ;
YYV ( y yc omme nt ("
yy p s t a c k ( 0, 0 ) ;
pop d i t e m%s \ n", amount , amount 1 ?
ff ff
: " s " ) ; )
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
yypus h ( Yy s t a c k , n e x t s t a t e
) ;
yypus h ( Yy d s t a c k , Yy s l h s [ pr od num ]
) ;
YYV( yyc omme nt (" pus h
o
o0 . 1 6 s (%d)", Yy s l h s [ p r o d num], n e x t s t a t e ) ; )
yy p s t a c k ( 0,
1 ) ;
#endi f
}
/
* *
/
YYPRIVATE voi d yy i n i t s t a c k () / * I ni t i al i ze t he st acks
*
/
{
y y s t k c l e a r ( Yy s t a c k ) ;
yypus h ( Yy s t a c k , 0 ) ; / * St at e st ack 0
*
/
Yy vs p Yy v s t a c k + (YYMAXDEPTH-1) ; / * Val ue st ack
*
/
YYDEBUG
y y s t k c l e a r ( Yy d s t a c k ) ;
yypus h
yycomment
W p s t a c k
( Yy d s t a c k ,
( " S h i f t s t a r t s t a t e \ n " ) ;
(0, 1); / * r ef r esh st ack wi ndow */
}
/
* *
/
YYPRIVATE i nt yy r e c o v e r ( t o k , s u p p r e s s )
i nt
i nt
{
t o k ; / * t oken t hat caused t he er r or
*
s u p p r e s s ; /
*
No er r or message i s pr i nt ed i f t r ue
*
/
/
i nt
YYD (
YYD (
* *
o l d _ s p
o l d _ d s p
* t os;
y y s t k _ p ( Y y _ s t a c k ) ;
y y s t k p( Yy d s t a c k ) ;
/
*
St at e- st ack poi nt er
/
)
)
L f ( ! s u p p r e s s )
{
y y e r r o r ( "Une xpe c t e d %s\ n", Yy s t o k [ t o k ]
) ;
++y y ne r r s > YYMAXERR )
{
y y e r r o r ( " To o many e r r o r s , a b o r t i n g \ n " ) ;
ret urn 0;
}
}
do {
whi l e( ! y y s t k empt y( Yy s t a c k )
&& yy n e x t ( Yy a c t i o n , y y s t k i t e m( Yy s t a c k , 0 ) , t o k) YYF )
{
yypop ( Yy s t a c k ) ;
YYD ( t o s yypop (Yy d s t a c k ) ;
)
YYD( yyc omme nt ( "Poppi ng %d f rom s t a t e s t a c k \ n " , t o s ) ;
)
YYD ( yy p s t a c k ( 0,
1) ; )
}
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
(
! y y s t k empt y( Yy s t a c k )
) /
*
Recover ed successf ul l y
*
/
{
/
*
*
Al i gn t he val ue ( and debug) st ack t o agr ee wi t h t he cur r ent
st at e- st ack poi nt er .
*
/
Yy v s p Yy v s t a c k + ( y y s t k e l e (Yy s t a c k )
1) ;
#i f def YYDEBUG
y y s t k p( Yy d s t a c k ) Yy d s t a c k + ( y y s t k e l e ( Y y s t a c k )
1) ;
yyc omme nt ( "Error r e c o v e r y s u c c e s s f u l \ n " ) ;
yy p s t a c k ( 0,
1) ;
#endi f
tok;
}
y y s t k p ( Yy s t a c k ) o l d sp
YYD ( y y s t k p ( Yy d s t a c k ) o l d ds p ;
YYD ( yyc omme nt ( " R e s t o r i n g s t a t e s t a c k . " ) ;
YYD ( yy p s t a c k ( 1, 1) ;
)
)
)
YYD( yyc omme nt ( " d i s c a r d i n g %s\ n", Yy s t o k [ t o k ] ) ; )
} whi l e( i i mark p r e v ( ) , t o k yy n e x t o k e n ()
) ;
YYD( yyc omme nt ( "Error r e c o v e r y f a i l e d \ n " ) ;
ret urn 0;
)
}
/
* *
/
i nt
{
y y p a r s e ()
/
*
Gener al - pur pose LALR par ser . Ret ur n 0 nor mal l y or 1 i f t he er r or
*
r ecover y f ai l s. Any ot her val ue i s suppl i ed by t he user as a r et ur n
* st at ement i n an act i on.
/
*
i nt act _num ;
i nt e r r c o d e ;
i nt t c h a r ;
i nt s u p p r e s s e r r
/
/
*
*
Cont ent s of cur r ent par se t abl e ent r y
Er r or code r et ur ned f r omyy act ()
*
*
/ * Used t o \ 0- t er mi nat e t he l exeme
/ * Set t o YYCASCADE af t er an er r or i s f ound
*
*
/
/
*
and decr ement ed on each par se cycl e,

messages ar en' t pr i nt ed i f i t ' s t r ue
/
/
/
/
/
/
#i f def YYDEBUG
L f ( ! yy i n i t d e b u g ( Yy _ s t a c k , &y y s t k _ p (Yy _ s t a c k ) ,
Yy d s t a c k , &yys t k p( Yy d s t a c k ) ,
Yy v s t a c k , (YYSTYPE), YYMAXDEPTH)
)
YYABORT;
#endi f
y y _ i n i t _ s t a c k ( ) ;
yy i n i t o c c s ( Yy v s p ) ;
/ * I ni t i al i ze par se st ack
/
y y l o o k a h e a d
s u p p r e s s e r r
yy n e x t o k e n ( ) ;
0;
/
Get f i r st i nput symbol
/
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
w h i l e ( 1 )
{
a c t num
{
}
{
}
{
yy n e x t ( Yy a c t i o n , y y s t k (Yy s t a c k , 0 y y l o o k a h e a d ) ;
( s u p p r e s s _ e r r )
- - s u p p r e s s e r r ;
( a c t num YYF
)
( ! ( y y l o o k a h e a d yy r e c o v e r ( y y l o o k a h e a d , s u p p r e s s e r r ) )
)
YYABORT;
s u p p r e s s e r r YYCASCADE;
YY IS SHI FT( ac t num)
) /
*
Si mpl e shi f t act i on
*
/
/
*
Not e t hat yyt ext and yyl eng ar e undef i ned at th poi
*
*
*
t hey wer e modi f i ed i n t he el se cl ause, bel ow. You must use
i i _t ext ( ) , et c. , t o put t hemi nt o a r easonabl e condi t i on i f
you expect to access t hemi n a YY SHI FT act i on.
*
/
yy s h i f t ( a c t num, y y l o o k a h e a d ) ;
i i mark p r e v ( ) ;
y y l o o k a h e a d yy n e x t o k e n ( ) ;
/
*
*
*
*
Do a r educt i on by - act _num. The act i vi t y at 1, bel ow, gi ves YACC
compat i bi l i t y. I t ' s j ust maki ng t he cur r ent l exeme avai l abl e i n
yyt ext and ' \ 0' t er mi nat i ng t he l exeme. The ' \ 0' i s r emoved at 2
The pr obl i s t hat you ha t o r ead t he next l ookahead symbol
*
*
*
*
*
*
bef or e you can r educe by t he pr oduct i on t hat had t he pr evi ous
symbol at i t s f ar r i ght . Not e t hat , si nce Pr oduct i on 0 has t he
goal symbol on i t s l ef t - hand si de, a r educe by 0 i s an accept
act i on. Al so not e t hat i i _pt ext ( ) [ i i _pl engt h ()] i s used at (2)
r at her t han yyt ext [ yyl eng] because t he user mi ght have modi f i ed
yyt ext or yyl eng i n an act i on.
*
*
*
Rat her t han pushi ng j unk as t he $$=$1 act i on on an epsi l on
pr oduct i on, t he ol d t os i t emi s dupl i cat ed i n t hi s si t uat i on
*
/
act _num
Yy _ r h s l e n
Yy v a l
- ac t _num ;
Yy r e d u c e [ a c t num
];
Yy v s p [ Yy r h s l e n ? Yy r h s l e n - 1 0 1; /
*
$$ $1
*
/
y y l i n e n o
y y t e x t
t c h a r
y y t e x t [ y y l e n g ]
i i p l i n e n o () /
(1)
/
i i p t e x t
0
y y t e x t [ y y l e n g
' \ 0' ;
i i p l e n g t h ()
] ;
YYD( yyc omme nt ( "Reduce by (%d) %s->%s\ n", a c t num, \
Yy s l h s [ a c t num], Yy s r h s [ a c t num] ) ; )
i f ( e r r c o d e yy a c t ( a c t num, Yy v s p )
)
r e t u r n e r r c o d e ;
331 i i p t e x t ( ) [ i i p l e n g t h () ] = t c h a r ;
/ * (2) */
332
333 i f ( a c t _ n u m == YY_ I S _ ACCEPT )
334 b r e a k ;
335 e l s e
336 y y r e d u c e ( a c t num, Yy r h s l e n ) ;
337 }
338
}
339 YYD ( y y c o m m e n t ( " A c c e p t \ n " ) ; )
340 YYD ( y y q u i t d e b u g ( ) ; )
341
342 YYACCEPT;
343 }
Reduce subroutine,
yy r educe().
Reductions are done in yy_r educe (), on line 128 of Listing 5.14. The subroutine
pops as many stack items as are on the right-hand side of the production by which the
parsers reducing on lines 134 to 136; the goto state is computed on line 138, and is
pushed 142 (153 if debugging is enabled). Nothing is popped in the case of an e
production because amount is zero in this situation. Yy val (which holds $$) is initial
ized to $1 before yy r educe is called.9 Similarly, the action code (in which Yy val
could be modified via $$) is executed after the initialization but before the
yy r educe () call. The code on lines 144 to 158 of Listing 5.14 is activated in debug
ging mode. The actions that I just described are performed here too, but theyre done in
Stack initialization
several steps so that breakpoints and screen updates happen in reasonable places.
The stacks are all initialized in yy i ni t st ack () on line 165. This initialization
must be done at run time because the parser might be called several times within a pro
gram (if, for example, its being used to parse expressions in some large program rather
than for a complete compiler). Initially, 0 is pushed onto the state stack, garbage onto
Error recovery:
yy r ecover ().
the value stack, and the string "$" onto the debugging symbol stack.
The next issue is error recovery, which is handled by yy r ecover () on line 182 of
Listing 5.15. This routine is passed the token that caused the error. It prints an error
message if its second argument is false, and tries to recover from the error by manipulat
ing the stack. The panic-mode error-recovery technique described earlier is used:
(1)
Pop items off the stack until you enter a state that has a legitimate action on the
current token (for which there is a nonerror transition on the current token). If you
(2)
find such a state, you have recovered. Return the input token.
Otherwise, if there are no such states on the stack, advance the input, restore the
stack to its original condition (that is, to the condition it was in when the error
(3)
occurred) and go to (1).
If you get to the end of input, you cant recover from the error. Return 0 in this
case.
The parser itself:
yypar se().
The algorithm is represented in pseudocode in Table 5.14.
The actual parser, yypar se (), begins on line 245 of Listing 5.15. It is a straightfor
ward implementation of the simplified, bottom-up parse algorithm in Listing 5.4 on page
373. The remainder of the output file is the third portion of the input file from Listing 5.5
9. Yacc hackers: Dont confuse this variable with yyl val , discussed earlier. Yacc uses yyl val both for $$
and for the default shift valueoccs uses two different variables.
Table 5.14. Bottom-Up, Error-Recovery Algorithm
recover( tok)
{
remember the_real_stack_pointer ;
do {
sp =the_real_stack_pointer ;
while( stack not empty && no_legal_transition( sp, tok ))
sp;
if( the stack is not empty )
{
the_real_stack_pointer =sp;
return tok;
)
}while( (tok =advance()) ^ end_of_input);
return 0;
}
on page 383. It is not reproduced here.
5.11 Implementing an LALR(1) Parser Generator
Occs Internals
This section discusses the source code for occs, a yacc-like program that translates an
augmented, attributed grammar into the C source code for a parser. If you still havent
done it, you must read Appendix E, which is a users manual for occs, before continuing.
(I really mean it this time.)
5.11.1 Modifying the Symbol Table for LALR(1) Grammars
Most of the important differences between LLama and occs are in the do f i l e ()
subroutine (Listing 4.36, page 327, line 272 ). The OX () macros conditionally compile
the code that is unique to occs. [ LL( ) is used for LLama-speci fic code.] There are
other trivial changes scattered throughout the program, but the important ones are in
do_ f i l e().
The first order of business is modifying the grammar somewhat. LLamas top-down
parser can handle actions imbedded in the middle of a production because it just puts
them onto the parse stack like any other symbol. The bottom-up parser used by occs has
to do the actions as part of a reduction, however. Consequently, it has to handle actions
differently. The patch () subroutine (on line 29 of Listing 5.15) modifies the grammar
to be suitable for occs, and at the same time removes all actions from the PRODUCTIONS
themselves. Actions are output as cases in the swi tch statement in the yy act () sub
routine (discussed earlier on page 388). The output case statements reference the pro
duction number with which the action is associated. The action code itself can safely be
discarded once it is output because occs no longer needs it.
Modifying the grammar
for occs: do_patch()
Listing 5.15. yypatch.c Modify Grammar for Use in Occs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#i ncl ude c c t y p e . h>
#i ncl ude < ma l l o c . h >
#i ncl ude <tool s/ debug. h>
#i ncl ude <t o o l s / ha s h. h>
#i ncl ude <t o o l s / c o mpi l e r . h>
#i ncl ude "parser.h"
J *________________________________________ ____________________________________ * J
voi d p a t c h P(( voi d
) ) ; /
*
publ i c
*
/
voi d do pa t c h P ( ( SYMBOL
*
sym
)) ; /
l ocal
*
/
voi d p r i n t one c a s e P (( i nt c a s e v a l , unsi gned char * a c t i o n , \
i nt rhs s i z e , i nt l i n e n o , struct prod
*
pr od ) ) ;
J *____________________________________________________________________________ * J
PRIVATE i nt La s t r e a l nont erm / * Thi s i s t he number of t he l ast
/ * nont er mi nal to appear i n t he i nput
*
*
/
/
*
*
gr ammar [as compar ed to t he ones
t hat pat ch () cr eat es] .
*
*
/
/
/
/
J * ___________________ _____________________________________________________________________________________________________________ _________________________* j
PUBLIC voi d p a t c h ()
{
/
Thi s subr out i ne does sever al t hi ngs

A
A A
A A
A
I t modi f i es t he symbol t abl e as descr i bed i n t he t ext .
I t pr i nt s t he act i on subr out i ne and del et es t he memor y associ at ed
wi t h t he act i ons.
*
a
a
a
Thi s i s not a par t i cul ar l y good subr out i ne f r oma st r uct ur ed pr ogr ammi ng
per spect i ve because i t does t wo ver y di f f er ent t hi ngs at t he same ti me.
You save a l ot of code by combi ni ng oper at i ons, however .
/
voi d d o p a t c h ( ) ;
* t o p []
{
ff
yy a c t ( yypnum, y y v s p ) ",
i n t
YYSTYPE
yypnum;
y y v s p ;
/
p r o d u c t i o n number
/ * v a l u e - s t a c k p o i n t e r
/ * Th i s s u b r o u t i n e h o l d s a l l t h e a c t i o n s i n t h e o r i g i n a l i n p u t " ,
s p e c i f i c a t i o n . I t n o r ma l l y r e t u r n s 0, but i f any o f your ",

a c t i o n s r e t u r n a n o n - z e r o number, t h e n t h e p a r s e r h a l t s " ,
i mme d i a t e l y , r e t u r n i n g t h a t n o n z e r o number t o t h e c a l l i n g " ,
s u b r o u t i n e . ",
ff
s w i t c h ( yypnum ) ",
Section 5.11.1 Modifying the Symbol Table for LALR( 1) Grammars 403
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
NULL
};
* b o t []
{
IIII
II
II
# i f d e f YYDEBUG",
d e f a u l t : y y c o mr ne nt ( \ "Pr o duc t i o n %d: no a c t i o n \ \ n \ " , yypnum) ; ",
II
b r e a k ; ",
II
II
# e n d i f ",
IIII
II
r e t u r n 0 ; " ,
NULL
};
La s t r e a l nont erm Cur nont erm;
( Make a c t i o n s )
p r i n t v ( Out put , t o p ) ;
p t a b ( Symt ab, d o p a t c h , NULL, 0 ) ;
( Ma k e _ a c t i o n s )
p r i n t v ( Out put , b o t ) ;
}
J * __________________ __________ ____________ ________________________________________________________ _______ ____* j
PRIVATE v o i d do pa t c h( s y m)
SYMBOL
*
sym;
{
PRODUCTION *prod;
SYMBOL
SYMBOL
SYMBOL
* *
* *
p2 ;
p p ;
/
/
*
*
Cur r ent r i ght - hand si de of sym
Gener al - pur pose poi nt er
*
*
/ * Poi nt er t o one symbol on r hs
*
*cur /
*
Cur r ent el ement of r i ght - hand si de
*
/
/
/
/
( ! ISNONTERM(sym) s y m- >v a l > La s t r e a l nont e r m )
{
/
*
I f t he cur r ent symbol i sn' t a nont er mi nal , or i f i t i s a nont er mi nal
* t hat used t o be an act i on (on t ha we 7 f or med) gnor e i t.
*
/
}
( prod s y m- > p r o d u c t i o n s ; pr od ; pr od p r o d - > n e x t )
{
( p r o d - > r h s _ l e n
c o n t i n u e ;
0 )
PP
cur
p r o d - > r h s + ( p r o d - >r h s l e n
*pp;
l) ;
( ISACT( cur)
) /
Check r i ght most symbol
/
{
p r i n t one c a s e ( prod- >num, c u r - > s t r i n g , ( p r o d - > r h s l e n ) ,
c u r - > l i n e n o , pr od ) ;
119 d e l s y m ( (HASH_TAB*) Symt ab, (BUCKET*) c ur ) ;
120 f r e e ( c u r - > s t r i n g ) ;
121 f r e e s y m ( c ur ) ;
122 * p p - - = NULL;
123 }
124 / * cur i s no l onger val i d because of t he pp above. */
125 / * Count t he number of nonact i ons i n t he r i ght - hand * /
126 / * si de. * /
127
128 f or(; pp >= p r o d - > r h s ; pp )
129 {
130 c ur = *pp;
131
132 i f ( ! I S A C T ( c u r ) )
133 cont i nue;
134
135 i f ( C u r _ n o n t e r m >= MAXNONTERM )
136 e r r o r ( 1 , "Too many n o n t e r mi n a l s & a c t i o n s (%d ma x ) \ n", MAXTERM);
137 el se
138 {
139 / * Tr ansf or m t he act i on i nt o a nont er mi nal . * /
140
141 Te r ms [ c u r - > v a l = ++Cur_nont erm ] = c ur ;
142
143 c u r - > p r o d u c t i o n s = ( PRODUCTI ON*) m a l l o c ( si zeof ( PRODUCTI ON) ) ;
144 i f ( ! c u r - > p r o d u c t i o n s )
145 e r r o r ( 1, "I NTERNAL [ d o p a t c h ] : Ou t o f m e m o r y \ n " ) ;
146
147 p r i n t _ o n e _ c a s e ( Num_ pr o duc t i o ns , / * Case val ue t o use. */
148 c u r - > s t r i n g , / * Sour ce code. * /
149 pp - p r o d - > r h s , / * # symbol s t o l ef t of act . */
150 c u r - > l i n e n o , / * I nput l i ne # of code. * /
151 p r o d
152 ) ;
153
154 / * Once t he case i s pr i nt ed, t he st r i ng ar gument can be f r eed. */
155
156 f r e e ( c u r - > s t r i n g ) ;
157 c u r - > s t r i n g = NULL;
158 c u r - > p r o d u c t i o n s - > n u m = Num_pr oduc t i ons ++ ;
159 c u r - > p r o d u c t i o n s - > l h s = c ur ;
160 c u r - > p r o d u c t i o n s - > r h s _ l e n = 0;
161 c u r - > p r o d u c t i o n s - > r h s [0] = NULL;
162 c u r - > p r o d u c t i o n s - > n e x t = NULL;
163 c u r - > p r o d u c t i o n s - > p r e c = 0;
164
165 / * Si nce t he new pr oduct i on goes t o epsi l on and not hi ng el se,
166 * FI RST( new) == { epsi l on }. Don' t bot her t o r ef i gur e t he
167 * f ol l ow set s because t hey won' t be used i n t he LALR( l ) st at e-
168 * machi ne r out i nes [ I f you r eal l y want t hem, cal l f ol l ow()
169 * agai n. ]
170 * /
171
172 c u r - > f i r s t = n e w s e t ( ) ;
173 A DD( c u r - > f i r s t , EPS I LON ) ;
174 }
175 }
176 }
177 }
178
Section 5.11.1 Modifying the Symbol Table for LALR(l) Grammars 405
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
PRIVATE voi d p r i n t one c a s e ( c a s e v a l , a c t i o n , r hs s i z e , l i n e n o , pr od )
i nt
unsi gned char
i nt
i nt
PRODUCTION
{
c a s e _ v a l ;
* a c t i o n ;
r h s _ s i z e ;
l i n e n o ;
*prod;
/ * Numer i c val ue at t ached t o case i t sel f . */
/ * Sour ce Code t o execut e i n case
/*
/*
/*
*
Number of symbol s on r i ght - hand si de
i nput l i ne number ( f or i l i nes) .
Poi nt er to r i ght - hand si de.
*
*
*
/
/
/
/
/ * Pr i nt out one act i on as a case st at ement . Al l $ ar e mapped
*
*
*
to r ef er ences to t he val ue st ack: $$ becomes Yy_vsp[ 0] , $1 becomes
Yy_vsp[ - 1] , et c. The r hs_si ze ar gument i s used f or t hi s pur pose,
[ see do dol l ar () i n yydol l ar . c f or det ai l s] .
*
/
i nt num, 1 ;
* d o _ d o l l a r ( ) ;
^ p r o d u c t i o n s t r ( ) ;
/
*
sour ce f ound i n yydol l ar . c
*
/ * sour ce f ound i n act s. c
*
f n a me [ 4 0 ] ,
*
f p;
/
*
pl ace t o assembl e $<f name>l
*
/
/
/
( !Make a c t i o n s )
o u t p u t ( 11\ n
o
od: / * %s * / \ n \ n \ t " , c a s e v a l , p r o d u c t i o n s t r ( p r o d ) ) ;
( !No l i n e s )
o u t p u t ( " # l i n e %d \ " % s \ " \ n \ t " , l i n e n o , I nput f i l e name ) ;
whi l e( * a c t i o n )
{
( * a c t i o n !
o u t p u t (
I I a n I I *
O__ #
)
a c t i o n + + ) ;
{
/
Ski p t he at t r i but e r ef er ence. The i f st at ement handl es $$ t he

* el se cl ause handl es t he t wo f or ms: $N and $- N, wher e N i s a
* deci mal number . When we hi t t he do dol l ar cal l (i n t he out put ()
A
cal l ) , "num" hol ds t he number associ at ed wi t h N, or DOLLAR DOLLAR
* i n t he case of $$.
/
( * + + a c t i o n ! <'
)
* fname ' \ 0' ;
{
+ + a c t i o n ; /
ski p t he <
/
fp
f name ;
( i
*
f p++
( f n a me ) ; -
* a c t i o n + + ;
i >0 && * a c t i o n && * a c t i o n !
f p
' \ 0' ;
( * a c t i o n =
+ + a c t i o n ;
}
( * a c t i o n
)
{
num DOLLAR DOLLAR;
+ + a c t i o n ;
}
Listing 5.15. continued.>**
238 el se
239 {
240 num = a t o i ( a c t i o n ) ;
241
i f (
* a c t i o n == )
242 + + a c t i o n ;
243 whi l e( i s d i g i t ( * a c t i o n ) )
244 + + a c t i o n ;
245 }
246
247 o u t p u t ( "%s", do d o l l a r ( num, r hs s i z e , l i n e n o , pr od, fnai ) ) ;
248 }
249 }
250 o u t p u t ( "\ n b r e a k ; \ n " ) ;
251 }
Those actions that are already at the far right of a production are just output with the
production number as the case value (by the i f statement in do_pat ch () on lines 114
to 123 of Listing 5.15). The action is effectively removed from the production with the
Imbedded actions.
PP
NULL on line 122. The f or loop that starts on line 128 looks for imbedded
actions, and, if any are found, modifies the grammar accordingly. Productions of the
form:
s s y mbo l s { a c t i o n ( ) ; } s ymbol s ;
are translated into:
s s y mbo l s 001 s y mbol s
001 : { a c t i o n ( ) ;
}
Dollar-attribute transla
tion: do dol l ar ().
o
o uni on fields.
The new nonterminal (named 001, above) is created on lines 141 to 145 (and initialized
on lines 156 to 173) by transforming the existing SYMBOL structure (the one that
represents the action) into a nonterminal that has an empty right-hand side. The new
production number is assigned on line 141. Note that the production in which the action
was imbedded doesnt need to be modified. The SYMBOL pointer that used to reference
the action now references the nonterminal, but its the same pointer in both situations.
The pr i nt one case ( ) routine, which prints the code associated with the action,
starts on line 179 of Listing 5.15. Dollar attributes ($$, $1, $- 1, and so forth) are han
dled on lines 209 to 248. The actual work is done by do_dol l ar (), in Listing 5.16.
is passed the attribute number (the 1 in $1), the number of symbols on the right-hand side
of the current production, the input line number (for error messages) and a pointer to the
evaluates to a string holding the code necessary to reference the
desired attribute (using the translations discussed earlier on Page 388).
Fields that are attached to a %uni on are also handled here (on lines 25 to 36 and 57
to 75 of Listing 5.16). The code on lines 25 to 36 handles any fields associated with $$,
PRODUCTION
the names of which are found in *pr od- >l hs- >f i el d. The code on lines 57 to 75
handles other attributes. Since numis the N in $/V, ( pr od- >r hs) [ num- 1] -
gets the field associated with the specified element of the right-hand side. Note that
fields cant be used with negative attribute numbers (because occs has no way of know
ing what right-hand side corresponds to that field). A warning is printed in this case.
Section 5.11.1 Modifying the Symbol Table for LALR(l) Grammars
Listing 5.16. yydollar.c Process $ Attributes
407
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
#nclude <stdi o. h>
#i ncl ude <t ool s / debug. h>
#i ncl ude <t o o l s / ha s h. h>
#i ncl ude "parser.h"
PUBLI C
i n t
*
d o d o l l a r ( num, r h s s i z e , l i n e n o , p r o d , f n a m e )
num;
*
i n t
i n t
r h s _ s i z e ;
l i n e n o ;
*
PRODUCTI ON * p r o d ;
* f n a m e ;
{
*
*
The N i n $N, DOLLAR_DOLLAR f or $$ ( DOLLAR_DOLLAR)
i s def i ned i n par ser . h, di scussed i n Chapt er Four
Number of symbol s on r i ght - hand si de, 0 f or t ai l
I nput l i ne number f or er r or messages
Onl y used i f r hs si ze i s >= 0
*
*
*
*
*
*
/
/
/
/
/
/
b u f [ 1 2 8 ] ;
i n t i , l e n ;
( num DOLLAR DOLLAR ) / * Do $$ * /
{
s t r c p y ( b u f , "Yy v a l " ) ;
(
* f n a m e )
s p r i n t f ( b u f + 6 , " . %s " , f n a m e ) ;
/
*
$<name>N
*
/
i f ( f i e l d s a c t i v e () )
{
( * p r o d - > l h s - > f i e l d )
s p r i n t f ( b u f + 6 , " . % s " , p r o d - > l h s - > f i e l d ) ;
{
e r r o r ( WARNING, " L i n e %d: No < f i e l d > a s s i g n e d t o $ $ ,
VV
vv
u s i n g d e f a u l t i n t f i e l d \ n " , l i n e n o ) ;
s p r i n t f ( b u f + 6 ,
I I 0 , 1 !
. Ob DEF FI ELD ) ;
}
}
}
{
( num < 0 )
+ + n u m;
( r h s _ s i z e < 0 )
s p r i n t f ( b u f , "Yy v s p [ Yy r h s l e n - % d ] " , num ) ;
/
$N i s i n t ai l * /
{
( ( i
r h s s i z e num) < 0 )
e r r o r ( WARNING, " L i n e %d: I l l e g a l $%d i n p r o d u c t i o n \ n " ,
l i n e n o , n u m ) ;
{
l e n s p r i n t f ( b u f , " y y v s p [ % d ] " , i ) ;
( * f n a m e )
s p r i n t f ( b u f + l e n , " . % s " , f n a m e ) ;
/
$<name>N
/
L f ( f i e l d s a c t i v e ( ) )
{
59 i f ( num <= 0 )
60
{
61 e r r o r ( N O N F A T A L , " C a n ' t u s e %%uni on f i e l d w i t h n e g a t i v e "
62 " a t t r i b u t e s . U s e $ < f i e l d > - N \ n " ) ;
63 }
64 el se i f ( * ( p r o d - > r h s ) [ n u m - 1 ] - > f i e l d )
65 {
66 s p r i n t f ( b u f + l e n , " . % s " , ( p r o d - > r h s ) [ n u m - 1 ] - > f i e l d ) ;
67 }
68 el se
69 {
70 e r r o r ( WARNI NG, " L i n e %d: No < f i e l d > a s s i g n e d t o $%d, "
71 " u s i n g d e f a u l t i n t f i e l d \ n " ,
72 l i n e n o , num ) ;
73 s p r i n t f ( b u f + l e n , " . % s " , DEF FI ELD ) ;
74
}
75 }
76
}
77
}
78
}
79
80 return b u f ;
81 }
5.12 Parser-File Generation
The routines in Listing 5.17 and 5.18 handle the occs parser-file generation. They
are much like the LLama routines with the same names. There are two possible tem
plate files, however. The one that weve been looking at {occs.par) is used in most situa
tions. If -a is specified on the command line, however, the file occs-act.par (in Listing
5.19) is used instead. All thats needed here are external declarations that give us access
to variables declared in the associated parser file (generated with a -p command-line
switch).
Listing 5.17. yycode.c Controller Routine for Table Generation
1 voi d t a b l e s ()
2
{
3 ma k e y y s t o k ( ) ;
/*
i n st ok. c */
4 ma k e t o k e n f i l e ( ) ;
/ *
i n st ok. c */
5 ma k e p a r s e t a b l e s ( ) ; / *
i n yyst at e. c */
6
}
5.13 Generating LALR(1) Parse Tables
The only part of occs weve yet to explore is the table-generation subroutines, all
concentrated into a single (somewhat large) module called yystate.c. The routines create
an LR(1) parse table, but before creating a new LR(1) state, the code looks for an exist
ing state with the same LR(0) kernel items as the new one. If such a state exists, looka
heads are added to the existing state rather than creating a new one.
Section 5.13Generating LALR(l) Parse Tables 409
Listing 5.18. yydriver.c Routines to Create Occs Output File
3
6 # i ncl ude < t o o l s / h a s h , h >
7 # i ncl ude c t o o l s / c o m p i l e r , h >
10
11 /* --------------------------------------------------------------------------------------------------------------------------- */
12
13 voi d f i l e _ h e a d e r P ( (voi d ) ) ; / * publ i c */
14 voi d c o d e _ h e a d e r P ( ( voi d ) ) ;
15 voi d dr i v e r P ( ( voi d ) ) ;
16
17 / * -----------------------------------------------------------------------------------------------------------------------------------------------------* /
18
19 PRI VATE F I LE * D r i v e r _ f i l e = s t d e r r ;
20
2i /* ------------------------------------------------------------------------------------------------------------------------------
22 * Rout i nes i n t hi s f i l e ar e occs speci f i c. Ther e' s a di f f er ent ver si on of al l
23 * t hese r out i nes i n l l dr i ver . c. They MUST be cal l ed i n t he f ol l owi ng
24 * f i l e_header ()
25 * code_header ()
26 * dr i ver ()
27 *------------------------------------------------------------------------------------------------------------------------------
28 * /
29
30 PUBLI C voi d f i l e _ h e a d e r ()
31 {
32 / * Thi s header i s pr i nt ed at t he t op of t he out put f i l e, bef or e t he
33 * def i ni t i ons sect i on i s pr ocessed. Var i ous #def i nes t hat you mi ght want
34 * to modi f y ar e put
35 * /
36
37 o u t p u t ( " # i n c l u d e \ " % s \ " \ n \ n " , TOKEN_ FI LE ) ;
38
39 i f ( P u b l i c )
40 o u t p u t ( " # d e f i n e P RI VAT E \ n " ) ;
41
42 i f ( D e b u g )
43 o u t p u t ( " # d e f i n e YYDEBUG\ n" ) ;
44
45 i f ( M a k e _ a c t i o n s )
46 o u t p u t ( " # d e f i n e YYACTI ON\ n " ) ;
47
48 i f ( M a k e _ p a r s e r )
49 o u t p u t ( " # d e f i n e YYPARS ER\ n " ) ;
50
51 i f ( ! ( D r i v e r _ f i l e = d r i v e r _ l ( O u t p u t , ! N o _ l i n e s , T e m p l a t e ) ) )
52 e r r o r ( NONFATAL, "%s n o t f o u n d o u t p u t f i l e w o n ' t c o m p i l e \ n " , T e m p l a t e )
53 }
54
55 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
56
57 PUBLI C voi d c o d e _ h e a d e r ()
58 {
59 / * Thi s st uf f i s out put af t er t he def i ni t i ons sect i on i s pr ocessed, but
60 * bef or e any t abl es or t he dr i ver i s pr ocessed.
61 */
62
63 d r i v e r 2( Out put , !No l i n e s ) ;
64 }
65
66 / * - -
---------------------------------------------------------------------------------------------------------------------------------------------------------------------- * /
67
68 PUBLIC voi d d r i v e r ()
69 {
70 / * Pr i nt out t he act ual par ser by copyi ng l l ama.par t o t he out put f i l e.
71 */
72
73 i f ( Make p a r s e r )
74 d r i v e r 2( Out put , !No l i n e s ) ;
75
76 f c l o s e ( Dr i v e r f i l e ) ;
77 }
Listing 5.19. occs-act.par File Header for -a Output
1 #i f def YYDEBUG
2 # def i ne YYD(x) x
3 #el se
4 # def i ne YYD(x) / * empt y
* /
5 #endi f
6 ~L / * User - suppl i ed code f r omdef i ni t i ons sect i on goes her e * /
7 #i f ndef YYSTYPE / * Def aul t val ue st ack t ype * /
8 # def i ne YYSTYPE i nt
9 #endi f
10
11 #undef YYD / * Redef i ne YYD i n case YYDEBUG was def i ned */
12 #i f def YYDEBUG / * expl i ci t l y i n t he header r at her t han wi t h */
13 # def i ne YYD(x) x / * a - D on t he occs command l i ne. */
14 # def i ne p r i n t f y y c o de / * Make pr i nt f () cal l s go t o out put wi ndow. */
15 #el se
16 # def i ne YYD(x) / * empt y
* /
17 #endi f
18
19 ext ern voi d y y c o d e () ;
20 ext ern voi d y y d a t a ( ) ;
21 ext ern voi d y y b s s ( ) ;
22 ext ern voi d yycomment ( ) ;
23
24 ext ern YYSTYPE *Yy_vs p; / * Val ue- st ack poi nt er */
25 ext ern YYSTYPE Yy _ v a l ;
/ *
Must hol d $$ af t er act i s per f or med */
26 ext ern i nt Yy r h s l e n ;
/ *
number of symbol s on RHS of cur r ent pr oduct i on */
Though this process is straightforward, a lot of code and several data structures have
to work in concert to implement it. These data structures start on line 25 of Listing 5.20
Representing LR(1) with an I TEMdeclaration, which represents an LR(1) item. There is a certain amount of
items:, i t em. redundancy here in order to speed up the table generation process. For example, the
r i ght _of _dot field points to the symbol to the right of the dot. (Remember, the PRO
DUCT I ON itself is stored as an array of SYMBOL pointers. It holds NULL if the dot is at
the far right of the production.) Occs could extrapolate this information from the PRO
DUCTI ON and dot position (the offset to which is in dot posn) every time it used the
I TEM, but its best, for speed reasons, to do the extrapolation only once since the dot
position does not change within a given I TEM. Similarly, the production number could
be derived from the PRODUCTI ON structure, but I ve copied it here to avoid an extra
level of indirection every time the production number is used. The RI GHT OF DOT
macro on line 35 of Listing 5.20 gets the numeric, tokenized value of the SYMBOL to the
right of the dot. It evaluates to zero (which isnt used for any of the input symbols) if the
dot is at the far right of the production.
Listing 5.20. yystate.c Universal Constants and LR(1) Items
1
2
3
4
5
6
7
8
9
10
11
12
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#i ncl ude < m a l l o c . h >
#ncl ude < t o o l s / d e b u g . h>
#i ncl ude < t o o l s / c o m p i l e r . h>
#i ncl ude " p a r s e r . h "
#i ncl ude " l l t o k . h " /
*
For EOI def i ni t i on * /
j * _______ ________________________________________________ ____________________________________* j
13 /*
For st at i st i cs onl y:
*/
14 PRI VATE i nt N i t e m s = 0 ; /*
number of LR( 1) i t ems
*/
15 PRI VATE i nt N p a i r s = 0 ; /*
# of pai r s i n out put t abl es */
16 PRI VATE i nt N t a b e n t r i e s = 0 ; /*
number of t r ansi t i ons i n t abl es
17 PRI VATE i nt S h i f t r e d u c e = 0 ; /*
number of shi f t / r educe conf l i ct s */
18 PRI VATE i nt R e d u c e r e d u c e = 0 ; /* number of r educe/ r educe conf l i ct s */
/
#def i ne MAXSTATE
#def i ne MAXOBUF
512
256
/
/
*
*
Max # of LALR( l ) st at es.
Buf f er si ze f or var i ous out put r out i nes
*
*
/
/
/* -------------------------------------------------------------------------------------------------------------------- */
i t e m /
*
LR (1) i t em:
*
/
{
i nt
PRODUCTI ON
SYMBOL
p r o d _ n u m ;
* p r o d ;
* r i g h t o f d o t ;
unsi gned char d o t p o s n ;
SET ^ l o o k a h e a d s ;
/
/
/
/
/
*
*
*
*
pr oduct i on number
t he pr oduct i on i t sel f
symbol t o t he r i ght of t he dot
of f set of dot f r omst ar t of pr oduct i on
*
*
*
*
/
/
/
/
*
set of l ookahead symbol s f or t hi s i t em* /
} I TEM;
#def i ne RI GHT OF DOT( p )
( ( p ) - > r i g h t o f d o t
9
( p ) - > r i g h t o f d o t - > v a l 0 )
The next data structure of interest represents an LR(1) state. The STATE structure is Representing an LR(1)
defined, along with a few related constants, in Listing 5.21. Most of the closure items state-st at e.
dont have to be stored in the STATE because they can be derived from the kernel items
if necessary. The exceptions are items that contain productions, because these items
cause reductions from the current state, not a transition to a new state. The closure items
with e productions are effectively part of the kernel, at least from the perspective of the
parse table. The set of items are in their own array because the table-creation code
occasionally compares two STATE structures to see if theyre equivalent. This
equivalence can be determined by comparing the true kernel items only. Theres no
point in also comparing the e items too, because the same kernel items in both states gen
erate the same closure itemsincluding the 8 itemsin both states. The
ker nel _i t er ns [ ] array is kept sorted to facilitate STATE comparisons. The sort cri
teria is pretty much arbitrary, but the sorting serves to get the two sets of kernel items to
appear in the same order in both STATES. I ll discuss the comparison routine thats used
for this purpose in a moment.
Listing 5.21. yy state, c LR(1) States
36 #de i ne MAXKERNEL 32
/ *
Maxi mumnumber of ker nel i t ems i n a st at e.
*
/
37 #de i ne MAXCLOSE 128
/ * Maxi mumnumber of cl osur e i t ems i n a st at e ( l ess
*
/
38 /* t he epsi l on pr oduct i ons) .
*
/
39 #dei ne MAXEPSILON 8
/* Maxi mumnumber of epsi l on pr oduct i ons t hat can be
*
/
40 /* i n a cl osur e set f or any gi ven st at e.
*
/
41
42 t ypedef short STATENUM;
43
44 t ypedef st ruct s t a t e / * LR (1) st at e
* /
45
{
46 ITEM * k e r n e l i t e ms [MAXKERNEL ] ; / * Set of ker nel i t ems.
* /
47 ITEM * e p s i l o n i t e ms [MAXEPSILON]; / * Set of epsi l on i t ems. * /
48
49 unsi gned n k i t e ms : 7 ; / * # i t ems i n ker nel i t ems[]. * /
50 unsi gned n e i t e ms : 7 ; / * # i t ems i n epsi l on i t ems[]. * /
51 unsi gned c l o s e d : 1 ; / * St at e has had cl osur e per f or med.
* /
52
53 STATENUM num; / * St at e number (0 i s st ar t st at e) . * /
54
55 } STATE;
The next listing (Listing 5.22) contains definitions used to build the internal
representation of the actual parse tables. The mechanism used is essentially the same as
that used by the output tables, but the transitions are stored as linked lists of structures
rather than input-symbol/next-state pairs. The Act i ons [ ] array (on line 64) is indexed
by current state. Each cell of Act i ons [] points at the head of a linked list of ACT
structures, one element of which is an input symbol, and the other element of which is
the action to take when that symbol is encountered from the current state. The Got o [ ]
table (on line 68) is essentially the same.
The table-generation code begins in Listing 5.23 with several routines to manage the
parse table just discussed. The allocation routine, new() on line 73 allocates space for
an ACT structure. Since the space used by these structures need never be freed, mal-
l oc (), which is both slow and expensive in terms of system resources, need not be
called every time a new structure is required; rather, new() allocates an array of struc
tures with a single mal l oc () call on line 84. CHUNK, #defned on line 71, determines
the number of structures in the array, new() then returns pointers to structures in this
array until all array elements have been allocated. Only then does it get another array of
structures from mal l oc (). The remainder of the routines in the listing either put new
elements in the table, or return pointers to existing elements.
Internal representation of
the parse table:
Act i ons , ACT, Got o.
Memory management:
n e w().
CHUNK.
Listing 5.22. yy state.c Internal Representation of the Parse Table
56 t y p e d e f s t r u c t a c t o r g o t o
57
{
58 i n t s y m;
/ *
Gi ven t hi s i nput symbol , */
59 i n t d o t h i s ; / * do t hi s. >0 == shi f t , <0 == r educe */
60 s t r u c t a c t o r g o t o * n e x t ; / * Poi nt er t o next ACT i n t he l i nked l i st . * /
61
62 } ACT;
63 t y p e d e f ACT GOTO;
/*
GOTO i s an al i as f or ACT */
64 PRI VATE ACT * A c t i o n s [ M A X S T A T E ] ; / * Ar r ay of poi nt er s t o t he head of t he act i on
65
*
chai ns. I ndexed by st at e number .
66
*
I ' mcount i ng on i ni t i al i zat i on t o NULL her e.
67
*
/
68 PRI VATE GOTO * G o t o s [ MA X S T A T E ] ;
/ *
Ar r ay of poi nt er s t o t he head of t he got o
69
A
chai ns.
70
/
Listing 5.23. yystate.c Parse-Table Memory Management
71 #def ne CHUNK 128 / * New() get s t hi s many st r uct ur es at once * /
72
73 PRI VATE voi d * n e w ( )
74 {
75 / * Ret ur n an ar ea of memor y t hat can be used as ei t her an ACT or GOTO.
76 * These obj ect s cannot be f r eed.
11 */
78
79 st at i c ACT *e he ap = (ACT *) 0;
80 st at i c ACT *heap = (ACT *) 1;
81
82 i f ( heap >= e he a p )
83 {
84 i f ( ! ( h e a p = ( ACT *) m a l l o c ( si zeof ( ACT) * CHUNK) ) )
85 e r r o r ( FATAL, "No memory f o r a c t i o n or g o t o \ n " ) ;
86
87 e he a p = heap + CHUNK ;
88 }
89 + + N t a b _ e n t r i e s ;
90 ret urn heap++ ;
91 }
92
93 / * ------------------------------------------------------------------------------------------------------------------------------------------ * /
94
95 PRI VATE ACT * p _ a c t i o n ( s t a t e , i n p u t _ s y m )
96 i nt s t a t e , i n p u t _ s y m ;
97 {
98 / * Ret ur n a poi nt er t o t he exi st i ng ACT st r uct ur e r epr esent i ng t he i ndi cat ed
99 * st at e and i nput symbol (or NULL i f no such symbol exi st s) .
100 */
101
102 ACT * p ;
103
104 f o r ( p = A c t i o n s [ s t a t e ] ; p ; p = p - > n e x t )
105 i f ( p- >s ym == i nput _ s y m )
106 ret urn p;
107
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
ret urn NULL;
}
/
------------------------------------------------------------------------------------------------------------------ */
PRIVATE voi d add a c t i o n ( s t a t e , i n p u t sym, do t h i s )
i nt s t a t e , i nput _ s y m, d o _ t h i s ;
{
/ * Add an el ement t o t he act i on par t of t he par se t abl e. The cel l i s
* i ndexed by t he st at e number and i nput symbol , and hol ds do t hi s.
*
/
ACT
*
p;
( Ve r bo s e > 1 )
p r i n t f ( "Addi ng s h i f t or r e d u c e a c t i o n f rom s t a t e %d: 'd on %s\ n",
do t h i s , Te r ms [ i n p u t sym ] - >name ) ;
P
p- >s ym
p - > d o _ t h i s
p - > n e x t
A c t i o n s [ s t a t e ]
(ACT
*
) n e w( ) ;
i nput _ s y m ;
d o _ t h i s ;
A c t i o n s [ s t a t e ] ;
p;
}
/
* _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ -*/
PRIVATE GOTO
*
p g o t o ( s t a t e , n o n t e r mi n a l )
i nt s t a t e , n o n t e r mi n a l ;
{
/
*
Ret ur n a poi nt er t o t he exi st i ng GOTO st r uct ur e t he
* i ndi cat ed st at e and nont er mi nal (or NULL i f no such symbol exi st s) . The
*
val used f or t he 1 i s t he one i n t he symbol t abl
j
i t i s
*
*
adj ust ed down (so t hat t he smal l est nont er mi nal has t he val ue 0)
bef or e doi ng t he t abl e l ook up, however .
*
/
GOTO
*
p;
n o n t e r mi n a l ADJ VAL( n o n t e r mi n a l ) ;
( P
G o t o s [ s t a t e ] ; p ; p p - > n e x t )
L f( p- >s ym
ret urn p;
n o n t e r mi n a l )
ret urn NULL;
}
/
* _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _*/
PRIVATE voi d
i nt
{
a d d _ g o t o ( s t a t e , n o n t e r mi n a l , g o h e r e )
n o n t e r mi n a l , go h e r e ;
/
*
*
*
*
*
Add an el ement t o t he got o par t of t he par se t abl e, t he cel l i s i ndexed
by cur r ent st at e number and nont er mi nal val ue, and hol ds go_her e. Not e
t hat t he i nput nont er mi nal val ue i s t he one t hat appear s i n t he symbol
t abl e. I t i s adj ust ed downwar ds (so t hat t he smal l est nont er mi nal wi l l
have t he val ue 0) bef or e bei ng i nser t ed i nt o t he t abl e, however .
*
/
167 GOTO *p;
168 i n t u n a d j u s t e d ; / * Or i gi nal val ue of nont er mi nal * /
169
170 u n a d j u s t e d = n o n t e r mi n a l ;
171 n o n t e r mi n a l = ADJ VAL( n o n t e r mi n a l ) ;
172
173 i f ( Ve r bo s e > 1 )
174 p r i n t f ( "Addi ng g o t o f rom s t a t e %d t o %d on %s\ n",
175 s t a t e , go h e r e , Te r ms [ u n a d j u s t e d ] - > n a me
176 p = (GOTO *) new ( ) ;
177 p- >s ym = n o n t e r mi n a l ;
178 p- >do t h i s = go h e r e ;
179 p - > n e x t = G o t o s [ s t a t e ] ;
180 G o t o s [ s t a t e ] = p;
181 }
The next listing (Listing 5.24) contains definitions for the data structures used to to tate management,
manage the LR(1) STATES as the state machine is constructed. Before looking at the
actual code, consider the logic used to create the machine, summarized in Table 5.15.
Table 5.15. An Algorithm for Creating an LALR( 1) State Machine
Data structures:
A state data base.
A list of unclosed states in the state data base.
Initially:
Create the start state and put it into both the state data base and the unclosed-state list.
Create LALR(l) state machine:
a. for( each state in the list of unclosed states )
I
b. Perform LR(1) closure on state, generating a set of new states and associated lookaheads.
c. for( each new state generated by the closure operation )
I
d. if( an existing state with the same kernel items as the newstate already exists )
I
e. Merge appropriate next-state transitions from state with existing state.
f. Add lookaheads in new state to the lookaheads in existing state.
g. if( the previous step added new symbols to the lookahead set AND
the existing state isnt already in the list of unclosed states)
h. Add the existing state to the list of unclosed states.
1
else
{
i. Add appropriate next-state transitions from state to new state.
j. Add the new state to the state data base.
k. Add the new state to the list of unclosed states.
1
I
}
Two data structures are used by the algorithm to keep track of the states. The states Finished-states list,
themselves must be stored in a data base of some sort because every time a new state is
created, you have to see if a state with the same set of kernel items already exists. (If it

k
Listing 5.24. yystate.c LR-State Memory ManagementData Structures
182 PRIVATE HASH_TAB * S t a t e s = NULL; / * LR (1) st at es * /
183 PRIVATE i n t N s t a t e s = 0; / * Number of st at es. * /
184
185 # d e f n e MAX UNFINISHED 128
186
187 t y p e d e f s t r u c t t n o d e
188 {
189 STATE * s t a t e ;
190 s t r u c t t n o d e * l e f t , * r i g h t ;
191
192 } TNODE;
193
194 PRIVATE TNODE He a p[ MAX UNFINISHED ] ; / * Sour ce of al l TNODEs * /
195 PRIVATE TNODE *Next a l l o c a t e = Heap ; / *
Pt r to next node to al l ocat e * /
196
197 PRIVATE TNODE ^ A v a i l a b l e = NULL;
/ *
Fr ee l i st of avai l abl e nodes * /
198 / * l i nked l i st of TNODES. p- >l ef t * /
199 / *
i s used as t he l i nk.
* /
200 PRIVATE TNODE ^ Un f i n i s h e d = NULL; / * Tr ee of unf i ni shed st at es. * /
201
202 PRIVATE ITEM * * S t a t e i t e ms ; /*
Used t o pass i nf o to st at e cmp * /
203 PRIVATE i n t S t a t e n i t e ms ; / *
f f
* /
204 PRIVATE i n t S o r t by number = 0; / *
f f
* /
205
206 # d e f n e NEW 0 / *
Possi bl e r et ur n val ues f r om * /
207 # d e f i n e UNCLOSED 1 / * newst at e () . * /
208 # d e f i n e CLOSED 2
does exist, the next-state transitions go to the existing state rather than the newly created
one.) An unsorted array is inappropriate for holding the list of existing states because
the search time is too long. Fortunately, we already have a quite workable data-base
manager in the guise of the hash-table functions that weve been using for symbol-table
management, and I use those routines here to keep track of the existing states. The
St at es pointer (declared on line 182 of Listing 5.24) points at a hash table of LR(1)
states. The maket ab () call that initializes the table is done in one of the higher-level
routines discussed later in this section.
The next data structure keeps track of those states that are created in a closure opera
tion, but have not, themselves, been closed. Since each closure operation creates several
new states, a mechanism is needed to keep track of those states on which closure has not
yet been performed. Look-up time is an issue here, as well, because step (g) of the algo
rithm must test to see if the state is already in the list of unclosed states before adding it
to the list. Though a hash table could also be used here, I ve opted for a binary tree
because its both easy to implement in this application and somewhat more efficient than
the general-purpose hash-table functions. I ts also easier, in the current application, to
remove an arbitrary state from a binary tree than from the hash tablethe removed state
is always a leaf, here. So, the unclosed-state list is implemented as a binary tree of
pointers to STATES in the hash table. The TNODE structure used for the tree nodes is
defined on lines 187 to 192 of Listing 5.24. The complete system is pictured in Figure
5.14.
The Avai l abl e list at the bottom of Figure 5.14 manages nodes in the tree that
have been used at least once. Tree nodes are, initially, managed like the STATES. An
array of several of them (called Heap) is defined on line 194 of Listing 5.24.
Next al l ocat e (defined on the next line) is a pointer to the next available node in
Unfinished-states list.
TNODE.
Unfinished-state manage
ment: Avai l abl e.
Next al l ocat e, Heap.
Figure 5.14. The State Data Base and Unclosed-State List
r
L
Avai l abl e
n
Unf i ni shed
STATE
STATE
\/
STATE
STATE STATE
j
U S '
N i
/ *
Heap [ ]. Initially, nodes are taken from the Heap[] by incrementing the
Next al l ocat e pointer. When a node is removed from the unclosed-state tree, that
node added to the linked list pointed to by Avai l abl e. New nodes are allocated
from the linked list if it isnt empty, otherwise a new node is fetched from the Heap [ ].
This strategy is necessary because mal l oc and too slow
New states are created and initialized by newstate, at the top of Listing 5.25. Note
that you dont have to initialize the new node until after youve decided whether or not
to put it into the tree. You can get away with this delay because the comparison function
STATE a llo ca tio n : new
s t at e ().
used to do the insertion is always called with: cmp ( exi st i ng where exi st
i ng is the node already in the tree and newis the new one. The comparison function
doesnt have to actually examine the new node to do the comparison, however The
I TEMpointer and count thats passed into newst at e () can be copied into global vari
ables, and the comparison function can look at those global variables instead of the
actual contents of the new node. St at e i t ems and St at e ni t ems, which were
declared on lines 202 and 203 of Listing 5.24, are used for this purpose. The delayed ini
tialization gives you a little more speed, because you dont have to do a true initializa
tion unless its absolutely necessary.
The add unf i ni shed () subroutine on line 267 of Listing 5.25 adds nodes to the
unfinished list, but only if the node is not already there. As discussed earlier, a binary
tree is used for the list. Note that this is not an ideal data structure because nodes are, at
least initially, inserted in ascending order, yielding a worst-case linked list. Things
improve once the lookaheads start being added because previously removed states are
put back onto the list more or less randomly. This situation could be improved by using
a unique random number as an identifier rather than the state number. But, it will most
likely take longer to generate this number than it will to chase down the list. Alternately,
you could keep the list as an array and do a binary-insertion sort to put in the new
Add to unfinished list
add unf i ni s hed().
Listing 5.25. yystate.c LR-State Memory ManagementSubroutines
.................... ............ i
2 0 9 PRIVATE i nt n e w s t a t e ( i t e m s , n i t e ms , s t a t e p )
2 1 0 ITEM * * i t e ms ;
211 i nt n i t e m s ;
2 1 2 STATE * * s t a t e p ;
2 1 3 {
2 1 4 STATE * s t a t e ;
2 1 5 STATE ^ e x i s t i n g ;
2 1 6 i nt s t a t e _ c mp ( ) ;
2 1 7
2 1 8 i f ( n i t e ms > MAXKERNEL )
2 1 9 e r r o r ( FATAL, "Ke r ne l o f new s t a t e %d t o o l a r g e \ n " , N s t a t e s ) ;
220
221
222 S t a t e _ i t e m s = i t e ms ; / * set up par amet er s f or st at e_cmp * /
2 2 3 S t a t e _ n i t e m s = n i t e ms ; / * and st at e_hash. * /
2 2 4
2 2 5 i f ( e x i s t i n g = f i n d s y m( S t a t e s , NULL ) )
2 2 6 {
2 2 7 / * St at e exi st s; by not set t i ng "st at e t o NULL, we' l l r ecycl e * /
2 2 8 / * t he newl y al l ocat ed st at e on t he next cal l . * /
2 2 9
2 3 0 * s t a t e p = e x i s t i n g ;
231 i f ( Ve r bo s e > 1 )
2 3 2 {
2 3 3 p r i n t f ( "Us i ng e x i s t i n g s t a t e ( %s c l o s e d ) : ",
2 3 4 e x i s t i n g - > c l o s e d ? "" : "un" )
2 3 5 p s t a t e _ s t d o u t ( e x i s t i n g ) ;
2 3 6 }
2 3 7 ret urn e x i s t i n g - > c l o s e d ? CLOSED : UNCLOSED ;
2 3 8 }
2 3 9 el se
2 4 0 {
241 i f ( N s t a t e s >= MAXSTATE )
2 4 2 error( FATAL, "Too many LALR(l ) s t a t e s \ n " ) ;
2 4 3
2 4 4 i f ( ! ( s t a t e = (STATE *) n e ws y m( s i z e o f ( S TATE) ) ) )
2 4 5 e r r o r ( FATAL, " I n s u f f i c i e n t memory f o r s t a t e s \ n " ) ;
2 4 6
2 4 7 memcpy( s t a t e - > k e r n e l _ i t e m s , i t e m s , n i t e ms * si zeof ( ITEM*) ) ;
2 4 8 s t a t e - > n k i t e m s = n i t e ms ;
2 4 9 s t a t e - > n e i t e m s = 0 ;
2 5 0 s t a t e - > c l o s e d = 0 ;
251 s t a t e - > n u m = N s t a t e s + + ;
2 5 2 * s t a t e p = s t a t e ;
2 5 3 addsym( S t a t e s , s t a t e ) ;
2 5 4
2 5 5 i f ( Ve r bo s e > 1 )
2 5 6 {
2 5 7 p r i n t f ( "Formi ng new s t a t e : " ) ;
2 5 8 p s t a t e _ s t d o u t ( s t a t e ) ;
2 5 9 }
2 6 0
261 ret urn NEW;
2 6 2 }
2 6 3 }
2 6 4
2 6 5 / * -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- * /
2 6 6
Section 5.13 Generating LALR( 1) Parse Tables 419
267
268
2 6 9
2 7 0
271
2 7 2
273
2 7 4
275
2 7 6
277
278
2 7 9
2 8 0
281
2 8 2
283
2 8 4
285
2 8 6
287
288
2 8 9
2 9 0
291
2 9 2
293
2 9 4
295
2 9 6
297
298
2 9 9
3 0 0
301
302
303
304
305
306
307
308
3 0 9
3 1 0
311
312
313
3 1 4
315
316
317
318
319
3 2 0
321
322
323
324
325
3 2 6
PRIVATE voi d
STATE
{
add u n f i n i s h e d ( s t a t e )
*
TNODE
i nt
* *
p a r e n t , * r o o t ;
cmp;
p a r e n t
r o o t
&Unf i ni s he d;
U n f i n i s h e d ;
whi l e( r o o t )
{
/ * l ook f or t he node i n t he t r ee
*
/
(
(cmp s t a t e - > n u m r o o t - > s t a t e - > n u m) 0 )
{
p a r e n t
r o o t
(cmp < 0) ? & r o o t - > l e f t
(cmp < 0) r o o t - > l e f t
&r o o t - > r i g h t ;
r o o t - > r i g h t ;
}
}
( ! r o o t ) / * Node i sn' t i n t ree.
*
/
{
( A v a i l a b l e ) / * Al l ocat e a new node and
{ /
*
put i t i nt o t he t ree.
*
*
p a r e n t
A v a i l a b l e
A v a i l a b l e ;
A v a i l a b l e - > l e f t
}
/ * Use node f r omAvai l abl e
/
/
*
*
*
l i st i f possi bl e, ot her wi se
get t he node f r omt he Heap.
*
*
*
/
/
/
/
/
{
( Ne xt a l l o c a t > &Heap[ MAX UNFINISHED ]
)
error( FATAL, " I n t e r n a l : No memory f o r u n f i n i s h e d s t a t e \ n " ) ;
*
p a r e n t Next a l l o c a t e + + ;
}
( * p a r e n t ) - > s t a t e
( * p a r e n t ) - > l e f t
/ * i ni t i al i ze t he node */
( * p a r e n t ) - > r i g h t NULL;
}
}
/
*
/
PRIVATE STATE
u n f i n i s h e d ()
{
/
Ret ur ns a poi nt er to t he next unf i ni shed st at e and del et es t hat

* st at e f r omt he unf i ni shed t ree. Ret ur ns NULL i f t he t r ee i s empt y
/
TNODE
TNODE
* r o o t ;
**
p a r e n t ;
( { U n f i n i s h e d )
ret urn NULL;
p a r e n t
( r o o t
&Unf i ni s he d;
U n f i n i s h e d )
/ * f i nd l ef t most node */
{
whi l e( r o o t - > l e f t )
{
p a r e n t
r o o t
& r o o t - > l e f t
r o o t - > l e f t
}
327 }
328
329 *par ent =r oot - >r i ght ;
/*
Unl i nk node f r omt he t r ee */
330 r oot - >l ef t ==Avai l abl e; /* Put i t i nt o t he f r ee l i st */
331
332
333
334 }
335
Avai l abl e
return root -
=r oot ;
- >st at e ;
s t a t e comparison,
st at e cmp().
elements. Again, the time required to move the tail end of the array is comparable to the
time required to chase down the list. Another solution uses a SET of unfinished-state
numbers. The problem here is that, towards the end of creating a big table (when things
are slower anyway), almost all of the unfinished states will have large state numbers,
the smallest state number is 256 (for example) youll have to look at 16 empty words in
the bit map before finding one with a set bit. Empirical testing showed that using a SET
slowed things down by about 5% (which translates to about 10 seconds on a big gram
mar, with occs running on my 8MHz IBM-PC/AT). The get unf i ni shed () subrou
tine on line 307 of Listing 5.25 gets a node from the unfinished list. It returns 0 when the
list is empty.
The next Listing (Listing 5.26) holds a comparison and a hash function that can be
used to order the states. The comparison function, state_ cmp () on line 336 of Listing
5.26, is needed to compare states when searching for a particular state in the array.
States are equivalent if they have the same set of kernel items, and equivalence can be
determined from the production numbers and dot positions only. The lookaheads dont
matter because youre testing for equivalence only to see if the lookaheads for a new
LR(1) state can be merged with those of an existing one. Similarly, the actual ordering
of states is immaterial as long as the ordering is consistent, so you can do easy things for
sort criteria. The rules are:
(1)
(2)
(3)
The STATE with the most items is larger.
If the STATES have the same number of items, the two lists of items are compared.
Remember, these lists are sorted, so you can do what amounts to a lexicographic
comparison, here. The f or loop on lines 360 to 367 of Listing 5.26 scans the list of
items until it finds a mismatched item. If there are no mismatched items, the states
are equivalent.
The items are compared using two criteria: The item with the largest production
number is the largest (this test is on line 362). Otherwise, if both items have the
same production number, the item with the dot further to the right is the larger (this
test is on line 365).
s t a t e hashing,
st at e hash( ) .
i t e m management,
newi t em(),
f r ee_i t em(),
f r ee r ecycl ed i t ems()
movedot ().
i t e m comparison
i t emcmp().
The hash function on line 372 of Listing 5.26 is used to manage the finished-state list.
It is much easier to implement than the sort-comparison function. It just sums together
the production numbers and dot positions of the items.
The next set of subroutines, in Listing 5.27 takes care of I TEM management,
newitem () (on line 394) allocates space for (and initializes) a new item, f reei tem ()
(on line 421) puts an item on a recycle heap for later allocation by newitem 0.
f ree_ recycl ed_ i tems () (on line 429) cleans up after all processing finishesit
frees memory used for the entire set of items. The movedot () function on line 442
moves the dot over one notch and updates associated fields in the I TEMstructure.
The i tem cmpO function on line 365 is used to sort the items within a STATE
structure. The primary sort criterion is the numeric value that represents the symbol to
Listing 5.26. yystate.c State Comparison Functions
336 PRI VATE i nt st at e cmp( new, t ab node )
337 STATE *new ; / * Poi nt er t o new node ( i gnor ed */
338 / * i f Sor t by number i s f al se) . */
339 STATE *t ab node; / * Poi nt er to exi st i ng node */
340
{
341 /* Compar e t wo st at es as descr i bed i n t he t ext . Ret ur n a number r epr esent i ng
342 * t he r el at i ve wei ght of t he st at es, or 0 of t he st at es ar e equi val ent .
343 */
344
345 I TEM **t ab i t em ; /* Ar r ay of i t ems f or exi st i ng st at e */
346 I TEM **i t em ; / * Ar r ay of i t ems f or new st at e */
347 i nt ni t em ; /* Si ze of
* /
348 i nt cmp ;
349
350 i f ( Sor t by number )
351 r et ur n( new- >num - t ab node- >num ) ;
352
353 i f ( cmp = St at e ni t ems - t ab node- >nki t ems ) / * st at e wi t h l ar gest * /
354 ret urn cmp; / * number of i t ems i s * /
355 / * l ar ger . * /
356 ni t em = St at e ni t ems ;
357 i t em = St at e i t ems ;
358 t ab i t em= t ab node- >ker nel i t ems
f
359
360 f or (; - - ni t em > = 0 ; ++t ab i t em, ++i t em )
361
{
362 i f ( cmp = (*i t em) - >pr od num - ( *t ab i t em) - >pr od num )
363 ret urn cmp;
364
365 i f ( cmp = ( *i t em) - >dot posn - ( *t ab i t em) - >dot posn )
366 ret urn cmp;
367
}
368
369 ret urn 0; / * St at es ar e equi val ent */
370 }
371
372 PRI VATE i nt st at e hash( sym )
373 STATE *sym; / * i gnor ed */
374
{
375 / * Hash f unct i on f or STATEs. Sumt oget her pr oduct i on number s and dot
376 * posi t i ons of t he ker nel i t ems.
377
* /
378
379 I TEM **i t ems ; / * Ar r ay of i t ems f or new st at e * /
380 i nt ni t ems ; / * Si ze of " * /
381 i nt t ot al ;
382
383 i t ems = St at e i t ems ;
384 ni t ems = St at e ni t ems ;
385 t ot al = 0;
386
387 f or ( ; - - ni t ems >= 0 ; ++i t ems )
388 t ot al += ( * i t ems) - >pr od num + ( * i t ems) - >dot posn;
389
390 ret urn t ot al ;
391 }
the right of the dot. This means that e productions float to the top of the list, followed by
productions with terminals to the right of the dot, followed by productions with nonter
minals to the right of the dot. The ordering is handy when youre partitioning the closure
items to create new state kernels. Items with the same symbol to the right of the dot will
be adjacent in the sorted list. The second sort criteria (if the same symbol is to the right
of the dot in both items) is the production number, and the ternary criteria is the dot posi
tion.
Listing 5.27. yystate.c ITEM Management
392 ITEM ^Re c y c l e d i t e ms = NULL;
393
394 PRIVATE ITEM * ne wi t e m( p r o d u c t i o n )
395 PRODUCTION ^ p r o d u c t i o n ;
396 {
397 ITEM * i t e m;
398
399 i f ( R e c y c l e d _ i t e ms )
400 {
401 i t e m = R e c y c l e d _ i t e ms ;
402 R e c y c l e d _ i t e m s = (ITEM *) R e c y c l e d _ i t e ms - > p r o d ;
403 CLEAR( i t e m- > l o o k a h e a d s ) ;
404 }
405 el se
406 {
407 i f ( ! ( i t e m = ( I TEM *) m a l l o c ( si zeof ( I TEM) )) )
408 e r r o r ( FATAL, " I n s u f f i c i e n t memory f o r a l l LR(1) i t e ms \ n " ) ;
409
410 i t em- >l ookaheads = newset () ;
411 }
412
413 ++Ni t ems;
414 i t em- >pr od
415 i t e m- >pr od_num
416 i t em- >dot _posn
417 i t em- >r i ght _of _dot
418 ret urn i tem;
419 }
420
421 PRI VATE voi d f r e e i t e m ( i t e m )
422 ITEM * i t e m;
423 {
424 Ni t e ms ;
425 i t e m- > p r o d = (PRODUCTION *) R e c y c l e d _ i t e ms ;
426 Recycl ed i t ems = i t em;
427 }
428
429 PRI VATE voi d f r e e r e c y c l e d i t e m s ()
430 {
431 / * empt y t he r ecycl i ng heap, f r eei ng al l memor y used by i t ems t her e * /
432
433 ITEM *p;
434
435 whi l e( p = R e c y c l e d _ i t e ms )
436 {
437 R e c y c l e d _ i t e m s = ( I TEM *) R e c y c l e d _ i t e ms - > p r o d ;
438 f r e e ( p ) ;
439 }
440 }
441
= p r o d u c t i o n ;
= p r o d u c t i o n - > n u m ;
= 0;
= p r o d u c t i o n - > r h s [ 0] ;
442 PRI VATE movedot ( i t em )
443 I TEM *i t em;
444
{
445 / * Moves t he dot one posi t i on t o t he r i ght and updat es t he r i ght of dot
446 * symbol .
447 */
448
449 i t em- >r i ght of dot = ( i t em- >pr od- >r hs )[ ++i t em- >dot posn ] ;
450 }
451
452 PRI VATE i nt i t emcmp( i t eml p, i t em2p )
453 I TEM **i t eml p, **i t em2p ;
454
{
455 / * Ret ur n t he r el at i ve wei ght of t wo i t ems, 0 i f t hey' r e equi val ent . * /
456
457 i nt r val ;
458 I TEM *i t eml = *i t eml p;
459 I TEM *i t em2 = *i t em2p;
460
461 i f ( ! ( rval = RI GHT_OF_DOT(i t eml ) - RI GHT_0F_D0T(i t em2)) )
462 i f ( ! ( rval = i t eml - >pr od num - i t em2- >pr od num ) )
463 return i t eml - >dot posn - i t em2- >dot _posn ;
464
465 return r val ;
466 }
The next three listings contain subroutines that make the LR(1) parse table using the
logic discussed previously on page 415. The remainder of the chapter is boring stuff that Making the parse tables.
does things like print out the tables. Rather than discuss the routines here, I have com
mented them copiously so that you dont have to be flipping back and forth between the
listing and the text.
Listing 5.28. yy state.c High-Level, Table-Generation Function
467 PUBLI C void make_par se_t abl es()
468 {
469 / * Pr i nt s an LALR( l ) t r ansi t i on mat r i x f or t he gr ammar cur r ent l y
470 * r epr esent ed i n t he symbol t abl e.
471 */
A l l
473 ITEM * i t e m;
474 STATE *s t a t e ;
475 PRODUCTION * s t a r t _ p r o d ;
476 voi d m k s t a t e s O ;
A l l i nt s t a t e _ c m p ( ) ;
478 FILE * f p , * o l d _ o u t p u t ;
479
480 / * Make dat a st r uct ur es used t o pr oduce t he t abl e, and cr eat e an i ni t i al
481 * LR( 1) i t emcont ai ni ng t he st ar t pr oduct i on and t he end- of - i nput mar ker
482 * as a l ookahead symbol .
483 */
484
485 S t a t e s = ma k e t a b ( 2 5 7 , s t a t e _ h a s h , s t a t e _ c mp ) ;
486
487 i f ( ! Goal _s ymbol )
488 error( FATAL, "No g o a l s y m b o l . \ n" ) ;
489
490 s t a r t pr o d = Goal s y mb o l - > p r o d u c t i o n s ;
491
492 i f ( s t a r t p r o d - > n e x t )
493 error( FATAL, " S t a r t s ymbol must have o n l y one r i g h t - h a n d s i d e . \ n " ) ;
494
495 i t e m = n e w i t e m( s t a r t pr od ) ; / * Make i t emf or st ar t pr oduct i on * /
496 ADD( i t e m- > l o o k a h e a d s , EOI ) ; / * FOLLOW( S) = {$} V
497
498 n e w s t a t e ( &i t em, 1, &s t a t e ) ;
499
500 i f ( l r ( s t a t e ) ) / * Add shi f t s and got os to t he t abl e
* /
501
{
502 i f ( Ve r bo s e )
503 p r i n t f ( "Addi ng r e d u c t i o n s : \ n " ) ;
504
505 r e d u c t i o n s ( ) ; / * add t he r educt i ons
* /
506
507 i f ( Ve r bo s e )
508 p r i n t f ( " Cr e a t i n g t a b l e s : \ n " ) ;
509
510 i f ( IMake y y o u t a b ) / * Tabl es go i n yyout . c */
511
{
512 p r i n t t a b ( A c t i o n s , "Yya" , "Yy a c t i o n " , 1 ) ;
513 p r i n t t a b ( Got os , "Yyg" , "Yy g o t o " , 1 ) ;
514
}
515 e l s e
516
{
517 i f ( ! ( f p = f open( TAB FILE , "w") ) )
518
{
519 e r r o r ( NONFATAL, "Can ' t open " TAB FILE ", i g n o r i n g - T \ n " ) ;
520
521 p r i n t t a b ( A c t i o n s , "Yya", "Yy a c t i o n " , 1 ) ;
522 p r i n t t a b ( Got os , "Yyg", "Yy g o t o " , 1 ) ;
523 }
524 e l s e
525 {
526 o u t p u t ( " e x t e r n YY TTYPE *Yy a c t i o n [ ] ; / * i n y y o u t a b . c * / \ n " ) ;
527 o u t p u t ( " e x t e r n YY TTYPE *Yy g o t o [ ] ; / * i n y y o u t a b . c * / \ n "
) ;
528
529 o l d o u t p u t = Out put ;
530 Out put = f p;
531
532 f p r i n t f ( f p , " # i n c l u d e < s t d i o . h>\ n" ) ;
533 f p r i n t f ( f p , " t y p e d e f s h o r t YY_TTYPE;\n" ) ;
534 f p r i n t f ( f p , " # d e f i n e YYPRIVATE %s \ n",
535 P u b l i c ? "/ * empt y */ " : " s t a t i c " ) ;
536
537 p r i n t t a b ( A c t i o n s , "Yya", "Yy a c t i o n " , 0 ) ;
538 p r i n t t a b ( Got os , "Yyg", "Yy g o t o " , 0 ) ;
539
540 f c l o s e ( f p ) ;
541 Out put = o l d o u t p u t ;
542
}
543 }
544 p r i n t r e d u c t i o n s ( ) ;
545
}
546 }
Listing 5.29. yystate.c Table-Generation Functions
547
548
549
550
551
PRIVATE i n t l r ( c ur s t a t e )
STATE *cur s t a t e ;
{
/
*
Make LALR( l ) st at e machi ne. The shi f t s and got os ar e done her e t he
* r educt i ons ar e done el sewher e. Ret ur n t he number of st at es.
552 * /
553
554 ITEM **p;
555 ITEM * * f i r s t i t e m;
556 ITEM ^ c l o s u r e i t e m s [ MAXCLOSE ] ;
557 STATE * n e x t ; / * Next st at e. */
558 i n t i s ne w;
/ *
Next st at e i s a new st at e. */
559 i n t n c l o s e ; / *
Number of i t ems i n cl osur e i t ems. */
560 i n t n i t e m s ; / *
# i t ems wi t h same symbol t o r i ght of dot . * /
561 i n t v a l ; / *
Val ue of symbol t o r i ght of dot . * /
562 SYMBOL *sym; / *
Act ual symbol t o r i ght of dot . * /
563 i n t n l r = 0; / *
Nst at es + nl r == number of LR( 1) st at es. * /
564
565 add u n f i n i s h e d ( c ur s t a t e ) ;
566
567 w h i l e ( c ur s t a t e = g e t u n f i n i s h e d ( ) )
568
{
569 i f ( Ve r bo s e > 1 )
570 p r i n t f ( "Next p a s s . . . w o r k i n g on s t a t e %d\ n", c ur s t a t e - > n u m ) ;
571
572
/ *
cl osur e () adds nor mal cl osur e i t ems t o cl osur e i t ems ar r ay.
573
*
kcl ose () adds t o t hat set al l i t ems i n t he ker nel t hat have
574
*
out goi ng t r ansi t i ons (i e. whose dot s ar en' t at t he f ar
575
*
r i ght ) .
576
*
assor t () sor t s t he cl osur e i t ems by t he symbol t o t he r i ght
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
*
*
*
*
of t he dot . Epsi l on t r ansi t i ons wi l l sor t t o t he head of
t he l i st , f ol l owed by t r ansi t i ons on nont er mi nal s,
f ol l owed by t r ansi t i ons on t er mi nal s.
move eps () moves t he epsi l on t r ansi t i ons i nt o t he cl osur e ker nel set
*
/
n c l o s e
n c l o s e
c l o s u r e ( cur s t a t e , c l o s u r e i t e m s , MAXCLOSE
k c l o s u r e ( cur s t a t e , c l o s u r e i t e m s , MAXCLOSE, n c l o s e
);
);
( n c l o s e )
{
a s s o r t ( c l o s u r e i t e ms , n c l o s e , (ITEM*), i t e m cmp ) ;
n i t e ms
P
n c l o s e
mo v e _ e p s ( c u r _ s t a t e , c l o s u r e i t e m s , n c l o s e ) ;
c l o s u r e i t e ms + n i t e ms ;
n i t e ms
( Ve r bo s e > 1 )
p c l o s u r e ( c ur s t a t e , p, n c l o s e ) ;
}
/
/
*
*
Al l of t he r emai ni ng i t ems have at l east one symbol t o t he
r i ght of t he dot .
*
*
/
/
w h i l e ( n c l o s e > 0 )
{
f i r s t i t e m
sym
v a l
P ;
( * f i r s t _ i t e m ) - > r i g h t o f d o t
s y m- >v a l ;
604 / *
Col l ect al l i t ems wi t h t he same symbol to t he r i ght of t he dot.
605
*
On exi t i ng t he l oop, ni t ems wi l l hol d t he number of t hese i t ems
606
*
and p wi l l poi nt at t he f i r st nonmat chi ng i t em. Fi nal l y ncl ose i s
607
*
decr ement ed by ni t ems. i t ems = 0 ;
608 * /
609
610 n i t e ms = 0 ;
611 do
612
{
613 mo v e d o t ( *p++ ) ;
614 + + n i t e ms ;
615
616 } w h i l e ( - - n c l o s e > 0 && RIGHT OF D0T( *p) == v a l ) ;
617
618 / * (1)
newst at e () get s t he next st at e. I t r et ur ns NEWi f t he st at e
619
*
di dn' t exi st pr evi ousl y, CLOSED i f LR( 0) cl osur e has been
620
*
per f or med on t he st at e, UNCLOSED ot her wi se.
621
*
(2) add a t r ansi t i on f r omt he cur r ent st at e t o t he next st at e.
622
*
(3) I f i t ' s a br and- new st at e, add i t t o t he unf i ni shed l i st .
623
*
(4) ot her wi se mer ge t he l ookaheads cr eat ed by t he cur r ent cl osur e
624
*
oper at i on wi t h t he ones al r eady i n t he st at e.
625
*
(5) I f t he mer ge oper at i on added l ookaheads t o t he exi st i ng set ,
626
*
add i t t o t he unf i ni shed l i st .
627 * /
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663 }
i s n e w n e w s t a t e ( f i r s t i t e m, n i t e ms , &next ) ; /
*
1
*
/
( I cur s t a t e - > c l o s e d )
{
( ISTERM( sym )
) /
* *
/
add a c t i o n ( c ur s t a t e - > n u m, v a l , ne xt - >num ) ;
add g o t o ( c ur s t a t e - > n u m, v a l , ne xt - >num ) ;
}
( i s n e w NEW )
add u n f i n i s h e d ( n e x t ) ; /
*
3
*
/
{ /
*
4
*
/
( merge l o o k a h e a d s ( n e x t - > k e r n e l i t e ms , f i r s t i t e m, n i t e ms ) )
{
a d d _ u n f i n i s h e d ( n e x t ) ;
+ + n l r ;
/
*
5
*
/
}
w h i l e ( n i t e ms > 0 )
f r e e i t e m ( * f i r s t i t em++ ) ;
}
f p r i n t f ( s t d e r r , "\ rLR: %- 3d LALR:%-3d", N s t a t e s + n l r , N s t a t e s ) ;
}
c ur s t a t e - > c l o s e d 1;
}
f r e e r e c y c l e d i t e m s ( ) ;
( Ve r bo s e )
f p r i n t f ( s t d e r r , " s t a t e s , %d i t e ms , %d s h i f t and g o t o t r a n s i t i o n s \ n " ,
Ni t e ms , Nt ab e n t r i e s ) ;
N s t a t e s ;
664
665
/* -------------------------------------------------------------------------------------------------------------------- */
666
667
668
PRIVATE i nt
ITEM
merge l o o k a h e a d s ( d s t i t e m s , s r c i t e ms , n i t e ms )
* * s r c i t e ms ;
669 ITEM * * d s t i t e ms ;
671 {
672
673
674
675
676
677
678
679
/ * Thi s r out i ne i s cal l ed i f newst at e has det er mi ned t hat a st at e havi ng t he
* speci f i ed i t ems al r eady exi st s. I f t hi s i s t he case, t he i t eml i st i n t he
* STATE and t he cur r ent i t eml i st wi l l be i dent i cal i n al l r espect s except
*
*
*
*
l ookaheads. Thi s r out i ne mer ges t he l ookaheads of t he i nput i t ems
(src ) t o t he i t ems al r eady i n t he st at e (dst ) 0 i s r ned
i f not hi ng was done (al l l ookaheads i n t he new st at e ar e al r eady i n t he
exi st i ng st at e) , 1 ot her wi se. I t ' s an i nt er nal er r or i f t he i t ems don' t
* mat ch.
680
*
/
681
682
683
i nt d i d s o me t h i n g 0;
684 whi l e( n i t e ms > 0 )
685 {
686
687
688
(
{
( * d s t _ i t e m s ) - > p r o d
( * d s t i t e m s ) - > d o t po s n
i
i
( * s r c _ i t e m s ) - >pr o d
( * s r c i t e m s ) - > d o t po s n )
689
690 }
error( FATAL, "INTERNAL [ merge l o o k a h e a d s ] , i t e m mi s ma t c h\ n" ) ;
691
692 Lf ( ! s u b s e t ( ( * d s t i t e m s ) - > l o o k a h e a d s , ( * s r c i t e m s ) - > l o o k a h e a d s ) )
693
{
694
695
+ + d i d _ s o me t h i n g ;
UNION( ( * d s t i t e m s ) - > l o o k a h e a d s , ( * s r c i t e m s ) - > l o o k a h e a d s
696
}
697
698 ++ds t i t e ms ;
699 + + s r c i t e ms ;
700
}
701
702
703 }
ret urn d i d s o me t h i n g ;
704
705
j * ________ ____ ______________________________________________________________________________________________________________________________________________ ______ * J
706
707
708
PRIVATE i nt
STATE
move e p s ( c ur s t a t e , c l o s u r e i t e m s , n c l o s e )
*cur s t a t e
709 ITEM * * c l o s u r e i t e ms ;
710 {
711
712
713
714
715
716
/
*
*
*
*
*
Move t he epsi l on i t ems f r omt he cl osur e_i t ems set t o t he ker nel of t he
cur r ent st at e. I f epsi l on i t ems al r eady exi st i n t he cur r ent st at e,
j ust mer ge t he l ookaheads. Not e t hat , because t he cl osur e i t ems wer e
sor t ed to par t i t i on t hem, t he epsi l on pr oduct i ons i n t he cl osur e_i t ems
set wi l l be i n t he same or der as t hose al r eady i n t he ker nel . Ret ur n
* t he number of i t ems t hat wer e moved.
717
*
/
718
719
720
ITEM
i nt
* *
e ps i t e ms ,
* *
P
n i t e ms , moved ;
721
722 e p s _ i t e m s = c u r _ s t a t e - > e p s i l o n _ i t e m s ;
723 n i t e ms = c u r _ s t a t e - > n e i t e m s ;
724 moved = 0;
725
726 f o r ( p = c l o s u r e _ i t e m s / ( * p ) - > p r o d - > r h s _ l e n == 0 && n c l o s e >= 0/ )
727 {
728 i f ( ++moved > MAXEPSILON )
729 error( FATAL, "Too many e p s i l o n p r o d u c t i o n s i n s t a t e %d\n",
730 c u r _ s t a t e - > n u m ) ;
731 i f ( n i t e ms )
732 UNION( ( * e p s _ i t e m s + + ) - > l o o k a h e a d s , ( * p + + ) - > l o o k a h e a d s ) ;
733 el se
734 * e p s _ i t e ms + + = *p++ ;
735 }
736
737 i f ( moved )
738 c u r _ s t a t e - > n e i t e m s = moved ;
739
740 ret urn moved ;
741 }
742
743 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
744
745 PRIVATE i nt k c l o s u r e ( k e r n e l , c l o s u r e _ i t e m s , ma x i t e ms , n c l o s e )
746 STATE * k e r n e l ; / * Ker nel st at e t o cl ose. * /
747 ITEM * * c l o s u r e _ i t e m s ; / * Ar r ay i nt o whi ch cl osur e i t ems ar e put . * /
748 i nt ma x i t e ms ; / * Si ze of t he cl osur e_i t ems[] array. * /
749 i nt n c l o s e ; / * # of i t ems al r eady i n set. * /
750 {
751 / * Adds t o t he cl osur e set t hose i t ems f r omt he ker nel t hat wi l l shi f t to
752 * new st at es (i e. t he i t ems wi t h dot s somewher e ot her t han t he f ar r i ght ) .
753 * /
754
756 ITEM * i t e m, * * i t e mp , * c i t e m ;
757
758 c l o s u r e _ i t e m s += n c l o s e ; / * Cor r ect f or exi st i ng i t ems * /
759 maxi t e ms - = n c l o s e ;
760
761 i t e mp = k e r n e l - > k e r n e l _ i t e m s ;
762 n i t e ms = k e r n e l - > n k i t e m s ;
763
764 whi l e( - - n i t e m s >= 0 )
765 {
766 i t e m = *i t e mp++;
767
768 i f ( i t e m - > r i q h t o f d o t )
769 {
770 c i t e m = n e w i t e m( i t e m- > p r o d ) ;
771 c i t e m- > p r o d = i t e m- > p r o d ;
772 c i t e m- > d o t _ p o s n = i t e m- > d o t _ p o s n ;
773 c i t e m - > r i g h t _ o f _ d o t = i t e m - > r i g h t _ o f _ d o t ;
774 c i t e m- > l o o k a h e a d s = d u p s e t ( i t e m- > l o o k a h e a d s ) ;
775
776 i f ( - - ma x i t e ms < 0 )
777 e r r o r ( FATAL, "Too many c l o s u r e i t e ms i n s t a t e %d\n",
778 ke r ne l - >num );
779 * c l o s u r e _ i t e m s + + = c i t e m;
780 + + n c l o s e ;
781 }
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
}
n c l o s e ;
}
/* -------------------------------------------------------------------------------------------------------------------- */
PRIVATE i nt c l o s u r e ( k e r n e l , c l o s u r e i t e ms , maxi t e ms )
STATE * k e r n e l ;
ITEM ^ c l o s u r e i t e m s [ ] ;
i nt
{
/ * Ker nel st at e t o cl ose
*
/
*
Ar r ay i nt o whi ch cl osur e i t ems ar e put
*
ma x i t e ms ; / * Si ze of t he cl osur e i t ems[] ar r ay
*
/
/
/
/
*
*
Do LR( 1) cl osur e on t he ker nel i t ems ar r ay i n t he i nput STATE. When
f i ni shed, cl osur e i t ems[] wi l l hol d t he new i t ems. The l ogi c i s:
*
*
*
*
*
(1) f or ( each ker nel i t em)
do LR( 1) cl osur e on t hat i t em.
(2) whi l e( i t ems wer e added i n t he pr evi ous st ep or ar e added bel ow )
do LR (1) cl osur e on t he i t ems t hat wer e added.
*
/
i nt
i nt
i nt
ITEM
l
n c l o s e
d i d s o me t h i n g
* *
P
0 ;
0 ;
k e r n e l - > k e r n e l i t e ms
/ * Number of cl osur e i t ems * /
( i k e r n e l - > n k i t e r n s ; i > 0 ;) /
*
(1 )
*
/
{
d i d s o me t h i n g do c l o s e ( *p++, c l o s u r e i t e ms , E n c l o s e , &maxi t ems ) ;
}
whi l e( d i d s o me t h i n g )
{
/
*
(2)
*
/
d i d s o me t h i n g 0;
P
c l o s u r e i t e ms ;
( i n c l o s e i > 0
)
d i d s o me t h i n g do c l o s e ( *p+ + , c l o s u r e i t e ms , E n c l o s e , &maxi t e ms ) ;
}
n c l o s e ;
}
/ * ------------------------------------------------------------------------------------------------------------------------------------------ *
/
PRIVATE i nt
ITEM * i t e m;
do c l o s e ( i t e m, c l o s u r e i t e ms , n i t e ms , maxi t e ms )
ITEM ^ c l o s u r e i t e m s [ ] ; / * ( out put ) Ar r ay of i t ems added by cl osur e pr ocess
i nt * n i t e ms ;
i nt * ma x i t e ms ;
/
/
/
/
/
*
*
( i nput) # of i t ems cur r ent l y i n cl osur e_i t ems[]
( out put ) # of i t ems i n cl osur e_i t ems af t er
pr ocessi ng
( i nput) max # of i t ems t hat can be added
( out put ) i nput adj ust ed f or newl y added i t ems
/
/
/
/
/
/
{
/
*
A
A
Wor khor se f unct i on used by cl osur e (). Per f or ms LR( 1) cl osur e on t he
i nput i t em ( [ A- >b. Cd, e] add [C- >x, FI RST( de) ]). The new i t ems ar e added
to t he cl osur e i t ems[] ar r ay and *ni t ems and *maxi t ems ar e modi f i ed to
* r ef l ect t he number of i t ems i n th cl set . Ret ur n 1 i f do cl ose ()
A
A
A
di d anyt hi ng, 0 i f no i t ems wer e added (as wi l l be t he case i f t he dot
i s at t he f ar r i ght of t he pr oduct i on or t he symbol to t he r i ght of t he
dot i s a t er mi nal ) .
*
/
Listing 5.29. continued..
842
843 i n t d i d s o me t h i n g = 0;
844 i n t r hs i s n u l l a b l e ;
846 ITEM * c l o s e i t e m;
847 SET ^ c l o s u r e s e t ;
848 SYMBOL **symp;
849
850 i f ( ! i t e m - > r i g h t o f d o t )
851 return 0;
852
853 i f ( ! ISNONTERM( i t e m - > r i g h t _ o f _ d o t ) )
854 return 0;
855
856 c l o s u r e s e t = n e w s e t ();
857
858 / * The symbol t o t he r i ght of t he dot i s a nont er mi nal . Do t he f ol l owi ng:
859
*
860 *d)
f or ( ever y pr oduct i on at t ached to t hat nont er mi nal )
861 *(2) i f ( t he cur r ent pr oduct i on i s not al r eady i n t he set of
862
*
cl osur e i t ems)
863 *(3) add i t ;
864 *(4) i f ( t he d i n [ A- >b. Cd, e] doesn' t exi st )
865 *(5) add e t o t he l ookaheads i n t he cl osur e pr oduct i on.
866
*
el se
867 * (6) The d i n [ A- >b. Cd, e] does exi st , comput e FI RST( de) and add
868
*
i t t o t he l ookaheads f or t he cur r ent i t emi f necessar y.
869 */
870 / * (1) */
871
872 f o r ( pr o d = i t e m - > r i g h t o f d o t - > p r o d u c t i o n s ; pr od ; pr od = p r o d - > n e x t )
873
{
874 / * (2) * /
875 i f (
! ( c l o s e i t e m = i n c l o s u r e i t e ms ( p r o d , c l o s u r e i t e ms , * n i t e m s ) ))
876 {
877 i f ( - - ( * ma x i t e ms ) <= 0 )
878 error( FATAL, HLR(1) Cl o s u r e s e t t o o l a r g e \ n " ) ;
879 / * (3) * /
880 c l o s u r e i t e m s [ ( * n i t e ms ) + + ] = c l o s e i t e m = n e w i t e m( prod ) ;
881 ++di d s o me t h i n g ;
882
}
883
884
i f ( ! *( s ymp = & ( i t e m- > p r o d - > r h s [ i t e m- > d o t po s n + 1 ] ) ) ) / * (4) * /
885
{
886 d i d s o me t h i n g = add l o o k a h e a d ( c l o s e i t e m- > l o o k a h e a d s , / * (5) * /
887 i t e m- > l o o k a h e a d s ) ;
888
}
889 e l s e
890
{
891 t r u n c a t e ( c l o s u r e s e t ) ; / * (6) */
892
893 r hs i s n u l l a b l e = f i r s t r h s ( c l o s u r e s e t , symp,
894 i t e m- > p r o d - > r h s l e n - i t e m- > d o t pos n - 1 ) ;
895
896
897
898
899
900
REMOVE( c l o s u r e s e t , EPSILON ) ;
( r h s _ i s _ n u l l a b l e )
UNION( c l o s u r e s e t , i t e m- > l o o k a h e a d s ) ;
901 d i d s o me t h i n g = add l o o k a h e a d ( c l o s e i t e m- > l o o k a h e a d s , c l o s u r e s e t ) ;
902
}
903
}
904
905 d e l s e t ( c l o s u r e s e t );
906 ret urn d i d s o me t h i n g ;
907 }
908
909 / *-
-------------------------------------------------------------------------------------------------------------------------------------- */
910
911 PRIVATE ITEM * i n c l o s u r e i t e m s ( p r o d u c t i o n , c l o s u r e i t e m, n i t e ms )
912 ITEM * * c l o s u r e i t e m;
913 PRODUCTION ^ p r o d u c t i o n ;
914
{
915 / * I f t he i ndi cat ed pr oduct i on i s i n t he cl osur e i t ems al r eady, r et ur n a
916 * poi nt er t o t he exi st i ng i t em, ot her wi se r et ur n NULL.
917 * /
918
919 f or(; n i t e ms >= 0 ; + + c l o s u r e i t e m )
920 i f ( ( ^ c l o s u r e i t e m) - > p r o d == p r o d u c t i o n )
921 ret urn ^ c l o s u r e i t e m;
922
923 ret urn NULL;
924
}
925
926 / * -
-------------------------------------------------------------------------------------------------------------------------------------- * /
927
928 PRIVATE i nt add l o o k a h e a d ( d s t , s r c )
929 SET * d s t , * s r c ;
930
{
931 / * Mer ge t he l ookaheads i n t he sr c and dst set s. I f t he or i gi nal sr c
932 * set was empt y, or i f i t was al r eady a subset of t he dest i nat i on set,
933 * r et ur n 0, ot her wi se r et ur n 1.
934 * /
935
936 i f ( IIS EMPTY( s r c ) && ! s u b s e t ( d s t , s r c ) )
937
{
938 UNION( d s t , s r c ) ;
939 ret urn 1;
940
941
942 ret urn 0;
943 }
Listing 5.30. yy state.c Adding Reductions to Tables
944 PRIVATE v o i d r e d u c t i o n s ()
945 {
946 / * Do t he r educt i ons. I f t her e' s memor y, sor t t he t abl e by st at e number * /
947 / * f i r st so t hat yyout . doc wi l l l ook ni ce. * /
948
949 v o i d a d d r e d u c t i o n s ( ) ; / * bel ow * /
950
951 Sort _by_number = 1;
952 i f ( ! p t a b ( S t a t e s , a d d r e d u c t i o n s , NULL, 1 ) )
953 p t a b ( S t a t e s , a d d r e d u c t i o n s , NULL, 0 ) ;
954 }
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
/*_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ----_ */
PRIVATE voi d a d d r e d u c t i o n s ( s t a t e , j unk )
STATE
voi d
{
/
*
* j unk;
*
*
*
*
Thi s r out i ne i s cal l ed f or each st at e. I t adds t he r educt i ons usi ng t he
di sambi guat i ng r ul es descr i bed i n t he t ext , and t hen pr i nt s t he st at e to
yyout . doc i f Ver bose i s t rue. I don' t l i ke t he i dea of doi ng t wo t hi ngs
at once, but i t makes f or ni cer out put because t he er r or messages wi l l
* be next to t he st at e t hat caused t he er r or
*
/
i nt
ITEM
i ;
* * -5
i t e m p;
( i
s t a t e - > n k i t e m s , i t e m p s t a t e - > k e r n e l i t e ms ; i >=0 ; ++i t e m p )
r e d u c e one i t e m( s t a t e
*
i t e m p ) ;
( i
s t a t e - > n e i t e m s , i t e m p s t a t e - > e p s i l o n i t e ms ; i >=0 ; ++i t e m p )
r e d u c e one i t e m ( s t a t e * i t e m p ) ;
( Ve r bo s e )
{
p s t a t e ( s t a t e ) ;
(
>num % 10 0 )
f p r i n t f ( s t d e r r , "%d\ r", s t a t e - > n u m ) ;
}
}
/ *------------------------------------------------------------------------------------------------------------------ */
PRIVATE voi d r e d u c e one i t e m ( s t a t e , i t e m )
ITEM
STATE
{
*
i t e m;
*
/ * Reduce on t hi s i t em
/ * f r omt hi s st at e
*
*
/
/
i nt
i nt
i nt
i nt
i nt
i nt
ACT
t o k e n ;
p p r e c ;
/ * Cur r ent l ookahead
*
/
*
Pr ecedence of pr oduct i on
*
/ * Pr ecedence of t oken
*
a s s o c ; /
*
Associ at i vi t y of t oken
/
/
/
/
r e d u c e _ b y ;
r e s o l v e d ; / * Tr ue i f conf l i ct can be r esol ved */
*
ap;
( i t e m - > r i g h t o f d o t ) /
No r educt i on r equi r ed
/
r et ur n;
p p r e c i t e m- > p r o d - > p r e c ; /
of ent i r e pr oduct i on
/
( n e x t member( NULL) ; ( t o ke n n e x t member( i t e m - > l o o k a h e a d s ) ) > 0 ;)
{
a s s o c
P r e c e d e n c e [ t o k e n ] . l e v e l ;
P r e c e d e n c e [ t o k e n ] . a s s o c ;
/
/
symbol
of l ookahead */
*/
( ! (ap p a c t i o n ( s t a t e - > n u m, t o k e n ))
)
/ * No conf l i ct s */
{
add a c t i o n ( s t a t e - > n u m, t o k e n , ( i t e m- > p r o d num) ) ;
}
1015 e l s e i f ( ap- >do t h i s <= 0 )
1016
{
1017 / * Resol ve a r educe/ r educe conf l i ct i n f avor of t he pr oduct i on */
1018 / * wi t h t he smal l er number . Pr i nt a war ni ng. */
1019
1020 ++Reduce r e d u c e ;
1021
1022 r e duc e by = m i n ( - ( a p - > d o t h i s ) , i t e m- > p r o d num ) ;
1023 ap- >do t h i s = - r e d u c e by ;
1024
1025 e r r o r ( WARNING, " S t a t e %2d: r e d u c e / r e d u c e c o n f l i c t "
1026 "%d/%d on %s ( c h o o s e %d) . \ n",
1027 s t a t e - > n u m,
1028 - ( ap- >do t h i s ) , i t e m- > p r o d num ,
1029 t o k e n ? T e r ms [ t o k e n ] - > n a me : "< EOI >",
1030 r e d u c e by )
f
1031
}
1032 e l s e / * Shi f t / r educe conf l i ct . * /
1033 {
1034 i f ( r e s o l v e d = ( ppr e c && t p r e c ) )
1035 i f ( t p r e c do t h i s = - ( i t e m- > p r o d num ) ;
1037
1038 i f ( Ve r bo s e > 1 ! r e s o l v e d )
1039
{
1040 + + S h i f t r e d u c e ;
1041 e r r o r ( WARNING, " S t a t e %2d: s h i f t / r e d u c e c o n f l i c t %s/%d"
1042 " ( c h o o s e %s) %s\ n",
1043 s t a t e - > n u m,
1044 Te r ms [ t o k e n ] - > n a me ,
1045 i t e m- > p r o d num,
1046 ap- >do t h i s < 0 ? "r e duc e " : " s h i f t " ,
1047 r e s o l v e d ? " ( r e s o l v e d ) " : ""
1048
);
1049
}
1050
}
1051
}
1052 }
Listing 5.31. yystate.c Statistics Functions
1053 PUBLIC v o i d l r _ s t a t s (
f p )
1054 FILE *f p;
1055 {
1056 / * Pr i nt out var i ous st at i st i cs about t he t abl e- maki ng pr ocess * /
1057
1058 f p r i n t f ( f p , "%4d LALR(l ) s t a t e s \ n " , N s t a t e s ) ;
1059 f p r i n t f ( f p , "%4d i t e m s \ n " , Ni t e ms ) ;
1060 f p r i n t f ( f p , "%4d n o n e r r o r t r a n s i t i o n s i n t a b l e s \ n " , Nt ab e n t r i e s )
f
1061 f p r i n t f ( f p , "%4d/%- 4 d u n f i n i s h e d i t e m s \ n " , Ne xt a l l o c a t e - Heap,
1062 MAX UNFINISHED);
1063 f p r i n t f ( f p , "%4d b y t e s r e q u i r e d f o r LALR(l ) t r a n s i t i o n m a t r i x \ n " ,
1064 (2 * s i z e o f ( c h a r * ) * N s t a t e s ) / * i ndex ar r ays * /
1065 + N s t a t e s / * count f i el ds * /
1066 + ( Npa i r s * s i z e o f ( s h o r t ) ) / * pai r s * /
1067 ) ;
1068 f p r i n t f ( f p , " \ n " ) ;
1069 }
1070
1071
/ * --------------------------------------------- ------ * /
1072
1073 PUBLIC i n t l r c o n f l i c t s (
f p >
1074 FILE * f p;
1075
{
1076 / * Pr i nt out st at i st i cs f or t he i nadequat e st at es and r et ur n t he number of
1077 * conf l i ct s.
1078 * /
1079
1080 f p r i n t f ( f p , "%4d s h i f t / r e d u c e c o n f l i c t s \ n " , S h i f t r e duc e ) ;
1081 f p r i n t f ( f p , "%4d r e d u c e / r e d u c e c o n f l i c t s \ n " , Reduce r e duc e ) ;
1082 r e t u r n S h i f t r e d u c e + Reduce r e duc e ;
1083
}
Listing 5.32. yystate.c Print Functions
084 # d e f i n e MAX_TOK_PER_LINE 10
085 PRIVATE i n t Tokens p r i n t e d ; / * Cont r ol s number of l ookaheads pr i nt ed * /
086 / * on a si ngl e l i ne of yyout . doc. * /
087 PRIVATE v o i d s p r i n t t o k ( bp, f o r ma t , a r g )
088 c har **bp;
089 c har * f o r ma t ; / * not used her e, but suppl i ed by pset ( ) * /
090 i n t a r g ;
091 {
092 / * Pr i nt one nont er mi nal symbol t o a buf f er mai nt ai ned by t he
093 * cal l i ng r out i ne and updat e t he cal l i ng r out i ne' s poi nt er .
094 */
095
096 i f ( a r g == - 1 ) *bp += s p r i n t f ( *bp, " n u l l " ) ;
097 e l s e i f ( a r g == - 2 ) *bp += s p r i n t f ( *bp, "empt y " ) ;
098 e l s e i f ( a r g == _ E0 I _ ) *bp += s p r i n t f ( *bp, ) ;
099 e l s e i f ( ar g == EPSILON) *bp += s p r i n t f ( *bp,
ff ff
) ;
100 e l s e *bp += s p r i n t f ( *bp, "%s ", Te r ms [ ar g] - >name ) ;
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
( ++Tokens p r i n t e d >= MAX TOK PER LINE )
{
*
bp + s p r i n t f ( * b p , " \ n \ t \ t " ) ;
Tokens p r i n t e d 0;
}
}
PRIVATE
ITEM
i n t
{
* s t r i t e m ( i t e m, l o o k a h e a d s )
item;
l o o k a h e a d s ;
/
*
*
*
Ret ur n a poi nt er t o a st r i ng t hat hol ds a r epr esent at i on of an i t em. The
l ookaheads ar e pr i nt ed t oo i f "l ookaheads" i s t r ue or Ver bose i s > 1
( - V was speci f i ed on t he command l i ne) .
*
/
b u f [ MAXOBUF
*bp;
*
2 ] ;
i n t i ;
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
bp
bp +
b u f ;
s p r i n t f ( bp, "%s >", i t e m- > p r o d - > l h s - > n a me ) ;
( i t e m- > p r o d - > r h s l e n < 0 )
bp + s p r i n t f ( bp, " < e p s i l o n > . " ) ;
{
( i
0; i p r o d - > r h s l e n ; ++i )
{
( i
i t e m- > d o t po s n )
*
bp++
bp + s p r i n t f (bp, " %s", i t e m- > p r o d - > r h s [ i ] - >name ) ;
}
( i
i t e m- > d o t po s n )
*
bp+ +
}
( l o o k a h e a d s Ve r bo s e >1 )
{
bp + s p r i n t f ( bp, " ( p r o d u c t i o n
o
od,
o
od ) \ n \ t \ t [ " ,
i t e m- >pr o d- >num, i t e m- > p r o d - > p r e c ) ;
Tokens p r i n t e d 0;
p s e t ( i t e m - > l o o k a h e a d s , s p r i n t t o k , &bp ) ;
*
bp+ +
}
( bp buf > (MAXOBUF
*
2 ) )
error( FATAL, " I n t e r n a l [ s t r i t e m ] , b u f f e r o v e r f l o w \ n " ) ;
*
bp ' \ 0' ;
buf ;
}
j *___ _______________________________________________________________ * j
PRIVATE v o i d p s t a t e ( s t a t e )
STATE
{
*
/ * Pr i nt one r ow of t he par se t abl e i n h r eadabl e f or myyout . doc
*
( st derr i f - V i s speci f i ed) .
/
i n t
ITEM
ACT
i ;
* -I
i t e m;
*
p;
d o c u me n t ( " S t a t e %d: \ n", s t a t e - > n u m ) ;
/
/
--------------------------------------------------------------------------------------------------*/
Pr i nt t he ker nel and epsi l on i t ems f or t he cur r ent st at e.
/
( i = s t a t e - > n k i t e m s , i t e m = s t a t e - > k e r n e l i t e ms i > 0 ; ++i t e m )
d o c u me n t ("
o
os \ n " , s t r i t e m ( * i t e m, ( * i t e m ) - > r i g h t o f dot ==0 ) ) ;
( i = s t a t e - > n e i t e m s , i t e m = s t a t e - > e p s i l o n i t e ms ; i > 0 ; ++i t e m )
d o c u me n t ( "
o
os \ n " , s t r i t e m ( * i t e m, 1)
) ;
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
d o c u me n t ( "\ n" ) ;
--------------------------------------------------------------------------------------------------*/
/
/
/*
/ * Pr i nt out t he next - st at e t r ansi t i ons, f i r st t he act i ons,
/ * t hen t he got os.
*
*
( i
0; i < MINTERM + USED TERMS/ ++i )
{
i f ( p
{
p a c t i o n ( s t a t e - > n u m, i ) )
( p- >do t h i s 0 )
{
( p- >s ym EOI
d o c u me n t ( "
)
Ac c e p t on end o f i n p u t \ n " ) ;
e r r o r ( FATAL, "INTERNAL: s t a t e %d, I l l e g a l a c c e p t " ,
s t a t e - > n u m ) ;
}
L f ( p- >do t h i s < 0 )
document ( 11 Reduce by %d on %s\ n", - ( p- >do t h i s ) ,
Te r ms [ p- >s ym] - >name ) ;
d o c u me n t ( " S h i f t t o %d on %s\ n", p- >do t h i s ,
Te r ms [ p- >s ym] - >name ) ;
}
}
( i
MINNONTERM/ i < MINNONTERM + USED NONTERMS/ i ++ )
( P
p g o t o ( s t a t e - > n u m , i )
d o c u me n t ( " Got o %d on
)
%s\ n", p- >do t h i s ,
Te r ms [ i ] ->name ) ;
d o c u me n t ( " \ n " ) ;
}
PRIVATE v o i d p s t a t e s t d o u t ( s t a t e )
STATE
{
*
d o c u me n t _ t o ( s t d o u t ) ;
p s t a t e ( s t a t e ) ;
document t o ( NULL ) ;
}
/* -------------------------------------------------------------------------------------------------------------------- */
PRIVATE p c l o s u r e ( k e r n e l , c l o s u r e i t e m s , n i t e ms )
STATE
ITEM
{
* k e r n e l ;
* * c l o s u r e i t e m s ;
p r i n t f ( "\n%d i t e ms i n Cl o s u r e o f ", n i t e ms ) /
s t d o u t ( k e r n e l ) ;
L f ( n i t e ms > 0 )
{
p r i n t f ( "
w h i l e ( - -
--------- c l o s u r e i t e ms \ n" ) ;
n i t e ms > 0 )
p r i n t f ( " &s\ n", s t r i t e m ( ^ c l o s u r e i t e ms + + , 0) );
}
}
Section 5.13Generating LALR(l) Parse Tables
Listing 5.33. yy state, c Routines That Create Auxiliary Tables
437
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
PRIVATE v o i d
PRODUCTION
{
ma k e _ y y _ l h s ( p r o d t a b )
* * pr o dt a b;
*
{
"The Yy _ l hs a r r a y i s u s e d f o r r e d u c t i o n s . I t i s i n d e x e d by p r o d u c t i o n " ,
"number and h o l d s t h e a s s o c i a t e d l e f t - h a n d s i d e a d j u s t e d s o t h a t t h e " ,
"number can be u s e d a s an i n d e x i n t o Yy g o t o . " ,
NULL
};
PRODUCTION *prod;
i n t i ;
comment ( Out put , t e x t ) ;
o u t p u t ( "YYPRIVATE i n t Yy l hs [ %d] = \ n { \ n " , Num p r o d u c t i o n s ) ;
( i 0; i < Num p r o d u c t i o n s ; ++i )
{
pr od
*
pr o dt a b++;
o u t p u t ( " \ t / * %3d * / \ t %d", prod- >num, ADJ VAL( p r o d - > l h s - > v a l ) );
( i
i
Num p r o d u c t i o n s - 1 )
i f ( i % 3 2 i Num p r o d u c t i o n s - 1 ) / * use t hr ee col umns * /
o u t p u t ( "\ n" ) ;
}
o u t p u t ( " } ; \ n " ) ;
}
/
* *
/
PRIVATE v o i d
PRODUCTION
{
ma k e _ y y _ r e d u c e ( p r o d t a b )
* * pr o dt a b;
*
{
"The Yy _ r e duc e [ ] a r r a y i s i n d e x e d by p r o d u c t i o n number and h o l d s " ,
"t he number o f s ymbol s on t h e r i g h t - h a n d s i d e o f t h e p r o d u c t i o n " ,
NULL
9 }
PRODUCTION *prod;
i n t i ;
comment ( Out put , t e x t ) ;
o u t p u t ( "YYPRIVATE i n t Yy reduce[ %d] = \ n { \ n " , Num p r o d u c t i o n s ) ;
( i 0; i < Num p r o d u c t i o n s ; ++i )
{
pr od
*
pr o dt a b++;
o u t p u t ( " \ t /
*
3d * / \ t %d " , prod- >num, p r o d - > r h s l e n ) ;
( i
i
Num p r o d u c t i o n s - 1 )
( i % 3 2 i Num p r o d u c t i o n s - 1 ) / * use t hr ee col umns * /
o u t p u t ( "\ n" ) ;
}
o u t p u t ( " } ; \ n " ) ;
298 }
299
300 / * ---------------------------------------------------------
301
302 PRIVATE m a k e _ y y _ s l h s ( p r o d t a b )
303 PRODUCTION * * p r o d t a b ;
304 {
305
306 {
307
308 " p r o d u c t i o n number and e v a l u a t e s t o a s t r i n g r e p r e s e n t i n g t h e " ,
309
*
/
*
"Yy s l h s [ ] i s a d e b u g g i n g v e r s i o n o f Yy I h s [ ] . I t i s i n d e x e d by",
" l e f t - h a n d s i d e o f t h e p r o d u c t i o n . " ,
310 NULL
311 };
312
314 i n t i ;
315
316 comment ( Out put , t e x t ) ;
317 o u t p u t ( "YYPRIVATE c ha r *Yy_s l hs [ %d] = \ n { \ n " , N u mp r o d u c t i o n s ) ;
318
319 f o r ( i = Num_ pr o duc t i o ns ; i >= 0 ; )
320 {
321 pr o d = *pr odt ab++;
322 o u t p u t ( " \ t / * %3d * / \ t \ " % s \ " ", prod- >num, p r o d - > l h s - > n a me ) ;
323 o u t p u t ( i != 0 ? " , \ n " : "\ n" ) ;
324 }
325 o u t p u t ( 11}; \ n " ) /
326 }
327
328 PRIVATE ma k e _ y y _ s r h s ( p r o d t a b )
329 PRODUCTION * * p r o d t a b ;
330 {
331
332 {
[]
333 "Yy _ s r hs [ ] i s a l s o u s e d f o r d e b u g g i n g . I t i s i n d e x e d by p r o d u c t i o n " ,
334 "number and e v a l u a t e s t o a s t r i n g r e p r e s e n t i n g t h e r i g h t - h a n d s i d e o f " ,
335 "t he p r o d u c t i o n . " ,
336 NULL
337 };
338
340 i n t i , j ;
341
342 comment ( Out put , t e x t ) ;
343 o u t p u t ( "YYPRIVATE c ha r *Yy s rhs [ %d] = \ n { \ n " , Num p r o d u c t i o n s ) ;
344
345 f o r ( i = Num p r o d u c t i o n s ; - - i >= 0 ; )
346 {
347 pr o d = *pr odt ab++;
348 o u t p u t ( " \ t / * %3d * / \ t \ " " , prod- >num ) ;
349
350
351 {
352
353
354
355 }
356
( j = 0; j r h s l e n ; ++j )
o u t p u t ( "%s", p r o d - > r h s [ j ] ->name ) ;
( j != p r o d - > r h s l e n - 1 )
o u t c ( ' ' ) ;
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
o u t p u t ( i
i
0 ? " \ " , \ n " : " \ " \ n " ) ;
}
/
*
*
*
The f ol l owi ng r out i nes gener at e compr essed par se t abl es. Ther e' s cur r ent l y
no way t o do uncompr essed t abl es. The def aul t t r ansi t i on i s t he er r or
* t r ansi t i on.
/
*
PRIVATE voi d p r i n t r e d u c t i o n s ()
{
/
*
Out put t he var i ous t abl es needed t o do r educt i ons */
PRODUCTION * * pr o dt a b;
Lf ( ! ( pr odt ab= (PRODUCTION**) m a l l o c (si (PRODUCTION*)
*
Num p r o d u c t i o n s ) ) )
error( FATAL, "Not enough memory t o o u t p u t LALR(l ) r e d u c t i o n t a b l e s \ n " ) ;
p t a b ( Symt ab, mkprod, p r o d t a b , 0 ) ;
make yy I hs ( p r o d t a b
make yy r e duc e ( p r o d t a b
o u t p u t ( " # i f d e f YYDEBUG\n"
ma ke _ y y _ s I hs
make yy s r h s
( p r o d t a b
( p r o d t a b
o u t p u t ( " # e n d i f \ n " ) ;
f r e e ( p r o d t a b ) ;
}
/
* *
/
PRIVATE voi d
SYMBOL
PRODUCTION
{
PRODUCTION
mkpr od( sym, p r o d t a b )
*
sym;
* * pr o dt a b;
*
p;
( ISNONTERM(sym)
)
( P
s y m- > p r o d u c t i o n s ; p
P
p - > n e x t )
p r o d t a b [ p->num ]
P
}
/
* *
/
PRIVATE voi d p r i n t t a b ( t a b l e , row name, c o l name, make p r i v a t e )
ACT
i nt
{
* * t a b l e ;
*row_name;
* c o l _ n a me ;
make p r i v a t e ;
/ * Name to use f or t he r ow ar r ays
/
/
*
*
Name t o use f or t he r ow- poi nt er s ar r ay
Make i ndex t abl e pr i vat e ( rows al ways pr i vat e)
*
*
*
/
/
/
/
*
Out put t he act i on or got o t abl e. */
i nt
ACT
ACT
w
1/
p
d;
*
e l e , * * e l e p ; / * t abl e el ement and poi nt er t o same
*
/
*
S f
* *
p;
1417 i nt c o u n t ; / * # of t r ansi t i ons f r omt hi s st at e, al ways >0 */
1418 i nt col umn;
1419 SET ^r e dundant = n e w s e t (); / * Mar k r edundant r ows */
1420
1421 st at i c char * a c t t e x t [] =
1422 {
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440 };
1441 st at i c char * g o t o t e x t [] =
1442 {
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457 };
1458
1459 c omme nt ( Out put , t a b l e == A c t i o n s ? a c t _ t e x t : g o t o _ t e x t ) ;
1460
1461 / * ------------------------------------------------------------------------------------------------------------------------
1462 * Modi f y t he mat r i x so t hat , i f a dupl i cat e r ows exi st s, onl y one
1463 * copy of i t i s kept ar ound. The ext r a r ows ar e mar ked as such by set t i ng
1464 * a bi t i n t he "r edundant " set . (The memor y used f or t he chai ns i s j ust
1465 * di scar ded. ) The r edundant t abl e el ement i s made t o poi nt at t he r ow
1466 * t hat i t dupl i cat es.
1467 * /
1468
1469 f o r ( e l e p = t a b l e , i = 0; i < N s t a t e s ; + + e l e p , ++i )
1470 {
1471 i f ( MEMBER( r e d u n d a n t , i ) )
1472 c o n t i n u e ;
1473
1474 f o r ( p = e l e p + l , j = i ; ++j < N s t a t e s ; ++p )
1475 {
The Yy _got o t a b l e i s g o t o p a r t o f t h e LALR(l ) t r a n s i t i o n ma t r i x " ,
I t ' s c o mp r e s s e d and can be a c c e s s e d u s i n g t h e yy n e x t () s u b r o u t i n e , 11,
d e c l a r e d b e l o w. "
VV
t
n o n t e r mi n a l = Yy I h s [ p r o d u c t i o n number by whi c h we j u s t r e duc e d ] ",
VV
t
YygOOO[ ] ={ 3, 5 , 3 , 2 , 2 , 1 , 1 };
VV
u n c o v e r e d s t a t e - +
number o f p a i r s i n l i s t - - +
n o n t e r m i n a l ----------------------------- +
VV
vv
vv
g o t o t h i s s t a t e
VV
g o t o s t a t e = yy n e x t ( Yy g o t o , c ur s t a t e , n o n t e r mi n a l ) ; " ,
NULL
"The Yy _ a c t i o n t a b l e i s a c t i o n p a r t o f t h e LALR(l ) t r a n s i t i o n " ,
" ma t r i x . I t ' s c o mpr e s s e d and can be a c c e s s e d u s i n g t h e y y _ n e x t ( ) " ,
" s u b r o u t i n e , d e c l a r e d b e l o w . " ,
VVvv
vv
YyaOOO[]
{ 3, 5, 3 2, 2
1.1
) "
i f 9
VV
VV
vv
vv
s t a t e number----- +
number o f p a i r s i n l i s t - +
i n p u t s ymbol ( t e r m i n a l ) - -
a c t i o n -------------------------------------
VV
vv
+
vv
+
vv
vvvv
vv
a c t i o n yy n e x t ( Yy a c t i o n , c ur s t a t e , l o o k a h e a d s ymbol ) ; " ,
VVvv
vv
vv
vv
vv
a c t i o n <
a c t i o n
a c t i o n >
a c t i o n
0
0
0
YYF
Reduce by p r o d u c t i o n n, n a c t i o n . ",
A c c e p t , ( i e . Reduce by p r o d u c t i o n 0 . ) " ,
S h i f t t o s t a t e n, n a c t i o n . ",
e r r o r .
VV
NULL
1476 i f ( MEMBER( r e d u n d a n t , j ) )
1477 cont i nue;
1478
1479 e l e = * e l e p ; / * poi nt er t o t empl at e chai n * /
1480 e = *p; / * chai n t o compar e agai nst t empl at e * /
1481
1482 i f ( ! e | | ! e l e ) / * ei t her or bot h st r i ngs have no el ement s * /
1483 cont i nue;
1484
1485 f o r ( / e l e && e ; e l e = e l e - > n e x t , e = e - > n e x t )
1486 i f ( ( e l e - > d o _ t h i s != e - > d o _ t h i s ) | | ( e l e - > s y m != e - >s ym) )
1487 br eak;
1488
1489 i f ( ! e && ! e l e )
1490 {
1491 / * Then t he chai ns ar e t he same. Mar k t he chai n bei ng compar ed
1492 * as r edundant , and modi f y t abl e[j] t o hol d a poi nt er t o t he
1493 * t empl at e poi nt er .
1494 * /
1495
1496 ADD( r e d u n d a n t , j ) ;
1497 t a b l e [ j ] = (ACT *) e l e p ;
1498 }
1499 }
1500 }
1501
1502 / * ----------------------------------------------------------------------------------------------------------------------
1503 * Out put t he r ow ar r ays
1504 * /
1505
1506 f o r ( e l e p = t a b l e , i = 0 / i < N s t a t e s ; + + e l e p , ++i )
1507 {
1508 i f ( ! * e l e p | | MEMBER(redundant, i ) )
1509 cont i nue;
1510 / * Count t he number of t r ansi t i ons f r omt hi s st at e * /
1511 c o unt = 0 /
V.
1512 f o r ( e l e = * e l e p ; e l e ; e l e = e l e - > n e x t )
1513 + + c o u n t ;
1514
1515 output("YYPRIVATE YY_TTYPE %s%03d[ ] = { %2d, " , row_name, e l e p - t a b l e , c o unt )
1516 / * ; * /
1517
1518 col umn = 0;
1519 f o r ( e l e = * e l e p ; e l e ; e l e = e l e - > n e x t )
1520 {
1521 ++Npa i r s ;
1522 o u t p u t ( "%3d, %-4d , e l e - > s y m, e l e - > d o _ t h i s ) /
1523
1524 i f ( ++col umn != c o u n t )
1525 o u t c ( ' , ' ) ;
1526
1527 i f ( col umn % 5 == 0 )
1528 o u t p u t ( " \ n \ t \ t \ t ") ;
1529 }
1530 / * { * /
1531 o u t p u t ( " } ; \ n " ) ;
1532 }
1533
1534 / *
1535 * Out put t he i ndex ar r ay
f
1536 * /
1537
1538 i f ( make p r i v a t e )
1539 o u t p u t ( "\nYYPRIVATE YY TTYPE *%s[%d] = \ n " , c o l name, N s t a t e s
) ;
1540 el se
1541 o u t p u t ( M\nYY TTYPE *%s[%d] = \ n " , c o l name, N s t a t e s ) ;
1542
1543 o u t p u t ( " { \ n / * 0 * / 11 ) ; / * } * /
1544
1545 f or ( e l e p = t a b l e , i = 0 ; i < N s t a t e s ; + + i , + + e l e p )
1546 {
1547 i f ( MEMBER(redundant,
i ) )
1548 o u t p u t ( "%s%03d", row name, (ACT **) ( * e l e p ) - t a b l e ) ;
1549 el se
1550 o u t p u t ( * e l e p ? "%s%03d" : " NULL" , row name, i );
1551
1552 i f ( i !=N s t a t e s - 1 )
1553 o u t p u t ( ", " );
1554
1555 i f ( i ==0 II ( i % 8) = =0 )
1556 o u t p u t ( " \ n / * %3d
*/ ", i + 1 ) ;
1557 }
1558 /* { */
1559 d e l s e t ( r e dunda nt ); / * Mar k r edundant r ows */
1560 o u t p u t ( " \ n } ; \ n " ) ;
1561 }
5.14 Exercises
5.1. (a) Write an algorithm that creates a physical syntax tree from a postfix represen
tation of an arithmetic expression. For example, the input ab*cd*+ should create
the following tree:
(b) Implement your algorithm.
5.2. (a) Write an augmented, attributed grammar for a bottom-up parser that builds the
syntax tree required for the previous exercise.
(b) Rewrite your grammar to support infix expressions as input rather than
postfix. Parenthesized subexpressions must be supported.
(c) Implement both of the previous grammars using occs.
5.3. Using the LR parse table in Table 5.11 on page 369, show the contents of the
state and symbol stacks as the following input is processed:
1 + ( 2 * 3 ) * 4
5.4. Create bottom-up, LALR(l) parse tables for the following grammar:
Section 5.14Exercises
1. stmt > expr SEMI
2. 1 WHILE LP expr RP stmt
3. 1 LC stmt list RC
4. stmt list ^ stmt list stmt
5. 1 8
6. expr > expr PLUS term
7. 1 STAR expr
8. 1 term
9. 1 8
10. term > IDENTIFIER
11. 1 LP expr RP
This grammar produces one shift/reduce conflict, which should be resolved
according to the following table:
Input
symbol
Token Precedence Associativity
()
LP RP high none
*
STAR medium right to left
+ PLUS low left to right
LC RC none left to right
*
SEMI none none
Hand in the tables and graphs showing both the LR and LALR state machines.
5.5. Modify the grammar in the previous exercise to include an assignment operator
(x=y;) and an exponentiation operator (x ~~y) is x to the yth power . Assignment
should be the lowest precedence operator and exponentiation the highest. Make
LALR(l) parse tables for this modified grammar.
5.6. Implement a parser that uses the parse tables created in Exercises 5.4 or 5.3.
5.7. Add code-generation actions to the grammar created in Exercises 5.4 or 5.3 . You
may output either assembly language or C.
5.8. Add the code-generation actions from the previous exercise to the parser imple
mentation created in Exercise 5.5.
5.9. Why are fewer shift/reduce conflicts generated when imbedded actions are
inserted into a production immediately to the right of terminal symbols?
5.10. Think up a real-world situation where a reduce/reduce conflict is inevitable (other
than the one presented earlier).
5.11. A more-efficient method for creating LALR(l) parse tables than the one
presented in the current chapter is described in [Aho], pp. 240-244. An LR(0)
state machine is created and then lookaheads are added to it directly. You can
see the basis of the method by looking at Figure 5.8 on page 366. Lookaheads
appear on the tree in one of two ways, either they propagate down the tree from a
parent to a child or they appear spontaneously. A lookahead character pro
pagates from the item [!s-â.J t(i,c] to its child when FIRST(P) is nullable. It
appears spontaneously when FIRST(!b) contains only terminal symbols. Modify
occs to use this method rather than the one presented in the current chapter. Is
the method in [Aho] actually more efficient once implementation details are con
sidered? Why or why not?
443
5.12. An even more efficient method for creating LALR(l) parse tables, developed by
DeRemer and Pennello, is described in [DeRemer79] and [DeRemer82]. It is
summarized in [Tremblay], pp. 375-383. Modify occs to use this table-creation
method.
5.13. LLama and occs currently supports optional and repeating productions, indi
cated with brackets:
s : a [ o p t i o n a l ] b
c [ r e p e a t i n g z e r o or more t i m e s ] * d
f
These are implemented in the start_ opt() and end_ opt() subroutine
presented in Chapter 4 and the transformations involved are discussed in Appen
dix E. Occs uses left-recursive lists and LLama uses a right-recursive lists for
these transformations.
(a) Add a +operator to occs and LLama that can be used in place of the star in a
repeating production. It causes the enclosed sub-production to repeat one or
more times.
(b) Add the operator [. . . ] (a, b) * for a to b repetitions of the sub
productiona and b are decimal numbers.
(c) Add the [. . . ] *>and [. . . ] +>operators to occs. These should work just
like [ . . . ] * and [ . . . ] +, but should use right-recursive rather than left-
recursive list productions.
Having looked at how parsers are created, its now time to look at the real heart of
the compiler: the code generation phase. Weve actually been looking at code genera
tion in a backhanded way. lX, LLama, and occs are all compilers, after all, and they
do generate code of sortsthe tables and drivers. This chapter looks at more conven
tional code generation and at the run-time environment by building a C compiler. I m
assuming that you can generalize from a description of a specific implementation once
you understand the design issues. The grammar used here is summarized in Appendix C
and is discussed piecemeal in the current chapter as its used.
Because a production-quality, full ANSI compiler would both take up too much space a n s i -C su b se t im plem ent-
and actually obscure some code-generation issues, I ve implemented a proper subset of
ANSI C. Most of the hard stuff is implemented, but I ve omitted some functionality that
you can add on your own if youre interested.1 The modifications are left as exercises at
the end of the chapter. Note that these limitations exist only at the semantic level. The
grammar recognizes a n s i C, but the compiler ignores a few a n s i constructs when they
are encountered. The compiler has the following limitations:
Floating point is not supported.
The auto, const, v ol ati l e, and r egi ster keywords are ignored.
Structures do not form lvalues. You cant pass or return them by value or assign to a
structure.
Compile-time initializations work only on nonaggregate global variables like this:
i nt x = 5;
Arrays, structures, and local variables of any sort may not be initialized at compile
time.
Strong type checking on enumerated types is not supported. An enumerated type is
treated just like an i nt, and any integer value can be assigned to a variable of that
type, even if the value was declared as part of a different enumerated type. The
1. Adding some of the missing stuff, like structure lvalues, floating point, and proper initializations, is a
nontrivial enterprise. Other omissions, like bit fields, are easy.
445
446 Code Generation Chapter 6
The virtual machine.
elements of the enumeration list are treated just like a macro would be treated, it is
just replaced by a i nt constant wherever its found. The following declaration:
enumt ag { Beaver , Wal l y, J ane, War d/ } Cl eaver /
is treated like this:
#def i ne Be a v e r 0
#def i ne Wal l y 1
#def i ne June 2
#def i ne Ward 3
i nt C l e a v e r /
Bit fields are ignored. The associated structure field is treated as if the bit field
wasnt present in the input.
Function prototypes are treated as simple extern declarations. The argument or
type list is ignored.. The new C++-style function-definition syntax is supported, how
ever.
One nonstandard escape sequence is supported in character constants: VC
(backslash-caret-letter) evaluates to the associated control character. It cannot be
used in a string constant.
A simple, one-pass compiler is developed in this chapter: the compiler generates
assembly language directly rather than creating a true intermediate language that is pro
cessed by a back end. I ve taken this approach because it represents something of a
worst-case scenario, and so makes a better example. The chapter starts out discussing
the run-time environment and how a typical machine is used by the generated code.
Intermediate languages are discussed briefly, and an output assembly language with a
C-like syntax is developed. Internal data structures such as the symbol table are dis
cussed, and then the compiler is developed.
One failing of the current chapter is that the recursive nature of the grammar pre
cludes a strictly hierarchical discussion of the code-generation actions. As a conse
quence, theres a certain amount of skipping around in the grammar in the later sections.
I ve tried to minimize this as much as possible, but you may have to read the chapter
more than once before the overall structure of the compiler becomes apparent. The
entire grammar is listed hierarchically in Appendix C, and it will be worthwhile for you
to study the grammar thoroughly before proceeding to the discussion of the code-
generation issues.
As a final disclaimer, compilers are complicated programs and are notoriously
difficult to debug. I dont expect the current compiler to be any different in this respect.
The compiler has been tested as well as I am able, but theres nothing like having several
thousand people reading your code to show up bugs that you never dreamed existed.
The electronically distributed version will always be the most up-to-date, and bug fixes
will be posted to USENET on a regular basis. Please notify me when you find a bug so I
can tell everyone else about it (either c/o Software Engineering, or electronically; the
addresses are in the Preface).
You must be thoroughly familiar with the bottom-up parse process described in the
last chapter before continuing.
6.1 Intermediate Languages
Compilers often generate intermediate languages rather than translate an input file
directly to binary. You can look at an intermediate language as a model assembly
language, optimized for a nonexistent, but ideal, computer called a virtual machine. The
Section 6.1 Intermediate Languages 447
compilers output can be tested by simulating this virtual machine with a computer pro
gram. There are several advantages to an intermediate-language approach. First, you
can design an intermediate language with ease of optimization in mind, and thereby
improve the resultant binary image because it can be more heavily optimized. Next, the
intermediate-to-binary translation is usually done by a separate compilation pass called a
back end, and you can provide several back ends for different target machines, all of
which use the same parser and code generator (called the front end). By the same token,
you can provide several front ends for different languages, all of which generate the
same intermediate language and all of which share the same back end to generate real
code.2
Virtual machines typically have many registers, very orthogonal instruction sets (all
instructions can be performed on all registers, the syntax for register operations is the
same as the memory-access syntax), and so forth. There are trade-offs, though. The
ideal machine hardly ever maps to a real machine in an efficient way, so the generated
code is typically larger and less efficient than it would be if the parser knew about the
actual target machine. The ability to mix and match front and back ends compensates to
some extent for the inefficiencies.
Intermediate languages typically take one of three forms. These are triples, quads
(short for quadruples), and postfix (reverse-Polish) notation.
Most real assembly languages are made up of triples which are made of of three
parts: an operator, a source, and a destination or target. For example, the 68000
instruction ADD.W D0,D1 is a triple that adds the contents of the DO and D1 registers
and puts the result in Dl. AC representation of a similar triple would be:
d += s;
and a typical mathematical representation is:
(+=, d, s)
Triples are sometimes called triplets or 3-tuples. They are also sometimes called two-
address instructions because the binary representation of most instructions comprise an
operator and source and destination addresses.
Quads, also called quadruples or three-address instructions, have four parts. The fol
lowing quad has two sources, an operand, and a destination:
d = si + s2;
A more mathematical representation of the same quad would be:
( +, d, si , s2)
Note that, in spite of the name, some quads, such as an assignment quad, have only
three parts, with an empty fourth field. This empty element is usually marked with a
dash:
( r d , S, )
The first field defines the operation. Since its never empty, a dash there signifies sub
traction.
Not all quads and triples involve implicit assignments. The first of the following tri
ples compares two numbers and remembers the result of the comparison internally. The
Back ends and front
ends.
Virtual machine charac
teristics.
Triples (two-address in
structions).
Quads (three-address in
structions).
Quads and triples without
explicit assignments.
2. For example, the Microsoft Pascal, FORTRAN, and C compilers all share the same back end.
Advantages of quads
versus triples.
Postfix, Reverse-Polish
Notation (RPN).
second triple, which branches to a specific label, is executed only if the previous com
parison was true. Neither instruction involves an assignment.
(LESS_THAN, a, b)
(GOTO, t a r g e t , - )
Arithmetic operations may not involve assignment either. The following two triples exe
cute A=B+C. The first one does the addition and stores the result in an internal register
called the accumulator. The second assigns the result of the previous operationthe
. - 1 to A.
(+ , B, C )
(=, A, . - 1 )
The dot represents the position of the current instruction: . - 1 references the previous
instruction, . - 2 the instruction before that, and so on.
Triples have one advantage over quads: theyre very close in structure to many real
assembly languages, so they can ease code generation for these assemblers. I m using
them in the compiler generated in this chapter for this reason. Quads have two advan
tages over triplets. They tend to be more compact; a quad like ( +, d, si , s2)
requires two triplets to do the same work:
( = d, s i ) ;
(+= d, s2 ) ;
Also, certain optimizations are easier to do on quads because triples are more position
dependent. In the previous example, the two triples must be treated as a single unit by
any optimizations that move the code around, and the ordering of the two triples is
important. Since its self-contained, the quad can be moved more easily.
The third kind of common intermediate language is postfix or Reverse-Polish Nota
tion (RPN.) Forth, PostScript, and Hewlett-Packard calculators are examples of postfix
languages. This representation has several advantages. The first is that expressions can
be evaluated with less work than usual: no parentheses are needed to represent them,
and the compiler doesnt need to allocate temporary variables to evaluate postfix
expressionsit uses a run-time stack instead. All operands are pushed onto the stack
and all operators affect the top few stack elements. For example, an expression like this:
(1+2) * (3+4)
is represented in postfix as follows:
1 2 +34 +*
and is evaluated as shown in Table 6.1.
Postfix languages often have various operators designed specifically for stack opera
tions, such as a dup operator that duplicates and pushes the top of stack elementX2 can
be represented with:
X dup *
Another common stack operator is the swap operator, which swaps the two elements at
the top of stack.
RPN has one other advantage: Its easy to reconstruct the original syntax tree from a
postfix representation of the input language. (Well look at this process further in
Chapter Seven.) Since some optimizations require the production of a syntax tree, a
postfix intermediate language is a convenient way to transfer a compiled program to the
optimizer.
Table 6.1. Postfix Evaluation of 1 2 + 3 4 + *
Section 6.1 Intermediate Languages 449
stack input comments
empty 1 2 + 3 4 + * push 1
1 2 + 3 4 + * push 2
1 2 + 3 4 + * add the two items at top of stack and replace them
with the result.
3 3 4 + * push 3
3 3 4 + * push 4
3 3 4 + * add the two items at top of stack and replace them
with the result.
3 7
*
multiply the two items at top of stack and replace
them with the result.
21 the result is at top of stack.
6.2 C-code: An Intermediate Language and Virtual Machine
The syntax of an intermediate language is pretty much arbitrary. Ideally it reflects
the characteristics of the most likely target machines. For purposes of clarity, however,
I ve decided to use a C-like intermediate language in the compiler created later in this
chapter. I ve defined a C subset in which all instructions have direct analogs in most
assembly languagesall the instructions translate directly into a small number of
machine instructions (usually one). This way, the assembly-language syntax will be
familiar to you, regardless of your background. In addition, you can use your C compiler
and normal debugging tools to exercise the output from the compiler. The code is simple
enough so that a translation to assembly language is very straightforward. I ve dubbed
this intermediate language C-code.
C-code is really more of an assembly language than a true intermediate language.
The main problem is that it is very restrictive about things like word width, alignment,
and storage classes. It forces the compiler itself to worry about details that are usually
handled by the back end and, as a consequence, makes a back end harder to write
because the back end must occasionally undo some of the things that the compiler has
done (like add padding to local variables to get proper alignment).
The current section describes C-code in considerable depth. Though the description
is, by necessity, full of specific implementation details, reading it should give you a good
idea of the sorts of things that are required in a more general-purpose intermediate
language. The current section also serves as a review of assembly-language concepts.
The C-code description is pretty terse in places, and a previous familiarity with a real
assembly language will help considerably in the following discussion. Finally, I ve also
used the current section as a vehicle for discussing the memory organization and
subroutine-linkage procedures that are almost universal among C compilers, so you
should read through it even if youre very familiar with assembler.
The C-code virtual machinethe hypothetical machine that would run C-code as its
assembly languageis modeled by a series of macros in a file called <tools/virtuaLh>,
3. For convenience, all the C-code directives described in this section are also summarized in Appendix F.
<tools/virtual. h>
which should be #i ncl uded at the top of every C-code file. This file contains
definitions that allow you to compile and run the generated C-code, using your normal C
compiler as an assembler. Virtual.h is described piecemeal as the various C-code
language elements are discussed. Unfortunately, the code in virtual.h is not particularly
portable. I ve made all sorts of assumptions about the sizes of various data types, the
ordering of bytes within a word, and so forth. You can compensate for these system-
dependent problems by modifying the include file, however.
A typical development cycle looks like this:
vi f i l e . c Edit the C source-code file.
c f i l e. c Run the source code through the compiler developed in the
current chapter. The compiler generates a C-code output file
called output.c.
cc output. c Assemble the compilers output, using your normal compiler
as an assembler. You can debug the compiler s output using
dbx, CodeView, or whatever debugger you normally use.
Table 6.2 shows how a C input file is translated into C-code by the compiler in the
current chapter. The rightmost column demonstrates how the C-code relates to 8086
assembler. Table 6.2 is intended to demonstrate whats happening in a general way.
Dont worry about the detailsI ll discuss whats going on in great detail as the chapter
progresses. For now, notice how similar the C-code is to the true assembly language.
Its easy to translate from one to another. C-code is really an assembly language, regard
less of the superficial similarities to C.
6.2.1 Names and White Space
Whitespace. As in C, white space is ignored in C-code except as needed to separate tokens. Com
ments delimited with /* . . . */ are treated as white space. Multiple-line comments are
not permittedthe / * and * / must be on the same line.
Identifiers can be made up of letters, digits, and underscores only. The first character
in the name may not be a digit, and names are restricted to 31 characters. Names are
truncated if theyre longer. In addition to the standard C keywords, all of the C-code
directives discussed below should be treated as keywords and may not be used for
identifiers.
6.2.2 Basic Types
The primary design considerations in the virtual machine are controlled by the sizes
of the variables that are manipulated by the machine. Four basic data types are sup
ported: 8-bit bytes, 16-bit words, 32-bit long words, and generic pointers (nominally 32
bits). These types are declared with the byte, word, lword, and ptr keywords respec
tively. The array and record keywords can be used as a synonym for byte in order
to provide some self-documentation when declaring a structure or array of structures.
All but pt rs are signed quantities.
Listing 6.1 shows the type definitions from virtual.h. I m making nonportable
assumptions about word widths here; you may have to change these definitions in your
own system. The ptr type is a character pointer, as compared to avoi d pointer, so that
it can be incremented if necessary.
The <tools/c-code.h> file, which is #i ncl uded on line one of Listing 6.1, contains
various definitions that control the widths of basic types. It is shown in Listing 6.2. This
information has been split into a second header file because it is likely to be used by the
compiler that is generating C-code, but is not much use in interpreting the C-code.
byte, word, lword, ptr, ar
ray, record.
<tool$/c-code. h>
Section 6.2.2Basic Types 451
Table 6.2. Translating C to C-code to 8086 Assembly Language
CInput
s t r cpy( dst , s r c )
c har
*
dst ,
*
{
char
*
C-Code Output
#i nc l ude <t ool s / v i r t ual . h>
#def i ne T( x)
SEG( bs s )
#def i ne LO 1
#def i ne Ll 2
/
/
*
*
s t r c py
s t r c py
l oc al s */
t e mps . /
#undef
#def i ne
SEG( code)
T
T( n) ( f p- L O- ( n* 4) )
P ROC( s t r c py, publ i c )
/
/
*
f p+4
f p+8
ds t [ ar gument ]
s r c [ ar gument ]
*
*
/
/
l i nk ( L 0+L 1) ;
/ f p- 4 [ var i abl e]
*
/
8086 Assembler
BSS SEGMENT WORD PUBL I C
ENDS
TEXT SEGMENT WORD PUBL I C
ASSUME CS: TEXT
PUBL I C
s t r c py
s t r c py
PROC NEAR
; [ bp+2] =
; [ bp+4] =
pus h bp
mov bp, s p
sp, 6
ds t
s ub
[ bp- 2]
dst ;
whi l e ( * s r c )
{
* ds t ++
*
}
r et ur n s t ar t
BP( f p- 4) B P ( f p+4) ; mov ax, WORD PTR [ bp+2]
mov WORD PTR [ bp- 2] , ax
T ST 1: T ST 1:
E Q( * BP( f p+8) , 0) /
*
/
got o EXI Tl ;
BP( T (1) ) = BP(f p+4); /* t l =dst */
BP(fp+4)
+= l ; /* dst ++ */
BP( T(2)) = BP(f p+8); / * t 2=sr c */
BP(fp+8) += 1; / * sr c++ */
*BP( T (1)) = *BP( T ( 2) ) ; / * *t l =t 2
*/
mov bx, WORD PTR [ bp+4]
mov al , BYTE PTR [ bxl
or
j z
al , al
EXI Tl
got o TST1;
EXI Tl :
r F . pp BP( f p- 4) ;
got o RET1;
mov s i , WORD PTR [ bp+2]
i nc WORD PTR [ bp+2]
mov di , WORD PTR [ bp+4]
i nc WORD PTR [ bp+4]
mov bx, di
mov al , BYTE PTR [ bx]
mov bx, s i
mov BYTE PTR [ bx] , al
i mp TST1
EXI T l :
mov bx, WORD PTR [ bp- 4]
i mp RETl
} RET l : RET l :
unl i nk ( ) ;
r et ( ) ;
E NDP ( s t r cpy)
mov s p, bp
s t r c py ENDP
TEXT ENDS
END
6.2.3 The Virtual Machine: Registers, Stack, and Memory
The C-code virtual machine is pictured in Figure 6.1. The machine has a set of 16 The register set.
general-purpose registers named rO, r l , r2, and so forth. An entire 32-bit register can be
accessed from any instruction, as can either of the 16-bit words or any of four 8-bit bytes
that comprise the same register. These registers are memory locations that are physically
part of the CPU itselfthey dont have addresses. Use the syntax shown in Table 6.3 to
access a register. The name must always be fully qualified; one of the forms shown in
Table 6.3 must always be used. The register name by itself (without the dot and type
reference) is illegal.
Listing 6.1. virtual.h Basic Types
1 #i ncl ude < t o o l s / c - c o d e . h>
2
/ * Basi c t ypes */
3 t ypedef char b y t e ; / * 8 bi t */
4 t ypedef short word; / * 16 bi t * /
5 t ypedef l ong l wor d; / * 32 bi t
* /
6
7
8
9
t ypedef char * p t r ; / * Nomi nal l y 32 bi t .
* /
t ypedef b y t e a r r a y ;
t ypedef b y t e r e c o r d ;
/ *
Al i ases f or "byt e. "
* /
Listing 6.2. c-code.h Various Widths
1
2
3
4
5
6
7
8
#def i ne BYTE_WIDTH
#def i ne WORD_WIDTH
#def i ne LWORD_WIDTH
#def i ne PTR WIDTH
1
2
4
4
/ * Wi dt hs of t he basi c t ypes
*
/
#def i ne BYTE_HIGH_BIT M0 x f f 8 0 M
#def i ne WORD_HIGH_BIT "0x8000"
#def i ne LWORD HIGH BIT "0x80000000L"
/
*
Hi gh- bi t mask.
*
/
The stack, fp, sp.
In addition to the register set, there 1024-element. 32-bit wide stack, and two
special-purpose registers that point into the stack: the f p and sp registersdiscussed
The instruction pointer,
depth, below.
Finally, there is the ip or instruction-pointer register. This register holds the address
i p.
(in the code segment) of the next instruction to execute, not of the current instruction.
is updated every time an instruction is processed, and is modified indirectly by various
instructions. A cal l , for example, pushes the ip and then transfers control to some
where else. A r et pops the address at top of stack into the ip. A got o modifies the ip
directly. The instruction pointer is not used directly by the current compiler, and its
value wont change if you just use your C compiler to assemble the C-code output.
Access to it is occasionally useful, however, and the register is available in all real
Virtual machine imple
mentation.
machines. I ts included here for completeness sake.
The register set and stack are implemented in Listings 6.3 and 6.4. Note that the
definition for r eg on lines 21 to 28 of Listing 6.4 is not portable because I m assuming
that an 8086-style byte ordering is usedthe least-significant byte is lowest in memory,
and the most-significant byte is highest. Exactly the opposite holds in 68000-style
machines. The LSB is highest in memory, so youd have to redefine the wor ds and
byt es fields as follows for those machines:
st ruct words { word h i g h , low;
};
st ruct b y t e s { b y t e b3, b2, b l , bO;
};
The address of an object is the physical address of the least-significant byte in both the
8086 and 68000 family. Other machines may require even more shuffling.
Figure 6.1. The C-code Virtual Machine
Section 6.2.3The Virtual Machine: Registers, Stack, and Memory 453
high low
rO
rl
r2
r3
r4
r5
r6
rl
r8
r9
rA
rB
rC
rD
rE
rF
<------ 32 hits ------>
h3 h2 hi h()
stack[ ]
sp
text data bss
iP
Table 6.3. The Virtual-Machine Register Set
rl contains: access syntax:
pointer r l . pp
32-bit long word r l . l
two 16-bit words r l . w. hi gh r l . w. l ow
four 8-bit bytes (byte 0 is low) r l . b. b3 r l . b. b2 r l . b. bl r l . b. bO
454 Code GenerationChapter 6
Listing 6.3. c-code.h Stack-size Definitions
9 #define SWIDTH LWORD WIDTH /* St ack wi dt h (i n byt es) . */
10 #defi ne SDEPTH 1024 / * Number of el ement s i n st ack.
*/
Listing 6.4. virtual.h The Register Set and Stack
10 #i f def ALLOC
11 # def i ne I ( x ) x
12 # def i ne CLASS / * empt y * /
13 #el se
14 # def i ne I ( x ) / * empt y * /
15 # def i ne CLASS ext ern
16 #endi f
17
18 st ruct words { word l ow, h i g h ; } ;
19 st ruct b y t e s { b y t e bO, b l , b 2 , b 3 ; }; / * bO i s LSB, b3 i s MSB * /
20
21 t ypedef uni on r e g
22
{
23 char
*pp;
/ * poi nt er * /
24 l wor d l ; / * l ong wor d * /
25 st ruct words w; / * t wo 16- bi t wor ds * /
26 st ruct b y t e s b; / * f our 8- bi t byt es * /
27
}
28 r e g ;
29
30 CLASS r e g rO, r l , r 2 , r 3 , r 4 , r 5 , r 6 , r7 ; / * Regi st er s * /
31 CLASS r e g r 8 , r 9 , rA, rB, rC, rD, rE, rF ;
32
33 CLASS r e g s t a c k [ SDEPTH
] ;
/ * r un- t i me st ack * /
34 CLASS r e g *___s p I (= &s t a c k [ SDEPTH ] ) ; / * St ack poi nt er * /
35 CLASS r e g * f p I (= &s t a c k[ SDEPTH ] ) ; / * Fr ame poi nt er * /
36
37 #def i ne f p ((char * ) ___f p)
38 #def i ne s p ((char * ) ___sp)
Note that the stack itself is declared as an array of reg unions on line 33 of Listing
6.4, and the stack and frame pointers are reg pointers. The sp and fp registers are refer
enced in C-code directives using the macros on lines 37 and 38, however. The cast
assures that pointer arithmetic is defeated in the based addressing modes discussed
below.
The stack is deliberately made as wide as the worst-case basic typeif a 64-bit dou
bl e were supported, I would have made the stack 64 bits wide. This way, all stack
access can be done with a single instruction. I ve done this in order to make the back
ends life a little easier. I ts a simple matter to translate single push and pop instructions
into multiple instructions (if the target-machines stack is 16 bits wide, for example). Its
difficult to go in the other direction, howeverto translate multiple pushes and pops into
single instructions.
Allocating space for the yGu must put the following two lines into only one module of your program to actu-
virtual-register set, a l - ally allocate space for the stack and register set:
LOC.
#def i ne ALLOC
#i ncl ude < t o o l s / v i r t u a l . h>
You can create a two-line file, compile it, and link it to your other code if you like.
Section 6.2.3The Virtual Machine: Registers, Stack, and Memory 455
When ALLOC is defined, the definitions on lines 11 and 12 are activated. Here, CLASS
evaluates to an empty string, so all invocations of the CLASS macro effectively disappear
from the file. The I ( ) macro evaluates to its argument, so an invocation like
I (= &st ack[ SDEPTH] );
evaluates to
= &st ack[ SDEPTH] ;
The macros argument is the entire string =&st ack [ SDEPTH] . When ALLOC is not
defined, the opposite situation holds, CLASS expands to the keyword extern and the
contents of the I () macro are discarded.
The run-time stack, stack pointer, and frame pointer registers are defined at the bot
tom of Listing 6.4. The stack is the same width as a register, and the stack pointer is ini
tialized to point just above the top stack element. It grows towards low memory with the
first push.
6.2.4 Memory Organization: Segments
The stack represents only a portion of the memory that can be used by the running
program. This memory is partitioned into several segments, one of which is the stack
segment. Figure 6.2 shows how the segments are usually arranged.
Figure 6.2. Segments
7K
stored on
disk
heap
stack
<-----
bss (uninitialized data)
data (initialized data)
text (code)
prefix
initial stack pointer
The rationale behind this partitioning is best understood by looking at how a program
is loaded by the operating system. Initially, a contiguous region of memory is allocated
for the program to use. This region can be physical memory, as it is in most microcom
puters, or can be virtual memory, as on a mainframe. In the later case, the physical
memory might not form a contiguous block, but its convenient to look at it that way.
The prefix segment holds information that is used by the operating system to load the
program. Typically, the sizes of the various segments are stored there, as are things like
the initial values of the program counter and stack pointer. Some operating systems (like
MS-DOS) read the prefix into memory, and its available for the program to use at run
time. Other operating systems read the prefix to get the information that it contains, and
then overwrite the prefix with the other segments.
CLASS.
The i () macro.
Segments.
The prog ram-load pro
cess.
The program prefix.
Executable image.
Text (or code) and (ini
tialized) data segments.
Bss segment.
Stack segment.
Heap segment.
Dynamic arrays.
Combined heap and
stack.
Using the information in the prefix, the operating system then allocates a block of
memory large enough to hold the remainder of the program, called the executable image.
The text and data segments are then copied directly from the disk into memory. The text
or code segment holds the executable code; the data segment holds initialized data, only.
Thats why an initialized stati c local variable comes up with an initial value, but once
that value is changed it stays changedthe initial value is just read from the disk into
the proper place in memory. Note that there are two types of initialized data: variables
that have been given an initial value when declared, and initialized constants (like string
constants) whose values are not expected to change. The constant data is sometimes
stored in the text segment, along with the code, in order to make the operating systems
life a little easier. It knows that the text segment is not modified, so it doesnt have to
swap this region out to the disk if it needs the memory for another process.
One of two things can happen once the text and data segments are loaded, depending
on the operating system. Either control can be transferred directly to the program, which
must continue the initialization process, or the operating system itself initializes the
other segments. In any event, the bss4 segment, which holds all uninitialized data, is ini
tialized to all zeros at load-time. (Theres no point in storing masses of zeros on the
disk.) The stack pointer is then initialized to point at the correct part of the stack seg
ment, which holds the run-time stack used for subroutine calls, and so forth. Stacks grow
down in most computers, so the stack pointer is typically initialized to the top of the
stack segment. Various pointers that manipulate the heap, the region of memory used for
dynamic storage, are also initialized. The heap is used, in C, by the memory-allocation
subroutines mal l oc () and f ree () . 5 In some languages the heap is also used in code
generated by the compiler itself. The C++new operator allocates memory for an object;
and a user-defined class might do memory allocation transparently when an object in that
class is declared. Similarly, PL/1 supports dynamically allocable arraysarrays whose
size is not known until run time, and the PL/1 compiler uses the heap to allocate space
for that array when the scope of that declaration is entered at run time (when the block
that holds the declaration is entered). The compiler just translates a compile-time
declaration into code that allocates space for the array and initializes a pointer to access
the first elementit effectively calls mal l oc () when the subroutine is entered and
f ree () at exit. The contents of the stack and heap segments are typically uninitialized
at load time, so the contents of variables allocated from these regions are undefined. The
heap is typically at the top of the image because it may have to grow larger as the pro
gram runs, in which case the memory-allocation functions request that the operating sys
tem enlarge the size of the executable image. The heap and stack segments are often
combined, however. Since the stack pointer typically grows down, and the memory allo
cation from the heap can start in low memory and go up, the shared space can be allo
cated from both ends as needed, like this:
4. Bss stands for block starting with symbol, a term dating from the Mesozoic, at least.
5. The source code for a m a l l o c () and f r e e () implementation is in [K&R] pp. 185-189.
Section 6.2.4Memory Organization: Segments 457
stack
initial stack pointer
v
heap
This way, if a program has an unusually large stack and unusually small heap or vice
versa, space can be transferred from one area to the other with ease. I ts difficult to
expand the size of the heap after the top of heap runs into the stack, however, and such
expansion almost always results in a fragmented heap. The virtual-memory manager
could solve the fragmentation problem, but it might not.
Many compilers provide a mechanism for the program to determine the positions of
various segments at run time. For example, the UNIX compiler automatically declares a
variable at the end of the bss segment called e b s s . All memory whose address is greater
than &e bs s must be in the stack or heap segments. Other variables (called e t e x t and
e d a t a ) are provided for the other segment boundaries. Similarly, many compilers gen
erate code that checks the value of the stack pointer when every subroutine is entered
and terminates the program if the stack pointer is not in the stack segment.
C-code lets you change from one segment to another by issuing a SEG () directive,
which takes one of the following forms:
SEG( t e x t )
SEG( d a t a )
SEG( b s s )
Note that no semicolon is used herethis is the case for all C-code directives whose
names are in all caps. Everything that follows a SEG ( t e x t ) (up to the next SEG( )
directive or end of file) is in the text segment, and so forth. You can use SEG ( c o d e )
instead of SEG ( t e x t ) . Theres no direct access to the prefix, stack, and heap segments
because these exist only at run time, and the compiler uses other mechanisms to access
them. You may not change segments in the middle of a subroutine. No SEG directives
can appear between the PROC and ENDP directives, discussed below. The SEG () direc
tive is defined as an empty macro in virtual.h. I ts shown in Listing 6.5.
Listing 6.5. virtual.h The SEG ( ) directive.
39 # d e f n e SEG( segment ) / * empt y * /
6.2.5 Variable Declarations: Storage Classes and Alignment
C-code supports global-level variable declarations only. All variables must be
declared in the data or bss segments, depending on whether or not they are initialized.
Four storage classes are available (they are related to C in Table 6.4):
pri vate Space is allocated for the variable, but the variable cannot be accessed from
outside the current file. In C, this class is used for all stati c variables, be
they local or global. Initialized variables go into the data segment, other
variables go into the bss segment.
Segment-end markers:
The SEG () directive.
No semicolon used with
upper-case C-code direc
tives.
Locationcounter.
publ i c Space is allocated for the variable, and the variable can be accessed from
any file in the current program. In C, this class is used for all initialized
nonstatic global variables. It is illegal for two publ i c variables in the
same program to have the same name, even if theyre declared in different
files. Since publ i c variables must be initialized when declared, they must
be in the data segment.
common Space for this variable is allocated by the linker. If a variable with a given
name is declared common in one module and publ i c in another, then the
publ i c definition takes precedence. If there are nothing but common
definitions for a variable, then the linker allocates space for that variable in
the bss segment. C uses this storage class for all uninitialized global vari
ables.
ext er nal Space for this variable is allocated elsewhere. If a label is ext er nal , an
identical label must be declared common or publ i c in some other module
of the program. This storage class is not used for variables in the current
application, all of which are common, publ i c, or pr i vat e. It is used for
subroutines, though.
Table 6.4. Converting C Storage Classes to C-code Storage Classes
stati c
not stati c
definition
declaration (extern
or prototype)
Subroutine
(in text segment).
pr i vat e publ i c ext er nal
Uninitialized variable
(in bss segment).
pr i vat e common
Initialized variable
(in data segment).
pr i vat e publ i c
The following example shows how these classes are used in C:
i nt k o o k l a = 1; / * Publ i ci t ' s i ni t i al i zed and not st at i c. * /
st at i c i nt f r a n ; / * Pr i vat ei t ' s st at i c. I ni t i al i zed t o 0. * /
i nt o l l i e ; / * Commoni t ' s decl ar ed wi t hout an ext er n. * /
i nt o l l i e ; / * Not a r edef i ni t i on because def aul t cl ass * /
/ * i s ext er n. Bot h i nst ances of ol l i e ar e * /
/ * common. * /
voi d r o n n i e ( )
{
st at i c f l oat g e o r g e ; / * Pr i vat e i t ' s decl ar ed st at i c. * /
ext ern dan; / * Common. * /
i nt d i c k = 1 ; / * No decl ar at i on i s gener at ed. (The * /
/ * memor y i s al l ocat ed on t he st ack) . * /
}
The four storage classes can be broken into two broad categoriesclasses that allo
cate space in the executable image (as stored on the disk) and classes that dont allocate
space. I ll have to dip into real assembly language (8086 assembler) for a moment to
show whats actually going on. The assembler actually builds a physical copy of the
executable image in memory as it works, and it copies this image to the disk when
assembly is complete. The assembler keeps track of the current position in the execut
able image that its building with an internal variable called the location counter.
A directive like the following allocates space for a variable:
Section 6.2.5Variable Declarations: Storage Classes and Alignment 459
_ v a r : dw 10
This instruction tells the assembler to do the following:
Create a symbol-table entry for var and remember the current value of the location
counter there.
Fill the next two bytes with the number 10. Two bytes are allocated because of the
dwother codes are used for different sizes; db allocates one byte, for example. The
10 is taken from the dwdirective.
Increment the location counter by two bytes.
The allocated space (and the number 10, which is used here to fill that space) end up
on the disk as part of the executable image. Instructions are processed in much the same
way. For example, a
MOV a x , _ v a r
instruction moves the contents of _var into the ax register. This instruction tells the
assembler to do the following:
Copy a binary code representing the move into ax operation into the place refer
enced by the current location counter. This binary code is called the op code.
Copy the address of _var into the next two bytes. This address is the location-
counter value that was remembered in the symbol table when space for _var was
allocated.
Increment the location counter by three.
From the assemblers perspective, code and data are the same thing. All it knows
about is the current location counter. There is nothing preventing us from putting a MOV
instruction into memory as follows:
db OAOH ; OxAO i s t h e 8086 op c ode f o r "MOV i n t o AX"
dw _ v a r ; a d d r e s s o f _ v a r
Applying the foregoing to C-code storage classes, publ i c and pr i vat e definitions
are translated into dwdirectives in 8086 assembler. They cause the location counter to
be moved, and some value is copied into the allocated space. This value ends up on the
disk as part of the executable image. The ext er nal directive is at the other extreme.
No space is allocated; the location counter is not modified. The assembler does create a
symbol-table entry for the associated label, however. If that label is referenced in the
code, place holders are put into the image instead of the actual addresses, which are not
known by the compiler. The linker replaces all place holders with the correct addresses
when it puts together the final program. The ext er nal directive must reference a
region of memory allocated with a dwdirective or equivalent somewhere else in the pro
gram. The unpatched binary image (with the place holders still in it) is usually called a
/
relocatable object module.
The common storage class is somewhere in between ext er nal and publ i c. If the
name associated with the common is used elsewhere in a publ i c or pr i vat e declara
tion, then the common is treated just like an ext er nal . Things change when all
6. One common way to organize a relocatable module puts a symbol table into the object file. The table has
one element for each unresolved referencep r i v a t e symbols are not put into the table. The symbol-
table elements hold the symbols name and the offset from the start of the file to the first place holder that
references that symbol. That place holder, in turn, holds the offset to the next place holder for the same
symbol, in a manner similar to a linked list. The last element of the list is usually marked with zeros. Most
8086 relocatable-object-module formats derive from the specification described in [Intel] and in
[Armbrust].
op code.
Relocatable object
module.
Allocating variables.
references to the variable are commons, however. Space is allocated for the variable at
load time, not at compile time. Remember, the executable image on the disk represents
only part of the space that is used when the program is running. Space for a common is
always allocated in the bss segment, which is created when the program is loadedit
has no existence at compile time and wont be part of the binary image on the disk. The
linker replaces the place holders in those instructions that use the common variable with
references to the place at which the variable is found at run time, but no dwdirective is
generated. Space is allocated implicitly because the linker leaves a note in the execut
able images file header that tells the loader the size of the bss region. Instead of incre
menting the location counter at compile time, the assembler tells the loader to increment
it at load time, by making the size of the executable image large enough to encompass
the extra space.
Returning to C-code, all variables must be declared using a basic type (byte, word,
l word, or pt r) and one of the foregoing storage classes ( pri vat e, publ i c, common,
ext ernal ) . In addition, one-dimensional arrays can be declared using trailing brackets.
Multi-dimensioned arrays are not permitted. The following declarations (only) are avail
able;
class type name;
class type name [ constant ];
*
si ngl e var i abl e
*
*
A structure must be declared as a byte array, though the keyword r ecor d can be
used as a synonym for byt e for documentation purposes.
Publ i c and pr i vat e variables in the data segment may be initialized with an
explicit C-style initializer. Character constants and implicit array sizes are both sup
ported. For example:
SEG ( d a t a )
p u b l i c b y t e name
p u b l i c b y t e name []
p u b l i c b y t e name []
p r i v a t e word name
p u b l i c word name [3]
' z'
a b c d \ 0'
"abe d";
10;
1 0 ,
11.
12
The double-quote syntax can be used only to initialize byte arrays. A C declaration such
as the following:
k i n g s [4]
{
"he nr y ",
"kong",
" e l v i s "
" b a l t h a z a r "
};
must be declared as follows in C-code:
SEG( d a t a ) ;
p r i v a t e b y t e S1[ ]
p r i v a t e b y t e S2[ ]
p r i v a t e b y t e S3 [ ]
p r i v a t e b y t e S 4 []
p u b l i c p t r k i n g s [4]
"he nr y \ 0 " ;
"kong\ 0" ;
" e l v i s \ 0 " ;
" b a l t h a z a r \ 0 " ;
{ SI , S2, S3, S4 };
These declarations are all in the data segment because theyre initialized. The
anonymous string names must be pr i vat e because the same labels may be used for
other anonymous strings in other files, ki ngs is public, because it wasnt declared
. The virtual.h definitions for the various storage classes are in Listing 6.6.
Listing 6.6. virtual.h Storage Classes
Section 6.2.5Variable Declarations: Storage Classes and Alignment 461
40 #def ne p u b l i c / * empt y * /
41 #def i ne common / * empt y * /
42 #def ne p r i v a t e
43 #def i ne e x t e r n a l
The C-code assembler assumes that memory is made up of of 32-bit wide cells, each
of which occupies four addresses. The various basic types control the way that the four
bytes can be accessed. In particular, the least significant byte of an object must also be
at an address that is an even multiple of the objects sizea byte can be anywhere, a
word must be at an even address, an lword and ptr must be at an address thats an even
multiple of four. All objects are stored with the least-significant byte at the lowest phy
sical address and the most-significant byte at the highest address. A pointer to an object
holds the physical address of the least-significant byte. The system is pictured in Figure
6.3.
Figure 6.3. Memory Alignment
3 2
1/ . Io
0
7 6 I| 5_____ 1
4
4
11 [10. |
19
8
0
0
15 |14 . |13 |\12
12
19 18 17 16
16
<------- 32 bits ------- >
Four, 8-bit bytes.
i U I I
Two, 16-bit words.
j m s b | 2 I s b j/ m s b | q I s b
One, 32-bit long word.
3 m s b | 2 | / | Q I s b
A good analogy for these restrictions is a book printed in a typeface thats half a page
high. The typeface is also so wide that only two digits can fit on a line. The rules are
that a two-digit number must be on a single line. A four-digit number, which requires
two lines to print, must be on a single page. An eight-digit number, which requires four
lines to print, must be on two facing pages.
These restrictions are called alignment restrictions. (You would say that words must
be aligned at even addresses, and so forth.) Alignment restrictions are usually imposed
by the hardware for efficiencys sake. If the data path from memory to the CPU is 32 bits
wide, the machine would like to fetch all 32 bits at once, and this is difficult to do if the
32-bit word can be split across a multiple-of-four boundary.
The actual byte ordering is not critical in the present application provided that its
consistent, but the alignment restrictions are very important when doing something like
allocating space for a structure. For example, the following structure requires 16 bytes
rather than nine, because the 16-bit i nt must be aligned on an even boundary; the 32-bit
l ong must start at an address thats an even multiple of four, and padding is required at
the end to assure that the next object in memory is aligned properly, regardless of its
type. A declaration like the following:
Alignment.
LSB is at the lowest ad
dress.
Structure-field allocation
and alignment.
{
char c l ;
i nt i ;
char c 2 ;
l ong l ;
char c 3 ;
}
allocates fields placed in memory like this:
a l i g n o directive.
12
Though C guarantees that the ordering of structure members will be preserved, a clever
compiler for another language might shuffle around the ordering of the fields to make the
structure smaller. If cl were moved next to c2, then four bytes would be saved. The
three bytes of padding at the end of the structure are required in case you have an array
of structures. A clever compiler could actually eliminate the padding in the current
example, but nothing could be done if the first field of the structure was a long, which
needs to be aligned on a multiple-of-four boundary.
C-code variable declarations automatically force proper alignment for that variable,
so if you declare a byte-sized variable followed by a 16-bit word, one byte of memory is
wasted because the word requires even alignment. There are occasional situations where
alignment must be forced explicitly, and C-code provides the ALI GN (type) directive for
this purpose. The assembler inserts sufficient padding to assure that the next declared
variable is aligned as if it were of the indicated type. ALI GN is defined in virtual.h in
Listing 6.7. The worst-case alignment boundary is declared for use by the compiler in
c-code.h (Listing 6.8).
Listing 6.7. virtual.h The ALIGN Directive
44 #def i ne ALI GN( t y pe ) /* empt y * /
Li sti ng6.8. c-code.h Worst-case Alignment Restriction
11 #def i ne ALI GN_WORST LWORD WIDTH / * Long wor d i s wor st - case al i gnment . * /
6.2.6 Addressing Modes
Immediate mode.
Constants and all variable and register names are referenced in instructions by means
of various addressing modes, summarized in Table 6.5.
The immediate addressing mode is used for numbers. C-style hex, octal, decimal,
and character constants are all recognized ( Oxabc 0377 123 ' c' ); l wor d (32-bit)
constants should be indicated by a trailing L ( 0x12345678L). The immediate address
ing mode is used in real assemblers to put a physical number into an instruction rather
Direct mode.
than an address. Many real assemblers require a special symbol like #to precede
immediate data.
The direct mode is used to fetch the contents of a variable. There is usually no spe
cial operator here; you just use the variables name. Note, however, that array and
Section 6.2.6Addressing Modes 463
function names cannot be accessed in direct modetheir name is always treated as the
address of the first element or instructions.
The effective-address mode gets the address of an object. The operand of the &
operator is usually a label (&x where x is a name in a previous declaration), but it can
also be used in conjunction with one of the based addressing modes, discussed shortly.
Note that the &should be omitted from a function or array name.
The indirect modes work only on objects of pointer type: a variable that was
declared ptr, a register name followed by the . pp selector, or the sp or f p registers.
You must do two things to access an object indirectly. You must surround the variable or
register that holds the address with parentheses to indicate indirection, and you must tell
the C-code assembler the kind of object that the pointer is referencing. You do the latter
by preceding the parenthesis with one of the following one- or two-character modifiers:
Code
Points at object
of this type.
B byt e
W wor d
L l wor d
P pt r
BP pointer to byt e
WP pointer to wor d
LP pointer to l p
PP pointer to pt r
For example, W(rO.pp) fetches the word whose address is in rO.pp. You can also
access an object at a specified offset from a pointer with the following syntax:
W(p +offset)
The offset may be a number, a numeric register reference (as compared to a pointer refer
ence), or a numeric variable. The following example references the word whose address
is derived by adding together the contents of the fp register and r 0 . w. low:
W(fp +rO.w.low)
The following instruction references the long word at an offset of -16 (bytes) from the
frame pointer:
L(fp-16)
If the fetched object is of type BP, WP, LP, or PP, a star may be added to fetch the object
to which the pointer points. For example, WP (f p+6) fetches the word pointer at offset 6
from the frame pointer; *WP (f p+6) fetches the object pointed to by that word pointer.
Note that these double indirect modes, though found on most big machines like VAXs,
are missing from many machines. You would have to use two instructions to access the
referenced objectlike this:
rO.pp =WP(fp+6);
r l .w =W(rO.pp);
The effective-address and indirect modes can be combined in various ways, summar
ized in Table 6.6. The alternate syntaxes are useful when doing code generation,
because variables on the stack can be treated the same way as variables at fixed
addresses. For example, the address of a word pointer that is stored at offset 8 from the
frame pointer can be fetched with &WP (f p+6) , and the address of a word pointer, _p, at
a fixed address can be accessed with &WP (&_p). The convenience will become evident
when code-generation for the arithmetic operators is discussed, below. You can look at
Indirect modes.
Indirection indicated by
parentheses.
w(p),w(p+o/feef)
WP (p), *WP (P)
Combined effective-
address and indirect
modes. &wp (&p).
Table 6.5. Addressing Modes
Mode Example Notes
immediate 93 Decimal number. Use leading Ox for hex, 0 for octal.
direct x, r O. l Contents of variable or register.
indirect
B (p)
W(p)
L(p)
P( P)
BP (p)
WP (p)
LP(p)
PP( P)
byt e whose address is in p.
wor d whose address is in p.
l wor d whose address is in p.
pt r whose address is in p.
byt e pointer whose address is in p.
wor d pointer whose address is in p.
l wor d pointer whose address is in p.
pt r pointer whose address is in p.
double
indirect
*BP( p)
*WP( p)
*LP( p)
*PP( p)
byt e pointed to by pointer whose address is in p.
wor d pointed to by wor d pointer whose address is in p.
l wor d pointed to by l wor d pointer whose address is in p.
pt r pointed to by pt r whose address is in p.
based
indirect
B( p+N)
W( p+N)
L(pN)
+
byt e at byte offset N from address in p.
wor d at byte offset N from address in p.
l wor d at byte offset N from address in p.
effective
address
&name
&W( p+N)
Address of variable or first element of array.
Address of word at offset +n from the pointer p (The
effective-address modes can also be used with other
indirect modes see below.)
A generic pointer, /?, is a variable declared pt r or a pointer register: r N. pp, fp, sp.
N is any integer: a number, a numeric register (rO. w. l ow), or a reference to a
byt e, word, or l wor d variable. The based indirect modes can take negative offsets
as in B ( p- 8) .
these alternate addressing modes as a way to specify an explicit type in the variable
reference. An &W(&p) tells you that p points at a word, information that you would not
have if p were used by itself.
Note that p may not be preceded by an ampersand if it references a register because
registers dont have addresses. Also, the ampersand preceding the p in the examples in
Table 6.6 is optional if p is the name of an array or function. Since some compilers print
a warning if you use the ampersand in front of an array or function name, you should
leave the ampersand off unless your compiler requires it.
Finally, note that C-style pointer arithmetic is used when a double-indirect directive
tells the assembler the type of the referenced object.
p t r p;
p += 1 / * Adds 1 to p because t he r ef er enced * /
/ * obj ect ' s t ype i s unspeci f i ed. * /
BP(&p) += 1 / * Adds 1 ( si ze of a byt e) t o p. * /
WP(&p) += 1 / * Adds 2 ( si ze of a word) t o p. * /
LP(&p) += 1 / * Adds 4 ( si ze of a l wor d) t o p. * /
The macros that implement the various type directives are in Listing 6.9. The
prefixes themselves are declared, for the benefit of the compiler thats generating the C-
code, in c-code.h and are shown in Listing 6.10.
Section 6.2.6Addressing Modes
Table 6.6. Combined Indirect and Effective-Address Modes
465
Synt ax: Evaluates to:
&p
&W(&p)
&WP( &p)
&WP ( f p+n)
address of the pointer
P
&W( p)
WP( &p)
WP ( f p+n)
contents of the pointer itself
W( p)
*WP( &p)
*WP ( f p+n)
contents of the word whose address is in the pointer
Listing 6.9. virtual.h Direct-Stack-Access Directives
45 #def i ne w * (word
*)
46 #def i ne B * ( by t e
*)
47 #def i ne L * ( l wor d
*)
48 #def i ne P * ( p t r
*)
49 #def i ne WP * (word
* * J
50 #def i ne BP * ( by t e
* * J
51 #def i ne LP * ( l wor d
* * )
52 #def i ne PP * ( p t r
* * J
Listing 6.10. c-code.h Indirect-Mode Prefixes
12 #def i ne BYTE_ PREFI X "B" / * I ndi r ect - mode pr ef i xes. * /
13 #def i ne WORD_PREFIX "W"
14 #def i ne LWORD_PREFIX "L"
15 #def i ne PTR PREFI X
. . p. .
16 #def i ne BYTEPTR_PREFI X "BP"
17 #def i ne W0RDPTR_PREFI X "WP"
18 #def i ne LWORDPTR_PREFIX "LP"
19 #def i ne PTRPTR_ PREFI X "PP"
6.2.7 Manipulating the Stack
Two C-code directives are provided to do explicit stack manipulation. These are: Push ar|d pop directives.
push ( something )
and
something =pop ( type )
The push () macro pushes the indicated object, and the pop () macro pops an object of
the indicated type. For example, if x were declared as an l word, you could say:
x = pop ( l wor d) . The l wor d is the declared type of the target (of x). The target can
also be a register, but the types on the two sides of the equal sign must agree:
r l . w . l o w = p o p ( word ) ;
Push junk if pushed ob- The stack is 32 bits wide, so part of the stack word is wasted when you push or pop
ject is too short. small objects. For example:
r l . w . l o w = p o p ( word )
pops the bottom word of the current top-of-stack item into the low half of rl . The top
half of the stack item is discarded, and the top half of rl is not modified. A push
instruction modifies a 32-bit quantity, and the pushed object is right-adjusted in the 32-
bit word. If rl holds 0x12345678, the following instruction:
pus h ( r l . b . b3 )
pushes the number Ox?????? 12. The question marks represent undefined values. A
p u s h ( r l . w . l o w )
directive pushes 0x????5678,
pus h ( r l . w . h i g h )
pushes Ox???? 1234, and
p u s h ( r l . w. 1 )
pushes the entire number 0x12345678.
A simple push( r l ) is not permitted. The register name must be fully qualified.
Use push (r 1 . 1 ) to push the entire register.
The two stack directives are defined in virtual.h in Listing 6.11.
Listing 6.11. virtual.h Pushing and Popping
53 #defne pus h( n) ( - - s p ) - > 1 = ( l wo r d ) ( n )
54 #defi ne pop ( t ) ( t ) ( ( s p + + ) - > 1 )
6.2.8 Subroutines
Subroutines must all be defined in the text segment. Extemal-subroutine declarations
should be output as follows:
e x t e r n a l na me ( ) ;
No return value may be specified.
Subroutine definitions are created by surrounding the code that comprises the sub
routine with PROC ( name, cl ass) and ENDP (name) directives. The name is the func
tion name and the cl ass is the storage class (either publ i c or pri vat e) , pr i vat e
functions cannot be accessed outside the current file. Invocations of PROC ( ) and
ENDP ( ) may not be followed by semicolons.
A subroutine is called using a cal l (name) directive and control is passed back to
the calling function with a r et () directive. For example:
SEG ( t e x t )
PROC( _ s y l v e s t e r , p u b l i c )
c a l l ( _ t w e e t y ) ; / * Cal l subr out i ne t weet y. * /
r e t ( ) ; / * Ret ur n to cal l i ng f unct i on. * /
ENDP( _ s y l v e s t e r )
The argument to cal l () can be either a subroutine name or a reference to a variable or
register [like cal l ( rl . pp) , which calls the subroutine whose address is in rl . pp].
Subroutine definitions.
Passing control:
cal l (name) and ret( )
Section 6.2.8Subroutines 467
Note that the compiler has inserted underscores in front of the variable names so that
these names wont conflict with internally generated labels. A ret () statement is sup
plied if one is not already present immediately before the ENDP.
The virtual.h definitions for these macros are in Listing 6.12. The cal l ( ) macro
simulates an assembly-language call instruction which pushes the return address (the
address of the instruction that follows the call) onto the stack and then transfers control
to the address that is the calls argument. Here, the retum-address push is simulated by
pushing the stringized version of the subroutine name onto the stack, and then the
subroutine is called. An assembly-language return instruction pops the address at the
top of stack into the instruction pointer. The ret () directive simulates this process by
popping the name from the stack and returning.
Listing 6.12. virtual.h Subroutine Definitions, Calls, and Returns
55 #def i ne PROC(name, e l s ) e l s name ( ) {
56 #def i ne ENDP(name) r e t ( ) ; } / * Name i s i gnor ed. * /
57
58 #def i ne c a l l ( n a me ) ( s p ) - > p p = #name, ( * ( v o i d ( * ) ( ) ) ( name ) ) ( )
59
60 #def i ne r e t () sp++; r e t u r n
The subroutine call on line 58 of Listing 6.12 is complicated by the fact that function
pointers must be handled as well as explicit function names. The argument must first be
cast into a pointer to a subroutine:
( v o i d ( * ) ( ) ) (name)
and then can be called indirectly through the pointer. This cast does nothing if name is
already a subroutine name, but it correctly converts register and variable references into
indirect subroutine calls.
6.2.9 Stack Frames: Subroutine Arguments and Automatic Variables
Of the various memory segments, the stack segment is of particular importance to
subroutines. The run-time stack is used to make subroutine calls in the normal way: The
return address is pushed as part of the call and the return pops the address at top of stack
into the instruction pointer. Languages like C, which support recursion, use the stack for
other purposes as well. In particular, the stack is used to pass arguments to subroutines,
and certain local variables, called automatic variables, are stored on the stack at run
time. In C, all local variables that arent declared stati c are automatic. The following
sequence of events occurs when a subroutine is called:
(1) The arguments are pushed in reverse order.
(2) The subroutine is called, pushing the return address as part of the call.
(3) The called subroutine pushes a few housekeeping registers, including the frame
pointer, discussed below.
(4) The subroutine advances the stack pointer so that room is freed on the stack for local,
automatic variables and anonymous temporaries.
7. The # directive is new to a n s i C. It turns the associated macro argument into a string by surrounding the
matching text with implicit quotation marks.
Subroutine linkage.
Automatic variables.
Subroutine linkage: creat
ing the stack frame.
Leading underscores
avoid name conflicts.
Virtual.h implementation
Of c a l l ( ) and r e t ( ) .
Stack frame (activation
record).
Building a stack frame.
This entire area of the stack, extending from the first argument pushed to the upper
most local variable or temporary, is called an activation record or stack frame. The
compiler uses the stack and stack-manipulation macros described earlier to translate the
code in Listing 6.13 into the output code in Listing 6.14. The complete stack frame is
pictured in Figure 6.4.
Listing 6.13. cal l (of, the, wi l d) : Compiler Input
1 i nt d o g l , do g 2 , dog3;
2
3 st at i c c a l l ( o f , t h e , w i l d )
4 i nt o f , t h e , w i l d ;
5 {
6 i nt buc k, t h o r n t o n ;
7 l ong j o h n _ s i l v e r ;
8 . . .
9 }
10 s p i t z ()
11 {
12 c a l l ( d o g l , do g 2 , dog3 ) ;
13 }
Listing 6.14. cal l (of, the, wi I d) !Compiler Output
1 SEG ( b s s )
2 i n t _ d o g l ;
3 i n t _ d o g 2 ;
4 i n t dog 3;
J
6 SEG ( t e x t )
7 PROC ( c a l l , p r i v a t e )
8 p u s h ( f p ) ;
/ *
Save ol d f r ame poi nt er . */
9 f p = s p; / *
Set up new f r ame poi nt er . */
10 s p - = 8; / * Make r oomf or l ocal var s. & t empor ar i es. */
11 c a l l ( c h k s t k ) ; / * Check f or st ack over f l ow. */
12 / * Code goes her e. */
13 s p = f p; / * Di scar d l ocal var i abl es and t empor ar i es. * /
14 f p = p o p ( p t r ) ; / * Rest or e pr evi ous subr out i ne' s f p. * /
15 r e t () ;
16 ENDP ( c a l l )
17 PROC ( s p i t z , p u b l i c )
18 p u s h ( f p ) ;
19 f p = s p; / *
No sp- =N because t her e' r e no l ocal vars.
* /
20 c a l l ( c h k s t k ) ;
21 p u s h ( dog3 ) ; / * cal l ( dogl , dog2, dog3 ); * /
22 p u s h ( dog2 ) ;
23 p u s h ( _ d o g l ) ;
24 c a l l ( c a l l ) ;
25 s p += 8; / *
Di scar d t he ar gument s t o cal l (). * /
26 s p = f p ; / *
Ret ur n f r omspi t z( ) : di scar d l ocal vars. * /
27 f p = po p( p t r ) ; / *
Rest or e cal l i ng r out i ne' s f r ame poi nt er . * /
28 r e t () ;
29 ENDP ( s p i t z )
Local variables in the The sp- =8 on line ten of Listing 6.14 decrements the stack pointer to make room both
stack frame. for j ocaj variables and for a small scratch space to use for the anonymous temporaries.
(Note that the stack pointer must be decremented in even multiples of 4 [the stack width]
Figure 6.4. Stack Frame for cal l ( of , t he, wi l d)
Section 6.2.9Stack Frames: Subroutine Arguments and Automatic V ariables 469
low memory
when you modify it explicitly, as compared to modifying it implicitly with a push or
pop. ) Unlike the arguments, the positions of local variables and temporaries within their
area are arbitrary.8 Dont be confused by the fact that part of the stack frame is created
by the calling function and the other part is created by the called function. The stack
frame is a single entity. Everybody is responsible for undoing what they did, so the cal
ling function cleans the arguments off the stack, and the called function gets rid of every
thing else. Note that this organization of the stack frame, though characteristic, is not
mandatory. Many compilers put the temporaries in different places or not on the stack at
all. Sometimes, space on the stack frame is reserved to pass arguments to run-time
library functions, to push registers used for r egi ster variables, and so on.
Figure 6.5 shows the local-variable and argument portion of the stack frame in
greater detail. Th z frame-pointer register (fp) provides a fixed reference into the stack
frame. It is used to access all the arguments and automatic variables. The argument
wi l dwhich contains dog_3can be accessed using the indirect addressing mode as
follows:
8. The chkstk () subroutine, called on line 11, checks that the stack pointer hasnt crossed over into a
different segment, as can happen if you allocate a large automatic array. It aborts the program if an error is
discovered. This routine can also be used by a profiler subroutine to log the time at which the subroutine
was entered. There would be a second subroutine call that logs the exit time at the bottom of the function.
The frame pointer: argu
ments and automatic
variables.
Accessing subroutine ar
guments.
Accessing local auto
matic variables.
Stack frame advantages:
size,
recursion.
The arguments are put onto the stack with push directives, and since an i nt is word
sized, the high half of the 32-bit stack item is undefined. The actual, symbolic names of
the argument do not appear in the output. All generated code that references wi l d uses
W(f p+16) instead of the symbolic name, wi l d.
Figure 6.5. The Stack Frame, Magnified
W(f p+1 6 )
physical
address:
100-103
104-107
108-111
112-115
116-119
120-123
124-127
j ohn si l ver
<
103 102 101 100
buck
107 106
t hor nt on
105 104
old frame pointer
in 110 109 108
return address
115 114 113 112
undefined
119 118
undefined
123 122
undefined
127
3
msb
126
2
Of
117 116
t he
121 120
wi l d
125
/
124
0
Isb
<
<
<-
<
Access syntax:
t hor nt on
buck
L ( f p - 8 )
W( f p - 4 )
W( f p - 2 )
fP
(fp holds the address 108)
W( f p + 8 )
W( f p + 1 2 )
W( f p + 1 6)
The local-variable region differs from the arguments in that the compiler treats a
block of memory on the stack as a region of normal memory, and packs the local vari
ables into that space as best it can. Again, the indirect modes are the best way to access
these variables. For example, buck can be accessed using W(f p- 2). ( buck is at physi
cal addresses 106 and 107, and the frame pointer holds physical address 108, so the byte
offset from one to the other is -2.) The W(f p- 2 ) directive causes a word-size object to
be fetched from address 106. The low byte of buck can be accessed directly with
B (f p-2) and the high byte with B (f p- 1). Similarly, j ohn si l ver can be accessed
with L ( f p- 8 ). (Remember that the objects address is the physical address of the
least-significant byte, so the offset is - 8 here.) A pointer-to-word variable can be fetched
with WP (f p - N). The object to which it points can be fetched with *WP (f p - N).
This use of the stack has two real advantages. First, the same relatively small area of
memory (the stack) can be recycled from subroutine to subroutine, so less of the total
memory area need be allocated for variables. Second, this organization makes recursive
subroutines possible because each recursive instance of a subroutine has its own stack
frame with its own set of local variables, and these variables are accessed relative to the
current frame pointer. A recursive subroutine does not know that its calling itself; it
does the same thing for a recursive call that it would do for a nonrecursive call: push the
arguments and transfer control to the top of the required subroutine. The called routine
doesnt know that it has been called recursively, it just sets up the stack frame in the nor
mal way. It doesnt matter if more than one stack frame for a given subroutine exists at
onceonly the top one is active.
Also note that the address of the leftmost argument is always (f p+8), regardless of
the number of arguments or their type. This is one of the things that makes it possible to
have a variable number of arguments in a C subroutineyou can always find the left
most one.
Section 6.2.9Stack Frames: Subroutine Arguments and Automatic V ariables 471
The stack frames organization has disadvantages, too. The problem is that the code
that pushes the arguments is generated by the compiler when it processes the subroutine
call, but the offsets are figured when the compiler processes the subroutine declaration.
Since the call and declaration can be in different files, theres no way that the compiler
can check for consistency unless you use a function prototype in a common header file.
If you dont use the prototype, a particularly nasty bug, called a phase error, can appear
at run time. Figure 6.6 shows the stack frames created, both by the earlier
cal l ( of , t he, wi l d) and an incorrect cal l ( ) , with no arguments. When the
cal l () subroutine modifies wi l d, it just modifies the memory location at f p+4, and on
the incorrect stack, ends up modifying the return address of the calling function. This
means that cal l () could work correctly, as could the calling function, but the program
would blow up when the calling function returned.
Figure 6.6. A Phase Error
c a l l ( o f , t h e , wi l d )
temporary
variables
< sp
fp-12
y:
fp-8
c a l l ()
x : fp-4 temporary
<----- sp
old frame pointer < fp
variables
f p- 12
return address fp+4
y:
f p- 8
of : fp+8 x : fp-4
fp+12 old frame pointer
t h e
< rp
w i l d
fp+16 return address fp+4
/\
fp+8
old frame pointer
return address
stack ft
calling j
'ame of
function
old frame pointer
return address
fp+12
fp+16
\/
The sequence of instructions that set up a stack frame are so common that C-code
provides two instructions for this purpose. The l i nk ( N ) instruction does the follow
ing:
push( f p )/
f p = sp;
sp -= N x stack_width; / * Decrement b y one s t a c k e l e m e nt . */
and an unl i nk ( ) directive does the following:
s p = f p;
f p = p o p ( p t r ) ;
The earlier cal l ( of , t he, wi l d) is modified to use l i nk( ) and unl i nk () in
Listing 6.15. Listing 6.16 shows the l i nk and unl i nk () implementations in virtual.h.
Stack frame disadvan
tages: phase errors.
link and unlink instruc
tions.
Listing 6.15. cal l (of, t he, wi l d) : Compiler Output with l i nk () and unl i nk ()
1 SEG ( b s s )
2 i nt _ f ;
3 i nt t h e ;
4 i nt w i l d ;
5
6 SEG ( t e x t )
7 PROC ( c a l l )
8 l i n k ( 2 ) ;
/ *
Cr eat e st ack f r ame, 2 st ack el ement s f or l ocal s. */
9
10 u n l i n k ( ) ;
11 r e t () ;
12 ENDP ( c a l l )
13
14 PROC ( s p i t z )
15 l i n k ( 0 ) ;
/ *
No l ocal var i abl es so ar gument i s zero. */
16 p u s h ( w i l d
) ; / *
cal l ( of , the, wi l d ); */
17 p u s h ( t h e
) ;
18 p u s h ( o f
) ;
19 c a l l ( c a l l ) ;
20 u n l i n k ( ) ;
21 r e t ( ) ;
22 ENDP ( s p i t z )
Listing 6.16. virtual.h Subroutine Linkage Directives
61 # d e f i n e l i n k ( n ) ( ( s p ) - > p p = (char *) f p) , ( f p = sp) , ( s p = (n) )
62
63 #d e f i n e u n l i n k () (___s p = ( r e g *)___f p) , (__ f p = ( r e g *) ( ( __s p++) - > p p ) )
Dynamic, static links. The stack frames can also be viewed as a linked list of data structures, each
representing an active subroutine. I ts possible (in an object-oriented programming
environment, for example) for these structures to be allocated from discontinuous
memory rather than from a stack. The frame pointer is sometimes called the dynamic
link because it links together activation records.9
6.2.10 Subroutine Return Values
A functions return values should be placed into a register, according to Table 6.7.
9. Some languages require a second pointer, called the static link, discussed in Appendix B.
Table 6.7. Return Values: Register Usage
Section 6.2.10Subroutine Return Values 473
Type Returned in:
char
i nt
l ong
pointer
r F . w . l o w (A char is always promoted to i nt.)
r F . w . l o w
r F . l
rF . p p
6.2.11 Operators
A very limited set of operators are supported in C-code. The following arithmetic
operators (only) are available:
+= -= *= /= %= "= & l r s ( x , n )
The ones that look like C operators work just like the equivalent C operators. = is an
arithmetic right shift (sign extension is assumed). Three special operators are supported.
The =- operator performs a twos complement operation, as in x =- c. The =~operator
does a ones complement, as in x =~c. Both of these instructions are perfectly legitimate
C. I m assuming that the lexical analyzer breaks the foregoing into:
x = - c ;
x = ~c ;
Lint might print an "obsolete syntax" error message, though. You need binary operators
here, because C-code is attempting to mimic real assembly language, and though most
machines support a unary negate operator such as the 8086 NEG AX, which does a
twos-complement negation of the AX register, you cant implement this operator without
an equal sign in C. It seemed better to use a syntax consistent with the other operators
than to bury the operation in an macro.
One additional operation is supported by means of a macro. A l r s (x, n) directive
does a logical right shift rather than an arithmetic shift (=). It shifts x, n bits to the
right, with zero fill in the high bits rather than sign extension. This directive is defined in
virtual.h in Listing 6.17.
Listing 6.17. virtual.h The Logical-Right-Shift Directive
64 #defne l r s ( x , n) ( (x) = ( (unsigned long) (x) ( n ) ) )
All C-code expressions must contain exactly two operands and one operator. They
must be semicolon terminated. Only one expression is permitted on a line, and the entire
expression must be on a single line.
6.2.12 Type Conversions
No automatic type conversions are supported in C-code. Both operands in an expres
sion must be of the same type, but if one of the operands is an explicit (immediate-mode)
number, it is automatically converted to the type of the other operand.
Arithmetic operators.
10. Note that, because of this restriction, most real assembly languages attach the type to the operator rather
than the operand. You would move a 32-bit long word in 68000 assembler with a MOV.L dl , d0
instruction. In C-code youd use r l . 1 = rO. 1 to do the same thing.
Sign extension. Cast operators may not be used to do type conversion, but sign extension can be per
formed using one of the following directives:
ext l ow(reg) Duplicate high bit of r eg. b. bOin all bits of r eg. b. bl
ext hi gh (reg) Duplicate high bit of r eg. b. b2 in all bits of r eg. b. b3
ext wor d (reg) Duplicate high bit of reg. w. l owin all bits of reg. w. hi gh
An ext _l ow(reg) directive fills r eg. b. bl with ones if the sign bit in r eg. b. bO is
set, otherwise its filled with zeros. These directives only work in a register, so given
input like this:
i nt a r c h y ;
l ong m e h i t a b e l ;
me h i t a b e l a r c hy + 1/
the compiler should output something like the following
p u b l i c word a r c hy ;
p u b l i c l wor d me h i t a b e l ;
r O. w. l o w =
rO. w. l ow +
a r c hy
1 ;
/
*
Get ar chy.
*
e x t wo r d ( rO ) ;
me h i t a b e l r O . l
/ * Add 1 t o i t
/
/
*
*
*
Conver t t o l ong.
Do t he assi gnment .
*
*
/
/
/
/
The ext wor d ( ) directive effectively converts the wor d into an l word. If ar chy
were unsigned, a r 0. w. hi gh=0 would be used instead of ext wor d( r O) . The
definitions of the sign-extension directives are in Listing 6.18.
Listing 6.18. virtual.h Sign-Extension Directives
65 #def i ne e x t l o w( r e g ) ( r e g . w . l o w = (word ) r e g . b . b O )
66 #def i ne e x t h i g h ( r e g ) ( r e g . w . h i g h =: (word ) r e g . b . b 2 )
67 #de i ne e x t wo r d ( r e g ) ( r e g . l : ( l w o r d ) r e g . w. l o w )
6.2.13 Labels and Control Flow
Labels and the goto. Labels in C-code are like normal C labels. They are defined as a name followed by a
colon:
l a b e l :
The only control-flow statement is the got o branch, which is used in the normal way:
got o l a b e l ;
The target of the got o branch must be in the same subroutine as the got o itself.
Test directives. Conditional flow of control is performed using one of the test directives summarized
in Table 6.8. These all compare two operands, and the instruction on the line following
the test is executed only if the test evaluates true, otherwise the instruction on the next
line is ignored. The following code executes the got o branch if al l is equal to
t hi ngs (all things being equal).
EQ( a l l , t h i n g s )
got o j ai l ;
go:
/ * col l ect $200 * /
j a i l :
All the addressing modes described earlier can be used in a test. The normal
Section 6.2.13Labels and Control Flow
comparisons assume signed numbers, but the U_LT ( ), U_LE ( ) , U_GT ( ) , and
U_GE ( ) directives compare the two numbers as unsigned quantities. The instruction
following the test may not be another test. The test directives are implemented with the
macros in Listing 6.19.
Table 6.8. C-code Test Directives
Directive: Execute following line if:
EQ( a, b ) a =b
NE ( a, b ) a * b
LT ( a, b ) a b
GE( a, b ) a >b
U_ LT( a, b ) a b (unsigned comparison)
U_ GE( a, b ) a >b (unsigned comparison)
bit b of s . ........................
BI T( b, s ) t (bit 0 is the low bit).
W *
is set to 1
Listing 6.19. virtual.h Comparison Directives
68 #def i ne EQ a, b) i f ( (l ong) (a)
--- ---
(l ong) (b) )
69 #def i ne NE a, b) i f ( (l ong) (a)
I=
(l ong) (b) )
70 #def i ne LT a, b) i f ( (l ong) (a) < (l ong) (b) )
71 #def i ne LE a, b) i f (
(l ong) (a) <= (l ong) (b) )
72 #def i ne GT a, b) i f (
(l ong) (a) > (l ong) (b) )
73 #def i ne GE a, b) i f (
(l ong) (a) >= (l ong) (b) )
74
75
76
77
78
79
80
#def ne U_LT( a, b)
#def i ne U_GT( a, b)
#def i ne U_LE( a, b)
#def i ne U GE( a. b)
#def i ne B I T ( b , s )
unsi gned l ong) (a) < (unsi gned l ong) (b)
unsi gned l ong)(a) > (unsi gned l ong) (b)
unsi gned l ong) (a) <= (unsi gned l ong) (b)
unsi gned l ong) (a) > (unsi gned l ong) (b)
( ( s) & (1 (b) ) )
)
)
)
)
6.2.14 Macros and Constant Expressions
Various C-preprocessor macros, summarized in Table 6.9, may be used in a C-code
file. C-style, parameterized macros are supported, but the ANSI defined() pseudo op, the
concatenation operator (##), and the stringizing operator (#name) are not supported.
Arithmetic expressions that involve nothing but constants and the following operators
can appear in an #i f directive and anywhere that a constant would appear in an
operand:
+ - * / % & & & != < > < = > = ~ !
The - is both unary and binary minus. The * is multiplication. Multiple-operator
expressions such as the following are legal:
475
#def i ne Ll ( - 24)
WP( f p- 4) = WP ( f p - L l - 6 ) ; / * The Ll - 6 i s a const ant expr essi on. * /
Cant issue s e g ()
between p r o c o and
E NDP ().
Translate _ m a i n ( ) to
ma i n ( ) .
Print virtual-machine
state, pm ().
#l ne line-number "file"
#def i ne NAME text
#def i ne NAME (args) text
#undef NAME
#i f def NAME
#i f constant expression
#endi f
#el se
#i ncl ude <file>
#i ncl ude "file"
Table 6.9. C-code Preprocessor Directives
Note that the constant expression must completely precede or follow a variable or regis
ter reference in an operand. For example, the following is not permitted because the
register reference ( f p ) is imbedded in the middle of the expression:
WP(fp-4) =WP(6 + fp + 10);
6.2.15 File Organization
All variable and macro declarations must precede their use. Forward references are
permitted with got o branches, however.
C-code files should, ideally, take the following form:
#i ncl ude <t ool s/ vi r t ual . h>
SEG( dat a )
initialized data declarations
SEG( bss )
uninitialized data declarations
SEG( t ext )
subroutines
You can switch back and forth between the data and bss segments as much as you like,
but you cant change segments in the middle of a subroutine definition bounded by a
PROC () and ENDP () directive, ext er n declarations should be placed in the bss seg
ment. Macro definitions can go anywhere in the file.
6.2.16 Miscellany
One other handy macro is in virtual.h. Since the point of C-code is to create a
language that can be assembled with your C compiler rather than a real assembler, its
desirable to be able for a mai n () function in the source to really be mai n (). Unfor
tunately, the compiler puts a leading underscore in front of mai n when it outputs the
PROC directive. The mai n () macro on line 81 of Listing 6.20 takes care of this prob
lem.
The last part of virtual.h (on lines 84 to 118 of Listing 6.20) is an actual function,
created only if ALLOC is defined, pm() prints the top few stack elements and the con
tents of the virtual-machine registers to standard output as follows:
Listing 6.20. virtual.h Run-time Trace Support
Section 6.2.16Miscellany 477
81 #def i ne _mai n mai n
82
83 #i f def ALLOC
84 pm ()
85 {
86 r e g *p;
87 i nt i ;
88
89 / * Pr i nt t he vi r t ual machi ne ( r egi st er s and t op 16 st ack el ement s) . * /
90
91 p r i n t f ( "r0= %081x r l = %081x r2= %081x r3= %081x\ n",
92 rO. l r r l . l , r 2 . 1 , r 3 . 1 ) ;
93 p r i n t f ( "r4= %081x r5= %081x r6= %081x r7= %081x\ n",
94 r4 . 1 , r 5 . 1 , r 6 . 1 , r7 . 1
) ;
95 p r i n t f ( "r8= %081x r 9= %081x rA= %081x rB= %081x\ n",
96 r 8 . 1 , r 9 . 1 , r A . l , r B. 1
) ;
97 p r i n t f ( "rC= %081x rD= %081x rE= %081x rF= %081x\ n",
98 r C. 1, r D . l , r E . l , r F . l ) ;
99
100 i f ( ___s p >= &s t ack [SDEPTH] )
101 p r i n t f ( "St a c k i s e mp t y \ n " ) ;
102 el se
103 p r i n t f ( " \ n i t e m b y t e r e a l addr b3 b2 b l bO h i l o l \ n " ) ;
104
105 f o r ( p = ___s p, i = 1 6 ; p < Sst ack[ SDEPTH] && i >=0 ; ++p )
106 {
107 p r i n t f ( " %0 4 d %04d %9p [ %02x| %02x| %02x| %02x] = [%04x| %04x] = [ %081x] ",
108 p - ___s p, ( p - ___s p ) * 4 , ( v o i d f a r * ) p ,
109 p - > b . b 3 & Ox f f , p - > b . b 2 & Ox f f , p - > b . b l & Ox f f , p - > b . b 0 & Ox f f ,
110 p - > w. h i g h & O x f f f f , p - > w. l o w & O x f f f f ,
111 p - > l
112 );
113
114 i f ( p ==__ s p ) p r i n t f ("<- SP") ;
115 i f ( p = = ___ f p ) p r i n t f ( "<-FP" ) ;
116 p r i n t f ( "\ n " ) ;
117 }
118 }
119 #endi f
r0= 00000000
r4= 00000000
r8= 00000000
rC= 00000000
r l = 00000000
r5= 00000000
r9= 00000000
rD= 00000000
r2= 00000000
r 6= 00000000
rA= 00000000
rE= 00000000
r3= 00000000
r7= 00000000
rB= 00000000
rF= 00000000
i t e m b y t e r e a l addr b3 b2 b l bO h i l o 1
0000 0000 2C8F: 1878 [00 00 00 00] == [ 0 0 00 100 00 ] == [ 0 0 0 0 0 0 0 0 ] < - S P
0001 0004 2C8F: 187C [00 00 00 00] == [ 0 0 0 0 1 0 0 0 0 ] == [ 00000000]
0002 0008 2C8F: 1880 [00 00 18 8c] == [ 0 0 0 0 1 1 8 8 c ] == [ 0 0 0 0 1 8 8 c ] <- FP
0003 0012 2C8F: 1884 [00 00 04 f 2 ] == [ 0 0 0 0 | 04 f 2] == [ 000004 f 2]
0004 0016 2C8F: 1888 [ab Icd e f 12] == [ a b e d | e f 12] == [ a b c d e f l 2 ]
0005 0020 2C8F: 188C [00 00 18 90] == [ 0 0 0 011 890 ] == [ 00001890]
The stack is printed three times, broken up as bytes, words, and 1 words. The leftmost,
item, column is the offset in stack elements from the top of stack to the current element;
the byte column is the byte offset to the least-significant byte of the stack item; and the
real addr is the physical address of the stack element (its in 8086 segment:offset form).
C-code labels are
different from assembly
language.
Jump tables cant be im
plemented in C-code.
Subroutine prefix, body,
and suffix.
The symbol table.
The position of the sp and f p register is indicated at the far right, pm() is used in the
compiler to print a run-time trace as the output code executes. The compiler described
below can be put in a mode where most instructions are output like this:
rO. 1 r 1 . 1 ; p r i n t f ( "rO. 1 r l . 1 ; \ n " ) ; pm( ) ;
so you can what the virtual machine is doing as the code executes.
6.2.17 Caveats
C-code is similar enough to assembly language that its easy to forget its really C,
and make mistakes accordingly. The biggest problem is labels. Most assembly langua
ges treat all labels the same, regardless of what they are used for: a label always evalu
ates to an address, and any label can be used in any instruction. C-code is different, how
ever: Array names evaluate to addresses, most other labels evaluate to the contents of a
variable, but labels that are followed by colons can only be used as targets of got o
branches. These restrictions make C-code a little more difficult to write than real assem
bler. They also make some common assembly-language data structures impossible to
implement. For example, most assemblers let you code a swi t ch statement as follows:
SEG( d a t a )
s w i t c h t a b l e : dw Ll /
*
Ar r ay of f our l abel s. * /
SEG( t e x t )
dw L2
dw L3
dw L4
t o =
g o t o
Ll : c ode
L2 : c ode
L3: c ode
L4 : c ode
/ * eval uat ed ar gument t o swi t ch
s w i t c h t a b l e [ tO 1
*
/
This data structure, called a jump table, is not legal C-codeyou cant have an array of
labels and you cant use a variable as an argument to a goto.
Like Gaul, most computer languages can be divided into three parts: declarations,
expressions, and statements. Looking back at Table 6.2 on page 451, the output code is
divided into three logical sections. Each section is generated from a separate part of the
input file, and the code generation for each of these parts is controlled by distinct parts of
the grammar. Every subroutine has a prefix portion thats generated when the declara
tions and argument list are processed. The body of the subroutine, which contains state
ments and expressions, comes next. Finally, code to clean up and return from a subrou
tine is generated at the end of the definition. This last section is called the suffix. I ll
start by looking at the first of these sections, the subroutine-prefix and declaration pro
cessing.
6.3.1 Symbol-Table Requirements
We need to examine the data structures that are used to process a declaration before
looking at the actual code-generation actions. A compilers declaration system centers
around a set of data structures collectively called the symbol table. Strictly speaking, the
Section 6.3.1 Symbol-Table Requirements 479
symbol table is a database that contains information about subroutines, variables, and so
forth. The database is indexed by a key fieldhere a subroutine or variables name
and each record (each entry in the database) contains information about that item such as
the variables type or subroutines return value. A record is added to the database by the
code that processes declarations, and it is deleted from the database when the scoping
rules of the language determine that the object can no longer be referenced. C local vari
ables, for example, are deleted when the compiler finishes the block in which they are
declared.
Symbol tables are used for other purposes as well. Type definitions and constant
declarations may be found in them, for example. The symbol table can also be used to
communicate with the lexical analyzer. In the current compiler, a typedef creates a
symbol-table entry for the new type, as if the type name were a variable name. A bit is
set in the record to indicate that this is a typedef , however. The lexical analyzer then
uses the symbol table to distinguish identifiers from type names. This approach has its
drawbacks (discussed below), but can be quite useful.
Even though a symbol table is a database, it has special needs that must be met in
specific ways. A symbol-table manager must have the following characteristics:
Speed. Because the symbol table must be accessed every time an identifier or type is
referenced, look-up time must be as fast as possible. Consequently, disk-based data-
management systems are not appropriate here. The entire table should be in
memory. On the down side, one of the main limitations on input-file size is often the
maximum memory available for the symbol table.
Ease of maintenance. The symbol table is probably the most complex data structure
in the compiler. Its support functions must be organized so that someone other than
the compiler writer can maintain them.
Flexibility. A language like C does not limit the complexity of a variable declara
tion, so the symbol table must be able to represent variables of arbitrary type. This
representation should be optimized for code-generation purposes. Similarly, the
symbol table should be able to grow as symbols are added to it.
Duplicate entries must be supported. Most programming languages allow a variable
at an inner nesting level to have the same name as a variable at an outer nesting
level. These are different variables, in spite of having the same name, and the scop
ing rules of the language determine which of these variables are active at a given
moment. The active variable is said to shadow the inactive one. A distinct symbol
table entry is required for each variable, and the database manager must be able to
handle this situation.
You must be able to quickly delete arbitrary elements and groups of elements from
the table. For example, you should be able to delete all the local variables associated
with a particular block level in an efficient manner, without having to look up each
element separately.
The symbol table used here is organized in two layers. I ll call the innermost of
these the database layer. This layer takes care of physical table maintenance: inserting
new entries in the table, finding them, deleting them, and so forth. I ll call the outer
layer the maintenance layerit manages the table at a higher level, creating systems of
data structures to represent specific symbols and inserting these structures into the table
using the low-level insert function. Other subroutines at the maintenance level delete
entire classes of symbols (variables declared at the same block level, for example),
traverse all variables of a single class, and so forth.
Key, record.
Desirable symbol-table
characteristics.
Shadowing.
Lay e rs
Database layer.
Maintenance layer.
Stack-based symbol
tables.
Stack disadvantages.
Tree-based symbol
tables.
Tree deletions.
Several data structures can be used for the database layer of the symbol table, each
appropriate in specific situations. The simplest possible structure is a linear array organ
ized as a stack. New symbols are added to the end of the array with a push operation and
the array is searched from top to bottom of stack. (The most recently added item is
examined first.) This method, though crude, is quite workable provided that the table is
small enough. The scope rules are handled by the back-to-front searching. Since vari
ables declared at an inner nesting level are added to the table after those declared at a
higher level, they are always found first. The database manager is trivial to implement.
One real advantage to a stack-based symbol table is that its very easy to delete a
block of declarations. Variable declarations are done in waves according to the current
scoping level. For example, given input like this:
i nt l a u r e l , har dy;
{
i nt l a r r y , c u r l y , moe;
{
i nt h o u s e _ o f _ r e p r e s e n t a t i v e s [ 435 ] ;
}
}
l aur el and har dy are inserted first as a block, then l arry, curl y, and moe are
inserted, and then the house of r epr esent at i ves is inserted. The symbol-table
stack looks like this when the innermost block is processed:
6.3.2 Symbol-Table Data-Base Data Structures
stack pointer house_of_representatives
moe
curly
larry
hardy
laurel
____
level 3
7K
level 2
level 1
7K
All variables associated with a block can be deleted at one time by adding a constant to
the stack pointer.
The stack approach does have disadvantages other than the obvious one of the
inefficient linear search required to find entries close to the beginning of the table. (This
is not an insignificant problemthe compiler spends more time searching for symbol-
table references than any other symbol-table operation. The time required for a linear
search will be prohibitive if the table is large enough.) The maximum size of the stack-
based table must be known at compile time, so the symbol table cant scale its size
dynamically to fit the needs of the current input file. Consequently, the number of vari
ables that can be handled by the system is limited. You have to allocate a worst-case
table size to make sure that there is enough room.
The search-time and limited-size problems can be solved, to some extent, by using a
binary tree as the basic data structure. Average search times in a balanced binary tree
are logarithmic, and the tree size can grow dynamically as necessary.
Deletion of an arbitrary node from a tree is difficult, but fortunately this is not an
issue with most symbol-tables applications, because nodes for a given level are inserted
into the tree as a block, and newer levels are deleted before the older ones. The most-
recently inserted nodes tend to form leaves in the tree. If a variable at the most recent
level is an interior node, then all its children will have been declared either at the same
Section 6.3.2Symbol-Table Data-Base Data Structures 481
nesting level or in an inner block. The most recently added block of variables is always
at the end of a branch, and all variables in that block can be removed by breaking the
links to them without having to rearrange the tree. For example, a tree for the earlier
code fragment is pictured in Figure 6.7. The dotted lines show the scoping levels. The
house of r epr esent at i ves is deleted first by breaking a single link; l arry,
curl y, and moe are deleted next, again by breaking single links; l aur el and har dy
are deleted by breaking the single link that points at l aurel .
Figure 6.7. Symbol-Table Trees: Deletion
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
har dy
/
/
moe
/
/
/
/
/
/
/
/
/
/
/
house of r epr esent at i ves
Binary trees do have disadvantages. First, it is a common practice for programmers
to declare variables in alphabetical order. Since variables are added to the tree in the
same order as the declarations, a simple binary tree degrades to a linked list in this situa
tion, and the search times are linear rather than logarithmic. This problem can be solved
at the cost of greater insert and delete times by using a height-balanced tree system such
as an AVL tree, but the shuffling around of nodes thats implicit in the rebalancing can
destroy the ordering that made deletions easy to do. The other disadvantage of a tree-
based symbol table is that collisions, situations where the same name is used for vari
ables at different scope levels, are difficult to resolve. This problem, too, can be solved at
the expense of lookup time. For example, you could have two key fields in the data
structure used to represent a tree node, one for the name and one for the nesting level:
t ypedef st ruct t r e e _ n o d e
{
char name [ 32
i nt l e v e l ;
st ruct t r e e _ n o d e *r i g h t ;
st ruct t r e e _ n o d e * l e f t ;
INFO_TYPE i n f o ;
}
t r e e _ n o d e ;
and then use both fields when comparing two nodes:
11. See: [Kruse] pp. 357-371 and [Tenenbaum] pp. 461-472 for a general discussion of AVL trees. [Holub 1]
contains a C implementation of an AVL-tree database manager.
]; / * Var i abl e name. */
/ * Nest i ng l evel . */
/ * Ri ght - chi l d poi nt er . */
/ * Lef t - chi l d poi nt er . */
/ * Ot her I nf or mat i on. */
Tree disadvantages.
Collisions.
Hashed symbol tables.
c o mp a r e ( n o d e l , node2 )
st ruct t r e e _ n o d e * n o d e l , *node2;
{
i f ( n o d e l - > l e v e l != n o d e 2 - > l e v e l )
r et ur n( n o d e l - > l e v e l - n o d e 2 - > l e v e l ) ;
el se
r et ur n( s t r c mp ( n o d e l - > n a me , node2- >name) ) ;
}
You could also solve the collision problem by adding an additional field to the tree
nodea pointer to the head of a linked list of conflicting nodes. Newly added entries
would be put at the head of the list, and so would be found first when the list was
searched. The system is pictured in Figure 6.8; it uses the following data structure:
t ypedef st ruct t r e e _ n o d e
{
char name [ 3 2 ] ;
st ruct t r e e _ n o d e *r i g h t ;
st ruct t r e e _ n o d e * l e f t ;
st ruct t r e e _ n o d e *c o n f l i c t s ;
INFO_TYPE i n f o ;
}
t r e e _ n o d e ;
Figure 6.8. Using a Linked List to Resolve Collisions in a Tree-Based Table
/ * Ri ght - chi l d poi nt er . */
/ * Lef t - chi l d poi nt er . */
/ * Conf l i ct i ng- name- l i st head. */
/ * Ot her i nf or mat i on. */
>
/ \
There is one final problem with a tree-based symbol table. The use of global vari
ables is discouraged by most proponents of structured-programming because its difficult
to determine how global variables change value. As a consequence, a well-structured
program accesses local variables more often than global ones. The local variables are
added to the symbol table last, however, and these nodes tend to be farther away from
the root in a tree-based table, so it takes longer to find them.
It turns out that the the best data structure for most symbol-table applications is a
hash table. An ideal hash table is an array that is indexed directly by the key field of the
object in the table. For example, if the key field is a string made up of letters, digits, and
underscores, the string could be treated as a base 63 number (26 lower-case letters +26
upper-case letters +10 digits +an underscore =63 possible characters). Unfortunately,
an array that could be indexed directly by a 16-character name would require 6316-1 or
roughly 60,000,000,000,000,000,000,000,000,000 elements, most of which would be
wasted because an average symbol table has only a few hundred objects in it.
The solution to this problem is to compress the array. A hash table is an array,
indexed by key, that is compressed so that several elements of the uncompressed array
can be found at a single location in the compressed array. To do this, you convert the
key field thats used as the index in the uncompressed array into a pseudo-random
number which is used as an index into the compressed array. This randomization pro
cess is called hashing and the number so generated is the keys hash value. The same
key should always hash to the same pseudo-random number, but very similar keys
should have very different hash values.
Collisions (situations where two keys hash to the same value) are resolved by making
each array element the head of a linked list of table elements. This method is appropri
ate in a symbol-table application because, if you always put the new node at the head of
a chain rather than the end, local variables automatically preempt global variables with
the same nametheyll be found first when the list is searched. They are also found
more quickly than they would be if stored in a binary tree.
A simple hash table of size three is pictured in Figure 6.9. The table has four
members, with the keys "a, "b", "c", and Md". The hash values are computed by
treating the name strings as numbers. An ASCII 'a ' has the decimal value 97, 'b ' is
98, and so on. These numbers are truncated down to an array index using a modulus
division by the table size (3 in this case):
value MOD table size
1
2
0
1
Figure 6.9. A Simple Hash Table
hash_t ab[0]
hash_t ab[1]
hash t ab[2]
A simple hash function could just add together the characters in the name as if they
were numbers, and then truncate the resulting sum to a valid array index with a modulus
operation. An better function is hashpjw, shown in Listing 6.21. Hashpjw uses an
exclusive-or and shift strategy to randomize the hash value.
There has probably been more waste paper devoted to the subject of hash algorithms
than any other topic in Computer Science. In general, the complex algorithms so
described give very good theoretical results, but are so slow as to be impractical in a real
application. An ideal hash function generates a minimum number of collisions, and the
12. The collision-resolution method used here is called open hashing. There are other ways of resolving
collisions that wont be discussed because they are not much use in most practical applications. See
[Tenenbaum] pp. 521-574 and [Kruse] pp. 112135.
13. Hashpjw was developed by PJ. Weinberger and is described in [Aho], p. 436. The version in Listing 6.21
is optimized somewhat from the one in [Aho] and is tailored to a 16-bit i n t . Theres a more general
purpose version of the same function in Appendix A.
I "c"
\
"d"
A
w
"b"
\
"a"
name numeric value
"a" 97
"b" 98
"c" 99
"d" 100
Hash tables.
Hashing and hash
values.
Collision resolution.
Two hash functions: addi
tion and
hashpjw.
An informal complexity
analysis of addition and
hashpjw.
Listing 6.21. hashpjw.c A 16-bit hashpjw Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
unsi gned ha s h pj w( name )
unsi gned char
{
*name;
}
unsi gned ha s h v a l
unsi gned i ;
0;
(;
*
name ++name )
{
has h v a l
( i
( has h v a l << 2) +
*
name
has h v a l & 0 x 3 f f f )
has h v a l ( has h v a l
( i 12) ) & 0 x 3 f f f ;
}
ret urn has h v a l ;
average search time (which is proportional to the mean chain length) should be roughly
the same as an equivalent data structure such as a binary treeit should be logarithmic.
The maximum path length in a perfectly balanced tree with N elements is log2N. A tree
with 255 elements has a maximum chain length of 8 and an average chain length of
roughly 7 [(128x8 +64x7 +32x6 +...+ 2x1)/255]. Youre okay as long as you can get a
comparable average chain length in a hash table.
I ts difficult to predict chain lengths from a given algorithm because the words found
in the input control these lengths. An empirical analysis is quite workable, however,
given a large enough data set. Table 6. 10 shows the behavior of the hash_add( ) and
hash_p j w( ) functions from Appendix A. The former just adds together the characters
in the name; the latter is in Listing 6.21. 917 unique variable names gleaned from real C
programs were used for the test input, and the table is 127 elements long.
The elapsed time is the amount of time used to run the test program. Since the only
change was the choice of hash function, this number gives us a good indication of
efficiency. Youll note that hash_p j w( ) is about 9%slower than hash_add ( ). The
mean chain length for both algorithms is identical, and both algorithms use shorter aver
age chain lengths than a binary tree of an equivalent size. Finally, the distribution of
Addition as a hash func
tion: advantages, disad
vantages.
chain lengths is actually a little better with hash_add( ) than with hash pj
there are more shorter chains generated by hash add (
Weighing the foregoing, it seems as if simple addition is the preferable of the two
algorithms because it yields somewhat shorter chains, executes marginally faster, and is
smaller. Addition has two significant disadvantages, the main one being that characters
are small numbers, so the hash values tend to pile up at one end of a large table. One of
the reasons that addition performed well in the current example is the relatively small
table size, hash p j w( ) solves the bunching problem with the left shift on line nine of
Listing 6.21, making it more appropriate for larger tables. Also, addition cant distin
guish between identifiers that are permutations of the same name and hashpjw can, so the
latter is a better choice if this situation comes up. Aho, using names gleaned from Pascal
programs rather than C programs and a larger table
than I did.
got better results from hashpjw
Hash tables are almost ideal data structures for use in a symbol table. They are as
efficient (if not more so) than binary trees, and are more appropriate for those applica
tions in which several elements of the table might have the same key, as is the case when
local and global variables share the same name.
Table 6.10. Performance of Two Hash Functions
Addition Hashpjw
Elapsed time 9.06 seconds Elapsed time 9.61 seconds
Mean chain length: 7.22047 Mean chain length: 7.22047
Standard deviation: 2.37062 Standard deviation: 2.54901
Maximum chain length: 14 Maximum chain length: 13
Minimum chain length: 2 Minimum chain length: 1
0 chains of length 1 1 chain of length 1
1 chains of length 2 3 chains of length 2
1 chain of length 14 0 chains of length 14
6.3.3 Implementing the Symbol Table
I will use the hash functions described in Appendix A for the database layer of our
symbol table. To summarize the appendix, the hash-table system described in Appendix
A uses two data structures: the hash table itself is an array of pointers to buckets, each
bucket being a single data-base record. The buckets are organized as a linked list. When
two keys hash to the same value, the conflicting node is inserted at the beginning of the
list. This list of buckets is doubly linkedeach bucket contains a pointer to both its
predecessor and successor in the list. Arbitrary nodes can then be deleted from the mid
dle of a chain without having to traverse the entire chain. You should review these func
tions now.
The basic hash functions are fine for simple symbol-table applications. They were
used to good effect in occs and LLama, for example. Most programming languages
require a little more complexity, however. First of all, many internal operations require
the compiler to treat all variables declared at a common nesting level as a single block.
For example, local variables in C can be declared at the beginning of any curly-brace
delimited compound statement. A function body is not a special case, its treated identi
cally to a compound statement attached to a whi l e statement, for example. The scope
of any local variable is defined by the limits of the compound statement. For all practi
cal purposes, a variable ceases to exist when the close curly-brace that ends that block in
which it is declared is processed. The variable should be removed from the symbol table
at that time. (Dont confuse compile and run time here. A st at i c local variable con
tinues to exist at run time, but it can be deleted from the symbol table because it cant be
accessed at compile time once it is out of scope.)
The compiler also has to be able to traverse the list of local variables in the order that
they were declared. For example, when the compiler is setting up the stack frame, it has
to traverse the list of arguments in the proper order to determine the correct offsets from
the frame pointer. Both of those situations can be handled by providing a set of cross
Block deletions.
Cross links.
symbol structure.
symbol .name,
symbol . r name
Conflicting local-static
names.
symbol . l evel
symbol .dupl i cat e
links that connect all variables at a particular nesting level. For example, there are three
nesting levels in the following fragment:
i nt G o d o t ;
w a i t i n g ( v l a d i m i r , e s t r a g o n )
{
i nt p o z z o ;
whi l e( c o n d i t i o n )
{
i nt p o z z o , l u c k y ;
}
}
Godot and wai t i ng are at the outer level, the subroutine arguments and the first pozzo
comprise the second block, and the third block has the second pozzo in it. Note that the
arguments are grouped with the outer local variables and that the inner pozzo shadows
the outer one while the inner block is activethe two pozzos are different variables,
occupying different parts of the stack frame. Figure 6.10 shows how the symbol table
looks when the innermost block is being processed. I m assuming that vl adi mi r
hashes to the same value as pozzo. Note that the subroutine name [ wai t i ng () ] is con
sidered part of the outer block, but the arguments are part of the inner block. The hash
table itself is at the leftthe solid arrows are links used to resolve collisions within the
table. Here, the array across the top holds the heads of the cross-link chains. This array
could be eliminated by passing the head-of-chain pointers as attributes on the value
stack, however. You can visit all symbol-table entries for variables at a given scoping
level by traversing the cross links (the dashed lines). If the head-of-chain array was a
stack, you could delete the nodes for a particular level by popping the head-of-list
pointer and traversing the list, deleting nodes. The current C-compilers symbol table is
organized with cross links, as just described. A hash-table element is pictured in Figure
6. 10.
The top two nodes ( next and prev) are maintained by the hash-table functions
described in depth in Appendix A. They point at the previous and next node in the colli
sion chain. The double indirection lets you delete an arbitrary element from the table
without having to traverse the collision chain to find the predecessor. The bottom fields
are managed by the maintenance-layer, symbol-table functions and are declared as the
symbol structure in Listing 6. 22 along with the symbol table itself.
The name field in the symbol structure is the name as it appears in the input. The
r name field is the symbols name as it appears in the output. In the case of a global vari
able, r name holds the input name with a leading underscore appended. Local static
variables are given arbitrary names by the compiler because local statics in two subrou
tines could have the same name. The names would conflict if both were used in the out
put, so an arbitrary name is used. Compiler-supplied names dont have leading under
scores so cant conflict with user-supplied names. The r name field for automatic vari
ables and subroutine arguments is a string, which when used in an operand, evaluates to
the contents of the variable in question. For example, if an i nt variable is at offset - 8
from the frame pointer, the r name will be "f p- 8".
The l evel field holds the declaration level for the symbol (0 for global symbols, 1if
theyre at the outermost block in a function, and so forth), l evel is used primarily for
error checking. It helps detect a duplicate declaration.
The dupl i cat e bit marks the extra symbol s created by duplicate declarations like
the following:
Figure 6.10. A Cross-linked Symbol Table
Section 6.3.3Implementing the Symbol Table 487
Figure 6.11. A Symbol-Table Element
(cross l i nk)
from previous variable
at current level
previous node
in hash table
>
maintained by
hash functions
p r e v
n ex t
<
>
next node
in hash table
7K
name
symbol structure
maintained by
symbol-table functions
r name
l e v e l
t y p e
e t y p e
n ex t
1
1
i m p l i c i t , d u p l i c a t e
>1 System of structures representing type |<
1------------------------- y / -------------------------------------- 1
\
V
to next variable
at current level
V
to linked list of symbol
structures, one per argument
Listing 6.22. symtab.h The Symbol Table
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
31
32
/ * SYMTAB. H
*
*
*
Symbol - t abl e def i ni t i ons. Not e t hat <t ool s/ debug, h> and
<t ool s/ hash. h> must be #i ncl uded (i n t hat order) bef or e t he
#i ncl ude f or t he cur r ent f i l e.
/
#i f def ALLOC / * Al l ocat e var i abl es i f ALLOC def i ned. * /
ALLOC CLS / * empt y
*
/
#el se
ALLOC CLS
#endf
#def i ne NAME MAX 32
#def i ne LABEL MAX 32
st ruct s ymbol
{
17 unsi gned char name [ NAME _MAX+1 ] ;
18 unsi gned char rname [NAME _MAX+1] ;
19
20 unsi gned l e v e l : 13 ;
21 unsi gned i m p l i c i t : 1 ;
22 unsi gned d u p l i c a t e : 1 ;
23
24 st ruct l i n k * t y p e ;
25 st ruct l i n k *e t y p e ;
26 st ruct s ymbol * a r g s ;
27
28 st ruct s ymbol * n e x t ;
29
30 } s ymbol ;
ALLOC CLS HASH TAB
*
Symbol t a b
/
*
/
*
/
*
/
/
/
*
*
*
/
/
*
*
Maxi mumi dent i f i er l engt h.
*
/ * Maxi mumout put - l abel l engt h
*
/
/
Symbol - t abl e ent r y.
*
/
I nput var i abl e name.
*
/ * Act ual var i abl e name
*
/
/
Decl ar at i on l ev. , f i el d of f set
*
/
Decl ar at i on cr eat ed i mpl i ci t l y. * /
Dupl i cat e decl ar at i on.
*
/
/ * Fi r st l i nk i n decl ar at or chai n. */
/ * Last l i nk i n decl ar at or chai n. */
I f a f unct decl , t he ar g l i st
*
I f a var t he i ni t i al i zer
*
/
/
/ * Cr oss l i nk t o next var i abl e at */
/ * cur r ent nest i ng l evel
*
/
/ * The act ual t abl e. */
ext ern i nt x;
i nt x 5;
I ts convenient not to discard these symbols as soon as the duplication is found. The
duplicate symbol is not put into the symbol table, however, and this bit lets you deter
mine that the symbol has not been inserted into the table so you can delete it at a later
time.
symbol . i mpl i ci t
The i mpl i ci t field is used to distinguish undeclared variables from implicit func
tion declarations created when a subroutine is used before being defined. An implicit
declaration of type i nt is created when an undeclared identifier is encountered by the
compiler. If that identifier is subsequently used in a function call, its type is modified to
function returning i n t and the i mpl i ci t bit is cleared. If a symbol is still marked
implicit when the compiler is done with the current block, then it is an undeclared vari
able and an appropriate error message is printed.
symbol .type,
symbol .etype,
symbol . args.
The t ype and et ype fields in the symbol structure point to yet another data struc
ture (discussed shortly) that describes the objects type, ar gs keeps track of function
arguments until they are added to the symbol table. It is the head of a linked list of sym
bol structures, one for each argument.
symbol . next
Finally, the next pointer is the cross link to the next variable at the same nesting
level. Note that this field points at the symbol component of the combined structure,
not at the header maintained by the hash functions.
Section 6.3.3Implementing the Symbol Table 489
6.3.4 Representing TypesTheory
The next issue is the representation of a variables type. If a language is simple
enough, types can be represented with a simple numeric coding in the symbol structure.
For example, if a language had only two types, integer and floating point, you could
define two constants like this:
#def i ne INTEGER 0
#def i ne FLOAT 1
and add a field to the symbol structure that would be set to one or the other of these
values. Pointers could be represented in a similar way, with a second variable keeping
track of the levels of indirection For example, given a declaration like i nt
'k'k'k
p, this
second variable would hold 3. This sort of typing system is called a constrained system
because there are only a limited number of possible types.
The situation is more complex in a language like C, which has an unconstrained typ
ing system that permits virtually unlimited complexity in a variable declaration. A C
variables type must be represented by a system of data structures working in concert.
You can see whats required by looking at how a C variable declaration is organized.
Variable declarations have two parts: a specifier part which is a list of various keywords
(i nt, long, extern, struct, and so forth) and a declarator part that is made up of the
variables name and an arbitrary number of stars, array-size specifiers (like [10]) and
parentheses (used both for grouping and to indicate a function).
The specifier is constrainedthere are only a limited number of legal combinations
of keywords that can be used hereso it can be represented by a single structure. The
declarator is not constrained, howeverany number of stars, brackets, or parentheses are
permitted in any combination. Because of this organization, a type can be represented
using two kinds of structures, one representing the specifier and another representing the
declarator. The type definition is a linked list of these structures, and the t ype field in
the symbol structure that we looked at earlier points at the head of the list ( et ype
points at the end). All type representations have exactly one specifier structure, though
there can be any number of declarators (including none), and its convenient for the
specifier to be at the end of the linked listI ll explain why in a moment.
Lets look at some examples A simple declaration of the form
short Quasi modo; is represented like this:
symbol
name
type
etype
speci f i er
short i nt
A pointer type, like l ong *Gr i ngoi r e; adds a declarator structure, like this
name
type
Gringoire
------- > ------- >
---
pointer to l ong i nt
---
,---------------- , .------------------' J
You read down the list just like you would parse the declaration in English. Gr i ngoi r e
is a pointer to a long. The symbol node holds the name, the second holds an indicator
that the variable is a pointer, and the third node holds the long. An array of longs
declared with l ong Coppenol e [10] ; is represented as follows:
Constrainedtypes,
numericcoding.
Unconstrainedtypes.
Specifier.
Declarator.
Exampletyperepresen
tations.
name
type
symbol
Coppenole
>
array
(10 elements)
speci f i er
r r
> l ong i nt <
A pointer to an array of longs like l ong (*Fr ol l o) [ 10 ] ; has a pointer node inserted
in between the symbol structure and the array decl ar at or , like this:
name
type
symbol
decl ar at or
pointer to
decl ar at or
array of
(10 elements)
speci f i er
l ong i nt
An array of pointers to longs such as l ong *Esmer el da [10] has the pointer to and
array of nodes transposed, like this:
name
t ype
symbol
decl ar at or
array of
(10 elements)
decl ar at or
pointer to l ong i nt
Derived types.
This system has an important characteristic that will be quite useful for generating
code. The array-bracket, indirection, and structure-access operators generate temporary
variables, just like the simple addition and multiplication operators discussed in earlier
chapters. The types of these temporaries can be derived from the original type represen
tation, often just by removing the first element in the chain. For example, in the earlier
array-of-pointers-to-long example, an expression like Esmer el da [ 1] generates a tem
porary of type pointer to long that holds the referenced array element. The
temporarys type is just the original chain, less the leading array of node. The
address-of operator (&) is handled in much the same way by adding a pointer to node
to the left of the chain.
6.3.5 Representing TypesImplementation
Implementing declara
tors.
Implementing Specifiers
The foregoing system of declarators and specifiers can be implemented with a system
of several structures. A declarator is represented with the structure in Listing 6.23.
There are two fields: dcl t ype identifies the declarator as a pointer, array, or function,
using the values defined on lines 33 to 35. If the declarator is an array, numel e holds
the number of elements. Youll need this number to increment a pointer to an array
correctly, and to figure an arrays size.
Specifiers are represented by the structure and macros in Listing 6.24. The situation
Nouns.
Adjectives.
is simplified here because the current C compiler ignores the const and vol at i l e
keywords. A C specifier consists of a basic type (call it a noun), and various modifiers
(call them adjectives). C supports several nouns: char, i nt, f l oat, doubl e, struct,
uni on, and enum. Only four of these are necessary in the current compiler, however:
char, i nt, voi d, and struct. You would add f l oat to this list if floating point were
supported. A doubl e could be treated as a l ong f l oat. Enumerated types are treated
as if they were i nt s. Unions are treated identically to structuresall the offsets to the
fields are zero, however. An i nt is implied if a noun is missing from a declaration, as in
l ong x. C also supports one implicit identifier that is declared just by using it: a label.
The LABEL on line 46 takes care of this sort of identifier. The basic nouns are
Listing 6.23. symtab.h A Declarator
Section 6.3.5Representing TypesImplementation 491
33 #d e f i n e POINTER 0 / *
Val ues f or decl ar at or . t ype. */
34 #d e f i n e ARRAY 1
35 # d e f i n e FUNCTION 2
36
37 t y p e d e f s t r u c t d e c l a r a t o r
38
{
39 i n t d e l t y p e ; / * POI NTER, ARRAY, or FUNCTI ON */
40 i n t num e l e ; / * I f cl ass==ARRAY, # of el ement s */
41 } d e c l a r a t o r ;
represented by the noun field on line 64 of Listing 6.24, and the possible values for this
field are defined on lines 42 to 46.
Next come the adjectives. These fall naturally into several categories that have
mutually exclusive values. The most complex of these is the storage class represented
by the scl ass field on line 65 of Listing 6.24, which can hold one of the values defined
on lines 47 to 52. These values are mutually exclusiveyou cant have an
ext ern regi ster, for example. Note that the TYPEDEF class is used only to pass
information around during the declaration, t ypedef s are also marked by a bit in the
first element of the type chainits convenient to have this information duplicated while
the declaration is being processedwell see why in a moment. Note that AUTO desig
nates anything that can be on the run-time stack, and FI XED is anything at a fixed
address in memory, whether or not its declared stati c. The st at i c keyword is used
by the compiler to decide whether or not to make the variable public when it generates
the code to allocate space for that variable.
The CONSTANT storage class is used for two purposes. First, when you declare an const ant sto ra g e cla ss.
enumerated type like this:
enum r a b b i t s
{
FLOPSY, MOPSEY, PETER, COTTONTAIL;
};
the compiler puts i nt entries for each of the elements of the enumerator list (FLOPSY,
MOPSEY, PETER, and COTTONTAI L) into the symbol table and sets the CONSTANT attri
bute in the associated specifier. The const _val union on lines 72 to 82 of Listing
6.25-1 holds the value associated with the constant. A definition for FLOPSY is in Figure
6.12. The v_i nt field holds the numeric values of integer constants, but its also used
for string constants. When the compiler sees a string constant, it outputs a definition of
the form
c ha r S1[ ] = " c o n t e n t s o f s t r i n g " ;
and the numeric component of the label is stored in v_i nt each string label has a
unique numeric component. v_st r uct is used only if the current speci f i er
describes a structure, in which case the noun field is set to STRUCTURE and v st r uct
points at yet another data structure (discussed momentarily) that describes the structure.
The ocl ass field on line 66 of Listing 6.24 remembers the C-code storage class that
is actually output with the variable definition for a global variable. The contents of this
field are undefined if the scl ass field is not set to FI XED, and the possible values are
defined on lines 54 to 59.
The l ong field on line 67 selects either of two lengths for an integer variable. In _iong and short,
the current application, the l ong keyword is ignored if the variable is of type char, and
the short keyword is always ignored. All of this is summarized in Table 6.11. The
Listing 6.24. symtab.h A Specifier
42 #def i ne INT 0
43 #def i ne CHAR 1
44 #def i ne VOID 2
45 #def i ne STRUCTURE 3
46 #def i ne LABEL 4
47
48 #def i ne FIXED 0
49 #def i ne REGISTER 1
50 #def i ne AUTO 2
51 #def i ne TYPEDEF 3
52 #def i ne CONSTANT 4
53
54
55 #def i ne NO_OCLASS 0
56 #def i ne PUB 1
57 #def i ne PRI 2
58 #def i ne EXT 3
59 #def i ne COM 4
60
61
62
63
71
72
73
74
75
76
77
78
79
80
81
82
83
84
/
speci f i er . noun. I NT has t he val ue 0 so

/
t o i nt , same goes f or EXTERN, bel ow.

/*
speci f i er . scl ass */
/*
At a f i xed addr ess.
*/
/*
I n a r egi st er.
*/
/*
On t he r un- t i me st ack.
*/
/* Typedef . */
/* Thi s i s a const ant . */
/*
Out put ( C- code) st or age cl ass
*/
/* No out put cl ass ( var i s aut o) . */
/* publ i c */
/*
pr i vat e
*/
/*
ext er n
*/
/*
common
*/
st ruct s p e c i f i e r
{
64 unsi gned noun 3
/*
CHAR I NT STRUCTURE LABEL
*/
65 unsi gned s c l a s s 3 /*
REGI STER AUTO FI XED CONSTANT TYPEDEF
*/
66 unsi gned o c l a s s 3 /* Out put st or age cl ass: PUB PRI COMEXT.
*/
67 unsi gned _ l o n g 1
/* l =l ong. 0=shor t . */
68 unsi gned u n s i g n e d 1
/*
l =unsi gned. 0=si gned.
*/
69 unsi gned s t a t i c 1
/* l =st at i c keywor d f ound i n decl ar at i ons. */
70 unsi gned e x t e r n 1
/*
l =ext er n keywor d f ound i n decl ar at i ons. */
uni on
*
/
/ * t hat an uni ni t i al i zed st r uct ur e def aul t s * /
/
{
i nt v i n t ;
unsi gned i nt
l ong
v _ u i n t ;
v l o n g ;
/ * Val ue i f const ant :
/*
/*
/*
/*
*
I nt & char val ues. I f a st r i ng const
i s numer i c component of t he l abel .
Unsi gned i nt const ant val ue.
Si gned l ong const ant val ue.
*
*
*
*
unsi gned l ong v u l o n g ; / * Unsi gned l ong const ant val ue.
*
/
/
/
/
/
/
s t r u c t d e f *v s t r u c t ; / * I f t h i s i s a st r uct , poi nt s at a
*
/ * st r uct ur e- t abl e el ement .
*
/
/
} c o n s t v a l ;
} s p e c i f i e r ;
unsi gned bit on line 68 is used in much the same way as l ong. The ext er n and
st at i c bits on the next two lines remember when the equivalent keyword is found in
the input as the specifier list is parsed. They are needed by the compiler to figure the
output storage class after the entire specifier has been processed.
Theres one final problem. A declaration list can be made up of two types of struc
tures: it can have zero or more decl ar at or s in it, and it always has exactly one
You need some way to determine if a list element is a declarator or a
specifier when all you have to work with is a pointer to an element. The problem is
Encapsulation. solved by encapsulating the two structures into a third structure that can be either a
The l i nk structure. declarator or specifier. This is done with the l i nk structure in Listing 6.25. The cl ass
field tells us what the following union contains. Its either a DECLARATOR or a
Fi gure 6.12. An Enumerator-List Element.
symbol :
name
t ype
"FLOPSY"
l i nk:
cl ass=SPECI FI ER
s:
noun I NT
const val :
v i nt =0
Tabl e 6.11. Processing l ong and short.
input noun l ong length notes
l ong i nt I NT true 32 bits
shor t i nt I NT false 16 bits same as i nt
i nt I NT false 16 bits
l ong char CHAR true 8 bits same as char
shor t char CHAR false 8 bits same as char
char CHAR false 8 bits
SPECI FI ER, as defined on lines 85 and 86. The next field points at the next element in
the type chain. Its NULL if this is the specifier, which must come last in the chain.
Finally, the t def field is used when processing typedefs. I ts used to distinguish
whether a type chain was created by a t ypedef or by a normal declaration. You could
treat t ypedef as a storage class, and mark it as such it in the scl ass field of the
specifier, but its convenient not to have to chase down the length of the chain to the
specifier to get this information.
Li sti ng6.25. symtab.h A l i nk in the Declaration Chain
85 # d e f i n e DECLARATOR 0
86 # d e f i n e SPECI FI ER 1
87
88 t y p e d e f s t r u c t l i nk
89
{
90 u n s i g n e d cl ass : 1; / * DECLARATOR or SPECI FI ER
* /
91 u n s i g n e d t def : 1; / * For t ypedef s. I f set , cur r ent l i nk * /
92 / * chai n was cr eat ed by a t ypedef . * /
93 u n i o n
94
{
95 speci f i er s
f / * I f cl ass == DECLARATOR * /
96 decl ar at or d; / * I f cl ass == SPECI FI ER */
97 }
98 sel ect ;
99 s t r u c t l i nk *next ; / * Next el ement of chai n. */
100
101 } l i nk;
102
103 /*
104 * Use t he f ol l owi ng p- >XXX wher e p i s a poi nt er t o a l i nk st r uct ur e.
105 * /
106
107 # d e f i n e NOUN s e l e c t . s . noun
108 #d e f i n e SCLASS s e l e c t . s . s c l a s s
109 # d e f i n e LONG s e l e c t . s . l o n g
110 #d e f i n e UNSIGNED s e l e c t . s . u n s i g n e d
111 # d e f n e EXTERN s e l e c t . s . e x t e r n
112 #d e f i n e STATIC s e l e c t . s . s t a t i c
113
114
# d e f n e OCLASS s e l e c t . s . o c l a s s
115 #d e f i n e DCL_TYPE s e l e c t . d . d e l t y p e
116
117
# d e f n e NUM_ELE s e l e c t . d . n u m e l e
118 #d e f i n e VALUE s e l e c t . s . c o n s t v a l
119 # d e f i n e V_INT VALUE. v i n t
120 #d e f i n e V_UINT VALUE. v_ui nt
121 # d e f i n e V_LONG VALUE.v l o n g
122 #d e f i n e V_ULONG VALUE.v u l o n g
123
124
# d e f i n e V_STRUCT VALUE.v s t r u c t
125
/* ----------
126
127
128
* Use i
*/
t he f ol l owi ng XXX (p) wher e p i s a poi nt er t o a l i nk st r uct ur e.
129 #d e f i n e IS SPECIFIER(p) ( ( p ) - > c l a s s == SPECIFIER )
130 # d e f i n e IS DECLARATOR(p) ( ( p ) - > c l a s s == DECLARATOR )
131 #d e f i n e IS_ARRAY(p) ( ( p ) - > c l a s s == DECLARATOR && ( p ) ->DCL_TYPE==ARRAY )
132 # d e f i n e IS_POINTER(p) ( ( p ) - > c l a s s == DECLARATOR && ( p ) ->DCL_TYPE==POINTER )
133 #d e f i n e IS__FUNCT (p) ( ( p ) - > c l a s s == DECLARATOR && ( p ) ->DCL_TYPE==FUNCTION )
134 # d e f i n e IS__STRUCT (p) ( ( p ) - > c l a s s == SPECIFIER && ( p ) ->NOUN == STRUCTURE )
135
136
#d e f i n e IS_LABEL(p) ( ( p ) - > c l a s s == SPECIFIER && ( p ) - >NOUN == LABEL )
137 # d e f i n e IS_CHAR(p) ( ( p ) - > c l a s s == SPECIFIER && ( p ) ->NOUN == CHAR )
138 #d e f i n e I S_I NT( p) ( ( p ) - > c l a s s == SPECIFIER && ( p ) ->NOUN == INT )
139 # d e f i n e IS_UINT( p) ( I S_I NT( p) && (p)->UNSIGNED )
140 #d e f i n e IS_JLONG (p) ( I S_I NT( p) && (p)->LONG
)
141 # d e f i n e IS_ULONG(p) ( I S_I NT( p) && (p) ->LONG ScSc (p) ->UNSIGNED )
142
143
144
#d e f i n e IS__UNSIGNED (p) ( (p)->UNSIGNED )
145 # d e f i n e IS_AGGREGATE(p) ( IS_ARRAY(p ) IS_S TRUCT(pO )
146
147
#d e f i n e IS_PTR_TYPE(p) ( IS_ARRAY(p) IS_POINTER(p) )
148 # d e f i n e IS_CONSTANT(p) ( IS_SPECIFIER( p) && (p)->SCLASS == CONSTANT )
149 # d e f i n e IS TYPEDEF(p) ( IS_SPECIFIER( p) && (p)->SCLASS == TYPEDEF )
150 # d e f i n e IS_INT_CONSTANT(p) ( IS_CONSTANT(p) && (p)->NOUN == INT )
The macros on lines 107 to 123 of Listing 6.25 clean up the code a little by getting
rid of some of the dots and field names. For example, if p is a pointer to a l i nk, you can
say p- >V_I NT rather than
p - > s e l e c t . d . c o n s t _ v a l . v _ i n t
to access that field.
Structures. The foregoing system becomes even more complex when you introduce structures
and unions into the picture. You need two more data structures for this purpose. First of
all, the structure definitions are organized in an auxiliary symbol table called the struc
ture table, declared on line 159 of Listing 6.26, below. The v_ struct field of a
v s t r u c t speci f i er that describes a structure points at the structure-table element for that
structure. The table is indexed by tag name if there is oneuntagged structures are
assigned arbitrary names. It contains st r uct def structures, defined on lines 151 to 157
of Listing 6.26. The st r uct def contains the tag name (tag), the nesting level at the st r uct def
point of declaration (l evel ), and a pointer to a linked list of field definitions (f i el ds),
each of which is a symbol structureone symbol for each field. The l evel field is
here so that an error message can be printed when a duplicate declaration is found; its
recycled later to hold the offset to the field from the base address of the structure. These
offsets are all zero in the case of a union; thats the only difference between a structure
and a union, in fact. The symbol s next field links together the field definitions. This
organization means that you must use a linear search to find a field, but it lets you have
an arbitrary number of fields.
Listing 6.26. symtab.h Representing Structures
151 t ypedef st ruct s t r u c t d e f
152
{
153 char tag[NAME MAX+1]; / * Tag par t of st r uct ur e def i ni t i on. */
154 unsi gned char l e v e l ; / * Nest i ng l evel at whi ch st r uct decl ar ed. * /
155 s ymbol * f i e l d s ; / * Li nked l i st of f i el d decl ar at i ons. * /
156 unsi gned s i z e ; / * Si ze of t he st r uct ur e i n byt es. * /
157 } s t r u c t d e f ;
158
159 ALLOC_CLS HASH_TAB * S t r u c t _ t a b ; / * The act ual t abl e. * /
Figure 6. 13 gives you an idea of how a reasonably complex declaration appears in
the complete symbol-table system. I ve left out irrelevant fields in the figure. The
declaration that generated that table is as follows:
struct a r g o t i e r s
{
i nt (* C l o p i n ) ( ) ; / * Funct i on poi nt er * /
doubl e M a t h i a s [ 5 ] ;
struct a r g o t i e r s * Gui l l a ume ;
struct p s t r u c t { i nt a; } P i e r r e ;
}
g i p s y ;
Note that isolating the st r uct def from the f i el d lets you correctly process a
declaration like the following:
struct one { st ruct t wo *p; };
struct t wo { st ruct one *p; };
because you can create a st r uct def for struct t wo without having to know any
thing about this second structures contents. The fields can be added to the st r uct def
when you get to the struct t wo declaration.
The final part of symtab.h, in which various mappings from C types to C-code types C t C-code mappings.
are defined, is in Listing 6.27. Note that the various WI DTH macros from <tools/c-
code.h> are used here, so you must include c-code.h before including symtab.h in your
file.
Figure 6.13. Representing a Structure in the Symbol Table
Symbol t ab
symbol :
name
r name
t ype
next
"gi psy"
" gi psy"
l i nk:
to next variable at this level
St r uct t ab
st r uct def :
t ag
si ze
f i el ds
"ar got
52

symbol \
L
name
l evel
t ype
next
"Cl opi n"
symbol
name
l evel
t ype
next
"Mat hi as"
symbol
name
l evel
t ype
next
"Gui l l aume"
symbol
name
l evel
t ype
next
VV vv
NULL
st r uct def :
t ag "pst r uct "
si ze 2
f i el ds <
symbol
\L
name a
l evel 0
t ype
next NULL
l i nk:
l i nk:
l i nk:
l i nk:
l i nk:
cl ass
next
DECLARATOR
NULL
sel ect :
noun
cl ass
i s l ong
unsi gned
val ue
STRUCT
FI XED
0
0
next
DECLARATOR
NULL
noun
i s_l ong
unsi gned
val ue
I NT
FI XED
0
0
0
cl ass
next
SPECI FI ER
A
w
sel ect
cl ass
numel e
POI NTER
cl ass
next
SPECI FI ER
a
W
sel ect
cl ass
numel e
ARRAY
5
cl ass
next
SPECI FI ER
A
w
S0 J _0Ct
cl ass
numel e
POI NTER
l i nk:
cl ass
next
SPECI FI ER
A
W
sel ect
cl ass
numel e
FUNCTI ON
l i nk
cl ass
next
DECLARATOR
NULL
sel ect :
noun
cl ass
i s l ong
unsi gned
val ue
FLOAT
1
1
0
l i nk:
cl ass
next
DECLARATOR
NULL
sel ect :
noun
cl ass
i s l ong
unsi gned
val ue
STRUCT
0
0
cl ass
next
DECLARATOR
NULL
sel ect :
noun
cl ass
i s l ong
unsi gned
val ue
STRUCT
0
0
- - - - -
l i nk:
cl ass
next
DECLARATOR
NULL
noun I NT
cl ass
i s l ong 0
unsi gned 0
val ue 0
Listing 6.27. symtab.h Sizes of Various Types
160 #def i ne CSIZE BYTE WIDTH
/ *
char * /
161 #def i ne CTYPE "byt e"
162
163 #def i ne ISIZE WORD_WIDTH
/ *
i nt * /
164 #def i ne ITYPE "word"
165
166 #def i ne LSIZE LWORD_WIDTH
/ *
l ong * /
167 #def i ne LTYPE "l word"
168
169 #def i ne PSIZE PTR WIDTH
/ *
poi nt er : 32- bi t ( 8086 l ar ge model ) * /
170 #def i ne PTYPE "pt r"
171
172 #def i ne STYPE "r e c or d"
/ *
st r uct ur e, si ze undef i ned * /
173 #def i ne ATYPE "array"
/ *
ar r ay, si ze undef i ned * /
6.3.6 Implementing the Symbol-Table Maintenance Layer
Our next task is to build the symbol-table maintenance layerthe functions that
manipulate the data structures described in the last section. The first set of functions, in
Listing 6.28, take care of memory management. Three sets of similar routines are
providedthey maintain the symbol , l i nk, and st r uct def structure respectively.
Taking the symbol -maintenance routines on lines 23 to 81 of Listing 6.28 as charac
teristic, the compiler creates and deletes symbol s throughout the compilation process,
so its worthwhile to minimize the create-and-delete time for a node, di scar d sym-
bol ( ) creates a linked list of freed nodes on line 63 rather than calling newsym() and
f r eesym( ) [the hash-table-function versions of mal l oc () and f r ee( ) ] for every
create and delete. Symbol f r ee points at the head of the free list; it is declared on line
15. new symbol () calls newsym() on line 30 only if the free list is empty, otherwise
it just unlinks a node from the list on line 33. The routines for the other structures work
in much the same way. The only difference is that new l i nk (), since its calling mal
l oc () directly on line 96, can get nodes ten at a time to speed up the allocation process
even further. LCHUNKthe number of nodes that are allocated at one timeis defined
on line 19.
Listing 6.28. symtab.c The Maintenance Layer: Memory-Management Functions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#nclude <st di o. h>
#include <s t dl i b. h>
#include <t ool s / debug. h>
#include <t ool s / has h. h>
#include <t o o l s / l . h>
#include <t ool s / c ompi l e r . h>
#include <t o o l s / c - c o de . h>
#include "symtab.h"
#include "value. h"
#include "proto. h"
#include "1abe1 . h"
/
*
Symbol - t abl e def i ni t i ons
*
/ * Val ue def i ni t i ons.
/
*
Pr ot ot ypes f or al l f unct i ons i n t hi s di r ect or y
*
*
/ * Label s t o use f or compi l er symbol s.
*
/
/
/
/
J * _____________________________________________________________________________________________* J
PRIVATE symbol
PRIVATE l i n k
* Sy mbo l _ f r e e
*Li nk f r e e
PRIVATE s t r u c t d e f * S t r u c t f r e e
NULL; / * Fr ee- l i st of r ecycl ed symbol s.
NULL; / * Fr ee- l i st of r ecycl ed l i nks.
NULL; / * Fr ee- l i st of r ecycl ed st r uct def s
*
*
*
/
/
/
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
#def i ne LCHUNK 10 /
*
new l i nk( ) get s t hi s many nodes at one shot . */
j *_____________________________________________________________________________ */
PUBLIC s ymbol * n e w_ s y mb o l ( name, s c o p e )
*name;
s c o p e ; i nt
{
s ymbol *sym p;
( ! Symbol f r e e ) /
*
Fr ee l i st i s empt y. */
sym p ( s ymbol *) news ym( ( s ymbol ) ) ;
/ * Unl i nk node f r om
*
{ /
*
t he f r ee l i st .
*
/
/
sym_p
Symbol f r e e
S y mb o l _ f r e e ;
Symbol f r e e - > n e x t
me ms e t ( sym p, 0, ( s ymbol )
) ;
}
s t r n c p y ( sym p- >name, name, (sym p- >name)
) ;
s y m _ p - > l e v e l
ret urn s ym p ;
s c o p e ;
}
/
* ------------------------------------------------------------------------------------------------------------------ */
PUBLIC voi d d i s c a r d s y m b o l ( sym )
s ymbol
*
sym;
{
/
*
*
Di scar d a si ngl e symbol st r uct ur e and any at t ached l i nks and args. Not e
t hat t he ar gs f i el d i s r ecycl ed f or i ni t i al i zer s, t he pr ocess i s
* descr i bed l at er on i n t he t ext (see val ue. c i n t he code) , but you have to
*
t est f or a di f f er ent t ype her e. Sor r y about t he f or war d r ef er ence
*
/
( sym )
{
( IS FUNCT( s y m- >t y pe )
)
d i s c a r d s ymbol c h a i n ( s y m- >a r g s ) ; / * Funct i on ar gument s.
*
/
d i s c a r d v a l u e ( ( v a l u e * ) s y m- > a r g s ) ; /
I f an i ni t i al i zer .
/
d i s c a r d l i n k c h a i n ( s y m- >t y pe ) ; / * Di scar d t ype chai n.
/
s y m- >ne x t
Symbol f r e e
Symbol f r e e
sym;
/
/
Put cur r ent symbol

i n t he f r ee l i st .
/
/
}
}
/
* ------------------------------------------------------------------------------------------------------------------ * /
PUBLIC voi d d i s c a r d s ymbol c h a i n ( s y m) / * Di scar d an ent i r e cr oss- l i nked */
s ymbol
*
sym; / * chai n of symbol s
/
{
s ymbol
P
sym;
whi l e( sym )
{
P
s y m- >ne x t
d i s c a r d s y m b o l ( sym ) ;
Section 6.3.6Implementing the Symbol-Table Maintenance Layer 499
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
sym
p;
}
}
! *_____________________________________________________________________________ * j
PUBLIC l i n k
{
*new l i n k ( )
/ * Ret ur n a new l i nk. I t ' s i ni t i al i zed t o zer os so i t ' s a decl ar at or
*
LCHUNK nodes ar e al l ocat ed f r ommal l oc () at one t i me.
*
/
l i n k *p;
i n t i ;
( ! Li nk f r e e )
{
( ! ( Li nk f r e e ( l i n k *) m a l l o c ( ( l i n k )
*
LCHUNK ) ) )
{
y y e r r o r ( "INTERNAL, new l i n k : Out o f me mo r y \ n") /
e x i t ( 1 ) ;
}
( P
Li nk f r e e i LCHUNK; i > 0; ++p ) / * Execut es LCHUNK- 1 * /
p - > n e x t
P + 1; /
*
t i mes. * /
p - > n e x t NULL ;
}
P
L i n k _ f r e e
me ms e t ( p
r e t u r n p;
L i n k _ f r e e ;
Li nk f r e e - > n e x t
( l i n k )
) ;
}
/
* ------------------------------------------------------------------------------------------------------------------ */
PUBLIC v o i d d i s c a r d l i n k c h a i n ( p )
l i n k
{
*
p;
/ * Di scar d al l l i nks i n t he chai n. Not hi ng i s r emoved f r om t he st r uct ur e
* t abl e, however . Ther e' s no poi nt i n di scar di ng t he nodes one at a t i me
* si nce t hey' r e al r eady l i nked t oget her , so f i nd t he f i r st and l ast nodes
* i n t he i nput chai n and l i nk t he whol e l i st di r ect l y.
*
/
l i n k * s t a r t
(
P )
{
w h i l e ( p - > n e x t ) / * f i nd l ast node */
P
p - > n e x t ;
p - > n e x t
Li nk f r e e
Li nk f r e e
}
}
137
138
/ * ---------------------------------------------------------------------------------------------------------------------- ----------------- * /
139 PUBLIC voi d d i s c a r d l i n k ( p ) / * Di scar d a si ngl e l i nk. * /
140 l i n k *p;
141
{
142 p - > n e x t = Li nk f r e e ;
143 Li nk f r e e = p;
144
145
}
146
147
/ * ----------------------------------------------------------------------------------------------------------------------- -------------------- * /
148 PUBLIC s t r u c t d e f *new s t r u c t d e f ( t a g ) / * Al l ocat e a new st r uct def . * /
149 char * t a g ;
150
{
151
152
s t r u c t d e f * s d e f p;
153 i f ( ! S t r u c t f r e e )
154 s d e f p = ( s t r u c t d e f *) newsym( si zeof ( s t r u c t d e f ) ) ;
155 el se
156
{
157 s d e f p = S t r u c t f r e e ;
158 S t r u c t f r e e = ( s t r u c t d e f *) ( S t r u c t f r e e - > f i e l d s ) ;
159 me ms e t ( s d e f p, 0, si zeof ( s t r u c t d e f ) ) ;
160
}
161 s t r n c p y ( s d e f p - > t a g , t a g , si zeof ( s d e f _ p - > t a g ) ) ;
162 ret urn s d e f p;
163
164
}
165
166
/ * ----------------------------------------------------------------------------------------------- -----------------* /
167 PUBLIC voi d d i s c a r d s t r u c t d e f ( s d e f p )
168 s t r u c t d e f * s d e f p;
169
{
170 / * Di scar d a st r uct def and any at t ached f i el ds, but don' t di scar d l i nked
171 * st r uct ur e def i ni t i ons.
172
173
* /
174 i f ( s d e f p )
175 {
176
177
d i s c a r d s ymbol c h a i n ( s d e f p - > f i e l d s ) ;
178 s d e f p - > f i e l d s = ( s ymbol * ) S t r u c t f r e e ;
179 S t r u c t f r e e = s d e f p;
180 }
181
}
Subroutines in Listing 6.29 manipulate declarators: add_decl ar at or () adds
declarator nodes to the end of the linked list pointed to by the t ype and et ype fields in
a symbol structure. The routine is passed a symbol pointer and the declarator type
(ARRAY, POINTER, or FUNCTIONdeclared on line 33 of Listing 6.23, page 491). Sub
routines in Listing 6.30 manipulate specifiers: they copy, create, and initialize speci f
i er links.
Listing 6.31 contains routines that manipulate entire types and linked-lists of sym
bol s. cl one_t ype (), on line 226, copies an entire type chain. This routine is used in
two places. First, when a variable that uses a typedef rather than a standard type is
declared, the type chain is copied from the symbol representing the typedef to the
Listing 6.29. symtab.c The Maintenance Layer: Declarator Manipulation
182 PUBLIC void add d e c l a r a t o r ( sym, t y p e )
183 s ymbol *sym;
184 i nt t y p e ;
185
{
186 / * Add a decl ar at or l i nk to t he end of t he chai n, t he head of whi ch i s
187 * poi nt ed to by sym- >t ype and t he t ai l of whi ch i s poi nt ed t o by
188 * sym- >et ype. *head must be NULL when t he chai n i s empt y. Bot h poi nt er s
189 * ar e modi f i ed as necessar y.
190 * /
191
192 l i n k * l i n k p;
193
194 i f ( t y p e == FUNCTION && IS_ARRAY( sym- >et ype) )
195
{
196 y y e r r o r ( "Array o f f u n c t i o n s i s i l l e g a l , a s s u mi n g f u n c t i o n p o i n t e r \ n " ) ;
197 add d e c l a r a t o r ( sym, POINTER ) ;
198
}
199
200 l i n k p = new l i n k ( ) ; / * The def aul t cl ass i s DECLARATOR. */
201 l i n k p->DCL TYPE = t y p e ;
202
203 i f ( ! s y m- >t y pe )
204 s y m- >t y pe = s y m- >e t y p e = l i n k p;
205 el se
206
{
207 s y m- > e t y p e - > n e x t = l i n k p;
208 s y m- >e t y pe = l i n k p;
209
}
210 }
Listing 6.30. symtab.c The Maintenance Layer: Specifier Manipulation
211 PUBLIC s p e c c p y ( d s t , s r c ) / * Copy al l i ni t i al i zed f i el ds i n sr c t o dst . */
212 l i n k * d s t , *s r c ;
213 {
214
215 i f ( s r c - >NOUN ) d s t ->N0UN s r c ->NOUN
216
i f ( s r c - >SCLASS ) d s t ->SCLASS s r c ->SCLASS
217
i f ( s r c - >LONG ) d s t ->L0NG s r c ->LONG
218
i f ( s r c - >UNSIGNED ) d s t ->UNSIGNED = s r c ->UNSIGNED ;
219 i f ( s r c - >STATIC ) d s t ->STATIC s r c ->STATIC
220
i f ( s r c - >EXTERN ) d s t ->EXTERN s r c ->EXTERN
221
i f ( s r c - > t d e f ) d s t - > t d e f s r c - > t d e f ;
222
223 i f ( s r c - >SCLASS == CONSTANT | src->NOUN === STRUCTURE)
224 memcpy( &ds t - >VALUE, &src->VALUE, si zeof (src->VALUE) ) ;
225 }
symbol for the variable. You could keep around only one type chain in the typedef
symbol , and make the new symbol s t ype field point there, but this would complicate
symbol deletion, because youd have to keep track of the scope level of every l i nk as
you deleted nodes. Though this later method would be more conservative of memory, it
complicates things enough so that I didnt want to use it. cl one_t ype () is also used to
create type chains for temporary variables, though here the copying is less defensible.
I ll discuss this second application when expression processing is discussed, below.
the_same_type, on line 258 of Listing 6.31, compares two type chains and returns
true if they match. The storage class components of the specifier are ignored. When the
third argument is true, a POINTER declarator is considered identical to a ARRAY declara
tor when they are found in the first positions of both type chains. This relaxation of strict
type checking is necessary, again, for expression processing, because arrays are
represented internally as a pointer to the first element. Strict checking is necessary when
two declarations are compared for equivalence, however, so the third argument lets us
disable this feature.
get_ si zeof (), on line 302 of Listing 6.31, returns the size of an object of the type
represented by its argument. Note that recursion is used on line 313 to process arrays. A
declaration like this:
i n t a [ 10] [ 20]
should return 400 (10 x 20 elements x 2 bytes per element). If the current declarator is
an array, the size of the current dimension is remembered, and the the rest of the type
chain is passed recursively to get_ si zeof () to get the size of an array element. This
could, of course, be done with a loop, but the recursion is more compact and arrays with
more than three or four dimensions are rare. The other subroutines in the listing are
self-explanatory.
The remainder of the maintenance layer, in Listing 6.32, is made up of subroutines
that print symbols and convert fields to printable strings.
Listing 6.31. symtab.c The Maintenance Layer: Type Manipulation
226 PUBLIC l i n k * c l o n e _ t y p e ( t c h a i n , endp )
227 l i n k * t c h a i n ; / * i nput : Type chai n t o dupl i cat e. * /
228 l i n k **endp; / * out put : Poi nt er t o l ast node i n cl oned chai n. * /
229 {
230 / * Manuf act ur e a cl one of t he t ype chai n i n t he i nput symbol . Ret ur n a
231 * poi nt er t o t he cl oned chai n, NULL i f t her e wer e no decl ar at or s t o cl one.
232 * The t def bi t i n t he copy i s al ways cl ear ed.
233 */
234
235 l i n k * l a s t , *head = NULL;
236
237 f or(; t c h a i n ; t c h a i n = t c h a i n - > n e x t )
238 { " '
239 i f ( ! he ad ) / * 1st node i n chai n. * /
240 he ad = l a s t = n e w _ l i n k ( ) ;
241 el se /* Subsequent node. */
242 {
243 l a s t - > n e x t = n e w _ l i n k ( ) ;
244 l a s t = l a s t - > n e x t ;
245 }
246
247 memcpy( l a s t , t c h a i n , si zeof ( * l a s t ) ) ;
248 l a s t - > n e x t = NULL;
249 l a s t - > t d e f = 0;
250 }
251
252 * e n d p = l a s t ;
253 return he ad;
254 }
255
256 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
257
258 PUBLI C i nt t h e _ s a m e _ t y p e ( p i , p2, r e l a x )
259 l i n k * p l , *p2;
260 i nt r e l a x ;
261 {
262 / * Ret ur n 1 i f t he t ypes mat ch, 0 i f t hey don' t , I gnor e t he st or age cl ass,
263 * I f "r el ax" i s t r ue and t he ar r ay decl ar at or i s t he f i r st l i nk i n t he
264 * chai n, t hen a poi nt er i s consi der ed equi val ent t o an ar r ay,
265 * /
266
267 i f ( r e l a x && IS_PTR_TYPE(pi ) && IS_PTR_TYPE(p2) )
268 {
269 p i = p l - > n e x t ;
270 p2 = p 2 - > n e x t ;
271 }
272
273 f or(; p i && p2 ; p i = p l - > n e x t , p2 = p 2 - > n e x t )
274 {
275 i f ( p l - > c l a s s != p 2 - > c l a s s )
276 ret urn 0;
277
278 i f ( p l - > c l a s s == DECLARATOR )
279 {
280 i f ( (pl->DCL_TYPE != p2->DCL_TYPE) | |
281 ( pl - >DCL_TYPE==ARRAY && ( pl - >NUM_ELE != pl - >NUM_ ELE) ) )
282 ret urn 0;
283 }
284 el se / * t hi s i s done l ast * /
285 {
286 i f ( (pl->NOUN == p2->NOUN ) &&(pl ->LONG == p2->LONG ) &&
287 (pi->UNSIGNED == p2->UNSIGNED ) )
288 {
289 ret urn ( pl->NOUN==STRUCTURE ) ? pl ->V_STRUCT == p2->V_STRUCT
290 : 1 ;
291 }
292 ret urn 0 ;
293 }
294 }
295
296 y y e r r o r ( "I NTERNAL t h e _ s a m e _ t y p e : Unknown l i n k c l a s s \ n " ) ;
297 ret urn 0;
298 }
299
300 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
301
302 PUBLI C i nt g e t _ s i z e o f ( p )
303 l i n k *p;
304 {
305 / * Ret ur n t he si ze i n byt es of an obj ect of t he t he t ype poi nt ed t o by p,
306 * Funct i ons ar e consi der ed t o be poi nt er si zed because t hat ' s how t hey' r e
307 * r epr esent ed i nt er nal l y,
308 * /
309
310 i nt s i z e ;
311
312 i f ( p- >cl ass == DECLARATOR )
313 s i z e = (p->DCL TYPE==ARRAY) ? p->NUM ELE * g e t s i z e o f ( p - > n e x t ) : PSIZE;
Listing6.31. continued...
314 el se
315
{
316 swi t ch( p->NOUN
)
317 {
318 case CHAR: s i z e = CSIZE; br eak;
319 case INT: s i z e = p->L0NG ? LSIZE : ISIZE ; br eak;
320 case STRUCTURE: s i z e = p- >V STRUCT- >si ze; br eak;
321 case VOID: s i z e = 0; br eak;
322 case LABEL: s i z e = 0; br eak;
323 }
324
}
325
326 ret urn s i z e ;
327 }
328
329 / -*/
330
331 PUBLIC s ymbol ^ r e v e r s e _ l i n k s ( sym )
332 s ymbol *sym;
333 {
334 / * Go t hr ough t he cr oss- l i nked chai n of "symbol s", r ever si ng t he di r ect i on
335 * of t he cr oss poi nt er s. Ret ur n a poi nt er t o t he new head of chai n
336 * ( f or mer l y t he end of t he chai n) or NULL i f t he chai n st ar t ed out empt y.
337 * /
338
339 s ymbol * p r e v i o u s , * c u r r e n t , * n e x t ;
340
341
i f ( ! sym )
342 ret urn NULL;
343
344 p r e v i o u s = sym;
345 c u r r e n t = s y m- >ne x t
#
346
347 whi l e( c u r r e n t )
348 {
349 n e x t = c u r r e n t - > n e x t ;
350 c u r r e n t - > n e x t = p r e v i o u s ;
351 p r e v i o u s = c u r r e n t ;
352 c u r r e n t = n e x t ;
353 }
354
355 s y m- >ne x t = NULL;
356 ret urn p r e v i o u s ;
357 }
Listing 6.32. symtab.c The Maintenance Layer: Symbol-Printing Functions
358 PUBLIC char * s c l a s s _ s t r ( c l a s s ) / * Ret ur n a st r i ng r epr esent i ng t he * /
359 i nt c l a s s ; / * i ndi cat ed st or age cl ass. * /
360 {
361 ret urn cl ass==CONSTANT ? "CON" :
362 cl ass==REGISTER ? "REG" :
363 cl ass==TYPEDEF ? "TYP" :
364 cl ass==AUTO ? "AUT" :
365 cl as s ==FI XED ? "FIX" : "BAD SCLASS" ;
366 }
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
J 'k________________________________________________________________________________________________________________ * j
PUBLIC
i nt
{
* o c l a s s s t r ( c l a s s )
c l a s s ;
/ * Ret ur n a st r i ng r epr esent i ng t he
/ * i ndi cat ed out put
*
*
/
/
ret urn c l as s ==PUB ? "PUB"
c l a s s = = PRI ? "PRI"
c l a s s
c l a s s
COM ? "COM"
EXT ? "EXT" " (NO OCLS)"
}
j * _________________________________________________________________________________________________________________ * j
PUBLIC
i nt
{
*noun s t r ( noun )
noun;
/ * Ret ur n a st r i ng r epr esent i ng t he
/ * i ndi cat ed noun.
*
*
/
/
ret urn noun
noun
INT
CHAR
noun==VOID
noun
noun
LABEL
STRUCTURE
? " i n t "
? "char"
? "voi d"
? " l a b e l "
? " s t r u c t " "BAD NOUN"
}
/ *_____________________________________________________________________________ * /
PUBLIC
*
a t t r s t r ( s p e c p )
s p e c i f i e r * s p e c p;
{
/
/
*
Ret ur n a st r i ng r epr esent i ng al l

at t r i but es i n a speci f i er ot her
*
/ * t han t he noun and st or age cl ass.
*
*
/
/
/
s t r [ 5 ] ;
st r [0]
st r [1]
st r [2]
st r [3]
st r [41
( s p e c p- > u n s i g n e d )
( s p e c _ p - > _ s t a t i c
( s p e c _ p - > _ e x t e r n
( s p e c p - > l o n g
' \ 0' ;
)
)
)
9
u'
f
9
s '
/
9
e '
9
1'
f
/
ret urn str;
}
/ *-------------------------------------------------------------------------------------------------------------------- * /
PUBLIC
l i n k
{
*
t y p e s t r ( l i n k p )
* l i n k p; /
/
*
Ret ur n a st r i ng r epr esent i ng t he
*
t ype by t he l i nk chai n
*
*
/
/
i nt i ;
t a r g e t [ 80 ] ;
i nt
b u f
a v a i l a b l e
64 ] ;
( t a r g e t ) 1;
*
b u f
' \ 0' ;
' \ 0' ;
( ! l i n k _ p )
ret urn "(NULL)";
( l i n k p - > t d e f )
{
s t r c p y (
a v a i l a b l e
" t d e f " ) ;
5;
}
428
429 f or(; l i n k _ p ; l i n k _ p = l i n k _ p - > n e x t )
430 {
431 i f ( IS DECLARATOR(link p) )
432 {
433 swi t ch( l i nk_p- >DCL_TYPE )
434 {
435 case POINTER: i = s p r i n t f ( b u f , "*" ) ; break;
436 case ARRAY: i = s p r i n t f ( b u f , " [%d]", l i nk_p- >NUM_ELE) ; break;
437 case FUNCTION: i = s p r i n t f ( b u f , " ( ) " ) ; break;
438 def aul t : i = s p r i n t f ( b u f , "BAD DECL" ) ; break;
439 }
440 }
441 el se / * i t ' s a speci f i er * /
442 {
443 i = s p r i n t f ( b u f , "%s %s %s %s", n o u n _ s t r ( l i nk_p- >NOUN ) ,
444 s c l a s s _ s t r ( l i nk_p- >SCLASS ) ,
445 o c l a s s _ s t r ( l i nk_p- >OCLASS ) ,
446 a t t r _ s t r ( & l i n k _ p - > s e l e c t . s ) ) ;
447
448 i f ( link_p->NOUN==STRUCTURE | | link_p->SCLASS==CONSTANT )
449 {
450 s t r n c a t ( t a r g e t , b u f , a v a i l a b l e ) ;
451 a v a i l a b l e - = i ;
452
453 i f ( l i nk_p- >NOUN != STRUCTURE )
454 cont i nue;
455 el se
456 i = s p r i n t f ( b u f , " %s", l i nk_p- >V_STRUCT- >t ag ?
457 l i nk_p- >V_STRUCT- >t ag : " unt a g g e d") ;
458 }
459 }
460
461 s t r n c a t ( t a r g e t , b u f , a v a i l a b l e ) ;
462 a v a i l a b l e - = i ;
463 }
464
465 ret urn t a r g e t ;
466 }
467
468 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
469
470 PUBLIC char * t c o n s t _ s t r ( t y p e )
471 l i n k * t y p e ; / * Ret ur n a st r i ng r epr esent i ng t he val ue * /
472 {/ * f i el d at t he end of t he speci f i ed t ype */
473 st at i c char b u f [ 8 0 ] ; / * ( whi ch must be char *, char , i nt , l ong, * /
474 / * unsi gned i nt, or unsi gned l ong) . Ret ur n * /
475 b u f [ 0 ] = ' ? ' ; / * " ?" i f t he t ype i sn' t any of t hese. * /
476 b u f [1] = ' \ 0 ' ;
A l l
478 i f ( I S_POI NTER( t ype) && I S_CHAR( t ype- >next ) )
479 {
480 s p r i n t f ( b u f , "%s%d", L_STRING, t y p e - > n e x t - > V_ I NT ) ;
481 }
482 el se i f ( ! (IS_AGGREGATE(type) | | I S_FUNCT( t ype ) ) )
483 {
484 swi t ch( type->NOUN )
485 {
486 case CHAR: s p r i n t f ( b u f , "' %s' (%d)", b i n t o a s c i i (
487 type~>UNSIGNED ? t ype- >V_UI NT
488 : t y p e - > V I NT , 1 ) ,
489 type->UNSIGNED ? t ype- >V_UI NT
490 : t y p e - > V I NT, 1 ) ;
491 br eak;
492
493 case INT: i f ( t ype- >L0NG )
494
{
495 i f ( type->UNSIGNED )
496 s p r i n t f ( b u f , "%luL", t y p e - > V ULONG);
497 el se
498 s p r i n t f ( b u f , "%ldL", t y p e - > V LONG ) ;
499
}
500 el se
501
{
502 i f ( type->UNSIGNED )
503 s p r i n t f ( b u f , "%u", t y p e - > V UI NT) ;
504 el se
505 s p r i n t f ( b u f , "%d", t y p e - > V INT );
506
}
507 br eak;
508
}
509
}
510
511 i f ( *buf == ' ?' )
512 y y e r r o r (" I n t e r n a l , t c o n s t s t r : Ca n ' t make c o n s t a n t f o r t y p e %s\ n",
513 t y p e s t r ( t y p e ) )
514 ret urn b u f ;
515 }
516
517 / *-
-------------------------------------------------------------------------------------------------- */
518
519 PUBLIC char *sym c h a i n s t r ( c h a i n )
520 s ymbol * c h a i n ;
521
{
522 / * Ret ur n a st r i ng l i st i ng t he names of al l symbol s i n t he i nput chai n (o
523 * a const ant val ue i f t he symbol i s a const ant ) . Not e t hat t hi s r out i ne
524 * can' t cal l t ype st r( ) because t he second- or der r ecur si on messes up t he
525 * buf f er s. Si nce t he r out i ne i s used onl y f or occasi onal di agnost i cs, i t
526 * not wor t h f i xi ng t hi s pr obl em.
527 * /
528
529 i nt i ;
530 st at i c char bu f [8 0]
*
/
531 char *p = bu f ;
532 i nt a v a i l = si zeof ( b u f ) - 1;
533
534 *buf = ' \ 0 ' ;
535 whi l e( c h a i n && a v a i l > 0 )
536
{
537 i f ( I S_CONSTANT( chai n- >et ype) )
538 i = s p r i n t f ( p, "%0. *s", a v a i l - 2, " c o n s t " ) ;
539 el se
540 i = s p r i n t f ( p, "%0. *s", a v a i l - 2, c ha i n- >na me ) ;
541
542
p += i ;
543 a v a i l - = i ;
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
( c h a i n c h a i n - > n e x t )
{
*
P + +
t t
t
1 2;
}
}
ret urn b u f ;
}
/* -------------------------------------------------------------------------------------------------------------------------------------------- */
PRIVATE voi d p s y m( sym p
fp >
/ * Pr i nt one symbol to f p
*
/
s ymbol
FILE
{
*
*
sym p;
f p;
II Q.
O 1 8 . 1 8 s
o
o 18 18s %2d %p %s\ n",
sym p- >rname
sym_p- >name,
s y m_ p- >t y pe
s y m _ p - > l e v e l ,
(voi d f a r * ) s y m_ p - > n e x t ,
t y p e s t r ( sym p - > t y p e )
II_______ II
);
}
j 'k------------------------------------------------------------------------------------------------------------------ * J
PRIVATE voi d p s t r u c t ( s d e f p / * Pr i nt a st r uct ur e def i ni t i on to f p
*
s t r u c t d e f
FILE
{
* s de f p ;
*f p;
/
*
i ncl udi ng al l t he f i el ds & t ypes.
*
/
/
s ymbol * f i e l d p;
f p r i n t f ( f p , " s t r u c t <%s> ( l e v e l
o
od, d b y t e s ) \ n " ,
s d e f p - > t a g , s d e f p - > l e v e l , s d e f p - > s i z e ) ;
( f i e l d p s d e f p - > f i e l d s ; f i e l d p; f i e l d p = f i e l d p - > n e x t )
{
II o
o 20s (o
o
od)
o
os \ n " ,
f i e l d p- >name, f i e l d p - > l e v e l , t y p e s t r ( f i e l d p - > t y p e ) ) ;
}
}
/ * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - * /
PUBLIC p r i n t s y ms ( f i l e n a m e )
* f i l e n a m e ;
/ * Pr i nt t he ent i r e symbol t abl e to
/ * t he named f i l e. Pr evi ous cont ent s
/
/
{ /
of t he f i l e (i f any) ar e dest r oyed. */

FILE
f p;
( ! ( f p
f o p e n ( f i l e n a m e , " w " ) )
)
y y e r r o r ( "Can' t open s y m b o l - t a b l e f i l e \ n " ) ;
{
f p r i n t f ( f p , " A t t r i b u t e s i n t y p e f i e l d a r e u p e l \ n "
u n s i g n e d ( . f o r s i g n e d ) --------- +
p r i v a t e
e x t e r n
l o n g
(
(
\n"
\ n" f o r p u b l i c ) ----------- +
f o r common)------------- + I\ n"
(. f o r s h o r t ) ---------------- + \ n \ n "
604 f p r i n t f ( f p, "name rname l e v n e x t t y p e \ n " ) ;
605 p t a b ( Symbol t a b , psym, f p , 1 ) ;
606
607 f p r i n t f ( f p , " \ n S t r u c t u r e t a b l e : \ n \ n " ) ;
608 p t a b ( S t r u c t t a b , p s t r u c t , f p , 1 ) ;
609
610 f c l o s e ( f p ) ;
611
}
612 }
6.4 The Parser: Configuration
I ll start looking at the compiler proper with the configuration parts of the occs input
file. Listing 6.33 shows most of the header portion of the occs input file. The %uni on
directive on lines 21 to 30 of Listing 6.33 types the value stack as a C union containing
the indicated fields. Some of these fields are pointers to symbol-table structures, others
to structures that are discussed below, and the last two fields are simple-variable
attributesnumis for integer numbers, asci i for asci i characters. Further down in the
table, directives like <asci i >tell occs which fields to use for particular symbols. For
example, an ASSI GNOP has an <asci i >attribute because the <asci i >directive pre-
o
o union, used for C
compilers value stack
cedes its definition on line 63. The asci i field of the union is used automatically
in a production corresponds to an ASSI GNOP. You dont have to say $N. asci i .
$N
Listing 6.33. c.y occs Input File: Definitions Section
c h a r *p c ha r ;
s ymbol *p sym;
l i n k *p l i n k ;
s t r u c t d e f *p s d e f ;
s p e c i f i e r *p s p e c ;
v a l u e *p v a l ;
i n t num;
i n t a s c i i ;
30 }
31
32
/ *- - - - - - - - - - - */
33
34 %term STRING / * St r i ng const ant .
/
35 %term ICON / * I nt eger or l ong const ant i ncl udi ng ' \ t' , et c. */
36 %term FCON / * Fl oat i ng- poi nt const ant .
/
37
38 %term TYPE / * i nt char l ong f l oat doubl e si gned unsi gned shor t */
39 / * const vol at i l e voi d
*/
40 %term < a s c i i > STRUCT / * st r uct uni on
*/
41 %term ENUM / * enum
*/
42
43 %term RETURN GOTO
44 %term IF ELSE
45 %term SWITCH CASE DEFAULT
46 %term BREAK CONTINUE
47 %term WHILE DO FOR
48 %term LC RC / * { } * /
49 %term SEMI / * ; * /
50 %term ELLIPSIS / * . . . * /
51
52 / * The at t r i but es used bel ow t end t o be t he sensi bl e t hi ng. For exampl e, t he
53 * ASSI GNOP at t r i but e i s the' oper at or component of t he l exeme; .most ot her
54 * at t r i but es ar e t he f i r st char act er of t he l exeme. Except i ons ar e as f ol l ows
55
*
t oken at t r i but e
56
*
RELOP > ' >'
57
*
RELOP < ' <'
58
*
RELOP >- ' G'
59
*
RELOP <= ' L '
60 */
61
62 %l e f t COMMA / * ,
*/
63 %ri ght EQUAL < a s c i i > ASSIGNOP / * = *= / = %= += - = = = <= | = ~= */
64 %ri ght QUEST COLON / * ? :
* /
65 %l e f t OROR / * | |
* /
66 %l e f t ANDAND / * &&
* /
67 %l e f t OR / * |
*/
68 %l e f t XOR / * ~
* /
69 %l e f t AND / * &
* /
70 %l e f t < a s c i i > EQUOP / * == !=
* /
71 %l e f t < a s c i i > RELOP / * <= >= < >
* /
72 %l e f t < a s c i i > SHIFTOP / *
*/
73 %l e f t PLUS MINUS / * + -
*/
74 %l e f t STAR < a s c i i > DIVOP / * * / %
*/
75 %ri ght SIZEOF < a s c i i > UNOP INCOP / * si zeof ! ~ ++
*/
76 %l e f t LB RB LP RP < a s c i i > STRUCTOP / * [ ] ( ) . - > */
77
78
79 / * These at t r i but es ar e shi f t ed by t he scanner .
*/
80 %term TTYPE / * Name of a t ype cr eat ed wi t h a pr evi ous t ypedef .
*/
81 / * At t r i but e i s a poi nt er t o t he symbol t abl e
*/
82 / * ent r y f or t hat t ypedef .
*/
83 %nonassoc < a s c i i > CLASS / * ext er n r egi st er aut o st at i c t ypedef . At t r i but e
*/
84 / * i s t he f i r st char act er of t he l exeme0 */
85 %nonassoc NAME / * I dent i f i er or t ypedef name. At t r i but e i s NULL
*/
86 / * i f t he symbol doesn' t exi st , a poi nt er t o t he
*/
Section 6.4The Parser: Configuration 511
87 / * associ at ed "symbol " st r uct ur e, ot her wi se. * /
88
89 %nonassoc ELSE / * Thi s gi ves a hi gh pr ecedence t o ELSE t o suppr ess
90 * t he shi f t / r educe conf l i ct er r or message i n:
91 * s - > I F LP expr RP expr \ I F LP expr RP s ELSE s
92 * The pr ecedence of t he f i r st pr oduct i on i s t he same
93 * as RP. Maki ng ELSE hi gher pr ecedence f or ces
94 * r esol ut i on i n f avor of t he shi f t .
95
*/
96 %type <num> a r g s c o n s t e x pr t e s t
97
98 %type e x t d e c l l i s t e x t d e c l d e f l i s t d e f d e c l l i s t d e c l
99 %type v a r d e c l f u n c t d e c l l o c a l d e f s new name name e nume r a t o r
100 %type name l i s t v a r l i s t param d e c l a r a t i o n abs d e c l a b s t r a c t d e c l
101 %type e x pr b i n a r y non comma e x pr unar y i n i t i a l i z e r i n i t i a l i z e r l i s t
102 %type or e x pr and e x pr or l i s t and l i s t
103
104 %type t y p e s p e c i f i e r s o p t s p e c i f i e r s t y p e or c l a s s t y p e s p e c i f i e r
105 %type o pt t a g t a g s t r u c t s p e c i f i e r
106 %type s t r i n g c o n s t t a r g e t
107
108
/* -------
109 * Gl obal and ext er nal var i abl es. I ni t i al i zat i on t o zer o f or al l gl obal s i s
110 * assumed. Si nce occs - a and - p i s used, t hese var i abl es may not be pr i vat e.
111 */
112
113
%{
114 # i f d e f YYACTION
115
116 e x t e r n char * y y t e x t ; / * gener at ed by LeX * /
117 e x t e r n c o n s t i n t y y l i n e n o , y y l e n g ;
118
%}
There are several other variables declared in this part of the
input file. PH discuss these variables, below, as they are used.
180
%{
181 # e n d i f
182
%}
Lines 34 to 50 of Listing 6.33 contain definitions for those tokens that arent used in Tokens not in expres-
expressions dont require precedence or associativity information. Lines 62 to 76
comprise a precedence chart in which the other tokens are defined. Lowest-precedence
operators are at the top of the list, and tokens on the same line are at the same pre-
sions. Single tokens
represent multiple sym
bols. Attributes
differentiate them.
cedence level. Some of these tokens have the <asci i > attribute associated with them.
These tokens all represent multiple input symbols. The attribute, created by the lexical
analyzer and passed to the parser using the yyl val mechanism described in Appendix
E, serves to identify which input symbol has been scanned. You can access the attribute
from within a production by using the normal $ mechanism a symbol with an
attribute is at $1 in the production, then that attribute can be referenced by
using $ 1 in an action. Most attributes are the first character of the lexeme a
few exceptions are listed in the comment on line 52 of Listing 6.33.
TYPE, CLASS, NAME, and ELSE are assigned precedence levels on lines 79 to 95 to
TYPE, CLASS, NAME,
and el se have pre
eliminate various shift/reduce conflicts inherent in the grammar. (See Appendix E for a cedence.
description of whats going on.) The NAME and TTYPE tokens also have pointer-to-
symbol-table-entry attributes associated with them. (A TTYPE is a type created by a pre
vious typedef.) The lexical analyzer uses the symbol table to distinguish identifiers
from user-defined types. It returns a NULL attribute if the input symbol isnt in the table,
otherwise the attribute is a pointer to the appropriate symbol-table entry. I ll demon
strate how this is done in a moment.
%t ype The %t ype directives on lines 96 to 106 of Listing 6.33 attach various attributes to
nonterminal names. The abbreviations that are used in the names are defined in Table
6.12. The remainder of the header holds variable definitions, most of which arent shown
in Listing 6.33. Theyre discussed later along with the code that actually uses them.
Table 6.12. Abbreviations Used in Nonterminal Names
Abbreviation Meaning Abbreviation Meaning
abs abstract expr expression
ar g argument ext external
const constant opt optional
decl declarator par am parameter
def definition st r uct structure
Skipping past the productions themselves for a moment, the end of the occs input file
starts in Listing 6.34. This section contains various initialization and debugging routines
s i gi nt handling, clean- that are used by the parser. The occs parsers initialization function starts on line 1306
up.
of Listing 6.34. It creates temporary files and installs a signal handler that deletes the
temporaries if SI GI NT. is issued while the compiler is running. ( SI GI NT is issued
when a Ctrl-C or Ctrl-Break is encountered under MS-DOS. Some UNIX systems also
recognize DEL or Rubout.) i ni t out put st r eams (), starting on line 1251,
Create temporary-files for creates three temporary files to hold output for the code, data, and bss segments. These
code, data, and bss seg
ments.
files make it easier to keep the various segments straight as code is generated The
cleanup code that starts on line 1323 merges the three temporaries together at the end of
the compilation process by renaming the data file and then appending the other two files.
Listing 6.34. c.y occs Input File: Temporary-File Creation
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
oo
o o
FILE
*mkt emp( ) ;
* y y c o d e o u t , * y y d a t a o u t , * y y b s s o u t ;
y y p r o mp t ( ) ;
/
/
*
I n occs out put f i l e
*
*
I n debugger .
*
/
/
#def i ne OFILE NAME "o u t p u t . c " /
*
Out put f i l e name.
*
*
Bs s
*Code;
*Dat a;
/ * Name of BSS t empor ar y f i l e
*
/
/
*
*
Name of Code t empor ar y f i l e
Name of Dat a t empor ar y f i l e
*
*
/
/
/
/
/
* *
/
PRIVATE voi d i n i t o u t p u t s t r e a m s ( p c o d e , p d a t a , p b s s )
* *
p c o d e ,
* *
p d a t a ,
* *
p b s s ;
{
/ * I ni t i al i ze t he out put st r eams, maki ng t empor ar y
*
*
*
*
*
*
Not e t hat t he ANSI t mpf i l e () or t he UNI X mkst mpO f unct i ons ar e bot h
bet t er choi ces t han t he mkt emp ( ) / f open () used her e because anot her
pr ocess coul d, at l east i n t heor y, sneak i n bet ween t he t wo cal l s.
Si nce mkt emp uses t he pr ocess i d as par t of t he f i l e name, t hi s
i s not much of a pr obl em, and t he cur r ent met hod i s mor e por t abl e
t han t mpf i l e () or mkst mpO. Be car ef ul i n a net wor k envi r onment .
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
*
/
( ! ( *p_code
! (*p b s s
mktemp("ccXXXXXX"))
mktemp("cbXXXXXX"))
! (*p d a t a mktemp("cdXXXXXX"))
)
{
y y e r r o r ( "Can' t c r e a t e t e m p o r a r y - f i l e name s ") ;
e x i t ( 1 ) ;
}
( ! ( y y c o d e o u t
! ( y y b s s o u t
n ( *p c o d e , "w")) ! ( y y d a t a o u t n ( *p d a t a , "w"))
n (
*
p b s s , "w"))
)
{
p e r r o r ( "Can' t open t e mpo r a r y f i l e s " ) ;
e x i t ( 1 ) ;
}
}
J * _______________________________________________________________________________________________ _____________ ___* j
voi d ( * 0 s i g ) ( ) ; /
*
Pr evi ous SI GI NT handl er . I ni t i al i zed i n yy i ni t occs( ) . */
PRIVATE s i g i n t h a n d l e r ()
{
/
*
Ct r l - C handl er . Not e t hat t he debugger r ai ses SI GI NT on a ' q' command,
* so t hi s r out i ne i s execut ed when you exi t t he debugger . Al so t he
* debugger ' s own SI GI NT handl er , whi ch cl eans up wi ndows and so f or t h, i s
* i nst al l ed bef or e yy_i ni t _occs () i s cal l ed. I t ' s cal l ed her e to cl ean up
* t he envi r onment i f necessar y. I f t he debugger i sn' t i nst al l ed, t he cal l
* i s har ml ess.
/
s i g n a l ( SIGINT, SIG IGN
c l e a n up (
u n l i n k ( OFILE NAME
) ;
) ;
);
( *Os i g) ( ) ;
e x i t ( 1 ) ; /
*
Needed onl y i f ol d si gnal handl er r et ur ns.
*
/
}
j *_____________________________________________________________________________ * /
s ym cmp ( s i , s 2) s ymbol
*
si ,
*
s t r u c t cmp ( s i , s 2) s t r u c t d e f * s l ,
*
s 2 ;
s2 ;
{ ret urn s t r c mp ( s l - > n a me , s 2 - > n a me ) ; }
{ ret urn s t r cmp (si s2
) ; }
sym_hash
s t r u c t has h
(si)
(si)
s ymbol
*
s i ; { ret urn ha s h a d d ( s l - > n a me ) ;
s t r u c t d e f * s l ; { ret urn has h a d d ( s i
) ;
}
}
PUBLIC voi d
y y s t y p e * v a l ;
{
yy i n i t o c c s ( v a l )
yyc omme nt ( " I n i t i a l i z i n g \ n " ) ;
Os i g s i g n a l ( SIGINT, SIG IGN ) ;
i n i t _ o u t p u t _ s t r e a m s ( &Code, &Data, &Bss ) ;
s i g n a l ( SIGINT, s i g i n t h a n d l e r ) ;
v a l - > p c har
ff ff .
/
Pr ovi de at t r i but e f or t he st ar t symbol .
/
Symbol _t ab
S t r u c t t a b
ma k e t a b ( 257, sym ha s h, sym cmp
) ;
ma k e t a b ( 127, s t r u c t h a s h , s t r u c t cmp ) ;
}
1321 / *
----------------------------------------------------------------------------------------------------------------------------------------- * /
1322
1323 PRIVATE void c l e a n u p ()
1324
{
1325 / * Cl eanup act i ons. Mar k t he ends of t he var i ous segment s, t hen mer ge t he
1326 * t hr ee t empor ar y f i l es used f or t he code, dat a, and bss segment s i nt o a
1327 * si ngl e f i l e cal l ed out put . c. Del et e t he t empor ar i es. Si nce some compi l er s
1328 * don' t del et e an exi st i ng f i l e wi t h a r ename (), i t ' s best to assume t he
1329 * wor st . I t can' t hur t t o del et e a nonexi st ent f i l e, you' l l j ust get an
1330 * er r or back f r omt he oper at i ng syst em.
1331 */
1332
1333 extern FILE * y y c o d e o u t , * y y d a t a o u t , * y y b s s o u t ;
1334
1335 s i g n a l ( SIGINT, SIG_IGN ) ;
1336 f c l o s e ( y y c o d e o u t ) ;
1337 f c l o s e ( y y d a t a o u t ) ;
1338 f c l o s e ( y y b s s o u t ) ;
1339 u n l i n k ( OFILE NAME ) ; / * del et e ol d out put f i l e ( i gnor e EEXI ST) * /
1340
1341 i f ( r e n a me ( Dat a, OFILE NAME ) )
1342 y y e r r o r ( "Can' t rename t e mpo r a r y (%s) t o %s\ n", Dat a, OFILE NAME ) ;
1343 el se
1344 { / * Append t he ot her t empor ar y */
1345 m o v e f i l e ( OFILE NAME, Bs s , "a" ); /* f i l es t o t he end of t he */
1346 m o v e f i l e ( OFILE NAME, Code, "a" ); /* out put f i l e and del et e t he */
1347 } / * t empor ar y f i l es, movef i l e() */
1348 }
/* i s i n appendi x A. */
The remainder of the occs input file is in Listing 6.35. yypst k (), starting on line
1423, is used by the occs interactive debugger to print the value-stack contents. The rou
tine is passed a pointer to the value stack item and a pointer to the equivalent string on
the symbolic, debugging stack (the middle column in the stack window). It uses this
string to identify which of the fields in the union are active, by searching the table on
lines 1357 to 1415 with the bsear ch( ) call on line 1432. The following switch then
prints out stack item.
Listing 6.36 finishes up the start-up code with a mai n( ) function. The
yyhook a () routine is a debugger hook that prints the symbol table when a Ctrl-A is
issued at the IDEs command prompt. The yyhook b () lets you enable or disable the
run-time trace feature discussed along with the gen () subroutine, below, from the
debugger.
Print the value stack,
yypst k ().
mai n (), y y h o o k _ a ( ) ,
y y h o o k b ( )
Listing 6.35. c.y occs Input File: Initialization and Cleanup
1349 enum u ni o n f i e l d s { NONE, P SYM, P_LINK, P_SDEF, P_FI
1350 P_ CHAR, P_VALUE, SYM_CHAIN, ASCI
1351 typedef struct t a b t y p e
1352
{
1353 char *sym;
1354 enum u n i o n f i e l d s c a s e v a l ;
1355
} t a b t y p e ;
1356
1357 t a b t y p e Ta b[] =
1358
{
1359 / * nont er mi nal f i el d */
1360 / * name i n %uni on
*/
1361
1362 { "ASSIGNOP", ASCII
1363 { "CLASS", ASCII
1364 { "DIVOP", ASCII
1365 { "EQUOP", ASCII
1366 { "INCOP", ASCII
1367 { "NAME", P_SYM
1368 { "RELOP", ASCII
1369 { "SHIFTOP", ASCII
1370 { "STRUCT", ASCII
1371 { "STRUCTOP", ASCII
1372 { "TTYPE", P_SYM
1373 { "UNOP", ASCII
1374 { "abs d e c l " , P_SYM
1375 { " a b s t r a c t _ d e c l " , P_SYM
1376 { "and e x pr " , P_VALUE
1377 { "and l i s t " , P_VALUE
1378 { "a r g s ", NUM
1379 { "b i na r y " , P_VALUE
1380 { " c o n s t e x p r " , NUM
1381 { " d e c l " , P_SYM
1382 { " d e c l l i s t " , SYM_CHAIN
1383 { "de f ", SYM_CHAIN
1384 { "de f l i s t " , SYM_CHAIN
1385 { "e nume r at or ", P_SYM
1386 { "expr", P_VALUE
1387 { " e x t _ d e c l " , P_SYM
1388 { " e x t _ d e c l _ l i s t ", S YM_C HAIN
1389 { " f u n c t d e c l " , P_SYM
1390 { " i n i t i a l i z e r " , P_VALUE
1391 { " i n i t i a l i z e r l i s t " , P_VALUE
1392 { " l o c a l _ d e f s " , S YM_C HAIN
1393 { "name", P_SYM
1394 { "name l i s t " , S YM_C HAIN
1395 { "new name", P_SYM
1396 { "non comma e x pr " , P_VALUE
1397 { " o p t _ s p e c i f i e r s " , P_LINK
1398 { "opt t a g " , P_SDEF
1399 { "or e x pr " , P_VALUE
1400 { " o r _ l i s t " , P_VALUE
1401 { "param d e c l a r a t i o n ", SYM_CHAIN
1402 { " s p e c i f i e r s " , P_LINK
1403 { " s t r i n g c o n s t " , P_CHAR
1404 { " s t r u c t s p e c i f i e r " P_SDEF
1405 { "t a g ", P_SDEF
1406 { " t e s t " , NUM
1407 { "t y pe ", P LINK
, p
NUM
SPEC,
}
1408 { "t y pe o r _ c l a s s " , P_LINK } ,
1409 { "t y pe s p e c i f i e r " , P_LINK } ,
1410 { "unary", P_VALUE } ,
1411 { "var d e c l " , P_SYM } ,
1412 { "var l i s t " , SYM_CHAIN } ,
1413 { "{72}", NUM } ,
1414 { " { 7 3 } " , P VALUE }
1415 };
1416
1417 t e mp ( p i , p2 )
1418 t a b t y p e * p l , *p2;
1419 {
1420 r e t u r n ( s t r c mp ( p l - > s y m, p2- >s ym) ) ;
1421 }
1422
1423 c h a r * y y p s t k ( v , name )
1424 y y s t y p e *v; / * Poi nt er t o val ue- st ack i t em. * /
1425 c h a r *name; / * Poi nt er t o debug- st ack i t em. * /
1426 {
1427 s t a t i c c h a r b u f [ 1 2 8 ] ;
1428 c h a r * t e x t ;
1429 t a b t y p e * t p , t e m p l a t e ;
1430
1431 t e m p l a t e . s y m = name;
1432 t p = ( t a b t y p e *) b s e a r c h ( &t e mpl a t e , Tab, s i z e o f ( T a b ) / s i z e o f ( *Tab) ,
1433 s i z e o f ( * T a b ) , t e mp ) ;
1434
1435 s p r i n t f ( b u f , "%04x ", v- >num ) ; / * The f i r st f our char act er s i n t he * /
1436 t e x t = b u f + 5; / * st r i ng ar e t he numer i c val ue of * /
1437 / * t he cur r ent st ack el ement . * /
1438 / * Ot her t ext i s wr i t t en at "t ext ". * /
1439 s w i t c h ( t p ? t p - > c a s e _ v a l : NONE )
1440 {
1441 c a s e SYM_CHAIN:
1442 s p r i n t f ( t e x t , "sym c h a i n : %s"-,
1443 v- >p_s ym ? s y m_ c h a i n _ s t r ( v - > p _ s y m) : "NULL" ) ;
1444 b r e a k ;
1445 c a s e P_SYM:
1446 i f ( ! v- >p_s ym )
1447 s p r i n t f ( t e x t , "s ymbol : NULL" ) ;
1448
1449 e l s e i f ( I S_FUNCT( v- >p_s ym- >t ype) )
1450 s p r i n t f ( t e x t , "s ymbol : %s(%s)=%s %1. 40s ",
1451 v- >p_s ym- >name ,
1452 s y m _ c h a i n _ s t r ( v - > p _ s y m- > a r g s ) ,
1453 v - > p _ s y m- > t y p e && * ( v- >p_s ym- >rname) ?
1454 v- >p_s ym- >rname : "",
1455 t y p e _ s t r ( v - > p _ s y m - > t y p e ) ) ;
1456 e l s e
1457 s p r i n t f ( t e x t , "s ymbol : %s=%s %1. 40s ",
1458 v- >p_s ym- >name ,
1459 v - > p _ s y m- > t y p e && * ( v- >p_s ym- >rname) ?
1460 v- >p_s ym- >rname : "",
1461 t y p e _ s t r ( v - > p _ s y m - > t y p e ) ) ;
1462 b r e a k ;
1463 c a s e P_SPEC:
1464 i f ( ! v - > p _ s p e c )
1465 s p r i n t f ( t e x t , " s p e c i f i e r : NULL" ) ;
1466 el se
1467 s p r i n t f ( t e x t , " s p e c i f i e r : %s %s", a t t r s t r ( v - > p _ s p e c ),
1468 noun s t r ( v - > p s p e c - > n o u n ) ) ;
1469 br eak;
1470 case P_LINK:
1471 i f ( ! v - >p l i n k )
1472 s p r i n t f ( t e x t , " s p e c i f i e r : NULL" );
1473 el se
1474 s p r i n t f ( t e x t , " l i n k : %1. 50s ", t y p e _ s t r ( v - > p l i n k ) );
1475 br eak;
1476 case P_VALUE
1477 i f ( ! v - >p v a l )
1478 s p r i n t f ( t e x t , " _ v a l u e : NULL" );
1479 el se
1480
{
1481 s p r i n t f ( t e x t , "%cval ue: %s %c/%u %1. 40s ",
1482 v - > p _ v a l - > l v a l u e ? ' 1 ' : ' r' ,
1483 * ( v - >p v a l - >na me ) ? v - > p v a l - >na me : " ",
1484 v - > p v a l - > i s tmp ? ' t ' : ' v ' ,
1485 v - > p v a l - > o f f s e t ,
1486 t y p e s t r ( v - > p v a l - > t y p e ) ) ;
1487 }
1488 br eak;
1489 case P SDEF:
1490 i f ( ! v - >p s d e f )
1491 s p r i n t f ( t e x t , " s t r u c t d e f : NULL" ) ;
1492 el se
1493 s p r i n t f ( t e x t , " s t r u c t d e f : %s l e v %d, s i z e %d",
1494 v - > p s d e f - > t a g ,
1495 v - > p s d e f - > l e v e l ,
1496 v - > p s d e f - > s i z e ) ;
1497 br eak;
1498 case P CHAR:
1499 i f ( ! v - >p s d e f )
1500 s p r i n t f ( t e x t , " s t r i n g : NULL" ) ;
1501 el se
1502 s p r i n t f ( t e x t , "<%s>", v - > p c ha r ) ;
1503 br eak;
1504 case NUM:
1505 s p r i n t f ( t e x t , "num: %d", v- >num ) ;
1506 br eak;
1507 case ASCI I :
1508 s p r i n t f ( t e x t , " a s c i i : ' %s ' ", b i n t o a s c i i ( v - > a s c i i , 1) ) ;
1509 br eak;
1510 }
1511 ret urn b u f ;
1512 }
Listing 6.36. main.cmai n () and a Symbol-Table-Printing Hook
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#ncl ude < t o o l s / d e b u g . h >
i nt m a i n ( a r g c , a r g v )
* *
a r g v ;
{
UX( yy i n i t o c c s ( ) ; ) /
*
Needed f or yacc, cal l ed aut omat i cal l y by occs. */
a r g c yy g e t a r g s (' a r g c , a r g v ) ;
( > 2 )
e n a b l e t r a c e ( ) ;
/
*
Gener at e t r ace i f anyt hi ng f ol l ows f i l e name on
*
/ * command l i ne
*
/
/
y y p a r s e ( ) ;
}
j 'k_________________________________________________________________________________________________________________ * J
yyhook a ()
{
/
/
*
*
Pr i nt symbol t abl e f r omdebugger wi t h Ct rl - A.
Not used i n yacc
*
*
/
/
st at i c i nt x 0;
b u f [ 3 2 ] ;
s p r i n t f ( b u f , "sym. %d", x++ ) ;
yycomment ( "Wr i t i ng s ymbol t a b l e t o %s\ n", buf ) ;
p r i n t syms ( buf ) ;
}
yyhook b ()
{
/ * Enabl e/ di sabl e r un- t i me t r ace wi t h Ct r l - b.
*
/
*
Not used i n yacc
*
/ * enabl e t r ace () and di sabl e t r ace () ar e di scussed
/
bel ow, when t he gen() cal l i s di scussed.

*
/
/
/
/
bu f [3 2 ] ;
( yyprompt ( "Enabl e or d i s a b l e t r a c e ? ( e / d ) : ", b u f , 0 ) )
{
(
b u f ' e ' ) e n a b l e _ t r a c e ( ) ;
d i s a b l e t r a c e ( ) ;
}
}
6.5 The Lexical Analyzer
Now that weve seen the tokens and the value-stack definition, we can look at the
lexical analyzer specification (in Listing 6.37). This file is essentially the analyzer at the
end of Appendix D, but its been customized for the current parser by adding an
attribute-passing mechanism. The uni on on lines 18 to 28, which reflects the %uni on
definition in the occs input file, is used to pass attributes back to the parser. For example,
when an ASSIGNOP is recognized on line 115, the first character of the lexeme is
attached to the token by assigning a value to y y l v a l , which is of the same type as a
value-stack element. (The uni on definition on lines 18 to 28 doubles as an extern
definition of y y l v a l . ) The parser pushes the contents of y y l v a l onto the value stack
when it shifts the state representing the token onto the state stack.
Also note that a newline is now recognized explicitly on line 132. The associated
action prints a comment containing the input line number to the compilers output file so
that you can relate one to the other when youre debugging the output. You could also
Section 6.5The Lexical Analyzer 519
use this action to pass input-line-number information to a debugger.
Listing 6.37. c.lex C Lexical Analyzer
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
31
32
33
34
35
36
37
38
50
51
52
53
54
55
/
*
*
Lexi cal anal yzer speci f i cat i on f or C. Thi s i s a somewhat ext ended ver si on
of t he one i n Appendi x D. The mai n di f f er ence i s t hat i t passes at t r i but es
* f or some t okens back t o t he par ser , usi ng t he yyl val mechani sm t o push
* t he at t r i but e ont o t he val ue st ack.
*
/
o
o
{
#i ncl ude <sear ch. h>
# i ncl ude <t ool s/ debug. h> / * Needed by symt ab. h
* Needed by symt ab. h
/ * Funct i on pr ot ot ype f or bsear chf )
*
*
#i ncl ude <t ool s/ hash. h> / *
*
/
/
/
#i ncl ude "yyout . h11
#i ncl ude "symt ab. h"
#i ncl ude "val ue. h"
/ * Token def s. cr eat ed by occs. Yacc uses y. t ab. h
*
/
*
Needed t o pass at t r i but es t o par ser
*
/ * di t t o
*
/
/
/
ext ern uni on {
19 char *p c ha r ;
20 symbol *p_sym;
21 l i n k *p l i n k ;
22 s t r u c t d e f *p s d e f ;
23 s p e c i f i e r *p s p e c ;
24 v a l u e *p v a l u e
25 i nt i n t e g e r ;
26 i nt a s c i i ;
27 }
28 y y l v a l ;
29
30 ext ern FILE *y y c o d e o u t ;
/
/
Thi s def i ni t i on must dupl i cat e t he %uni on

i n c. y.
*
/
/
/
Decl ar ed by occs i n yyout . c.

*
/
/
Decl ar ed by occs i n yyout . c

*
/
/

/
#def i ne YYERROR( t ext ) yyer r or ("%s\ n", t ext ) ; /
Does not hi ng i n UNI X l ex

*
/
/
o
o
/
39 l e t [ _a- z A- Z]
/*
Let t er
*/
40 al num [ a- zA- ZO- 9]
/*
Al phanumer i c char act er
*/
41 h [ 0 - 9 a - f A - F ] /* Hexadeci mal di gi t */
42 o [ 0 - 7 ]
/* Oct al di gi t */
43 d [ 0 - 9 ] /* Deci mal di gi t */
44 s u f f i x [ UuLl ]
/* Suf f i x i n i nt egr al numer i c const ant */
45 wh i t e [ \ x 0 0 - \ x 0 9 \ x 0 b \ s ] / * Whi t e space: al l cont r ol char s but \ n * /
46
47
oo
o o
48
ft / * ff
{
49 i nt i ;
whi l e( i
{
i i i nput ()
)
( i < 0 )
i i f l ushbuf (); / * Di scar d l exeme. * /
\ M(\\. i r \ n] )
0 { o } * { s u f f i x }?
Ox{h }+{suf f i x}?
[ 1 - 9 ] { d } * { s u f f i x } ?
( { d } + | { d } + \ . { d } * | { d } * \ . { d } + )
" (
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
el se i f ( i
{
&& i i l o o k a h e a d (1)
i i i n p u t ( ) ;
/
*
}
}
( i 0 )
y y e r r o r ( "End o f i n comment \ n" ) ;
}
*
\" { r e t u r n STRING; }
*
\ n y y e r r o r ( " Ad d i n g
y y mo r e ( ) ;
ng t o i n g \
>
ii ii
++
[~!]
* II
[/%]
VV i VV
+
vv_ vv
[<>]=?
[! ]
ret urn ICON ;
+ 1 ?{d} + ) ? [ f F ] ? ret urn FCON ;
ret urn LP;
ret urn RP;
ret urn LC;
ret urn RC;
ret urn LB;
ret urn RB;
y y l v a l . a s c i i = *
ret urn STRUCTOP;
y y t e x t ;
y y l v a l . a s c i i
ret urn INCOP;
*
y y t e x t ;
ret urn UNOP;
y y t e x t ;
ret urn STAR;
ret urn DI VOP;
*
y y t e x t ;
ret urn PLUS;
ret urn MINUS;
ret urn SHIFTOP;
*
y y t e x t ;
y y l v a l . a s c i i y y t e x t [1] ? ( y y t e x t [0]
: ( y y t e x t [0]
>' ? 'G' ' L' )
);
ret urn RELOP;
ret urn EQUOP;
*
y y t e x t ;
Section 6.5 The Lexical Analyzer 521
114 [ */ %+\ - &r ]= I
115
( i ) =
yyl val . asci i = *yyt ext ;
116 ret urn ASSI GNOP;
117
118
vv_ vv
ret urn EQUAL;
119 ret urn AND;
120
vvAvv
ret urn XOR;
121
VV|vv
ret urn OR;
122 " &&" ret urn ANDAND;
123 "i r ret urn 0R0R;
124
vv9 vv
ret urn QUEST;
125
vvvv
ret urn COLON;
126
vv vv
ret urn COMMA;
127
vvvv
f ret urn SEMI ;
128
vv vv
ret urn ELLI PSI S;
129
130 {l et }{al num}* ret urn i d or keywor d( yyt ext );
131
132 \ n f pr i nt f ( yycodeout , "\ t \ t \ t \ t \ t \ t \ t \ t \ t / *%d*/ \
133 {whi t e}+ ; / * i gnor e ot her whi t e space * /
134
oo
"0
135
136
/ *- - - - - - - - - - - - -
137
138 t ypedef st ruct / * Rout i nes t o r ecogni ze keywor ds
139
{
140 char * n ame;
141 i nt val ;
142
}
143 KWORD;
144
145 KWORD Kt ab[] = / * Al phabet i c keywor ds */
146
{
147 { "aut o", CLASS },
148 { "br eak", BREAK },
149 { "case", CASE },
150 { "char", TYPE },
151 { "cont i nue", CONTI NUE },
152 { "def aul t ", DEFAULT },
153 { "do", DO },
154 { "doubl e", TYPE },
155 { "el se", ELSE },
156 { "enum", ENUM },
157 { "ext er n", CLASS },
158 { "f l oat ", TYPE },
159 { "f or", FOR },
160 { "got o", GOTO },
161
{ "i f "/ I F },
162 { "i nt ", TYPE },
163 { "l ong", TYPE },
164 { "r egi st er ", CLASS },
165 { "r et ur n", RETURN },
166 { "shor t ", TYPE },
167 { "si zeof ", SI ZEOF },
168 { "st at i c", CLASS },
169 { "st r uct ", STRUCT },
170 { "swi t ch", SWI TCH },
171 { "t ypedef ", CLASS },
172 { "uni on", STRUCT },
173 { "unsi gned", TYPE },
yyl i neno) ;
*
/
*
/
174 { "voi d", TYPE
175 { "whi l e", WHI LE
}
176
};
177
178 st at i c i nt cmp( a, b )
179 KWORD *a, *b;
180 {
181 ret urn strcmp( a- >name, b- >name );
182
}
183
184 i nt i d or keywor d( l ex ) / * Do a bi nar y sear ch f or a * /
185 char *l ex; / * possi bl e keywor d i n Kt ab * /
186
{
/ * Ret ur n t he t oken i f i t ' s * /
187 KWORD *p; / * i n t he t abl e, NAME */
188 KWORD dummy; / * ot her wi se. */
189
190 dummy. name = l ex;
191 p =bsear ch( &dummy, Ktab, si zeof ( Kt ab) / si zeof ( KWORD) , si zeof (KWORD) , cmp) ;
192
193
i f ( p )
/ * I t' s a keywor d. */
194
{
195 yyl val . asci i = *yytext
/
196 ret urn p- >val ;
197
}
198 el se i f ( yyl val . p_sym= (symbol *) f i ndsym( Symbol _tab, yyt ext ) )
199 ret urn (yyl val . p sym- >t ype- >t def ) ? TTYPE : NAME ;
200 el se
201 ret urn NAME/
202
}
6.6 Declarations
6.6.1 Simple Variable Declarations
This section moves from data structures to analyzing the code that manipulates those
data structures. Declaration processing involves two main tasks: (1) you must assemble
the linked lists that represent the types, attach them to symbols, and put the resulting
structures into the symbol table, and (2) you must both generate C-code definitions for
variables at fixed addresses and figure the frame-pointer-relative offsets for automatic
variables. Many of the productions in the grammar work in concert to produce a single
definition. Since these productions are discussed piecemeal, you should take a moment
and look at the overall structure of the grammar in Appendix C. The declarations are all
handled first, at the top of the grammar.
Anexample variable- The best way to understand how the code generation works is to follow along as the
declaration parse. parser works on an explicit example. (This is also a good approach for adding the code-
generation actions in the first place: run a short sample through the parser observing the
order in which reductions occur. Since actions at the far right of productions are exe
cuted in the same sequence as the reductions, you can see where the various actions need
to be placed.) I ll use the following input to demonstrate simple declarations:
l ong i nt *x, y;
The parse of that input is shown, in symbolic form, in Table 6.13.
14
Section 6.6.1Simple Variable Declarations 523
Table 6.13. A Parse of l ong i nt
*
x
9 y;
Stack Next Action Taken by Parser
1 1 (empty) Reduce by ext def list >e
| 2 ext def Jist Shift TYPE ( l o n g )
3 ext def list TYPE Reduce by type specifier-*TYPE
4 ext def list type spedfier Reduce by type or class >type spedfier
5 extjdef list type or class Reduce by sped fie rs-ttype or class
6 ext def Jist specifiers Shift TYPE ( i n t )
| 7 ext def list specifiers TYPE Reduce by type specifier-VYWE
8 ext def list specifiers type specifier Reduce by type or class -ttype spedfier
9
ext def list specifiers type or class Reduce by specifiers-tspecifiers type or class
10
extjdef Jist specifiers Reduce by opt spedfiers-tspecifiers
11
extjdef list opt specifiers Shift STAR 1
12
extjdef list opt specifiers STAR Shift NAME (x) |
| 13 ext def list opt specifiers STAR NAME Reduce by new_name-*NAME
| 14 ext def list opt specifiers STAR new name Reduce by var decl-tnew name
| 15 ext def list opt spedfiers STAR var decl Reduce by var decl-*STAR var decl
| 16 ext def Jist opt spedfiers var decl Reduce by ext decl -*var_decl
17 ext def list opt spedfiers ext decl Reduce by ext decl list textdecl
18 ext_def list opt spedfiers ext decl list Shift COMMA |
19 ext def list opt specifiers ext decl list COMMA Shift NAME (y)
20 ext def list opt specifiers ext decl list COMMA NAME Reduce by new_name-*NAME
21 ext def Jist opt_specifiers ext jdecl list COMMA new name Reduce by var decl -*new name
22 ext def list opt specifiers ext jdecl list COMMA var decl Reduce by ext decl >var_decl
23 ext_def Jist opt specifiers ext decl list COMMA ext decl Reduce by ext decl list -text decl list COMMAS decl
24 ext def list opt spedfiers ext decl list Reduce by {3} >
25 ext def list opt spedfiers ext decl list {3} Shift SEMI
26 ext def list opt specifiers ext decl list {3} SEMI Reduce by ext def>opt_specifiers ext decl Jist {3} SEMI
27 ext def list ext def Reduce by ext def list -text def list ext def
28 ext _defJist Reduce by program -êxt def list (accept)
Since all actions are performed as part of a reduction, the best way to see how the code
works is to follow the order of reductions during the parse. First, notice how the pro
duction in
e x t _ d e f _ l i s t
: e x t _ d e f _ l i s t e x t _ d e f
/ * epsi l on * /
(on line 189 of Listing 6.38) is done first. This is always the case in left-recursive list
productions: the nonrecursive component (whether or not its e) must be reduced first in
order to put the recursive left-hand side onto the stack. 15 In the current example, the
14. If you have the distribution disk mentioned in the Preface, you can use the visible parser to see the parse
process in action. The file c.exe is an executable version of the compiler described in this chapter. Get the
parse started by creating a file called test.c containing the single line:
l o n g i n t *x, y;
Then invoke the parser with the command: c t e s t . c. Use the space bar to singlestep through the parse.
15. The nonrecursive element of a right-recursive list production is always done last.
parser must get an ext def J i s t onto the stack in order to be able to reduce by
extjdefJist-êxtjdefJist ext def at some later time, and the only way to get that initial
extjdef J i s t onto the stack is to reduce by extdefJist e.
Listing 6.38. c.y Initialization and Cleanup Productions
183
o o
o o
184 program : e x t d e f l i s t { c l e a n u p ( ) / }
185
r
186
187 e x t d e f l i s t
188 : e x t d e f l i s t e x t d e f
189 / *
epsi l on * /
190 {
191 y y d a t a ( " # i n c l u d e < t o o l s / v i r t u a l . h>\ n" ) ;
192 y y d a t a ( " # d e f i n e T ( x ) \ n " ) ;
193 y y d a t a ( "SEG( da t a ) \ n" ) ;
194 y y c o d e ( "\ nSEG( c o d e ) \ n " ) ;
195 y y b s s ( "\ nSEG( b s s ) \ n " ) ;
196 }
197
f
Initializations done in
nonrecursive list element
Syntactically, a program is a list of external definitions because of the productions on
lines 187 to 189 of Listing 6.38. A reduction by the production on line 189 is always
the first action taken by the parser, regardless of the input. This behavior is exploited in
Generate seg directives
the current compiler to do various initializations on lines 189 to 196 of Listing 6.38. The
appropriate SEG directives are generated at the tops of the various segments, and a
#i ncl ude for <tools/virtual.h> is output at the top of the data segment, which is at the
top of the output file after the three segments are combined at the end of the compilation.
I ll discuss the empty T () macro on line 192 in a moment.
The associated cleanup actions are all done in the previous production:
program e x t d e f l i s t { c l e a n up ( ) ;
}
on line 184. Remember, this is a bottom-up parser, so the reduction to the goal symbol is
the last action taken by the parser. The cl ean up() action coalesces the output
Specifier processing.
streams and deletes temporary files.
After initializing via the production, the parser starts processing the specifier com
ponent of the declaration. All of the specifier productions are shown together in Listing
39. Three types of specifier lists are supported
opt speci f i ers Zero or more types and storage classes mixed together
speci f i ers One or more types and storage classes mixed together
type One or more types. (No storage classes are permitted.)
Note that the parser is not checking semantics here. It just recognizes collections of pos
sible types and storage classes without testing for illegal combinations like a short
l ong.
The parser starts by shifting the TYPE token (l ong in the current input) and then
Pointer to l i nk used as
an attribute.
reduces type specifier-^TYPE (on line 229 of Listing 6.39). The associated action
calls new_type_spec (), which gets a new l i nk and initializes it to a specifier of the
correct type. A pointer to the l i nk structure is attached to the type specifier by assign
ing it to $$. From here on out, that pointer is on the value stack at the position
corresponding to the type specifier. If a storage class is encountered instead of a TYPE,
the action on line 226 of Listing 6.39 is executed instead of the action on line 189. This
action modifies the storage-class component of the l i nk rather than the type, but is
Listing 6.39. c.y Specifiers
Section 6.6.1Simple Variable Declarations
525
198 o pt s p e c i f i e r s
199
i
i CLASS TTYPE { s e t c l a s s b i t ( 0, $ 2 - > e t y p e ) ; / * Reset cl ass. * /
200 s e t c l a s s b i t ( $1, $ 2 - > e t y p e ) ; / * Add new cl ass.
* /
201 $$ = $ 2 - > t y p e ;
202 }
203 TTYPE { s e t c l a s s b i t ( 0 f $ l - > e t y p e ) ; / * Reset cl ass bi t s. * /
204 $$ = $ l - > t y p e ;
205 }
206 s p e c i f i e r s
207 / * empt y * / %prec COMMA
208 {
209 $$
-
new l i n k ( ) ;
210 $ $ - > c l a s s = SPECIFIER;
211
$ $ -
>NOUN = INT;
212 }
213
4
9
r
214 s p e c i f i e r s
215
A
t y p e or c l a s s
216 s p e c i f i e r s t y p e or c l a s s { s p e c _ c p y ( $$, $2 ) ;
217 d i s c a r d l i n k c h a i n ( $2 )
; }
218
i
9
ft
t
219 t y p e
220
%
t: t y p e s p e c i f i e r
221 t y p e t y p e s p e c i f i e r { s p e c _ c p y ( $$, $2 ) ;
222 d i s c a r d l i n k c h a i n ( $2 ) ;
}
223
4
9
ft
224 t y p e or c l a s s
225
t
t: t y p e s p e c i f i e r
226 CLASS { $$ = new c l a s s s p e c ( $1 ) ;
}
227
t
9
ft
t
228 t y p e s p e c i f i e r
229
i
t: TYPE { $$ = new t y p e s p e c ( y y t e x t
) ; }
230 enum s p e c i f i e r { $$ = new t y p e s p e c ( " i n t "
) ; }
231 s t r u c t s p e c i f i e r { $$ = new l i n k ( ) ;
232 $ $ - > c l a s s = SPECIFIER;
233 $ $ - >NOUN = STRUCTURE;
234 $ $ - >V_STRUCT = $1;
235 }
236
i
$
f
otherwise the same as the earlier one. Both new_type_spec ( ) and new cl ass -
spec ( ) are in Listing 6.40.
The next reductions are type orclass type specifier and specifiers
type or class, neither of which has an associated action. The pointer-to-link attribute
is carried along with each reduction because of the default $$=$1 action that is supplied
by the parser. The pointer to the l i nk created in the initial reduction is still on the value
stack, but now at the position corresponding to the specifiers nonterminal.
The parser now processes the i nt. It performs the same set of reductions that we
just looked at, but this time it reduces by
specifiers-^specifiers type or class
rather than by
specifiers-^ typ e or_c lass
as was done earlier. (Were on line nine of Table 6.13. The associated action is on lines
type_or_class^>
type_specifier
specifiers ->
type_or_class
Listing 6.40. decl.c Create and Initialize a l i nk
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#i ncl ude <st dl i b. h>
#i ncl ude <t ool s/ c- code.h>
#i ncl ude "symt ab. h"
#i ncl ude "val ue. h"
#ncl ude "pr ot o. h"
/ * DECL. C
*
Thi s f i l e cont ai ns suppor t subr out i nes f or t hose act i ons i n c. y
t hat deal wi t h decl ar at i ons.
*
/
ext ern voi d yybss(), yydat a();
/ *_____________________________________________________________________________ * /
PUBLI C
i nt
{
l i nk *new cl ass spec( f i r st char of l exeme )
f i r st char of l exeme;
/
*
*
*
Ret ur n a new speci f i er l i nk wi t h t he scl ass f i el d i ni t i al i zed to hol d
a st or age cl ass, t he f i r st char act er of whi ch i s passed i n as an ar gument
(' e' f or ext er n ' s ' f or st at i c, and so f or t h) .
*
/
l i nk *p
p- >cl ass
new_l i nk() ;
SPECI FI ER;
set _cl ass bi t ( f i r st char of 1
ret urn p;
p );
}
/ *_____________________________________________________________________________ * /
PUBLI C voi d set cl ass bi t ( f i r st char of l exeme
P )
i nt
l i nk
{
f i r st char of l exeme;
*
p;
/
*
Change t he cl ass of t he speci f i er poi nt ed to by p as i ndi cat ed by t he
* f i r st char act er i n t he 1 I f i t ' s 0, t hen t he def aul t s ar e
*
r est or ed ( f i xed, nonst at i c, nonext er nal ). Not e t hat t he TYPEDEF
* cl ass i s used her e onl y t o r emember t hat t he i nput cl ass
* was a t ypedef , t he t def f i el d i n t he l i nk i s set t r ue ( and t he
* cl ass i s cl ear ed) bef or e t he ent r y i s added to t he symbol t abl e
*
/
swi t ch( f i r st char of l exeme )
{
0 : p- >SCLASS
p- >STATI C
p- >EXTERN
FI XED;
0;
0;
' t' : p- >SCLASS
' r' : p- >SCLASS
TYPEDEF
REGI STER
Section 6.6.1 Simple Variable Declarations 527
58 case ' s ' : p->STATIC = 1 ; break;
59 case ' e ' : p->EXTERN = 1 ; break;
60
61 def aul t : yyerror("INTERNAL, s e t c l a s s b i t : bad st orage c l a s s '%c'\n",
62 f i r s t char of l exeme) ;
63 e x i t ( 1 );
64 br eak;
65
}
66
}
67
68
-------------------------------------------------------------------------------------------------------------------- * /
69
70 PUBLIC l i nk *new type s p e c ( lexeme )
71 char *1e xeme;
72
{
73 / * Cr eat e a new speci f i er and i ni t i al i ze t he t ype accor di ng t o t he i ndi cat ed
74 * 1exeme. I nput l exemes are: char const doubl e f l oat i nt l ong shor t
75
*
si gned unsi gned voi d vol at i l e
76 */
77
78 l i nk *p = new l i n k ( ) ;
79 p - > c l a s s = SPECIFIER;
80
81 swi t ch( l exeme[ 0 ] )
82 {
83 case ' c ' : i f ( l exeme[ 1 ] ==' h' ) /* char const */
84 p->NOUN = CHAR ; /* ( I gnor e const . ) */
85 br eak;
86 case ' d ' : /* doubl e */
87 case ' f ' : yyerror("No f l o a t i n g p o i n t \ n " ) ; /* f l oat */
88 br eak;
89
90 case ' i ' : p->NOUN = INT; break; /* i nt */
91 case ' 1 ' : p->LONG = 1; break; /* l ong */
92 case ' u' : p->UNSIGNED = 1; break; /* unsi gned */
93
94 case ' v ' : i f ( l exeme[2] == ' i ' ) / * voi d \ vol at i l e * /
95 p->N0UN = VOID; / * i gnor e vol at i l e * /
96 br eak;
97 case ' s ' : break; / * shor t \ si gned * /
98
}
/ * i gnor e bot h * /
99
100 ret urn p;
101 }
216 to 217 of Listing 6.39.) There are currently three attributes of interest: $1 and $$
(which is initialized to $ 1 by the parser before executing the action) hold a pointer to the
l i n k structure that was assembled when the l ong was processed; $2 points at the l i n k
for the new list elementthe i n t . The action code merges the two structures by copy
ing all relevant fields from $2 to $$. The second l i n k ( $2) is then discarded. The
action is illustrated in Figure 6.14.
If additional types or storage classes are present in the input, the parser loops some
more, creating new l i n k s for each new list element and merging those new links into
the existing one, discarding the new l i n k after its merged.
Having collected the entire specifier list, the parser now reduces by opt_specifiers
opt spedfiers-^specifiers on line 206 of Listing 6.40. There is no action. If the
TTYPE, processing a
typedef.
Declarator processing.
Identifier processing in
declaration,
new nan?e->NAME.
Figure 6.14. Merging Links
(Before)
type or class
specifiers
ext def list
Parse
Stack
Value
Stack
l i nk: __________
cl ass=SPECI FI ER
noun I NT
l i nk:
cl ass=SPECI FI ER
l ong 1
(After)
specifiers
ext def Ji st
Parse
Stack
Value
Stack
l i nk:
cl ass=SPECI FI ER
l ong 1
(This l i nk
is deleted.)
L_ _ J
l i nk
cl ass=SPECI FI ER
noun
l ong
I NT
1
declaration were in terms of a typedef , optjspecifiers-^ TTY PE (just above the current
production on line 203 of Listing 6.40) is executed instead of the list-collecting reduc
tions we just looked at. The scanner attaches an attribute to the TTY PE token: a pointer
to a symbol structure representing the typedef . The name field in the symbol is the
type name rather than a variable name. The action on line 204 of Listing 6.40 passes the
type string for this symbol back up the stack as opt specifiers attribute. This type
chain differs from a normal one only in that the tdef bit is set in the leftmost l i nk in
the chain. The parser needs this information because a type that comes from a t ypedef
could be an entire chain, including the declarators. A non- typedef declaration always
yields a single specifier. If no specifier at all is present, then the action on lines 207 to
212 of Listing 6.40 is executed. This action just sets up things as if an i nt had been
found in the input.
The parser shelves the specifier on the value stack for a moment to move on to the
declarator component of the declarationit stores a pointer to the accumulated specifier
l i nk on the value stack as an attribute attached to the most recently pushed specifier
nonterminal. Declarator processing starts with a shift of the STAR and NAME tokens
on lines 11 and 12 of Table 6.13 on page 523. The parser then starts reducing by the pro
ductions in Listing 6.41, below.
The first reduction is new name>NAME. The action is on line 282 of Listing 6.41.
There are two name productions to take care of two different situations in which names
16. x;, with no explicit type, is a perfectly legitimate declaration, the default i nt type and extern storage
classes are used.
Listing 6.41. c\y Variable Declarators
2 3 7 v a r d e c l
2 3 8
4
4 new name %prec COMMA / * Thi s pr oduct i on i s done f i r st .
* /
2 3 9
2 4 0 v a r d e c l LP RP { add d e c l a r a t o r ( $$ , FUNCTION ) ; }
241 v a r d e c l LP v a r l i s t RP { add d e c l a r a t o r ( $$ , FUNCTION ) ;
2 4 2 d i s c a r d s ymbol c h a i n ( $3 ) ;
243 }
2 4 4 v a r d e c l LB RB
245
{
2 4 6 / * At t he gl obal l evel , t hi s must be t r eat ed as an ar r ay of
247 * i ndet er mi nat e si ze; at t he l ocal l evel t hi s i s equi val ent t o
2 4 8 * a poi nt er . The l at t er case i s pat ched af t er t he decl ar at i on
2 4 9 * i s assembl ed.
2 5 0 */
251
2 5 2 add d e c l a r a t o r ( $$, ARRAY ) ;
253 $$- >et ype- >NUM_ELE = 0;
2 5 4
255 YYD ( yyc omme nt ( "Add POINTER s p e c i f i e r \ n " ) ; )
2 5 6
}
2 5 7
2 5 8 v a r d e c l LB c o n s t e x pr RB
2 5 9
{
2 6 0 a d d _ d e c l a r a t o r ( $$, ARRAY ) ;
261 $$- >et ype- >NUM_ELE = $3;
2 6 2
263 YYD( yycomment ( "Add array[ %d] s p e c . \ n " , $ $ - >et ype- >NUM ELE) ; )
2 6 4
}
265 STAR v a r d e c l %prec UNOP
2 6 6
{
267 add d e c l a r a t o r ( $$ = $2, POINTER ) ;
268 YYD ( yyc omme nt ( "Add POINTER s p e c i f i e r \ n " ) ; )
2 6 9
}
2 7 0
271 LP v a r _ d e c l RP { $$ = $2; }
2 7 2
4
/
ft
p
273
2 7 4
/ * -
275
*
Name pr oduct i ons. new name al ways cr eat es a new symbol , i ni t i al i zed wi t h t he
2 7 6
*
cur r ent l exeme. Name r et ur ns a pr eexi st i ng symbol wi t h t he associ at ed name
277
*
(i f
a
t her e i s one) ; ot her wi se, t he symbol i s al l ocat ed. The NAME t oken i t sel f
2 7 8
*
has a NULL at t r i but e i f t he symbol doesn' t exi st , ot her wi se i t r et ur ns a
2 7 9
*
poi nt er t o a "symbol " f or t he name.
2 8 0 */
281
2 8 2 new name:: NAME { $ $ = new s y m b o l ( y y t e x t , Ne s t l e v ) ;
}
283
f
2 8 4
285 name
i
I: NAME { i f ( ! $ 1 $ l - > l e v e l ! = N e s t _ l e v )
2 8 6 $ $ = new s y m b o l ( y y t e x t , Ne s t l e v )
/
2 8 7
}
2 8 8
1
i
ft
r
are used. Normally, the scanner looks up all identifiers in the symbol table to see
whether or not they represent types so that it can return a TTY PE token when appropri- ,nteraction between
. , i , , i l l ! * scanner and code gen-
ate. In order to avoid unnecessary lookups, the scanner attaches a symbol-table pointer erator Vja Symbol table
for an existing identifier to the NAME token if the identifier is in the table, otherwise the
attribute is NULL. This way the table lookup doesnt have to be done a second time in
the code-generation action. The new name takes care of newly-created identifiers.
ignores this passed-back symbol pointer, and allocates a new symbol structure, passing
name->NAME
a pointer to the symbol back up the parse tree as an attribute attached to the new name
nonterminal. The name-^NAME action (on lines 285 to 286, of Listing 6.41) uses the
existing symbol. The test on line 285 identifies whether the returned symbol is at the
current scoping level. If it is not, then the parser assumes that this is a new symbol (such
as a local variable) that just happens to have the same name as an existing one, and allo-
Nest l ev.
cates a new symbol structure. Nest l ev keeps track of the current scope level; it is
declared in Listing 6.42, which is part of the occs-input-file definitions section.
Listing 6.42. c.y Identify Nesting Level (from Occs Definitions Section)
119
%{
120 i n t Ne s t l e v ; / * Cur r ent bl ock- nest i ng l evel .
* /
121
%}
var dc!^> new name.
The next reduction, on line 15 of Table 6.13 on page 523), is var dcl^new name.
The production is on line 238 of Listing 6.42. The only action is the implicit $$=$1,
which causes the symbol pointer returned from newjiame to be passed further up the
tree, attached to the var_dcl. The %pr ec at the right of the line eliminates a shift/reduce
conflict by assigning a very low precedence to the current productionthe technique is
var decl STAR
var decl
discussed in Appendix E.
The parser reduces by var_decI STAR var_decl next. (Were moving from line 15
to line 16 of Table 6.13 on page 523the action is on line 265 of Listing 6.42.) The
add decl ar at or ().
action calls add decl ar at or () to add a pointer-declarator l i nk to the type chain in
the symbol that was created when the name was processed. The process is similar to
that used to assemble an NFA in Chapter Two. Figure 6.15 shows the parse and value
stacks both before and after this reduction is performed.
Note that the synthesized attribute for every right-hand side of this production is the
same symbol structure that was allocated when the name was processed. The actions
just add l i nk structures to the symbol s type chain.
If the declaration had been i nt **x, both stars would have been shifted initially,
and the current reduction would have executed twice in succession, thereby adding a
var decl
second pointer-declarator l i nk to the end of the type chain in the symbol structure.
The other productions that share the var decl left-hand side do similar things, adding
declarator links either for arrays or for functions, as appropriate, to the end of the type
chain in the current symbol structure. Note that the actions on lines 240 to 243 handle
the function component of a function-pointer declaration. I ll look at function
declarations, including the list nonterminal, in a momentbriefly, the list takes
care of function prototypes; the associated attribute is a pointer to the head of a linked
list of symbol structures, one for each function argument. The action here just discards
all the symbol s in the list. Similarly, const expr (on line 258) handles integer-constant
expressions. This production is also discussed below, but the associated attribute is an
Dummy const_expr pro
duction used for grammar
development.
integer. I m using the <num>field in the value-stack union to hold the value.
As an aside, a minor problem came up when adding the actions to the grammaryou
cant do the expression-processing actions until the declarations are finished, but you
need to use a constant expression to process an array declaration. I solved the problem
by providing a dummy action of the form:
Figure 6.15. Adding Declarators to the Type Chain
Section 6.6.1Simple Variable Declarations 531
var decl
STAR
opt specifiers
ext def list
symbol :
(Before)
l i nk:
cl ass=SPECI FI ER
noun
l ong
I NT
1
(After)
symbol : l i nk:
var decl
opt specifiers
ext_def_list
name="x"
t ype
et ype <
>
l i nk:
cl ass=SPECI FI ER
c o n s t e xpr : e x pr { $$ = 10; }
until the declarations were working. The action was later replaced with something more
reasonable.
When the parser finishes with the declarator elements (when the lookahead is a
COMMA), the type chain in the symbol structure holds a linked list of declarator
l i nks, and the specifier is still on the value stack at a position corresponding to the
opt specifiers nonterminal. The comma is shifted, and the parser goes through the entire
declarator-processing procedure again for the y (on lines 20 to 23 of Table 6.13 on page
523).
Now the parser starts to create the cross links for symbol-table entrythe links that Create cross links,
join declarations for all variables at the current scoping level. It does this using the pro
ductions in Listing 6.43. The first reduction of interest is ext_decl_listext_decl exe
cuted on line 17 of Table 6.13. The associated action, on line 298 of Listing 6.43, puts a
NULL pointer onto the value stack. This pointer marks the end of the linked list. The
parse proceeds as just described, until the declaration for y has been processed, whereu
pon the parser links the two declarations together. The parse and value stacks, just
before and just after the reduction by:
ext decl list - ext decl
ext decl list
ext decl listtext decl list COMMA ext decl
ext_decl_list
COMMA ext decl
are shown in Figure 6.16. and the code that does the linking is on lines 308 to 313 of
Listing 6.43. If there were more comma-separated declarators in the input, the process
would continue in this manner, each successive element being linked to the head of the
list in turn.
Listing 6.43. c.y Function Declarators
289
/ * -----------------------
290 * Gl obal decl ar at i ons: t ake car e of t he decl ar at or par t of t he decl ar at i on
291 * ( The speci f i er s ar e handl ed by speci f i er s).
292 * Assembl e t he decl ar at or s i nt o a chai n, usi ng t he cr oss l i nks.
293 */
294
295 e x t d e c l l i s t
296 : e x t d e c l
297 {
298 $ $ - > n e x t = NULL; / * Fi r st l i nk i n chai n. * /
299 }
300 e x t d e c l l i s t COMMA e x t d e c l
301
{
302 / * I ni t i al l y, $1 and $$ poi nt at t he head of t he chai n.
303 * $3 i s a poi nt er t o t he new decl ar at or .
304 */
305
306 $ 3 - > n e x t = $1;
307 $$ = $3;
308 }
309
9
310
311 e x t d e c l
312 : v a r d e c l
313 I v a r d e c l EQUAL i n i t i a l i z e r { $ $ - > a r g s = ( s ymbol * ) $ 3 ; }
314 I f u n c t d e c l
315
316
The only other issue of interest is the initializer, used on line 313 of Listing 6.43. I ll
defer discussing the details of initializer processing until expressions are discussed, but
the attribute associated with the initializer is a pointer to the head of a linked list of
structures that represent the initial values. The args field of the symbol structure is
used here to remember this pointer. You must use a cast to get the types to match.
Before proceeding with the sample parse, its useful to back up a notch and finish
looking at the various declarator productions. There are two types of declarators not
used in the current example, function declarators and abstract declarators. The
function-processing productions start in Listing 6.44. First, notice the functjdecl pro
ductions are almost identical to the var_decl productions that were examined earlier.
They both assemble linked lists of declarator l i nks in a symbol structure that is passed
around as an attribute. The only significant additions are the right-hand sides on lines
329 to 341, which handle function arguments.
The same funct_decl productions are used both for function declarations (externs
and prototypes) and function definitions (where a function body is present)remember,
these productions are handling only the declarator component of the declaration. The
situation is simplified because prototypes are ignored. If they were supported, youd
have to detect semantic errors such as the following one in which the arguments dont
have names. The parser accepts the following input without errors:
f o o ( i n t , l o n g )
{
/ * body * /
}
Initializers, symbol.args.
Function declarators,
funct_decl.
Function-argument de
clarations.
Figure 6.16. A Reduction by ext decl list-êxt decl list COMMA ext decl
symbol :
ext_decl
COMMA
ext decl list
opt specifiers
ext def list
Parse Value
Stack Stack
name='
"y11
t ype (NULL)
et ype (NULL)
next (NULL)
symbol
9
name=' "x"
t ype
et ype
next (NULL)
l i nk:
(Before)
l i nk:
cl ass=DECLARATOR
cl ass=SPECI FI ER
noun
l ong
I NT
1
del t ype=POI NTER
ext jdecl Jist
opt specifiers
V
ext def list
Parse Value
Stack Stack
symbol :
name='
" y"
t ype (NULL)
et ype (NULL)
next
i
symbol
: \(
name=' "X"
t ype
*
et ype
next (NULL)
l i nk:
(After)
l i nk:
cl ass=DECLARATOR
del t ype=POI NTER
cl ass=SPECI FI ER
noun
l ong
I NT
1
The Nest l ev variable is modified in the imbedded actions on lines 329 and 335
because function arguments are actually at the inner scoping level.
The function arguments can take two forms. A name list is a simple list of names, as name Jist,
used for arguments in the older, K&R style, declaration syntax:
h o b b i t ( f r i t o , b i l b o , spam )
short f r i t o , b i l b o ;
A var list takes care of both the new, C++-style syntax:
h o b b i t ( short f r i t o , short b i l b o , i nt spam ) ;
and function prototypes that use abstract declarators (declarations without names) such
as the following:
h obb i t ( shor t , shor t , i nt ) ;
All of these forms are recognized by the parser, but the last one is just ignored in the
Listing 6.44. c.y Function Declarators
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
f u n c t d e c l
: STAR f u n c t _ d e c l
f u n c t d e c l LB RB
f u n c t d e c l LB c o n s t e x pr RB
LP f u n c t _ d e c l RP
f u n c t _ d e c l LP RP
new name LP RP
add d e c l a r a t o r ( $$ $2 POINTER ) ;
}
add d e c l a r a t o r ( $$, ARRAY ) ;
$$- >et ype- >NUM ELE 0;
add d e c l a r a t o r ( $$, ARRAY ) ;
$$ >NUM ELE $3;
$$ $2;
a d d _ d e c l a r a t o r ( $$, FUNCTION ) ;
add d e c l a r a t o r ( $$, FUNCTION ) ;
new name LP { ++Ne s t l e v name 1
{
Ne s t l e v ; } RP
{
$4
$$
r e v e r s e l i n k s ( $4 ) ;
$4;
}
new name LP { ++Nes t l e v ; } v a r l i s t { Ne s t l e v ; } RP
{
$$ $4;
}
name l i s t
new name
{
$$-
$ $ - > t y p e
$ $ - > t y p e - > c l a s s
$$- >t ype- >SCLASS
NULL;
n e w _ l i n k () ;
SPECIFIER;
AUTO;
}
name l i s t COMMA new name {
$$
$ $ - > n e x t
$ $ - > t y p e
$ $ - > t y p e - > c l a s s
$$- >t ype- >SCLASS
$3;
$1;
n e w _ l i n k () ;
SPECIFIER;
AUTO;
}
v a r l i s t
}
}
}
p a r a m_ d e c l a r a t i o n
v a r l i s t COMMA param d e c l a r a t i o n
($1)
($3)
$$ NULL;
}
{
$$
$ 3 - > n e x t
$3;
$1;
}
}
param d e c l a r a t i o n
t y p e v a r d e c l { add s p e c t o d e c l ( $ l , $$
a b s t r a c t d e c l
ELLIPSIS
{ d i s c a r d s ymbol ( $ 1 ) ; $$
{ $$
$2
NULL
NULL
) ; }
}
}
present application: The abstract_decl production used on line 368 of Listing 6.44
creates a symbol structure, just like the ones weve been discussing. Theres no name,
however. The action on line 368 just throws it away. Similarly, the ELLIPSIS thats
used for ANSI variable-argument lists is just ignored on the next line.
The namelist and var Ji st productions are found on lines 342 to 365 of Listing 6.44.
They differ from one another in only one significant way: A name J i s t creates a linked
list of symbol s, one for each name in the list, and the type chains for all these symbols
are single specifier l i nks representing i nts. A var list takes the types from the
declarations rather than supplying an i nt type. By the time the parser gets back up to
lines 329 or 336 of Listing 6.45-1 the argument list will have have been processed, and
the attribute attached to the name list or var J i s t will be a pointer to the head of a linked
list of symbol structures. The $$- >ar gs = $4 attaches this list to the ar gs field of
the symbol that represents the function itself. Figure 6.17 shows the way that
h o b b i t ( s h o r t f r i t o , s h o r t b i l b o , i n t spam ) ;
is represented once all the processing is finished.
Figure 6.17. Representing hobbi t (short f r i t o, short bi l bo, i nt spam)
symbol : l i nk: l i nk:
name="hobbi t "
t ype
et ype
next NULL
cl ass=DECLARATOR
- - - - >
cl ass=SPECI FI ER
dcl _t ype=FUNCTI ON noun = I NT
symbol :
name
t ype
et ype
next
VV
o
vv
name="bi l bo"
t ype
et ype
next
symbol : w
name="spam"
t ype
---
et ype
---
next NULL
l i nk:
cl ass=SPECI FI ER
noun =
shor t
I NT
= 1
l i nk:
>
cl ass=SPECI FI ER
noun =
shor t
I NT
= 1
l i nk:
cl ass=SPECI FI ER
noun I NT
Note that the list of arguments is assembled in reverse order because, though the list
is processed from left to right, each new element is added to the head of the list. The
r ever se l i nks () call on line 333 of Listing 6.44 goes through the linked list of sym
bol s and reverses the direction of the next pointers. It returns a pointer to the new
head of the chain (formerly the end of the chain).
Abstract declarators,
abstract_decl.
Argument declarations
assembled in reverse
order,
r ever se l i nks().
Merge declarator and
specifier,
add spec t o decl ( ) .
Abstract declarators.
Problems with scanner,
code-generator interac
tion
One other subroutine is of particular interest: add spec t o decl (), called on
line 367 of Listing 6.44, merges together the specifier and declarator components of a
declaration. I ts shown in Listing 6.45, below. Passed a pointer to a l i nk that
represents the specifier, and a pointer symbol that contains a type chain representing the
declarator, it makes a copy of the specifier l i nk [with the cl one t ype () call on line
132] and tacks the copy onto the end of the type chain.
The last kind of declarator in the grammar is an abstract declarator, handled in List
ing 6.46, below. If abstract declarators were used only for declarations, the productions
on lines 371 to 388 could be devoid of actions. You need abstract declarators for the cast
operator and si zeof statement, however. The actions here work just like all the other
declarator productions; the only difference is that the resulting symbol attribute has a
type but no name. The symbol structure is allocated on line 380the production
takes the place of the identifier in the earlier declarator productions.
Returning to the parse in Table 6.13 on page 523, the parser has just finished with the
ext decl list and is about to reduce by {3 }-^. This production is supplied by occs to
process the imbedded action on lines 390 to 401 of Listing 6.47, below. Occs translates:
ext _def : opt _speci f i er s ext _decl _l i st {action... } SEMI
9
as follows:
ext def : opt speci f i er s ext decl l i st {3} SEMI
{3} : /* empty */ {action... }
9
so that it can do the imbedded action as part of a reduction.
The action does three things: it merges the specifier and declarator components of
the declaration, puts the new declarations into the symbol table, and generates the actual
declarations in the output. The action must precede the SEMI because of a problem
caused by the way that the parser and lexical analyzer interact with one another. The
lexical analyzer uses the symbol table to distinguish identifiers from the synthetic types
created by a typedef , but symbol-table entries are also put into the symbol table by the
current production. The problem is that the input token is used as a lookahead symbol.
It can be read well in advance of the time when it is shifted, and several reductions can
occur between the read and the subsequent shift. In fact, the lookahead is required to
know which reductions to perform. So the next token is always read immediately after
shifting the previous token. Consider the following code:
t y p e d e f i n t i t y p e ;
i t y p e x;
If the action that adds the new type to the symbol table followed the SEMI in the gram
mar, the following sequence of actions would occur:
Shift the SEMI, and read the lookahead symbol, i t ype has not been added to the
symbol table yet, so the scanner returns a NAME token.
Reduce by ext def-^{optjspecifiers ext_decl_list} SEMI, adding the i t ype to the
symbol table.
The problem is solved by moving the action forward in the production, so that the parser
correctly acts as follows:
Listing 6.45. decl.c Add a Specifier to a Declaration
Section 6.6.1 Simple Variable Declarations
537
102 voi d add s p e c t o _ d e c l ( p _ s p e c , d e c l _ c h a i n )
103 l i n k *p s p e c ;
104 symbol * d e c l c h a i n ;
105
{
106 / * p spec i s a poi nt er ei t her t o a speci f i er / decl ar at or chai n cr eat ed
107
*
by a pr evi ous t ypedef or t o a si ngl e speci f i er . I t i s cl oned and t hen
108
*
t acked ont o t he end of ever y decl ar at i on chai n i n t he l i st poi nt ed t o by
109
*
decl chai n. Not e t hat t he memor y used f or a si ngl e speci f i er , as compar ed
110
*
to a t ypedef , may be f r eed af t er maki ng t hi s cal l because a COPY i s put
111
*
i nt o t he symbol ' s t ype chai n.
112
*
113
*
I n t heor y, you coul d save space by modi f yi ng al l decl ar at or s t o poi nt
114
*
at a si ngl e speci f i er . Thi s makes del et i ons much mor e di f f i cul t , because
115
*
you can no l onger j ust f r ee ever y node i n t he chai n as i t ' s used. The
116
*
pr obl emi s compl i cat ed f ur t her by t ypedef s, whi ch may be decl ar ed at an
117
'A
out er l evel , but can' t be del et ed when an i nner - l evel symbol i s
118
*
di scar ded. I t ' s easi est t o j ust make a copy.
119
*
120
*
Typedef s ar e handl ed l i ke t hi s: I f t he i ncomi ng st or age cl ass i s TYPEDEF,
121
*
t hen t he t ypedef appear ed i n t he cur r ent decl ar at i on and t he t def bi t i s
122
*
set at t he head of t he cl oned t ype chai n and t he st or age cl ass i n t he
123
*
cl one i s cl ear ed; ot her wi se, t he cl one' s t def bi t i s cl ear ed ( i t ' s j ust
124
*
not copi ed by cl one t ype( ) ) .
125 */
126
127 l i n k * c l o n e s t a r t , * c l o n e end ;
128 l i n k **p;
129
130 f or ( ; d e c l c h a i n ; d e c l c h a i n = d e c l c h a i n - > n e x t )
131
{
132 i f ( ! ( c l o n e s t a r t = c l o n e t y p e ( p s p e c , &cl one e n d ) ) )
133
{
134 y y e r r o r ( "I NTERNAL, add t y p e d e f : Mal f ormed c h a i n (no s p e c i f i e r ) \ n " ) ;
135 e x i t ( 1 ) ;
136
}
137 el se
138
{
139 i f ( ! d e c l c h a i n - > t y p e ) / * No decl ar at or s. * /
140 d e c l c h a i n - > t y p e = c l o n e s t a r t ;
141 el se
142 d e c l c h a i n - > e t y p e - > n e x t = c l o n e s t a r t ;
143
144 d e c l c h a i n - > e t y p e = c l o n e end;
145
146 i f ( I S TYPEDEF ( c l o n e end) )
147 {
148 s e t c l a s s b i t ( 0, c l o n e end ) ;
149 d e c l c h a i n - > t y p e - > t d e f = 1;
150
}
151 }
152
}
153 }
Listing 6.46. c.y Abstract Declarators
371 a b s t r a c t d e c l
372
i
i: t y p e abs d e c l { add s p e c t o d e c l ( $1, $$ = $2 ) ; }
373 I TTYPE abs d e c l {
374 $$ = $2;
375 add s p e c t o d e c l ( $ l - > t y p e , $2 ) ;
376 }
377
4
/
ft
r
378
379 abs d e c l
380
4
i: / * epsi l on * / { $$ = new s y m b o l 0) ;
}
381 LP abs d e c l RP LP RP { add d e c l a r a t o r ( $$ = $2, FUNCTION ) ;
}
382 STAR abs d e c l { add d e c l a r a t o r ( $$ = $2, POINTER ) ;
}
383 abs d e c l LB RB { add d e c l a r a t o r ( $$, POINTER ) ;
}
384 abs d e c l LB c o n s t e x pr RB { add d e c l a r a t o r ( $$, ARRAY ) ;
385 $$- >et ype- >NUM ELE = $3;
386 }
387 LP abs d e c l RP { $$ = $2; }
388
i
/
ft
r
Reduce by { 3 } e, adding the i t y p e to the symbol table.
Shift the SEMI, and read the lookahead symbol, i t y p e is in the symbol table this
time, so the scanner returns a TTY PE token.
Reduce by extj def - ^{opt_specifiers ext_decl_list} { 3 } SEMI.
Listing 6.47. c. y High-Level, External Definitions (Part One)
389 e x t d e f :: o p t s p e c i f i e r s e x t d e c l l i s t
390 {
391 add s p e c t o d e c l ( $1, $2 ) ;
392
393 i f ( ! $ l - > t d e f )
394 d i s c a r d l i n k c h a i n ( $1 );
395
396 add s y mbo l s t o t a b l e ( $2 = r e v e r s e l i n k s ( $2 ) ) ;
397 f i g u r e o s c l a s s ( $2 ) ;
398 g e n e r a t e d e f s and f r e e a r g s ( $2 ) ;
399 remove d u p l i c a t e s ( $2 ) ;
400 }
401 SEMI
402
403 / * There are addi t i onal right-hand s i de s l i s t e d i n subsequent l i s t i n g s .
404 * /
The action on lines 390 to 401 of Listing 6.47 needs some discussion. The attribute
associated with the ext decl list at $2 is a pointer to a linked list of symbol structures,
one for each variable in the declarator list. The attribute associated with opt specifiers
at $1 is one of two things: either the specifier component of a declaration, or, if the
declaration used a synthetic type, the complete type chain as was stored in the symbol-
table entry for the typedef . In both cases, the add spec t o decl () call on line
391 modifies every type chain in the list of symbol s by adding a copy of the type chain
passed in as the first argument to the end of each symbol s type chain. Then, if the
current specifier didnt come from a typedef , the extra copy is discarded on line 394.
The symbols are added to symbol table on line 396. The f i gur e oscl ass () call on
line 397 determines the output storage class of all symbols in the chain,
gener at e def s_and_f r ee_ar gs () outputs the actual C-code definitions, and
r emove_dupl i cat es () destroys any duplicate declarations in case a declaration and
definition of a global variable are both present. All of these subroutines are in Listings
6.48 and 6.49, below.
Listing 6.48. decl.c Symbol-Table Manipulation and C-code Declarations
154 v o i d
155 symbol
156
{
157
/ *
158
*
159
*
160
*
161
*
162
*
163
*
164
*
165
*
166
*
167
*
168
*
169 */
170
171 sym
172 i n t
173 sym
174
175 f o r
176
{
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
a dd_ _ s y mbo l s _ t o _ t a bl e ( sym )
*sym;
Add decl ar at i ons to t he symbol t abl e
Ser i ous r edef i ni t i ons (two publ i cs, f or exampl e) gener at e an er r or
Har ml ess r edef i ni t i ons ar e si l ent l y. Bad code i s
gener at ed when an er r or message i s pr i nt ed. The symbol t abl e i s modi f i ed
i n t he case of a har ml ess dupl i cat e to r ef l ect t he hi gher
st or age cl ass: ( publ i c pr i vat e) > common > ext er n
The sym- >r name f i el d i s modi f i ed as i f t hi s wer e a gl obal var i abl e (an
under scor e i s i nser t ed i n f r ont of t he name) . You shoul d add t he symbol
chai ns to t he t abl e bef or e modi f yi ng t hi s f i el d t o hol d st ack of f set s
i n t he case of l ocal var i abl es.
b o l * e x i s t s ;
h a r m l e s s ;
b o l *new;
/
*
Exi st i ng symbol i f t her e' s a conf l i ct .
*
/
(new sym; new ; new n e w- >n e x t )
e x i s t s ( s ymbol
*
) f i nds y m( Sy mbo l t a b , ne w- >na me ) ;
( l e x i s t s e x i s t s - > l e v e l != n e w - > l e v e l )
{
"1. * ~11 s p r i n t f ( new- >rname, 11
addsym ( Symbol t a b , new ) ;
s ( new- >rname) - 2 , ne w- >na me ) ;
}
{
h a r ml e s s
n e w - > d u p l i c a t e
0;
1;
( t h e same t y p e ( e x i s t s - > t y p e , n e w- >t y p e 0 )
{
( e x i s t s - > e t y p e - > OCLAS S EXT exi s t s - >et ype- >OCLASS==COM )
{
h a r ml e s s 1;
( new- >et ype- >OCLASS EXT )
{
e x i s t s - > e t y p e - > OCLAS S
e x i s t s - > e t y p e - > S CL AS S
e x i s t s - > e t y p e - >EXTERN
e x i s t s - > e t y p e - > S T A T I C
new- >et ype- >OCLASS;
ne w- >e t ype - >SCLASS;
new- >et ype- >EXTERN;
new- >et ype- >STATI C;
}
}
}
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
( ! h a r ml e s s )
y y e r r o r ( " D u p l i c a t e d e c l a r a t i o n o f %s\ n", new->name ) ;
}
}
}
/
* *
/
voi d
s ymbol
{
/*
f i g u r e o s c l a s s ( sym )
*sym;
Go t hr ough t he l i st f i gur i ng t he out put st or age cl ass of al l var i abl es.
* Not e t hat i f somet hi ng i s a var i abl e, t hen t he ar gs, i f any, ar e a l i st
* of i ni t i al i zer s. I ' massumi ng t hat t he symhas been i ni t i al i zed to zeros;
* at l east t he OSCLASS f i el d r emai ns unchanged f or nonaut omat i c l ocal
* var i abl es, and a val ue of zer o t her e i ndi cat es a nonexi st ent out put cl ass
*
/
( ; sym ; sym s y m- >ne x t )
{
( s y m - > l e v e l 0 )
{
( IS FUNCT( s y m- >t y pe )
)
{
( sym >EXTERN ) sym- >et ype- >OCLASS
( s ym- >et ype- >STATI C ) sym- >et ype- >OCLASS
sym- >et ype- >OCLASS
EXT;
PRI;
PUB;
}
{
( sym-
( s y m- >a r g s
>STATIC ) sym- >et ype- >OCLASS
) sym- >et ype- >OCLASS
PRI;
PUB;
COM;
}
}
( sym- >t ype- >SCLASS FIXED )
{
( IS FUNCT ( s y m- >t y pe )) sym >0CLASS
(! IS LABEL ( s y m- >t y pe )) sym- >et ype- >OCLASS
EXT;
PRI;
}
}
}
/
* *
/
voi d
s ymbol
{
/*
*
d e f s and f r e e a r g s ( sym )
*
sym;
*
Gener at e gl obal - var i abl e def i ni t i ons, i ncl udi ng any necessar y
i ni t i al i zer s. Fr ee t he memor y used f or t he i ni t i al i zer (i f a var i abl e)
or ar gument l i st ( i f a f unct i on) .
/
(
sym sym s y m- >ne x t )
{
( IS FUNCT( sym- >t ype)
)
{
/ * Pr i nt a def i ni t i on f or t he f unct i on and di scar d ar gument s
*
( you' d t hemi f pr ot ot ypes wer e suppor t ed) .
*
/
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
}
}
symbol
symbol
{
/
y y d a t a ( " e x t e r n a l \ t % s ( ) ; \ n " , sym- >rname ) ;
d i s c a r d s ymbol c h a i n ( s y m- >a r g s ) ;
s ym- >ar gs NULL;
}
( IS CONSTANT(sym->etype) s y m - > t y p e - > t d e f )
{
conti nue;
}
L f ( ! s y m- >a r g s ) / * I t ' s an uni ni t i al i zed var i abl e. * /
{
p r i n t b s s d e l ( sym ) ; / * Pr i nt t he decl ar at i on
*
/
}
/ * Deal wi t h an i ni t i al i zer
*
/
{
v a r d e l ( y y d a t a , sym- >et ype- >OCLASS, sym, "= " ) ;
( IS AGGREGATE( s y m- >t y pe )
)
y y e r r o r ( " I n i t i a l i z a t i o n o f a g g r e g a t e t y p e s n o t s u p p o r t e d \ n " ) ;
(
IIS CONSTANT( ( ( v a l u e * ) s y m- > a r g s ) - > e t y p e )
)
y y e r r o r ( "I n i t i a l i z e r must be a c o n s t a n t e x p r e s s i o n \ n " ) ;
( ! t h e same t y p e ( s y m - > t y p e , ( ( v a l u e
*
) s y m- > a r g s ) - > t y p e , 0) )
y y e r r o r ( " I n i t i a l i z e r : t y p e mi s ma t c h \ n " ) ;
y y d a t a ( "%s ; \ n", CONST STR( ( v a l u e
*
) s y m- >a r g s ) ) ;
d i s c a r d v a l u e ( ( v a l u e * ) ( s ym- >ar gs )
) ;
s ym- >ar gs NULL;
}
J * --------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------------- --------------------------- ----------------- --------------* j
*
*
remove d u p l i c a t e s ( sym )
sym;
*
*
*
Remove al l nodes mar ked as dupl i cat es f r omt he l i nked l i st and f r ee t he
memor y. These nodes shoul d not be i n t he symbol t abl e. Ret ur n t he new
head- of - l i st poi nt er (the f i r st symbol may have been del et ed) .
/
symbol
*
p r e v
s ymbol * f i r s t
NULL;
sym;
whi l e( sym )
{
( ! s y m - > d u p l i c a t e ) /
Not a dupl i cat e, go t o t he

*
{
/ * next l i st el ement .
/
/
p r e v
sym
sym;
s y m- >ne x t
}
( p r e v NULL ) / * Node i s at st ar t of t he l i st
*
/
{
f i r s t s y m- >ne x t
d i s c a r d s y m b o l ( sym ) ;
sym f i r s t ;
324
}
325 el se / * Node i s i n mi ddl e of t he l i st . */
326 {
327 p r e v - > n e x t = s y m- >n e x t ;
328 d i s c a r d s y m b o l ( sym ) ;
329 sym = p r e v - > n e x t ;
330 }
331
}
332 ret urn f i r s t ;
333 }
Listing 6.49. decl.c Generate C-Code Definitions
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
voi d
s ymbol
{
p r i n t b s s d e l ( sym ) / * Pr i nt a decl ar at i on t o t he bss segment
*
/
*
sym;
( s ym- >et ype- >SCLASS
i
FIXED )
y y e r r o r ( " I l l e g a l s t o r a g e c l a s s f o r
o, 0 II
S sym->name ) ;
{
( sym >STATIC &&
y y e r r o r ( "%s : Bad
sym- >et ype- >EXTERN )
c l a s s \ n " , sym->name ) ;
v a r d e l ( y y b s s , sym >OCLASS, sym, " ; \ n " ) ;
}
}
/
* *
/
PUBLIC voi d v a r d e l ( o f u n c t , c c ode s c l a s s , sym, t e r m i n a t o r )
voi d
i nt
(* o f u n c t ) ( ) ;
c c ode s c l a s s ;
/
*
Poi nt er t o out put f unct i on ( yybss or yydat a)
/ * C- code st or age cl ass of symbol .
*
*
s ymbol *sym;
^ t e r m i n a t o r ;
/
/
*
*
Symbol i t sel f .
Pr i nt t hi s st r i ng at end of t he decl ar at i on.
*
*
/
/
/
/
{
/
*
Ki ck out a var i abl e decl ar at i on f or t he cur r ent symbol .
*
/
s u f f i x [ 3 2 ] ;
t y p e
* Mii
i nt
l i n k *p
s i z e
s t o r a g e c l a s s
1
s y m- > t y p e ;
( c _ c o d e _ s c l a s s
( c _ c o d e _ s c l a s s
(c c o de s c l a s s
PUB)
PRI)
EXT)
" p u b l i c "
" p r i v a t e "
ff ff
"common"
* s u f f i x ' \ 0' ;
( IS FUNCT(p)
)
{
y y e r r o r ( "INTERNAL, v a r _ d c l : o b j e c t n o t a v a r i a b l e \ n " ) ;
( 1 );
}
IS ARRAY(p)
)
{
Section 6.6.2Structure and Union Declarations 543
376 f o r (; I S_ARRAY( p) ; p = p- >next )
377 si ze *= p- >NUM_ELE f
378
379 spr i nt f ( suf f i x, 11[%d] ", si ze ) ;
380 }
381
382
i f (
I S_STRUCT( p) )
383
{
384 ( *of unct ) ( "\ nALI GN( l wor d) \ n" ) ;
385 spr i nt f ( suf f i x, " [%d] ", si ze * p- >V STRUCT- >si ze );
386
}
387
388 i f ( I S_POI NTER( p) )
389 t ype = PTYPE;
390 e l s e / * Must be a speci f i er . */
391 s w i t c h ( p- >NOUN )
392
{
393 c a s e CHAR: t ype = CTYPE; b r e a k ;
394 c a s e I NT: t ype = p- >LONG ? LTYPE : I TYPE; b r e a k ;
395 c a s e STRUCTURE: t ype = STYPE; b r e a k ;
396 }
397
398
( *
of unct )( "%s\ t %s\ t %s%s %s", st or age cl ass, t ype,
399 sym- >r name, suf f i x, t er mi nat or ) ;
400 }
6.6.2 Structure and Union Declarations
Now, lets back up for a moment and look at more-complex types: structures and
unions. These were handled at a high level on line 231 of Listing 6.39 on page 525,
where a struct_specifier is recognized in place of a TYPE token. For example, every
thing from the st ruct up to the name is a struct_spec(fier in the following declaration:
st r uct t ag
{
i nt t i nker ;
l ong t ai l or ;
char sol di er ;
st r uct t ag *spy;
}
n ame;
A parse for this declaration is shown in Table 6.14. There are several points of interest
here. First, the STRUCT token is returned by the scanner for both the st ruct and
uni on lexemes. The associated <asci i >attribute is the first character of the lexeme.
In practice, the only difference between a structure and a union is the way that the offsets
to each field are computedin a union the offsets are all zero. Since the two data types
are syntactically identical, you need only one token with an attribute that allows the
code-generation action to distinguish the two cases.
The structure tag is handled in the parse on lines 4 to 6 of Table 6.14. The associated
actions are on lines 426 to 444 of Listing 6.50, below. There are two actions of interest,
optjag-ê on line 427 takes care of those situations where an explicit tag is not present.
An arbitrary name is generated on line 430, and a new st r uct _def is allocated on the
next line. Then the definition (with no fields as of yet) is added to the structure table on
line 433. A pointer to the new structure-table element is passed up as an attribute.
struct_specifier.
STRUCT recognized for
both struct and union.
<ascii> attribute.
Structure tags.
opt_tagê
544
Code Generation Chapter 6
Table 6.14. Parsing a Structure Definition
1 Stack Next Action
1 1 (empty) Reduce: ext def list -te
DATA: # i n c l u d e <t ool s/ vi r t ual .h>
DATA: # d e f i n e T(x)
DATA: SEG(data)
CODE: SEG(code)
BSS: SEG( bss)
2 ext def list Shift: STRUCT
3 ext def list STRUCT Shift: NAME
4 ext J e f J i s t STRUCT NAME Reduce: ft?#>NAME
5 ext def list STRUCT tag Reduce: opt tag-^ttag
6 j ext def list STRUCT opt tag Shift: LC
7 ext def list STRUCT opt tag LC Reduce: def J i s t - t t
8 ext def list STRUCT opt tag LC def list Shift: TYPE
9 ext def list STRUCT opt tag LC def list TYPE Reduce: type specifiert TYPE
10 ext def list STRUCT opt tag LC def list type specifier Reduce: type or classttype specifier
11 ext def list STRUCT opt tag LC def list type or class Reduce: specifiersttype or class
12 | ext def list STRUCT opt tag LC def list specifiers Shift: NAME
13 ext def list STRUCT opt tag LC def list specifiers NAME Reduce: new name >NAME
14 1 ext def list STRUCT opt tag LC def list specifiers new name Reduce: var decltnew name
15 j ext def list STRUCT opt tag LC def list specifiers var decl Reduce: decltvar decl
16 ext def list STRUCT opt tag LC def list specifiers decl Reduce: decl J i s t tdecl
17 j ext def list S T R U C T ^ tag LC def list specifiers decl list Reduce: {65}
18 1 ext def list STRUCT opt tag LC def list specifiers decl list {65} Shift: SEMI
19 | ext def list STRUCT opt tag LC def list specifiers decl list {65} SEMI Reduce: deftspecifiers decl list {65} SEMI
20 ext def list STRUCT opt tag LC def list def Reduce: def list-tdef list def
22 ext def list STRUCT opt tag LC def list TYPE Reduce: type specifiert TYPE
23 ext def list STRUCT opt tag LC def list type specifier Reduce: type or class-ttype specifier
24 | ext def list STRUCT opt tag LC def list type or class Reduce: specifiers-ttype or class
25 ext def list STRUCT opt tag LC def list specifiers Shift: NAME
26 ext def list STRUCT opt tag LC def list specifiers NAME Reduce: new nametNAME
27 ext def list STRUCT opt tag LC def list specifiers new name Reduce: var decltnew name
28 ext def list STRUCT opt tag LC def list specifiers var decl Reduce: decltvar decl
29 ext def list STRUCT opt tag LC def list specifiers decl Reduce: decl J i s t tdecl
30 j ext def list STRUCT opt tag LC def list specifiers decl list Reduce: {65 }
31 | ext def list STRUCT a/?/ tag LC def list specifiers decl list {65} Shift: SEMI
32 | ext def list STRUCT opt tag LC def list specifiers decl list {65} SEMI Reduce: def>specifiers decl list {65} SEMI
33 ext def list STRUCT opt tag LC def list def Reduce: def Jist-tdefJist def
35 | ext def list STRUCT opt tag LC def list TYPE Reduce: type specifier-tTYPE
36 ext def list STRUCT opt tag LC def list type spedfier Reduce: type or class-^type specifier
37 ext def list STRUCT opt tag LC def list type or class Reduce: specifiers-ttype or class
38 ext def list STRUCT opt tag LC def list specifiers Shift: NAME
39 | ext def list STRUCT opt tag LC def list specifiers NAME Reduce: new name-tNAME
40 1 ext def list STRUCT opt tag LC def list specifiers new name Reduce: var dec It new name
41 j ext def list STRUCT opt tag LC def list specifiers var decl Reduce: decltvar decl
42 | ext def list STRUCT opt tag LC def list specifiers decl Reduce: decl list-tdecl
43 j ext def list STRUCT opt tag LC def list specifiers decl list Reduce: {65 }
44 j ext def list STRUCT opt tag LC def list specifiers decl list {65} Shift: SEMI
continued...
The action on lines 437 to 444 is used when a tag is present in the definition, as is the
case in the current example. The incoming attribute for the NAME token is useless here
because this attribute is generated by a symbol-table lookup, not a structure-table
lookup. The action code looks up the name in the structure table and returns a pointer to
the entry if its there. Otherwise, a new structure-table element is created and added to
Table 6.14. Continued. Parsing a Structure Definition
Stack
def STRUCT L C def list specifiers decl
Next Action
Reduce: def>spedfiers decl list {65} SEMI
65} SEMI
extjiefJist STRUCT opt tag L C def list def
ext def list STRUCT opt j a g L C def
ext def list STRUCT opt tag L C def list STRUCT
ext def list STRUCT opt tag L C def list STRUCT NA ME
ext def list STRUCT opt j a g L C def Jist STRUCT tag
extjdef Jist STRUCT opt tag LC def list struct specifier
ext def list STRUCT opt Jag L C def Jist type _specifier
ext j i e f Jist STRUCT opt tag L C def list typejyrjlass
extjdef Jist STRUCT opt tag L C def Jist specifiers
ext def list STRUCT opt Jag L C def Jist specifiers STAR
ext def list STRUCT opt tag L C def list specifiers STAR
Reduce: def Jist-^defJist def
Shift: STRUCT
Shift: NA ME
Reduce: t a#-NA ME
Reduce: struct specifier-)STRUCT tag
Reduce: type spedfier>struct spedfier
Reduce: typejyrjlass -ttype spedfier
Reduce: specifiers-ttype_orjlass
Shift: STAR
Shift: NA ME
Reduce: new name-tNA ME
NA ME
ext def list STRUCT opt tag L C def list specifiers STAR Reduce: var decl tnew name
new name
ext def list STRUCT opt tag L C def list specifiers STAR Reduce: var decl-tSTAR var decl
var decl
ext def list STRUCT opt tag L C def list specifiers var decl
def list STRUCT opt tag L C def ifiers decl
def list STRUCT opt tag L C def ifiers decl
ext def list STRUCT opt tag L C def list specifiers decl
{65}
ext def list STRUCT opt tag L C def list specifiers
decl list { 65} SEMI
Reduce: decl-tvarjdecl
Reduce: decl list-tdecl
Reduce: { 65}
Shift: SEMI
Reduce: def-tspecifiers decl list {65} SEMI
extjdef Jist STRUCT opt tag L C def list def
ext def list STRUCT opt j a g L C def
ext def list STRUCT opt j a g L C def list RC
ext def list struct spedfier
ext def list type spedfier
ext def list type or class
Reduce: def list-tdef list def
Shift: RC______________________________________________
Reduce: struct specifiers STRUCT opt tag LC def list RC
Reduce: type spedfiertstruct spedfier
Reduce: typejyrjlass stype spedfier
Reduce: specifiers-ttype or class
ext def list specifiers
ext def list opt sped fiers
ext def list opt specifiers NA ME
ext def list opt spedfiers new name
ext j i e f Jist opt_specifiers var decl
ext def list opt spedfiers ext decl
Reduce: optjspecifiers-tspecifiers
Shift: NA ME______________
Reduce: new_name-tNA ME
Reduce: var decl-t new name
Reduce: ext decl t var decl
Reduce: ext decl list-text decl
ext def list opt spedfiers ext decl
extjdef Jist opt spec ifiers ext decl list {50}
ext def list opt specifiers ext decl list {50} SEMI
ext def list ext def
ext def list
name[16];
Reduce: { 50}
BSS: ALI GN( l wor d)
BSS: common
Shift: SEMI__________________________
Reduce: ext def-topt specifiers ext decl
Reduce: ext def list-text def list ext def
{ 50} SEMI
Reduce: program-text def list (Accept)
the table, as before. This time the real tag is used rather than an arbitrary name, how
ever.
The parser now moves on to process the field definitions, using the productions in Structure-field definitions.
Listing 6.51, below. These productions are effectively the same as the ext j i e f Jist pro
ductions that handle global-variable definitions. In fact, most of the actions are identi
cal. You need two sets of productions because function definitions (with bodies) are per
mitted only at the global level. As with an ext j i e f the attribute associated with a def is a
pointer to a cross-linked chain of symbols, one for each comma-separated declarator
found in a definition, or NULL if no declarators were found. The def Ji st productions def Jist
process a list of semicolon-terminated definitions, linking the cross-linked symbols
Listing 6.50. c.y Structures
405 s t r u c t s p e c i f i e r
406
STRUCT o p t t a g LC d e f l i s t RC
407
{
408 i f ( !$2 - > f i e l d s )
409 {
410 $ 2 - > f i e l d s = r e v e r s e l i n k s ( $4 ) ;
411
412 i f (
! i l l e g a l s t r u c t d e f ( $2, $4 ) )
413 $ 2 - > s i z e = f i g u r e s t r u c t o f f s e t s ( $ 2 - > f i e l d s , $ l = = ' s ' ) ;
414 }
415 el se
416 {
417 y y e r r o r ( " I g n o r i n g r e d e f i n i t i o n o f %s", $ 2 - > t a g ) ;
418 d i s c a r d s ymbol c h a i n ( $4 ) ;
419 }
420
421 $$ = $2
f
422 }
423 STRUCT t a g { $$ = $2; }
424
f
425
426 o pt t a g : t a g
427 1 / * empt y */
{
428 st at i c unsi gned l a b e l = 0;
429 st at i c char t a g [ 1 6 ] ;
430 s p r i n t f ( t a g , "%03d", l a b e l + + ) ;
431
432 $$ = new s t r u c t d e f ( t a g ) ;
433 a dds y m( S t r u c t t a b , $$ ) ;
434
}
435
f
436
437 t a g : NAME
{
438 i f ( ! ( $ $ = ( s t r u c t d e f *) f i n d s y m ( S t r u c t t a b , y y t e x t )) )
439 {
440 $$ = new s t r u c t d e f ( y y t e x t ) ;
441 $ $ - > l e v e l = Ne s t l e v ;
442 a dds y m( S t r u c t t a b , $$ ) ;
443 }
444
}
445
f
from each individual definition together into a larger list. The new elements are added to
the end of the listthe loop on line 450 finds the end. After the entire list is processed
(on line 65 of the parse), the attribute associated with the def Jist is a linked list of sym
bols, one for each field in the structure.
The parser now reduces by
struct specifier STRUCT opt tag LC def list RC
struct_specifier -> The action is on lines 407 to 421 of Listing 6.50, and the subroutines that are used here
STRUCT opt_tag LC are jn Lj st mg 6.52. The i l l egal st r uct def () call checks the field definitions to
def list RC
make sure that theres no recursion and that none of the fields are function definitions (as
compared to function pointers, which are legal). The f i gur e st r uct of f set s ()
Figuring offsets to fields. call on line 427 figures the offsets from the base address of the structure to the individual
fields. The basic algorithm just traverses the linked list of symbol s adding the
Listing 6.51. c.y Local Variables and Function Arguments
446 def l i st
447
i
4
def l i st def { symbol *p;

448 i f ( p = $2 )
449
{
450 f o r (; p- >next ; p =p- >next )
451
f
452 p- >next = $1;
453 $$ = $2;
454
}
455 }
456 /* epsi l on */
{
$$ =NULL; } /* I ni t i al i ze end- of - l i st
*/
457
4
9
r / * poi nt er .
*/
458
459 def
460
4
4
ft
ft speci f i er s decl l i st
{
add spec t o decl ( $1, $2 ) ; }
461 SEMI
{ $$ = $2; }
462
463 speci f i er s SEMI
{
$$ = NULL; }
464
4
9
ft
t
465
466 decl l i st
467
4
4
ft
ft decl {
$$- >next = NULL; }
468 decl l i st COMMA decl
469
{
470 $3- >next = $1;
471 $$ = $3;
472
}
473
4
9
ft
474
475 decl
476
4
4
ft
ft f unct decl
477 var decl
478 var decl EQUAL i ni t i al i zer { yyer r or ( "I gnor i ng i ni t i al i zer . \ n" );
479 di scar d val ue ( $3 );
480
}
481 var decl COLON const expr %pr ec COMMA
482 COLON const expr %pr ec COMMA
483
4
9
ft
t
cumulative size of the preceding fields to the current offset. A minor problem is caused
by alignment restrictionspadding may have to be added in the middle of the structure
in order to get interior fields aligned properly. The structure is declared as a character
Structure-field alignment,
padding.
array on line 76 of the parse ( r ecor d is an alias for byte), and the individual fields are
not declared explicitlythey are extracted from the proper location within the array
when expressions are processed. As a consequence, the compiler has to worry about
supplying padding that would normally be supplied by the assembler.
This alignment problem actually arises in the current example. The first field is a
two-byte int, but the second field requires four-byte alignment, so two spaces of padding
have to be added to get the second field aligned properly. The situation is simplified,
somewhat, by assuming that the first field is always aligned on a worst-case boundary.
An al i gn ( l word) directive is generated by the compiler just above the actual variable
definition for this purpose. Finally, note that the structure size is rounded up to be an
even multiple of the worst-case alignment restriction on lines 464 and 465 of Listing
6.52 so that arrays of structures work correctly.
Structure size rounded
up for arrays.
opt_specifier, The structure definition is now reduced to an opt_specifier, and the parse continues
just like a simple variable definition. One more right-hand side to ext def is needed for
structures. It is shown in Listing 6.53, and handles structure, union, and enumerated-
type declarations that dont allocate space for a variable (such as a struct definition
with a tag but no variable name). Note that the di scar d l i nk chai n () call on line
490 does not delete anything from the structure table.
Listing 6.52. decl.c Structure-Processing Subroutines
401 i nt i l l egal _ st r uct _def ( cur _st r uct , f i el ds )
402 st r uct def *cur _st r uct ;
403 symbol *f i el ds;
404 {
405 / * Ret ur n t r ue i f any of t he f i el ds ar e def i ned r ecur si vel y or i f a f unct i on
406 * def i ni t i on (as compar ed to a f unct i on poi nt er ) i s f ound as a f i el d.
407 */
408
409 f or ( ; f i el ds; f i el ds = f i el ds- >next )
410 {
411 i f ( I S_FUNCT( f i el ds- >t ype) )
412 {
413 yyer r or ("st r uct / uni on member may not be a f unct i on") ;
414 r et ur n 1;
415 }
416 i f ( I S_STRUCT( f i el ds- >t ype) &&
417 !st r cmp( f i el ds- >t ype- >V_STRUCT- >t ag, cur _st r uct - >t ag) )
418 {
419 yyer r or ("Recur si ve st r uct / uni on def i ni t i on\ n");
420 r et ur n 1;
421 }
422 }
423 r et ur n 0;
424 }
425 / *- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
426
427 i nt f i gur e_st r uct _of f set s ( p, i s_st r uct )
428 symbol *p; / * Chai n of symbol s f or f i el ds. */
429 i nt i s_st r uct ; / * 0 i f a uni on. */
430 {
431 / * Fi gur e t he f i el d of f set s and r et ur n t he t ot al st r uct ur e si ze. Assume
432 * t hat t he f i r st el ement of t he st r uct ur e i s al i gned on a wor st - case
433 * boundar y. The r et ur ned si ze i s al ways an even mul t i pl e of t he wor st - case
434 * al i gnment . The of f set t o each f i el d i s put i nt o t he "l evel f i el d of t he
435 * associ at ed symbol .
436 */
437
438 i nt al i gn_si ze, obj _si ze;
439 i nt of f set = 0;
440
441 f or ( ; p ; p = p- >next )
442 {
443 i f ( ! i s_st r uct ) / * I t ' s a uni on. */
444 {
445 of f set =max ( of f set , get _si zeof ( p- >t ype ) );
446 p- >l evel =0;
447 }
448 el se
449 {
450 obj _si ze = get _si zeof ( p- >t ype );
451 al i gn si ze = get al i gnment ( p- >t ype );
Section 6.6.2Structure and Union Declarations
549
452
453 whi l e( o f f s e t % a l i g n s i z e )
454 + + o f f s e t ;
455
456 p - > l e v e l = o f f s e t ;
457 o f f s e t += obj s i z e ;
458
}
459
}
460 /* Ret ur n t he st r uct ur e si ze: t he cur r ent of f set r ounded up to t he */
461 / * wor st - case al i gnment boundar y. You need t o wast e space her e i n case */
462 / * t hi s i s an ar r ay of st r uct ur es. */
463
464 whi l e( o f f s e t %ALIGN_WORST )
465 + + o f f s e t ;
466 ret urn o f f s e t ;
467 }
468 / *-
--------- */
469
470 i nt g e t a l i g n m e n t ( p )
471 l i n k *p;
472
{
473 /* Ret ur ns t he al i gnment - - t he number by whi ch t he base addr ess of t he obj ect
474 * must be an even mul t i pl e. Thi s number i s t he same one t hat i s r et ur ned by
475 * get si zeof (), except f or st r uct ur es whi ch ar e wor st - case al i gned, and
476 * ar r ays, whi ch ar e al i gned accor di ng t o t he t ype of t he f i r st el ement .
477 */
478
479 i nt s i z e ;
480
481 i f ( !p )
482
{
483 y y e r r o r (11INTERNAL, g e t a l i g n me n t : NULL p o i n t e r \ n " ) ;
484 e x i t ( 1 );
485
}
486 i f ( IS ARRAY( p ) ) ret urn g e t a l i g n m e n t ( p - > n e x t ) ;
487 i f ( IS_STRUCT( p ) ) ret urn ALIGN_W0RST;
488 i f ( s i z e = g e t s i z e o f ( p ) ) ret urn s i z e ;
489
490 y y e r r o r (11INTERNAL, g e t a l i g n me n t : Ob j e c t a l i g n e d on z e r o b o u n d a r y \ n " );
491 e x i t ( 1 );
492
}
Listing 6.53. c.y High-Level, External Definitions (Part Two)
484 / * ext def :
* /
485 0r
t
s p e c i f i e r s
486
{
487
i f (
! ( $ l - > c l a s s == SPECIFIER && $1->N0UN == STRUCTURE) )
488 y y e r r o r ( " U s e l e s s d e f i n i t i o n (no i d e n t i f i e r ) \ n " ) ;
489
i f (
! $ l - > t d e f )
490 d i s c a r d l i n k c h a i n ( $1 ) ;
491
}
492 SEMI
6.6.3 Enumerated-Type Declarations
The final definition recognized by the grammar is an enumerated type, handled at a
high level on line 230 of Listing 6.39 (on page 525) where an enumjspecifier is recog
nized in place of a TY PE token. An enumerated-type definition like this:
enumt a g { r i c h man, poor man, b e g g a r man 5, t h i e f } x;
is treated as if the following had been used:
i nt x ;
#def i ne r i c h_man
#def i ne poor_man
#def i ne b e g g a r man
#def i ne t h i e f
0
( r i c h man + 1)
5
( be g g a r man + 1)
Enumeratedtypesare
handledas integercon
stants.
Enum val , do enum()
but the compiler recognizes the elements of the enumerated type directly rather than
using a macro preprocessor. The high action in Listing 6.39 on page 525 just
creates a specifier for an i nt, ignoring the tag component of the enumerated type.
The real work happens in the productions in Listing 6.54, which creates symbol-table
entries for the symbolic constants ( ri ch man, and so forth). Internally, the compiler
doesnt distinguish between an element of an enumerated type and any other integer con
stant. When an enumerated-type element is referenced, a symbol-table lookup is neces
sary to get the value; but thereafter, the value is handled just like any other integer con
stant. The Enumval global variable keeps track of the current constant value. It is ini
tialized to zero on line 507 of Listing 6.54, when the enum keyword is recognized. The
enumerated-type elements are processed on lines 516 to 518. Enumval is modified as
necessary if an explicit value is given in the definition. In any event, do_enum(), in
Listing 6.56, is called to create the symbol-table element, which is added to the table on
the next line.
Listing 6.54. c.y Enumerated Types
122
123
124
o
o
{
i nt Enum v a l ; / * Cur r ent enumer at i on const ant val ue
*
/
o
o }
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
enum s p e c i f i e r
enumname o pt enum l i s t { ( $ 2 - > t y p e )
y y e r r o r ( "%s: r e d e f i n i t i o n " , $ 2 - >na me ) ;
d i s c a r d s y m b o l ( $ 2 ) ;
}
enumLC e nume r a t o r l i s t RC
o pt enum l i s t
: LC e nume r a t o r l i s t RC
/
*
empt y
*
/
enum : ENUM { Enum v a l 0; }
e nume r a t o r l i s t
e nume r a t o r
e nume r a t o r l i s t COMMA e nume r a t o r
Section 6.6.3Enumerated-Type Declarations 551
515 e nume r at or
516 : name { do enum( $1, Enum v a l + + ) ; }
517 | name EQUAL c o n s t e x pr { Enum v a l = $3;
518 do e num( $1, Enum v a l + + ); }
519
f
Listing 6.55. decLc Enumerated-Type Subroutines
493 PUBLIC voi d do enum( sym, v a l )
494 symbol *sym;
495 i n t v a l ;
496
{
497 i f ( c onv sym t o i n t c o n s t ( sym, v a l ) )
498 adds ym( Symbol t a b , sym ) ;
499 e l s e
500
{
501 y y e r r o r ( "%s: r e d e f i n i t i o n " , sym->name ) ;
503 }
504
}
505
----------------------------------------------------------------------------------------------------------------------------------------* /
506 PUBLIC i n t c onv sym t o i n t c o n s t ( sym, v a l )
507 s ymbol *sym;
508 i n t v a l ;
509
{
510 / * Tur n an empt y symbol i nt o an i nt eger const ant by addi ng a t ype chai n
511 * and i ni t i al i zi ng t he v i nt f i el d t o val . Any exi st i ng t ype chai n i s
512 * dest r oyed. I f a t ype chai n i s al r eady i n pl ace, r et ur n 0 and do
513 * not hi ng, ot her wi se r et ur n 1. Thi s f unct i on pr ocesses enum' s.
514 * /
515 l i n k * l p ;
516
517 i f ( s y m- >t y pe )
518 return 0;
519 l p = new l i n k ( ) ;
520 l p - > c l a s s = SPECIFIER;
521 l p->N0UN = INT;
522 l p->SCLASS = CONSTANT;
523 l p- >V_I NT = v a l ;
524 s y m- >t y pe = l p ;
525 *sym- >rname = ' \ 0 ' ;
526 return 1;
527 }
Labels for l i nk: gen
erated after processing
subroutine, printed above
subroutine.
Output streams: code,
data, bss.
Subroutine-prefix-and-
suffix generation.
Prefix.
Suffix.
Arguments to l i nk in
struction.
Vspace, Tspace,
Funct name.
ext_defs opt_specifiers
funct_decl {}defjist {}
compound_stmt {}.
The next sort of declaration is a function declaration, handled, at the high level, by
the remaining right-hand side to ext def:
ext def-ôpt specifiers funct decl { . . . }def_list {. . . }compound_stmt
Well look at the actual production in a moment. First, look at the parse of the following
code in Table 6.15:
pooh( pi gl et , eeyor e)
l ong eeyor e;
{
}
This time I ve shown the output as well as the parser actions; the three output streams are
indicated by CODE:, DATA:, and BSS: in the table. The final compiler output (after
the three streams have been merged) is in Listing 6.56.
Note that the time at which the parser outputs something is not directly related to the
position of that something in the output file. For example, the #defnes for LOand Ll
at the top of the output file are not emitted by the parser until after the entire subroutine
has been processed. This shuffling is accomplished by using two different output
streams. The definitions are written to the bss stream; code is written to the code stream.
The streams are merged in the following order: data, bss, and then code. So, all output
to the bss stream appears above all code-stream output in the final program.
The code on lines nine to 14 of Listing 6.56 is generated at the top of every subrou
tine (with the obvious customizations). This block of instructions is called the subrou
tine prefix. The macro definitions on lines five and six of Listing 6.56 are also part of the
prefix, even though these definitions end up at the top of the output file rather than
immediately above the subroutine definition, because they are generated along with the
rest of the prefix code. The code on lines 18 to 20 of Listing 6.56 is generated at the bot
tom of every subroutine and is called the subroutine suffix.
The LOand Ll macros on lines five and six of Listing 6.56 are used as arguments in
the l i nk instruction at the top of the subroutine. The numeric component of the labels
is uniquethe next subroutine in the input file uses L2 and L3. In the current example,
LOholds the size, in bytes, of the local-variable region of the subroutines stack frame;
Ll holds the size of the temporary-variable region. Unfortunately, for reasons that I ll
discuss in a moment, neither size is known until after the entire subroutine has been pro
cessed. The compiler solves the problems by generating the LOand Ll labels at the same
time that it outputs the l i nk instruction. The label names are stored internally in the
Vspace and Tspace arrays in Listing 6.57. The compiler puts the function name into
Funct name at the same time. Later on in the parse, after the subroutine has been pro
cessed and the sizes of the two regions are known, the compiler emits the #def i nes to
the data segment, using the previously generated labels.17
The code to do all of the foregoing is in the first action of the following prodution:
ext def-ôpt specifiers funct decl { . . .}defjist {. . . }compound_stmt {. . . };
found on lines 522 to 546 of Listing 6.58. The attribute attached to tht funct decl at $2
6.6.4 Function Declarations
17. This technique is necessary only if you are generating code for a single-pass assembler. Two-pass
assemblers let you define a label after its usedthe definition is picked up in the first pass and
substitutions are made in the second. The two-pass v a x / u n i x assembler just generates the labels at the end
of the subroutine code.
Section 6.6.4Function Declarations
Table 6.15. A Parse of pooh ( pi gl et , eeyor e) long eeyor e; {}
553
Stack Next Action
1 ( empty) Reduce: extjief list >e
DATA: # i n c l u d e < t o o l s / v i r t u a l . h>
DATA: # d e f i n e T( x)
DATA: SEG (data)
CODE: SEG (code)
BSS: SEG( bss)
2 extdeflist Reduce: opt specifiers-te
3 ext def Jist opt specifiers Shift: NAME
4 ext defjist opt specifiers NAME Reduce: new name-)NAME
5 ext def Jist opt spedfiers new name Shift: LP
6 ext def list opt spedfiers new name LP Reduce: { 28} >
7 ext j i e f list opt specifiers newjiame LP {28} Shift: NAME
8 ext def list opt specifiers new name LP {28} NAME Reduce: new_namesN\ME
9 ext def list opt specifiers newjiame LP {28} new_name Reduce: name list-tnew name
10 ext def list opt specifiers new name LP {28} name_list Shift: COMMA
11 ext def Jist opt specifiers new name LP {28} name_list COMMA Shift: NAME
12 ext def j i s t opt specifiers new name LP {28} name_list COMMA NAME Reduce: new names NAME
13 ext def list opt specifiers new name LP {28} name_list COMMA new_name Reduce: nameJist >
name list COMMA new name
14 ext def list opt specifiers new name LP {28} name_list Reduce: { 29}
15 ext def list opt specifiers new name LP {28} name_list {29} Shift: RP
16 ext j i e f list opt specifiers new name LP {28} name_list {29} RP Reduce: funct decl >
new name LP {28} name Jist {29}
RP
17 ext def list opt specifiers funct decl Reduce: { 51} > CODE: #undef T
CODE: # d e f i n e T(n) ( f p- LO- (n*4) )
CODE: PROC ( pooh, p u b l i c ) CODE:
l i nk( L0+L1) ;
18 ext defjist opt specifiers funct decl {51} Reduce: def l i s t s e
19 ext defjist opt sped fiers funct decl {51} defjist Shift: TYPE
20 ext defjist opt sped fiers funct decl {51} defjist TYPE Reduce: type sped f i e r - t l YPE
21 ext defjist opt specifiers funct decl {51} defjist type_speci fier Reduce: type or class-ttype specifier
22 ext def list opt specifiers funct decl {51} defjist type_or_class Reduce: specifiers-ttype or class
23 ext defjist opt specifiers funct decl {51} defjist specifiers Shift: NAME
24 ext j i e f Jist opt specifiers funct decl {51} defjist specifiers NAME Reduce: new /lam^-^NAME
25 ext def list optspecifiers functdecl {51} defjist specifiers new_name Reduce: var decl >newjiame
26 ext j i e f list opt specifiers funct decl {51} defjist specifiers var_decl Reduce: decl-tvar decl
27 ext def list opt specifiers funct decl {51} defjist specifiers decl Reduce: decl list-tdecl
28 ext defjist opt specifiers funct decl {51} defjist specifiers decljist Reduce: { 65} >
29 ext def list opt specifiers funct decl {51} defjist specifiers decljist {65} Shift: SEMI
30 ext j i e f list opt specifiers funct decl {51} defjist specifiers decl Jist {65}
SEMI
Reduce: def-tspecifiers decljist { 65} SEMI
31 ext defjist opt spedfiers funct decl {51} defjist def Reduce: def list-tdef list def
32 ext defjist opt specifiers funct decl {51} defjist Reduce: {52}- > CODE: / * fp+4 =
p i g l e t [arg] * / CODE: / * fp+8 =
eeyore [arg] * /
33 extdeflist opt specifiers funct decl {51} defjist {52} Shift: LC
34 ext defjist opt specifiers funct decl {51} defjist {52} LC Reduce: { 71}
continued...
Table 6.15. Continued. A Parse of pooh ( pi gl et , eeyor e) long eeyor e; {}
Stack Next Action
35 ext def list opt specifiers funct decl {51} defjist {52} LC {71} Reduce: def l i s t s e
36 ext def list opt specifiers funct jdecl {51} defjist {52} LC {71} defjist Reduce: local defs s d e f Jist
37 ext def list opt specifiers funct decl {51} defjist {52} LC {71} local_defs Reduce: stmt l i s t s e
38 ext def list opt specifiers funct_decl {51} defjist {52} LC {71} local_defs
stmtjist
Shift: RC
39 ext def list opt specifiers functjdecl {51} defjist {52} LC {71} local_defs
stmtjist RC
Reduce: compound s t m t s
LC {71} local_defs stmtjist
RC
40 ext def list opt specifiers funct _decl {51} defjist {52} compound_stmt Reduce: ext def s o p t specifiers funct_decl
{51} defjist {52} compound_stmt
CODE: unl i nk () ;
CODE: ret ( ) ;
CODE: ENDP (_pooh)
BSS: #define LO 0 / * pooh l o c . * /
BSS: #define Ll 0 / * pooh tmp. * /
41 ext defjist ext def Reduce: ext_def Jist s e x t j d e f list ext def
42 ext def list Reduce: programsext def list
A c c e p t
Listing 6.56. Compiler Output for Function Definition
1 #ncl ude < t o o l s / v i r t u a l . h>
2 #def i ne T( x)
3 SEG( dat a)
4 SEG( bs s )
5 #def ne LO 0 / * pooh: l ocal s * /
6 #def i ne Ll 0 / * pooh: t emps. * /
7
O
SEG( code)
o
9 #undef T
10 #def i ne T( n) ( f p - L O- ( n * 4 ))
11 PROC( p o o h , p u b l i c )
12 1 i nk ( L0+L1) ;
13 / * f p +4 =pi gl et [ ar gument ] */
14 / * f p+8 = eeyor e [ ar gument ] */
15
16 / * Code f r omt he body of t he subr out i ne goes her e. * /
17
18 u n l i n k ( ) ;
19 r e t ( ) ;
20 ENDP(_pooh)
symbol . args, C ross
links.
Alphabetic componants
of output labels, label.h
is a pointer to a symbol , and the ar gs field of that structure is itself a pointer to a linked
list of additional symbol structures, one for each argument. The arguments, along with
an entry for the function itself, are put into the symbol table on lines 530 to 532 of List
ing 6.58. Putting the elements in the table does not affect the cross links. The arguments
still form a linked list after the insertion. The gen () calls on line 540 and 541 actually
emit the PROC and l i nk directivesI ll come back to this subroutine later, when I dis
cuss expression processing. The L prefix in the variable- and temporary-space labels, is
put into the labels on lines 535 and 536 of Listing 6.58. L LI NK is defined along with
several other label prefixes in label.h, Listing 6.59. These other label prefixes are used
Listing 6.57. c.y Global Variables for Function-Declaration Processing (from Occs Definitions Section)
Section 6.6.4Function Declarations 555
125
%{
126 char V s p a c e [ 1 6 ] ;
127 char T s p a c e [ 1 6 ] ; / * The compi l er doesn' t know t he st ack- f r ame si ze
128 * when i t cr eat es a l i nk( ) di r ect i ve, so i t out put s
129 * a l i nk( VSPACE+TSPACE) . Lat er on, i t #def i nes VSPACE
130 * t o t he si ze of t he l ocal - var i abl e space and TSPACE
131 * to t he si ze of t he t empor ar y- var i abl e space. Vspace
132 * hol ds t he act ual name of t he VSPACE macr o, and
133 * Tspace t he TSPACE macr o. ( Ther e' s a di f f er ent name
134 * f or each subr out i ne. )
135
*/
136
137 char Func t name [ NAME MAX+1 ] ; / * Name of t he cur r ent f unct i on * /
138
%}
for processing i f statements, whi l e loops, and so on. I ll discuss them further when the
flow control statements are presented.
The parser now handles the K&R-style argument definitions. The new symbol-table K&R-style argument
entries are all of type i nt, because the funct decl contains a name list, not a var list.
definitions.
Since no types are specified in the input argument list, i nt is supplied automatically.
The parser processes the defjist next. When the def J is t-proccssing is finished, the attri
bute at $4the previous action was $3will be a pointer to a linked list of symbol s,
one for each formal definition. The f i x t ypes and di scar d syms () call on line
549 of Listing 6.58 looks up each of the redefined symbols in the symbol table. If the
symbol is there, the type is modified to reflect the redefinition; if its not, an error mes
sage is generated. This subroutine also discards the symbol structures used for the
redefinitions. The f i gur e par amof f set s () call on line 550 traverses the argu
ment list again because some elements of the list have just been modified to have new
types. It patches the symbol structures so that the r name field holds an expression that
can be used in a C-code instruction to access the variable. All these expressions are rela
tive to the frame pointer (WP (f p- 8 ) and so forth). The position in the argument list
determines the value of the offset. The pr i nt of f set comment 551 prints
the comment in the middle of the output that shows what all of these offsets are. All
three subroutines are in Listing 6.60.
Skipping past the function body and local-variable processing for a moment (all
these are done in the compound stmt), the end-of-function processing is all done in the
End-of-function
processing.
third action in Listing 6.58, on lines 556 to 575. The r emove symbol s f r om
t abl e ( ) call on line 560 deletes all the subroutine arguments from the symbol table.
The subroutine itself stays in the table, howeverits a global-level symbol. The
di scar d_symbol _chai n () call on the next line frees the memory used for the asso
ciated symbol structures and associated type chains. The structure-table is not
modified, so the structure definition persists even though the variable doesnt. Finally,
the end-of-function code is output with the three gen () calls on the following lines, and
l i nk labels emitted
the link-instruction labels are emitted on lines 568 to 571 of Listing 6.58
Listing 6.58. c.y High-Level, External Definitions (Part Three)
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
/
* *
/
o p t _ s p e c i f i e r s f u n c t _ d e c l
{
s t a t i c u n s i g n e d l i n k v a l 0; /
*
Label s used f or l i nk ar gs. */
add s p e c t o d e c l ( $1, $2 ) ; /
/
*
Mer ge t he speci f i er and
*
/
*
decl ar at or . */
( ! $ l - > t d e f )
d i s c a r d l i n k c h a i n ( $1 ) ; / * Di scar d ext r a speci f i er .
*
/
f i g u r e o s c l a s s ( $2 ) ; /
*
Updat e symbol t abl e
*
a d d _ s y mb o l s _ t o _ t a b l e ( $2 ) ;
add s y mbo l s t o t a b l e ( $ 2 - > a r g s ) ;
/ * Add f unct i on i t sel f
/ * Add t he ar gument s.
*
*
/
/
/
s t r c p y ( Funct _name, $2- >name
s p r i n t f ( Vs pa c e ,
) ;
"%s%d", L LINK,
s p r i n t f ( Ts pac e
II Q
l i n k _ v a l + + ) ;
fcs%dn, L LINK, l i n k v a l + + ) ;
y y c o d e ( " \ n # u n d e f T\ n" ) ;
y y c o d e ( " # d e f i n e T (n) ( f p - %s - ( n * 4 ) ) \ n \ n " , Vs pac e ) ;
g e n ( "PROC", $2- >r name , $2- >e t ype - >STATI C ? " p r i v a t e " : " p u b l i c " ) ;
g e n ( " l i n k " , Vs pa c e , Ts pac e ) ;
++Nes t l e v ; / * Make nest i ng l evel of def i ni t i on_l i st
* mat ch nest i ng l evel i n t he f unct decl
*
/
}
d e f l i s t
{
f i x _ t y p e s _ a n d _ d i s c a r d syms
f i g u r e _ p a r a m _ o f f s e t s
p r i n t o f f s e t comment
( $4
( $ 2 - > a r g s
) ;
) ;
( $ 2 - > a r g s , "argument " ) ;
Ne s t l e v ; /
I t ' s i ncr ement ed agai n i n t he compound st mt

*
/
}
compound s t mt
{
pur g e u n d e c l ( ) ; /
/
Deal wi t h i mpl i ci t decl ar at i ons

and undecl ar ed symbol s.
*
/
/
remove s y mbo l s f rom t a b l e ( $ 2 - > a r g s ) ; / * Del et e ar gument s.
*
/
d i s c a r d s ymbol c h a i n ( $ 2 - > a r g s ) ;
ge n (
ll #Q
: %s%d", L RET, r l a b e l ( l )
gen ( "unl i nk"
gen ( "
gen ( "ENDP", $2- >rname
) ;
) ;
) ;
);
/ * End- of - f unct i on */
/
*
code. */
y y b s s ( " \ n # d e f i n e
o. 0 o.
oo od \ t /
* a
s : l o c a l s * / \ n " ,
Vs pa c e , l o c v a r s p a c e ( ) , $2- >name ) ;
y y b s s
(
" # d e f i n e d \ t /
* s
?>s: t e mps . * / \ n " ,
tmp v a r s p a c e ( ) , $2- >name ) ;
tmp r e s e t ( ) ; /
/
Reset t empor ar y- var i abl e syst em. */

*
( Thi s i s j ust i nsur ance. )
*
/
}
Listing 6.59. label.h Output-label Definitions
Section 6.6.4Function Declarations
557
1 / * Thi s f i l e cont ai ns def i ni t i ons f
2 * t ake t he f orm: <pr ef i x><number >,
3 * act i on. The pr ef i xes ar e def i ned
4
5
*/
6 #def i ne L_BODY "BDY" /*
7 #def i ne L_C0ND_END "QE"
/*
8 #def i ne L COND_FALSE "QF"
/*
9 #def i ne L_DOEXIT "DXIT"
/*
10 #def i ne L DOTEST "DTST"
/*
11 #def i ne L DOTOP "DTOP" /*
12 #def i ne L_ELSE "EL"
/*
13 #def i ne L END "E"
/*
14 #def i ne L_ FALSE
up ii
/*
15 #def i ne L_"i n c r e me n t
I NC"
/*
16 #def i ne L LINK "L"
/*
17 #def i ne L_ NEXT "EXIT"
/*
18 #def i ne L_ RET "RET"
/*
19 #def i ne L_"s t r i n g S
/*
20 #def i ne L_"s w i t c h "SW" /*
21 #def i ne L_"t e s t "TST"
/* -
22 #def i ne L "t r u e
II fji II
/*
23 #def i ne L VAR "V"
/ *
or t he var i ous l abel Al l l abel s
t he <number > suppl i ed by t he code- gener at i on
Top of t he body of a f or l oop
End of condi t i onal .
Tr ue par t of condi t i onal ( ? : ) .
*
*
*
J ust af t er t he end of t he do/ whi l e.
J ust above t he t est i n a do/ whi l e.
Top of do/ whi l e l oop.
Used by el se pr ocessi ng.
End of r el at i onal / l ogi cal op.
Fal se t ar get of r el at i onal / l ogi cal op.
J ust above t he i ncr ement par t of f or l oop.
to l i nk i nst r uct i on.
Out si de of l oop, end of i f cl ause.
Above cl eanup code at end of subr out i ne.
St r i ngs.
Used f or swi t ches.
Above t est i n whi l e/ f or / i f .
Tr ue t ar get of r el at i onal / l ogi cal oper at or
Local - st at i c var i abl es.
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
Listing 6.60. decl.c Process Subroutine Arguments
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
voi d
s ymbol
{
/*
*
f i x _ t y p e s and d i s c a r d s y m s ( sym )
*sym;
Pat ch up subr out i ne ar gument s to mat ch f or mal decl ar at i ons
*
*
Look up each symbol i n t he l i st . I f i t ' s i n t he t abl e at t he cor r ect
l evel , r epl ace t he t ype f i el d wi t h t he t ype f or t he symbol i n t he l i st ,
* t hen di scar d t he r edundant symbol st r uct ur e. Al l symbol s i n t he i nput
* l i st ar e di scar ded af t er t hey' r e
*
*
*
*
*
Type checki ng and aut omat i c pr omot i ons ar e done here, too, as f ol l ows:
char s ar e conver t ed t o i nt.
ar r ays ar e conver t ed t o poi nt er s.
st r uct ur es ar e not per mi t t ed.
*
*
Al l new obj ect s ar e conver t ed t o aut os
*
/
symbol ^ e x i s t i n g ,
*
s;
whi l e( sym )
{
( ! ( e x i s t i n g ( s ymbol * ) f i n d s y m( Symbol t a b, s y m- >na me )
)
s y m - > l e v e l != e x i s t i n g - > l e v e l )
{
yye r r or ( "%s no t i n argument l i s t \ n " , sym->name ) ;
e x i t ( 1 ) ;
}
556 el se i f ( !s y m- >t y pe || !s y m- > e t y p e )
557 {
558 y y e r r o r ( "INTERNAL, f i x _ t y p e s : Mi s s i n g t y p e s p e c i f i c a t i o n \ n " ) ;
559 e x i t ( 1 ) ;
560 }
561 el se i f ( IS_STRUCT( sym- >t ype) )
562 {
563 y y e r r o r ( " S t r u c t u r e p a s s i n g n o t s u p p o r t e d , u s e a p o i n t e r \ n " ) ;
564 e x i t ( 1 ) ;
565 }
566 el se i f ( !I S_CHAR( sym- >t ype) )
567 {
568 / * The exi st i ng symbol i s of t he def aul t i nt t ype, don' t r edef i ne
569 * char s because al l char s ar e pr omot ed t o i nt as par t of t he cal l ,
570 * so can be r epr esent ed as an i nt i nsi de t he subr out i ne i t sel f .
571 */
572
573 i f ( IS ARRAY( s y m- > t y p e ) )
/ *
Make i t a poi nt er to t he * /
574 s ym- >t ype- >DCL TYPE = POINTER;
/ * f i r st el ement . * /
575
576 s ym- >et ype- >SCLASS = AUTO; / *
Make i t an aut omat i c var. * /
577
578 d i s c a r d l i n k c h a i n ( e x i s t i n g - > t y p e ) ;
/ *
Repl ace exi st i ng t ype
* /
579 e x i s t i n g - > t y p e = s y m- >t y p e ;
/ * chai n wi t h t he cur r ent one.
* /
580 e x i s t i n g - > e t y p e = s y m- > e t y p e ;
581 s y m- >t y pe = s y m- > e t y p e = NULL; / * Must be NULL f or di scar d -
* /
582 } / *
symbol () cal l , bel ow.
* /
583 s = s y m- > n e x t ;
585 sym = s ;
586 }
587 }
588
589 /* -------------------------------------------------------------------------------------------------------------------- */
590
591 i nt f i g u r e _ p a r a m _ o f f s e t s ( sym )
592 s ymbol *sym;
593 {
594 / * Tr aver se t he chai n of par amet er s, f i gur i ng t he of f set s and i ni t i al i zi ng
595 * t he r eal name (i n sym- >r name) accor di ngl y. Not e t hat t he name chai n i s
596 * assembl ed i n r ever se or der , whi ch i s what you want her e because t he
597 * f i r st ar gument wi l l have been pushed f i r st , and so wi l l have t he l ar gest
598 * of f set . The st ack i s 32 bi t s wi de, so ever y l egal t ype of obj ect wi l l
599 * r equi r e onl y one st ack el ement . Thi s woul d not be t he case wer e f l oat s
600 * or st r uct ur e- passi ng suppor t ed. Thi s al so t akes car e of any al i gnment
601 * di f f i cul t i es.
602
603 * Ret ur n t he number of 32- bi t st ack wor ds r equi r ed f or t he par amet er s.
604 */
605
606 i nt o f f s e t = 4 ; / * Fi r st par amet er i s al ways at BP ( f p+4) . * /
607 i nt i ;
608
609 f o r (; sym ; sym = s y m- >ne x t )
610 {
611 i f ( IS_STRUCT( sym- >t ype) )
612 {
613 y y e r r o r ( " S t r u c t u r e p a s s i n g n o t s u p p o r t e d \ n " ) ;
614 continue;
615 }
Section 6.6.5Compound Statements and Local Variables 559
616
617 s p r i n t f ( s ym- >rname, "fp+%d", o f f s e t ) ;
618 o f f s e t += SWIDTH ;
619 }
620
621
/ *
Ret ur n t he of f set i n st ack el ement s, r ounded up i f necessar y. * /
622
623 r e t u r n ( ( o f f s e t / SWIDTH) + ( o f f s e t % SWIDTH != 0) ) ;
624
}
625
626
/ * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - * /
627
628 v o i d p r i n t o f f s e t comment ( sym, l a b e l )
629 symbol *sym;
630 c ha r *l a b e l ;
631
{
632
/ * Pr i nt a comment l i st i ng al l t he l ocal var i abl es. */
633
634 f o r (; sym ; sym = s ym- >ne xt )
635 y y c o d e ( " \ t / * %16s = %-16s [%s ] * / \ n " , sym- >rname, sym- >name, l a b e l ) ;
636 }
6.6.5 Compound Statements and Local Variables
The next issue is the function body, which consists of a compound stmt. As you can
see from Listing 6.61, below, a compound statement is a list of statements (a stmtjist)
surrounded by curly braces. Local-variable definitions (local defs) can appear at the
beginning of any compound statement, and since this same production is also used to
process multiple-statement bodies of loops, and so forth, local variables can be defined at
any nesting level. The inner variables must shadow the outer ones until the compiler
leaves its scoping level, however. The local version of the variable is used instead of
another variable declared at a more outer level with an identical name.
Its difficult, though certainly possible, to modify the size of the stack frame every
time that the compiler enters or leaves a scoping level. The main difficulty is
temporary-variable management, which is much easier to do if the temporary variable
space doesnt move around or change size during the life of a subroutine. My solution is
to allocate space for all local variables, regardless of the scoping level, with the single
l i nk instruction at the top of the subroutine. From a space-allocation perspective, all
variables are treated as if they were declared in the outermost scoping level at run time.
At compile time, however, the symbol-table entry for an inner variable is created when
that variable is declared, and it is deleted when the compiler leaves the scoping level for
that variable. Even though an inner variable continues to exist at run time, that variable
cannot be accessed from outside the compound statement because the symbol-table entry
for the variable wont exist.
The obvious problem with this approach is that memory is wasted. In the following
fragment, for example, the stack region used to store cast or could be recycled for use
by pol l ux; it isnt:
18. This is quite-Iegal C, though the practice is discouraged because it makes it difficult to find the variable
definitions when theyre buried in a subroutine.
Function bodies,
compound_stmt.
Nested variable
definitions.
Handling scoping levels,
memory allocation.
Listing 6.61. c.y Compound Statements
577 compound s t mt
578 : LC { ++Ne s t l e v ;
579 l o c r e s e t ( ) ;
580 }
581 l o c a l d e f s s t mt l i s t RC { Ne s t l e v ;
582 remove s y mbo l s f rom t a b l e ( $3 ) ;
583 d i s c a r d s ymbol c h a i n ( $3 ) ;
584 }
585
r
586
587 l o c a l d e f s
588 : d e f l i s t
{
add s y mbo l s t o t a b l e ( $$ = r e v e r s e l i n k s ( $1 )) ;
589 f i g u r e l o c a l o f f s e t s ( $$, Func t name ) ;
590 c r e a t e s t a t i c l o c a l s ( $$, Funct name ) ;
591 p r i n t o f f s e t comment ( $$, " v a r i a b l e " ) ;
592 }
593
f
{
{
i nt c a s t o r ;
}
{
i nt p o l l u x ;
}
}
Local-variable definitions are handled by the productions in Listing 6.61 and the sub
routines in Listing 6.62. The defjist nonterminal on line 588 of Listing 6.61 is the same
production thats used for structure fields. Its attribute is a pointer to the head of a linked
list of symbol s, one for each declaration. These symbols are added to the symbol table
on line 588 in Listing 6.61.
Listing 6.62. local.c Local-Variable Management
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#i ncl ude < s t d l i b . h >
# i ncl ude < t o o l s / c o m p i l e r . h >
# i ncl ude < t o o l s / c - c o d e . h>
# i nc l ude "s ymt ab. h 11
#i ncl ude " p r o t o . h "
#i ncl ude " l a b e l . h "
/ * LOCAL. C Subr out i nes i n th f i l t ake car e of l ocal abl e management
PRIVATE i nt 0 /
*
Of f set f r omt he f r ame poi nt er ( whi ch al so
/ * mar ks t he ba of t he aut omat i c abl
*
*
/
/
/
/ * r out i ne.
*
r egi on of t he st ack f rame) t o t he most
r ecent l y al l ocat ed var i abl e. Reset t o 0
by l oc r eset () at t he head of ever y sub
*
/
*
*
*
*
*
*
/
/
/
/
/
/
Section 6.6.5Compound Statements and Local Variables 561
21 e x t e r n v o i d y y c o d e ( ) , y y d a t a ( ) , y y b s s ( ) , yyc omme nt ( ) ;
22 /* ---------------------------------------------------------------------------------------------------------------------- */
23
24 voi d l o c _ r e s e t ( )
25 {
26 / * Reset ever yt hi ng back t o t he vi r gi n st at e. Cal l t hi s subr out i ne j ust */
27 / * bef or e pr ocessi ng t he out er most compound st at ement i n a subr out i ne. */
28
29 O f f s e t = 0 ;
30 }
31 /* ---------------------------------------------------------------------------------------------------------------------- */
32
33 i nt l o c _ v a r _ s p a c e ()
34 {
35 / * Ret ur n t he t ot al cumul at i ve si ze of t he t empor ar y- var i abl e r egi on i n
36 * st ack el ement s (not byt es) . Thi s cal l out put s t he val ue of t he macr o
37 * t hat speci f i es t he var i abl e- space si ze i n t he l i nk i nst r uct i on. Cal l i ng
38 * l oc_r eset ( ) al so r eset s t he r et ur n val ue of t hi s subr out i ne t o zero.
39 */
40
41 r et ur n( ( O f f s e t + (SWIDTH-1)) / SWIDTH );
42 }
43 /* -----------------------------------------------------------------------------------------------------------------------*/
44
45 voi d f i g u r e _ l o c a l _ o f f s e t s ( sym, f unc t _name )
46 s ymbol *sym;
47 char *f unc t _name ;
48 {
49 / * Add of f set s f or al l l ocal aut omat i c var i abl es i n t he syml i st . */
50
51 f o r (; sym ; sym = s y m- >ne x t )
52 i f ( ! IS_FUNCT( s y m- >t y pe ) && ! s ym- >et ype- >STATI C )
53 l o c _ a u t o _ c r e a t e ( sym ) ;
54 }
56
57 voi d l o c _ a u t o _ c r e a t e ( sym )
58 s ymbol *sym;
59 {
60 / * Cr eat e a l ocal aut omat i c var i abl e, modi f yi ng t he "r name" f i el d of "sym"
61 * t o hol d a st r i ng t hat can be used as an oper and t o r ef er ence t hat
62 * var i abl e. Thi s name i s a cor r ect l y al i gned r ef er ence of t he f or m:
63 *
64 * f p + of f set
65 *
66 * Local var i abl es ar e packed as wel l as possi bl e i nt o t he st ack f r ame,
67 * t hough, as was t he case wi t h st r uct ur es, some paddi ng may be necessar y
68 * to get t hi ngs al i gned pr oper l y.
69 */
70
71 i nt a l i g n _ s i z e = g e t _ a l i g n m e n t ( s y m- >t y pe ) ;
72
73 Of f set += get _si zeof ( sym- >t ype ); / * Of f set f r omf r ame poi nt er */
74 / * t o var i abl e. */
75
76 whi l e( O f f s e t %a l i g n _ s i z e ) / * Add any necessar y paddi ng */
77 + + Of f s e t ; / * t o guar ant ee al i gnment . */
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
s p r i n t f ( s ym- >rname, "f p-%d", O f f s e t ) ; / * Cr eat e t he name.
*
/
sym >SCLASS AUTO;
}
/
* *
/
voi d
s ymbol
c r e a t e s t a t i c l o c a l s ( sym, f u n c t name )
*
sym;
* f u n c t name;
{
/
*
Gener at e def i ni t i ons f or l ocal , st at i c var i abl es i n t he syml i st . */
(; sym ; sym s y m- >ne x t )
( !IS_FUNCT( s y m- >t y pe ) && s ym- >et ype- >STATI C )
l o c s t a t i c c r e a t e ( sym, f u n c t name ) ;
}
/
* *
/
voi d
s ymbol
l o c _ s t a t i c _ c r e a t e ( sym, f u n c t name )
*sym;
* f u n c t name;
{
st at i c unsi gned v a l / * Numer i c component of ar bi t r ar y /
s p r i n t f ( s ym- >rname, "%s%d", L VAR, v a l + + ) ;
s ym- >et ype- >SCLASS
FIXED ;
PRI
v a r d e l ( y y b s s , PRI, sym, ) ;
y y b s s ( " \ t / * %s [%s ( ) , s t a t i c l o c a l ] * / \ n " , sym- >name, f u n c t name ) ;
}
/
* *
/
voi d
s ymbol
{
/
remove s y mbo l s f rom t a b l e ( sym )
*
sym;
*
Remove al l symbol s i n t he l i st f r omt he t abl e.
*
/
s ymbol
*
p;
( p
sym; p ; p p - > n e x t )
( ! p - > d u p l i c a t e )
d e l s y m( Symbol t a b , p ) ;
{
y y e r r o r ( 11INTERNAL, remove s ymbol : d u p l i c a t e sym. i n c r o s s - l i n k \ n " ) ;
e x i t ( 1 ) ;
}
}
of f set s ().
Automatic variables: The f i gur e_l ocal _of f set s () call on line 589 of Listing 6.61 handles
f i gur e_i ocai _ automatic variables. (The subroutine is on line 45 of Listing 6.62.) It goes through the
list, adjusting the r names to hold a string which, when used in an operand, references
the variable. Ultimately, the reference will look something like this: WP( f p+6) , but
only the f p+6 is created herethe WP and parentheses are added later by the
expression-processing code.
The current offset from the base of the automatic-variable region is remembered in
Of f set , which is incremented by the size of each variable as space for it is allocated.
Section 6.6.6Front-End/Back-End Considerations 563
The variables are packed as closely as possible into the local-variable space. If align
ment permits, they are placed in adjacent bytes. As with structures, padding is inserted if
necessary to guarantee alignment, f i gur e l ocal of f set s () is called at the top of
every block, but Of f set is reset to zero only at the top of the subroutine, so the size of
the local-variable region continues to grow over the life of the subroutine as automatic
variables are allocated. The final size of the region is determined once the entire subrou
tine has been processed by calling l oc var space () on line 33 of Listing 6.62. This
value is used at the end of the subroutine-processing code to define one of the macros
thats passed to the l i nk instruction.
The cr eat e st at i c l ocal s () call on line 590 of Listing 6.61 handles st at i c
locals. The subroutine starts on line 84 of Listing 6.62. It goes through the list a second
time, allocating space for the variable as if it were a st at i c global variable. An arbi
trary name is assigned instead of using the declared name, as was the case with true glo
bals. This way, two subroutines can use the same name for a static variable without a
conflict. All st at i c locals are declared pr i vat e, so conflicts with variables in other
modules are not a problem. Symbols are removed from the table when the compiler
leaves the scoping-level on lines 582 and 583 of Listing 6.61.
6.6.6 Front-End/Back-End Considerations
Packing variables onto the stack makes sense in a one-pass compiler, but its con
traindicated if the real code is going to be generated by a back end. The front end has no
way of knowing what alignment restrictions apply to the target machine or the actual
sizes of the various types. The back ends life can be made easier by assuming that all
types are the same size and that there are no alignment restrictions. As it is, the back end
might have to undo some of our work. If an i nt is 32 bits, it must unpack the variables.
Similarly, if the worst-case alignment restriction is two rather than four, it must get rid of
the extra padding.
If the front end is ignoring size and alignment, some mechanism is needed to pass the
symbol-table information to the back end. Currently, the compilers just throwing that
information away when it leaves the scoping level. A better approach passes the symbol
table to the back end as part of the intermediate code. For example, you can introduce a
new, l ocal storage class to C-code and generate definitions for all local symbols at the
top of a block along with the static-variable definitions. A matching del et e (name)
directive can be generated at the end of the block to tell the back end that the symbol had
gone out of scope.
Theres no need to worrying about the size at the intermediate-code level if all vari
ables are the same size, The compilers been keeping track of the size so that we can use
W(), L () and so forth to access variables, but it wouldnt have to do so if everything was
the same size. Consequently, both global and local variables can just be called out by
name in the intermediate code. You can dispense with all the size-related, C-code
addressing modes and register-access directives and just use the names: _p, rather than
WP (&_p) orWP ( f p - 1 6 ) , rOrather than r O. pp. The back end can compute the offsets
for local variables and make any necessary adjustments to the generated code, replacing
the symbolic names with stack-relative access directives as necessary. This one change
dramatically simplifies both the design of C-code and the complexity of the front end.
All of the foregoing applies to structures as well as simple variablesthe fields
should all be the same size and not be packed into the structure. Better yet, the front end
could make no attempt to determine the offset to the field from the base of the structure.
Structure members could be passed to the back end like this:
member type structure name .field name;
l oc var space().
Static-local variables:
cr eat e st at i c l ocal s()
Dont pack variables if
back end is used.
Passing symbol table in
formation to back end.
No need to keep track of
sizes.
Reasons to use gen ().
@ as first character in
format string.
Add comments to output
code, gen comment ().
and the fields could be called out by name in the intermediate code rather than generat
ing explicit offsets to them, using something like st r uct _name. member _name.
6.7 The gen () Subroutine
The gen () subroutine was used in the last section to print out the few C-code
instructions in the subroutine prefix and suffix, gen () is a general-purpose code-
generation interface for the parserall C-code instructions are emitted using gen ()
calls rather than yycode () calls. It seems reasonable to look at it now, before using it
further. I ve concentrated the code emission into a single subroutine for several reasons:
Clarity in the source code. Once youve added leading tabs, trailing newlines, field
widths, and so forth, direct yycode () calls are pretty hard to read. Since gen ()
takes care of all the formatting for you, the subroutine calls are more understandable,
and the code more maintainable as a consequence, gen () also makes the output
code more readable because the code is formatted consistently.
Fewer C-code syntax errors. There are slight variations in syntax in the C-code
instruction set. Some instructions must be followed by semicolons, others by colons,
and still others by no punctuation at all. Some instructions take parenthesized argu
ments, others do not. gen () takes care of all these details for you, so the odds of a
syntax error showing up in the output are much smaller.
Portability. Since all the output is concentrated in one place, its much easier to
make changes to the intermediate language. You need only change a single subrou
tine instead of several yycode () calls scattered all over the parser. Similarly, its
easy to emit binary output rather than ASCII outputjust change gen () to emit
binary directly.
Debugging. The compiler takes a command line switch that causes it to generate a
run-time trace. Instead of emitting a single C-code directive, the compiler emits the
C-code directive surrounded by statements that print the directive itself, the contents
of all the virtual registers, and the top few stack elements. This way, you can watch
the effect of every output instruction as its executed. It is much easier to emit these
extra run-time-trace statements when all output is concentrated in one place.
The first argument to gen () is a string that specifies the instruction to emitusually
the op code. The number and type of any additional arguments are controlled by the first
onelegal first arguments are summarized in Table 6.16. If an argument is a character-
pointer, the string is printed; if its an i nt, the number is converted to a string and
printed, and so on. In addition, if the first character of an arithmetic instruction is an @,
the 0 is removed and a * is printed to the left of the destination string. This call:
g e n ( "@+=", " d s t " , "s r c " ) ;
generates this code:
* d s t += s r c ;
The gen () subroutine is implemented in Listing 6.63 along with various support
routines. gen_comment () (on line 81) puts comments in the output. It works like
pr i nt f (), except that the output is printed to the right of the instruction emitted by the
19. This argument also holds for the code that creates declarations. I should have funneled all declarations
through a single subroutine rather than using direct yydata () and yybss () calls. You may want to
make that change to the earlier code as an exercise.
Table 6.16. The gen () Interface
Section 6.7The gen () Subroutine
First
Argument
Second
Argument
Third
Argument
- . . .
Output Description
?i a _ i i
o char *dst ; char *src; dst %= src; modulus
char *dst ; char *src; dst &= src; bitwise AND
*=
char *dst; char *src; dst *= src; multiply
"*=%s%d" char *dst ; i nt src; dst *= src; multiply dst by constant
I I H
char *dst; char *src; dst += src; add
"+=%s%d" char *dst ; i nt src; dst += src; add constant to dst
M _ _ 11
char *dst ; char *src; dst - = src; subtract
11_ 9- o 9 H 11
' o o o U char *dst ; i nt src; dst - = src; subtract constant from dst
i y _ i i
char *dst ; char *src; dst / = src; divide
"/ =%s%d" char *dst ; i nt src; dst / = src; divide dst by constant
char *dst ; char *src; dst < < = src; left shift dst by src bits
char *dst ; char *src; dst > > = src; right shift dst by src bits
">L=" char *dst ; char *src; l r s( dst , sr c) ; logical right shift dst by src bits
I I _ I I
char * dst ; char *src; dst = - src; twos complement
I I _ ~ I I
char *dst; char *src; dst =~ src; ones complement
I I | _ I I
char *dst ; char *src; dst = src; bitwise OR
I I ~_ I I
char *dst ; char *src; dst "= src; bitwise XOR
I I _ I I
char *dst ; char *src; dst = src; assign
char *dst ; char *src; dst = &src; load effective address
"=*%s%v" char *dst; val ue *src; dst = *name;
dst = name;
assign indirect, name is taken from src- >name. If
the name is of the form &name, then dst =name is
output, otherwise dst =*name is output.
I I I I
char * l abel ; (none) l abel : label
" : %s%d" char * al pha; i nt num; al phanum: label, but with the alphabetic and numeric com
ponents specified separately. gen( ": %s%d", "P",
10) emitsP10: .
"BI T" char *opl ; char *bi t; BI T( opl , bi t ) test bit
"EQ" char *opl ; char *op2; EQ( opl , op2) equality
"EQ%s%d" char *opl ; i nt op2; EQ( opl , op2) equal to constant
"GE" char *opl ; char *op2; EQ( opl , op2) greater than or equal
"GT" char *opl ; char *op2; EQ( opl , op2) greater than
"LE" char *opl ; char *op2; EQ( opl , op2) less than or equal
"LT" char *opl ; char *op2; EQ( opl , op2) less than
"NE" char *opl ; char *op2; EQ( opl , op2) not equal
"U GE" char *opl ; char *op2; EQ( opl , op2) greater than or equal, unsigned
"U GT" char *opl ; char *op2; EQ( opl , op2) greater than, unsigned
"U_LE" char *opl ; char *op2; EQ( opl , op2) less than or equal, unsigned
"U LT" char *opl ; char *op2; EQ( opl , op2) less than, unsigned
"PROC" char * name; char ^cl s; PROC( name, el s) start procedure
"ENDP" char * name; (none) ENDP( name) end procedure
"cal l " char * l abel ; (none) cal l ( l abel ) ; call procedure
"ext hi gh" char *dst ; (none) ext hi gh( dst ) ; sign extend
"ext l ow" char *dst ; (none) ext l ow( dst ) ; sign extend
"ext word" char *dst ; (none) ext wor d( dst ) ; sign extend
"got o" char *l abel ; (none) got o l abel ; unconditional jump
"got o%s%d" char *al pha; i nt num; got o al phanum; unconditional jump, but the alphabetic and numeric
components of the target label are specified
separately, gen ( "got o%s%d", "P", 10) emits
got o P10; .
"l i nk" char *l oc; char *tmp; l i nk ( l oc+t mp) ; link
"pop" char *dst ; char *t ype; dst = pop( t ype) ; pop
"push" char *src; (none) push ( sr c) ; push
"ret " (none) (none) ret ( ) ; return
"unl i nk" (none) (none) ret ( ) ; unlink
next gen( ) call. gen_comment () stores the comment text in Comment _buf
(declared on line 77) so that it can be printed by a subsequent gen () call.
The enabl e_t r ace () and di sabl e_t r ace () subroutines on lines 100 and 101 Run-time trace,
enable and disable the generation of run-time trace statements. enabi e t r ace o ,
di sabl e t r ace()
Strength reduction.
Instruction output and
formatting:
pr i nt i nst r uct i on().
Run-time trace instruc
tions: P (), T ().
gen () itself starts on line 113. It uses the lookup table on lines 21 to 74 to translate
the first argument into one of the tokens defined on lines 13 to 19. The table lookup is
done by the bsear ch () call on line 132. Thereafter, the token determines the number
and types of the arguments, which are pulled off the stack on lines 138 to 149. The ANSI
variable-argument mechanism described in Appendix A is used, gen () does not emit
anything itselfit assembles a string which is passed to a lower-level output routine.
The switch starting on line 156 takes care of most of the formatting, using spr i nt f ()
calls to initialize the string. Note that a simple optimization, called strength reduction is
done on lines 201 to 229 in the case of multiplication or division by a constant. If the
constant is an even power of two, a shift is emitted instead of a multiply or divide direc
tive. A multiply or divide by 1generates no code at all. The comment, if any, is added
to the right of the output string on line 236.
The actual output is done in pr i nt i nst r uct i on () on line 247 of Listing 6.63.
This subroutine takes care of all the formatting details: labels are not indented, the state
ment following a test is indented by twice the normal amount, and so forth,
pr i nt i nst r uct i on () also emits the run-time trace directives. These directives are
written directly to yycodeout [with f pr i nt f () calls rather than yycode () calls] so
that they wont show up in the IDE output window. The trace is done using two macros:
_P () prints the instruction, and _T () dumps the stack and registers. Definitions for
these macros are written to the output file the first time that pr i nt i nst r uct i on () is
called with tracing enabled (on Line 262 of Listing 6.63). The output definitions look
like this:
#def ne _ P ( s ) p r i n t f ( s )
#d e f i n e _T( ) pm () , p r i n t f ( \ "------------------------------------------------------------- \ \ n \ " )
_P () just prints its argument, _T () prints the stack using a pm() callpm() is
declared in
Most statements are handled as follows:
_ P( "a = b; " )
a = b; _T ()
The trace directives are printed at the right of the page so that the instructions them
selves will still be readable. The instruction is printed first, then executed, and then the
stack and registers are printed. Exceptions to this order of events are as follows:
l abel i
_p <
"l abel : " )
PROC(. . .)
_p <
" PROC( . . . ) " ); _T () ;
_p <
" r et ( . . . ) " )
r et (. . . );
_p <
"ENDP ( . . . ) " )
ENDP ( . . . )
The trace for a logical test is tricky because both the test and the following instruction
must be treated as a unit. They are handled as follows:
_P ( "NE( a, b) " )
NE( a, b) {
_P ( "g o t o x" ) ;
i n s t r u c t i o n ; T ( ) ; }
Section 6.7The gen () Subroutine
Listing 6.63. gen.c C-code Generation
567
3 # i ncl ude < t o o l s / d e b u g . h>
7 #i ncl ude "s ymt ab. h"
8 # i ncl ude "v a l u e . h "
9 #i ncl ude " p r o t o . h "
10
11 PRI VATE i nt Trace = 0/ / * Gener at e r un- t i me t r ace i f t rue. * /
12
13 t ypedef enumr e q u e s t
14 {
15 t _ a s s i g n _ a d d r , t _ a s s i g n _ i n d , t _ c a l l , t _ e n d p , t _ e x t , t _ g o t o , t _ g o t o _ i n t ,
16 t _ l a b e l , t _ l a b e l _ i n t , t _ l i n k , t _ l o g i c a l , t _ l o g i c a l _ i n t , t _ l r s , t _ ma t h,
17 t _ ma t h _ i n t , t _ p o p , t _ p r o c , t _ p u s h , t _ r e t , t _ u n l i n k
18
19 } r e q u e s t ;
20
21 st ruct l tab
22 {
23 char *l e x e me ;
24 r e q u e s t t o k e n ;
25 }
26 Lt ab[] =
27 {
28 {"%=", t _mat h },
29 {"&=", t _mat h },
30 {H* = ^ t _mat h },
31 {"*=%s%d", t _ ma t h _ i n t }, / * Mul t i pl y var by const ant . * /
32 {"+="/ t _mat h },
33 {"+=%s%d", t _ ma t h _ i n t },
34 { f t _mat h },
35 {"-=%s%d", t _ ma t h _ i n t },
36 { "/ =", t _mat h },
37 {"/=%s%d", t _ ma t h _ i n t },
38 t _ l a b e l },
39 {":%s%d", t _ l a b e l _ i n t },
40 {"<<=", t _mat h },
41 { ,, = M/ t _mat h },
42 {"=&", t _ a s s i g n _ a d d r }, / * Get ef f ect i ve addr ess. */
43 {"=*%s%v", t _ a s s i g n _ i n d }, / * Assi gn i ndi r ect . */
44 f t _mat h },
45 {">>=", t _mat h },
46 {">L=", t _ l r s },
47 {"BI T", t _ l o g i c a l },
48 {"ENDP", t _ e n d p },
49 {"EQ", t _ l o g i c a l },
50 {"EQ%s%d", t _ l o g i c a l _ i n t },
51 {"GE", t _ l o g i c a l },
52 {"GT", t _ l o g i c a l },
53 {"LE", t _ l o g i c a l },
54 {"LT", t _ l o g i c a l },
55 {"NE", t _ l o g i c a l },
56 {"PROC", t _ p r o c },
57 {"U_GE", t _ l o g i c a l },
58 { "U_GT", t _ l o g i c a l },
59 {"U LE", t l o g i c a l },
60 { "U_LT", t l o g i c a l
61
ri i ~ _ i i
1 t
t math
62 { " c a l l " , t c a l l
63 { " e x t h i g h " , t e x t
64 { " e x t l ow", t e x t
65 { " e x t word", t e x t
66 { " g o t o " , t g o t o
67 { "goto%s%d", t g o t o i n t
68 { " l i n k " , t l i n k
69 {"pop", t pop
70 { "pus h", t pus h
71 { " r e t " , t r e t
72 { " u n l i n k " , t u n l i n k
73
ri i _ i i
I f
t math
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
};
#def i ne NREQ ( si zeof ( L t a b ) / si zeof ( *Lt ab)
Comment b u f [ 1 3 2 ] ;
)
/ * Tabl e si ze (i n el ement s) .
/ * Remember comment t ext her e
*
*
/
/
j *_____________________________________________________________________________ * j
PUBLIC voi d ge n c omme nt ( f o r ma t ,
. . )
* f o r ma t ;
{
/
*
Wor ks l i ke pr i nt f (), but t he st r i ng i s appended as a comment to t he end
* of t he command by t he next gen( ) cal l . Ther e' s no ar r ay- si ze
*
checki ng be car ef ul . The maxi mumgener at ed st r i ng l engt h i s 132
* char act er s. Over wr i t e any comment s al r eady i n t he buf f er
*
/
v a _ l i s t
va s t a r t
(
f or mat
) ;
v s p r i n t f ( Comment b u f , f o r ma t , a r g s ) ;
va end ( a r g s ) ;
}
/* --------------------------------------------------------------------------------------------------------------------------------------------
* Enabl e/ di sabl e t he gener at i on of r un- t i me t r ace out put .
*
/
PUBLIC e n a b l e t r a c e () { Tr ac e
PUBLIC d i s a b l e t r a c e () { Tr ac e
1;
0;
}
}
/
*
Must cal l bef or e par si ng st ar t s
*
/
J * --------------------- ---------------------------- ----------------------------------------------------------------------------------------------------------------------- * j
PRIVATE i nt cmp( a , b ) /
*
Compar e t wo l exeme f i el ds of an l t ab. */
struct
{
l t a b
*
a,
*
b;
ret urn s t r c mp( a - > l e x e me , b - > l e x e me ) ;
}
J *______________________________________________________________________________________ ______* J
PUBLIC g e n ( op
)
/ * emi t code * /
*
op;
{
*
d s t s t r ,
s r c s t r , b [ 8 0 ] ;
i nt
v a l u e
l t a b
s r c _ i n t , d s t i n t ;
* s r c _ v a l ;
*p, dummy;
Section 6.7 The gen () Subroutine 569
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
173
174
175
176
177
t o k ;
va l i s t
VV vv
i nt amt ;
(
*
op ' 0' )
{
++op;
M* ii
}
dummy. l e xe me op;
( ! (P (
l t a b
*
) bsearch( &dummy, Lt ab, NREQ, ( * L t a b ) , c mp) ) )
{
y y e r r o r ( "INTERNAL gen
r et ur n;
bad <%s>, no c o de e m i t t e d . \ n " , op ) ;
}
va s t a r t ( a r g s , op ) ; /
*
Get t he ar gument s.
/
d s t s t r va a r g ( a r g s
*
);
swi t ch( t o k
{
p - > t o k e n )
t _ m a t h _ i n t :
t _ l o g i c a l _ i n t
t _ g o t o _ i n t :
t l a b e l i n t : s r c i n t va a r g ( a r g s i nt );
t a s s i g n i n d : s r c _ v a l
s r c s t r
va a r g ( a r g s , v a l u e * ) ;
va a r g ( a r g s
*
);
}
/
*
*
*
The f ol l owi ng code j ust assembl es t he out put st r i ng. I t i s pr i nt ed wi t h
t he pr i nt _i nst r uct i on() cal l under t he swi t ch, whi ch al so t akes car e of
i nser t i ng t r ace di r ect i ves, i nser t i ng t he pr oper i ndent , et c.
*
/
swi t ch( t o k )
{
158 case t c a l l : s p r i n t f ( b , " c a l l ( % s ) ; " , d s t s t r )
; br eak;
159 case t e n d p : s p r i n t f ( b , " ENDP( %s) " , d s t s t r , s r c s t r ) ; br eak;
160 case t e x t : s p r i n t f ( b , " %s (%s ) ; ", op, d s t _ s t r
)
; br eak;
161 case t g o t o : s p r i n t f ( b , " g o t o %s ; " , d s t _ s t r
)
; br eak;
162 case t g o t o i n t : s p r i n t f ( b , " g o t o %s%d;", d s t s t r , s r c i n t ) ; br eak;
163 case t _ l a b e l : s p r i n t f ( b ,
it g-o "
oo , d s t _ s t r
)
; br eak;
164
165 case t l a b e l i n t : s p r i n t f ( b ,
Il0.0 0. J . It
d s t _ s t r , s r c _ i n t )
/
166 t o k = t l a b e l ;
167 br eak;
168
169 case t l o g i c a l : s p r i n t f ( b ,
O, 0 / O-0 o, 0 \ n
0 0 ^0 0 , oo^ , op, ds t _ s t r , s r c _ s t r ) ;
170 br eak;
171
172 case t l o g i c a l i n t : s p r i n t f ( b , " %2. 2s ( %s , %d) ", op, d s t s t r , s r c i n t ) ;
t o k t l o g i c a l ;
t _ l i n k :
t p o p :
d s t s t r , s r c s t r ) ;
s p r i n t f (b, 11
o
o 12s pop (%s) ; 11, d s t s t r s r c s t r ) ;
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
}
t _ p r o c :
t _ p u s h :
t _ r e t :
t u n l i n k
s p r i n t f (b, " PROC(%s, %s)",
s p r i n t f (b, 11 p u s h ( %s ) ; " ,
s p r i n t f (b, 11 r e t ( ) ; M
s p r i n t f (b, 11 u n l i n k ( ) ; "
d s t s t r , s r c s t r ) ;
d s t s t r
) ;
);
);
t l r s : s p r i n t f (b, "%sl rs (%s, %s) ; ", p r e f i x , d s t s t r , s r c s t r ) ;
o
"o
t a s s i g n addr: s p r i n t f ( b , " %s %- 1 2 s &%s;", p r e f i x , d s t s t r , s r c s t r ) ;
t a s s i g n i nd:
( s r c _ v a l - > n a m e [ 0] ==
s p r i n t f ( b , " %s %- 1 2 s
)
9- C!
oOf i x , d s t s t r , s r c v a l - > n a me + l ) ;
s p r i n t f ( b , " %s %- 1 2 s *%s; ", p r e f i x , d s t s t r , s r c v a l - > n a me ) ;
t mat h: s p r i n t f ( b , "%s%-12s ^s ^s; ,
W
p r e f i x , d s t s t r , op, s r c s t r ) ;
t math i n t :
199 i f ( *op !
_ /
*' && *op ! := ' / ' )
200 s p r i n t f ( b
II 0, 0. _
, oo o 12s %2 . 2 s %d;",
201 el se
202
{
203 swi t ch ( s r c i n t )
204
{
205 case 1 amt = 0; br eak;
212 case 128: amt = 7; br eak;
215 case 1024 : amt = 10; br eak;
216 case 2048 : amt = 11; br eak;
217 case 4096 : amt = 12; br eak;
218 def aul t :1 amt = - 1 ; br eak;
i x , d s t s t r , op, s r c i n t ) ;
}
(
! amt )
s p r i n t f ( b , "/ * %s%-12s %s 1; * / " , i x , d s t s t r , op ) ;
( amt < 0 )
s p r i n t f ( b , "%s% 12s %s %d;", i x , d s t s t r , op, s r c i n t ) ;
s p r i n t f ( b ,
M O . O . _ 1 Q O. O.
ooo J. Z. o oo od "
f t
( *op
i x , d s t s t r ,
9* 9
)
? "
VV VV vv
amt) ;
}
y y e r r o r ( "INTERNAL, ge n: bad t o k e n %s, no c ode e m i t t e d . \ n " , op ) ;
Section 6.7The gen () Subroutine 571
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
i f ( ^Comment buf )
{
c o n c a t (
/
*
Add opt i onal comment at end of l i ne. */
(b) , b, b, ( t o k t l a b e l ? " \ t \ t \ t \ t " : " \ t " ) ,
'7
* ii
Comment b u f , 11 * / " , ~1SfULL ) ;
^Comment buf ' \ 0' ;
}
p r i n t i n s t r u c t i o n ( b, t o k ) ; /
*
Out put t he i nst r uct i on. */
}
J *_________ ___________ ________________________________ ________ ___________ J
PRIVATE voi d p r i n t i n s t r u c t i o n ( b, t )
*
b;
r e q u e s t t ;
{
/
/
*
*
Buf f er cont ai ni ng t he i nst r uct i on
Token r epr esent i ng i nst r uct i on.
*
*
/
/
/
*
Pr i nt t he i nst r uct i on and, i f t r ace i s enabl ed ( Trace i s t rue) , pr i nt
* code to gener at e a r un- t i me t r ace
*
/
FILE
st at i c i nt
st at i c i nt
* y y c o d e o u t , * y y b s s o u t ;
p r i n t e d d e f s 0;
l a s t s t mt was t e s t
( Tr ac e && I p r i n t e d d e f s )
{
p r i n t e d d e f s 1;
f p r i n t f ( y y b s s o u t ,
" # d e f i n e P ( s ) p r i n t f ( s ) \ n"
" # d e f i n e
11\ n \ n " ) ;
TO pm () , p r i n t f ( \ " --------------------------------------------------------------- \ \ n \ " ) "\
}
( ! Trace ) /
*
j ust pr i nt t he i nst r uct i on */
{
yycode( "%s%s%s\ n", ( t
(
b ) ;
t l a b e l t t endp t
l a s t s t mt was t e s t
t pr oc )
)
" \t" ,
vv vvvv
l a s t s t mt was t e s t ( t t l o g i c a l ) ;
}
( t t l o g i c a l )
{
f p r i n t f ( y y c o d e o u t , " \ t \ t \ t \ t \ t " " P ( \ " % s \ \ n \ " ) ; \ n " , b ) ;
yyc ode ( 11\ t % s \ t \ t { \ n " , b) /
l a s t s t mt was t e s t = 1;
/* }*/
}
{
swi t ch( t )
{
t l a b e l : yycode( "%s ", b ) ;
f p r i n t f ( y y c o d e o u t , " \ t \ t \ t \ t \ t " " P ( \ " % s \ \ n \ " ) ; " , b) ;
t p r o c : yycode( "%s ", b ) /
f p r i n t f ( y y c o d e o u t , " \ t \ t \ t "
f p r i n t f ( y y c o d e o u t ,
" _ P ( \ " %s \ \ n \ " )
II T ( ) . M
b) ;
) ;
295 case t _ r e t :
296 case t endp: f p r i n t f ( y y c o d e o u t , " \ t \ t \ t \ t \ t " " _ P ( \ M%s \ \ n\ ") " \ n " , b) ;
297 y y c o d e ( "%s", b ) ;
298 br eak;
299
300 def aul t : f p r i n t f ( y y c o d e o u t , " \ t \ t \ t \ t \ t 11 " _ P ( \ M% s \ \ n \ " ) "\ n", b) ;
301 y y c o d e ( "\ t %s%s", l a s t s t mt was t e s t ? " " : b) ;
302 f p r i n t f ( y y c o d e o u t , " \ t \ t \ t " 11_T () ; "
) ;
303 }
304
305 i f ( l a s t s t mt was t e s t )
/ * { * /
306
{
307 p u t c ( ' } ' , y y c o d e o u t ) ;
308 l a s t s t mt was t e s t = 0;
309
}
310 p u t c ( ' \ n ' , y y c o d e o u t ) ;
311 }
312 }
6.8 Expressions
This section looks at how expressions are processed. Temporary-variable manage
ment is discussed as are lvalues and rvalues. The code-generation actions that handle
expression processing are covered as well. Weve looked at expression parsing
sufficiently that theres no point in including extensive sample parses of every possible
production in the current sectionI ve included a few sample parses to explain the
harder-to-understand code-generation issues, but I expect you to be able to do the
simpler cases yourself. J ust remember that order of precedence and evaluation controls
the order in which the productions that implement particular operators are executed.
Productions are reduced in the same order as the expression is evaluated.
The foregoing notwithstanding, if you have the distribution disk, you may want to
run simple expressions through the compiler (c.exe) as you read this and the following
sections. As you watch the parse, pay particular attention to the order in which reduc
tions occur and the way that attributes are passed around as the parse progresses.
6.8.1 Temporary-Variable Allocation
All expressions in C are evaluated one operator at a time, with precedence and asso
ciativity determining the order of evaluation as much as is possible. It is conceptually
convenient to look at every operation as creating a temporary variable that somehow
references the result of that operationour compiler does things somewhat more
efficiently, but its best to think in terms of the stupidest possible code. This temporary,
which represents the evaluated subexpression, is then used as an operand at the next
expression-evaluation stage.
Our first task is to provide a mechanism for creating and deleting temporaries. One
common approach is to defer the temporary-variable management to the back end. The
compiler itself references the temporaries as if they existed somewhere as global vari
ables, and the back end takes over the allocation details. The temporary variables type
can be encoded in the name, using the same syntax that would be used to access the tem
porary from a C-code statement: W(tO) is a word, L (tl ) is an l word, WP (t2) is a
wor d pointer, and so on. The advantage of this approach is that the back end is in a
Defer temporary-variable
management to back
end.
Section 6.8.1 Temporary-Variable Allocation 573
much better position than the compiler itself to understand the limitations and strengths
of the target machine, and armed with this knowledge it can use registers effectively. I
am not taking this approach here because its pedagogically useful to look at a worst-
case situationwhere the compiler itself has to manage temporaries.
Temporary variables can be put in one of three places: in registers, in static memory,
and on the stack. The obvious advantage of using registers is that they can be accessed
quickly. The registers can be allocated using a stack of register namesessentially the
method that is used in the examples in previous chapters. If there arent enough registers
in the machine, you can use a few run-time variables (called pseudo registers) as
replacements, declaring them at the top of the output file and using them once the regis
ters are exhausted. You could use two stacks for this purpose, one of register names and
another of variable names, using the variable names only when there are no more regis
ters available. Some sort of priority queue could also be used for allocation.
One real advantage to deferring temporary-variable allocation to the back end is that
the register-versus-static-memory problem can be resolved in an efficient way. Many
optimizers construct a syntax tree for the expression being processed, and analysis of this
tree can be used to allocate temporaries efficiently (so that the registers are used more
often than the static memory). This sort of optimization must be done by a postprocessor
or postprocessing stage in the parser, howeverthe parser must create a physical syntax
or parse tree that a second pass can analyze. Though its easy for a simple one-pass com
piler to use registers, its difficult for such a compiler to use the registers effectively.
Another problem is function calls, which can be imbedded in the middle of expres
sions. Any registers or pseudo registers that are in use as temporaries must be pushed
before calling the function and popped after the return. Alternately, code at the top of
the called function could push only those registers that are used as temporaries in the
function itselftheres a lot of pushing and popping in either case. This save-and-
restore process adds a certain amount of overhead, at both compile time and run time.
The solution to the temporary-variable problem thats used here is a compromise
between speed and efficiency. A region of the stack frame is used for temporaries. This
way, they dont have to be pushed because theyre already on the stack.
Because the maximum size of the temporary-variable region needed by a subroutine
varies (it is controlled by the worst-case expression in the subroutine), the size of the
temporary-variable space changes from subroutine to subroutine. This problem is solved
with the second macro thats used in the l i nk instruction in the subroutine prefix. (Ll
in the example in the previous section.) The macro is defined to the size of the
temporary-variable region once the entire subroutine has been processed.
This approachallocating a single, worst-case sized temporary-variable regionis
generally better than a dynamic approach where the temporary-variable space is gradu
ally expanded at run time by subtracting constants from the stack pointer as variables are
needed. The stack is shrunk with matching additions when the variable is no longer
needed. This last approach can be more efficient of stack space, but it is both more
difficult to do at compile time and is inefficient at run time because several subtractions
and additions are needed rather than a single l i nk instruction. Since most languages
use very few, relatively small, temporaries as they evaluate expressions, this second
20. In many machines, such as the Intel 8086 family, a stack-relative memory access is actually more efficient
than a direct-mode memory access. It takes fewer clock cycles. Since none of the 8086-family machines
have any general purpose registers to speak of, putting the temporaries on the stack is actually one of the
most efficient solutions to the problem.
Registers as tem
poraries, pseudo regis
ters.
Problems with function
calls.
Temporaries on stack.
Dynamic temporary-
variable creation.
method is usually more trouble than its worth. Nonetheless, you may want to consider a
dynamic approach if the source language has very large data types. For example, a
special-purpose language that supported a mat r i x basic type and a matrix-
multiplication operator would need a tremendous amount of temporary-variable space to
process the following expression:
mat r i x a[100] [100], b[ 100] [ 100] , c[ 100] [ 100] ;
a a
*
b
*
c;
It would be worthwhile not to waste this stack space except when the expression was
actually being evaluated, so a dynamic approach makes sense here. If the stack just isnt
large enough for temporaries such as the foregoing, the compiler could generate run-time
calls to mal l oc () as part of the subroutine prefix, storing a pointer to the memory
returned from mal l oc the stack. A run-time f r ee () call would have to be gen
Temporary variables
type.
erated as part of the subroutine suffix to delete the memory. Calling mal l oc () is a
very-high-overhead operation, however, so its best to use the stack if you can.
The next problem is the temporary variables type, which varies with the operand
types. The type of the larger operand is used for the temporary. The easiest solution
(and the one used here) is to use the worst-case size for all temporaries, and align them
all on a worst-case boundary. All temporaries take up one 1 wo rd-sized stack element,
which is guaranteed to be aligned properly because its a stack word. Again, if the
worst-case type was very large (if doubles were supported, for example), it would be
worthwhile to go to the trouble to pack the temporaries into the stack region in a manner
analogous to the local variables.
Figure 6.18 shows a stack frame with eight bytes of local-variable space and enough
temporary space for two variables. The macro definitions at the top of the figure are gen
erated at the same time as the subroutine prefix. The LOmacro evaluates to the size, in
stack elements, of the local-variable region. (This is the same LOthats passed to the
Temporary-variable refer- l i nk instruction.) The T () macro simplifies temporary-variable access in instructions.
ences: t ()
It is defined at the top of the file as an empty macro, and is then #undefed and redefined
to reference the current local-variable-space size at the top of every subroutine. The ini
tial dummy definition is required because some compilers wont let you undefine a
nonexistent macro. The following examples demonstrate how the macro is used:
W( T( 0) )
L( T( 1) )
accesses the wor d in the bottom two bytes of the bottom temporary
accesses the l wor d that takes up the entire top temporary.
*WP (T (0) ) accesses the wor d whose address is in the top temporary.
Figure 6.18. Storing Temporary Variables in the Stack Frame
#def i ne T(n) (f
#def i ne LO 2
#def i ne Ll 2
p- LO- ( 4*n) )
/ * si ze of l ocal - var i abl e s]
/ * si ze of t empor ar y- var i ab
( l ow memor y)
pace */
l e space */
Temporary-variable region
T(1) f p- 16
Ll =2
n|/
T (0) f p- 12
'p
Local-variable region
*
f p- 5 | f p- 6 | f p-7 | f p-8
L0=2
*
f p- 1 | f p-2 | f p- 3 | f p-4
3 2 1 0
( hi gh memory)
< f p
Though temporary variables are not packed into the stack frame, the temporary- Recycling temporaries,
variable space is used intelligently by the compilerthe same region of the stack is
recycled as needed for temporaries. The strategy, implemented in Listing 6.64, is
straightforward. (Its important not to confuse compile and run time, in the following
discussion. Its easy to do. The temporary-variable allocator is part of the compiler, but
it is keeping track of a condition of the stack at run time. All of the temporary-variable
allocation and deallocation is done at compile time. A region of the stack is just used as
a temporary at run timethe compiler determined that that region was available at com
pile time.)
Listing 6.64. temp.c Temporary-Variable Management
4 # i ncl ude < t o o l s / h a s h . h >
7 # i ncl ude < t o o l s / c - c o d e . h >
8 # i ncl ude "s ymt ab. h"
9 # i ncl ude " v a l u e . h "
11
12 / * Subr out i nes i n t hi s f i l e t ake car e of t empor ar y- var i abl e management . * /
13
14 #def i ne REGION_MAX 128
15
16 #def i ne MARK - 1
17
18 t ypedef i nt CELL;
19
20 PRIVATE CELL Regi on[REGION_MAX];
21 PRIVATE CELL *Hi gh wa t e r mark = Re g i o n;
22
/ *_. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
23
24 PUBLIC i nt tmp a l l o c ( s i z e )
25 i nt s i z e ; / * desi r ed number of byt es */
26
{
27 / * Al l ocat e a por t i on of t he t empor ar y- var i abl e r egi on of t he r equi r ed si
28 * expandi ng t he t mp- r egi on si ze i f necessar y. Ret ur n t he of f set i n byt es
29 * f r omt he st ar t of t he r val ue r egi on to t he f i r st cel l of t he t empor ar y
30
* 0 i s r et ur ned i f no space i s avai l abl e, and an er r or message i s al so

31 * pr i nt ed i n t hi s si t uat i on. Thi s way t he code- gener at i on can go on as i
32 * space had been f ound wi t hout havi ng to wor r y about t est i ng f or er r or s.
33 * (Bad code i s gener at ed, but so what ?)
34
*/
35
36 CELL *s t a r t , *p ;
37 i nt i ;
38
39 / * si ze = t he number of st ack cel l s r equi r ed to hol d "si ze" byt es. */
40
41 s i z e = ( ( s i z e + SWIDTH) / SWIDTH) - ( s i z e % SWIDTH == 0);
42
43 i f ( ! s i z e )
44 y y e r r o r ( "INTERNAL, tmp a l l o c : z e r o - l e n g t h r e g i o n r e q u e s t e d \ n " ) ;
45
46
/ * Maxi mumnumber of st ack el ement s t hat can be */
/ * used f or t empor ar i es. */
/ * Mar ks cel l s t hat ar e i n use but ar e not t he */
/ * f i r st cel l of t he r egi on. */
/ * I n- use ( Regi on) map i s an ar r ay of t hese. */
47 / * Now l ook f or a l ar ge- enough hol e i n t he al r eady- al l ocat ed cel l s. * /
48
49 f o r ( s t a r t = Re g i o n; s t a r t < Hi gh_wat er_mark ;)
50 {
51 f o r ( i = s i z e , p = s t a r t ; i >= 0 && ! *p; ++p )
52
53
54 i f ( i >= 0 ) / * Cel l not f ound. * /
55 s t a r t = p + 1;
56 e l s e / * Found an ar ea bi g enough. * /
57 b r e a k ;
58 }
59
60 i f ( s t a r t < Hi gh_wat e r _mar k ) / * Found a hol e. * /
61 p = s t a r t ;
62 e l s e
63 {
64 i f ( ( Hi gh_wat er_mark + s i z e ) > ( Re gi on + REGION_MAX) ) / * No room. * /
65 {
66 y y e r r o r ( " Ex p r e s s i o n t o o c o mpl e x , br e a k i n t o s m a l l e r p i e c e s \ n " ) ;
67 r e t u r n 0;
68 }
69 p = Hi gh_wat e r _mar k;
70 Hi gh_wat e r_mark += s i z e ;
71 }
72
73 f o r ( *p = s i z e ; s i z e > 0; * + +p = MARK) / * 1st cel l =si ze. Ot her s=MARK * /
74
75
76 r e t u r n ( s t a r t - Re g i o n ) ; / * Ret ur n of f set t o st ar t of r egi on * /
77 / * conver t ed t o byt es. * /
78 }
79
80 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
81
82 PUBLIC v o i d t mp _ f r e e ( o f f s e t )
83 i n t o f f s e t ; / * Rel ease a t empor ar y var. ; of f set shoul d * /
84 { / * have been r et ur ned f r omt mp_al l oc( ) . * /
85 CELL *p = Re g i o n + o f f s e t ;
86 i n t s i z e ;
87
88 i f ( p < Re g i o n I I p > Hi gh_wat er _mark | | ! *p | | *p == MARK )
89 y y e r r o r ( "INTERNAL, t mp _ f r e e : Bad o f f s e t ( %d) \ n", o f f s e t ) ;
90 e l s e
91 f o r ( s i z e = *p; s i z e >= 0; *p++ = 0 )
92
93 }
94
95 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
96
97 PUBLIC v o i d t m p _ r e s e t ( )
98 {
99 / * Reset ever yt hi ng back t o t he vi r gi n st at e, i ncl udi ng t he hi gh- wat er mar k.
100 * Thi s r out i ne shoul d be cal l ed j ust bef or e a subr out i ne body i s pr ocessed,
101 * when t he pr ef i x i s out put . See al so: t mp_f r eeal l ( ) .
102 */
103 t m p _ f r e e a l l ( ) ;
104 Hi gh_wat e r _mar k = Re g i o n ;
105 }
106
107
/ * -
/
108
109 PUBLIC v o i d tmp f r e e a l l ( )
110
{
111 / * Fr ee al l t empor ar i es cur r ent l y i n use ( wi t hout modi f yi ng t he hi gh- wat er
112 * mar k) . Thi s subr out i ne shoul d be cal l ed af t er pr ocessi ng ar i t hmet i c
113 * st at ement s t o cl ean up any t empor ar i es st i l l ki cki ng ar ound ( t here i s
114 * usual l y at l east one) .
115 * /
116 me ms e t ( Re g i o n, 0, s i z e o f ( R e g i o n ) ) ;
117 }
118
119 / * -
/
120
121 PUBLIC i n t tmp v a r s p a c e ()
122
{
123 / * Ret ur n t he t ot al cumul at i ve si ze of t he t empor ar y- var i abl e r egi on i n
124 * st ack el ement s, not byt es. Thi s number can be used as an ar gument t o t he
125 * l i nk i nst r uct i on.
126 * /
127 r e t u r n Hi gh wa t e r mark - Re g i o n;
128
}
The compiler keeps an internal array called Regi ondeclared on line 20 of Listing Compile-time map of
6.64 that tells it which stack elements in the temporary-variable space are in use at run
time. The index of the cell in the array corresponds to the offset of the equivalent stack
element. For example, Regi on [N] corresponds to the temporary variable at offset N (in
stack elements) from the base of the temporary-variable region (not from the frame
pointer). The array elements are zero if the corresponding stack element is available.
For example, if Regi on [4] is zero, then the stack element at offset 4 from the base of
the temporary space is available. The size of the array is controlled by REGI ON MAX,
declared on line 14 of Listing 6.64. This size limits the amount of temporary-variable
space available to a subroutine. This size limits an expressions complexity because it
run-time stack: Re
gi on[].
limits the number of temporaries that can be used to evaluate the expression. The com
piler prints an error message if the expression gets too complicated (if there arent
enough temporary variables to evaluate all the subexpressions).
The temporary-variable allocator, t mp al l oc (), starts on line
It is passed the number of bytes required for the temporary, figures out how many stack
elements are required for that temporary, and then allocates that many bytes from the
24 of Listing 6.64. Allocatetemporary:
t mp al l oc ().
temporary-variable region.21 The allocation logic on lines 49 to 58 of Listing 6.64
searches the array from bottom to top looking for a free stack element. It allocates the
first available stack element for the temporary. The Hi gh wat er mark, used here and Hi gh wat er mar k
declared on line 21 of Listing 6.64, keeps track of the size of the temporary-variable
space. If the first available element is at a higher index than any previously allocated
element, then Hi gh wat er mar k is modified to point at the new element.
21. The allocator Im presenting here is actually more complex than necessary, because all temporary
variables take up exactly one stack element in the current compiler. It seemed reasonable to demonstrate
how to handle the more complex situation, where some temporaries can take up more than one stack
element, however.
Marking a stack cell as
in use .
Rvalues.
Lvalues.
The cell is marked as in use on line 73. The Re gi on element corresponding to the
first cell of the allocated space is set to the number of stack elements that are being allo
cated. If more than one stack element is required for the temporary, adjacent cells that
are part of the temporary are filled with a place marker. Other subroutines in Listing
6.64 de-allocate a temporary variable by reseting the equivalent Re g i o n elements to
zero, de-allocate all temporary variables currently in use, and provide access to the
high-water mark. You should take a moment and review them now.
6.8.2 Lvalues and Rvalues
Expression evaluation, and the way that temporaries are used while doing the evalua
tion, is actually a more complex proposition than you would think, based on the exam
ples in earlier chapters. Hitherto, weve used temporaries for one purpose only, to hold
the value of a subexpression. We could do this because none of the grammars that weve
seen are capable of modifying variables, only of using them. As soon as modification
becomes a possibility, things start to get complicatedespecially in a language like C
that provides several ways to modify something and so forth).
So far, the parsers that weve looked at always treated identifier references in the
same way: They copied the contents of the referenced variable into a temporary and
passed the name of that temporary back up as an attribute. The following grammar is a
case in point:
e : e PLUS e { yycode( "%s += %s\ n", $1, $ 3 ) ; f r e e _ na me ( $3 ) ; }
ID { yycode( "%s = %s\ n", $$ = new n a m e d , y y t e x t ) ; }
The value stack is a stack of character pointers. The attribute is the name of the tempo
rary that holds the partially-evaluated subexpression. Input like x +y generates the fol
lowing code:
tO = x ;
t l = y;
t l += tO;
The kind of temporary thats used herewhich holds somethings valueis called an
rvalue. In the current grammar, the y and z in x = y + z are said to generate rvaluesthey
create temporaries that hold a variables value. (Its an rvalue because it can go to the
right of the equals sign.) The +operator also generates an rvalue because the result of
the operation is a run-time temporary holding the sumthe value of the subexpression.
What happens when you introduce an assignment operator to the grammar? Our
first, unsuccessful attempt at a solution treats the assignment operator just like the addi
tion operator:
e : e PLUS e { yycode( "%s += %s\ n", $1, $ 3 ) ; f r e e _ na me ( $3 ) ; }
e EQUAL e { yycode( "%s = %s \ n", $1, $ 3 ) ; f r e e _ n a me ( $3 ) ; }
ID { yycode( "%s = _ %s \ n", $$ = ne w_name ( ) , y y t e x t ) ; }
f
The input x = y + z yields the following output:
t o = x;
t l =
y;
t 2 = z;
t l += t 2 ;
tO = t l ;
which doesnt do anything useful.
Section 6.8.2Lvalues and Rvalues 579
In order to do assignment correctly, you need to know the address of x (the memory
location to modify), not its contents. To get this information you need to introduce a
second type of temporaryone that holds an address. This kind of temporary is called
an lvalue because it can be generated by subexpressions to the left of an equal sign. The
grammar and code in Listing 6.65 supports both kinds of temporaries. They are dis
tinguished from one another by the first character of the name. Rvalues start with R,
lvalues with L. As was the case in earlier chapters, a simple stack strategy is used to
allocate and free names. Error checking has been omitted to simplify things.
Listing 6.65. Expression Processing with Lvalues and Rvalues
1
%{
2
3 %}
4 O' O
5 s
6 e
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
o o
"0"0
#def i ne YYSTYPE char
*
e ;
e PLUS e
{
/
*
*
Conver t $1 t o an r val ue i f necessar y so t hat i t can be used
as t he synt hesi zed at t r i but e.
*
/
(
*
$1 ! ' R' )
{
o
o y y c o d e ( "%s =
f r e e v a l u e ( $ 1 ) ;
s \ n " , t a r g e t new r v a l u e ( ) , r v a l u e ($1) ) ;
$1 t a r g e t ;
}
$$ $1;
yycode( "%s += %s\ n", t a r g e t , r v a l u e ( $3 ) ;
f r e e v a l u e ( $3 ) ;
}
e EQUAL e
{
*
(
*
$1 ! 'L ' )
y y e r r o r ( "Lval ue r e q u i r e d \ n " ) ;
{
(
*
$3
i
' R' ) /
*
Conver t $3 t o an r val ue. * /
{
y y c o d e ( "%s =
f r e e v a l u e ( $ 3 ) ;
t a r g e t ;
%s\ n", new r v a l u e ( ) , r v a l u e ( $ 3 ) ) ;
$3 =
}
y y c o d e (
f r e e v a l u e ( $1 ) ;
* e
ss
g. O I I
ob $1, $3 ) ;
$$ $3/
}
}
ID
{
y y c o d e ( "%s & %s \ n", $$ new l v a l u e ( ) , y y t e x t ) ;
}
50 c h a r *L v a l u e s [4] = { "LO"
VV
t
Ll ", "L2", "L3" };
51 c h a r *Lp L v a l u e s ;
52 c h a r *R v a l u e s [4] = { "RO"
VV
Rl ", "R2", "R3" };
53 c h a r *Rp Rv a l u e s ;
54
55 c h a r *new l v a l u e () { r e t u r n *Lp++; } / * Al l ocat e a new l val ue. * /
56 c h a r *new r v a l u e () { r e t u r n *Rp++; } / * Al l ocat e a new r val ue. * /
57
58 v o i d f r e e v a l u e ( v a l ) / * Fr ee an l val ue or an r val ue. * /
59 c h a r * v a l ;
60
{
61 i f ( * v a l == ' L'
)
Lp = v a l ;
62 1e l s e
__
Rp = v a l ;
63
}
64
65 c h a r * r v a l u e ( v a l u e
)
/ * I f t he ar gument i s an l val ue, r et ur n a st r i ng */
66 c h a r *v a l u e ; / * t hat can be used i n an expr essi on t o r ef er ence */
67 {
/ * t he l val ue' s cont ent s. ( Add a * t o t he name. ) */
68 s t a t i c c h a r b u f [ 1 6 ] ;
69
70 i f ( * l v a l u e == ' R' ) / * i t ' s al r eady an r val ue * /
71 r e t u r n v a l u e ;
72 s p r i n t f ( b u f , "*%s", l v a l u e ) ;
73 r e t u r n b u f ;
74
}
Using the new grammar, the input x=y+z yields the following output:
LO & x,
Ll &_y,
L2 & z ,
RO *L1,
RO += *L2,
*L0 =RO;
which is awkward, but it works. The lvalues are converted to rvalues by preceding their
names with a star. Note that the attribute synthesized in the assignment is the rvalue, not
the lvalue. The assignment operator requires an lvalue to the left of the equals sign, but
it generates an rvalue. Thats why code like the following is illegal in C:
(x=y) = 2/
even though
x = y = z;
Logical versus physical
lvalues.
LO = & x;
is okay. (x=y) generates an rvaluea temporary that holds the value of yand expli
cit assignment to an rvalue is meaningless. In x=y=z, the operand to the right of the
equal sign is always an rvalue and the operand to the left is always an lvalue. (Run it
through the parser by hand if you dont believe me.)
The code that the compiler just generated is pretty sorry. Fortunately, the compiler
can do a simple optimization to fix things. Instead of passing around the name of a tem
porary variable as an attribute, it can pass a string which, when used as an operand in an
instruction, evaluates correctly. This attribute is sometimes a temporary-variable name,
but it can also take on other values. That is, instead of generating
in the earlier example, and then passing around the string "LO" as an attribute, the
compiler can dispense with the assignment entirely and pass around "&_x" as an attri
bute. So you now have two types of lvalues: physical lvaluestemporary variables that Physical and logical
hold the address of something, and logical lvaluesexpressions that evaluate to the ,va'ues-
address of something when used as an operand. The compiler can distinguish between
the two by looking at the first character in the name; logical lvalues start with an amper
sand.
Listing 6.66 shows the parser, modified to use logical lvalues. Changes are marked
with a . Very few changes have to be made. First, the code for assignment must be
modified on line 32 of Listing 6.66 to handle logical lvalues. The conditional just dis
cards the leading ampersand if its present, because the two operators in the expression
*&_x just cancel one another, yielding _x. Next, the ID action on lines 39 to 41 of List
ing 6.66 must be changed to generate logical, rather than physical, lvalues. The name is
assembled into a string, and a copy of the string is passed as an attribute. (The copy is
freed on line 59.) Finally, the r val ue () subroutine is modified on lines 70 and 71 of
Listing 6.66 to recognize both kinds of lvalues.
Listing 6.66. Expression Processing with Logical Lvalues
1 %{
2 #def i ne YYSTYPE char*
3 %}
4 %%
5 s : e ;
6 e : e PLUS e
7 {
8 char *t a r g e t ;
9
10 i f ( *$1 ! = ' R' )
11 {
12 y y c o d e ( "%s = %s\ n", t a r g e t = n e w _ r v a l u e ( ) , r v a l u e ( $1) ) ;
13 f r e e _ v a l u e ( $ 1 ) ;
14 $1 = t a r g e t ;
15 }
16 $$ = $1;
17 yycode( "%s += %s\ n", t a r g e t , r v a l u e ( $3 ) ;
18 f r e e _ v a l u e ( $3 ) ;
19 }
20 | e EQUAL e
21 {
22 i f ( *$1 != ' L' )
23 y y e r r o r ( "Lval ue r e q u i r e d \ n " ) ;
24 e l s e
25 {
26 i f ( *$3 != ' R' ) / * Conver t $3 t o an r val ue. * /
27 {
28 y y c o d e ( "%s = %s\ n", t a r g e t = n e w _ r v a l u e ( ) , r v a l u e ( $ 3 ) ) ;
29 f r e e _ v a l u e ( $ 3 ) ;
30 $1 = t a r g e t ;
31 }
32 yycode( "*%s = %s", *$1==' &' ? ( $1) +1 : $1, $ 3 ) ; / * * /
33 $$ = $3;
34 f r e e _ v a l u e ( $1 ) ;
35 }
36 }
37 | ID
38 {
39 char b u f [ 1 6 ] ; / * * /
40 s p r i n t f ( b u f , "&%s", y y t e x t ) ; / * * /
41 $$ = s t r d u p ( b u f ); /* */
42
}
43
f
44
o o
o o
45 char * L v a l u e s [4 ] = { "LO", "Ll ", "L2", "L3
" };
46 char *Lp Lv a l u e s
f
47
48 char * Rv a l u e s [ 4 ] = { "RO", "Rl ", "R2", "R3 " };
49 char *Rp Rv a l u e s
r
50
51 char *new l v a l u e () { ret urn *Lp++; }
/ *
Al l ocat e a new l val ue. * /
52 char *new r v a l u e () { ret urn *Rp++; } / * Al l ocat e a new l val ue.
* /
53
54 f r e e v a l u e ( v a l ) / *
Fr ee an l val ue or an r val ue. */
55 char * v a l ;
56
{
57 i f ( * v a l == ' L' ) * Lp = v a l ; / * Physi cal l val ue */
58 el se i f ( * v a l == ' R' ) *- - Rp = v a l ;
59 el se f r e e ( v a l ) ; / * Logi cal l val ue
* /
60
}
61
62 char * r v a l u e ( v a l u e ) / * I f t he ar gument i s an l val ue, r et ur n a st r i ng
* /
63 char *v a l u e ; / * t hat can be used i n an expr essi on to r ef er ence */
64
{
/ * t he l val ue' s cont ent s.
* /
65 st at i c char b u f [ 1 6 ] ;
66
67 i f ( * l v a l u e == ' R' ) / * I t ' s al r eady an r val ue. */
68 ret urn v a l u e ;
69
70 i f ( * l v a l u e == / * Logi cal l val ue
* /
71 ret urn v a l u e + 1; / * *& X cancel s. J ust use t he name.
* /
72
73 s p r i n t f ( b u f , n*%s", l v a l u e ) ;
74 ret urn b u f ;
75 }
The new-and-improved code-generation actions translate the input x=y+z as fol
lows:
RO = _ y ;
RO += _ z ;
_x = RO;
To summarize all of the foregoing and put it into the context of a real compiler: two
types of temporaries can be created as an expression is evaluated: an lvalue which holds
the address of the object that, in turn, holds the result of the previous operation, and an
rvalue which holds the result itself. The terms come from an expression like:
x = y;
The rvalue, y (on the right of the equal sign), is treated differently from the jvalue, x (on
the ]eft). The y must evaluate to its contents, but the x must evaluate to an addressto
the place where the y is to be stored.
An expression that is said to generate an lvalue, can be seen as creating a tempo
rary variable that holds an address of some sort, and the operator controls whether an
lvalue or rvalue is generated. In the current C compiler, all identifier references generate
lvaluesa temporary is created that holds the address of the referenced object. The
Summary: lvalues and
rvalues.
following operators also generate lvalues:
[] . ->
Note that none of the foregoing operators do arithmeticthey all do a reference of some
sort. Operators not in the foregoing list generate rvaluesa temporary is created that
holds the result of the the operation. In addition, the following operators require their
operands to be lvalues:
++ = += - = e t c .
Other operators can handle both lvalue and rvalue operands, but to use them, they must
convert the lvalues to rvalues by removing one level of indirection.
A quick way to determine whether an expression forms an lvalue or an rvalue is to
put the expression to the left of an equals sign and see if its legal. Given the following
declarations:
i nt x;
i nt a r r a y [ 1 0 ] ;
i nt *p = a r r a y ;
st ruct t a g
{
i nt f 1;
i nt f 2 [ 5 ] ;
}
s t r u c t u r e ;
the following expressions generate lvalues:
x
a r r a y * a r r a y a r r a y [ n ]
p *p * (p + n)
s t r u c t u r e . f i s t r u c t u r e . f 2 [ n ] * ( s t r u c t u r e . f 2 )
s t r u c t u r e (only if structure assignment is supported)
Note that both p and *p form lvalues, though the associated temporaries are of different
types. The following are not lvalues because they cant appear to the left of an equal
sign:
&x
a r r a y
s t r u c t u r e . f 2
structure (only if structure assignment is not supported)
6.8.3 Implementing Values, Higher-Level Temporary-Variable Support
This section describes the current compilers implementation of lvalues and rvalues.
The val ue structure, in Listing 6.67, is used for both of them, and all of the expression- The v a l u e structure.
processing productions pass around pointers to val ue structures as attributes. The vari
ous fields are used as follows:
l val ue True if the structure represents an lvalue, false if its an rvalue.
i s t mp True if the structure represents a real temporary variable (an rvalue or a phy
sical lvalue), false otherwise.
of f set Position of the temporary variable in the stack frame. This number is the
absolute value of the offset, in stack elements, from the base address of the
temporary-variable region to the variable itself [i.e. the number that is
returned from t mp al l oc () ].
name If the current value is an rvalue, this field is a string which, when used as an
operand, evaluates to the value of the previously computed subexpression,
its an lvalue, this field is a string that evaluates to the address of the object
generated from the previous subexpression.
t ype
et ype Points at the start and end of a type chain representing the values type. If an
rvalue, the name field references an object of the indicated type; if an lvalue,
the name field evaluates to a pointer to an object of the indicated type. When
an lvalue is created from a symbol (when an identifier is processed), a copy
of the symbol s type chain is made and the copy is used in the val ue.
sym When the lvalue is created from a symbol , this field points at that symbol.
Its used only to print an occasional comment in the output, and is undefined
if the lvalue was created by an operator like * or [ ].
Listing 6.67. value.h Lvalues and Rvalues
1 / * VALUE. H
2 * #i ncl uded bef or e #i n
3 * /
4
5 #def i ne VALNAME_MAX (NAME_MAX * 2)
6
7 t ypedef st ruct v a l u e
8
{
9 char name [ VALNAME MAX ] ; / *
10 l i n k * t y p e ;
/ *
11 l i n k * e t y p e ;
/ *
12 s ymbol *sym;
/ *
13 unsi gned l v a l u e : 1; / *
14 unsi gned i s tmp : 1; / *
15 unsi gned o f f s e t : 14; / *
16 / *
17 } v a l u e ;
18
19 #def i ne LEFT 1 / * Seco
20 #def i ne RIGHT 0 / *
21
22 #def i ne CONST_STR(p) t c o n s t s t r ( ( p ) -
23
24
Var i ous def i ni t i ons f or ( l \ r ) val ues, "symt ab. h" must be
el udi ng t hi s f i l e.
/
*
Max. l engt h of st r i ng i n val ue.name
*
/
Oper and t hat accesses t he val ue.
Var i abl e' s t ype ( st art of chai n)
Var i abl e' s t ype ( end of chai n) .
Or i gi nal symbol .
1 l val ue, 0 r val ue.
1 i f a t empor ar y var i abl e.
Absol ut e val ue of of f set f r ombase of
t empor ar y r egi on on st ack to var i abl e
ar gument to shi f t name () i n val ue. c,
>t y pe ) /
*
Si mpl i f y t const st r( ) cal l s
/ * wi t h val ue ar gument s by
/
*
ext r act i ng t he t ype f i el d
*
*
*
*
*
*
*
*
/
/
/
/
/
/
/
/
*
/
di scussed bel ow. */
*
*
*
/
/
/
Constant val ue.
Note that val ues are used for constants as well as temporary variables. (Theres a
picture of one in Figure 6.19.) The attribute generated by unop>ICON is a pointer to a
val ue structure for an rvalue. The type chain identifies the val ue as an i nt constant.
[I S CONSTANT ( et ype) in symtab.h, which examines the storage class of the right
most l i nk in the chain, evaluates true.] Similarly, the numeric value of the constant is
stored in the last l i nk in the type chain (in the const val union of the speci f i er
substructure). The val ue structures name field holds an ASCII string representing the
number. The unop-^NAME action that well look at momentarily creates an identical
attribute if the identifier references an element of an enumerated type.
The advantage of this approachusing val ues for both constants and variables is
that all expression elements, regardless of what they are, synthesize val ue pointers as
attributes.
Section 6.8.3Implementing Values, Higher-Level Temporary-Variable Support 585
Figure 6.19. A val ue that Represents an i nt Constant
val ue:
name
l val ue
"10"
0
t ype
et ype
Numeric value stored as a string
l i nk:
cl ass=SPECI FI ER
>
s:
noun I NT
const val :
v i nt =10 v
s
\
\
\
Numeric value stored as a number
Listing 6.68 contains various subroutines that manipulate val ue structures. The new v a i u e o ,
low-level allocation and deletion routines ( di scar d_val ue ( ) on line 46 of Listing d i s c a r d _ v a i u e ( )
6.68 and new_val ue( ) on line 84 of the same listing) work just like the l i nk-
management routines discussed earlier. The val ues are allocated from mal l oc (), but
discarded by placing them on the linked free list pointed to by Val ue f ree, declared
on line 14. Several structures are allocated at oncethe number of structures is con
trolled by VCHUNK, defined on line 19. The type chain associated with the val ue is dis
carded at the same time as the val ue itself.
Listing 6.68. value.c Low-Level val ue Maintenance
2 # i ncl ude c t o o l s / d e b u g . h>
5 # i ncl ude < t o o l s / c - c o d e . h>
7 # i ncl ude "s ymt ab. h"
8 #i ncl ude " v a l u e . h "
9 # i ncl ude " p r o t o . h "
10 #i ncl ude " l a b e l . h "
11
12 / * VALUE. C: Rout i nes t o mani pul at e l val ues and r val ues. */
13
14 PRIVATE v a l u e * V a l u e _ f r e e = NULL;
15 #def i ne VCHUNK 8 / * Al l ocat e t hi s many st r uct ur e at a t i me. * /
16
17 /* -------------------------------------------------------------------------------------------------------------------- */
18
19 v a l u e * n e w_ v a l u e ( )
20 {
21 v a l u e *p;
22 i nt i ;
23
24 i f ( ! V a l u e _ f r e e )
25 {
26 i f ( ! ( Va l u e _ f r e e = ( v a l u e *) m a l l o c ( si zeof ( v a l u e ) * VCHUNK )) )
27 {
28 y y e r r o r ( "INTERNAL, ne w_ v a l ue : Out o f me mor y\ n") ;
29 e x i t ( 1 ) ;
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
}
( P
Va l ue f r e e , i VCHUNK; i > 0; ++p ) / * Execut es VCHUNK- 1 * /
p - > t y p e ( l i n k *) ( p+1 ) ; /
*
t i mes. * /
p - > t y p e NULL
}
P
V a l u e _ f r e e
me ms e t ( p,
ret urn p;
Va l ue f r e e ;
( v a l u e *) Va l ue f r e e - > t y p e ;
( v a l u e )
);
}
/
* *
/
voi d
v a l u e
{
d i s c a r d v a l u e ( p )
p;
*
/ * Di scar d a val ue st r uct ur e
*
/
*
and any associ at ed l i nks
*
/
/
( P )
{
( p - > t y p e )
d i s c a r d l i n k c h a i n ( p - > t y p e ) ;
p - > t y p e
Va l ue f r e e
( l i n k * ) Va l u e f r e e
p;
}
}
/
* *
/
v a l u e
* s h i f t na me ( v a l , l e f t )
* v a l ;
i nt
{
/
*
*
*
Shi f t t he name one char act er t o t he l ef t or r i ght . The st r i ng mi ght get
t r uncat ed on a r i ght shi f t . Ret ur ns poi nt er t o shi f t ed st r i ng.
Shi f t s an & i nt o t he new sl ot on a r i ght shi f t .
*
/
( l e f t )
memmove( v a l - > n a me , v a l - > n a m e + l , (v a l - > n a me )
1 ) ;
{
memmove( v a l - > n a me + l , v a l - > n a me , ( val - >name ) 1 );
v a l - > n a m e [
v a l - > n a m e [01
( val - >name )
1 ] ' \ 0' ;
}
ret urn v a l - > n a me ;
}
/

/
* r v a l u e ( v a l )
v a l u e
{
/
v a l ;
*
*
I f val i s an r val ue, do not hi ng and r et ur n val - >name. I f i t ' s an l val ue,
conver t i t t o an r val ue and r et ur n t he st r i ng necessar y t o access t he
*
1 I f i t ' s an 1 const ant , r et ur n a ng r epr esent i ng t he
90 * number ( and al so put t hat st r i ng i nt o val - >name).
91 */
92
93 st at i c char b u f [ NAME MAX + 2 ] = "*" ;
94
95 i f ( ! v a l - > l v a l u e ) / * I t ' s an r val ue al r eady. * /
96 {
97 i f ( I S_CONSTANT( val - >et ype) )
98 s t r c p y ( v a l - > n a me , CONST STR( val ) ) ;
99 ret urn v a l - >na me ;
100
}
101 el se
102
{
103 v a l - > l v a l u e = 0;
104 i f ( * v a l - >na me == ) / * *& cancel s * /
105
{
106 s h i f t n a me ( v a l , LEFT);
107 ret urn v a l - >na me ;
108
}
109 el se
110
{
111 s t r n c p y ( & b u f [ l ] , v a l - > n a me , NAME MAX+1 ) ;
112 ret urn b u f ;
113
}
114
}
115
}
116
117 char * r v a l u e name ( v a l )
118 v a l u e * v a l ;
119
{
120 / * Ret ur ns a copy of t he r val ue name wi t hout act ual l y conver t i ng val t o an
121 * r val ue. The copy r emai ns val i d onl y unt i l t he next r val ue name cal l .
122 * /
123
124 st at i c char b u f [ (NAME MAX * 2) + 1 ] ; / * si zeof ( val - >name) - hi * /
125
126 i f ( ! v a l - > l v a l u e )
127 {
128 i f ( I S_CONSTANT( val - >et ype) ) s t r c p y ( b u f , CONST_STR( val ) ) ;
129 el se s t r c p y ( b u f , v a l - >na me ) ;
130
}
131 el se
132
{
133 i f ( * v a l - >na me == ) s t r c p y ( b u f , v a l - >na me + 1 )
f
134 el se s p r i n t f ( b u f , "*%s", v a l - >na me )
m
f
135
}
136 ret urn buf ;
137
}
The r val ue () subroutine on line 84 of Listing 6. 68 is used heavily in the
expression-processing code. It is used to fetch the actual value associated with either an
rvalue or lvalue.
r val ue()
is passed a val ue structure.
this val ue is an rvalue that r val ue o and rvalues
represents a numeric constant, a string representing the constants numeric value is
returned. If its a normal rvalue, the name field is returned.
If the argument is a pointer to an lvalue, the val ue is converted to an rvalue and a
string that can be used to access the rvalue is returned. The conversion to rvalue
involves modifying two fields. First, the l val ue field is set false. Then, if this is a
r val ue () and logical
lvalues.
r val ue () and physical
lvalues.
logical lvalue, the name field is shifted one character to the left and the resulting string is
returned.
If the incoming val ue represents a physical lvalue, a string containing the name pre
ceded by a * is returned, but the name field isnt modified. This behavior is actually an
anomaly because a physical rvalue is not created. That is, to be consistent with the logi
cal lvalue case, the r val ue () subroutine should emit code to copy the object pointed to
by the lvalue into a physical temporary variable. It should then modify the val ue struc
ture to represent this new temporary rather than the original one. I havent done this here
because I wanted the code-generation actions to be able to control the target. For exam
ple, the following is taken from the code-generation action for a return statement. $2 is
a pointer to a val ue for the object being returned. the input takes the form
return x, then $2 points at an lvalue. If the input takes the form return x+1, then
$2 points at an rvalue:
ge n (
II_ II
IS I NT( $ 2 - > t y p e ) ? "r F. w. l o w" " r F . l " , r v a l u e ($2) );
r val ue name()
Temporary-variable
val ue, t mp cr eat e ()
0
The target (the physical rvalue) has to be one of two registers, not a temporary variable.
Other situations arise where the target is a temporary, so I simplified the interface to
r val ue () by requiring the caller to generate the physical rvalue if necessary. The
string thats returned from r val ue () can always be used as the source operand in the
assignment. The r val ue name () subroutine on line 117 returns the same thing as
r val ue (), but it doesnt modify the val ue structure at all, it just returns a string that
you can use as a source operand.
The value-maintenance functions continue in Listing 6.69 with a second layer of
higher-level functions. Temporary-variable val ues are also handled here. Two rou
tines, starting
t mp
on line 138, are provided for temporary-variable creation.
is the lower-level routine. It is passed a type and creates an rvalue for
that type using the low-level allocation routine discussed earlier [tmp al l oc () ] to get
space for the variable on the stack. A copy of the input type string is usually put into the
val ue. An i nt rvalue is created from scratch if the t ype argument is NULL, however.
If the second argument is true, a l i nk for a pointer declarator is added to the left of the
val ues type chain. The val ues name field is created by t mp cr eat e (). It initial
ized to a string which is output as the operand of an instruction, and which accesses the
temporary variable. The name is generated on line 175, and it looks something like this:
WP (T (0) ). The type component (WP) changes with the actual type of the temporary.
I ts created by get pr ef i x () on line 181. The 0 is the offset from the base of the
temporary-variable region to the current variable. We saw the T () macro earlier,
translates the offset to a frame-pointer reference.
Listing 6.69. value.c High-Level val ue Maintenance
138 v a l u e * t m p _ c r e a t e ( t y p e , a d d _ p o i n t e r )
139 l i n k * t y p e ; / * Type of t empor ar y, NULL to cr eat e an i nt . * /
140 i nt add p o i n t e r ; /* I f t r ue, add poi nt er decl ar at or t o f r ont of t ype */
/ * bef or e cr eat i ng t he t empor ar y. / 141 {
142
143/ * Cr eat e a t empor ar y var i abl e and r et ur n a poi nt er t o i t. I t ' s an r val ue
144 * by def aul t .
145 */
146
147 v a l u e * v a l ;
148 l i n k * l p ;
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
v a l
v a l - > i s tmp
new v a l u e ( ) ;
1;
( t y p e )
v a l - > t y p e
/
*
Copy exi st i ng t ype.
*
/
c l o n e t y p e ( t y p e , &lp ) ;
/
*
Make an i nt eger .
*
/
{
IP
l p - > c l a s s
l p->N0UN
v a l - > t y p e
n e w _ l i n k ( ) ;
SPECIFIER;
INT;
lp;
}
v a l - > e t y p e
l p->SCLASS
lp;
AUTO; / * I t ' s an aut o now
/ * of t he pr evi ous st or age cl ass
*
*
/
/
( add p o i n t e r )
{
IP
lp->DCL_TYPE
l p - > n e x t
v a l - > t y p e
n e w _ l i n k ( ) ;
POINTER;
v a l - > t y p e ;
i p ;
}
1 tmp
s p r i n t f ( v a l - > n a m e ,
r et ur n( v a l ) ;
11
I I Q,
s ( T ( %d)
s i z e ( l - > t y p e ) ) ;
p r e f i x ( v a l - > t y p e ) , ( v a l - > o f f s e t + 1 ) ) ;
}
J * _________________________________________________________ ______________________________________________________ * j
*
l i n k
{
*
g e t _ p r e f i x ( t y p e )
t y p e ;
/
*
*
*
Ret ur n t he f i r st char act er of t he LP (), BP (), WP(), et c. , di r ect i ve
t hat accesses a var i abl e of t he gi ven t ype. Not e t hat an ar r ay or
st r uct ur e t ype i s assumed t o be a poi nt er t o t he f i r st cel l at r un t i me
*
/
i nt c;
( t y p e )
{
( t y p e - > c l a s s DECLARATOR )
{
swi t ch( t ype- >DCL TYPE )
{
ARRAY:
FUNCTION:
POINTER:
( g e t p r e f i x ( t y p e - > n e x t ) ) ;
ret urn PTR PREFIX;
c t y p e - > n e x t ) ;
( c
el se i f ( c
( c
*BYTE_PREFIX
*WORD PREFIX
)
)
*LWORD PREFIX )
ret urn BYTEPTR_PREFIX;
ret urn WORDPTR_PREFIX;
ret urn LWORDPTR_PREFIX;
ret urn PTRPTR PREFIX;
}
}
209 el se
210 {
211 swi t ch( t ype- >NOUN )
212 {
213 case INT: ret urn ( t ype- >L0NG) ? LWORD_PREFIX : WORD_PREFIX;
214 case CHAR:
215 case STRUCTURE: ret urn BYTEPTR_PREFIX ;
216 }
217 }
218 }
219 y y e r r o r ( " I NTERNAL, g e t _ p r e f i x : Ca n ' t p r o c e s s t y p e %s . \ n", t y p e _ s t r ( t y p e ) ) ;
220 e x i t ( 1 ) ;
221 }
222
223 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
224
225 v a l u e * t mp _ g e n ( t mp _ t y p e , s r c )
226 l i n k *t mp_t ype ; / * t ype of t empor ar y t aken f r omher e * /
227 v a l u e * s r c ; / * sour ce var i abl e t hat i s copi ed i nt o t empor ar y * /
228 {
229 / * Cr eat e a t empor ar y var i abl e of t he i ndi cat ed t ype and t hen gener at e
230 * code t o copy sr c i nt o i t . Ret ur n a poi nt er t o a "val ue" f or t he t empor ar y
231 * var i abl e. Tr uncat i on i s done si l ent l y; you may want to add a l i nt - st yl e
232 * war ni ng message i n t hi s si t uat i on, however . Sr c i s conver t ed t o an
233 * r val ue i f necessar y, and i s r el eased af t er bei ng copi ed. I f t mp_t ype
234 * i s NULL, an i nt i s cr eat ed.
235 * /
236
237 v a l u e * v a l ; / * t empor ar y var i abl e * /
238 char * r e g ;
239
240 i f ( ! t h e _ s a m e _ t y p e ( t mp _ t y pe , s r c - > t y p e , 1) )
241 {
242 / * con ver t_ t ype () copi es sr c t o a r egi st er and does any necessar y t ype
243 * conver si on. I t r et ur ns a st r i ng t hat can be used t o access t he
244 * r egi st er . Once t he sr c has been copi ed, i t can be r el eased, and
245 * a new t empor ar y ( t hi s t i me of t he new t ype) i s cr eat ed and
246 * i ni t i al i zed f r omt he r egi st er .
247 * /
248
249 reg = conver t _t ype( t mp_t ype, src ) ;
250 r e l e a s e _ v a l u e ( s r c ) ;
251 v a l = t m p _ c r e a t e ( I S_ CHAR( t mp_t ype ) ? NULL : t mp_ t y pe , 0 ) ;
252 g e n ( v a l - > n a me , r e g ) ;
253 }
254 el se
255 {
256 v a l = t m p _ c r e a t e ( t mp _ t y pe , 0 ) ;
257 g e n ( v a l - > n a me , r v a l u e ( s r c ) ) ;
258 r e l e a s e _ v a l u e ( s r c ) ;
259 }
260 ret urn v a l ;
261 }
262
263 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
264
265 char * c o n v e r t _ t y p e ( t a r g _ t y p e , s r c )
266 l i n k * t a r g _ t y p e ; / * t ype of t ar get obj ect * /
267 v a l u e * s r c ; / * sour ce t o be conver t ed * /
268 {
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
i nt dsi ze;
r e g [ 1 6 ] ;
/
/
*
*
dst si ze
pl ace t o assembl e r egi st er name
*
*
/
/
/
*
*
Thi s r out i ne shoul d onl y be cal l ed i f t he
and t he sour ce t ype (i n sr c- >t ype)
t ype (i n t ar g_t ype)
I t gener at es code t o copy
* t he sour ce i nt o a r egi st er and do any t ype conver si on. I t r et ur ns a
*
*
st r i ng t hat r ef er ences t he r egi st er . Thi s st r i ng shoul d be used
i mmedi at el y as t he sour ce of an assi gnment i nst r uct i on, l i ke t hi s
*
*
gen ( "="f dst name, conver t t ype ( dst t ype, sr c ); )
*
/
s p r i n t f ( r e g , "rO. %s", g e t s u f f i x ( s r c - > t y p e )
gen
(
II_II
r v a l u e ( s r c )
);
);
/
/
*
copy i nt o * /
*
r egi st er .
*
/
( ( d s i z e g e t s i z e ( t a r g t y p e ) ) > g e t s i z e ( s r c - > t y p e ) )
{
( s rc- >et ype- >UNSI GNED ) /
*
{
( d s i z e
( d s i z e
2 ) g e n (
4 ) g e n (
II_II
" r 0 . b . b l " , "0" ) ;
II_ II
Hr 0 . w . h i g h " , "0" ) ;
}
/
*
si gn ext end
*
/
{
( d s i z e
( d s i z e
2 ) g e n ( " e x t _ l o w" , "rO" ) ;
4 ) g e n ( "e x t word", "rO" ) ;
}
}
s p r i n t f ( r e g , "rO. %s", g e t s u f f i x ( t a r g t y p e )
ret urn
) ;
}
/ *_____________________________________________________________________________ * j
PUBLIC
l i n k
{
i nt s i z e ( t y p e )
t y p e ;
/
*
*
Ret ur n t he number of byt es r equi r ed t o hol d t he t hi ng r ef er enced by
pr ef i x () .
*
/
( t y p e )
{
( t y p e - > c l a s s DECLARATOR )
ret urn ( t ype- >DCL TYPE ARRAY) g e t s i z e ( t y p e - > n e x t ) : PSIZE ;
{
swi t ch( type->NOUN )
{
INT: ret urn (type->LONG)
CHAR:
STRUCTURE: ret urn CSIZE;
LSIZE I SI ZE;
}
}
}
y y e r r o r ( "INTERNAL, g e t s i z e : Ca n ' t s i z e t y p e
e x i t ( 1 ) ;
o
"Os . \ n " , t y p e s t r ( t y p e ) ) ;
}
329 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
330
331 char * g e t _ s u f f i x ( t y p e )
332 l i n k * t y p e ;
333 {
334 / * Ret ur ns t he st r i ng f or an oper and t hat get s a number of t he i ndi cat ed
335 * t ype out of a r egi st er . (I t r et ur ns t he xx i n : r O. xx) . "pp" i s r et ur ned
336 * f or poi nt er s, ar r ays, and st r uct ur esget _suf f i x( ) i s used onl y f or
337 * t empor ar y- var i abl e mani pul at i on, not decl ar at i ons. I f an an ar r ay or
338 * st r uct ur e decl ar at or i s a component of a t empor ar y- var i abl e' s t ype chai n,
339 * t hen t hat decl ar at or act ual l y r epr esent s a poi nt er at r un t i me. The
340 * r et ur ned st r i ng i s , at most , f i ve char act er s l ong.
341 * /
342
343 i f ( t y p e )
344 {
345 i f ( t y p e - > c l a s s == DECLARATOR )
346 ret urn "p p ";
347 el se
348 swi t ch( t ype- >N0UN )
349 {
350 case INT: ret urn ( t ype- >L0NG) ? "1" : "w. l o w";
351 case CHAR:
352 case STRUCTURE: ret urn "b. bO";
353 }
354 }
355 y y e r r o r ( "INTERNAL, g e t _ s u f f i x : Ca n ' t p r o c e s s t y p e %s . \ n", t y p e _ s t r ( t y p e ) ) ;
356 e x i t ( 1 ) ;
357 }
358
359 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
360
361 voi d r e l e a s e _ v a l u e ( v a l )
362 v a l u e * v a l ; / * Di scar d a val ue, f i r st f r eei ng any space * /
363 { / * used f or an associ at ed t empor ar y var i abl e. * /
364 i f ( v a l )
365 {
366 i f ( v a l - > i s _ t m p )
367 t m p _ f r e e ( v a l - > o f f s e t ) ;
368 d i s c a r d _ v a l u e ( v a l ) ;
369 }
370 }
Create and initializetem- t mp_gen( ) on line 225 of Listing 6.69 both creates the temporary and emits the
porary. tmp_gen(). CO(je necessary t0 initialize it. (tmp cr eat e () doesnt initialize the temporary.) The
subroutine is passed three arguments. The first is a pointer to a type chain representing
the type of the temporary. An i nt is created if this argument is NULL. The second
do the copy is generated. If the source is an lvalue, it is converted to an rvalue first.
If the source variable and temporary variables types dont match, code is generated
to do any necessary type conversions. Type conversions are done in conver t t ype ()
on line 265 of Listing 6.69 by copying the original variable into a register; if the new
variable is larger than the source, code to do sign extension or zero fill is emitted. The
subroutine returns a string that holds the name of the target register. For example, the
input code:
i nt i ;
l ong 1;
f o o ( ) { i = i + l ; }
generates the following output:
r O. w. l o w = W( &_i ) ;
e x t _ wo r d ( r O) ;
L ( T (1) ) = r O . l ;
L ( T.( 1) ) += L ( &_1) ;
rO. l = L ( T (1) ) ;
W(&_i) = r O. w. l o w;
The first two lines, emitted by conver t t ype (), create a l ong temporary variable and
initialize it from an i nt. The code copies i into a wor d register and then sign extends
the register to form an l word. The string "rO. 1" is passed back to t mp_gen (), which
uses it to emit the assignment on the third line. The last two lines demonstrate how a
truncation is done, conver t t ype emits code to copy T (1) into a register. It then
passes the string " r O. w. l ow" back up to the calling routine, which uses it to generate
the assignment.
The test on line 284 of Listing 6.69 checks to see if sign extension is necessary (by
comparing the number of bytes used for both variablesget si ze () starts on line 304
of Listing 6.69). Code to do the type conversion is done on lines 286 to 295 of Listing
6.69. get suf f i x, used on line 298 to access the register, starts on 331 of listing 6.69.
The r el ease val ue () subroutine on line 361 of Listing 6.69 is a somewhat reiease vaiueo.
higher-level version of di scar d val ue (). I ts used for temporary variables and both
discards the val ue structure and frees any stack space used for the temporary. Note that
the original source val ue is discarded on line 250 of Listing 6.69 as soon as the variable
is copied into a register, so the memory used by the original variable can be recycled for
the new temporary.
6.8.4 Unary Operators
Armed with the foregoing, you can proceed to tackle expression processing, starting
with the unary operators and working our way up the grammar to complete expressions.
Because youre starting at the bottom, it is helpful to look at the overall structure of the
expression productions in Appendix C before continuing. Start with the expr nontermi
nal and go to the end of the grammar.
The unary operators are handled by the right-hand sides of the unary nonterminal,
which you get to like this:
compound stmt LC local defs stmt list RC
stmt list > stmt
stmt
expr
expr SEMI
binary I unary
(Ive both simplified and left out a few intermediate productions for clarity. See Appen
dix C for the actual productions.) The simplest of the unary productions are in Listing
6.70. The top production just handles parenthesized subexpressions, and the second
right-hand side recognizes, but ignores, floating-point constants.
The right-hand side on line 597 handles integer constants, creating a val ue structure
like the one examined earlier in Figure 6.19. makei con (), in Listing 6.71, does the Integer constants,
actual work, creating a val ue for the integer constant, which is then passed back up as make_lcon ()
an attribute. This subroutine is used in several other places so it does more than is
Listing 6.70. c.y Unary Operators (Part 1)Constants and Identifiers
594 unar y
595 :: LP e x pr RP { $$

<
C
M
c
o
l
l
}
596 FCON { y y e r r o r ( " F l o a t i n g - p o i n t no t s u p p o r t e d \ n " ) ;
}
597 ICON { $$ = make i c o n ( y y t e x t , 0
) ; }
598 NAME { $$ = do name ( y y t e x t , $1
) ; }
Identifiers,
unary-*NAME
do name().
required by the current production. In particular, the incoming parameter can represent
the number either as an integer or as a string. Numeric input is used here by setting
yyt ext to NULL.
Identifiers are handled by on line 598 of Listing 6.70, above. The attribute attached
to the NAME at $1 is a pointer to the symbol table entry for the identifier or NULL if the
identifier isnt in the table. The actual work is done by do name (), in Listing 6.72,
below.
Listing 6.71. value.c Make Integer and Integer-Constant Rvalues
371 v a l u e *make i c o n ( y y t e x t , nume r i c v a l )
372 c h a r * y y t e x t ;
373 i n t nume r i c v a l ;
374
{
375 / * Make an i nt eger - const ant r val ue. I f yyt ext i s NULL, t hen numer i c val
376 * hol ds t he numer i c val ue, ot her wi se t he second ar gument i s not used and
377 * t he st r i ng i n yyt ext r epr esent s t he const ant .
378
* /
379
380 v a l u e *vp;
381 l i n k * l p ;
382 c h a r * p ;
383
384 vp = make i n t ( ) ;
385 l p = v p - > t y p e ;
386 l p- >SCLASS = CONSTANT;
387
388 i f ( ! y y t e x t )
389 l p >V INT = nume r i c v a l ;
390
391 e l s e i f ( * y y t e x t == )
392
{
393 + + y y t e x t ; / * Ski p t he quot e. */
394 l p - >V INT = e s c ( &y y t e x t ) ;
395 }
396 e l s e / * I ni t i al i ze t he const val f i el d
397
{
/ * based on t he i nput t ype, st oul ()
398 / * conver t s a st r i ng t o unsi gned l ong.
399 f o r ( p = y y t e x t ; *p ;
++P )
400
{
401 i f ( *p==' u' *p==' U' ) { lp->UNSIGNED = 1 ; }
402 e l s e i f ( * p = = ' l ' I *p==' L' ) { l p- >L0NG = 1 ; }
403
}
404 i f ( lp->LONG )
405 l p >V ULONG = s t o u l ( &y y t e x t ) ;
406 e l s e
407 l p- >V_UINT = ( u n s i g n e d i n t ) s t o u l ( &y y t e x t ) ;
408 }
409 r e t u r n vp;
410 }
*
*
/
/
Section 6.8.4Unary Operators 595
411
412
/ * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- * /
413
414 v a l u e *make i n t ( )
415
{ / * Make an unnamed i nt eger r val ue. * /
416 l i n k * l p ;
417 v a l u e *vp;
418
419 i P n e w _ l i n k ( ) ;
420 l p - > c l a s s = SPECIFIER;
421 l p->N0UN INT;
422 vp new v a l u e ( ) ;
/ *
I t ' s an r val ue by def aul t . * /
423 v p - > t y p e v p - > e t y p e = l p ;
424 ret urn vp;
425 }
Listing 6.72. op.c Identifier Processing
9 #i ncl ude " v a l u e . h "
11 #include " l a b e l . h "
12
13 /* OP. C Thi s f i l e cont ai ns suppor t subr out i nes f or t he ar i t hmet i c */
14 / * oper at i ons i n c. y. * /
15
16 s ymbol *Undecl = NULL; / * When an undecl ar ed symbol i s used i n an expr essi on,
17 * i t i s added t o t he symbol t abl e t o suppr ess subse-
18 * quent er r or messages. Thi s i s a poi nt er t o t he head
19 * of a l i nked l i st of such undecl ar ed symbol s. I t i s
20 * pur ged by pur ge_undecl () at t he end of compound
21 * st at ement s, at whi ch t i me er r or messages ar e al so
22 * gener at ed.
23 */
24
25 /* --------------------------------------------------------------------------------------------------------------------*/
26
27 v a l u e *do_name ( y y t e x t , sym )
28 char * y y t e x t ; / * Lexeme * /
29 s ymbol *sym; / * Symbol - t abl e ent r y f or i d, NULL i f none. * /
30 {
31 l i n k * c h a i n _ e n d , * l p ;
32 v a l u e * s y n t h ;
33 char b u f [ VALNAME_MAX ] ;
34
35 / * Thi s r out i ne usual l y r et ur ns a l ogi cal l val ue f or t he r ef er enced symbol .
36 * The symbol ' s t ype chai n i s copi ed i nt o t he val ue and val ue- >name i s a
37 * st r i ng, whi ch when used as an oper and, eval uat es t o t he addr ess of t he
38 * obj ect . Except i ons ar e aggr egat e t ypes ( ar r ays and st r uct ur es) , whi ch
39 * gener at e poi nt er t empor ar y var i abl es, i ni t i al i ze t he t empor ar i es t o poi nt
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
* at t he
A
el
t hat poi
of t he aggr egat e, and r et ur n an r val t hat
The t ype chai n i s 11 j ust copi ed f r omt he
A
*
sour ce symbol , so a st r uct ur e or poi nt er must be i nt er pr et ed as a poi nt er
t o t he f i r st el ement / f i el d by t he ot her code- gener at i on subr out i nes.
*
I t al i mpor t ant t hat t he i ndi r ect (
*
[J
>) set up
*
t he same sor t of obj ect when t he r ef er enced obj ect i s an
*
*
Not t hat
i
symmust be checked bel The pr obl t he same one
* we had ear l i er wi t h . A st at ement l i ke f oo( ) {i nt x; x=l ; } f ai l s
* because t he second x i s r ead when t he semi col on i s shi f t ed
*
put t i ng x i nt o t he symbol t abl e. You can' t use t he t r i ck used ear l i er
* because def l i st i s used al l over t he pl ace, so t he symbol t abl e ent r i es
* can' t be made unt i l l ocal def s- >def l i st i s The si mpl est
* sol ut i on i s j ust t o cal l f i ndsym( ) i f NULL was r et ur ned f r omt he scanner
*
*
*
The second i f ( ! sym) i s needed when t he symbol r eal l y i sn' t t her e
/
( !sym )
sym ( s ymbol
*
) f i n d s y m( Symbol t a b , y y t e x t ) ;
( ! sym )
sym make i m p l i c i t d e c l a r a t i o n ( y y t e x t , &Undecl ) ;
IS CONSTANT(sym->etype) )
/ * i t ' s an enummember * /
{
i f ( IS I NT( s ym- >t ype ) )
s y n t h make i c o n ( NULL, s y m- >t y pe - >V INT ) ;
{
y y e r r o r ( "Une x pe c t e d n o n i n t e g e r c o n s t a n t \ n " ) /
s y n t h make i c o n ( NULL, 0 ) ;
}
}
{
ge n comment ( "%s", s ym- >name ) ; /
*
Next i nst r uct i on wi l l have symbol
*
/ * name as a comment .
*
/
/
Lf ( ! ( l p c l o n e t y p e ( s y m- > t y p e , &chai n e nd) )
)
{
y y e r r o r ( "INTERNAL do name: Bad t y p e c h a i n \ n " ) ;
s y n t h make i c o n ( NULL, 0 ) ;
}
IS AGGREGATE(sym->type) )
{
* Manuf act poi nt t o f i el
*
/
s p r i n t f ( b u f , "&%s ( %s ) ",
IS_ARRAY( sym- >t ype) ?
sym- >rname ) ;
p r e f i x ( l p ) : BYTE PREFIX,
s y n t h tmp c r e a t e ( s y m - > t y p e , 0 ) /
}
{
s y n t h
s y n t h - > l v a l u e
s y n t h - > t y p e
n e w _ v a l u e ( ) ;
1
i p ;
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
s y n t h - > e t y p e
s y nt h- >s y m
c h a i n end;
sym ;
( s y m - > i m p l i c i t IS FUNCT(l p)
)
s t r c p y ( s y nt h- >na me , sym- >rname ) ;
s p r i n t f ( s y n t h - > n a m e ,
( c h a i n end->SCLASS FIXED) 11&%s (%s ) ",
g e t p r e f i x ( l p ) , s y m- >r na me ) ;
}
}
ret urn s y n t h ;
}
/* -------------------------------------------------------------------------------------------------------------------- */
s ymbol
*
make i m p l i c i t d e c l a r a t i o n ( name, u n d e c l p )
*
name;
s ymbol * * unde c l p;
{
/
*
*
Cr eat e a symbol f or t he name, put i t i nt o t he symbol t abl e, and add i t
t o t he l i nked l i st poi nt ed to by *undecl p. The symbol i s an i nt . The
* l evel f i el d i s used f or t he l i ne number .
*
/
symbol
l i n k
*
*
sym;
i p ;
ext ern i nt y y l i n e n o ; /
*
cr eat ed by LeX
*
/
*
y y t e x t ;
IP
l p - > c l a s s
l p->N0UN
sym
s y m - > i m p l i c i t
s ym- >t y pe
sym
n e w _ l i n k ( ) ;
SPECIFIER;
INT;
n e w_ s y mb o l ( name, 0 ) ;
1;
s y m- >e t y p e = l p ;
y y l i n e n o ; / * Use t he l i ne number f or t he decl ar at i on * /
/ * l evel so t hat you can get at i t l at er
/
*
i f an er r or message i s pr i nt ed.
*
/
/
s p r i n t f ( sym- >rname,
addsym ( Symbol t a b ,
M
sym
( s ym- >r name ) - 2 , y y t e x t ) ;
) ;
s ym- >ne xt
* unde c l p
* u n d e c l p ;
sym;
/ * Li nk i nt o undecl ar ed l i st . */
ret urn sym;
}
/* -------------------------------------------------------------------------------------------------------------------- */
PUBLIC voi d pur ge u n d e c l ( )
{
/
*
*
*
*
Go t hr ough t he undecl ar ed l i st . I f somet hi ng i s a f unct i on, l eave i t i n
t he symbol t abl e as an i mpl i ci t decl ar at i on, ot her wi se pr i nt an er r or
message and r emove i t f r omt he t abl e. Thi s r out i ne i s cal l ed at t he
end of ever y subr out i ne.
*
/
160 s ymbol *sym, *cur ;
161
162 f o r ( syrri = Unde c l ; sym; )
163 {
164 c ur = sym; / * r emove cur r ent symbol f r oml i st * /
165 sym = s y m- >ne x t ;
166 c u r - >next = NULL;
167
168 i f (
c u r - > i m p l i c i t )
169
{
170 y y e r r o r ("%s ( us e d on l i n e %d) u n d e c l a r e d \ n , c ur - >name , c u r - > l e v e l ) ;
171 d e l s y m( Symbol t a b , c ur ) ;
172 d i s c a r d s y m b o l ( c ur ) ;
173 }
174
}
175 Unde c l = NULL;
176 }
The incoming symargument is NULL when the scanner cant find the symbol in the
table. The table lookup on line 58 of Listing 6.72 takes care of the time-delay problem
discussed earlier when types were discussed. Code like this:
i nt x;
x = 1;
wont work because the symbol-table entry for x isnt created until after the second x is
read. Remember, the action is performed after the SEMI is shifted, and the lookahead
character is read as part of the shift operation.
Undeclared identifiers. Undeclared identifiers pose a particular problem. The difficulty is function calls: an
implicit declaration for the function must be created the first time the function is used.
Undeclared variables are hard errors, though. Unfortunately, theres no way for the
ma?0/?^NAME production to know how the name is being used. It doesnt know
whether the name is part of a function call or not.
If the symbol really isnt in the table, do name () creates an implicit declaration of
type i nt for the undeclared identifier (on line 62 of Listing 6.72). The
make i mpl i ci t decl ar at i on () subroutine is found on lines 117 to 148 of Listing
6.72. The new symbol is put into the symbol table, and the cross links form a linked list
of undeclared symbols. The head-of-list pointer is Undecl , declared on line 16. The
implicit symbol is marked as such by setting the i mpl i ci t bit true on line 137 of List
ing 6.72. The implicit symbol is modified when a function call is processed. A func
tion declarator is added to the front of the type chain, and the i mpl i ci t bit is turned
off. I ll discuss this process in detail in a moment.
After the entire subroutine is processed and the subroutine suffix is generated,
pur ge undecl () (on line 152 of Listing 6.72) is called to find and delete the unde
clared symbols. This subroutine traverses the list of implicitly-generated symbols and
prints undeclared symbol error messages for any of them that arent functions. It also
removes the undeclared variables from the symbol table and frees the memory.
One advantage of this approach is that the undeclared symbol error message is
printed only once, regardless of the number of times that the symbol is used. The disad
vantage is that the error message isnt printed until the entire subroutine has been pro
cessed.
Returning to the normal situation, by the time you get to line 64 of Listing 6.72, sym
points at a symbol for the current identifier. If that symbol represents a constant, then it
make_i mpl i ci t -
decl ar at i on()
pur ge undecl ()
Symbolic constants
was created by a previous enumerated-type declaration. The action in this case (on lines
66 to 72) creates an rvalue for the constant, just as if the number, rather than the sym
bolic name, had been found.
If the current symbol is an identifier, a val ue is created on lines 74 to 110 of Listing
6.72. The type chain is cloned. Then, if the symbol represents an aggregate type such as
an array, a temporary variable is generated and initialized to point at the first element of
the aggregate. Note that this temporary is an rvalue, not an lvalue. (An array name
without a star or brackets is illegal to the left of an equal sign.) I ts also important to
note that arrays and pointers can be treated identically by subsequent actions because a
physical pointer to the first array element is generatedpointers and arrays are physi
cally the same thing at run time. A POI NTER declarator and an ARRAY declarator are
treated identically by all of the code that follows. I ll explore this issue further, along
with a few examples, in a moment.
If the current symbol is not an aggregate, a logical lvalue is created in the el se
clause on lines 95 to 109 of Listing 6.72. The val ues name field is initialized to a
string that, when used as an operand in a C-code instruction, evaluates either to the
address of the current object or to the address of the first element if its an aggregate.
If the symbol is a function or if it was created with an implicit declaration, the name
is used as is. (Were on lines 103 of Listing 6.72.) I m assuming that an implicit symbol
will eventually reference a function. The compiler generates bad code if the symbol
doesnt, but so what? Its a hard error to use an undeclared variable.
All other symbols are handled by the spr i nt f () call on line 106 of Listing 6.72.
The strategy is to represent all possible symbols in the same way, so that subsequent
actions wont have to worry about how the name is specified. In other words, a subse
quent action shouldnt have to know whether the referenced variable is on the stack, at a
fixed memory address, or whatever. It should be able to use the name without thinking.
You can get this consistency with one of the stranger-looking, C-code storage
classes. The one case that gives us no flexibility is a frame-pointer-relative variable,
which must be accessed like this: WP (f p+8) . This string starts with a type indicator
(WP), the variable reference itself is an address (fp+8), and an indirect addressing mode
must be used to access the variable. To be consistent, variables at fixed addresses such
as global variables are represented in a similar waylike this: WP (&_p). The f p+8 and
_p are just taken from the symbol s name field. The type (WP, here) is figured by
get pr ef i x () by looking at the variables actual type. Since we are forming lvalues,
the names must evaluate to addresses, so we need one more ampersand in both cases.
&WP (f p+8) and &WP (&_p) are actually put into the name array.
The next unop right-hand side, which handles string constants, is in Listing 6.73.
The string_const productions in Listing 6.74 collect adjacent string constants and con
catenate them into the St r _buf array, declared in the occs declarations section at the
top of listing 6.74. The attribute attached to the stringjconst is a pointer to the assem
bled string, which is then turned into an rvalue by the make scon () call on line 602 of
Listing 6.73. The subroutine itself is in Listing 6.75. String constants are of type
pointer to char. The string itself is not stored internally by the compiler beyond the
time necessary to parse it. When a string constant is recognized, a declaration like the
following one is output to the data segment:
p r i v a t e c h a r S 1 5 6 [ ] = " c o n t e n t s o f s t r i n g " ;
The numeric component of the label (156) is remembered in v i nt field of the last
l i nk in the type chain, and the entire label is also loaded into the val ues name array.
The next operator, si zeof , is handled by the productions in Listing 6.76. Three
cases have to be handled separately:
Creating identifier
val ues.
Aggregate types
represented by physical
pointers at run time.
val ue. name
All symbols represented
identically.
Using &wp (&_p) for
lvalues
String constants.
The si zeof operator.
Listing 6.73. c.y-Unary Operators (Part 2)String Constants
599 / * unop: * /
600 s t r i n g c o n s t %prec COMMA
601 {
602 $$ =make s c o n ( $1 ) ;
603 y y d a t a ( " p r i v a t e \ t c h a r \ t % s [ ] =\ " %s \ " ; \ n " , $ $ - > n a me , $ 1 ) ;
604
}
Listing 6.74. c.y-Array in which String Constants are Assembled
139
%{
140 #define STR MAX 512 / * Maxi mumsi ze of a st r i ng const ant . * /
141 char S t r buf [ STR MAX]; / * Pl ace t o assembl e st r i ng const ant s. * /
142
%}
605 s t r i n g <c o n s t
606 : STRING
607 {
608 $$ = S t r b u f ;
609 * S t r b u f = ' \ 0 ' ;
610
611 y y t e x t [ s t r l e n ( y y t e x t ) - 1 ] = ' \ 0 ' ; / * Remove t r ai l i ng " * /
612
613 i f ( c o nc a t ( STR MAX, S t r b u f , S t r b u f , y y t e x t + 1 , NULL) < 0 )
614 y y e r r o r ( " S t r i n g t r u n c a t e d a t %d c h a r a c t e r s \ n " , STR MAX ) ;
615
}
616 I s t r i n g c o n s t STRING
617
{
618 y y t e x t [ s t r l e n ( y y t e x t ) - 1 ] = ' \ 0 ' ; / * Remove t r ai l i ng " * /
619
620 i f ( concat ( STR_MAX, S t r b u f , S t r b u f , y y t e x t + 1 , NULL) < 0 )
621 y y e r r o r ( " S t r i n g t r u n c a t e d a t %d c h a r a c t e r s \ n " , STR MAX ) ;
622
}
623
si zeof (" stri ng" ) evaluates to the number of characters in the string,
si zeof (expression) evaluates to the size of the temporary variable that holds the
result of the evaluation.
i nt x;
l ong y;
si zeof ( x ) ; / * =2 ( si ze of a wor d ). * /
si zeof ( y ) ; / * = 4 ( si ze of an l wor d) . * /
si zeof ( x+y ) ; / * = 4 ( si ze of an l wor d) . * /
si zeof (type) evaluates to the number of bytes required by a variable of the
indicated type.
The returned attribute for all three cases is an integer-constant rvalue, the
const val field of which holds the required size. The attribute attached to the expr
($3 in the second right-hand side in Listing 6.76) is a val ue for the temporary that holds
the result of the run-time expression evaluation. Its type field determines the size.
Similarly, the attribute attached to the abstract decl ($3 in the third right-hand side in
Listing 6.76) is a nameless symbol that holds a type chain for the indicated type.
Section 6.8.4Unary Operators
Listing 6.75. value.c Make a String-Constant Rvalue
601
426 val ue *make scon( st r )
427 char *st r;
428
{
429 l i nk *p;
430 val ue *synt h;
431 stati c unsigned l abel = 0;
432
433 synt h = new val ue();
434 synt h- >t ype = new l i nk();
435
436
P
synt h- >t ype;
437 p- >DCL_TYPE = POI NTER;
438 p- >next = new l i nk();
439
440
P
p- >next ;
441 p- >cl ass = SPECI FI ER;
442 p- >NOUN CHAR;
443 p- >SCLASS CONSTANT;
444 p- >V_I NT ++l abel ;
445
446 synt h- >et ype
=p;
447 spr i nt f ( synth- - >name, "%s%d
448 return synt h;
449 }
L STRI NG, l abel );
Listing 6.76. c.y Unary Operators (Part 3)si zeof
624 / * unop: */
625 SI ZEOF LP s t r i n g c o n s t RP %prec SI ZEOF
626 {
$$ = make i c o n ( NULL, s t r l e n ($3) + 1 ) ; }
627
628 SI ZEOF LP e x pr RP %prec SI ZEOF
629 { $$ = make i c o n ( NULL, g e t s i z e o f ( $ 3 - > t y p e ) ) ;
630 r e l e a s e v a l u e ( $3 ) ;
631 }
632 SI ZEOF LP a b s t r a c t d e c l RP %prec SI ZEOF
633 {
634 $$ = make i c o n ( NULL,
635 g e t s i z e o f ( $ 3 - > t y p e ) ) ;
636 d i s c a r d s y m b o l ( $3 ) ;
637 }
So far, none of the operators have generated code. With Listing 6.77, we move Type-conversion opera-
on to an operator that actually does something. The right-hand side in Listing 6.77 s*
processes the cast operator. The test on line 641 checks to see if the cast is legal. If it is, Casts,
and the source and destination types are both pointers, it just replaces the type chain in
the operand with the one from the cast operator on lines 648 to 651. Note that this
behavior, though common in C compilers, can cause problems because the contents of
the pointer are not changed, only the type. A run-time alignment error occurs the first
time you try to use the pointer if the alignment isnt correct.
The third possibility, on line 655 of Listing 6.77, takes care of all other cases. The
tmp gen () call creates a temporary of the required type, copies the contents of the
operand into the temporary (making any necessary type conversions and translating
Listing 6.77. c.y Unary Operators (Part 4)Casts and Address-of
lvalues to rvalues as necessary), and finally, it deletes the val ue associated with the
operand.
The next issue is the unary arithmetic operators, handled by the right-hand sides in
Listing 6.78. The first one takes care of unary minus, the second handles the logical
NOT and ones-complement operators (!, ~). The attribute attached to the UNOP is the
lexeme (passed in an i nt because its only one character). Again, the work is done by a
subroutine: do_unop (), at the top of Listing 6.79.
The arithmetic operators are easy. The action (on lines 184 to 188 of Listing 6.79)
just modifies the operand, passing the operator through to gen (). If the name field of
the operand holds the string "W( f p+8) , then the following code is generated for unary
minus and ones complement:
W( T( 0) ) =- W( f p+8) ; / * - name */
W( T( 0) ) =~ W( f p+8) ; / * "name */
In the first case, the - is supplied in the production, in the second, the attribute is used.
The !operator. The logical NOT operator is a little harder to do because it must evaluate either to 1
or 0, depending on the truth or falsity of the operand. Assuming that x is at W(f p+6) ,
the expression !x generates the following code:
EQ( W( f p+6) , 0 ) / * i f t h e o p e r a n d i f f a l s e *
goto TO;
FO: W( T( 0) ) = 0; / * t emp = 0 */
goto E0;
TO: W( T( 0) ) = 1; / * t emp = 1 */
E0 :
The synthesized attribute is the temporary variable that holds the 1or 0. All but the first
two instructions are created by gen f al se_t r ue (), on line 206 of Listing 6.79. The T
label takes care of the true situation, the F label with false, and the E (for end) lets the
false case jump over the true one after making the assignment. The numeric component
tf i abei o 0f all three labels is the same, and is generated by the t f l abel () call on line 192 of
Listing 6.79. (t f l abel () is declared just a few lines down, on line 199.) This label is
then passed to gen f al se t r ue as an argument.
Unary minus, NOT, ones
complement.
The and unary - opera
tors.
All but the first two lines of the output are emitted by gen_f al se_t r ue (), starting gen faise true o
on line 206 of Listing 6.79. This subroutine is used by the actions for all of the logical
operators. It is passed the numeric component of the target labels and a pointer to the
operands val ue structure. This second argument is used only to do a simple
optimizationif the operand is already a temporary variable of type i nt, then it is recy
cled to hold the result of the operation, otherwise a new temporary variable is created.
(The test is made on line 220 and the temporary is created on the next two lines.)
Remember, both the test and the jump to the true or false label are generated before
gen f ai se t r ue () is called, and once the jump has been created, the associated
val ue is no longer needed. The operands val ue could be discarded by the calling rou
tine if you didnt want to bother with the optimization.
Note that the low-level t mp cr eat e () function, which creates but doesnt initial
ize the temporary, is used here. Also note that the val ue is discarded before the
t mpcr eat e () call so that the same region of memory can be recycled for the result of
the operation.
Listing 6.78. c.y Unary Operators (Part 5)Unary Arithmetic Operators
658 / * unop: */
659 MI NUS unar y { $$ ==do unop( ' , $2 ) ;
} %pr ec UNOP
660 UNOP unar y { $$ ==do unop( $1, $2 ) ;
}
The pre and post, increment and decrement operators are processed by the produc- ++ar|d , i ncop o.
tions in Listing 6.80 and the i ncop () subroutine in Listing 6.81. The following input
file:
i nt x ;
f oo(){ ++x; }
generates the following code for the ++x:
W( &_x) +=1; / * I n c r e me n t x i t s e l f . */
W( T( l ) ) = W( &_x) ; / * Copy i n c r e m e n t e d v a l u e t o t e m p o r a r y . */
The synthesized attribute is the temporary variable. A postincrement like x++works in a
similar manner, but the two instructions are transposed:
W( T (1) ) = W(&_x) ;
W( &_x) +=1;
Again, the synthesized attribute is the temporarythe value of the temporary is deter
mined by whether a pre or post increment is used. The ++and operators require an
lvalue operand, but they both generate rvalue temporararies. Pre and postdecrements use
-= rather than +=but are otherwise the same as the increments.
Listing 6.82 contains various right-hand sides for the address-of operator (&) and the
dereference operators (*, [], and ->). The structure operators are grouped with the
rest of them because they all do essentially the same thing: add an offset to a pointer and
load the object at the indicated address into a temporary variable. All of these operators
generate rvalues. The all require either an lvalue operand, or an rvalue operand of type
array, structure, or pointer. Remember that aggregate-type references generate a
temporary variable (an rvalue) holding the address of the object.
The first issue is getting an objects address. The address-of operator is processed by
unop-ÂND unary on line 665 of Listing 6.82. The addr of () workhorse function is
in Listing 6.83, below. Only two kinds of incoming values are legal here: an lvalue or
an rvalue for an aggregate type such as a structure. In either case, the incoming value
Pointer and array dere
ferencing.
Address-of operator (&)
Listing 6.79. op.c Unary Minus and Ones-Complement Operator Processing
177 v a l u e *do u n o p ( op, v a l )
178 i n t op;
179 v a l u e * v a l ;
180 {
181 c h a r *op_buf = "=?" ;
182 i n t i ;
183
184 i f ( op != ' ! ' ) / * or unar y - * /
185 {
186 op bu f [ 1 ] = o p ;
187 g e n ( op b u f , v a l - > n a me , v a l - >na me ) ;
188
}
189 e l s e / * ! * /
190
{
191 g e n ( "EQ", r v a l u e ( v a l ) , "0" ) ; / * EQ( x, 0) * /
192 g e n ( "goto%s%d", L_TRUE, i = t f l a b e l () ) ; / * got o T000; * /
193 v a l = ge n f a l s e t r u e ( i , v a l ) ; / * f al l t hr u t o F * /
194
}
195 r e t u r n v a l ;
196
}
197
/ * _ . ------------------------------------------------------------------------------------------------------------------------------------------* /
198
199 i n t t f l a b e l ()
200
{
/* Ret ur n t he numer i c component of a l abel f or use as */
201 st at i c i nt l a b e l ; / * a t r ue/ f al se t ar get f or one of t he st at ement s */
202 ret urn + + l a b e l ; / * pr ocessed by gen f al se t r ue ( ) . */
203 }
204
------------------------------------------------------------------------------------------------------------------------------------------* /
205
206 v a l u e *gen f a l s e t r u e ( l a b e l nu m, v a l )
207 i nt l a be l num;
208 v a l u e * v a l ;
209 {
210 / * Gener at e code t o assi gn t r ue or f al se t o t he t empor ar y r epr esent ed by
211 * val . Al so, cr eat e a t empor ar y t o hol d t he r esul t , usi ng val i f i t ' s
212 * al r eady t he cor r ect t ype. Ret ur n a poi nt er t o t he t ar get . The FALSE
213 * code i s at t he top. I f val i s NULL, cr eat e a t empor ar y. Label nummust
214 * have been r et ur ned f r oma pr evi ous t f l abel () cal l .
215 * /
216
217 i f ( ! v a l )
218 v a l = tmp c r e a t e ( NULL, 0 ) ;
219
220 el se i f ( ! v a l - > i s tmp | | ! IS I N T ( v a l - > t y p e ) )
221
{
222 r e l e a s e v a l u e ( v a l ) ;
223 v a l = tmp c r e a t e ( NULL, 0 ) ;
224
}
225 g e n ( " :%s%d", L FALSE, l a b e l n u m ) ; / * F000: */
226 g e n ( v a l - > n a me , "0" ) ; / * tO = 0 */
227 g e n ( "goto%s%d", L END, l a b e l n u m ) ; / * got o E000; */
228 g e n ( ":%s%d", L_TRUE, l a be l num ) ; / * T000: */
229 g e n ( v a l - > n a me , "I" ) ; / * t O = 1; */
230 g e n ( " :%s%d", L END, l a b e l n u m ) ; / * E000: */
231 ret urn v a l ;
232
}
Listing 6.80. c.y Unary Operators (Part 6)Pre and Postincrement
605
661 / * unop: * /
662 I unary INCOP { $$ == i n c o p ( 0, $2, $ l ) ; }
663 | INCOP unar y { $$ == i n c o p (
1, $1,
$2 ) ;
}
Listing 6.81. op.c Increment and Decrement Operators
233 v a l u e * i n c o p ( i s p r e i n c r e me n t , op, v a l ) / * ++ or
* /
234 i nt i s p r e i n c r e me n t ; / * I s pr ei ncr ement or pr edecr ement . * /
235 i nt op; / * f or decr ement , ' +' f or i ncr ement * /
236 v a l u e * v a l ; / * l val ue to modi f y.
* /
237
{
238 char bu f [ VALNAME_MAX ] ;
239 char * n ame;
240 v a l u e * new;
241 char *out op = (op = = ' + ' ) ? "+=%s%d" : "-=%s%d" ;
242 i nt i n c amt ;
243
244 / * You must use r val ue name( ) i n t he f ol l owi ng code because r val ue ()
245 * modi f i es t he val i t sel f - - t he name f i el d mi ght change. Her e, you must use
246 * t he same name bot h t o cr eat e t he t empor ar y and do t he i ncr ement so you
247 * can' t l et t he name be modi f i ed.
248 */
249
250 i f ( ! v a l - > l v a l u e )
251 y y e r r o r ( "%c%c: l v a l u e r e q u i r e d \ n " , op, op ) ;
252 el se
253
{
254 i n c amt = ( IS POI NTER( v a l - >t y p e ) ) ? g e t s i z e o f ( v a l - > t y p e - > n e x t ) : 1 ;
255 name = r v a l u e na me ( v a l ) ;
256
257 i f ( i s p r e i n c r e me n t )
258
r
\
259 g e n ( o ut op, name, i n c amt ) ;
260 v a l = tmp g e n ( v a l - > t y p e , v a l ) ;
261
}
262 el se / * Post i ncr ement .
* /
263
{
264 v a l = tmp g e n ( v a l - > t y p e , v a l ) ;
265 g e n ( o ut op, name, i n c amt ) ;
266
}
267
}
268 ret urn v a l ;
269 }
Listing 6.82. c.y Unary Operators (Part 7)Indirection
664 / * unop:
*/
665 AND unar y { $$ = addr o f ( $2 ) ; } %prec UNOP
666 STAR unary { $$ = i n d i r e c t ( NULL, $2 ) ; } %prec UNOP
667 unary LB e x pr RB { $$ = i n d i r e c t ( $3, $1 ) ; } %prec UNOP
668 unary STRUCTOP NAME { $$ = do s t r u c t ( $1, $2, y y t e x t ) ; } %prec STRUCTOP
already holds the desired address: The lvalue is an address by definition. An rvalue like
the one created by unoptNAME for an aggregate object is a physical temporary vari
able that also holds the required address. So, all that the address-of action needs to do is
modify the val ues type chain by adding an explicit pointer declarator at the far left and
change its val ue into an rvalue by clearing the l val ue bit. Keep this action in mind
when you read about the * operator, below. I ll start with the pointer dereference (*) and
* and 11 the array dereference ([ ]), operators, handled by i ndi r ect () (in Listing 6.84, below).
Listing 6.83. op.c Address-of Operator Processing
270 v a l u e * a d d r _ o f ( v a l )
271 v a l u e * v a l ;
272 {
273 / * Pr ocess t he & oper at or . Si nce t he i ncomi ng val ue al r eady hol ds t he
274 * desi r ed addr ess, al l you need do i s change t he t ype ( add an expl i ci t
275 * poi nt er at t he f ar l ef t ) and change i t i nt o an r val ue. The f i r st ar gument
276 * i s r et ur ned.
277 * /
278
279 l i n k *p;
280
281 i f ( v a l - > l v a l u e )
282 {
283 p = n e w _ l i n k ( ) ;
284 p- >DCL_ TYPE = POI NTER;
285 p - > n e x t = v a l - > t y p e ;
286 v a l - > t y p e = p;
287 v a l - > l v a l u e = 0;
288 }
289 el se i f ( ! I S_ AGGREGATE( v a l - > t y p e ) )
290 y y e r r o r ( "(&) l v a l u e r e q u i r e d \ n " ) ;
291
292 r et ur n v a 1;
293 }
Listing 6.84. op.c Array Access and Pointer Dereferencing
294 v a l u e ^ i n d i r e c t ( o f f s e t , p t r )
295 v a l u e * o f f s e t ; / * Of f set f act or ( NULL i f i t ' s a poi nt er ) . */
296 v a l u e * p t r ; / * Poi nt er t hat hol ds base addr ess. */
297 {
298 /* Do t he i ndi r ect i on, I f no of f set i s gi ven, we' r e doi ng a *, ot her wi se
299 * we' r e doi ng pt r [ of f set ] . Not e t hat , st r i ct l y speaki ng, you shoul d cr eat e
300 * t he f ol l owi ng dumb code:
301 *
302 * tO=r val ue ( pt r ); (i f pt r i sn' t a t empor ar y)
303 * tO+= of f set (i f doi ng [ of f set ] )
304 * tl = *t0; ( cr eat es a r val ue)
305 * l val ue at t r i but e = &tl
306 *
307 * but t he f i r st i nst r uct i on i s necessar y onl y i f pt r i s not al r eady a
308 * t empor ar y, t he second onl y i f we' r e pr ocessi ng squar e br acket s.
309 *
310 * The l ast t wo oper at i ons cancel i f t he i nput i s a poi nt er to a poi nt er . I n
311 * t hi s case al l you need t o do i s r emove one * decl ar at or f r omt he t ype
312 * chai n. Ot her wi se, you have t o cr eat e a t empor ar y t o t ake car e of t he t ype
313 * conver si on.
314 */
315
316 l i nk *t mp ;
317 val ue *synt h ;
318 i nt obj si ze ; / * Si ze of obj ect poi nt ed t o (i n byt es) */
319
320 i f ( I I S PTR TYPE( pt r - >t ype) )
321 yyer r or ( "Oper and f or * or [N] Must be poi nt er t ype\ n" );
322
323 r val ue( pt r ); / * Conver t t o r val ue i nt er nal l y. The "name" f i el d
*/
324 / * i s modi f i ed by r emovi ng l eadi ng &' s f r oml ogi cal
*/
325 / * l val ues.
*/
326 i f ( of f set ) / * Pr ocess an ar r ay of f set .
*/
327 {
328 i f ( I I S I NT( of f set - >t ype) || I I S CHAR( of f set - >t ype) )
329 yyer r or ( "Ar r ay i ndex must be an i nt egr al t ype\ n" );
330
331 obj si ze = get si zeof ( pt r - >t ype- >next ); / * Si ze of der ef er enced */
332
/ * obj ect . */
333
334 i f ( ! pt r - >i s t mp ) / * Gener at e a physi cal */
335 pt r = t mp gen( pt r - >t ype, pt r ); / * l val ue. */
336
337 i f ( I S CONSTANT( of f set - >t ype ) ) / * Of f set i s a const ant .
*/
338
{
339 gen("+=%s%d", pt r - >name, of f set - >t ype- >V I NT * obj si ze );
340
}
341 el se / * Of f set i s not a con */
342
{ / *
st ant . Do t he ar i t h
*/
343
/ *
met i c at r un t i me. */
344
345 i f ( obj si ze 1= 1 ) / * Mul t i pl y of f set by */
346 { / * si ze of one obj ect . */
347 i f ( ! of f set - >i s t mp )
348 of f set = t mp gen( of f set - >t ype, of f set );
349
350 gen( "*=%s%d", of f set - >name, obj si ze );
351
}
352
353 gen ( "+=", pt r - >name, of f set - >name ); / * Add of f set to base. */
354
}
355
356 r el ease val ue( of f set );
357
}
358
359 / * The t empor ar y j ust gener at ed (or t he i nput var i abl e i f no t empor ar y
360 * was gener at ed) now hol ds t he addr ess of t he desi r ed cel l . Thi s command
361 * must gener at e an l val ue unl ess t he obj ect poi nt ed t o i s an aggr egat e,
362 * wher eupon i t ' s n r val ue. I n any event , you can j ust l abel t he cur r ent
363 * cel l as an l val ue or r val ue as appr opr i at e and cont i nue. The t ype must
364 * be advanced one not ch t o compensat e f or t he i ndi r ect i on, however .
365 */
366
367 synt h = pt r ;
368 t mp = pt r - >t ype; / * Advance t ype one not ch. */
369 pt r - >t ype = pt r - >t ype- >next ;
370 di scar d l i nk( t mp) ;
371
372 i f ( IIS AGGREGATE( p t r - > t y p e - > n e x t ) )
373 s y n t h - > l v a l u e = 1;
/ * Convert t o l v a l u e . */
374
375 r e t u r n s y n t h ;
376 }
Operand to * or [ ]
must be array or pointer
Attribute synthesized by
* and [ ] operators.
Rules for forming lvalues
and rvalues when pro
cessing * and [ ] .
The best way to understand the code is to analyze what actually happens as the
operator is processed. First of as was the case with the address-of operator, the
operand must represent an address. It is one of two things: (1) an array, in which case
the operand is an rvalue that holds the base address of the array, or (2) a pointer, in
which case the operand is an lvalue for something of pointer typethe expression in the
val ues name field evaluates to the address of the pointer.
The next issue is the synthesized attribute, which is controlled by the type of the
dereferenced object. The compiler needs to convert the incoming attribute to the syn
thesized attribute. If the dereferenced object is a nonaggregate, the synthesized attribute
is an lvalue that holds the address of the dereferenced object. If the dereferenced object
is an aggregate, then the attribute is an rvalue that holds the address of the first element
of the object. In other words, the generated attribute must be the same thing that would
be created by unop>NAME if it were given an identifier of the same type as the refer
enced object. Note that both the incoming and outgoing attributes are addresses.
To summarize (if the following isnt clear, hold on for a secondseveral examples
follow):
(1)
The dereferenced object is not an aggregate. The synthesized attribute
that references that object, and:
a. if the incoming attribute then it already contains the necessary
address. That is, the incoming rvalue holds the address of the dereferenced
object, and the outgoing lvalue must also hold the address of the dereferenced
object. Consequently, no code needs to be generated in this case. The com
piler does have to modify the val ue internally, however. First, by setting the
l val ue bit true, and second, by removing the first l i nk in the type chain.
Remember, the generated lvalue is for the dereferenced object. If the incoming
type is pointer to i nt , the outgoing object is an lvalue that references the
i nt .
c.
If the compilers doing an array access rather than a pointer access, code must
also be emitted to add an offset to the base address thats stored in the lvalue,
if the incoming attribute is a logical lvalue, then all the compiler needs to do is
remove the leading ampersand and change the type as discussed earlier. The
pointer variable is now treated as if it were a physical lvalue.
If the compilers processing an array access, it must create a physical lvalue in
order to add an offset to it. It cant modify the declared variable to compute an
array access.
if the incoming attribute is a
generate code to get rid of one level of indirection, and change the type as was
discussed earlier. You can safely add an offset to the physical lvalue to do
physical lvalue, for a pointer, then you need to
(2)
array access.
The dereferenced object is an aggregate. The synthesized attribute is an rvalue that
points at the first element of the referenced array or structure, and:
a. if the incoming attribute is an rvalue, use it for the synthesized attribute, chang
ing the type by removing the first declarator link and adding an offset as neces
sary.
b. if the incoming attribute is a logical lvalue, create a physical rvalue, with the
type adjusted, as above.
c. if the incoming attribute is a physical lvalue, convert it to an rvalue by resetting
the l val ue bit and adjust the type, as above.
Note that many of the foregoing operations generate no code at allthey just modify
the way that something is represented internally. Code is generated only when an offset
needs to be computed or an incoming lvalue references an aggregate object. Also, notice
the similarities between processing a &and processing a *. The former adds a pointer
declarator to the type chain and turns the incoming attribute into an rvalue; the latter
does the exact opposite, removing a pointer declarator from the front of the type chain
and turning the incoming attribute into an lvalue.
I ll demonstrate how indirection works with some examples. Parses of the following
code is shown in Tables 6.17 to 6.23.
i nt *p, a[10];
f o o ()
{
*p++; *++p; ++*p; ( *p) ++;
a [7]; ++p[ 3] ; p++[ 3] ;
}
Table 6.17. Generate Code for *p++;
Parse and Value Stacks Comments
stmt list
Shift STAR.
stmtjist STAR

Shift NAME. The attribute for NAME is a pointer to the symbol for p.
stmtjist STAR NAME
p
Reduce by Mrta/7 >NAME. The action creates an lvalue that references the variable.
stmt list STAR unary
&WP( &_ p) L
Shift INCOP.
stmtjist STAR unary INCOP
&WP( &_ p) L ' +'
Reduce by unary-tunary INCOP. Emit code to increment p:
CODE: WP ( T ( 1) ) = WP( & p) ;
CODE: WP ( &_p) += 2;
The synthesized attribute is an rvalue (of type pointer to i n t ) for the temporary
variable.
stmt list STAR unary
WP (T ( 1))R
Reduce by wwary>STAR unary. $2 (under the unary) already contains the correct
address and is of the correct type. Convert it to a physical lvalue that references * p
(not p itself).
stmtjist unary
WP( T( 1) ) 1_
Subsequent operations cannot access p; they can access * p with * WP ( T ( 1) ) .
Neither the entire parse nor the entire parse stack is shown in the tables, but theres
enough shown that you can see whats going on. The following symbols are on the stack
to the left of the ones that are shown.
ext deflist opt specifiers funct decl {70} defjist {71} LC {65} local defs
I ve shown the value stacks under the parse stack. Attributes for most of the nontermi
nals are val ue structuresthe name field is shown in the table. Subscripts indicate
whether its an lvalue (name) or rvalue (name). Logical lvalues start with an amper
sand. A box () is used when there are no attributes. You should look at all these tables
now, even though theyre scattered over several pages.
I ve used the ++operators in these expressions because ++requires an lvalue, but it
generates an rvalue. The ++is used to demonstrate how the indirection is handled with
both kinds of incoming attributes. Other, more complex expressions [like * (p+1) ] are
610
Table 6.18. Generate Code for *++p;
stmt j ist
stmt j ist
stmt j ist
st mt j i st
st mt j i st
stmt j ist
st mt j i st
Parse and Value Stacks

STAR
STAR
STAR
STAR
STAR
unary
WP(T(1) ) L
INCOP
' +
INCOP
' +'
INCOP
' +'
unary
WP (T ( 1) )
NAME
P
unary
&WP (& p) ,
Comments
Shift STAR
Shift INCOP. The attribute for INCOP is the first character of the lexeme.
Shift NAME. The attribute for NAME is a pointer to the symbol forp
Reduce by w/iary^NAME. Create a logical lvalue that references p.
Reduce by WAia/7 INCOP Emit code to increment
CODE
CODE
P)
+ 2 ;
WP( & p) ;
The synthesized attribute is an rvalue of type pointer
yy
that holds a copy of p
From this point on, the compiler has forgotten that p ever existed
purposes of processing the current expression.
Reduce by unary-tSTAR
for the
previous reduction it into a physical
(*P).
The current reduction turns the attribute from the
that references the dereferenced object
Table 6.19. Generate Code for (*p) ++;
stmt list
Shift LP
stmt list LP

Shift STAR
stmt list LP STAR

Shift NAME. The attribute is a pointer to a symbol structure forp.
stmt list LP STAR NAME
p
Reduce by w/aaryNAME. The synthesized attribute is a logical lvalue that
references p.
st mt j i st LP STAR unary
&WP (& p) L
Reduce by Mrtary-STAR unary. Convert the incoming logical lvalue forp to a
physical lvalue that references *p.
stmtj ist LP unary
WP(&_p)t
unary is now converted to expr by a series of reductions, not shown here. The
initial attribute is passed through to the expr, however.
st mt j i st LP expr
WP (&_p) L
Shift RP
st mt j i st LP expr RP
WP(&_p)t
Reduce by unary-tLP expr RP. $$ = $2 ;
st mt j i st unary
WP(& p) L
Shift INCOP
stmtj ist unary INCOP
WP(&_p)t ' + '
Reduce by unary-tunarylNCOP. Emit code to increment *p:
CODE: W(T (1)) = *WP(& p) ;
CODE: *WP(&_p) += 1;
The synthesized attribute is an rvalue for the temporary that holds the incre
mented value of *p;
stmt list unary
W( T(1) )K
handled in much the same way. (p+1, like ++p, generates an rvalue of type pointer.)
I suggest doing an exercise at this juncture. Parse the following code by hand, start
ing at the st mt l i st , as in the previous examples, showing the generated code and the
relevant parts of the parse and value stack at each step:
Table 6.20. Generate Code for ++*p;
611
Parse and Val ue Stacks Comments
stmtjist
Shift INCOP
stmt list
INCOP
Shift STAR
stmt list
INCOP
' +'
STAR
Shift NA ME. The attribute for NA ME is a pointer to the symbol structure f or p.

stmt list
INCOP STAR
NA ME
P
Reduce by wmjry>NAME. Create a logical lvalue f or p.
stmtjist
INCOP STAR
unary
&WP ( & J p) t
Reduce by unary-tSTAR unary. This reduction converts the logical lvalue that refer
ences p into a physical lvalue that references *p .
stmt list
INCOP unary
WP( &_p) t
Reduce by wr t ar y-^ INCOP unary Emi t code to increment *p :
CODE: *WP(&_p) += 1;
CODE: W(T (1) ) = * WP( &_p ) ;
The generated attribute is an rvalue that holds the contents of *p after the increment
this is a preincrement.
stmtjist
unary
W(T (1) ) ,
Table 6.21. Generate Code for a [ 7 ] ;
Parse and Val ue Stacks Comments
stmt list
Shift NA ME. The attribute for NA ME is a pointer to the symbol structure f o r a.

stmt list NAME
a
Reduce by w/aaryNAME. Since this is an array rather than a simple pointer variable, an
rvalue is generated of type pointer to first array element (pointer to i n t , here):
CODE: W(T (1) ) = &W( _a) ;
stmtjist unary
W(T (1) ) K
Shift LB.
stmt list unary LB
W(T (1) ) R
Shift ICON.
stmtlist unary LB ICON
W( T( 1) ) , "7"
Reduce by wâry>ICON. The synthesized attribute is a val ue representing the constant
7.
stmt list unary LB unary
W( T( 1) ) , 1 R
unary is reduced to expr by a series of reductions, not shown here. The attribute is passed
through to the expr, however.
stmtjist unary LB expr
W(T (1) ) R 1 R
Shift RB.
stmtjist unary LB expr RB
W(T (1) ) R 7,
Reduce by unary-tLBexprRB. The incoming attribute is an rvalue of type pointer to
i n t . The foll owi ng code is generated to compute the offset:
CODE: W( T (1) ) += 14;
The synthesized attribute is an lvalue holding the address of the eighth array element.
stmt list unary
n w( T( D) ,
The final attribute is an lvalue that references a [ 7].
i nt * x[5];
* * x ;
*x [ 3 ] ;
x[ l ] [2] ;
The other pointer-related operators are the structure-access operators: . and ->.
These are handled by the do_ struct () subroutine in Listing 6.85. The basic logic is
the same as that used for arrays and pointers. This time the offset is determined by the
position of the field within the structure, however.
The final unary operator handles function calls. I ts implemented by the unop right-
hand sides in Listing 6.85 and the associated subroutines, cal l () and ret reg (), in
Listing 6.86. One of the more interesting design issues in C is that the argument list as a
Structure access: ->
and ., do st r uct ().
Function calls.
612
Table 6.22. Generate Code for p++ [ 3 ];
stmtjist
Shift NAME
stmtjist NAME
Reduce by *NAME.
stmtjist unary
&WP( &_ p) t
Shift INCOP
stmtjist unary INCOP
&WP( &j p) L ' +'
Reduce by unary-tunary INCOP. Emit code to increment the pointer. The syn
thesized attribute is an rvalue of type pointer to i n t .
CODE: WP( T( 1) ) = WP ( &_ p) ;
CODE: WP ( &_p) += 2;
stmtjist unary
WP( T( l ) ) jr
Shift LB.
stmtjist unary LB
WP( T( l ) ) jr
Shift ICON.
stmtjist unary LB ICON
WP( T( 1) ) , ' 3'
Reduce by unary>ICON. The action creates an rvalue for an i n t constant. The
numeric value is stored in $$- >et ype- >V I NT.
stmtjist unary LB unary
WP (T ( 1) ) R 3*
Reduce unary to expr by a series of reductions, not shown. The original attribute
is passed through to the expr.
stmt list unary LB expr
WP ( T ( 1) ) R 3*
Shift RB.
stmtjist unary LB expr RB
WP( T( 1) ) l 3r
Reduce by unary>ICON. Code is generated to compute the address of the
fourth cell:
CODE: WP ( T ( 1) ) += 6;
Since the incoming attribute is already a temporary variable of the correct type,
theres no need to generate a second temporary herethe first one is recycled.
Note that it is translated to an lvalue that references the fourth cell, however.
stmt list unary
WP (T ( 1) ) L
whole can be seen as a function-call operator which can be applied to any function
pointer;22 a function name evaluates to a pointer to a function, much like an array name
evaluates to a pointer to the first array element. The code-generation action does four
things: push the arguments in reverse order, call the function, copy the return value into
an rvalue, discard the arguments by adding a constant to the stack pointer. For example,
a call like the following:
d o c t o r ( l a wy e r , me r c ha nt , c h i e f ) ;
is translated to:
pus h ( W( &_chi ef ) ) ; / * push t he ar gument s * /
p u s h ( W( &_merchant ) ) ;
p u s h ( W( &_l awyer) ) ;
c a l l ( _ d o c t o r ) ; / * cal l t he subr out i ne * /
W( T (1) ) = r F . w . l o w ; / * copy r et ur n val ue t o r val ue * /
s p += 3/ / * di scar d ar gument s. * /
The function arguments are processed by the args productions on lines 674 to 681 of
Listing 6.86. A non_comma_expr recognizes all C expressions except those that use
comma operators. You must do something like the following to get a comma operator
into a function call:
Function-argument pro
cessing,
non_comma_ expr.
22. This is actually a C++ism, but its a handy way to look at it
Table 6.23. Generate Code for ++p [ 3 ] ;
613
stmt list
Shift INCOP
stmt list INCOP
' +'
Shift NAME. The attribute is a symbol representing p.
stmt list INCOP NAME
' +'
Reduce by wm/ry>NAME. Convert symbol to an lvalue.
stmt list INCOP unary
__ *
' +' &WP( &_p) /
Shift LB
stmt list INCOP unary LB
' +' &WP( &_p) ,
Shift ICON
stmt list INCOP unary LB ICON
' +' &WP( &_p) , ' 3'
Reduce by w// v>ICON. The synthesized attribute is a val ue structure
representing an i n t constant. Its an rvalue. The actual value (3) is stored
internally in that structure.
stmt list INCOP unary LB unary
^ ^ *
' +' &WP( &_p) , 3*
Reduce unary to expr by a series of reductions, not shown. The original
attribute is passed through to the expr.
stmt list INCOP unary LB expr
' +' &WP( &_p) , 3*
Shift RB.
stmt list INCOP unary LB expr RB
' +' &WP( &_p) , 3*
Reduce by unarysunaryLB expr RB. Code is generated to compute the
address of the fourth cell, using the offset in the val ue at $3 and the base
address in the val ue at $1:
CODE: WP (T (1) ) = WP (&_p) ;
CODE: WP (T ( 1) ) += 6;
You have to generate a physical lvalue because p itself may not be incre
mented. The synthesized attribute is an l val ue that holds the address of
the fourth element.
stmt list INCOP unary
' + ' WP( T ( 1) ) t
Reduce by unary>INCOP unary. Code is generated to increment the
array element, the address of which is in the lvalue generated in the previ
ous reduction.
CODE: *WP (T (1) ) += 1;
CODE: W(T (2) ) = *WP (T (1) ) ;
The synthesized attribute is an rvalue that duplicates the contents of that
cell.
stmtjist unary
W(T (2) )K
t i n k e r ( ( t a i l o r , c owboy) , s a i l o r ) ;
This function call has two arguments, the first one is the expression
( t a i l o r , cowboy), which evaluates to cowboy. The second argument is s a i l o r .
The associated attribute is a v a l ue for that expressionan lvalue is used if the expres
sion is a simple variable or pointer reference; an rvalue is used for most other expres
sions.
The args productions just traverse the list of arguments, printing the push instruc
tions and keeping track of the number of arguments pushed. The argument count is
returned back up the tree as the synthesized attribute. Note that the args productions
form a right-recursive list. Right recursion is generally not a great idea in a bottom-up
parser because all of the list elements pile up on the stack before any reductions occur.
On the other hand, the list elements are processed in right to left order, which is con
venient here because arguments have to be pushed from right to left. The recursion
shouldnt cause problems unless the subroutine has an abnormally high number of argu
ments.
The c a l l () subroutine at the top of Listing 6.87 generates both the c a l l instruction
and the code that handles return values and stack clean up. It also takes care of implicit
subroutine declarations on lines 513 to 526. The action in unary-^NAME creates a sym
b o l of type i n t for an undeclared identifier, and this symbol eventually ends up here as
args- * . .. productions.
Right recursion gets ar
guments pushed in
correct order.
cal l ()
unarysHMAE
the incoming attribute. The c a l l () subroutine changes the type to function returning
int by adding another l i n k to the head of the type chain. It also clears the i m p l i c i t
bit to indicate that the symbol is a legal implicit declaration rather than an undeclared
variable. Finally, a C-code e x t e r n statement is generated for the function.
Listing 6.85. op.c Structure Access
377 v a l u e *do s t r u c t ( v a l , op, f i e l d name )
378 v a l u e * v a l ;
379 i n t op; / * . o r - (the l ast i s f or - >) * /
380 c h a r * f i e l d name;
381
{
382 v a l u e *new;
383 s ymbol * f i e l d ;
384 l i n k * l p ;
385
386 / * St r uct ur e names gener at e r val ues of t ype st r uct ur e. The associ at ed
*/
387 / * name eval uat es to t he st r uct ur e' s addr ess, however . Poi nt er s gener at e
*/
388 / * l val ues, but ar e ot her wi se t he same. * /
389
390 i f (
IS_POINTER(v a l - > t y p e ) )
391 {
392 i f ( op != )
393 {
394 y y e r r o r ( "Obj e c t t o l e f t o f - > must be a p o i n t e r \ n " ) ;
395 r e t u r n v a l ;
396 }
397 l p = v a l - > t y p e ; / * Remove POI NTER decl ar at or f r om */
398 v a l - > t y p e = v a l - > t y p e - > n e x t ; / * t he t ype chai n and di scar d i t. * /
399 d i s c a r d l i n k ( l p ) ;
400 }
401
402
i f (
! I S_STRUCT( val - >t ype ) )
403
{
404 y y e r r o r ( " N o t a s t r u c t u r e . \ n " ) ;
405 r e t u r n v a l ;
406 }
/ * Look up t he f i el d i n t he st r uct ur e t abl e: */
407
408
i f (
! ( f i e l d = f i n d f i e l d ( v a l - > t y p e - >V STRUCT, f i e l d na me ) ) )
409 {
410 y y e r r o r ( "%s no t a f i e l d \ n " , f i e l d name ) ;
411 r e t u r n v a l ;
412
}
413
414
i f (
v a l - > l v a l u e I I ! v a l - > i s tmp ) / * Gener at e t empor ar y f or
* /
415 v a l = tmp g e n ( v a l - > t y p e , v a l ) ; / * base addr ess i f necessar y;
* /
416 / * t hen add t he of f set to t he * /
417 i f ( f i e l d - > l e v e l > 0 ) / * desi r ed f i el d.
* /
418 g e n ( "+=%s%d", v a l - > n a me , f i e l d - > l e v e l ) ;
419
420 i f ( ! I S_AGGREGATE( f i el d- >t ype) ) / * I f r ef er enced obj ect i sn' t * /
421 v a l - > l v a l u e = 1; / * an aggr egat e, use l val ue. * /
422
423 / * Repl ace val ue' s t ype chai n */
424 / * wi t h t ype chai n f or t he
* /
425 / * r ef er enced obj ect :
* /
426
i p
= v a l - > t y p e ;
427 i f ( ! ( v a l - > t y p e = c l o n e t y p e ( f i e l d - > t y p e , & v a l - > e t y p e ) ) )
428
{
429 y y e r r o r ( "INTERNAL do s t r u c t : Bad t y p e c h a i n \ n " );
430 e x i t (1 );
431
}
432 d i s c a r d l i n k c h a i n ( l p );
433 a c c e s s w i t h ( v a l ); /* Change t he val ue' s name */
434 /* f i el d t o access an obj ect */
435 return v a l ; /* of t he new t ype. */
436 }
437
438 /* -
------- */
439
440 PRIVATE s ymbol * f i n d f i e l d ( s , f i e l d name )
441 s t r u c t d e f *s ;
442 char * f i e l d name;
443 {
444 / * Sear ch f or "f i el d name" i n t he l i nked l i st of f i el ds f or t he i nput
445 * st r uct def . Ret ur n a poi nt er t o t he associ at ed "symbol " i f t he f i el d
446 * i s t her e, ot her wi se r et ur n NULL.
447 */
448
449 s ymbol *sym;
450 f or( sym = s - > f i e l d s ; sym; sym = s y m- >ne x t )
451
{
452 i f ( ! s t r c m p ( f i e l d name, sym- >name) )
453 return sym;
454
}
455 return NULL;
456
}
457
458 / * -
------- * /
459
460 PRIVATE char * a c c e s s w i t h ( v a l )
461 v a l u e * v a l ;
462
{
463 / * Modi f i es t he name st r i ng i n val so t hat i t r ef er ences t he cur r ent t ype.
464 * Ret ur ns a poi nt er t o t he modi f i ed st r i ng. Onl y t he t ype par t of t he
465 * name i s changed. For exampl e, i f t he i nput name i s "WP( f p+4) " , and t he
466 * t ype chai n i s f or an i nt , t he name i s be changed t o "W( f p+4) . I f val i s
467 * an l val ue, pr ef i x an amper sand t o t he name as wel l .
468 * /
469
470 char *p, b u f [ VALNAME_MAX ] ;
471
472 s t r c p y ( b u f , v a l - >na me ) ;
473 f or( p = b u f ; *p && *p != ' (' / *) */ ; ++p ) /* f i nd name */
474
f
475
476 i f ( ! *p )
477 y y e r r o r ( "INTERNAL, a c c e s s wi t h : m i s s i n g p a r e n t h e s i s \ n " ) ;
478 el se
479 s p r i n t f ( v a l - > n a me , "%s%s%s", v a l - > l v a l u e ? "&" : "",
480 g e t p r e f i x ( v a l - > t y p e ) , p ) ;
481 return v a l - >na me ;
482
}
616
Listing 6.86. c.y Unary Operators (Part 8)Function Calls
669 / * unop: * /
670 unar y LP a r g s RP { $$ = c a l l ( $1, $3 ) ; }
671 | unar y LP RP { $$ = c a l l ( $1, 0 ) ; }
672
/
673
674 a r g s : non comma e x p r %prec COMMA { g e n ( "pus h", r v a l u e ( $1 ) ) ;
675 r e l e a s e v a l u e ( $1 ) ;
676 $$ = 1;
677 }
678 I non comma e x pr COMMA a r g s { g e n ( "pus h", r v a l u e ( $1 ) ) ;
679 r e l e a s e v a l u e ( $1 ) ;
680 $$ = $3 + 1;
681 }
682
f
Listing 6.87. op. c Function-Call Processing
483 PUBLIC v a l u e * c a l l ( v a l , n a r g s )
484 v a l u e * v a l ;
485 i n t n a r g s ;
486
{
487 l i n k * l p ;
488 v a l u e * s y n t h ; / * synt hesi zed at t r i but e * /
489
490 / *
The i ncomi ng at t r i but e i s an l val ue f or a f unct i on i f
491
*
f unct ()
492
*
or
493
*
i nt (*p) () = f unct ;
494
*
(*P) 0;
495
*
496
*
i s pr ocessed. I t ' s a poi nt er t o a f unct i on i f p( ) i s used di r ect l y.
497
*
I n t he case of a l ogi cal l val ue ( wi t h a l eadi ng &), t he name wi l l be a
498
*
f unct i on name, and t he r val ue can be gener at ed i n t he nor mal way by
499
*
r emovi ng t he &. I n t he case of a physi cal l val ue t he name of a var i abl e
500
*
t hat hol ds t he f unct i on' s addr ess i s gi ven. No st ar may be added.
501
*
I f val i s an r val ue, t hen i t wi l l never have a l eadi ng &.
502
*
/
503
504 i f ( v a l - > s y m && v a l - > s y m - > i m p l i c i t && ! IS FUNCT( val - >t ype ) )
505
{
506 / * I mpl i ci t symbol s ar e not decl ar ed. Thi s must be an i mpl i ci t f unct i on
507 * decl ar at i on, so pr et end t hat i t ' s expl i ci t . You have t o modi f y bot h
508 * t he val ue st r uct ur e and t he or i gi nal symbol because t he t ype i n t he
509 * val ue st r uct ur e i s a copy of t he or i gi nal . Once t he modi f i cat i on i s
510 * made, t he i mpl i ci t bi t can be t ur ned of f .
511 */
512
513 l p = new l i n k ( ) ;
514 l p->DCL_TYPE = FUNCTION;
515 l p - > n e x t = v a l - > t y p e ;
516 v a l - > t y p e = l p ;
517
518 l p = new l i n k ( ) ;
519 l p->DCL_TYPE = FUNCTION;
520 l p - > n e x t = v a l - > s y m - > t y p e ;
521 v a l - > s y m - > t y p e = l p ;
522
523 v a l - > s y m - > i m p l i c i t = 0;
524 v a l - > s y m - > l e v e l = 0;
525
526 y y d a t a ( " e x t e r n \ t % s ( ) ; \ n " , v a l - >s y m- >r na me ) ;
527 }
528
529 i f ( IIS FUNCT( val - >t ype ) )
530 {
531 y y e r r o r ( "%s no t a f u n c t i o n \ n " , v a l - >na me ) ;
532 s y n t h = v a l /
533 }
534 el se
535 {
536 l p = v a l - > t y p e - > n e x t ; / * r et ur n- val ue t ype * /
537 s y n t h = tmp c r e a t e ( l p , 0) ;
538
539 gen ( " c a l l " , *val - >name == ? &v a l - >na me [1] : v a l - >na me ) ;
540 gen ( " = f s y nt h- >na me , r e t r e g ( l p ) ) ;
541
542 i f ( n a r g s )
543 g e n ( "+=%s%d" , "s p", n a r g s ) ;
544
546 }
547
548 ret urn s y n t h ;
549 }
550
551
/ * _
/
552
553 char * r e t r e g ( p )
554 l i n k *p;
555 {
556 / * Ret ur n a st r i ng r epr esent i ng t he r egi st er used f or a r et ur n val ue of
557 * t he gi ven t ype.
558 * /
559
560 i f ( IS_DECLARATOR( p ) )
561 ret urn "r F .p p ";
562 el se
563 swi t ch( p->NOUN )
564
{
565 case INT: ret urn (p->LONG) ? " r F . l " : "r F. w. l o w" ;
566 case CHAR: ret urn "r F. w. l o w" ;
567 def aul t: yyerror( "INTERNAL ERROR: r e t r e g , bad n o u n \ n " ) ;
568 ret urn "AAAAAAAAAAAAAAAGH!";
569 }
570 }
6.8.5 Binary Operators
We can now breathe a collective sigh of relief. The declaration system and the unary
operators are the hardest parts of the compiler and theyre over with. We can now move
on to the binary operators. Again, its useful to look at the overall structure of the
binary-operator productions before proceeding. These are summarized in Table 6.24.
Table 6.24. Summary of Binary-Operator Productions
expr
1
expr
non comma expr
COMMA noncomma expr
non_comma_expr > non comma expr ASSIGNOP noncomma expr
1 non comma_expr EQUAL noncomma expr
1 non comma expr QUEST non comma expr COLON non comma expr
1 orexpr
orjexpr > or list
or Jist > or Jist OROR andexpr
1 and expr
and expr > and list
and Jist > and list ANDAND binary
1 binary
binary > binary PLUS binary
1 binary MINUS binary
1 binary STAR binary
1 binary DIVOP binary
1 binary SHIFTOP binary
1 binary AND binary
1 binary XOR binary
1 binary OR binary
1 binary RELOP binary
1 binary EQUOP binary
1 unary
This grouping of the productions is a compromise between the strict grammatical
approach in which the grammar alone determines precedence and associativityall
operators that share a precedence level are isolated into single productionsand the
more flexible yacc/occs approach in which all binary operators are combined into a sin
gle production with many right-hand sides, associativity and precedence controlled by
%l eft and %ri ght directives. As you can see, the yacc approach has been used for
most of the operators by putting them under the aegis of a single binary production. The
%l eft and %ri ght directives that control the precedence and associativity of these
operators were discussed back towards the beginning of the current chapter. Theyre
also at the top of Appendix C if you want to review them.
Several of the lower-precedence operators have been isolated from the rest of the
binary operators for grammatical and code-generation reasons, however. The first issue
is the non comma expr which recognizes all expressions except those that use the
comma operator at the outermost parenthesis-nesting level. The following statement is
not recognized by a non comma expr.
a = f o o ( ) , b a r ( ) ;
but this one is okay:
(a = f o o () , bar () ) ;
because of the parenthesis. (Neither expression is a model of good programming style,
but the language accepts them.) An expr accepts the comma operator at the outer level;
both of the foregoing expressions are recognized. This isolation is required for gram
matical reasons. Comma delimited lists of expressions are used in several places (for
subroutine-argument lists, initializer lists, and so on). If an unparenthesised comma
operator were permitted in any of these comma-delimited lists, the parser would not be
able to distinguish between a comma that separated list elements and the comma opera
tor. The practical consequence would be reduce/reduce conflicts all over the place. Iso
lating the comma operator lets you use a non comma expr in all of the comma-
isolating the comma
operator. Problems with
comma-delimited lists.
delimited lists and an expr everywhere else.
The expr production, which handles the comma operator, is at the top of Listing 6.88. The comma operator.
The imbedded action releases the v a l u e structure associated with the left element of the
expression; the v a l u e for the right element is used as the synthesized attribute. The
imbedded action is necessary to prevent the compiler from wasting a temporary variable.
If $1 was released at the far right of the production, then both components of the list
would be evaluated before the left one was discarded. (Run a parse of + + a , ++b by hand
if you dont believe me.) Consequently, the temporary that holds the evaluated result of
the expr would continue to exist while the non comma expr was being evaluated. Since
the left term is going to be discarded anyway, theres no point in keeping it around
longer than necessary.
Listing 6.88. c.y Binary Operators: Comma, Conditional, and Assignment
Section 6.8.5Binary Operators 619
683 e xpr : e x pr COMMA { r e l e a s e v a l u e ( $ 1 ) ; } non comma e x p r { $$=$4; }
684 non comma e x pr
685 ;
686
687 non comma e x pr
688 : non comma e xpr QUEST { st at i c i nt l a b e l = 0;
689
690 g e n ( "EQ", r v a l u e ( $1 ) , "0" ) ;
691 g e n ( "goto%s%d", L COND_FALSE,
692 $<num>$ = + + l a b e l ) ;
693 r e l e a s e v a l u e ( $1 ) ;
694
}
695 non comma e x pr COLON { $ $ = $ 4 - > i s tmp
696
<
/
>
O
-
i i i i i i i i
697 : tmp g e n ( $ 4 - > t y p e , $4)
698
f
699
700 g e n ( "goto%s%d", L_COND_END, $<num>3 ) ;
701 g e n ( " :%s%d", L COND FALSE, $<num>3 ) ;
702
}
703 non comma e x pr { $$ = $ 6 ;
704
705 i f ( ! t h e same t y p e ( $ $ - > t y p e , $ 7 - > t y p e , 1) )
706 y y e r r o r ( "Types on t wo s i d e s o f c o l o n "\
707 "must a g r e e \ n " ) ;
708
709 gen ( " = $$- >name, r v a l u e ($7) );
710 g e n ( ":%s%d", L COND END, $<num>3 ) ;
711 r e l e a s e v a l u e ( $7 ) ;
712
}
713
714 non comma e x pr ASSIGNOP non comma e x pr {$$ = a s s i g n m e n t ( $2, $1, $ 3 ) ; }
715 non comma e x pr EQUAL non comma e x p r {$$ = a s s i g n m e n t ( 0, $1, $ 3 ) ; }
716
717 or e x pr
718 ;r
The non comma expr production, also in Listing 6.88, handles the conditional and The conditional operator
assignment operators. A statement like this: (arb.c).
i nt mi c ke y , mi n n i e ;
mi c ke y = mi n n i e ? p l u t o ( ) : g o o f y () ;
generates the code in Listing 6.89.
Listing 6.89. Code Generated for the Conditional Operator.
1 EQ( W( &_mi nni e) , 0 ) / * i f ( mi nni e == 0 ) * /
2 got o QF1; /* br anch ar ound t he f i r st cl ause */
3
4 c a l l ( _ p l u t o ) ; / * t r ue par t of t he condi t i onal : * /
5 W( T ( l ) ) = r F . w . l o w ; / * r val ue =subr out i ne r et ur n val ue * /
6 got o QE1 ;
7 QF1:
8 c a l l ( _ g o o f y ) ; / * f al se par t of t he condi t i onal * /
9 W( T( 2) ) = r F . w . l o w ; / * r val ue =subr out i ne r et ur n val ue * /
10
11 W( T (1) ) =W( T (2) ) ;
12 QE1:
13 W(& mi c ke y) = W( T ( l ) ) ; / * f i nal assi gnment * /
The conditional is processed by the right-hand side on lines 688 to 712 of Listing
6.88. The only tricky issue here is the extra assignment just above the QE1 label in List
ing 6.89. The problem, here, is inherent in the way a bottom-up parser works. Its very
difficult for a bottom-up parser to tell the expression-processing code to put the result in
a particular place; rather, the expression-processing code decides more or less arbitrarily
where the final result of an expression evaluation will be found, and it passes that infor
mation back up the parse tree to the higher-level code. The difficulty with the condi
tional operator is the two action clauses (one for the true condition and a second for the
false condition), both of which generate a temporary variable holding the result of the
expression evaluation. Since the entire conditional must evaluate to only one temporary,
code must be generated to copy the value returned from the false clause into the same
temporary that was used for the true clause.
None of this would be a problem with a top-down parser, which can tell the subrou
tine that processes the action clause to put the final result in a particular place. For
example, the high-level subroutine that processes the conditional in a recursive-descent
parser can pass a val ue structure to the lower-level subroutines that generate the
expression-processing code. These lower-level routines could then use that val ue for
the final result of the expression-evaluation process.
Assignment operators. The other two right-hand sides to non_comma_expr (on lines 714 and 715 of Listing
6.89) handle the assignment operators. The assi gnment () function in Listing 6.90
does all the work. The subroutine is passed three arguments: an operator (op) and
val ues for the destination (dst) and source (src). The operator argument is zero for
simple assignment, otherwise its the first character of the lexeme: ' +' for +=,' <' for
=, and so on. This operator is, in turn, passed to gen () by adding it to the operator
string on lines 588 and 589 of Listing 6.90.
The destination argument must be an lvalue. If its a physical lvalue, assi gn
ment () tells gen () to add a star to the right of the destination name by adding an @to
the operator string on line 587 of Listing 6.90. The source operand is converted to an
rvalue on line 593, then the assignment code is generated.
Implicit type conversion. i f the source and destination types dont match, an implicit type conversion must be
performed as part of the assignment on line 616. The con ver t t ype () subroutine is
the same one that was used for temporary-variable initialization. (Its in Listing 6.69 on
page 590.) conver t t ype () generates code to copy the source variable into a register
and do any necessary sign extension or zero fill. It returns a string that can be used to
reference the required register. As an example of the type-conversion process, given the
following declarations:
Listing 6.90. op.c Assignment Operators
571 v a l u e ^ a s s i g n m e n t ( op, d s t , s r c )
572 i nt op;
573 v a l u e * d s t , * s r c ;
574
{
575 char * s r c name;
576 char op s t r [8], *p = op s t r ;
577 v a l u e * v a l ;
578
579 i f ( ! d s t - > l v a l u e ) y y e r r o r ( 11 ( = ) l v a l u e r e q u i r e d \ n " ) ;
580 i f ( ! d s t - > i s tmp && d s t - > s y m ) gen c omme nt ( "%s", ds t - >s y m- >na me ) ;
581
582 / * Assembl e t he oper at or st r i ng f or gen ( ) . A l eadi ng @i s t r ansl at ed by
583 * gen() to a * at t he f ar l ef t of t he out put st r i ng. For exampl e,
584 * ( "@=, x, y) i s out put as "*x =y" .
585 * /
586
587 i f ( *ds t - >name != ) *p++ = '@' ;
588 i f ( op ) *p++ = op ;
589 i f ( op == ' <' op == ' >' ) *p++ = op ; / * <<= or >>= * /
590 / * do al ways * / *p++ = ' = ' ;
591 *p++ = ' \ 0 ' ;
592
593 s r c name = r v a l u e ( s r c );
594
595 i f ( I S POI NTER( d s t - > t y p e ) && I S PTR TYPE ( s r c - > t y p e ) )
596
{
597 i f ( op )
598 y y e r r o r ( "I l l e g a l o p e r a t i o n (%c= on t wo p o i n t e r s ) ; \ n " , o p ) ;
599
600 el se i f ( ! t h e same t y p e ( d s t - > t y p e - > n e x t , s r c - > t y p e - > n e x t , 0) )
601 y y e r r o r (11I l l e g a l p o i n t e r a s s i g n me n t ( t y p e mi s mat ch) \ n" ) ;
602
603 el se
604 g e n ( ds t - >na me + (* d s t - > n a me = = ' &' 11 : 0), s r c name );
605
}
606 el se
607 {
608 / * I f t he dest i nat i on t ype i s l ar ger t han t he sour ce t ype, per f or man
609 * i mpl i ci t cast ( cr eat e a t empor ar y of t he cor r ect t ype, ot her wi se
610 * j ust copy i nt o t he dest i nat i on, conver t t ype( ) r el eases t he sour ce
611 * val ue.
612 */
613
614 i f ( ! t h e same t y p e ( d s t - > t y p e , s r c - > t y p e , 1) )
615
{
616 g e n ( op s t r , ds t - >na me + ( * ds t - >na me == ? 1 : 0 ) ,
617 c o n v e r t t y p e ( d s t - > t y p e , s r c ) )
618
}
619 el se
620
{
621 g e n ( op s t r , ds t - >na me + ( * ds t - >na me == ? 1 : 0 ) , s r c name ) ;
622 r e l e a s e v a l u e ( s r c ) ;
623
}
624
}
625 return d s t ;
626
}
622
i nt i , j , k ;
l ong 1 ;
*
p;
simple input like i =j =k; generates the following output:
W(& i ) = W(& k ) ;
W(& i ) = W(& j ) ;
A more complex assignment like *p=l =i generates:
r O. w. l o w = W( &_i ) ; / * conver t i t o l ong */
e x t _ wo r d ( r O) ;
L(&_1) = r O . l ; / * assi gn t o 1 * /
r O . l = L( &_1) ; / * t r uncat e 1 to char * /
*BP (& p) = r 0 . b . b 0 ; / * assi gn t o *p * /
The logical OR operator The next level of binary operators handles the logical OR operatorthe productions
).
and related workhorse functions are in Listings 6.91, and 6.92.
Listing 6.91. c.y Binary Operators: The Logical OR Operator and Auxiliary Stack
143 / * St acks. The st ack macr os ar e al l i n */
144 / * <t ool s/ st ack, h>, i ncl uded ear l i er * /
145
%{
146 # i ncl ude < t o o l s / s t a c k . h> / * St ack macr os, (see Appendi x A) * /
147
148 s t k e r r ( o )
149
{
150 y y e r r o r ( o ? " Lo o p / s wi t c h n e s t i n g t o o de e p or l o g i c a l e x p r . t o o c o mp l e x . \ n "
151 : "INTERNAL, l a b e l s t a c k u n d e r f l o w . \ n" ) ;
152 e x i t ( 1 ) ;
153 }
154
155 #undef s t a c k e r r
156 #def i ne s t a c k e r r ( o ) s t k e r r ( o )
157
158 s t a c k d e l (S ando r , i nt , 32) ; / * Thi s st ack woul dn' t be necessar y i f I wer e * /
159 / * wi l l i ng t o put a st r uct ur e ont o t he val ue */
160 / * st acko r l i st and and l i st must bot h * /
161 / * r et ur n 2 at t r i but es; t hi s st ack wi l l hol d * /
162 / * one of t hem. * /
163
%}
719 or e x pr : or l i s t { i nt l a b e l ;
720 i f ( l a b e l = p o p ( S andor ) )
721 $$ = ge n f a l s e t r u e ( l a b e l , NULL ) ;
722 }
723
9
724 or l i s t : or l i s t OROR {
i f ( $1 )
725 o r ( $1, s t a c k i t e m( S a n d o r , 0) = t f _ l a b e l ( ) ) ;
726 }
727 and e x pr
{
or ( $4, s t a c k i t e m( S a n d o r , 0) ) ;
728 $$ = NULL;
729 }
730 I and e x pr
{
pus h ( S a ndo r , 0 ) ;
731 }
732
f
Section 6.8.5Binary Operators
Listing 6.92. op. c The Logical OR operator
623
627 voi d o r ( v a l , l a b e l )
628 v a l u e * v a l ;
629 i nt l a b e l ;
630 {
631 v a l = gen r v a l u e ( v a l ) ;
632
633 gen ( "NE", v a l - > n a me , "0" );
634 gen ( "goto%s%d", L TRUE, l a b e l );
636
}
637 /*
------------------------------------ */
638 v a l u e *gen r v a l u e ( v a l )
639 v a l u e * v a l ;
640
{
641 / * Thi s f unct i on i s l i ke r val ue(), except t hat emi t s code t o gener at e a
642 * physi cal r val ue f r oma physi cal l val ue ( i nst ead of j ust messi ng wi t h t he
643 * name) . I t r et ur ns t he ' val ue' st r uct ur e f or t he new r val ue r at her t han a
644 * st r i ng.
645 */
646
647 i f ( ! v a l - > l v a l u e | * ( val - >name ) == ) /* r val ue or l ogi cal l val ue */
648 r v a l u e ( v a l ) ; /*
j ust change t he name */
649 el se
650 v a l =tmp g e n ( v a l - > t y p e , v a l ); /*
act ual l y do i ndi r ect i on */
651
652 ret urn v a l ;
653 }
The only difficulty here is the requirement that run-time evaluation of expressions con
taining logical OR operators must terminate as soon as truth can be determined. Looking
at some generated output shows you whats going on. The expression i | | j | | k creates
the following output:
NE(W(&_i ) , 0)
got o Tl ;
NE(W(&_j), 0)
got o Tl ;
NE(W(&_k), 0)
got o Tl ;
F I :
W( T(1) ) = 0;
got o El ;
Tl :
W( T(1) ) = 1;
El :
The productions treat expressions involving | | operators as an OR-operator-delimited
list of subexpressions. The code that handles the individual list elements is on lines 724
to 730 of Listing 6.91. It uses the or () subroutine in Listing 6.92 to emit a test/branch
instruction of the form:
NE(W(&_i ), 0)
got o Tl ;
The code following the FI label is emitted on line 721 of Listing 6.91 after all the list
elements have been processed.
The main implementation-related difficulty is that the or Jist productions really want
to return two attributes: the val ue that represents the operand and the numeric com
ponent of the label used as the target of the output goto statement. You could do this, of
course, by adding a two-element structure to the value-stack union, but I was reluctant to
make the value stack wider than necessary because the parser would slow down as a
consequenceit would have to copy twice as much stuff on every parse cycle. I solved
the problem by introducing an auxiliary stack (S andor), declared in the occs declara
tion section with the code on lines 146 to 163 of Listing 6.91. This extra stack holds the
numeric component of the target label.
The best way to see whats going on is to follow the parse of i I I j I I k in Table 6.25.
You should read the comments in that table now, looking at the source code where
necessary to see whats going on.
Table 6.25. A Parse of i
9
J
k;
stmt
stmt list
Parse Stack
stmt list
stmtjist NAME
1
stmt list unary
W(&_i ) L
stmt list binary
W(&_i ) L
stmtjist and list
W(&_i ) L
stmt list and expr
W( & i ) L
W( &
W( &
andor
OROR
stmtjist or list OROR {128}
W(&_i ) z.

stmtjist or Jist OROR {128} NAME
W(&_i ) L
1
D
stmtjist or J ist OROR {128} unary
W(&_i ) L W( &_j )
stmtjist or Jist OROR {128} binary
W( & i )L W( & j )
Comments
Shift NAME. The shifted attribute is a pointer to the symbol that
represents i.
Reduce by wmyry>NAME.
Reduce by binary sunary.
Reduce by and listsbinary.
Reduce by and expr s a n d list.
Reduce by or list s a n d list. This is the first reduction in the list,
so push 0 onto the S andor stack. Note that, since this is the
leftmost list element, the default $$=$1 is permitted to happen in
order to tell the next reduction in the series what to do.
Shift OROR.
Reduce imbedded production or list s o r list
and expr. This is the first expression in the the compiler
knows that its first because $1 isnt NULL. Call t f _l abel ()
get the numeric component of the target label, and then replace
0 at of the andor stack with label
number. This replacement tells the or exprsor list production
that actually been processed reduction
or expr s o r Jist occurs in every expression, even those that dont
involve logical ORs. You can emit code only when the expres
sion had an | | operator in it, however. The S andor stack is
empty if there wasnt one. The compiler emits following
code:
CODE:
CODE:
NE (W( &_i ) , 0)
goto Tl ;
represents j.
Reduce by MW7/7 NAME. The synthesized attribute is an lvalue
for j.
Reduce by binary sunary.
Reduce by and listsbinary.
continued..
The logical AND operator The code to handle the logical AND operator (&&) is almost identical to that for the
OR operator. Its shown in Listings 6.93 and 6.94. The expression i&& j&&k generates
the following output:
Section 6.8.5Binary Operators
625
Table 6.25. Continued. A Parse of i I I j I Ik ;
Parse Stack S andor Comments
stmtjist or Jist OROR {128} and Jist
W(& i ) t W(& j )t
1 Reduce by and expr>and list.
stmtjist or J i s t OROR {128} and_expr
w(& i ) w(& j )^
1
Reduce by or Ji st s o r Jist OROR {128} and expr. Emit code
to handle the second list element:
CODE: NE (W( &_j ), 0)
CODE: g o t o Tl ;
The numeric component of the label is at the top of the S_andor
stack, and is examined with a st ack i t em( ) call. The post
reduction attribute attached to the or list is NULL.
stmtjist or list
NULL
1 Shift OROR.
stmtjist or Jist OROR
NULL
1
Reduce by imbedded production in or list-tor list {} and expr.
This time, the attribute for $ 1 is NULL, so no code is generated.
stmtjist or Jist OROR {128}
NULL
1
represents k.
stmtjist or Jist OROR {128} NAME
NULL k
1
Reduce by wrazry^NAME. The synthesized attribute is an lvalue
fork.
stmtjist or list OROR {128} unary
NULL W(&_k)L
1 Reduce by binary sunary.
stmtjist or list OROR {128} binary
NULL n W(&_k)t
1 Reduce by and listsbinary.
stmtjist or Jist OROR {128} and Jist
NULL n W(&_k)t
1 Reduce by and expr s a n d list.
stmtjist or list OROR {128} and_expr
NULL W(&_k)L
1
Reduce by or list s o r list OROR {128} and expr. Emit code
to process the third list element. The numeric component of the
label is at the top of the S andor stack.
CODE: NE (W( &_k) , 0)
CODE: g o t o Tl ;
The synthesized attribute is also NULL, there.
stmtlist or Jist
NULL
1
Reduce by or expr s o r list. The numeric component of the
label is popped off the S_andor stack. If its zero, then no | |
operators were processed. Its 1, however, so emit the targets for
all the goto branches generated by the previous list elements.
CODE: F I :
CODE: W( T( 1) ) = 0;
CODE: g o t o El ;
CODE: Tl :
CODE: W( T ( 1) ) = 1;
CODE: El :
The synthesized attribute is an rvalue for the temporary that holds
the result of the OR operation.
stmtjist or_expr
W(T (1) )R
EQ( W( &_i ) , 0) / * i && j && k * /
got o FI ;
EQ(W(&_j) , 0 )
got o FI ;
EQ( W( &_k) , 0)
got o FI ;
got o T1:
FI :
W( T (1) ) = 0;
got o El ;
Tl :
W( T (1) ) = 1;
El :
Since the run-time processing has to terminate as soon as a false expression is found, the
test instruction is now an EQ rather than an NE, and the target is the false label rather than
the true one. The same S a n d o r stack is used both for the AND and OR operators. The
got o Tl just above the FI label is needed to prevent the last list element from falling
through to the false assignment. Note that the logical AND and OR operators nest
correctly. The expression
( i | | j && k | | 1)
generates the output in Listing 6.95 (&&is higher precedence than I I). A blow-by-blow
analysis of the parse of the previous expression is left as an exercise.
Listing 6.93. c.y Binary Operators: The Logical AND Operator
733 and e x pr : and l i s t {
i nt l a b e l ;
734 i f (
l a b e l = p o p ( S andor ) )
735 {
736 g e n ( "goto%s%d", L_TRUE, l a b e l ) ;
737 $$ = gen f a l s e t r u e ( l a b e l , NULL ) ;
738 }
739 }
740
f
741 and l i s t : and l i s t ANDAND
{ i f ( $1 )
742 a n d ( $ l , s t a c k i t e m( S a n d o r , 0) = t f l a b e l ( ) ) ;
743 }
744 b i n a r y
{
and ( $4, s t a c k i t e m( S a n d o r , 0) ) ;
745 $$ = NULL;
746 }
747 I b i n a r y
{
pus h ( S andor , 0 ) ;
748 }
749
f
Listing 6.94. op.c- The Logical AND Operator
654 void a n d ( v a l , l a b e l )
655 v a l u e * v a l ;
656 i nt l a b e l ;
657
{
658 v a l = ge n r v a l u e ( v a l );
659
660 gen ( "EQ", v a l - ->name, "0 );
661 ge n ( "goto%s%d", L_FALSE, l a b e l ) ;
662 r e l e a s e _ v a l u e ( v a l ) ;
663 }
Relational operators. The remainder of the binary operators are handled by the binary productions, the first
two right-hand sides of which are in Listing 6.96. The productions handle the relational
operators, with the work done by the rel op () subroutine in Listing 6.97. The EQUOP
token matches either ==or !=. A RELOP matches any of the following lexemes:
<= >= < >
A single token cant be used for all six lexemes because the EQUOPs are higher pre
cedence than the RELOPs. The associated, integer attributes are assigned as follows:
Listing 6.95. Output for (i I I j && k || 1)
Section 6.8.5 Binary Operators 627
1 NE( W( &_i ) , 0 )
2 g o t o Tl ;
3 EQ(W(&_j ) , 0 )
4 g o t o F2;
5 EQ(W( &_k) , 0 )
6 g o t o F2;
7 F2:
8 W( T ( l ) ) = 0;
9 g o t o E2;
10 T2 :
11 W( T ( l ) ) = 1;
12 E2 :
13 NE (W( T (1) ) , 0)
14 g o t o Tl ;
15 NE(W(&_1) , 0 )
16 g o t o Tl ;
17 FI :
18 W( T (1) ) = 0;
19 g o t o El ;
20 T l :
21 W( T ( l ) ) = 1;
22 E l :
Token Lexeme Attribute
EQUOP

----
EQUOP
i =
r 1'
RELOP > ' >'

RELOP < ' <'
RELOP >= ' G'
RELOP <= ' L '
The rel op () subroutine at the top of Listing 6.97 does all the work. I ts passed the rel op o
operators attribute and the two operands; i <j generates the following code:
LT( W( &_i ) , W(&_j) ) / * compar e i and j . * /
g o t o Tl ; / * J ump t o t r ue case on success; * /
FI : / * ot her wi se, f al l t hr ough t o f al se. */
W( T (1) ) = 0;
g o t o El ;
T l :
W( T (1) ) = 1;
E l :
The comparison directive on the first line (LT) changes with the incoming operand: LT
for <, GT for >, EQ for ==, and so on. r el op () generates the code in several steps. Both
operands are converted to rvalues by the gen r val ue () calls on lines 673 and 674 of
Listing 6.96. The make t ypes mat ch call on the next line applies the standard C make_t ypes_mat ch( )
type-conversion rules to the two operands, generating code to promote the smaller of the
two variables to the type of the larger one. The subroutine starts on line 712 of Listing
6.96. Its passed pointers to the two val ues, and it might modify those val ues because
the conversion might create another, larger temporary. Normally, make t ypes mat ch
returns 1. It returns 0 if it cant do the conversion, as is the case when one operand is an
i nt and the other a pointer.
The switch on line 680 of Listing 6.96 translates the incoming attribute to an argu
ment passed to the gen that generates the test on line 693. The goto is generated
on the next line, and the numeric component of the target label is fetched from
t f l abel () at the same time. The code on lines 696 to 701 is doing a minor optimiza
tion. If the val ue passed into gen f al se t is already a temporary of type
i nt, then a second temporary is not created. If the original vl isnt an i nt temporary
the code on lines 696 to 701 swaps the two values in the hope that v2 is a temporary.
Listing 6.96. c.y- Binary Operators: Relational Operators
7 5 0 b i n a r y
751
7 5 2
: b i n a r y RELOP b i n a r y { $$ =
I b i n a r y EQUOP b i n a r y { $$ =
=r e l o p ( $1, $2, $3 ) ; }
= r e l o p ( $1, $2, $3 ) ; }
Listing 6.97. op.c- Relational Operators
664 v a l u e * r e l o p ( v l , op, v2 )
665
666
667
668
669
670
672
673
674
675
676
677
678
679
680
688
689
690
692
693
696
698
699
700
704
v a l u e
i nt
v a l u e
v l ;
op;
* v 2 ;
v a l u e
i nt
tmp;
l a b e l ;
g e n _ r v a l u e (
gen r v a l u e (
!ma k e _ t y p e s _ ma t c h ( &vl , &v2
y y e r r o r ( " I l l e g a l c o mpa r i s o n d i s s i m i l a r t y p e s \ n "
swi t ch(
y y e r r o r ( "INTERNAL, r e l o p ( ) : Bad
got o a b o r t ;
c\n",
gen
gen "goto%s%d",
tmp >t y pe
tmp v l ;
v 2 ;
tmp;
t r y make t empor ar y
gen l a b e l ,
705 a b o r t :
706 r e l e a s e v a l u e ( v2 ) ; / * di scar d t he ot her val ue * /
707 ret urn v l ;
708 }
709
710 / * - -
------- * /
711
712 PRIVATE i nt make t y p e s ma t c h ( v l p , v2p )
713 v a l u e * * v l p , **v2p;
714
{
715
i
/
/ * Takes car e of t ype conver si on. I f t he t ypes ar e t he same, do not hi ng;
716
<
* ot her wi se, appl y t he st andar d t ype- conver si on r ul es t o t he smal l er
717 * of t he t wo oper ands. Ret ur n 1 on success or i f t he obj ect s st ar t ed out
718 * t he same t ype. Ret ur n 0 ( and don' t do any conver si ons) i f ei t her oper and
719 * i s a poi nt er and t he oper ands ar en' t t he same t ype.
720 * /
721
722 v a l u e * v l = * v l p ;
723 v a l u e *v2 = *v2p;
724
725 l i n k * t l = v l - > t y p e ;
726 l i n k * t 2 = v 2 - > t y p e ;
727
728 i f ( t h e same t y p e ( t l , t 2 , 0) && IIS CHAR( t l ) )
729 ret urn 1;
730
731 i f ( IS_POINTER( t 1) I S_P0I NTER( t 2) )
732 ret urn 0;
733
734 i f ( IS CHAR(tl ) ) { v l = tmp g e n ( t l , v l ) ; t l = v l - > t y p e ; }
735 i f ( IS CHAR(t2) ) { v2 = tmp g e n ( t 2 , v 2 ) ; t 2 = v 2 - > t y p e ; }
736
737 i f ( IS ULONG(tl ) && IIS ULONG(t2) )
738
{
739 i f ( IS_L0NG( t 2) )
740 v2- >t ype- >UNSIGNED = 1;
741 el se
742 v2 = tmp g e n ( t l , v2 ) ;
743
}
744 el se i f ( IIS ULONG(tl ) && IS ULONG(t2) )
745
{
746 i f ( IS_LONG( t l ) )
747 vl - >t ype- >UNSI GNED = 1;
748 el se
749 v l = tmp g e n ( v 2 - > t y p e , v l ) ;
750 }
751 el se i f ( IS_LONG( t l ) && ! IS_LONG( t 2) ) v2 = tmp ge n ( t l , v 2 ) ;
752 el se i f ( IIS_LONG( t 1 ) && IS_LONG(t2) ) v l = tmp gen ( t 2 , v l ) ;
753 el se i f ( I S_ UI NT( t 1) && I I S_UI NT( t 2) ) v2- >t ype- >UNSIGNED = 1;
754 el se i f ( I I S_ UI NT( t 1) && I S_UI NT( t 2) ) vl - >t ype- >UNSI GNED = 1;
755
756 / * el se t hey' r e bot h nor mal i nt s, do not hi ng * /
757
758 * v l p = v l ;
759 *v2p = v 2 ;
760 ret urn 1 ;
761 }
All other operators,
bi nary o p()
Most other binary operators are covered by the productions and code in Listings 6.98
and 6.99. Everything is covered but addition and subtraction, which require special han
dling because they can accept pointer operands. All the work is done in bi nar y_op (),
at the top of Listing 6.99. The routine is passed val ues for the two operands and an
i nt that represents the operator. It generates the code that does the required operation,
and returns a val ue that references the run-time result of the operation. This returned
val ue is usually the incoming first argument, but it might not be if neither incoming
val ue is a temporary. In addition, one or both of the incoming val ues is released.
Listing 6.98. c.y Binary Operators: Other Arithmetic Operators
753
754
/ * bi nar y: * /
I b i n a r y STAR b i n a r y { $$ == b i n a r y _op (
$1.
r * r
f $3 ) ; }
755 b i n a r y DIVOP b i n a r y { $$ == b i n a r y op (
$1.
$2, $3 ) ; }
756 b i n a r y SHIFTOP b i n a r y { $$ == b i n a r y _op (
$1,
$2, $3 ) ; }
757 b i n a r y AND b i n a r y { $$ == b i n a r y _op (
$1,
$3 ) ; }
758 | b i n a r y XOR b i n a r y { $$ == b i n a r y _op (
$1,
f ~r
f
$3 ) ; }
759 b i n a r y OR b i n a r y { $$ == b i n a r y _op (
$1,
t | r
1 /
$3 ) ; }
Listing 6.99. op.c Other Arithmetic Operators
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
111
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
v a l u e
v a l u e
i nt
v a l u e
{
i nt
{
}
v l
v2
(
{
*
b i n a r y o p ( v l , op
* v l ;
op;
* v 2 ;
v2 )
*
s t r op
c o mmut a t i v e 0; /
*
oper at or i s commut at i ve * /
( do b i n a r y c o n s t ( &vl , op, &v2 )
)
r e l e a s e _ v a l u e ( v2 ) ;
ret urn v l ;
g e n _ r v a l u e ( v l ) ;
ge n r v a l u e ( v2 ) ;
Imake t y p e s ma t c h ( &vl , &v2 )
)
y y e r r o r ( "%c%c: I l l e g a l t y p e c o n v e r s i o n \ n " ,
(op >'
op op op) ;
swi t ch( op )
{
f * r
' &'
c o mmut a t i v e 1;
/ o. /
"O
' >'
/
/
* *
* *
/
/
d s t o p t ( &vl , &v2, c o mmut a t i v e ) ;
' +' tl -->V_ I NT += t2-->v_ I NT br eak
r f
tl -->v_[i nt
=
t2-->V_[i nt br eak
r * r
tl -
">v_
[i nt
* =
t2-->V_[i nt br eak
' &' tl -->v_[i nt &= t2-->v_[i nt break
' 1'
tl -->v_[i nt 1= t2-->v_[i nt break
mm
tl -->v_[i nt t2-->v_[i nt br eak
' /' tl -->v_[i nt / = t2-- >V_[i nt br eak
f O. 9
"0 tl -->v_[i nt
o_
'O t2-->v_[i nt br eak
' <' tl -->v"i nt = t2-- >V ~I NT break
856
857
858
859
860
872
873
874
875
876
877
878
879
880
892
893
894
895
896
897
898
899
900
901
902
914
}
( I S LONG( t 1) && I S LONG( t 2) )
{
swi t ch( op )
{
861 case ' +' tl ' - >v_ LONG += t2- >V_ LONG; br eak;
862 case
r _ t
tl ' - >V_LONG t2- >V_ LONG; br eak;
863 case
t * f
tl ' - >V_LONG
* =
t2- >V_ LONG ; br eak;
864 case ' &' tl ' - >V_LONG &= t2- >V_ LONG; br eak;
865 case
t | t
tl ' - >V_LONG |= t2- >V_ LONG; br eak;
866 case
r ~ r
tl ' - >V_LONG t2- >V_ LONG; br eak;
867 case tl ' - >V_LONG / = t2- >V_ LONG ; br eak;
868 case
t o, t
tl ' - >V_LONG
O_
O t2- >V_ LONG; br eak;
869 case tl -- >v LONG <<= t2- > v LONG; br eak;
870
871 case ' >' : i f ( I S UNSI GNED(tl ) ) t l - >V ULONG =
t l - >V LONG
t2- >V_LONG;
t 2- >V LONG;
}
ret urn 1;
}
( I S LONG(t1) && I S I NT (t2) )
{
swi t ch( op )
{
881 case ' +' tl -- >v_ LONG += t2- >v_ I NT; br eak;
882 case
f
tl -- >v_ LONG
=
t2- >v_ I NT; br eak;
883 case tl -- >v_ LONG
* =
t2- >v_ I NT; br eak;
884 case 9Sc9 tl -- >V_LONG &= t2- >V_ I NT; br eak;
885 case
f | 9
tl -- >V_LONG | = t2- >V_ I NT; br eak;
886 case
r ^r
tl -- >V_LONG t2- >v_ I NT; br eak;
887 case tl -- >V_LONG / = t2- >v_Î NT; br eak;
888 case
9 9
o tl -- >V_LONG
O_
O t2- >v_Î NT; br eak;
889 case ' <' tl -- >V LONG <<= t2- > v "i NT; br eak;
890
891 case ' >' :: i f ( I S UNSI GNED(tl ) )
1
r
t
i
1
l
V<ULONG =
t l - >V LONG
t2- >V_I NT;
t 2- >V I NT;
}
ret urn 1;
}
( I S I NT(tl ) && I S LONG( t 2)
)
{
/
*
*
Avoi d commut at i vi t y pr obl ems by doi ng t he ar i t hmet i c f i r st ,
t hen swappi ng t he oper and val ues.
*
/
903 swi t ch( op
)
904
905
{
case ' + ' x == t l - ->v__I NT + t 2 - - >v__LONG;
906 case ' x == t l - - > v _ [i nt
t 2 - - > v _ [LONG;
907 case '
* 9
X == t l -
" > v _
[i nt
t 2 - - > v _ [LONG;
908 case ' &' X == t l - - > v _ [i nt & t 2 - ->V__LONG;
909 case '
|t
X == t l - - > v _ [i nt | t 2 - - > v _ [LONG;
910 case '
~f
X == t l - - > v _ I NT
/ V
t 2 - - > v _ [LONG;
911 case ' r X == t l - - > v _ [i nt / t 2 - - > v _ _LONG;
912 case '
9- 9
o X == t l -
" > v _
[i nt
o
o t 2 - - > v _ _LONG;
913 case ' <' :
I
X == t l - ->V"i nt << t 2 - ->V LONG;
915 case i f ( I S_ UI NT( t l ) ) x = t l - >V_UI NT t2->V_LONG;
916 el se x = t l - >V_ I NT t2->V_LONG;
917 br eak;
918 }
919
920 t 2 - > V LONG = x; / * Modi f y vl t o poi nt at t he l ar ger * /
921 tmp = * v l p ; / * oper and by swappi ng *vl p and *v2p. * /
922 * v l p = *v2p ;
923 *v2p = tmp ;
924 ret urn 1;
925 }
926
}
927 ret urn 0;
928 }
929
/ * _ _ ---------------------------------------------------------------------------------------------------------------------------------- * /
930
931 PRIVATE voi d d s t o p t ( l e f t p , r i g h t p , c o mmut a t i v e )
932 v a l u e * * l e f t p ;
933 v a l u e * * r i g h t p ;
934
{
935 / * Opt i mi zes var i ous sour ces and dest i nat i on as f ol l ows:
936
*
937
*
oper at i on i s not commut at i ve:
938
*
i f *l ef t i s a t empor ar y: do not hi ng
939
*
el se: cr eat e a t empor ar y and
940
*
i ni t i al i ze i t t o *l ef t ,
941
*
f r eei ng *l ef t
942
*
*l ef t = new t empor ar y
943
*
oper at i on i s commut at i ve:
944
*
i f *l ef t i s a t empor ar y do not hi ng
945
*
el se i f *r i ght i s a t empor ar y swap *l ef t and *r i ght
946
*
el se pr ecede as i f commut at i ve.
947
*/
948
949 v a l u e *tmp;
950
951
i f (
! ( * l e f t p ) - > i s tmp )
952
{
953 i f ( c ommut at i ve && ( * r i g h t p ) - > i s tmp )
954
{
955 tmp = *l e f t p ;
956 * l e f t p = * r i g h t p ;
957 *r i g h t p = tmp;
958
}
959 el se
960 * l e f t p = tmp g e n ( ( * l e f t p ) - > t y p e , * l e f t p ) ;
961
}
962
}
bi nar y_op () starts out by trying to perform a type of optimization called constant Constant folding.
folding. If both of the incoming val ues represent constants, then the arithmetic is done
internally at compile time rather than generating code. The result is put into the last
l i nk of whichever of the two incoming val ues was larger, and that val ue is also the
synthesized attribute. The work is done by do bi nar y const ant () starting on line
816 of Listing 6.99. An i f clause is provided for each of the possible incoming types.
Note that do_bi nar y_const ant () is passed pointers to the val ue pointers. The
extra indirection is necessary because, if the left operand is larger than the right operand,
the two values are swapped (after doing the arithmetic, of course). The code to do the
swapping starts on line 920.
If constant folding couldnt be performed, then bi nar y op must generate some
code. It starts out on lines 776 and 777 of Listing 6.99 by converting the incoming
val ues to rvalues. It then does any necessary type conversions with the
make t ypes mat ch () call on line 779. The swi tch on line 784 figures out if the
operation is commutative, and the dst opt () on line 794 juggles around the operands
to make the code more efficient.
dst opt () starts on line 931 of Listing 6.99. It is also passed two pointers to
val ue pointers, and it makes sure that the destination value is a temporary variable. If
its already a temporary, dst opt () does nothing; otherwise, if the right operand is a
temporary and the left one isnt, and if the operation is commutative, it swaps the two
operands; otherwise, it generates a temporary to hold the result and copies the left
operand into it.
Returning to bi nar y op (), the arithmetic instruction is finally generated on line
806 of Listing 6.99.
The last of the binary operators are the addition and subtraction operators, handled
by the productions and code in Listings 6.100 and 6.101. The only real difference
between the action here and the action for the operators we just looked at is that pointers
are legal here. I ts legal to subtract two pointers, subtract an integer from a pointer, or
add an integer to a pointer. The extra code that handles pointers is on lines 1019 to 1057
of Listing 6.101.
The final group of expression productions are in Listing 6.102. They are pretty much
self-explanatory.
Listing 6.100. c.y Binary Operator Productions: Addition and Subtraction
760 / * bi nar y: * /
761 | bi nar y PLUS bi nar y { $$ ==pl us mi nus(
$1, $3 ) ; }
762 I bi nar y MI NUS bi nar y { $$ ==pl us mi nus(
$1, $3 ) ; }
763 unar y
764
Listing 6.101. op.c Addition and Subtraction Processing

963 v a l u e * p l u s _ m i n u s ( v l , op, v2 )
964 v a l u e * v l ;
965 i nt op ;
966 v a l u e *v2;
967 {
968 v a l u e *tmp;
969 i nt v l _ i s _ p t r ;
970 i nt v 2 _ i s _ p t r ;
971 char * s c r a t c h ;
972 char *ge n_op;
973
974 g e n_ o p = (op = = ' + ' ) ? "+=" :
975 v l = g e n _ r v a l u e ( v l ) ;
976 v2 = g e n _ r v a l u e ( v2 ) ;
977 v 2 _ i s _ p t r = I S_ POI NTER( v 2 - >t y pe ) ;
978 v l _ i s _ p t r = I S _ POI NTER( v l - > t y p e ) ;
979
dst opt ()
Addition and subtraction.
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
000
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
/
*
Fi r st , get al l t he er r or checki ng out of t he way and r et ur n i f
* an er r or i s det ect ed
*
/
L f ( v l i s p t r && v2 i s p t r )
{
( op ! t h e same t y p e ( v l - > t y p e , v 2 - > t y p e , 1) )
{
y y e r r o r ( " I l l e g a l t y p e s ( %c) \ n", op ) ;
r e l e a s e _ v a l u e ( v2 ) ;
ret urn v l ;
}
}
L f ( ! v l i s p t r && v2 i s p t r )
{
y y e r r o r (
II Q.
c: l e f t o pe r a nd must be p o i n t e r " , op ) ;
r e l e a s e _ v a l u e ( v l ) ;
ret urn v2;
}
/
*
Now do t he work. At t hi s poi nt one of t he f ol l owi ng cases exi st
*
*
vl : op: v2:
*
number [+- ] number
*
*
pt r
pt r
[+- ] number
pt r ( t ypes must mat ch)
*
/
i f ( ! ( v l i s p t r
{
v2 i s p t r )
)
/ * nor mal ar i t hmet i c * /
Lf ( ! do b i n a r y c o n s t ( &vl , op, &v2 )
)
{
make t y p e s ma t c h ( &vl , &v2 ) ;
d s t o p t ( &vl , &v2, op
g e n ( g e n_ o p, v l - > n a me , v2- >name ) ;
}
r e l e a s e _ v a l u e ( v2 ) ;
ret urn v l ;
}
{
( v l i s p t r && v2 i s p t r ) /
pt r - pt r
*
/
{
( ! v l - > i s tmp )
v l tmp g e n ( v l - > t y p e , v l ) ;
g e n ( gen op, v l - > n a me , v2- >name ) ;
( IS AGGREGATE( v l - > t y p e - > n e x t )
)
gen ( "/=%s%d", v l - > n a me , s i z e o f ( v l - > t y p e - > n e x t )
);
}
( ! IS AGGREGATE( v l - > t y p e - > n e x t )
)
{
/
*
pt r to nonaggr egat e [+ number */
( ! v l - > i s tmp )
v l tmp g e n ( v l - > t y p e , v l ) ;
g e n ( gen op, v l - > n a me , v2- >name ) ;
}
1039 e l s e / * pt r _t o_aggr egat e [+- ] number * /
1040 { / * do poi nt er ar i t hmet i c * /
1041
1042 s c r a t c h = I S_LONG( v 2 - > t y p e ) ? "r O. l " : "r O. w. l ow" ;
1043
1044 ge n ( ,, = "f " r l . p p " , v l - >na me ) ;
1045 g e n ( " = s c r a t c h , v2- >name ) ;
1046 g e n ( "*=%s%d", s c r a t c h , g e t _ s i z e o f ( v l - > t y p e - > n e x t ) ) ;
1047 g e n ( g e n_ o p, " r l . p p " , s c r a t c h ) ;
1048
1049 i f ( ! v l - > i s _ t m p )
1050 {
1051 tmp = t m p _ c r e a t e ( v l - > t y p e , 0 ) ;
1052 r e l e a s e _ v a l u e ( v l ) ;
1053 v l = tmp;
1054 }
1055
1056 g e n ( " = v l - > n a me , " r l . p p " ) ;
1057 }
1058 }
1059 r e l e a s e _ v a l u e ( v2 ) ;
1060 r e t u r n v 1;
1061 }
Listing 6.102. c.y High-Level Expression Processing
765 o p t e x pr
766 : e x p r { r e l e a s e _ v a l u e ( $1 ) ; t m p _ f r e e a l l ( ) ; }
767 I / * epsi l on */
768
769
770 c o n s t _ e x p r
771 : e x pr %prec COMMA
772 {
773 $$ = - 1 ;
774
775 i f ( ! I S_CONSTANT( $ l - > t y p e ) )
776 y y e r r o r ( "Co ns t a nt r e q u i r e d . " ) ;
777
778 el se i f ( ! I S_I NT( $ l - > t y p e ) )
779 y y e r r o r ( " C o n s t a n t e x p r e s s i o n must be i n t . " ) ;
780
781 el se
782 $$ = $l - >t ype- >V_ I NT ;
783
784 r e l e a s e _ v a l u e ( $ 1 ) ;
785 t m p _ f r e e a l l ( ) ;
786 }
787
788
789 i n i t i a l i z e r : e x pr %prec COMMA
790 I LC i n i t i a l i z e r _ l i s t RC { $$ = $2; }
791
792
793 i n i t i a l i z e r _ l i s t
794 : i n i t i a l i z e r
795 I i n i t i a l i z e r l i s t COMMA i n i t i a l i z e r
796 {
797 y y e r r o r ( "Ag g r e g a t e i n i t i a l i z e r s a r e n o t s u p p o r t e d \ n " ) ;
798 r e l e a s e v a l u e ( $3 ) /
799 }
800
6.9 Statements and Control Flow
The only part of the compiler weve yet to examine is the statement productions, dis
cussed in this section.
6.9.1 Simple Statements and i f /el se
Its best to start by looking at some examples. Table 6.26 shows input and output for
a few simple control-flow statements. Figure 6.20 shows a more complex example of
nested i f / el se statements. (Ive shown the complete compiler output in the Figure.)
The productions and subroutines that generate this code are in Listings 6.103 and
6.104. The stmtjist production at the top just assembles a list of zero or more state
ments. There are no attributes. The simplest statement is defined on line 801 as a single Empty and compound
semicolon. There is no action. A statement can also comprise a curly-brace-delimited statements.
compound statement (on line 811).
The next line defines a statement as an expression followed by a semicolon. The Expression statements
associated action frees the value holding the result of the expression evaluation and
releases any temporary variables. Note that many expressions create unnecessary final
values because theres no way for the parser to know whether or not an expression is part
of a larger expression. For example, an assignment to a temporary is emitted as part of
processing the the ++operator in the statement:
a++;
but that temporary is never used. It is an easy matter for an optimizer to remove this
extra assignment, which is, in any event, harmless.
The two forms of return statement are handled on lines 814 and 816 of Listing
6.103. The first is a simple return, with no value. The second takes care of the value by
copying it into the required retum-value register and then releasing the associated
val ue and temporaries. Because returning from a subroutine involves stack-cleanup
actions, a jump to a label immediately above the end-of-subroutine code is generated
ret urn statements
here rather than an explicit ret () instruction. The numeric part of the label is gen
erated by r l abel (), in Listing 6.104. The end-of-subroutine code is generated during
the reduction by
ext def-ôpt specifiers funct decl def list compound stmt
(on line 563 of Listing 6.58, page 556), which executes an r l abel (1) to increment the
numeric part of the label for the next subroutine.
Table 6.26. Simple Control-Flow: return goto a n d i f / e l s e
Input Output
r e t u r n ; g o t o RETO; /
Gener at ed by r et ur n st at ement
*
/
RETO /
*
Gener at ed by end- of - subr out i ne * /
u n l i n k ( ) ;
r e t () ;
/
*
code
*
/
r e t u r n i + j ; W( T (1)
W( T (1)
r F . w. l ow
g o t o RETO
)
) +
W(&_i) ;
W(&_j ) ;
W( T (1)
/
*
comput e i + j
*
/
); /
*
r et ur n val ue i n r egi st er
*
/
RETO /
*
Gener at ed i n end- of - subr out i ne pr ocessi ng
*
u n l i n k ( ) ;
r e t ( ) ;
/
f o o : ;
g o t o f o o ;
f o o
g o t o foo;
( i < j )
++i ;
TSTl
LT(W(& i ) , W(& j ) ) /
eval (i < j ) and put

g o t o Tl ; / * t he r esul t i nt o
( 1 )
/
/
FI :
W( T (1)
g o t o El ;
) 0;
Tl
W( T (1) ) l ;
El
EQ(W( T (1) ) , 0) / * t hi s t est does l oop cont r ol
g o t o EXITl ; /
*
don' t execut e body i f f al se
*
*
/
/
W(&_i)
W( T (1)
+
)
l ;
W(& i ) ;
/
body of t he i f st at ement
*
/
EXITl
( i < j )
++i ;
++j ;
TST2 :
LT (W(& i ) , W(& j) ) /
*
Eval uat e (i < j ) and put
g o t o T2; / * t he r esul t i nt o Ti l ).
*
*
/
/
F2
W( T (1)
g o t o E2;
) 0;
T2
W( T (1) ) l ;
E2
EXIT2
EQ(W( T (1) ) , 0 ) / * Thi s t est does l oop cont r ol . * /
g o t o EXIT2;
/ *
J ump t o el se cl ause i f f al se. * /
W(& i ) + = 1 ; / * Body of t he i f cl ause. * /
W( T (1) ) = W( &_i ) ;
g o t o EL2; / *
J ump over t he el se.
* /
W(& j ) += 1; / *
Body of t he el se cl ause. * /
W( T (1) ) W(& j ) ;
EL2 :
Figure 6.20. Nested i f / el se Statements
Section 6.9.1 Simple Statements and i f / e l s e 639
I nput
i n t i , j ;
}
f r e d ()
{
i n t wi l ma;
i f ( i )
{
}
e l s e
{
}
# i n c l u d e c t o o l s / v i r t u a l . h >
# d e f i n e
SEG( dat a)
SEG( bss)
T( x) Output
common word
common word
i ;
9
d;
# d e f i n e LO 0
# d e f i n e Ll 0
SEG( code)
# unde f
# d e f i n e
/
/
*
f r e d l o c a l s * /
*
t e m p s .
*
/
T
T( n) ( f p - L O- ( n * 4 ) )
PROC( f r e d , p u b l i c )
l i n k ( L 0 + L 1 ) ;
/
*
f p - 2 x [ v a r i a b l e ]
*
/
TST1:
EQ(W(& i ) , 0 ) /
*
l
*
/
g o t o EXITl ;
wi l ma==0; W( fp-- 2) = 0; / * wi l ma
* /
i f ( j >
wi l ma==1;
TST2:
EXIT2:
EQ (W(&_ j )
g o t o
W( f p - 2 )
, 0)
EXIT2;
= l ;
/ *
/ *
j * /
wi l ma
* /
EXITl :
g o t o ELI;
i f ( j >
TST3:
EQ (W(&_ j )
g o t o
, 0 )
EXIT3;
/ *
j * /
wi l ma ==2; W( f p-- 2) = 2; / * wi l ma * /
e l s e
EXIT3:
g o t o EL3;
wi l ma==3; W( fp-- 2) = 3; / * wi l ma
* /
wi l ma =4 ;
EL3:
W( fp-- 2) = 4; / * wi l ma
* /
ELI:
u n l i n k ( ) ;
r e t () ;
ENDP( f r e d )
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Listing 6.103. c.y Statement Processing: return, goto, and i f / e l s e
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
s t mt l i s t
s t mt l i s t s t a t e me n t
/
*
epsi l on
*
/
/* --------------------------------------------------------------------------------------------------------------------
* St at ement s
*
/
s t a t e me n t
SEMI
compound_s t mt
e x pr SEMI
RETURN SEMI
{ r e l e a s e _ v a l u e ( $ 1 ) ; t m p _ f r e e a l l ( ) ;
{ g e n ( "goto%s%d", L RET, r l a b e l (0)
}
); }
RETURN e x pr SEMI { g e n ("=", IS INT ( $ 2 - > t y p e ) ? "r F. w. l ow"
IS POINTER( $2- >t ype)
9
g e n ( "goto%s%d", L_RET, r l a b e l (0)
r e l e a s e _ v a l u e ( $2 ) ;
tmp f r e e a l l ( ) ;
"r F. pp"
" r F . l " ,
r v a l u e ($2)
) ;
) ;
GOTO t a r g e t SEMI
t a r g e t COLON
}
{
{
gen ( " g o t o " , $ 2 ) ;
$1 ) ;
}
}
s t a t e me n t
IF LP t e s t RP s t a t e me n t
{
}
g e n ( ":%s%d", L NEXT, $3 ) ;
IF LP t e s t RP s t a t e me n t ELSE
{
gen ( "goto%s%d", L ELSE, $3 ) ;
gen (
ll . Q
9- o 9-H 11
. o o o U f L NEXT, $3 ) ;
s t a t e me n t
}
{
}
g e n ( " : %s%d", L ELSE, $3 ) ;
Listing 6.104. op.c Get Numeric Part of End-of-Subroutine Label
1062 r l a b e l ( i n c r )
/*
Ret ur n t he numer i c component of t he next */
1063 { /* r et ur n l abel , post i ncr ement i ng i t by one
*/
1064 s t a t i c i n t num; /* i f i ncr i s t r ue.
*/
1065 r e t u r n i n c r ? num+ + : num;
1066 }
goto statements and la- The goto processing on lines 824 and 825 of Listing 6. 103. is similarly straightfor-
be s> ward. The compiler just generates jump instructions and labels as needed. The target
nonterminal is at the top of Listing 6. 105. It translates the label into a string which is
returned as an attribute. Note that the action is imbedded in the middle of the label-
processing production on line 825 of Listing 6. 103. If the action were at the end, then
the label would be generated after the associated statement was processed.
The next two productions, on lines 828 and 831 of Listing 6. 103, handle the i f and
i f /el se statements. The best way to understand these actions is to compare the sample
input and output in Table 6. 26 and Figure 6.20. The code for the test is generated by the
Section 6.9.1Simple Statements and i f / e l s e 641
test nonterminal, on line 843 of Listing 6.105, which outputs a label over the test code,
generates the test code via expr, and then generates the statement that branches out of
the test on failure. The first label isnt used here, but the same test production is used by
the loop-processing productions, below, and these productions need a way to jump back
up to the test code from the bottom of the loop. The presence of an extra label in an i f
statement is harmless.
Listing 6.105. c.y Statement Processing: Tests and goto Targets
837 t a r g e t : NAME { stati c char b u f [ NAME_MAX ] ;
838 s p r i n t f ( b u f , " %0. *s ", NAME MAX'- 2, y y t e x t ) ;
839 $$ = b u f ;
840 }
841
f
842
843 t e s t
{ stati c i nt l a b e l = 0/
844 g e n ( ":%s%dM, L TEST, $<num>$ = + + l a b e l ) ;
845
}
846 e x pr
{
847 $$ = $<num>l ;
848 i f ( IS INT CONSTANT($2->t ype) )
849
{
850 i f ( ! $ 2 - >t y pe - >V_ I NT )
851 y y e r r o r ( "Te s t i s a l wa y s f a l s e \ n " ) ;
852
}
853 el se / * not an endl ess l oop * /
854
{
855 g e n ( "EQ", r v a l u e ( $ 2 ) , "0" ) ;
856 g e n ( "goto%s%d", L NEXT, $$ ) ;
857
}
858 r e l e a s e v a l u e ( $2 ) /
859 tmp f r e e a l l ( ) ;
860
}
861 / * empt y */ { $$ = 0; / * no t est * /
862
}
863
m
r
Numeric component of
label.
Constant expressions in
tests.
needed because the body is always executed. This way, input like:
whi l e( 1 )
#
evaluates to:
l abel :

goto 1abe1;
The returned attribute is the numeric component of all labels that are associated with
the current statement. Youll need this information here to generate the target label for
the exit branch, and the same numeric component is used to process the el se. For
example, the inner i f / el se at the bottom if Figure 6.20 uses three labels: TST3: ,
EXI T3, and EL3. The outer i f / el se uses TST1:, EXI Tl , and ELI . The numeric com
ponent is generated in the test production, and the various alphabetic prefixes are defined
symbolically in label.h. (Its in Listing 6.59 on page 557.)
Finally, note that if the expr is a constant expression, no test is printed. An error
message is printed if the expression evaluates to zero because the code in the body of the
loop or i f statement is unreachable; otherwise, the label is generated, but no test code is
Loops.
The swi tch statement.
The alternative would be an explicit, run-time comparison of one to zero, but theres lit
tle point in that.
6.9.2 Loops, break, and cont i nue
Loops are handled by the productions in Listing 6.106; theres some sample input and
output code in Table 6.27. The code generated for loops is very similar to that generated
for an i f statement. The main difference is a jump back up to the test at the bottom of the
loop. The main difficulty with loops is br eak and cont i nue statements, which are not
syntactically attached to the loop-control productions, br eak and cont i nue are
treated just like labels by the productions on lines 915 and 923 of Listing 6.106. They
can appear anywhere in a subroutinetheres nothing in the grammar that requires them
to be inside a loop or swi tch. Nonetheless, you do need to know where to branch when
a br eak or cont i nue is encountered, and you need to detect a br eak or cont i nue
outside of a loop or swi tch.
The problem is solved with a few more auxiliary stacks, declared at the top of Listing
6.106. The top-of-stack item in S br k is the numeric component of the target label for a
br eak statement. -The alphabetic component of the label is at the top of S br k l abel .
I ve used two stacks to save the trouble of calling spr i nt f () to assemble a physical
label. The compiler pushes a label onto the stack as part of the initial loop-control pro
cessing (on line 866 of Listing 6.106, for example). It pops the label when the loop pro
cessing finishes on line 871 of Listing 6.106. S con and S con l abel do the same
thing for cont i nue statements. If the stack is empty when a br eak or cont i nue is
encountered, the statement is outside of a loop, and a semantic error message is printed.
6.9.3 The swi t ch Statement
The final control-flow statement in the language is the swi tch. Switches can be
processed in several different ways. First, bear in mind that a swi t ch is really a vec
tored got o statement. Code like the following is legal, though its bad style:
swi t ch( i )
{
case 0: i f ( c o n d i t i o n )
d o n a l d ( ) ;
el se
{
case 1: h e w e y ( ) ;
d e w i e ( ) ;
l o u i e ( ) ;
}
br eak;
}
You could do the same thing with got o statements as follows:
Section 6.9.3 The switch Statement
Table 6.27. Loops: while, for, and do /w hile
643
Input
whi l e( i < 10 )
{
cont i nue;
}
Output
TST3
LT( W(& i ) , 10)
got o T3;
/
/
*
*
Eval uat e (i <10) and put
the resul t i nto T(l ).
*
*
/
/
F3
W( T (1) )
got o E3;
0;
T3
W( T (1) )
l;
E3
EQ( W( T(1) ), 0)
got o EXI T3;
/
Exi t the l oop i f test f ai l s. */

got o EXI T3;
got o TST3;
/
/
/
*
*
Body of l oop
break
cont i nue
*
*
/
/
/
got o TST3; /
j ump back up to the test

*
/
EXI T3:
do
{
cont i nue;
}
whi l e( i );
DTOPl
got o DXI T1;
got o DTST1;
/
/
/
/
*
*
*
*
Top- of - l oop marker
Body of l oop:
break
cont i nue
*
*
/
/
/
/
DTST1
TST1:
W(&_ i )
W( T (1) )
l;
W( & i ) ;
EQ(W( T (1) ), 0)
got o EXI Tl ;
/
/
/
*
Eval uat e i and put
*
/
*
t he resul t i nt o T(1). */
Exi t l oop i f test f ai l s. */
got o DTOPl ; /
*
J ump back to top of l oop
/
DXI T1
EXI Tl
( i 0; i < 10; ++i )
{
cont i nue;
}
W(& i ) 0; / * I ni t i al i zat i on part of f or
TST4:
LT( W(& i ) , 10)
got o T4;
/
/
/
*
*
*
Top- of - l oop marker
Eval uat e (i <10) and put
the resul t i nto T(l )
F 4:
W( T (1) )
got o E4;
0;
T4:
W( T (1) )
l;
E4:
*
*
*
*
/
/
/
/
EQ(W( T (1) ), 0)
got o EXI T4;
got o BDY 4;
/ * Exi t the l oop i f test f ai l s.
I NC4:
/
/
*
*
Ski p over the i ncrement
i ncrement port i on of f or stmt
W(&_i )
W( T(1) )
got o TST4;
l;
W(& i) ;
BDY4:
got o EXI T4;
got o I NC4;
/
/
/
/
*
/
/
*
Ju/np up to the test.
Body of the l oop
break
conti nue (j ump to i ncrement)
/
/
/
/
got o I NC4; /
*
Bot t omof l oop, j ump to /
EXI T4:
Listing 6.106. c.y Statement Processing: Loops, break, and continue
164
%{
165 / * These st acks ar e necessar y because* t her e' s no synt act i c connect i on br eak,
166 * cont i nue, case, def aul t and t he af f ect ed l oop- cont r ol st at ement .
167 */
168
169 s t a c k _ d c l (S _brk, i nt, 32) ;
/*
number par t of cur r ent br eak t ar get * /
170 s t a c k d e l
(S_
brk l a b e l , c h a r *, 32) ; /*
st r i ng par t of cur r ent br eak t ar get * /
171
172 s t a c k d e l (S c o n, i nt, 32) ; /*
number par t of cur r ent cont i nue t ar g. * /
173 s t a c k d e l
<s_
c o n _ l a b e l , c h a r *, 32) ; /*
st r i ng par t of cur r ent cont i nue t arg. */
174
%}
864 / * st at ement : * /
865
866 | WHILE LP t e s t RP
867
868
869 s t a t e m e n t
870
871
872
873
874
875 | DO
876
877
878
879
880
881
882
883 s t a t e m e n t WHILE
884 LP t e s t RP SEMI
885
886
887
888
889
890
891
892
893 | FOR LP o p t _ e x p r SEMI
894 t e s t SEMI
895
896
897
898
899
900
901
902
903 o p t _ e x p r RP
904
905
906 s t a t e me n t
907
908
{
p u s h ( S _ c o n , $ 3 ) / p u s h ( S _ c o n _ l a b e l , L_TEST)
p u s h ( S br k, $ 3 ) ; pus h ( S brk l a b e l , L NEXT)
}
{ g e n ( ngoto%s%d", L TEST, $3 ) ;
gen (
}
ff
oo d", L NEXT, $3 ) ;
p o p ( S_con ) ; p o p ( S _ c o n _ l a b e l ) ;
pop ( S brk ) ; p o p ( S brk l a b e l ) ;
{ s t a t i c i n t l a b e l ;
ge n ( 11: %s%d", L DOTOP, $<num>$
}
{
{
+ + l a b e l ) ;
pus h ( S c on, l a b e l ) ;
pus h ( S con l a b e l , L DOTEST ) ;
p u s h ( S br k, l a b e l ) ;
p u s h ( S brk l a b e l , L DOEXIT
);
ge n ( 11: %s%d", L DOTEST, $<num>2 )
gen( "got o%s%d", L DOTOP,
ge n ( ":%s%d",
gen (
}
VV
d", L NEXT,
p o p ( S_con ) ;
p o p ( S _ c o n _ l a b e l ) ;
p o p ( S_brk ) ;
p o p ( S brk l a b e l ) ;
$<num>2 )
L DOEXIT, $<num>2 )
}
$7
);
{
g e n ( "goto%s%d, L BODY,
ge n ( " : %s%d",
$5 ) ;
L INCREMENT, $5 ) ;
p u s h ( S _ c o n , $ 5 ) ;
p u s h ( S _ c o n _ l a b e l ,
p u s h ( S _ b r k , $ 5 ) ;
p u s h ( S brk l a b e l ,
L INCREMENT
L NEXT
}
{ g e n ( ngoto%s%d", L TEST,
g e n ( " : %s%d", L BODY,
}
{
ge n (
VV
L NEXT,
$5
$5
g e n ( "goto%s%d", L INCREMENT, $5
$5
f
Section 6.9.3The swi tch Statement 645
909 pop( S con );
910 pop( S con l abel );
911 pop( S br k );
912 pop( S br k l abel );
913
}
914
915 BREAK SEMI { i f ( st ack empt y( S brk) )
916 yyer r or ("Not hi ng t o br eak f r om\ n") ;
917
918 gen comment ( "br eak") ;
919 gen("got o%s%d", st ack i t em( S br k l abel , 0),
920 st ack i t em( S brk, 0) ) ;
921
}
922
923 CONTI NUE SEMI { i f ( st ack empt y( S brk) )
924 yyer r or ("Cont i nue not i n l oop\ n") ;
925
926 gen comment ( "cont i nue") ;
927 gen("got o%s%d", st ack i t em( S con l abel , 0),
928 st ack i t em( S con, 0 ));
929
}
i f ( i
== 0 ) got o caseO;
el se i f ( i == 2 ) got o easel ;
el se got o end;
caseO: i f ( condi t i on )
donal d();
el se
{
easel : hewey();
d e w i e ( ) ;
l o u i e ( ) ;
}
end: ;
The simplest, and the least efficient, method of handling swi t ch statements translates Translation to i f / e l s e ,
the swi t ch and cases directly into a series of i f / el se statements. For example, code
like this:
s wi t ch ( i )
{
case 0: / * code */
case 1: / * mor e code */

}
can be translated into the following C-code:
EQ( i , 0 ) / * t est f or case 0: */
got o SW1;
/ * code * /
SW1:
got o S W2 :
/ * J ump ar ound t est * /
EQ( i , 1 ) / * t est f or case 1: * /
got o SW3
SW2 :
/ * mor e code * /
The main disadvantage of this method is run-time speed; a lot of unnecessary goto
branches are needed to find the thing youre looking for. Moreover, you have to test
every case explicitly to find the default one. In practice, translation to i f / el se is useful
only if there are a very limited number of case statements.
One easy-to-do improvement eliminates the got o branches around the imbedded
-processing statements by moving all the tests to the bottom of the switch. This
method is used by the current compiler, and is illustrated in Table 6.28. A jump at the
top of the switch gets you to the test code, which then jumps back up to the correct place
in the switch. A final got o branch jumps around the case-selection code if the last
doesnt have a br eak in it.
Tabl e 6.28. Vectored Goto: swi t ch
I nput Output
swi t ch( i ) got o SW1;
/ *
J ump t o case- pr ocessi ng code.
*/
{
SW3 :
/ * Case 0 * /
case 0: W( &_i ) = 0;
/ *
Fal l t hr ough to next case.
* /
i = 0 / SW4 :
/ *
Case 1
* /
/ * f al l t hr ough * / W( &_i ) = 1;
case 1: got o SW2;
/ *
br eak;
*/
i = 1; SW5: /* def aul t : */
br eak; W(&_i ) = 2;
def aul t : got o SW2; / * br eak;
*/
i = 2;
br eak; got o SW2;
/ *
I nser t ed by compi l er i n case
*/
} / * l ast case has no br eak i n i t. */
SW1:
EQ( W( &_i ) , 0 )
/ *
Code to eval uat e swi t ch: */
got o SW3;
/ *
J ump t o case 0. */
EQ( W( &_i ) , 1 )
got o SW4;
/ *
J ump to case 1.
*/
got o SW5; / * J ump to def aul t case.
*/
SW2 :
Dispatch tables. The main reason that I m using this approach is that limitations in C-code wont let
me use the more efficient methods. I ts worthwhile discussing these other methods, how
ever. All of the alternate swi t ch strategies use a table to compute the got o branches.
The first method uses a data structure called a dispatch tablean array of two-member
structures. The first member is the argument to the case statement, the second is a got o
instruction that branches to the required location in the code. The switch:
swi t ch( i ) ;
{
case 0: Wa s h i n g t o n ( ) ;
case 1: j e f f e r s o n ( ) ;
3: a da ms ( ) ;
}
can be implemented with this table:
0 got o caseO;
1 got o easel ;
3 got o case3;
and this code:
0
e a s e l
c a s e 3
Washington();
j efferson ();
adams();
The compiler generates code to look up the case value in the table, often using a run
time subroutine call. If the value is found, the matching got o statement is executed,
otherwise a branch to the default case is used. The table can be sorted by case value at
compile time so the compiler can use a binary search at run time to find the proper label.
An even more efficient method uses a data structure called a jump table. A jump Jump tables
table is an array of got o vectors, indexed by case value. The compiler can process the
switch with a single test and an array lookup:
if( argument to switch is in range )
got o j ump tabl e[ argument_to_s witch - value of smallest
else
got o default case;
The table elements are adjusted so that j ump t abl e [0] corresponds to the smallest
number used as an argument to a case statement. Holes in the tableelements for
which there are no corresponding caseare filled with jumps to the default case. Note
table size.
that, if the range of case values is less than or equal to twice the total number of case Jump table vs. dispatch
values, the jump table will be no larger than an equivalent dispatch table.
Though the current compiler cant use tables in the generated code, the method that it
does use is easily adapted to this approach, because it assembles a compile-time model
of the dispatch table. Listing 6.107 contains the data structures used for this purpose.
A dispatch-table element ( case val ) is defined on lines three to seven. It contains
two fields: on t hi s is the argument to the case statement, stored as an i nt, go her e is
the numeric component of the label associated with this case; the alphabetic prefix is SW.
The st ab structure, declared on line nine of Listing 6.107, is used to manage the
table proper. It contains the following fields:
t abl e The dispatch tablean array of case_val structures,
cur Points into t abl e at the next available slot.
name The name field of the val ue structure to which the expression argu
ment of the swi t ch statement evaluates. That is, a switch is recognized
by the following production:
statement-^SVIYYQW LP expr RP compound stmt
The expr recognizes an entire expressionall code for evaluating the
expression will have been generated when the expr is put onto the stack
by the parser. This code creates a single val ue attribute that identifies
the temporary variable that holds the run-time result of the expression
evaluation; the name field from this val ue is copied to the st abs
name field.
def l abel The numeric component of the label associated with the default case.
This number is 5 in the swi t ch in Table 6.28.
st ab l abel The numeric component of the labels that precede and follow the case-
selection code. The label that follows the case-selection code has the
value st ab_l abel +l. This number is 1in the swi t ch in Table 6.28:
SW1 precedes the selection code and SW2 follows it.
Listing 6.107. switch.h Type Definitions for swi t ch Processing
1 #def i ne CASE MAX 256
/ *
2
3 t ypedef st ruct c a s e v a l / *
4 {
5 i nt on t h i s ; / *
6 i nt go h e r e ; / *
7 } c a s e v a l / / *
8
9 t ypedef st ruct s t a b / *
10 {
11 c a s e v a l *cur ;
12 c a s e v a l t a b l e [ CASE MAX
] ;
13 char name [ VALNAME MAX
] ;
14 i nt d e f l a b e l ;
15 i nt s t a b l a b e l ;
16
17 } s t a b ;
Maxi mumnumber of cases i n a swi t ch */
a si ngl e di spat ch- t abl e el ement
*
/
The N i n a
rr
case N: st at ement
*
Numer i c component of l abel i n ut put
code.
*
*
/
/
/
a swi t ch t abl e * /
/
*
poi nt er t o next avai l abl e sl ot i n t abl e
*
/ * swi t ch t abl e i t sel f .
/ * swi t ch on t hi s r val ue
/ * l abel associ at ed wi t h def aul t case
/*
/ * code. Bot t oml abel i s st ab l abel +1.
*
*
*
l abel at t op and bot t omof sel ect or
*
*
/
/
/
/
/
/
The compiler keeps around a stack of pointers to st ab structures, declared in Listing
6.108. The size of this stack limits the number of nested switch statements that are per
mitted in the input. A st ab is allocated when the swi t ch is processed, and a pointer to
the st ab is pushed. The pointer is popped after the entire switch has been processed.
Every instance of a nested switch pushes a new st ab pointer onto the stack, and that
pointer is popped when the compiler is done with the nested switch.
Every case statement does two things: it emits a label of the form SWw, and it adds a
new dispatch-table element for n to the st ab structure at top of stack. (It modifies the
entry pointed to by the cur field in the structure at top of stack, and then increments the
cur field.) The numeric components of the label are just allocated sequentially by incre
menting Case l abel , also in Listing 6.108. The code that processes a swi t ch is in
Listings 6.108 and 6.109.
The only difficulty in the code is the point at which the val ue associated with the
argument to the swi t ch statement is released on line 945 of Listing 6.108. This is
another compile-versus-run-time issue. The order of operations at run time is: (1) jump
to the selector code at the bottom of the switch, (2) pick out the target label, (3) jump
back up to that label. At run time, the val ue associated with the expr is used in step
(2) to select the got o vector, and is not used further. Consequently, it can be recycled
for use in the code that comprises the body of the switch. The selector code is not gen
erated until after the body is processed, however. If the val ue was released after the
selector code was printed, it could not be recycled in the swi t ch body. Consequently,
its released at the top of the swi tch, before the body is processed. The values name
field is remembered in the st ab structure so that the selector code can be generated
later.
Section 6.9.3The switch Statement 649
Listing 6.108. c.y Statement Processing: The switch statement
175
%{
176 i n t Case l a b e l = 0/ / * Label used t o pr ocess case st at ement s. */
177
178 s t a c k d e l (S s w i t c h , s t a b *, 3 2 ) ; / * Swi t ch t abl e f or cur r ent swi t ch. * /
179
%}
930 / * st at ement : * /
931 SWITCH LP e x pr RP
932
{
933 / * Not e t hat t he end- of - swi t ch l abel i s t he 2nd ar gument t o
934 * new st ab + 1; Thi s l abel shoul d be used f or br eaks when i n
935 * t he swi t ch.
936 */
937
938 p u s h ( S s w i t c h , new s t a b ( $3, ++Case l a b e l ) ) ;
939 ge n comment( "Jump t o c a s e - p r o c e s s i n g c ode " ) ;
940 g e n ( "goto%s%d", L SWITCH, Cas e l a b e l ) ;
941
942 p u s h ( S br k, ++Case l a b e l ) ;
943 p u s h ( S brk l a b e l , L SWITCH ) ;
944
945 r e l e a s e v a l u e ( $3 ) ;
946 tmp f r e e a l l ( ) ;
947
}
948 compound s t mt
949
{
950 gen s t a b and f r e e t a b l e ( pop ( S s w i t c h ) ) ;
951 }
952
953 I CASE c o n s t e x pr COLON
954
{
955 add c a s e ( s t a c k i t e m( S s w i t c h , 0 ) , $2, ++Case l a b e l ) ;
956 ge n comment ( " c a s e %d:", $2 ) ;
957 ge n ( ":%s%d" , L SWITCH, Cas e l a b e l ) ;
958
}
959
960 I DEFAULT COLON
961
{
962 add d e f a u l t c a s e ( s t a c k i t e m( S s w i t c h , 0 ) , ++Case l a b e l ) ;
963 ge n_ c o mme nt ( " d e f a u l t : " ) ;
964 gen( ": %s%d", L SWITCH, Cas e l a b e l ) ;
965
}
966
#
9
Listing 6.109. switch.c Switch Processing
3 # i ncl ude < t o o l s / d e b u g . h>
8
10 #i ncl ude " v a l u e . h"
12 #i ncl ude " l a b e l . h "
13 #i ncl ude "s w i t c h . h "
14
15 / * -------------------------------------------------------------------------------------------------------------------------------------------- * /
16
17 PUBLIC s t a b * n e w _ s t a b ( v a l , l a b e l )
18 v a l u e * v a l ;
19 i nt l a b e l ;
20 {
21 / * Al l ocat e a new swi t ch t abl e and r et ur n a poi nt er t o i t. Use f ree( ) to
22 * di scar d t he t abl e. Val i s t he val ue t o swi t ch on, i t i s conver t ed to
23 * an r val ue, i f necessar y, and t he name i s st or ed.
24 * /
25
26 s t a b *p;
27
28 i f ( ! (p = ( s t a b *) m a l l o c ( s i z e o f ( s t a b ) ) ) )
29 {
30 y y e r r o r ( " No memory f o r s w i t c h \ n " ) ;
31 e x i t ( 1 ) ;
32 }
33
34 p - > c u r = p - > t a b l e ;
35 p - > s t a b _ l a b e l = l a b e l ;
36 p - > d e f _ l a b e l = 0;
37 s t r n c p y ( p- >name, r v a l u e ( v a l ) , si zeof ( p- >name) ) ;
38 ret urn p;
39 }
40
41 / * ---------------------------------------------------------------------------------------------------------------------------------------------* /
42
43 PUBLI C voi d a d d _ c a s e ( p, o n _ t h i s , g o _ h e r e )
44 s t a b *p;
45 i nt o n _ t h i s ;
46 i nt g o _ h e r e ;
47 {
48 / * Add a new case to t he st ab at t op of st ack. The ' c u r 1 f i el d i dent i f i es
49 * t he next avai l abl e sl ot i n t he di spat ch t abl e.
50 * /
51
52 i f ( p - > c u r > &( p- >t abl e[ CASE_MAX- 1] ) )
53 y y e r r o r ("Too many c a s e s i n s w i t c h \ n " ) ;
54 el se
55 {
56 p - > c u r - > o n _ t h i s = o n _ t h i s ;
57 p - > c u r - > g o _ h e r e = g o _ h e r e ;
58 ++( p - > c u r ) ;
59 }
60 }
61
62
/ * -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------- * /
63
64 PUBLIC voi d add d e f a u l t c a s e ( p, go h e r e )
65 s t a b *p;
66 i nt go h e r e ;
67
{
68 / * Add t he def aul t case t o t he cur r ent swi t ch by r emember i ng i t s l abel * /
69
70 i f ( p - > d e f l a b e l )
71 y y e r r o r ( "Onl y one d e f a u l t c a s e p e r m i t t e d i n s w i t c h \ n " ) ;
72
73 p - > d e f l a b e l = go h e r e ;
74
}
75
76
/ * ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------- * /
77
78 PUBLIC voi d gen s t a b and f r e e t a b l e ( p )
79 s t a b *p;
80
{
81 / * Gener at e t he sel ect or code at t he bot t omof t he swi t ch. Thi s r out i ne i s
82 * emi t t i ng what amount s t o a ser i es of i f / el se st at ement s. I t coul d j ust
83 * as easi l y emi t a di spat ch or j ump t abl e and t he code necessar y t o pr ocess
84 * t hat t abl e at r un t i me, however .
85 * /
86
87 c a s e v a l *cp;
88 char nbu f [2 0 ] ;
89
90 g e n ( "goto%s%d", L SWITCH, p - > s t a b l a b e l + 1 ) ;
91 g e n ( ":%s%d", L_SWITCH, p - > s t a b _ l a b e l ) ;
92
93 f or ( cp = p - > t a b l e ; cp c u r ; ++cp )
94
{
95 g e n ( "EQ%s%d", p- >name, c p- >o n t h i s ) ;
96 g e n ( "goto%s%d", L SWITCH, c p - > g o h e r e ) ;
97 }
98
99 i f ( p - > d e f l a b e l )
100 gen ( "goto%s%d", L SWITCH, p - > d e f l a b e l )
m
f
101
102 gen ( ":%s%d", L SWITCH, p - > s t a b l a b e l + 1 ) ;
103 f r e e ( p ) ;
104
}
6.10 Exercises
6.1. What are the consequences of changing the initial productions in the current C
grammar to the following:
program ext def list
ext def j i s t > ext def ext def j i s t I
Would the parser still work? For how long? This exercise points out the
differences between theoretical and practical grammars.
6.2. Find all the bugs in the compiler presented in the current chapter and fix them.
6.3. Draw a system of symbol, l i nk, and structdef structures that represents the
following global-variable definition:
st ruct Mont ague
l ong * ( *Romeo[2]) [5 ];
short **J u l i e t [10];
s t r uc t Montague *next;
}
* ( * (* Ca p u l e t ) [ 10 ] ) ( ) ;
6.4. Modify the compiler in the current chapter as follows:
a. Add 4-byte f l oats to the compiler.
b. Add 8-byte doubl es to the compiler.
c. Add support for the const, vol ati l e, and regi st er storage classes.
d. Make structures generate lvalues: Add structure assignment and add the
ability to pass a structure to a subroutine (and to return it) by value.
e. Add run-time initialization of automatic variables.
f. Add compile-time initialization of static-local and global aggregates.
g. Add support for signed and unsigned bit fields.
h. Add support for function prototypes. A hard error should occur when the
types in a definition disagree with a prototype. A warning should be printed
(and any required type conversions be performed) when the arguments of a
call differ from those in the prototype.
Youll have to call run-time subroutines to do the floating-point operations
because floating point is not supported in C-code.
6.5. The scope of a structure in the current chapters compiler is from the point of
declaration to the end of the input file. This is incorrect structures should fol
low the same scoping rules as variables. Fix the problem.
6.6. Though the compiler in the current chapter behaves reasonably well when its
given legal input, it has difficulty when presented with syntactically correct, but
semantically incorrect input. Semantic errors, such as ++being applied to a non-
lvalue or an attempt to multiply two pointers, should be handled by printing an
error message, and then pretending that legal input had been encountered by put
ting a legitimate attribute onto the value stack instead of garbage. Improve the
semantic-error recovery in the current compiler.
6.7. Most code generation has been concentrated into the gen () subroutine to facili
tate portability to other target languages. Declarations are not handled by gen,
however. Either modify gen () to handles declarations, or create a new
decl _gen( ) subroutine that outputs all declarations. Modify the variable-
declaration actions to use this subroutine.
6.8. The compiler currently generates the following code for i &&j &&k:
EQ(W(&_i ) , 0)
goto FI ;
EQ( W( &_j ) , 0)
goto FI ;
EQ(W(&_k) , 0)
goto FI ;
goto Tl ;
F I :
W( T (1) ) = 0;
goto El ;
Tl :
W( T (1) ) = 1;
El :
This code has an unreachable goto statement above the F I label. Modify the
compiler to generate the following code for the earlier input:
EQ(W(&_i ) , 0)
goto FI ;
EQ( W( &_j ) , 0)
goto FI ;
NE (W(&_k) , 0)
goto Tl ;
F I :
W( T (1) ) = 0;
goto El ;
Tl :
W( T (1) ) = 1;
El :
6.9. (a) Redesign C-code so that it is a more appropriate language for communication
with a back end: the sizes of all objects should all be 1, and the stack should be 1
element wide; Space for local variables should not be allocated in the compiler
(information about them should be passed to the back end to do the allocation),
and so forth. Modify the compiler to emit your modified code.
(b) Write a back end that translates that code to a real assembly language.
6.10. The PostScript graphics-description language is a postfix assembly language used
by many printers. It is described in Adobe Systems PostScript Language Refer
ence Manual (Reading, Mass.: Addison-Wesley, 1985). Write a C-to-PostScript
Translator. You may define a subset of C for this purpose. All the standard
PostScript functions should be supported as intrinsic functions (recognized as
keywords by the compiler and processed directly).
6.11. The UNI X troff typesetting program is really an assembler. Its input is an
assembly-language description of a document made up of interspersed text and
dot commands (assembler directives). Troff translates that input into a typeset
version of that document. Design a C-like language that can describe a document,
and then implement a compiler that translates that language to troff. At minimum
you must support i nts, one-dimensional arrays, subroutines (which must be able
to return i nt values), and i f /el se, f or, and whi l e statements. You should also
support a basic stri ng type, with operators for concatenation, comparison, and
so forth. Finally, you should support an aliasing facility that lets you give reason
able names to built-in registers and special characters. You can use recursion to
do the loops. For example, the following loop executes ten times:
. de Lp
i f \ \ $1 \ {\
e x e c u t i n g i t e r a t i o n \ \ $ 1
nr x \ \ $ 1 - 1
Lp \ \ n x \ }
A"
. Lp 10 \ " Th i s c a l l l o o p s t e n t i me s
6.12. Write compilers that duplicate the UNIX pic and eqn preprocessors, translating
descriptions of pictures and equations into troff commands.
6.13. Write an ANSI-C preprocessor. It must be able to handle the concatenation (##)
and stringizing operator (#).
6.14. Write a program that reads a C input file and which outputs ANSI function proto
types for all functions in that file. Your program must be able to handle
#i ncl ude statements. Your output must be acceptable to a standard, ANSI com
piler.
6.15. Write a program that reads a C file containing old-style function definitions like
this:
a p o s t l e s ( mat , mark, l u k e , j ohn, f r e d )
c h a r *mat ;
l ong mark;
doubl e l u k e ;
{
}
and which outputs C++-style definitions like this:
a p o s t l e s ( c h a r *mat , l ong mark, doubl e l u k e , i nt j ohn, i nt f r e d )
{
}
Note that the default i nt type must be supplied for john and fred.
6.16. Modify the solution to the previous exercise so that it goes in the other direction,
translating C++definitions to old-style C definitions.
6.17. Modify the solution to the previous exercise to translate an ANSI-C input file to
pre-ANSI, K&R C. If several input files are listed on the programs command line,
they should be translated individually. Output should be written to files with a
.ik&r extension. For example, foo.c should be translated, and the output placed in
foo.k&r. (Use .knr on UNIX systems so that you dont have to quote the name.)
At very least, function definitions, string concatenation, and function-name map
ping [such as remove () to unl i nk ()], should be handled. If the first eight
characters of an identifier are not unique, the identifier should be replaced with a
unique 8-character name. These replacements should carry from one input file to
another in the case of nonstatic global variables and subroutine names. That is, if
several C input files are specified on your translators command line, the files
should be translated individually (not merged together), but the too-long global
symbols that are translated in one file should be translated to the same arbitrary
name in all files. Be careful to check that your arbitrary name is not already being
used in the input file for some other purpose.
Structure assignment, passing structures to functions, and the returning of struc
tures from a function should also be detected and translated into a form that can
be handled by a compiler that supports none of these (you can use implicit
memcpy () calls for this purpose).
6.18. Write a beautifier program that reads a randomly formatted C input file and
which outputs the file with nice indenting showing the block level, lined up
braces, and so forth. You should do intelligent things with comments, trying to
line up the /* and */ tokens in end-of-line comments in neat columns, like this:
c ode / * comment * /
more c ode / * anot her comment * /
y e t more c ode / * yet anot her comment * /
If a comment wont fit onto the end of a reformatted line, then it should be moved
to the previous line. Multiple-line comments should be output as follows:
/ * t hi s i s a
* mul t i pl e- l i ne
* comment
*/
with the / * at the same indent level as the surrounding code.
6.19. One of the more useful features of C++is function overloading. You are per
mitted to declare several instances of the same function, each of which takes argu
ments of different types, like this:
o v e r l o a d S o p h o c l e s ;
i n t S o p h o c l e s ( l o n g J o c a s t a , s h o r t Oe di pus ) { . . . }
d o u b l e S o p h o c l e s ( d o u b l e J o c a s t a , c h a r *Oedi pus ) { . . . }
The overl oad keyword tells the compiler that multiple definitions of a function
are permitted. The compiler determines which version of the function to call by
examining the types of the actual arguments in a given call. If the types in a call
dont match the definition exactly, the standard type conversions are used to pro
mote the argument to the correct type (a warning is printed if this is done). Do the
following:
(a) Modify the C compiler in the current chapter to support function overloading.
(b) Write a preprocessor that translates overloaded functions to a form that can be
handled by a standard a n s i compiler.
6.20. The Awk Programming Language is discussed in [Aho2]. Write a compiler that
converts an awk input file to C.
6.21. (a) Write a C-to-Pascal converter.
(b) Write a Pascal-to-C converter. Your program must be able to handle nested
subroutine declarations. The organization of the Pascal stack frame is discussed
in Appendix B. This is a much harder problem than (a).
6.22. (a) Write a FORTRAN-to-C converter.
(b) Write a C-to-FORTRAN converter. All C data types must be handled prop
erly, especially translation of C pointers to FORTRAN array indexes. This is a
much harder problem than (a).
6.23. Modify virtual.h so that the registers and run-time stack are both 16 bits wide
rather than 32 bits wide. Eliminate the lword basic type and . 1 register selector,
and redefine the p t r type to be 16 bits wide. Finally, modify the compiler so that
this new stack and register width is supported. The C word widths must remain
unchangedthe compiler should still support a 32-bit l ong i nt, 16-bit i nt,
and 8-bit char. How could you modify the compiler to make this translation
easier?
6.24. Add an autodecrement and autoincrement addressing mode to C-code indirect
modes [W( p + + ) , WP ( p ) , and so forth], then modify the grammar and associ
ated code-generation actions so that the following inputs generate a minimum
number of C-code instructions:
*p++ *++p *p *p
6.25. Write a program that translates C-code into your favorite assembly language.
6.26. Write a C-code interpreter that takes a C-code program as input and executes that
code directly. The following subroutines should be built into the interpreter (so
that they can be called directly from a C-code program):
put b () Print the low byte of the top-of-stack item to standard output (in hex),
put w() Print the low word of the top-of-stack item to standard output (in
hex).
putl () Print the long word at the top of stack to standard output (in hex),
puti () Print the long word at the top of stack to standard output (in decimal),
put c () Print the low byte of the top-of-stack item to standard output as an
ASCII character.
put s () Print the string in the array whose address is at the top of stack to
standard output.
get b () Input a hex byte from standard input and return it in rF.
get w() Input a hex word from standard input and return it in r F. w. l ow.
get l () Input a hex long word from standard input and return it in rF.
get i () Input a decimal long word and put it in rF.
get c () Input an ASCII character from standard input and return it in
r F. b. bO.
dump () Print the contents of all registers to standard output,
dumps () Print the top 20 stack items to standard output.
Execution should begin with a subroutine called main( ), which must be supplied
in the source file. You may require this function to be the first (or last) one in the
file if you wish.
6.27. If you havent done so already, make the interpreter in the previous exercise into a
window-based system. One window displays the code as its being executed, a
second displays the register and stack contents, a third shows the contents of
selected static variables (entered by you at a command prompt), and a fourth
shows the C input line that corresponds to the output being executed. Use line
numbers that are inserted into the output by the lexical analyzer for this last func
tion.
6.28. If you havent done so already, modify the interpreter to support the following
breakpoint types:
Break when a specified source-code line is executed.
Break when a specified subroutine is called.
Break when a specified global variable is modified or takes on a specific
value.
Break when a specified C-code instruction is executed.
6.29. If you havent done so already, modify the interpreter to display the contents of
both global and local variables by specifying their name at a command prompt. A
local symbol should be displayed only when executing a subroutine that contains
that symbol. You will need to pass symbol-table information between the com
piler and interpreter to do this. Use an auxiliary file.
This chapter looks at optimization strategies at a high level. It is not intended to be
an in-depth discussion of the topicsomething that would take a book of its own to
cover adequately. The basic types of optimizations are described, however, and optimi
zation techniques are discussed in a general way. The way that the choice of intermedi
ate code affects optimization is also discussed.
Optimizations are easily divided into three categories: parser optimizations; linear,
peephole optimizations; and structural optimizations. I ll look at these three categories
one at a time.
The first category includes all optimizations that can be done by the parser itself.
The simplest of these really come under the category of generating good code to begin
with: using logical lvalues rather than physical ones, minimizing the number of got o
branches, and so forth. The other common parser optimization is intrinsic-function gen
eration. Intrinsic function calls are translated directly by the compiler into code that
does the action normally performed by the function. For example, st r cpy () is often
implemented as an intrinsic function. The lexeme st r cpy is recognized by the com
piler as a keyword, and there is a production of the form:
expr : STRCPY LP expr COMMA expr RP
in the grammar. The associated action generates code to copy one string to another
rather than generating code to call the st r cpy () function. Intrinsic functions are par
ticularly useful with small workhorse functions, a call to which can often have a higher
overhead than the code that does the work. Other common intrinsic functions are the
math functions like si n (), cos (), and sqr t ().
Some of the optimizations discussed below can also be done directly by the parser
(such as using a left shift to implement multiplication by a constant).
Intrinsic functions.
657
658 Optimization Strategies Chapter 7
The optimizations that can be done directly by the parser are, by necessity, limited in
scope to single productions. I ts difficult for the parser to optimize any code that takes
more than one production to process. For example, the following compiler input:
i nt i ;
*
i = 5;
++i ;
r et ur n i + 1;
generates this output:
W(&_i )
W(&_i )
W( T (1) )
W( T (1) )
W( T (1) )
r F. w. l ow
ret ();
The two identical assignments on the third and fourth lines are unavoidable because they
are generated by two separate statementsthe first by the ++i and the second by the
i +1. Its easy for a separate optimizer pass to recognize this situation, however, and
eliminate the redundant assignment.
This section looks at various optimizations that cannot be done in the parser, but
which can be done by an auxiliary optimizer pass that goes through the compilers out
put in a linear fashion, from top to bottom. (It may have to go through the output several
times, however.) This kind of optimizer is called a peephole optimizerpeephole
because it examines small blocks of contiguous instructions, one block at a time, as if the
code were being scrolled past a window and only the code visible in the window could
be manipulated. The optimizer scans the code looking for patterns, and then makes sim
ple replacements. Peephole optimizers are usually small, fast programs, needing little
memory to operate.
The kind of optimizations that you intend to do is a major factor in deciding the type
of intermediate language that the compiler should generate. A peephole optimizer is
happiest working with triples or quads. I ll use the triples generated by the compiler in
the last chapter for examples in the following discussion.
7.2.1 Strength Reduction
A strength reduction replaces an operation with a more efficient operation or series
of operations that yield the same result in fewer machine clock cycles. For example,
multiplication by a power of two can be replaced by a left shift, which executes faster on
most machines. (x*8 can be done with x<<3. ) You can divide a positive number by a
power of two with a right shift (x/ 8 is x>>3 if x is positive) and do a modulus division by
a power of two with a bitwise AND (x%8 is x&7).
Other strength reductions are less obvious. For example, multiplication by small
numbers can be replaced by multiple additions: t 0*=3 can be replaced with
11 = 10;
t O += tl ;
t O += tl ;
Combinations of shifts and additions can be used for multiplication by larger numbers:
t O* =9 can be replaced by:
= 5;
+= 1;
= W(&_i );
= W(&_i );
+= l ;
= W( T (1) ) ;
Section 7.2.1 Strength Reduction 659
11 = 10;
t l <<= 3;
tO += t l ;
[That is, tO x9 =( t0 x 8)+t0 =(t0 3)+t0.] Larger numbers can also be handled this
way: t 0* =27 can be replaced by:
t 1 = t O;
t l <<= 1;
tO += t l ;
t l = 2;
10 += 11;
t l <<= 1;
tO += t l ;
You can see whats going on by looking at how a binary arithmetic is performed. A
binary multiplication is done just like a decimal multiplication. 27J0is 110112, so the
multiplication is done like this:
d d d d d d d d
X 0 0 0 1 1 0 1 1
d d d d d d d d
d d d d d d d d
0 0 0 0 0 0 0 0
d d d d d d d d
d d d d d d d d
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0
If theres a 7 in the multiplier, then the multiplicand, shifted left by a number of bits
corresponding to the position of the 1 in the multiplier, is added to the product. At worst,
you need as many shift/add steps as there are bits in the denominator. I ts up to the
optimizer to determine at what point it takes longer to do the shifts and additions than it
does to use a multiplication instruction, but on many machines the shift/multiply strategy
is faster for all but very large numbers. This last optimization is a classic example of a
code-size-versus-execution-speed trade-off. The faster addition-and-shift code is much
larger than the equivalent multiplication instruction.
There are also nonarithmetic strength reductions. For example, many machines have
several different forms of jump or g o t o instruction, and the optimizer can sometimes
modify the code so that the more efficient instruction can be used.
7.2.2 Constant Folding and Constant Propagation
Weve already seen constant folding at work in the last chapter, because the parser
itself can do it in a limited way. Put simply, the compiler itself does any arithmetic that
involves constants, if it can. An expression like a + 2 * 3 is treated like ( a + 6 ) .
Order of evaluation can prevent the parser from doing constant folding. For exam
ple, the parser cant optimize the following input because of the left-to-right order of
evaluation:
a + 1 + 3
The parser processes the a+1 first and puts the result into a temporary variable. The
three is then added to the temporary. The following code is generated:
W( T(1) ) =W(&_i );
W( T ( l ) ) += 1;
W( T (1) ) += 3/
Its easy for an independent optimizer to see that T (1 ) is modified twice in succession
Instruction optimization.
Constant Propagation.
Dead variables.
without being used between the modifications, however, and replace the foregoing with:
W( T( l ) ) = W(&_i ) ;
W( T (1) ) += 4;
Multiplication by one, addition and subtraction of zero, and shift by zero can be
eliminated entirely.
Constant folding is actually a simple case of a more general optimization called con
stant propagation. Many variables retain constant values over a large portion of their
lifetimes. The compiler can note when a constant is assigned to a variable and use that
constant rather than the variable when the variable is referenced in the code. The con
stant propagates until the variable is modified. For example, code like this:
_y = 5;

_ x = _ y;
can be replaced with
_y = 5;
_x = 5;
(Assignment of a constant to a variable is a more efficient operation than a memory-to-
memory copy on most machines.) At a higher level, the loop in the following code:
i nt i , j ;
f or ( i =5, j = 0; j <i ; ++j )
f oo ( i );
can be treated as if the compiler had seen the following:
f or ( i =5, j = 0; j <5 ; ++j )
f oo( 5 );
The optimizer is, in effect, keeping track of the contents of all variables that contain
constants. It keeps a local symbol table with an entry for each active variable, and that
internal entry is modified to track modifications in the code. The internal value is
modified every time the variable is modified. For example, this code:
t O =1;
t O +=5;
11 =10;
is translated to this:
t O =1 /
11 =6;
The compiler initializes an internal copy of t O when the first line is encountered. It then
modifies that internal copy by adding five to it when the t 0 + = 5 is encountered and dis
cards the t O+=5 instruction. Finally, the modified value is used when the substitution is
made on the third input line.
7.2.3 Dead Variables and Dead Code
One type of optimization often leads to another. In the previous example, tO can be
discarded after the constant propagation because it is not used anywhere. tO is a dead
variablea variable that is initialized (and perhaps modified), but never referenced in a
function call or to the right of an equals sign. The variable is considered dead from the
last time it is used until it is reinitializedlike this:
Section 7.2.3Dead Variables and Dead Code 661
1 0 = _ a;
tO +=-- 5;
x = 1 0;
/ * t O i s now dead
*
/
tO += 1; / * Thi s i nst r uct i on can be el i mi nat ed. * /
tO = b; / * t O i s now r esur r ect ed
*
/
At a higher level, the i in the following code is a dead variableall references to it can
be eliminated:
f o o ( x )
{
i n t i ;
i = x;
++i ;
}
A dead assignment is an extreme case of a dead variable. A dead assignment is an
assignment to a variable that is never used or modified. The initial assignment of 1 0=_a
in the earlier example is not a dead assignment because 1 0 is used later on in the code.
Nonetheless, dead assignments are very common in the output from the compiler in the
previous chapter because of the way that expressions are processed. All expressions
evaluate to something, and there is often a copy into an rvalue as the last step of the
expression evaluation. Code like this:
i n t x ;
+ + x;
W( f p- 2) + = 1 ; / * x * /
W( T (1) ) = W( f p - 2 ) ;
The final assignment to T (1 ) is a dead assignment because T (1 ) is never used. At a
higher level, the earlier constant propagation example translated this code:
( i = 5, j = 0; j < i ; ++j )
into this:
( i = 5, j = 0; j < 5 ; ++j )
The i=5 is now a dead assignment and can be eliminated.
Dead code is code that cant be reached or does nothing useful. For example, code
generated by the following input can be removed entirely:
( 0 )
do s o m e t h i n g ( ) ;
This input is translated to the following C-code:
NE( 0, 0 )
g o t o l a b e l ;
c a l l ( _ d o _ s o me t h i n g ) ;
W( T( 1) ) = r F . w. l o w;
l a b e l :
It is optimized in two steps. First, the dead assignment on the fourth line can be elim
inated, yielding:
Dead assignment.
Dead code.
Eliminating useless code.
NE( 0, 0 )
got o l a b e l ;
c a l l ( _ d o _ s o me t h i n g ) ;
l a b e l :
Since the NE always evaluates true, all code between the following got o and the label
can be eliminated because its unreachable. Similar optimizations involve the elimina
tion of useless code (like the
got o F l ;
F I :
which is generated at the end of a list of &&operators). The following, more complex
example takes two passes to optimize:
tO = 0 ;
NE ( t O, 0)
got o l a b e l ;
l a b e l :
The first pass folds the constant 0 into tO, yielding this:
NE( 0 , 0 )
got o l a b e l ;
m m m
l a b e l :
The test now always fails, so it and the following got o instruction can both be elim
inated.
Had we started with the following code:
tO = 1 ;
NE( t O, 0)
got o l a b e l ;
* m m
l a b e l :
the constant would fold to:
NE ( 1 , 0 )
got o l a b e l ;
l a b e l :
which always tests true. Consequently the NE, the goto, and all code up to the label are
dead and can be discarded. Note that the dead-code elimination is affected if there is a
second entry point into the block. Input like the following:
swi t ch( h i t t e r )
{
case e d d i e _ mu r r a y : x = 1;
whi l e( x )
{
case t o n y _ p h i l l i p s : . . .
}
}
(its sick,1but its legal) could generate output like this:
X = 1 ;
NE( W( & x) , 0)
A
got o l abel Remove only this code.
l abel 2 :
l abel :
The constant 1 propagates into the test, resulting in dead code, but only the code up to
l a b e l 2 can be eliminated because the swi tch could branch to the interior label. A
region of code that has a single entry point and a single exit point (everything from a
label or PROC directive up to the next goto or r e t () instruction) is called a basic block,
and some optimizations, such as the current one, can be performed only within the
confines of a basic block.
As we just saw, dead variable and code elimination can interact with constant propa
gation in interesting ways. The i in the following code isnt a dead variableat least
initially:
f oo ()
{
i nt i , j , a r r a y [ 1 0 ] ;
i = 5;
++i ;
j = a r r a y [ i ] ;
}
The compiler can propagate the constant 5 through to the array access, however:
f oo ()
{
i nt i , j , ar r ay[10];
i = 5;
++i ;
j = ar r ay[6];
}
Having done that, the i is now dead and can be eliminated:
f oo ()
{
i nt j , ar r ay[10];
j = ar r ay[6];
}
A form of dead-assignment elimination also applies to nonconstants. Code like the
following, which is very common compiler output:
Basic block.
Interaction between op
timizations.
Nonconstant dead as
signments.
1. Thats a technical term.
664
vol ati l e.
Optimization Strategies Chapter 7
tO = x;
y = t O;
can be replaced by:
y = x;
unless x is modified before 10 is used. For example, this code:
tO =_x;
_y = tO;
tO = z;
can be optimized to:
y = _x;
tO = z /
but the following cannot be optimized because x is modified before t Ois used:
tO =_x;
x += 1;
y = t O;
Hardware problems, All the foregoing optimizations can be a real headache when youre interfacing to
hardware. For example, an 8-bit, memory-mapped I/O port at physical address 0x10 can
be modeled as follows on many machines:
*
por t ;
por t = (char *) 0xl 0;
Thereafter, you can access the port with *port. The following code, for example, is
intended to pulse the low bit of the output port at periodic intervals:
*por t = 0; / * i ni t i al i ze out put por t t o al l zer os */
whi l e( *por t ) / * r ead i nput por t , t er mi nat e i f dat a avai l abl e */
{
*por t =1; / * pul se t he l ow bi t of t he out put por t */
*por t = 0;
del ay ();
}
Unfortunately, many optimizers eliminate aH of the previous code. The *por t =l is a
dead assignment because *por t is modified again before the value 1is used, so the first
optimization results in the following code:
*por t = 0;
whi l e( *por t )
{
*por t = 0;
del ay ();
}
Next, since the value of *por t never changes (the same constant is always assigned to
it), constant propagation takes over and the following code is produced:
2. or, better yet, with:
char *const port = 0x10;
The const says that port itself wont change value. You can also use a macro:
#defi ne port ((char *) (0x10))
*por t = 0;
whi l e( 0 )
del ay();
The initial assignment to *por t can now be eliminated because *por t isnt used any
where, and the loop can be discarded because its body is unreachable.
The ANSI v o l ati l e keyword is used to suppress these optimizations. No assump
tions are made as to the value of any variable thats declared volatilethe compiler
assumes that the value can change at any time, even if no explicit code modifies it. You
can suppress all optimizations in the previous example by declaring por t as follows:
vol at i l e char *por t ;
or better yet:
vol at i l e char *const por t = 0x10;
This declaration says that the object at which por t points is liable to change without
notice, but that por t itself is a constant whose value never changes.
7.2.4 Peephole Optimization: An Example
I ll demonstrate the mechanics of peephole optimization with a simple example. The
input:
i = 5;
++i ;
i + 1;
generates this output:
W(&_i ) = 5;
W(&_i ) += 1;
W( T(1) ) =W(&_i );
W( T(1) ) =W(&_i );
W( T( l ) ) += 1;
r F. w. l ow = W( T( l ) );
ret () ;
This output has several inefficiencies built into it, all of which can be eliminated by a
peephole optimizer. The optimizer must operate on a basic block: The current code has
no labels into it, and a single exit point at the bottom, so it comprises a basic block.
The first optimizations to apply are constant folding and propagation. The approach
is straightforward: Go through the compiler output performing those operations that you
can. Youll need a symbol table, accessed by variable name, that holds the contents of
that variable insofar as it can be determined.
The optimizers first pass evaluates expressions, if possible, modifying the tables
contents to reflect the computed value. It also substitutes any reference to a variable that
holds a constant value with that constant. Here, the optimizer reads the first line, creates
a symbol table entry for i and initializes it to 5. It now reads the second line. Theres
already an entry for i in the table, and _i is being modified by a constant. Conse
quently, it increments the symbol-table entry from five to six, and modifies the code as
follows:
Suppress optimization
with the vol ati l e key
word.
3. The cast can be done like this:
#defi ne port ((vol ati l e char *) 0xl 0)
W(&_i )
W(&_i )
W( T ( 1)
W( T ( 1)
W( T(1)
r F. w. l ow
)
)
)
5/
6;
W(&_ i ) ;
W(&_ i ) ;
1;
W( T (1)
);
Now it reads the third and fourth lines. Seeing that i holds a constant, it replaces the
reference to i with that constant as follows:
j
W(&_i )
W( &_i )
W( T(1)
W( T (1)
W( T (1)
r F . w. l o w
)
)
)
+
5
6
6
6
1;
W( T (1)
);
It also makes a symbol-table entry for T (1), initializing it to 6. Reading the fifth line, it
sees a variable that holds a constant (T (1)) being modified by a constant, so adjusts the
internal symbol table and the output as before, yielding:
W(&_i)
W(&_i)
W( T (1)
W( T (1)
W( T(1)
r F . w. l ow
)
)
)
5
6
6
6
7
W( T (1)
) ;
Reading the final line, since T (1) holds a constant, it can be replaced with its value
(i
W(&_i)
W( &_i )
W( T (1)
W( T (1)
W( T(1)
r F . w. l ow
)
)
)
5
6
6
6
7
7
Though there are still the same number of operations as before, theyve all been
translated to simple assignments.
The output from the first pass is now simplified further by going through it a second
time, eliminating all dead variablesvariables that arent used in the current block. The
strategy is, again, straightforward: Clear the symbol table, then go through the code,
creating entries for all variables that appear as a source operand. Make a second pass
eliminating all variables whose destination operand is not in the table. Since T (1) is not
used in the source fields of any of the triples, all assignments to it are eliminated, yield
ing:
W(&_i )
W(&_i)
r F . w. l ow
5;
6;
7;
The assignments to _ i cannot be eliminated because the _ i could be used later on in the
code.
The final optimization eliminates dead assignments, starting with an empty symbol
table and proceeding as follows:
(1)
(2)
If a variable in the symbol table is modified by some instruction other than assign
ment, remove it from the table.
When a variable appears as the destination in an assignment:
a. If its not in the table, add it, remembering the line number of the assignment
Section 7.2.4Peephole Optimization: An Example 667
instruction.
b. If the variable is in the table, discard the line referenced in the symbol-table
entry, and then replace that symbol-table reference with one for the current
line.
In the current example, the first assignment to _ i is a dead assignment because _ i is
reinitialized before being used, so it is eliminated:
W(&_i) = 6;
r F. w. l ow = 7;
7.3 Structural Optimizations
So far, all the optimizations that weve looked at can be performed by analyzing the
linear output stream of the compiler. Some optimizations must analyze the overall struc
ture of the code, however, and more sophisticated techniques must be used. All of these
optimizations need to know something about the structure of the code being optimized.
Consequently, they must work on a parse or syntax tree that represents the code, not on a
series of instructions. One solution to this problem is for the parser to build a physical
parse tree with structures and pointers rather than generating code; the parse tree can
then be optimized, and finally traversed depth first to do the code generation. There are
two difficulties with this approach. First, the optimizer can get pretty big, and there is
often insufficient room in memory for it to be combined with the parsertwo indepen
dent programs are needed. Similarly, the more code that you can get into memory at
once, the more of the programs structure can be seen. Theres no point in wasting
memory for a parser. Finally, the entire parse tree isnt needed for optimizationa syn
tax tree representing the various statements and expressions is sufficient. The usual solu
tion to the problem is for the parser to generate an intermediate language from which the
syntax tree can be reconstructed, and that intermediate code is processed by the optim
izer.
7.3.1 Postfix and Syntax Trees
The intermediate language most appropriate for this purpose is postfix or Reverse-
Polish Notation (RPN). Users of Hewlett-Packard calculators and UNIXs dc desk-
calculator program will already be familiar with this representation. In postfix notation,
operands are pushed onto a stack without modification. Operators access the top few
items on the stack, replacing them with the result of the operation. For example, the fol
lowing C fragment
A * B + C * D
can be evaluated using the following postfix operations:
push A
push B
pop two items, multiply them, push the result
push C
push D
pop two items, multiply them, push the result
pop two items, add them, push the result
The result is on the top of the stack when the evaluation finishes. The two multiplica
tions have to be done before the addition because * has higher precedence than +.
A postfix intermediate language is easy to generate because the compiler doesnt
have to worry about assigning temporary variablesit just uses the run-time stack as its
Postfix, Reverse-Polish
Notation (RPN).
scratch space. Postfix is useful in interpreter implementations because postfix virtual
machines are easy to implement. The interpreter can translate the source code into a
postfix form, which is then executed in software on the virtual machine.
The main advantage of postfix, here, is that the optimizer can reconstruct the entire
parse treeor to be more exact, a compacted form of the parse tree called a syntax
tree from the list of instructions.
A common postfix representation uses one instruction per line. If that instruction is
an rvalue reference, the contents of the variable are pushed onto the stack. If its an
lvalue reference, the address is pushed. If its an arithmetic instruction, the top one or
two stack elements are manipulated, and they are replaced by the result, which is either
an lvalue or rvalue depending on the operator. The expression A*B+A*B is represented
as:
A
L
L
*
A
L
L
*
+
Generating postfix inter
mediate code.
Theres no need for an explicit
66
push operator as long as the operators can be dis
tinguished from variable names. Similarly, explicit parentheses are never necessary
because the order of evaluation is determined by the sequence of operations, and all
incoming variables are treated as lvalues because theres no need for explicit temporaries
in the intermediate code. (Consequently, the subscripts in the earlier example are redun
dant and can be omitted.)
The first order of business is making a parser that generates postfix intermediate code
for expressions. I ve done this in Listings 7.1 and 7.2. The code is simplified because no
temporary variables are necessary.
Listing 1 post fix.y Parser with Postfix Code Generation
1
2
3
4
5
6
7
14
15
16
f ctermI CON
l ef t PLUS
NAME
MI NUS
l ef t TI MES DI VI DE
o o
o o
s : expr
8 expr :: NAME { yycode "%s\ n", yyt ext ); }
9 I CON { yycode "%s\ n", yyt ext ); }
10 expr DI VI DE expr { yycode " An " ) ; }
11 expr TI MES expr { yycode " An " ) ; }
12 expr PLUS expr { yycode "+ \n"
) ; }
13 expr MI NUS expr { yycode "An" ) ; }
oo
o o
mai n () { yypar se();
}
Converting the postfix in
termediate code to a syn
tax tree.
The next task is converting the intermediate code generated by our compiler back to
a syntax tree. You can look at the generated intermediate code as a convenient way for
the parser to pass the tree to the optimizer. The following syntax tree can be created
when you run A*B+A*B through the parser we just looked at:
Listing 1.2. postfix.lex Lexical Analyzer for Postfix Code Generation
Section 7.3.1 Postfix and Syntax Trees 669
1
%{
2 #i ncl ude 11y y o u t . h11
3
%}
4
0, 0
oo
5 [ a- zA- Z ] [ a- z A- Z 0 - 9 ] * r et ur n NAME;
6 [ 0 - 9 ] + r et ur n I CON;
7
itj ii
r et ur n DI VI DE;
8
vv* vv
r et ur n TI MES;
9
VV+ vv
r et ur n PLUS;
10
vv _ vv
r et ur n MINUS;
11 ; / * empt y */
12
0 o
o o
* 2
A, B, A
1 ^1 2 **2
All internal nodes represent operators and all the leaves reference lvalues. Also, the
grouping of operators and operands is as you would expect, given the operator pre
cedence built into the grammar.
Before demonstrating how to do this reconstruction, well need a data structure to Datastructuresto
represent the nodes in the tree. Listing 7.3shows this data structure (a node) and a con- rePresenttne syntax
structor subroutine that makes new nodes [new()]. The node structure is a normal
binary-tree node, having left and right children. In addition, the name field holds vari
able names (A and B in this case) or the operator if the node is an internal node. The con
tents of this field will be modified by the optimizer, however. The op field usually holds
the operator (* or +), but it is set to 0 in leaf nodes.
tree: node.
Listing 7.3. optimize.c The node Data Structure, Used to Construct Syntax Trees
1 #i ncl ude < s t d l i b . h> / * <mal l oc. h> f or UNI X */
2
3 t ypedef st r uct node
4
{
5 char name [ 1 6 ] ;
6 i nt op;
7 st r uct node * l e f t ;
8 st r uct node *r i g h t ;
9
}
10 n o d e ;
11
12 node *new( )
13
{
14 node *p;
15 i f ( ! (p = ( node *) c a l l o c ( 1, si zeof ( node) ) ) )
16 e x i t ( 1 ) ;
17 r et ur n p;
18
}
Reconstructing the syn
tax tree.
The bui l d () subroutine in Listing 7.4 creates a syntax tree from a postfix input file.
The input file must have one operand or operator per line and it must be perfect. That is,
in order to simplify the code, I ve dispensed with error detection. Input lines are read
from standard input and the subroutine returns a pointer to the root node of the tree. The
tree is built in a bottom-up fashion, using a local stack defined on line 22 to keep track of
the partially-constructed tree. Figure 7.1 shows the syntax tree for the input discussed
earlier as it is builtyoull notice the similarity between this process and the bottom-up
parse process.
The def aul t case on Line 30 is executed for variable names.4 It allocates and ini
tializes a new node, and then pushes a pointer to the new node onto the stack. The child
pointers are initialized to NULL by new().
Operators are handled differently because theyre internal nodes. A new node is allo
cated and initialized, then pointers to two existing nodes are popped and the child
pointers of the new internal node are made to point at these. Finally, a pointer to the new
node is pushed.
Listing 7.4. optimize.c A Postfix to Syntax Tree Constructor
19
20
44
45
46
47
node
{
* b u i l d ()
21 char bu f [8 0 ] ;
22 node * s t a c k [ 10 ] ;
23 node * * s p = s t a c k - 1;
24
25
node *p;
26 whi l e( g e t s ( b u f ) )
27 {
28 swi t ch( *buf )
29 {
30 def aul t : p = n e w( ) ;
31 s t r c p y ( p - >name, h
32 *++s p = p;
33
34
br eak;
35 case ' * ' :
36 case ' + ' :
37
P
= n e w()
38 p - > r i g h t = * s p
39 p - > l e f t = * s p
40 p- >o p = *buf
41 p - > n a me [0] = *buf
42 *++s p
= P
43 br eak;
}
>uf ) ;
}
}
r et ur n
*
s p
Generating code from the
syntax tree.
Code can be generated from this tree by doing a depth-first traversal (visit the chil
dren then the parent). At every lvalue (ie. variable reference), generate an instruction of
var i abl e. At every internal node, generate the code the form t empor ar y
4. The defaul t case can go anywhere in the switch. It doesnt have to be at the end.
Figure 7.1. Building a Syntax Tree
Section 7.3.1 Postfix and Syntax Trees 671
necessary to perform the operation on the temporaries that resulted from traversing the
previous level, putting the result into a new temporary. The previously constructed tree
t o
-
A
t l
-
B
t l
*
t o
t 2
-
A
t 3
-
B
t 3
*=
t 2
t 3 += t l
The t r av( ) subroutine in Listing 7.5 does the traversal. It takes the pointer
returned from the previous bui l d () call as its initial argument. If r oot - >op is zero,
then the current node is a leaf and you generate the code to move it to a temporary vari
able. The spr i nt f () call overwrites the name field with the name of the temporary
variable. If the op field is nonnull, an interior node is being processed. In this case,
t r av( ) does an in-order traversal. The i f statement is always true (for nowthings
will change momentarily). The following pr i nt f () call prints the instruction, using
the name fields of the two children to find out what temporaries to use. The st r cpy ()
call then overwrites the name field of the current node to reflect the temporary that got
the result of the last operation.
Listing 7.5. optimize.c A Syntax-Tree Traversing, Code-Generation Pass
48 t r a v ( r o o t )
49 st r uct node * r o o t ;
50
{
51 st at i c i nt tnum = 0;
52
53 i f ( ! r o o t )
54 r et ur n;
55
56 i f ( ! r o o t - > o p ) / * l eaf * /
57 {
58 p r i n t f ( "t%d = %s\ n", tnum, r o o t - ->name ) ;
59 s p r i n t f ( r o o t - >na me , "t%d", tnum
) ;
60 ++tnum;
61
}
62 el se
63
{
64 t r a v ( r o o t - > l e f t ) ;
65
66 i f ( r o o t - > l e f t != r o o t - > r i g h t ) / * Al ways t r ue * /
67 t r a v ( r o o t - > r i g h t
) ;
/ * unl ess opt i mi zed * /
68
69 p r i n t f ( " %s %c= %s\ n", r o o t - > r i g h t - ->name,
70 r o o t - > o p , r o o t - > l e f t - ->name ) ;
71 s t r c p y ( r o o t - >n a me , r o o t - > r i g h t - ->name ) ;
72
}
73 }
7.3.2 Common-Subexpression Elimination
The code that t r av( ) outputs isnt too great, because the subexpression A*B is
evaluated twice. It would be better to perform the multiplication only once and use the
generated rvalue twice. Youd like the following output:
tO = A
t l = B
t l *= tO
t l += t l
This transformation is called common subexpression elimination, and is a good example
of the type of optimization that you can do by analyzing, and then modifying, the syntax
tree. Since both subtrees of the +node are identical, the optimizer can eliminate one
subtree and make both pointers in the +node point at the remaining subtree. The new
syntax tree looks like this:
o
/\
A B
Section 7.3.2Common-Subexpression Elimination 673
Both pointers in the +node point at the * node. This modified data structure is called a
Directed Acyclic Graph or DAG. The DAG is created from the syntax tree by the opt i -
mize ( ) function in Listing 7.6. This routine traverses the interior nodes of the tree,
comparing the two subtrees. If the subtrees are identical, the left and right pointers of
the parent node are made to point at the same child, effectively removing the other child
from the tree. The comparison is done using the makesig () function, which traverses
an entire subtree, assembling a string that shows the pre-order traversal (visit the root,
the left subtree, then the right subtree) of the subtree by concatenating all the name
fields. For example, the original syntax tree, when traversed from the root, creates the
following signature string:
+* * <AXBXAXB>
If two subtrees generate the same signature, theyre equivalent.
Brackets are placed around the identifiers because the left and right subtrees in the
following expression would incorrectly generate the same signatures if they werent
there:
(A * ABB) + ( AA * BB)
Finally, you traverse the DAG using the trav( ) function that was developed ear
lier. That i f statement now comes into play, preventing us from traversing the common
subtree twice.
7.3.3 Register Allocation
Syntax trees and DAGs are useful for other optimizations as well. The simplest of
these is efficient register allocation. In general, its better to use registers than memory
for temporary variables. Its difficult for the front end to do register allocation efficiently,
however, because it doesnt know how many registers the target machine has. There are
two problems with translating temporary-variable references to register references.
First, you might need more temporary variables than there are registers, and the register
allocation should be done in such a way that the registers are used as efficiently as possi
ble: The temporaries that are used most frequently should be put into registers. The
second issue is subroutine calls imbedded in expressions. If registers are used as tem
poraries, then they have to be pushed before calling the function and restored afterwards.
Both of these problems are easily solved by analyzing the parse or syntax tree for an
expression before generating code for that expression. For example, an expression like
this:
c a r l / p h i l i p + e manue l ( a+b) * bac h
generates the syntax tree in Figure 7.2 Each interior node in the tree represents the gen
eration or modification of a temporary variable, and the temporary can appear at several
interior nodes. The register-versus-nonregister problem is solved by assigning registers
to those temporaries generated along the longest paths in the tree. Similarly, the optim
izer can examine that tree and, noticing the function call, can decide to use registers only
for those temporaries generated after the function returns or, as is the case with the
functions argument, are pushed as part of the call.
7.3.4 Lifetime Analysis
The next structure-related register optimization is lifetime analysis. This optimiza
tion takes care of a situation like the following:
Directed acyclic graph
(DAG).
Listing 7.6. optimize.c A Subroutine for Common-Subexpression Elimination
74 o p t i m i z e ( r o o t ) / * Si mpl i f i ed opt i mi zer - - el i mi nat es common subexpr essi ons. * /
75 node * r o o t ;
76
{
77 char s i g l [ 32 ] ;
78 char s i g 2 [ 32 ] ;
79
80 i f (
r o o t - > r i g h t && r o o t - > l e f t )
81
{
82 o p t i m i z e ( r o o t - > r i g h t ) ;
83 o p t i m i z e ( r o o t - > l e f t ) ;
84
85 * s i g l = * s i g 2 = ' \ 0 ' ;
86 m a k e s i g ( r o o t - > r i g h t , s i g l ) ;
87 m a k e s i g ( r o o t - > l e f t , s i g 2 ) ;
88
89 i f ( s t r c mp ( s i g l , s i g 2 ) == 0 ) / * subt r ees mat ch * /
90 r o o t - > r i g h t = r o o t - > l e f t ;
91 }
92
}
93
94 m a k e s i g ( r o o t , s t r )
95 node * r o o t ;
96 char * s t r ;
97 {
98 i f (
! r o o t )
99 return;
100
101 i f (
i s d i g i t ( * r o ot - >na me )
102 s t r c a t ( s t r , r o o t - >na me ) ;
103 el se
104
{
105 s t r c a t ( "<", r o o t - >na me ) ;
106 s t r c a t ( s t r , r o o t - >na me ) ;
107 s t r c a t ( ">", r o o t - >na me ) ;
108 }
109 m a k e s i g ( r o o t - > l e f t , s t r ) ;
110 ma k e s i g ( r o o t - > r i g h t , s t r ) ;
111
}
i nt i , j ;
f o r ( i = 10000; i >=0 ; )

f o r ( j = 10000; j >= 0 ; )

If the target machine has only a limited number of registers available for variables (say,
one), then the r egi ster keyword cant be used effectively to improve the foregoing
code. Ideally, because i is used only in the first loop and j only in the second, i should
be placed in a register while the first loop is being executed, and j should be in a register
while the second loop is executed. This determination can be made by analyzing the
parse tree. In the simplest case, if a variable is found in only one subtree, then its life
time is restricted to the code represented by that subtree. More sophisticated analysis
can handle the more complex cases. For example, variables used for loop control can be
given precedence over the variables in the loop body, unless the ones in the body are
Figure 7.2. Using the Syntax Tree for Register Allocation
Section 7.3.4Lifetime Analysis 675
used more frequently than the control variables.
7.3.5 Loop Unwinding
Loop unwinding is a method of optimizing loop execution speed, usually at the cost
of code size. Put simply, the optimizer replaces the entire loop with the code that
comprises the loop body, duplicated the number of times that the loop would execute at
run time. This optimization, clearly, can be done only when the loop-control criteria can
be computed at compile time. A loop like this:
f o r ( i = 3; i > 0 ; )
f o o ( i ) ;
can be replaced with the following code:
f o o ( 3 ) ;
f o o ( 2 ) ;
f o o ( 1 ) ;
i = 0;
Something like this:
i n t a r r a y [ 1 0 ] ;
f o r ( p = a r r a y ; p <= a r r a y + 2; ++p )
f o o ( p ) ;
can be translated as follows:
rO. pp = a r r a y ;
f o o ( rO. pp )
rO. pp += s i z e o f ( i n t ) ;
f o o ( rO. pp )
rO. pp += s i z e o f ( i n t ) ;
f o o ( rO. pp )
rO. pp += s i z e o f ( i n t ) ;
_p = WP( rO. pp ) ;
7.3.6 Replacing Indexes with Pointers
If the values of an array index can be computed at compile time, not only can the
loop be unwound, but the inefficient array indexes can be replaced by direct pointer
operations. For example, this loop:
Hoisting variables.
Code-motion problems.
f o r ( i = 0; i <= 2 ; ++i )
a r r a y [ i ] = Oxf f ;
can be translated as follows:
rO. pp = a r r a y ;
W( rO. pp) = Oxf f ;
rO. pp += s i z e o f ( i n t ) ;
W( rO. pp) = Ox f f ;
rO. pp += s i z e o f ( i n t ) ;
W( rO. pp) = Oxf f ;
_ i = 3;
Note that the final assignment to _ i is required in case _ i is used further down in the
code. Also, _ i would have to be modified along with rO . p p if it were used in the loop
body for something other than the array index.
7.3.7 Loop-Invariant Code Motion
Loop-invariant optimizations analyze the body of a loop and hoist (move above the
test portion of the loop) all of the code that doesnt change as the loop executes. For
example, the division in the following loop is invariant:
f o r ( i = 0 ; i < 10 ; ++i )
a r r a y [ i ] += ( nume r / de nom) ;
so you can modify the loop as follows:
r 0 . 1 = numer / denom ;
f o r ( i = 0; i <= 2 ; ++i )
a r r a y [ i ] += rO. 1
The optimizations that we looked at earlier can also be applied here. Applying both
array-index replacement and loop unwinding, you end up with the following C-code:
r O. l
-
numer ;
r O. l / = denom;
r 1. pp
-
a r r a y ;
W( r l . p p )
-
r O. l ;
r l . pp + si zeof ( i nt ) ;
W( r l . pp)
-
r O. l ;
r l . pp += si zeof ( i nt ) ;
W( r l . p p )
-
r O. l ;
r l . pp += si zeof ( i nt ) ;
i = 3;
Note that hoisting division can cause problems if youre not careful. The following
example bug is lifted from the Microsoft C (ver. 5.1) documentation. This loop:
f o r ( i = 0; i <= 2 ; ++i )
i f ( denom != 0 )
a r r a y [ i ] += ( numer/ denom) ;
is optimized as follows:
r 0 . 1 = numer / denom ;
f o r ( i = 0; i <= 2 ; + + i )
i f ( denom != 0 )
a r r a y [ i ] += rO. 1 ;
The division is moved outside the loop because its invariant, but the preceding test is
not moved with it. Youll now get a run-time divide-by-zero error if de nom happens to
Section 7.3.7Loop-Invariant Code Motion 677
be zero.
You, as a compiler writer, must decide if its worth the risk of doing this kind of
optimization. Its difficult for the compiler to distinguish between the safe and
dangerous cases, here. For example, many C compilers perform risky optimizations
because the compiler writer has assumed that a C programmer can understand the prob
lems and take steps to remedy them at the source code level. I ts better to provide the
maximum optimization, even if its dangerous, than to be conservative at the cost of less
efficient code. A Pascal programmer may not have the same level of sophistication as a
C programmer, however, so the better choice in this situation might be to avoid the risky
optimization entirely or to require a special command-line switch to enable the optimiza
tion.
7.3.8 Loop Induction
Loop induction is a sort of strength reduction applied to loop-control statements.
The optimizer tries to eliminate the multiplication implicit in array indexing by replac
ing it with the addition of a constant to a pointer. For example, a loop induction on al
modifies this code
i nt a l [ 10] [ 10] , a 2 [ 10] [ 10] ;
f o r ( i = 0; i < 10; ++i )
f o r ( j = 0; j < 10; ++j )
a l [ i ] [ j ] = a 2 [ i ] [ j ] ;
as follows:
t l = a r r a y + 10;
f o r ( tO = a l ; tO < t l ; tO += 20 ) / * 20 == 10 * si zeof ( i nt ) * /
{
t 3 = tO + 20; / * 20 == 10 * si zeof ( i nt ) * /
f o r ( t 2 = tO; t 2 < t 3 ; t 2 += s i z e o f ( i n t ) )
*t 2 = a 2 [ i ] [ j ] ;
j = 10;
}
i = 10;
7.4 Aliasing Problems
Many of the loop optimizations have problems with pointer aliasingcode in which
a memory location can be accessed in more than one way (both by a pointer and directly,
or by two pointers that contain the same address). For example, the following code
creates an alias because both x and *p refer to the same memory location:
p r o me t h e u s ()
{
i nt x, *p = &x;
}
Similarly, the following code creates less obvious aliases :
p y r a mu s ()
{
i nt ar r ay[93];
t h i s b e ( p i , p2 )
i nt * p l , *p2 ;
{
/ * *pl and *p2 al i as one anot her , her e */
}
i nt x;
p a n d o r a ()
{
i nt *p = &x;
/ * *p i s an al i as f or x, her e */
}
The problem here is that the loop optimizations might modify one of the aliased vari
ables without modifying the other onethe compiler has no way to know that the two
variables reference the same object.
7.5 Exercises
7.1. Write a subroutine that multiplies two arbitrary i nts using nothing but shift and
multiply operations. The result should be returned in a long.
7.2. (a) Modify the compiler in the last chapter to replace all multiplication by con
stants with shift/add sequences and to replace all division by constants with
shift/subtract sequences.
(b) Do the foregoing with a separate optimizer pass instead of modifying the com
piler proper.
7.3. Write a peephole optimizer that works on the code generated by the compiler in the
last chapter. At very least, it should do constant folding, dead-variable elimination,
and dead assignment elimination.
7.4. Add an assignment operator (=) to the postfix tree constructor and optimizer
presented earlier in this chapter. The input:
a
b
+
a
b
+
c
should generate the following:
tO = a
t l = b
t l *= tO
t l += t l
c = t l
Your program must be able to handle expressions like:
a = b = c
which should be translated to the following:
c
b
a
7.5. Modify your solution to the previous exercise to support the following C operators:
+ - * / % & I ~ ( )
= = < > < = > = ! && ||
++
Standard associativity and precedence should be used.
7.6. (a) Modify the compiler in the previous chapter to generate a postfix intermediate
language instead of C-code.
(b) Write an optimizer pass that does common-subexpression elimination on that
postfix intermediate code.
(c) Write a back end that traverses the syntax tree created by the common-
subexpression optimizer and generates C-code.
Directories used.
This appendix contains a myriad small subroutines that are used throughout the rest
of the book. Collectively, they comprise a library called comp.lib. A few of the routines
in this library are described elsewhere (the input and table-printing functions in Chapter
Two, for example), but I ve concentrated the other routines in one place so that the rest
of the book wont be cluttered with distracting references to small subroutines whose
function is not directly applicable to the matter at hand. I ts important that you are fam
iliar with the routines in this appendix, howeverat least with the calling conventions, if
not with the code itself.
Since straight code descriptions can be pretty deadly reading, I ve organized each
section to start with a description of subroutine-calling conventions and a discussion of
how to use the routines in that section. This way, you can skip the blow-by-blow code
descriptions if youre not interested. Most of the code is commented enough so that
additional descriptive text isnt necessary. Theres no point in making an overly long
appendix even longer with descriptions of things that are obvious by reading the code.
Of the subroutines in this appendix, you should be particularly familiar with the
<tools/debug.h> file described in the next section and with the set and hash-table func
tions described in the sections after that. I ts worthwhile to thoroughly understand this
code. You need to be familiar only with the calling conventions for the remainder of the
functions in this appendix, however. The sources are presented, but theyre not manda
tory reading.
The auxiliary files and subroutines described here are organized into several direc
tories. Sources for subroutines of general utility are all in /src/tools, those that are more
directly compiler related are in /src/compiler/lib. Most of the #ncl ude files used by
the routines in this chapter are in /include/tools, and they are included with a
#i ncl ude ctool s/whatever. h>(Ive set up /include as my default library direc
tory). The one exception is curses.h, which is in /include itself, so that you can use a
UNIX-compatible #ncl ude <curses . h>.
680
Section A.l Miscellaneous Include Files 681
A.1 Miscellaneous Include Files
Many .h files are included at the tops of the various .c files presented in this book.
With the exception of the files discussed in this chapter, all files included with angle
brackets (<stdio.h>, <signal.h>, and so forth) are ANSI-compatible, standard-library files
and should be supplied by your compiler vendor (in this case, Microsoft). If they arent,
then your compiler doesnt conform to ANSI and youll have to work a little harder to port
the code presented here. Most of these standard .h files are also supported by UNIX
those that arent are supplied on the distribution disk so that you can port the code to
UNIX without difficulty.
A. 1.1 debug.hMiscellaneous Macros
The debug.h file is #i ncl uded in virtually every .c file in this book. It contains
several macro definitions that are of general utility, and sets up various definitions that
are useful in debugging. It also contains several macros that take care of various 8086
portability issues. (Its useful, by the way, to look at the 8086-related macros even if
youre not using an 8086not only will you get a good idea of the things that you have
to consider in order to make your code portable in general, but, since the 8086 is such a
pervasive architecture, a specific knowledge of the machine lets you write your code so
that it can be as portable as possible in any environment. I ts a mistake to assume that
your code will never have to move to the 8086 environment.)
Listing A.l. debug.h Miscellaneous Macros of General Utility
l #i f def DEBUG
2 # def i ne PRI VATE
3 # def i ne D(x) x
4 #el se
5 # def i ne PRI VATE st at i c
6 # def i ne D(x)
7 #endi f
8
Q
#def i ne PUBLI C
7
10 #i f def MSDOS
ll # def i ne MS( x) X
12 # def i ne UX( x)
13 # def i ne ANSI
14
# def i ne 808 6
15 #el se
16 # def i ne MS( x)
17 # def i ne UX( x) X
18 # def i ne 0 BI NARY 0 /* no bi nar y i nput mode i n UNI X open( ) */
19 t ypedef l ong t i me t; /* f or t he VAX, may have to change t hi s */
20 t ypedef unsi gned si ze t; /* f or t he VAX, may have to change t hi s */
21 ext er n char *st r dupO; /* You need to suppl y one.
*
/
22 #endi f
23
24 #i f def ANSI /* I f ANSI i s def i ned, put ar g l i st s i nt o */
25 #def i ne P (x) X /* f unct i on pr ot ot ypes. V
26 #def i ne VA_LI ST . . . /* and use el l i psi s i f a var i abl e number of ar gs */
27 #el se
28 #def i ne P(x) () / * Ot her wi se, di scar d ar gument l i st s and t r ansl at e */
29 #def i ne voi d char / * voi d keywor d t o i nt. */
30 #def i ne VA_LI ST a r g s / * don' t use el l i psi s */
31 #endi f
32
682 Support FunctionsAppendix A
Listing A.L continued...
33 / * SEG (p) Eval uat es to t he segment por t i on of an 8086 addr ess.
34 * OFF (p) Eval uat es to t he of f set por t i on of an 8086 addr ess.
35 * PHYS (p) Eval uat es to a l ong hol di ng a physi cal addr ess
36 */
37
38 #i f def 808 6
39 #def i ne SEG(p) ( ( ( unsi gned * ) & ( p ) ) [1] )
40 #def i ne OFF (p) ( ( ( unsi gned * ) &( p) ) [0] )
41 #def i ne PHYS (p) ( ( ( unsi gned l ong) OFF( p) ) + ( ( unsi gned l ong) SEG(p) << 4) )
42 #el se
43 #def i ne PHYS(p) (p)
44 #endi f
45
46 / * NUMELE( ar r ay) Eval uat es to t he ar r ay si ze i n el ement s
47 * LAS TELE( ar r ay) Eval uat es t o a poi nt er t o t he l ast el ement
48 * I NBOUNDS( ar r ay, p) Eval uat es t o t r ue i f p poi nt s i nt o t he ar r ay.
49 * RANGE ( a, b, c) Eval uat es t o t r ue i f a <= b <= c
50 * max( a, b) Eval uat es t o a or b, whi chever i s l ar ger
51 * mi n (a, b) Eval uat es to a or b, whi chever i s smal l er
52
*
53 * NBI TS( t ype) Ret ur ns number of bi t s i n a var i abl e of t he i ndi cat ed
54
*
t ype;
55 * MAXI NT Eval uat es to t he val ue of t he l ar gest si gned i nt eger
56 */
57
58 #def i ne NUMELE(a) ( si zeof ( a ) / si zeof ( *( a ) ))
59 #def i ne LASTELE(a) ( ( a ) + (NUMELE(a)- 1 ) )
60 #def i ne TOOHIGH(a, p) ( (p) - (a) > (NUMELE(a) - 1) )
61 #def i ne TOOLOW(a,p) ( (p) - (a) < 0 )
62 #def i ne INBOUNDS(a, p) ( ! (TOOHIGH(a, p) TOOLOW(a, p)) )
63
64 #def i ne I S ( t , x ) ( ( ( t ) 1 << ( x ) ) ! = 0 ) / * Eval uat e t r ue i f t he wi dt h of a */
65 / * var i abl e of t ype of t i s < x. The 1=0 */
66 / * assur es t hat t he answer i s 1 or 0 */
67
68 #def i ne NBI TS( t ) (4 * (1 + I S ( t , 4) + I S ( t , 8) + IS ( t , 1 2 ) + IS ( t , 1 6 ) \
69 + IS ( t , 2 0 ) + IS ( t , 2 4 ) + IS ( t , 28) + I S ( t , 3 2 ) ) )
70
71 #def i ne MAXINT ( ( ( unsi gned) "0) >> 1)
72
73 #i f ndef max
74 # def i ne ma x ( a , b ) ( ( ( a ) > ( b) ) ? (a) : (b) )
75 #endi f
76 #i f ndef mi n
77 # def i ne mi n ( a , b ) ( ( ( a ) < ( b) ) ? (a) : (b) )
78 #endi f
79 #def i ne RANGE( a, b, c) ( (a) <= (b) && (b) <= (c) )
debug The debugging versions of several macros are activated when DEBUG is #defi ned
(either with an explicit #defne or, more commonly, with a -DDEBUG on the
Local objects: pr i vat e. compilers command line) before <tools/debug.h> is included. The PRI VATE definition
on line two expands to an empty string in this caseit effectively disappears from the
input. When DEBUG is not defined, the alternative definition on line five is activated, so
all PRI VATE variables become stati c variables when youre not debugging. This
mechanism lets you put normally invisible variables and subroutines into the link
mapso that you know where they are when youre debuggingand then limit their
scope to the current file when debugging is completed. The complement to PRI VATE is
Section A. 1.1 debug.hMiscellaneous Macros 683
PUBLI C, defined on line eight. It is provided for documentation purposes only
PUBLI C variables and subroutines are those whose scope is not limited.
The D(x) macro on lines three and six is used for printing debugging diagnostics.
When DEBUG is defined, the macro expands to its argument, so
D( p r i n t f ( " a d i a g n o s t i c \ n ) ; )
expands to
p r i n t f ("a d i a g n o s t i c \ n " ) ;
Note that the semicolon is inside the parentheses. When DEBUG is not defined, the macro
expands to an empty string and the argument is discarded. The earlier "printf()" state
ment goes away. This method is preferable to the more usual
#i f def DEBUG
p r i n t f ( " a d i a g n o s t i c \ n H) ;
#endi f
for several reasons: its more readable; it requires only one line, rather than three; it
doesnt mess up your indentation (many compilers require the #to be in the leftmost
column). A multiple-line argument to D(x) must be done with backslashes, like this:
D( p r i n t f ( " %s i n de %s w i t ' d i n a h , s t rumn' on de o l ' %s ", \
s omeone, " k i t c h e n " , "banj o" ) ; )
Theres a potential problem here in that commas that are part of the argument to D() can
be confused with the comma that separates macro arguments, a n s i says that commas
enclosed by parentheses are not to be treated as argument separators (which is why the
previous definition worked). Your compiler may not handle this case correctly, however.
In any event, youll have to parenthesize expressions that use the comma operator to pass
them to D() :
D( ( s t a t e m e n t l , s t a t e m e n t 2 ) ; )
Its better style to use two D() invocations, however:
D( s t a t e m e n t l ; )
D( s t a t e me n t 2 ; )
The macros on lines ten to 22 help you write portable code. The arguments to MS ()
are incorporated into the program only if MSDOS is #def i ned (its defined automatically
by the Microsoft compiler). Arguments to UX () (for UNIX) are active only if MSDOS is
missing. Ive also defined ANSI and 8086 macros that are active only if MSDOS is also
active. Move the ANSI definition outside of the #i f def if youve an ANSI-compatible
UNIX compiler.
The P macro on lines 25 and 28 handles another ANSI-related portability problem.1
Many compilers (including many UNIX compilers) cant handle function prototypes.
This macro uses a mechanism similar to the D() macro discussed earlier to translate pro
totypes into simple ext ern declarations if ANSI is not defined. For example, given the
following input:
i nt d i m i t r i P (( i nt x, l ong y ) ) ;
if ANSI is defined, the P (x) macro evaluates to its argument and the following
Global objects: p u b l i c .
Debugging diagnostics,
D ().
Commas in the d o
macro argument.
Unix- and MS -D O S -S pecific
co d e : m s o , ux 0 .
AN S I , 8 0 8 6 .
Transforming function
prototypes to simple e x-
terns, p ().
I. This macro is taken from [Jaeschke], p. 142, and is used in the Whitesmiths compiler.
Getting rid of the ellipsis
in the uni x environment,
VA LI ST.
8086-related macros,
SEG (), OFF(), PHYS().
Representing an address
in the 8086 architecture,
segment, offset.
Problems with
segmentioffset address
ing.
Near and far pointers.
i nt d i m i t r i ( i nt x, l ong y ) ;
otherwise, the macro discards its argument and evaluates to (), so the following will be
created:
i nt d i m i t r i ( ) ;
The ARGS macro on lines 26 and 30 takes care of yet another problem with non-ANSI
compilers. An ANSI subroutine with a variable number of arguments is defined as fol
lows:
p r i n t f ( f o r ma t , . . . )
char * f o r ma t ;
The trailing ellipsis means followed by any number of arguments of an arbitrary type.
Earlier compilers dont support this mechanism. Consequently, youd like to replace the
ellipsis with a dummy argument name. You can do this with debug.h using the ARGS
macro as follows:
p r i n t f ( f o r ma t , VA_LIST )
char * f o r ma t ;
VA LIST expands to an ellipsis when a n s i is defined, otherwise it expands to the
dummy argument name _ a_ r_ g_ s.
The macros on lines 39 to 41 of Listing A.l handle portability problems inherent in
the 8086 architecture. The SEG () and OFF () definition are ignored in other systems,
and the PHYS () macro just evaluates to its argument in non-8086 environments. Before
looking at the definitions for these macros, a short discussion of the 8086 internal archi
tecture is necessary. The 8086 is really a 16-bit machine in that all internal registers are
16 bits wide. The machine has a 20-bit address space, however. The discrepancy is
made up by using two registers to create a physical address. An address is formed from
two 16-bit components representing the contents of the two registers: the segment and
the offset. These are usually represented as two 4-digit hex numbers separated by a
colon: 1234:5678. The segment portion is on the left, the offset on the right. A physi
cal address is formed in the machine using the following formula:
p h y s i c a l _ a d d r e s s == ( s e gme nt << 4 ) + o f f s e t ;
Three special registers called CS, DS, SS hold the segment components of the addresses
of objects in the code, data, and stack regions of your program. (A fourth segment regis
ter, called ES, is used if you need to access an object not in one of these segments).
This architecture has several ramifications. First, there are two pointer sizes. If the
segment component doesnt change over the life of the program, only the offset needs to
be stored; pointers are 16 bits wide in this situation. For example, the DS register
doesnt have to change if theres less then 64K of data and if the DS register is initialized
to the base address of that 64K region. All cells within that region can be accessed using
the fixed DS register by loading a 16-bit offset into the appropriate offset registerthe
segment portion of the address can be loaded once when the program loads. If the data
area is larger than 64K, then 32-bit pointers composed of a 16-bit segment and a 16-bit
offset componant must be used. You must load both the segment and offset componants
into their respective registers to access the target variable. Typically, the segment is
stored in the top 16 bits of the pointer and the offset is stored in the bottom 16 bits. The
physical address is not stored. A 32-bit pointer that has both a segment and offset com
ponent is called a far pointer; a 16-bit pointer containing the offset component only is
called a near pointer. Note that a program can have 16-bit data pointers and 32-bit func
tion pointers and vice versa, depending on how much code or data there is.
translation results:
Section A. 1.1 debug.hMiscellaneous Macros 685
One of the other problems with the segmented architecture is that there are 4,096
ways to address every physical cell in memory. For example, the following all address
physical address 0x10000: 1000:0000, 0FFF:0010, 0FFE:0020, 0FFD:0030, and so
forth.
The PHYS (p) macro on line 41 of Listing A.l evaluates to the physical address asso
ciated with an 8086 32-bit, far pointera pointer in which both the segment and offset
components of the address are stored. It computes this address using the same procedure
used by the 8086 itself, shifting the segment component left by four bits and adding it to
the offset. The segment and offset portions are extracted by the SEG and OFF macros on
the previous two lines, reproduced here:
#def ne SEG(p) ( ( ( unsi gned *)&(p))[1] )
#def i ne OFF(p) ( ( (unsi gned *) &(p))[0] )
Taking SEG() as characteristic, &(p) creates a temporary variable of type pointer-to-
pointer which holds the address of that pointer in memory. &(p) can be treated as if it
were the base address of an array of pointers* (&p) and (&p) [0] both evaluate to p
itself. The cast of ( unsi gned *) changes things, however. Now instead of being
treated as an array of pointers, &p is treated as a two-element array of 16-bit
unsi gned i nts, which are half the size of a pointer. The [1 ] picks up 16 bits of the
32-bit pointer, in this case the segment half. The [0 ] gets the offset half. These two
macros are not at all portable, but theyre more efficient than shift-and-mask operations,
which are.
The NBI TS (type) macro on line 68 of Listing A.l evaluates to the number of bits
in a variable of a particular integral type on the current machine. ( NBI TS (i nt) evalu
ates to the number of bits in an i nt.) It assumes that this number is 8, 16, 32, or 64, but
the macro is easily modified to accept other values. The_I S macro on the previous line
tests for one of these possibilities:
#def i ne _I S( t , n) ( ( (t )1 ( n) ) !=0)
It is passed the type (t) and the current guess (n). It casts the number 1into the correct
type, and then shifts that number to the left by the required number of bits. The resulting
binary number for several shift values and a 16-bit-wide type are shown below:
n 1 n
0 0000000000000001
4 0000000000010000
8 0000000100000000
12 0001000000000000
16 0000000000000000
32 0000000000000000
The expression evaluates to 0 if a variable of type t contains fewer than n bits. The
NBI TS macro, adds together several terms, one for each group of four bits in the number.
For example, when an i nt is 16 bits wide, I S (int, 4), I S (int, 8), and
_I S(int, 12) all evaluate to 1. The other terms evaluate to 0. Multiplying by four
adjust the result to an actual bit count. I should point out that this macro, though huge,
usually evaluates to a single constant at compile time. Since all the arguments are con
stants, none of the arithmetic needs to be done at run time. Its size might cause prob
lems with your compilers running out of macro-expansion space, however. Also, note
that some compilers use greater precision for compile-time calculations than is available
at run timeNBI TS () wont work on these compilers.
MAXI NT, on line 71, evaluates to the largest positive, twos-complement integer,
(unsi gned) 0 evaluates to an unsi gned i nt with all bits set to 1. The left shift
Finding a physical ad
dress on an 8086,
PHYS().
Isolating segment and
offset components,
SEG ( ) , OF F ().
Number of bits required
for a type, n b i t s o .
Finding the largest in
teger, M AXI N T.
Array-manipulation mac
ros: NUMELE(),
LASTELE (), TOOHI GH(),
TOOLOW(), I NBOUNDS().
Maximum and minimum
values, max (), min (),
RANGE().
Create a stack.
puts a zero into the high bitthe cast to (unsi gned) defeats any potential sign exten
sion on the right shift.
The definitions on lines 58 to 62 of Listing A.l are for array manipulations. NUMELE
is passed an array name and evaluates to the number of elements in the array. LASTELE
evaluates to the address of the last element in the array. TOOHI GH ( array, p) evaluates
to true only if the pointer p is above the array. TOOLOW( array, p) is true if p is
beneath the start of the array.2 I NBOUNDS ( array, p) is true if p is in bounds.
The macros on lines 73 to 79 of debug.h test for various ranges of integer values.
max( a, b) evaluates to a or b, whichever is larger; mi n( a, b) evaluates to a or b,
whichever is smaller; and RANGE (a, b, c) evaluates to true if a<b<c.
A.1.2 stack.h and yystack.hGeneric Stack Maintenance
Compilers tend to use stacks a lot. To simplify stack maintenance, a series of macros
that declare and manipulate stacks of an arbitrary, nonaggregate type, are declared in
stack.h. The macros are used as follows:
st ack_dcl ( st ack, t ype, si ze)
This macro creates a stack consisting of si ze objects of the given type. The
t ype can be a simple type (like i nt and l ong) or a pointer (like char*,
l ong**, and st ruct f oo *). Stacks of aggregate objects like structures and
arrays are not supported, however. Note that very complex types (like pointers to
arrays) are best defined with a previous typedef , like this:
t ypedef i nt ( * p t r _ t o _ 1 0 _ e l e m e n t _ a r r a y ) [ 1 0 ] ;
s t a c k _ d c l ( s t a c k , p t r _ t o _ 1 0 _ e l e me n t _ a r r a y , 128 ) ;
otherwise they might not compile.
The st ack dcl macro declares three objects, the names of which are all derived
from the st ack argument as follows:
stack
pst ack
t stack
If you put a definition of the form
#undef s t a c k _ c l s
#def i ne s t a c k _ c l s st at i c
somewhere after the #i ncl ude <t ool s/ st ack . h>, then the stack and stack
pointer will have the indicated storage class (stati c here).
2. Though ansi requires that a pointer be able to point just past the end of an array, it does not allow a pointer
to go past the start of the array. This behavior has serious practical ramifications in the 8086 compact and
large models, and you wont be able to use TOOLOWwith these memory models. Most 8086 compilers
assume that no data object in the compact or large model is larger than 64K. Consequently, once the
initial segment-register load is performed, only the offset portion of the address needs to be checked or
modified when a pointer is modified. This practice causes problems when the base address of the array is
close to the bottom of a segment. If, for example, an integer array is based at 2000:0000, a pointer
increment correctly yields the address 2000:0002, but a decrement will yield 2000:FFFE rather than
1000:FFFE because the segment portion of the address wont be modified. This incorrect pointer is treated
as very large, rather than very small.
Section A. 1.2 stack.h and yystack.hGeneric Stack Maintenance 687
st ack_p ( st ack)
This macro evaluates to the name used internally for the stack pointer. (See also:
st ack_el e (), below.)
push( st ack, x)
Push x onto the stack. The st ack argument must have appeared in a previous
st ack_dcl () invocation. This macro checks for stack overflow and invokes
the st ack_er r ( ) macro, discussed below, if the stack is full. (The default
action prints an error message and terminates the program.) The
push_( st ack, x) does no overflow checking, so it is faster.
pop( st ack)
Pop an item off the stack (like this: x = pop ( st ack) ;). The st ack argument
must have appeared in a previous st ack dcl () invocation. This macro checks
for stack underflow and invokes the st ack_er r macro, discussed below, if the
stack is empty. (The default action prints an error message and terminates the
program.) Thepop_ ( st ack) macro does no underflow checking, so it is faster.
popn( st ack, amt )
This macro pops amt elements from the indicated stack, and evaluates to the ele
ment that was at the top of stack before anything was popped. Theres also a
popn_( st ack, amt ) macro, which does no underflow checking, but is other
wise the same as popn ().
st ack_cl ear ( st ack)
This macro sets the stack back to its initial, empty condition, discarding all stack
elements.
st ack_el e( st ack)
This macro evaluates to the number of elements currently on the indicated stack.
st ack_empt y( st ack)
This macro evaluates to true if the stack is empty.
st ack_f ul l ( st ack)
This macro evaluates to true if the stack is full.
st ack_i t em( st ack, of f set )
Evaluates to the item at the indicated offset from the top of the stack. Note that
no range checking is done to see if of f set is valid. Use st ack el e () to do
this checking if you need it.
st ack_er r ( over f l ow)
This macro is invoked to handle stack errors. The over f l owargument is 1if a
stack overflow occurred (you tried to push an item onto a stack that was full). I ts
zero on an underflow (you tried to pop an item off an empty stack). The default
action prints an error message and terminates the program, but you can #undef
this macro and redefine it to do something more appropriate if you like. If this
macro evaluates to something, that value will, in turn, come back from the push
or pop macros when an error occurs. For example:
Access stack pointer
directly.
Push an element.
Pop an element.
Pop multiple elements.
Reinitialize stack.
Get number of elements.
Test for stack empty.
Test for stack full.
Access arbitrary element.
Stack error handler.
Stack-macro implementa
tion.
The ansi concatenation
operator, ##.
Down ward-g rowi ng
stacks are used for ansi
compatibility.
Pushing and popping,
push () , pus h_( ) ,
pop ( ) , pop_( ) , popn(),
popn ().
Stack-macro error pro
cessing, s t ack e r r ().
Double cast needed to
convert i n t to pointer.
Stack macros for occs,
<yystack.h>.
#undef st ack er r
#def i ne st ack er r ( over f l ow) 0
will cause both push and pop to evaluate to 0 if an overflow or underflow occurs.
The stack macros are implemented in stack.h, Listing A.2. These macros make
heavy use of the ANSI concatenation operator, ##. This operator is removed from a
macro definition by the preprocessor, effectively concatenating the strings on either side
of it. For example, given the invocation
st ack del ( pl at es, i nt,128);
the typedef on line seven of Listing A.2 evaluates, first to
i nt t ##pl at es;
(the i nt replaces t ype in the definition, the
removed, yielding:
replaces st ack) and then the ##is
i nt t pl at es;
Lines eight and nine will evaluate as follows:
t pl at es pl at es[ 128] ;
t pl at es * (p pl at es) pl at es + ( 128) ;
given the earlier st ack dcl () invocation. A downward-growing stack is created, so
the stack pointer ( p_pl at es) is initialized to just past the end of the stack. This initiali
zation works correctly in a n s i C, which says that pointer can point one cell beyond the
end of an array and still be valid, but a pointer is not permitted to go to the left of the
array.
Since a downward-growing stack is implemented, a push is a predecrement (on line
23) and a pop is a postincrement (on line 31). This way the stack pointer always points
at the top-of-stack item. The actual push and pop operations have been isolated into the
push and pop macros because its occasionally useful to do a push or pop
without checking for stack overflow or underflow. Note that the popn macro on line
34 must evaluate to the top-of-stack element before anything is popped. A postincre
ment is simulated here by modifying the stack pointer by amt elements, and then reach
ing backwards with a [ - amt ] to get the previous top-of-stack element.
The st ack er r () macro on lines 39 and 40 use f er r to print an error message and
then exits the program (f e works like pr i nt f () except that it sends characters
to st der r and calls exi t (1) when it done is discussed in depth below) In
theory, the entire expression evaluates to f er r ( ) s return value, but since f er r
doesnt return, its not really an issue here.
You need two casts on lines 27 and 31 where st ack er r is invoked because
theres no telling what sort of value to which the macro will evaluate. The default
call evaluates to an i nt, but since st ack_er r () can be redefined, you cant
count on that. The stack type is also uncertain because it can be redefined. You cant
just cast stack_ err () to the same type as the stack because the stack might be an
array of pointersmany compilers complain if you try to convert an i nt to a pointer.
So you have to cast twicefirst to l ong and then to the stack typeto suppress this
error message. Though casting to l ong as an intermediate step fixes this problem,
though it might introduce an unnecessary conversion if the stack type is an i nt, short,
or char. These multiple type conversions will also cause portability problems if the
stack err () macro evaluates to something that wont fit into a l ong (like a doubl e).
A second set of stack macros are in the file <yyst ack. h>, Listing A.3. These mac
ros differ from the ones just described in three ways only: all the macro names are pre
ceded by the characters yy, the generated names are prefaced by yyt and yyp rather
Listing A.2. stack.h Stack-Maintenance Macros
Section A. 1.2stack.h and yystack.hGeneric Stack Maintenance 689
1 / * St ack
2
*
3 */
4
5 #def i ne
6
7 #def i ne
8
9
10
11 #def i ne
12
13
14 #def i ne
15 #def i ne
16
17
18 #def i ne
19
20 #def i ne
21 #def i ne
22
23 #def i ne
24 #def i ne
25
26 #def i ne
27
28
29
30 #def i ne
31
32
33
34 #def i ne
35 #def i ne
36
37
38
39 #def i ne
40
St ack- mai nt enance macr os. Cr eat es downwar d- gr owi ng st acks
( whi ch shoul d wor k i n al l si x memor y model s) .
s t a c k e l s /
empt y
*
/
s t a c k d e l ( s t a c k , t y p e , s i z e ) t y p e t _ # # s t a c k ;
s t a c k _ c l s t _ # # s t a c k s t a c k [ s i z e ] ;
s t a c k e l s t # # s t a c k (*p # # s t a c k )
s t a c k + ( s i z e )
s t a c k c l e a r ( s t a c k )
(
(p # # s t a c k ) ( s t a c k +
( s t a c k ) / ( * s t a c k ) )
)
s t a c k _ f u l l ( s t a c k )
s t a c k e mp t y ( s t a c k )
(
(
( p _ # # s t a c k ) <= s t a c k )
(p # # s t ack) >= ( s t a c k +
( s t a c k ) / ( * s t a c k ) )
)
s t a c k e l e ( s t a c k )
( (
( s t a c k ) / ( * s t a c k ) ) (p # # s t a c k - s t a c k ) )
s t a c k _ i t e m ( s t a c k , o f f s e t )
s t a c k p ( s t a c k )
(
*
(p # # s t a c k + ( o f f s e t ) ) )
p # # s t a c k
p u s h _ ( s t a c k , x)
pop ( s t a c k )
(
(
p _ # # s t a c k
p # # s t a c k + +
(x)
)
)
p u s h ( s t a c k , x ) ( s t a c k f u l l ( s t a c k )
( ( t # # s t a c k ) (l ong) ( s t a c k e r r ( l ) ) )
: pus h ( s t a c k , x ) )
p o p ( s t a c k ) ( s t a c k e mp t y ( s t a c k )
( ( t _ # # s t a c k ) (l ong) ( s t a c k _ e r r ( 0 ) ) )
pop ( s t a c k ) )
p o p n _ ( s t a c k , a mt )
p o p n ( s t a c k , a mt )
(
(
(p # # s t a c k + amt ) amt ]
)
( s t a c k e l e ( s t a c k ) < amt)
( ( t _ # # s t a c k ) (l ong) ( s t a c k e r r ( 0 ) ) )
popn ( s t a c k , a mt ) )
s t a c k e r r ( o ) ( (o) ? f e r r ( "St a c k o v e r f l o w \ n " )
: f e r r ( "St a c k u n d e r f l o w \ n " ) )
than t_ and p_, and I ve abbreviated yystack to yystk. The second set of macros are
used only within the LLama- and occs-generated parsers. The extra yy avoids potential
name conflicts with user-supplied names.
A.1.3 l.h and compiler.h
Two other include files are of interest: l.h contains prototypes for the routines in
I.lib, (which contains run-time subroutines for LX and occs-generated programs);
compiler.h contains prototypes for the routines in comp.lib (which contains subroutines
used by IX and occs themselves). These files are not listed here because they are
created automatically from within the makefile that creates the library. The Microsoft
compilers !Zg switchwhich outputs prototypes for all functions in the input file rather
than compiling the fileis used. Copies of the prototype files are included on the distri
bution disk, however, and they are #i ncl uded by most of the source files in this book.
Listing A.3. yystack.h Stack-Maintenance Macros: LLama/Occs Version
1 / * yys t
2 */
3
4 #def i ne
5
6 #def i ne
7
8
9
10 #def i ne
11
12
13 #def i ne
14 #def i ne
15
16
17 #def i ne
18
19
20 #def i ne
21 #def i ne
22
23 #def i ne
24 #def i ne
25
26 #de f i ne
27
28
29
30 #def i ne
31
32
33
34 #def i ne
35 #def i ne
36
37
38
39 t def i ne
40
ack. h St ack- mai nt enance macr os yacc and l l ama ver si on
y y s t k e l s /
*
empt y
*
/
y y s t k d e l ( s t a c k , t y p e , s i z e ) t y p e y y t _ # # s t a c k ;
y y s t k _ c l s y y t _ # # s t a c k s t a c k [ s i z e ] ;
y y s t k e l s y y t # # s t a c k ( *yyp # # s t a c k )
s t a c k + ( s i z e )
y y s t k c l e a r ( s t a c k )
( (yyp # # s t a c k ) ( s t a c k +
( s t a c k ) /si ( * s t a c k ) )
)
y y s t k _ f u l l ( s t a c k )
y y s t k e mp t y ( s t a c k )
(
(
( y y p _ # # s t a c k ) <= s t a c k )
( yyp # # s t a c k ) >= ( s t a c k +
( s t a c k ) / ( * s t a c k ) )
)
y y s t k e l e ( s t a c k )
( (
( s t a c k ) /si ( * s t a c k ) )
( yyp # # s t a c k - s t a c k ) )
y y s t k _ i t e m ( s t a c k , o f f s e t )
y y s t k p ( s t a c k )
(
*
(yyp # # s t a c k + ( o f f s e t ) ) )
yyp # # s t a c k
y y p u s h _ ( s t a c k , x )
yypop ( s t a c k )
(
(
*
*
y y p _ # # s t a c k
yyp # # s t a c k + +
(x) )
)
y y p u s h ( s t a c k , x ) ( y y s t k f u l l ( s t a c k )
( ( y y t # # s t a c k ) ( l ong) ( y y s t k e r r ( l ) ) )
: yypus h ( s t a c k , x )
)
y y p o p ( s t a c k ) ( y y s t k e mp t y ( s t a c k )
9
( ( y y t _ # # s t a c k ) ( l ong) ( y y s t k _ e r r ( 0 ) ) )
yypop ( s t a c k )
\
)
y y p o p n _ ( s t a c k , a mt )
y y p o p n ( s t a c k , a mt )
(
(
(yyp # # s t a c k + amt) amt] )
( y y s t k e l e ( s t a c k ) < amt)
9
( ( y y t _ # # s t a c k ) (l ong) ( y y s t k e r r ( 0 ) ) )
yypopn ( s t a c k , a mt ) )
y y s t k e r r ( o ) ( (o) ? f e r r ( " S t a c k o v e r f l o w \ n " )
: f e r r ( " S t a c k u n d e r f l o w \ n " )
)
A.2 Set Manipulation
Many of the operations involved in compiler writinglike creating state-machine
tables from regular expressions and creating bottom-up parse tables from a grammar
involve operations on sets, and C, unlike Pascal, doesnt have a built-in set capability.
Bit maps. Fortunately, its not too hard to implement sets in C by means of bit mapsone
dimensional arrays of one-bit numbers. This section presents a package of bit-map-
based, set-manipulation routines.
A.2.1 Using the Set Functions and Macros
<tools/set.h>, set. Xo use the set routines, you must put a #i ncl ude <tool s/set .h> at the top of
your file. The definitions for a SET are found there, and many of the set functions are
actually macros defined in <tools!set.h>. The set routines are described in the following
paragraphs. I ll give a few examples after the calling-conventions are described. If a
Section A.2.1 Using the Set Functions and Macros 691
name is in all caps in the following list, its implemented as a macro, and most of these
macros have side effects. Be careful.
SET *newset (voi d)
Create a new set and return a pointer to it. Print an error message and raise
SIGABRT if theres insufficient memory. Normally this signal terminates the pro
gram but you can use si gnal () to change the default action (the process is
described in a moment). NULL is returned if r ai se () returns.
voi d del set ( SET *set )
Delete a set created with a previous newset call and free the associated memory.
The argument must have been returned from a previous newset () call.
SET *dupset ( SET *set )
Create a new set that has the same members as the input set. This routine is
more efficient than using newset () to create the set and then copying the
members one at a time, but otherwise has the same effect.
i nt _addset ( SET *set , i nt bi t )
This is an internal function used by the ADD macro, and shouldnt be called by
you.
i nt num_el e( SET *set )
Return the number of elements in the input set. NULL sets (described below)
are considered to be empty.
i nt _set _t est ( SET *set l , SET *set 2)
This is another workhorse function used internally by the macros. Dont call it
directly.
i nt set cmp( SET *set l , SET *set 2)
Compare two sets in a manner similar to st r cmp () returning 0 if the sets are
equivalent, <0 if setl<set2 and >0 if setl>set2. This routine lets you sort an array
of SETs so that equivalent ones are adjacent. The determination of less than and
greater than is pretty much arbitrary. (The routine just compares the bit maps as
if you were doing a lexicographic ordering of an array of i nt s.)
unsi gned set hash( SET *set l )
This function is even more obscure than set cmp( ) . It is provided for those
situations where a SET is used as the key in a hash table. It returns the sum of the
individual words in the bit map.
i nt subset ( SET *set , SET *sub)
Return 1 if sub is a subset of set, 0 otherwise. Empty and null sets are subsets
of everything, and 1 is returned if both sets are empty or null.
voi d _set _op( i nt op, SET *dest , SET *src)
Another workhorse function used internally by the macros.
voi d i nver t ( SET *set )
Physically invert the bits in the set, setting 1s to 0 and vice versa. In effect, this
operation removes all existing members from a set and adds all possible members
Create a set.
Delete a set.
Duplicate a set.
Internal set function: en
large set.
Find number of elements
in set.
Internal set
function: test bits.
Compare two sets.
Compute hash value for
set.
Sub is subset of set.
Internal set function:
binary operations.
Complement set by in
verting bits.
Clear set and make
smaller.
Find next set element.
Print set elements.
dest = dest u src.
dest = dest n src.
dest = symmetric
difference.
dest = src.
Clear set.
Add all elements to set.
Complement set.
that werent there before. Note that the set must be expanded to the maximum
possible size before calling i nver t ()ADD the largest element and then delete
i t See also, COMPLEMENT ().
voi d t r uncat e( SET *set )
Clears the set and shrinks it back to the original, default size. Compare this rou
tine to the CLEAR () macro, described below, which clears all the bits in the map
but doesnt modify the size. This routine is really a more efficient replacement for
del set (s) ; s=newset () ;. If the original set isnt very big, youre better off
using CLEAR ().
i nt next _member ( SET *set )
When called several successive times with the same argument, returns the next
element of the set each time its called or -1 if there are no more elements. Every
time the set argument changes, the search for elements starts back at the begin
ning of the set. A NULL argument also resets the search to the beginning of the
set (and does nothing else). Strange things happen if you add members to the set
between successive calls. If calls to next _member () are interspersed with calls
to pset () (discussed below), next _member () wont work properly. Calls to
next _member () on different sets cannot be interspersed.
voi d pset ( SET *set , i nt (*out) (), voi d *par am)
Print the set. The output routine pointed to by out is called for each element of
the set with the following arguments:
(*out) ( param, "nul l ", - 1); Null set
(*out) ( param, "empt y", - 2 ); Empty set
(*out) ( param, "%d ", N) ; Set element N
This way you can use f pr i nt f () as a default output routine.
UNI ON( SET *dest , SET *src)
Modify the dest set to hold the union of the sr c and dest sets.
I NTERSECT( SET *dest , SET *src)
Modify the dest set to hold the intersection of the sr c and dest sets.
DI FFERENCE( SET *dest , SET *src)
Modify the dest set to hold the symmetric difference of the sr c and dest
sets. (An element is put into the target set if it is a member of dest but not of
src, or vice versa.)
ASSI GN( SET *dest , SET *src)
Overwrite the dest with src.
CLEAR( SET *s)
Clear all bits in s, creating an empty set.
FI LL( SET *s)
Set all bits in s to 1, creating a set that holds every element in the input alphabet.
COMPLEMENT( SET *s)
Complement a set efficiently by modifying the sets compl ement bit. Sets
complemented in this way can not be manipulated by UNI ON (), etc. See also,
i nver t () and I NVERT ().
I NVERT( SET *s)
Complement a set by physically changing the bit map (see text).
I S_DI SJ OI NT( SET *sl , SET *s2)
Evaluate to true only if the two sets are disjoint (have no elements in common).
I S_I NTERSECTI NG ( SET *sl , SET *s2) Test for i ntersecti on.
Evaluate to true only if the two sets intersect (have at least one element in com
mon).
I S_EMPTY( SET *s)
Evaluate to true only if set is empty (having no elements) or null (s is NULL)
I S_EQUI VALENT( SET *sl , SET *s2)
Evaluate to true only if the two sets are equivalent (have the same elements).
ADD( SET *s, i nt x)
Add the element c to set s. It is not an error to add an element to a set more
than once.
REMOVE( SET *s, i nt x)
Remove the element c from set s. It is not an error to remove an element that is
not in the set.
TEST( SET *s, i nt x)
Evaluates to true if x is an element of set s.
MEMBER( SET *s, i nt x)
Evaluates to true if x is an element of set s. This macro doesnt work on COM-
PLEMENTed sets, but its both faster and smaller than TEST, which does. The dis
tinction is described below.
The elements of sets must be numbers, though in many instances any arbitrary
number will do. Enumerated types are almost ideal for this purpose, though #def i nes
can be used too. For example:
t ypedef enum
Section A.2.1 Using the Set Functions and Macros
J AN, FEB, MAR,
APR, MAY, J UN,
J UL, AUG, SEP,
OCT, NOV, DEC
} MONTHS;
creates 12 potential set elements. You can create two sets called wi nter and spri ng
by using the following set operations:
693
Invert all bits in bit map.
Test for disjoint.
Test for empty set.
Text for equivalence.
Add member to set.
Remove member from
set.
Test for membership (all
sets).
Test for membership (no
complemented sets).
Sets, an example. Set
elements are numbers.
Set implementation
difficulties. Null and emp
ty sets.
Problems with comple
mented sets.
Complement by physical
ly inverting bits.
#ncl ude <set .h>
SET *wi nt er , *spr i ng;
wi nt er = newset ()/
spr i ng = newset ();
ADD ( J AN, wi nt er );
ADD ( FEB, wi nt er );
ADD ( MAR, wi nt er );
ADD ( APR, spr i ng );
ADD ( MAY, spr i ng );
ADD ( J UN, spr i ng );
Set operations can now be performed using the other macros in <tools/set.h>. For exam
ple: I S_DI SJ OI NT ( wi nt er, spr i ng) evaluates to true because the sets have no ele
ments in common; I S_EQUI VALENT ( wi nt er , spr i ng) evaluates to false for the
same reason. A third set that contains the union of spr i ng and wi nt er can be created
with:
hal f _year = dupset ( wi nt er );
UNI ON( hal f _year , spr i ng );
Something like:
hal f _year = dupset ( wi nt er );
I NTERSECT( hal f _year , spr i ng) ;
creates an empty set because there are no common elements.
There are two implementation difficulties with the set routines. The first is the
difference between a null set and an empty set. (Ill bet that you thought that the
difference was just one more obscure mathematical conundrum designed for no other
purpose than to make undergraduates heads swim). An empty set is a set that has no
elements. In the case of the routines presented here, newset () and dupset () both
create empty sets. They have allocated an internal data structure for representing the set,
but that set doesnt have anything in it yet. A null set, however, is a SET pointer with
nothing in it. For example:
SET *p = NULL/ / * p r e p r e s e n t s t h e n u l l s e t */
p = newset (); / * p now r e p r e s e n t s an e mp t y s e t */
In practice, this difference means that the routines have to be a bit more careful with
pointers than they would be otherwise, and are a little slower as a consequence.
Complimented sets present another problem. Youll notice that the eventual size of
the set doesnt have to be known when the set is created by newset (). The set size is
just expanded as elements are added to it. This can cause problems when you comple
ment a set, because the complemented set should contain all possible elements of input
alphabet except the ones that are in the equivalent, uncomplemented, set. For example,
if youre working with a language thats comprised of the set of symbols {A, B, C, D, E,
F, G}and you create a second set {A,C,E,G}from elements of the language, the comple
ment of this second set should be {B,D,F|.
This ideal situation is difficult to do, however. Sets are represented internally as bit
maps, and these maps are of finite size. Moreover, the actual size of the map grows as
elements are added to the set. You can complement a set by inverting the sense of all the
bits in the map, but then you cant expand the sets size dynamicallyat least not
without a lot of work. To guarantee that a complemented set contains all the potential
elements, you first must expand the set size by adding an element that is one larger than
any possible legitimate element, and then complement the expanded set. A second
Section A.2.1 Using the Set Functions and Macros 695
problem has to do with extra elements. The bit-map size is usually a little larger than the
number of potential elements in the set, so you will effectively add members to the set if
you just stupidly complement bits. On the plus side, set operations (union, intersection,
etc.) are much easier to do on physically complemented sets.
An alternate method of complementing the set uses negative-true sets and positive-
true sets. Here, you mark a set as negative or positive by setting a bit in the SET struc
ture. You dont have to modify the bit map at all. If a set is marked negative-true when
you test for membership, you can just reverse the sense of the test (evaluate to true if the
requested bit is not false). Though this method solves the size problem, operations on
negative-true sets are much harder to perform.
Since the two representations are both useful, but in different applications, I ve
implemented both methods. The I NVERT () macro performs a ones-complement on all
bits currently in the bit map. Note that if new elements are added, the new bits wont be
complemented. You should always expand a set out to the maximum number of ele
ments (by adding and then removing the largest element) before inverting it. The COM
PLEMENT () macro implements the second method. It doesnt modify the bit map at all;
rather, it sets a flag in the SET structure to mark a set as negative true.
Because there are two different classes of sets (those that are complemented and
those that are inverted), there are also two different macros for testing membership.
MEMBER () evaluates to true only if the bit corresponding to the requested element is
actually set to one. MEMBER () cant be used reliably on complemented sets, The
TEST () macro can be used with complemented sets, but its both larger and slower than
MEMBER (). If a set is complemented, the sense of the individual bits is reversed as part
of the testing process. If the set isnt complemented TEST () works just like MEMBER ().
The various set operations (UNI ON, I NTERSECT, and so forth) are only valid on
inverted sets. Use I NVERT () if youre going to perform subsequent operations on the
inverted set. I leave it as an exercise to the reader to make set op () work on com
plemented sets. The complement bit can represent all bits not in the bit mapif the
complement bit is set, all bits not in the map are zero and vice versa. I ve found these
routines quite workable in their existing stateit seemed pointless to make the code
both larger and slower to correct what has turned out not to be a problem.
A.2.2 Set Implementation
The set routines are implemented in two places, the first of which is the macro file
<tools/set.h>, in Listing A.4. The file starts with various system-dependent definitions,
concentrated in one place to make it easier to port the code. SETTYPE (on line three) is
used as the basic unit in a bit map. As I mentioned earlier, a bit map is a 1-dimensional
array of 1-bit objects. In terms of set operations, a number is in the set if the array ele
ment at that position in the map is true (if bi t map [5] is true than 5 is in the set). The
map is implemented using an array of _SETTYPEs as follows:
4 3 2 1 0
bit number 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
array index k----------------2 1 0
Bit 0 is at position 0 in array [0], bit 1is at position 1in array [0], bit 20 is at position 4 in
array[l], and so forth. In order to make the array manipulation as efficient as possible,
SETTYPE should be the largest integral type that can be manipulated with a single
instruction on the target machine. In an 8086, for example, the largest such type is a 16-
bit wordlarger numbers require several instructions to manipulate themso I ve used
the 16-bit short as the SETTYPE.
Complement by marking
set as negative true.
Physical complement,
I N VE R T ()
Logical complement, c o m
p l e m e n t ()
Testing for membership
on complemented sets,
MEMB ER () VS. TE S T ().
Set operations work only
on physically comple
mented sets.
<tools/set.h> , S E TTYP E .
Implementing bit maps.
Listing A.4. set.h Macro Definitions and Prototypes for Set Functions
1
2
3
4
5
6
7
8
9
10
11
52
53
54
55
56
57
58
59
/ * SET. H: Macr os and f unct i on pr ot ot ypes f or t he set f unct i ons
*
/
u n s i g n e d s h o r t
# d e f i n e _BITS_IN_WORD
# d e f i n e _BYTES_IN_ARRAY(x)
#d e f i n e _DIV_WSIZE(x)
#d e f i n e _M0D_WSIZE(x)
#d e f i n e DEFWORDS
SETTYPE ; /
one cel l i n bi t map
/
16
(x 1) /
*
# of byt es i n bi t map
*
/
( ( u n s i g n e d ) ( x ) 4)
( (x) & OxOf
)
8 /
*
el ement s i n def aul t set */
#d e f i n e DEFBITS ( DEFWORDS * BITS IN WORD) / * bi t s i n def aul t set
*
/
# d e f i n e ROUND ( b i t ) ( ( ( DIV WSI ZE( bi t ) + 8) 3 ) 3 )
12 t ypedef st ruct _ s e t
13 {
14 unsi gned char nwords ;
/ *
Number of wor ds i n map
15 unsi gned char compl ;
/ * i s a negat i ve t r ue set i
16 unsi gned n b i t s ;
/ *
Number of bi t s i n map
17 SETTYPE *map ;
/ *
Poi nt er to t he map
18 SETTYPE def map[ DEFWORDS ]; / * The map i t sel f
19
20 } SET;
21
22 ext ern i nt a d d s e t
P ( (
SET* , i nt ) )
23 ext ern voi d d e l s e t
P ( (
SET*
) )
24 ext ern SET * d u p s e t
P ( ( SET* ) )
25 ext ern voi d i n v e r t
P ( ( SET* ) )
26 ext ern SET * ne ws e t
P ( ( voi d ) )
27 ext ern i nt n e x t member
P ( (
SET *
) )
28 ext ern i nt num e l e
P ( (
SET*
) )
29 ext ern voi d p s e t
P ( (
SET*, i nt (*)(), voi d*
) )
30 ext ern voi d s e t op
P ( (
i nt, SET*, SET*
) )
31 ext ern i nt s e t t e s t
P ( (
SET*, SET* ) )
32 ext ern i nt s e t cmp
P ( (
SET*, SET*
) )
33 ext ern unsi gned s e t h a s h
P ( (
SET*
) )
34 ext ern i nt s u b s e t
P ( (
SET*, SET*
) )
35 ext ern voi d t r u n c a t e
P ( (
SET*
) )
36
37
/ * Op ar gument passed t o set op */
38 #def i ne _UNION 0 / * x i s i n si or s2 * /
39 #def i ne _INTERSECT 1 / * x i s i n si and s2 * /
40 #def i ne _DIFFERENCE 2 / * (x i n si ) and (x not i n s2) */
41 #def i ne _ASSIGN 4
/ *
si = s2 * /
42
43 #def i ne UNI ON( d, s ) _ s e t _op ( UNION, d, s )
44 #def i ne INTERSECT( d, s) s e t _op ( _INTERSECT, d, s )
45 #def i ne DIFFERENCE(d, s) s e t _op ( _DIFFERENCE, d, s )
46 #def i ne ASSI GN( d, s ) s e t _op ( ASSIGN, d, s )
47
48 #def i ne CLEAR(s) me ms e t ( ( S) - >map, 0, ( s ) - >nwor ds * si zeof (
49 #def i ne FI LL( s ) me ms e t ( ( s) - >map, * 0, ( s ) - >nwords * si zeof (
50 #def i ne COMPLEMENT(s) ( ( s ) - > c o mp l = " ( s ) - > c o mp l )
51 #def i ne INVERT(s) i n v e r t ( s )
*
i f t r ue
*
/
/
/
/
/
SETTYPE))
SETTYPE))
#d e f i n e _SET_EQUIV
#d e f i n e _SET_DISJ
#d e f i n e SET INTER
0
1
2
/ * Val ue r et ur ned f r om set t est , equi val ent
*
/
/
*
*
di sj oi nt
i nt er sect i ng
*
/
/
/
#d e f i n e I S _ DI S J OI NT( s i , s 2 )
#d e f i n e I S_I NTERSECTI NG( s i , s 2)
#d e f i n e IS EQUIVALENT(a, b)
(
(
s e t _ t e s t ( s i , s 2 )
s e t t e s t ( s i , s 2 )
SET DISJ
)
( s e t c m p ( ( a ) , ( b ) ) 0
SET INTER )
)
Section A.2.2Set Implementation 697
Listing A.4. continued...
60
61
62
63
64
65
66
67
68
69
#def i ne I S EMPTY( s) ( numel e (s) 0
/
*
#def i ne GBI T( s, x, op) (
#def i ne REMOVE( s, x)
#def i ne ADD( s, x)
#def i ne MEMBER( s, x)
#def i ne TEST( s, x)
( ( (x) >= ( s) - >nbi t s)
( ( (x) >= ( s) - >nbi t s)
( ( (x) >= ( s) - >nbi t s)
? 0
addset ( s, x)
? 0
( ( MEMBER(s, x) ) ! ( s ) - >c ompl
)
Al l of t he f ol l owi ng have heavy- dut y si de- ef f ect s. Be car ef ul
*
/
( ( s ) - > m a p ) [ DIV WSIZE( x) ] op (1 MOD WSIZE( x) ) )
GBI T( s, x, &
GBI T ( s, x,
GBI T ( s, x, &
( s ) - >c ompl
)
)
)
)
)
)
)
The macros on lines four to seven all reflect the size of the SETTYPE and will have
to be changed SETTYPE changed. BITS IN WORD is just that, the number of
BI T S_ I N_ WORD,
BYTES I N ARRAY
bits in a variable of type SETTYPE. BYTES IN ARRAY is passed a count, and it
returns the number of bytes in an array of that many SETTYPE-sized variables. In this
case, I multiply the count by two using a left shift rather than doing something like
s
*
SETTYPE), which uses a multiply on my compiler the shift is a more
efficient operation than the multiply.
The_DIV_WSIZE (x) and_MOD_WSIZE (x) macros help determine the position of Computing bit positions,
a particular bit in the map. The argument to both macros is the bit for which you want _DIV_WSIZE(x),
i_ r. ~ ~ , 7 i i- i l I. u i_ M0D WSIZE(x).
the position. The_DIV_WSIZE (x) macro evaluate to the array element that holds the
bit. A_DIV_WSIZE (20) evaluates to 1 because bit 20 is in array[l]. Its just doing an
integer divide by 16, though Im using a left shift here for efficiencys sake. The
MOD WSIZE (x) macro evaluate to the position of the bit within the wordthe offset
in bits from the least-significant bit in the word. MOD WSIZE (20) evaluates to 4
because bit 20 is at offset 4 of array[1]. I m doing an efficient modulus-16 operation by
using a bitwise AND rather than a
o.
o
map
DEFWORDS, on line eight of Listing A.4, determines the array size for a default bit _ d e f wo r d s , _ d e f b i t s ,
Initially, all bit maps have this many elements. The size is expanded (in
ROUND.
DEFWORDS-sized chunks) if needed. _DEFWORDS is set to 8 here, so the default map
can have 128 elements before it needs to be expanded. DEFBITS, on the next line, is
just the number of bits required for that many words.
The ROUND macro is used to expand the size of the array. The array grows in
DEFWORDS-sized chunks. Say, for example, that the array starts out at the default size
of 8 words, and you want to add the number 200 to the set. The array must be expanded
to do so, and after the expansion, the array should have 16 elements in it (2 x DEF
WORDS). In this situation the macro expands to:
( ( ( DIV WSIZE ( 200) + 8) 3 ) 3 )
and one more level to:
(((((unsi gned) ( 200) 4) + 8) 3 ) 3 )
The 200>>4 evaluates to 12. so bit 200 array[12]. 12 plus 20. and 20>>3
yields 2. (The >>3 is an integer divide by 8.) The final multiply by 8 (the 3) yields
16, so the map array is expanded to 16 elements.
The SET itself is represented with the following structure, defined on lines 12 to 20 Representing a set, s e t
of Listing A.4.
Set macros.
CLEAR, FILL.
Adding and removing set
members, a d d ( ) , r e
mo v e (). Testing for
membership, MEMBER ( ).
GBI T.
Testing complemented
sets for membership,
TEST().
set
{
unsi gned char nwor ds ;
unsi gned char compl ;
unsi gned nbi t s ;
SETTYPE
SETTYPE
/ * Number of wor ds i n map
/
/
/
*
*
*
negat i ve t r ue i f set
Number of bi t s i n map
Poi nt er to t he map *map ;
def map[ DEFWORDS] ; / * The def aul t map
*
*
*
/
/
/
/
/
} SET;
The number of bits in the map ( nbi t s) could be computed from the number of words
( nwords) , but its more efficientof computation time, not spaceto keep both
numbers in the structure. The compl field is set to true if this is a negative-true set. The
def map [ ] array is the default bit map. Initially, map just points to it. When the map
grows, however, a new array is allocated with mal l oc () and map is modified to point at
the new array. This strategy is used rather than a r eal l ocO call for run-time
efficiency, at the cost of wasted memoryr eal l oc () will generally have to copy the
entire structure, but only the map needs to be copied if you do it yourself.
<set.h> continues on line 22 of Listing A.4 with function prototypes for the actual
functions. The macros on lines 43 to 60 handle operations that actually modify a set
(union, intersection, symmetric difference, and so forth). Most of them map to
set _op () calls, passing in a constant to tell the subroutine which function is required.
CLEAR and FILL modify the bit map directly, however, calling memset () to fill the bits
with zeros or ones as appropriate.
The real heart of <set.h> are the set-manipulation macros at the end of the file,
reproduced here:
#def i ne GBI T( s, x, op) ( ( ( s) - >map) [ DI VWSI ZE( x) ] op (1 MOD WSI ZE( x) ))
#def i ne REMOVE( s, x) ( ( (x) >= ( s) - >nbi t s) ? 0
#def i ne ADD( s, x) ( ( (x) >= ( s ) - > n b i t s ) ? a d d s e t ( s , x )
#def i ne MEMBER(s, x) ( ( (x) >= ( s) - >nbi t s) ? 0
#def i ne TEST( s, x) (( MEMBER(s , x ) ) ? ! ( s ) - >compl
GBI T( s, x, &
GBI T( s, x,
GBI T( s, x, &
( s ) - >compl
))
))
))
)
The REMOVE, ADD, and MEMBER macros all evaluate to GBI T invocations. The only
difference is the operator (op) passed into the macro. The first part of GBI T uses the
following to select the array element in which the required bit is found:
(
( s) - >map )[ DI V WSI ZE( x)
The second half of the macro shifts the number 1to the left so that it will fall in the same
position as the required bit, using
(1
MOD WSI ZE( x) )
)
If youre accessing bit 5, the number 1is shifted left 5 bits, yielding the following binary
mask in a 16-bit word:
0000000000100000
The same shift happens when you access bit 20, but in this case the first half of the macro
chooses (s ) ->map [ 1] rather than (s ) ->map [ 0 ]. The op argument now comes into
play The shifted mask is ORed with the existing array element to add it to the set
Similarly, the of the mask is ANDed to the array element to clear a bit. OR
also used to test for membership, but since theres no =in MEMBER, the map is not
modified.
The TEST macro, which can handle complemented sets, works by first determining
whether x is in the set, and then evaluating to the complement flag or its inverse, as
appropriate. For example, the complemented flag will be 1if x is in a negative true set,
and MEMBER tests true if the bit is in the map. The inverse of the complement flag (0) is
returned, however. I ve used this somewhat convoluted approach because of the sheer
size of the macro. TEST (s, x) expands to this monster:
( ( ( ( (x) >= ( s ) - > n b i t s ) ? 0 :
(( (s)- >map) [( (unsi gned) (x) 4 ) ] & ( l ( ( x) &OxOf ) ) ) ) ) ?
!(s)- >compl : (s)- >compl );
but the more obvious solution:
( s ) - > c o mp l ? !MEMBER(s, x) : MEMBER(s, x)
turns into this:
( s ) - > c o mp l
? ! ( ( (x) >= ( s ) - > n b i t s ) ? 0 :
( ( ( s) ->map) [ ( (unsi gned) (x) 4 ) ] & ( 1 ( (x) & OxOf) ) ) )
: ( ( (x) >= ( s ) - > n b i t s ) ? 0 :
( ( ( s) ->map) [ ( (unsi gned) (x) 4 ) ] & ( 1 ( (x) & OxOf) ) ) )
which is even worse.
The functions needed for set manipulation are all in set.c, the first part of which is
Listing A.5. The newset () function, which actually creates the new set, starts on line
nine. Normally, a pointer to the newly allocated SET is returned. If insufficient memory
is available, an error message is printed and the program is terminated by the
r ai se ( SI GABRT) call on line 21. r ai se () is an ANSI function which, in this case,
causes the program to terminate. I ve chosen to abort the program rather than return
NULL (like mal l oc () does) because most applications will terminate the program any
way if memory isnt available. If you want NULL to be returned on an error rather than
aborting the program, just disable the SI GABRT signal as follows:
#i ncl ude <s i g n a l . h>
ma i n ()
{
s i g n a l ( SIGABRT, SIG_IGN ) ;
}
Listing A.5. set.c SET Creation and Destruction
2 #i ncl ude <ct ype. h>
3 #i ncl ude <si gnal . h>
4 #i ncl ude <st dl i b. h>
5 #i ncl ude <st r i ng. h>
7 #i ncl ude <t ool s/ set . h>
8
9 PUBLI C SET *newset ()
10 {
11 / * Cr eat e a new set and r et ur n a poi nt er t o i t. Pr i nt an er r or message
12 * and r ai se SI GABRT i f t her e' s i nsuf f i ci ent memor y. NULL i s r et ur ned
13 * i f r ai se () r et ur ns.
14 */
15
16 SET *p;
17
18 i f ( ! (p = (SET *) mal l oc( si zeof (SET) )) )
19 {
20 f pr i nt f ( st der r , "Can' t get memor y t o cr eat e set\n" )/
21 r ai se( SI GABRT );
22 ret urn NULL; / * Usual l y won' t get her e */
Creating sets, newset ().
r a i s e ().
23 }
24 memset ( p, 0, si zeof (SET) );
25 p- >map = p- >def map;
26 p- >nwor ds = _DEFWORDS;
27 p- >nbi t s = _DEFBI TS;
28 ret urn p /
29 }
30
31 / * ---------------------------------------------------------------------------------------------------------------------------------------------- * /
32
33 PUBLI C voi d del set ( set )
34 SET *set ;
35 {
36 / * Del et e a set cr eat ed wi t h a pr evi ous newset () cal l . * /
37
38 i f ( set - >map != set - >def map )
39 f r ee( set - >map );
40 f r e e ( s e t ) ;
41 }
42
43 / * ---------------------------------------------------------------------------------------------------------------------------------------------- * /
44
45 PUBLI C SET *dupset ( set )
46 SET *set ;
47 {
48 / * Cr eat e a new set t hat has t he same member s as t he i nput set */
49
50 SET *new;
51
52 i f ( ! ( new = ( SET *) mal l oc( si zeof (SET) )) )
53 {
54 f pr i nt f ( st der r , "Can' t get memor y t o dupl i cat e set \ n") /
55 e x i t ( 1 ) /
56 }
57
58 memset ( new, 0, si zeof (SET) )/
59 new- >compl = set - >compl ;
60 new- >nwor ds = set - >nwor ds;
61 n e w - > n b i t s = s e t - > n b i t s ;
62
63 i f ( set - >map == set - >def map ) / * def aul t bi t map i n use */
64 {
65 new- >map = new- >def map;
66 memcpy( ne w- >de f map, s e t - > d e f ma p , _DEFWORDS * si zeof ( _SETTYPE) );
67 }
68 el se / * bi t map has been enl ar ged */
69 {
70 new- >map = ( _SETTYPE *) mal l oc( set - >nwor ds * si zeof ( _SETTYPE) )/
71 i f ( i new- >map )
72 {
73 f pr i nt f ( st der r , "Can' t get memor y t o dupl i cat e set bi t map\ n" );
74 e x i t ( 1 ) ;
75 }
76 memcpy( new- >map, set - >map, set - >nwor ds * s i z e o f ( _SETTYPE) );
77 }
78 ret urn new;
79 }
The del _ set () function starting on line 33 is the SET destructor subroutine. It
frees any memory used for an expanded bit map on line 38, and then frees the memory
used for the SET itself.
dupset () (starting on line 45 of Listing A.5) duplicates an existing set. I ts much
more efficient than calling new set () and then adding members to the new set one at a
time.
The functions in Listing A.6 handle set enlargement: _ addset () is called from the
ADD macro when the bit map is not large enough to hold the requested bit. All it does is
call enl arge () (which starts on line 95) to make the map larger, and then invokes
GBI T to set the bit. enl arge () is passed the required number of words in the bit
map (need). The test on line 106 causes the routine to return if no expansion is required.
Note that exi t () is called on line 114 if enl arge () cant get memory. I ve done this
rather than call rai se () because I m assuming that the value returned from ADD will
not be tested. It would be risky to call rai se () in this situation because the signal
handler might have been reassigned to an empty function, as was discussed earlier.
The next part of set.c (in Listing A.7) consists of various testing functions. The
num_ele () function, starting on line 126, determines the number of elements in the set.
It does this by looking at the map array one byte at a time, using a table lookup to do the
counting. The table in question is nbi ts [ ], declared on lines 134 to 152. The table is
indexed with a number in the range 0-255, and it evaluates to the number of bits set to 1
in that number. For example, the decimal number 93 is 01011101 in binary. This
number has five ones in it, so nbi ts [ 93] holds 5. The loop on lines 162 and 163 just
goes through the map array byte by byte, looking up each byte in nbi ts [ ] and accumu
lating the count in count.
T he_ set_ test () function starting on line 170 of Listing A.7 compares two sets,
returning SET EQUI V if the sets are equivalent (have the same elements),
SET I NTER if the sets intersect (have at least one element in common) but arent
equivalent, and SET DI SJ if the sets are disjoint (have no elements in common). A
bitwise AND is used to test for intersection on line 203the test for equivalence was
done on line 195 of Listing A.7. If the AND tests true, there must be at least one bit at
the same relative position in both bit-map elements.
Note that the sets are made the same size with the enl arge () calls on lines 187 and
188 before they are compared, and this expansion can waste time if the sets are likely to
be different sizes. For this reason, a second comparison function, setcmpO, is pro
vided on line 215. This routine works like strcmpO, returning zero if the sets are
equivalent, a negative number if setl <set2, and a positive number if setl >set2.
Since set cmp () does not modify the set sizes, using it can be less time consuming than
using set test (). The main purpose of this second comparison function is to let
you sort an array of SET pointers so that the equivalent sets are adjacent. The determina
tion of less than and greater than is pretty much arbitrarythe routine just compares the
maps as if they were arrays of i nts. set cmp () first compares the map elements that
exist in both of the sets on lines 228 to 230. If theres a mismatch, the two bytes are sub
tracted to get a relative ordering. The code on lines 236 to 247 is executed only if the
two sets are identical up to this point. The tail end of the larger set is then scanned to
make sure that its all zeros. If so, the two sets are equivalent. If not, the larger set is the
greater.
The final comparison function is subset (), starting on line 275 of Listing A.7.
common and tai l , initialized on lines 289 to 298, are the number of words that exist in
both sets, and the number of extra bytes that have to be tested in the potential subset. For
example, if set A has 10 bytes and set B has 20, and if youre determining whether set A
is a subset of set B, you need to look only at the first 10 bytes of both sets. It doesnt
matter whether the last 10 bytes of set B have anything in them. If, however, you want
Deleting sets,
del_set()
Duplicating sets,
dupset ().
Enlarging sets,
addset ().
Finding the number of
elements in a sets,
numel e().
Comparing sets for
equivalent, intersecting,
disjoint, set t est ().
Comparing sets for sort
ing, set cmp ().
Testing for subset,
subset ().
L isting A.6. set.c Adding Members to and Enlarging aSET
80 PUBLIC i nt a d d s e t ( s e t , b i t )
81 SET * s e t ;
82
{
83 / * A d d s e t i s c a l l e d b y t h e ADD () macr o when t h e s e t i s n ' t b i g e n o u g h . I t
84 * e x p a n d s t h e s e t t o t h e n e c e s s a r y s i z e and s e t s t h e i n d i c a t e d b i t .
85 */
86
87 voi d e n l a r g e ( i nt , SET* ); / * i m m e d i a t e l y f o l l o w i n g */
88
89 e n l a r g e ( ROUND( bi t ) , s e t );
90 r et ur n GBIT( s e t , b i t , = );
91 }
92
93 / *
* /
/
94
95 PRIVATE voi d e n l a r g e ( n e e d , s e t )
96 SET * s e t ;
97 {
98 / * E n l a r g e t h e s e t t o " n e e d " wo r d s , f i l l i n g i n t h e e x t r a wor ds wi t h z e r o s .
99 * P r i n t an e r r o r me s s a g e and a b o r t b y r a i s i n g SIGABRT i f t h e r e ' s n o t enough
100 * memory. NULL i s r e t u r n e d i f r a i s e () r e t u r n s . S i n c e t h i s r o u t i n e c a l l s
101 * m a l l o c , i t ' s r a t h e r s l o w and s h o u l d b e a v o i d e d i f p o s s i b l e .
102 */
103
104 _SETTYPE *new;
105
106 i f ( !s e t | n e e d <= s e t - > n wo r d s )
107 r et ur n;
108
109 D( p r i n t f (" e n l a r g i n g %d word map t o %d wo r d s \ n " , s e t - > n wo r d s , n e e d ) ; )
110
111 i f ( ! (new = ( SETTYPE *) m a l l o c ( ne e d * si zeof ( SETTYPE))) )
112 {
113 f p r i n t f ( s t d e r r , "Can' t g e t memory t o e xpand s e t \ n " ) ;
114 e x i t ( 1 );
115 }
116 memcpy( new, s e t - > ma p , s e t - > n wo r d s * si zeof ( SETTYPE) );
117 me ms e t ( new + s e t - > n w o r d s , 0, ( ne e d - s e t - > n wo r d s ) * si zeof ( SETTYPE) ) ;
118
119 i f ( s e t - >ma p != s e t - > d e f ma p )
120 f r e e ( s e t - >ma p );
121
122 s e t - >ma p = new ;
123 s e t - > n wo r d s = ( unsi gned char ) ne e d ;
124 s e t - > n b i t s = ne e d * BITS IN WORD ;
125
}
Set operations (union, in
tersection, sy metric
difference, assignment).
set op().
to know whether the longer set (B) is a subset of the shorter one (A), all bytes that are
not in set A (the extra 10 bytes in B) must be scanned to make sure theyre all zeros. The
common parts of the sets are scanned by the loop on line 303, and the tail of the larger
set is scanned, if necessary, by the loop on line 307.
The next three functions, in Listing A.8, all modify a set one way or another. The
set op () function on line 313 performs the union, intersection, symmetric difference,
and assignment functions.
I ve used the same strategy here as I used earlier when the sets were different sizes. The
common words are manipulated first, and the tail of the longer set is modified if
Listing A.7. set.c Set-Testing Functions
126
127
128
129
130
131
132
133
134
135
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
PUBLI C i nt
SET
{
numel e( set )
*
}
/
*
/ * Ret ur n t he number of el ement s ( nonzer o bi t s) i n t he set . NULL set s ar e
* consi der ed empt y. The t abl e- l ookup appr oach used her e was suggest ed to
* me by Doug Mer r i t . Nbi t s[] i s i ndexed by any number i n t he r ange 0- 255
* and i t eval uat es t o t he number of bi t s i n t he number .
/
*
st at i c unsi gned char nbi t s[]
{
136 / * 0--15 * / 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
137 / * 16--31
* /
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
138 / * 32-- 47 * / 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
139 / * 48-- 63
* /
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 6,
140 / *
64-- 79 * /
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
141
/ *
80--95
* /
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5,
142
/ *
96-- 111 * /
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4,
143
/ *
112- - 127
* / 3, 4, 4, 5, 4 , 5, 6, 4, 5, 6, 5 , 6, 6,
7,
144 / * 128- - 143 */
1, 2 , 2, 3, 2 , 3, 3, 4 , 2, 3 , 3 , 4, 3, 4 , 4 , 5,
145 / *
144- - 159 */ 2 , 3 , 3, 4 , 3, 4 , 4 , 5, 3, 4 , 4 , 4 , 5, 5, 6,
146 / * 160- - 175 */ 2 , 3, 3, 4 , 3 , 4 , 4 , 5, 3, 4, 4 , 5, 4, 5, 5, 6,
147 / * 176- - 191 */ 3 , 4 , 4 , 5, 4 , 5, 6, 4 , 5, 5 , 6, 5, 6, 7,
148 / * 192- - 207 */ 2 , 3 , 3, 4 , 3, 4 , 4 , 5, 3, 4, 4 , 4, 5, 5, 6,
149 / * 208- - 223 */ 3, 4 , 4 , 5 , 4 , 5, 6, 4 , 5, 5, 6, 5, 6, 6, 7 ,
150 / * 224- - 239 */ 3, 4, 4, 5 , 4 , 6, 4 , 5, 5, 6, 5, 6, 6,
7 ,
151 / * 240- - 255
*/
4, 5, 5, 6, 5, 6, 6, 7 , 5, 6, 6, 7, 6, 7, 7, 8
};
i nt
unsi gned i nt
unsi gned char
i ;
count 0;
*
p;
( !set )
ret urn 0;
P
(unsi gned char *) set - >map
( i BYTES I N ARRAY( set - >nwor ds) i > 0 ; )
count += nbi t s[ *p++ ]
ret urn count ;
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - * /
PUBLI C i nt set t est ( set l , set 2 )
SET
{
*set l ,
*
/
*
Compar es t wo set s. Ret ur ns as f ol l ows
*
*
*
*
SET_EQUI V
SET_I NTER
SET DI SJ
Set s ar e equi val ent
Set s i nt er sect but ar en' t equi val ent
Set s ar e di sj oi nt
*
* The smal l er set i s made i f t he t wo set s ar e di f f er ent si zes
*
/
i nt i , r val SET EQUI V ;
SETTYPE
*
Pi/
*
p2;
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
i ma x ( s e t l - > n w o r d s , s e t 2 - > n w o r d s ) ;
( i
s e t l ) ; / * Make t he set s t he same si ze
*
/
en ( i , s e t 2 ) ;
Pi
p2
s e t l - > ma p ;
s e t 2 - >ma p ;
(;
i > 0 ; p l + + , p2++ )
{
(
*
p i !
*
p2 )
{
/
*
You get her e i f t he set s ar en' t equi val ent . You can r et ur n
*
i mmedi at el y i f t he set s i nt er sect but have t o goi ng i n t he
*
*
*
case of di sj oi nt set s ( because t he set s mi ght act ual l y i nt er sect
at some byt e, as yet unseen) .
/
( * pl &
ret urn
*
p2 )
SET INTER ;
r v a l SET DI SJ ;
}
}
ret urn r v a l ; /
*
They' r e equi val ent
*
/
}
/
* -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- * /
PUBLIC s e t c mp ( s e t l , s e t 2 )
SET * s e t l , * s e t 2 ;
{
/
*
*
Yet anot her compar i son f unct i on. Thi s one wor ks l i ke st r cmp(),
r et ur ni ng 0 i f t he set s ar e equi val ent , <0 i f set l <set 2 and >0 i f
* set l >set 2.
*
/
i nt
w
1,
w
j ;
SETTYPE
*
Pi/
*
p 2 ;
l
p
D
m i n ( s e t l - > n w o r d s , s e t 2 - > n wo r d s ) ;
( Pi
(
s e t l - > ma p , p2 s e t 2 - >ma p ;
j >
0 p l + + , p2++ )
* i
Pi
ret urn
*
*
p2 )
P i -
p2;
/
*
You get her e onl y i f al l wor ds t hat exi st i n bot h set s ar e t he same
Check t he t ai l end of t he l ar ger ar r ay f or al l zer os.
/
(
s e t l - > n w o r d s i ) > 0 ) /
*
Set 1 i s t he l ar ger
*
/
{
whi l e( >
0 )
(
*
p l + + )
ret urn 1;
}
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
set 2- >nwor ds i ) > 0) /
Set 2 i s t he l ar ger
*
/
{
whi l e(
j >
0 )
(
*
P2++ )
ret urn 1;
}
ret urn 0; /
They' r e equi val ent

*
/
}
/
*
/
PUBLI C unsi gned set hash ( set l )
SET *set l ;
{
/
hash t he set by summi ng t oget her t he wor ds i n t he bi t map

*
/
SETTYPE
unsi gned
i nt
p;
t ot al ;
9
d;
t ot al
D
P
0;
set l - >nwor ds
set l - >map
whi l e(
w
3
>
t ot al +
0 )
P++ ;
ret urn t ot al ;
}
/

/
PUBLI C i nt subset ( set, possi bl e subset )
SET
{
*set , ^possi bl e subset ;
/
*
*
*
Ret ur n 1 i f "possi bl e_subset " i s a subset of "set . One i s r et ur ned i f
i t ' s a subset , zer o ot her wi se. Empt y set s ar e subset s of ever yt hi ng.
The r out i ne si l ent l y mal f unct i ons i f gi ven a NULL set , however . I f t he
*
"possi bl e subset " i s t han t he "set , t hen t he ext r a byt es must
* be al l zer os
/
SETTYPE
subset p, *set p;
i nt
i nt
c ommo n;
t ai l ;
/
/
*
Thi s many byt es i n pot ent i al subset

Thi s many i mpl i ed 0 byt es i n b
*
/
/
( possi bl e subset - >nwor ds > set - >nwor ds )
{
common
t ai l
set - >nwor ds ;
possi bl e subset - >nwor ds common
}
{
common
t ai l
possi bl e subset - >nwor ds;
0;
}
Li sti ngA.7. conti nued.
300 s u b s e t p = p o s s i b l e s u b s e t - >ma p ;
301 s e t p = s e t - >ma p;
302
303 f or (; - - common >= 0; s u b s e t p + + , s e t p + + )
304
i f (
( * s u b s e t p & * s e t p ) != * s u b s e t p )
305 r et ur n 0 ;
306
307 whi l e( - - t a i l >= 0 )
308 i f ( * s u b s e t p + + )
309 r et ur n 0 ;
310
311 r et ur n 1
f
312 }
Listing A.8. set.c Set Manipulation Functions
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
PUBLI C voi d s e t o p ( op, d e s t , s r c )
i nt
SET
{
op;
*
s r c ,
*
/
*
Per f or ms bi nar y oper at i ons dependi ng on op
*
*
*
*
*
UNI ON:
I NTERSECT:
DI FFERENCE
ASSI GN:
uni on of sr c and dest
i nt er sect i on of sr c and dest
symmet r i c di f f er ence of sr c and dest
*
*
The si zes of t he dest i nat i on set i s adj ust ed so t hat i t ' s t he same si ze
* as t he sour ce set
*
/
SETTYPE
SETTYPE
*
*
d;
s;
i nt
i nt
s s i z e ;
t a i l ;
/ * Poi nt er to dest i nat i on map
/ * Poi nt er to map i n set l
/ * Number of wor ds i n sr c set
/*
*
*
*
dest set i s t hi s much bi gger
*
/
/
/
/
s s i z e s r c - > n wo r d s ;
(
( unsi gned) d e s t - > n wo r d s < s s i z e ) / * Make sur e dest set i s at l east */
( s s i z e , d e s t ) ; /
as bi g as t he sr c set.
*
/
t a i l
d
d e s t - > n wo r d s
d e s t - >ma p ;
s s i z e
s s r c >map
swi t ch( op )
{
UNI ON: whi l e( s s i z e > 0 )
d++
*
s++
I NTERSECT: whi l e( s s i z e > 0 )
*d++ &
whi l e( t a i l >
s++ ;
= 0 )
d++ 0;
352 case _DI FFERENCE: whi l e( s s i z e >= 0 )
353 *d++ ~= *s ++ ;
354 br eak;
355 case _ASSI GN: whi l e( s s i z e >= 0 )
356 *d++ = *s ++ ;
357 whi l e( t a i l >= 0 )
358 *d++ = 0;
359 br eak;
360 }
361 }
362
363 / * -------------------------------------------------------------------------------------------------------------------------------------- * /
364
365 PUBLI C voi d i n v e r t ( s e t )
366 SET *set ;
367 {
368 / * Physi cal l y i nver t t he bi t s i n t he set . Compar e wi t h t he COMPLEMENT ()
369 * macr of whi ch j ust modi f i es t he compl ement bi t .
370 * /
371
372 SETTYPE *p, *end ;
373
374 f o r ( p = s e t - > ma p , end = p + s e t - > n wo r d s ; p < end ; p++ )
375 *p = ~*p;
376 }
377
378 / * -------------------------------------------------------------------------------------------------------------------------------------- * /
379
380 PUBLI C voi d t r u n c a t e ( s e t )
381 SET * s e t ;
382 {
383 / * Cl ear s t he set but al so set ' s i t back t o t he or i gi nal , def aul t si ze.
384 * Compar e t hi s r out i ne t o t he CLEAR( ) macr o whi ch cl ear s al l t he bi t s i n
385 * t he map but doesn' t modi f y t he si ze.
386 * /
387
388 i f ( s e t - >ma p != s e t - > d e f ma p )
389 {
390 f r e e ( s e t - >ma p ) ;
391 s e t - >ma p = s e t - > d e f ma p ;
392 }
393 s e t - > n wo r d s = _DEFWORDS;
394 s e t - > n b i t s = _DEFBITS;
395 me ms e t ( s e t - > d e f ma p , 0, si zeof ( s e t - > d e f ma p ) ) ;
396 }
necessary. The work is all done by the whi l e loops in the swi t ch on lines 342 to 360.
Its probably better style to put one whi l e statement outside the switch than to put
several identical ones in the cases, but the latter is more efficient because the swi t ch
wont have to be re-evaluated on every iteration of the loop. The first whi l e loop in
every case takes care of all destination elements that correspond to source elements. The
maps are processed one word at a time, ORing the words together for union, ANDing
them for intersection, XORing them for symmetric difference, and just copying them for
assignment. Note that the source set can not be larger than the destination set because of
the earlier enl arge () call on line 336.
The second whi l e loop in each case wont execute if the two sets are the same size,
because the tail size will be zero. Otherwise, the destination set is larger and you must
do different things to the tail, depending on the operation. Since the missing elements of
the smaller set are all implied zeros, the following operations are performed:
union: Do nothing else to the destination because there are no more bits in the
source set to add.
intersection: Clear all the bits in the tail of the destination because no bits are set in the
source so theres no possibility of intersection,
difference: Do nothing because every bit thats set in the source set will not be set in
the destination.
assignment: Set all bits in the tail of the destination to 0 (because the implied bits in the
source are all 0).
The i nver t () subroutine starting on line 365 of Listing A.8 just goes through the
map, reversing the sense of the bits. The t r uncat e () function on line 380 restores a
set to its initial, empty, condition. This last routine is really a more efficient replacement
for:
( s ) ;
s = n e w s e t ( ) ;
You may be better offin terms of speedto clear the existing set with CLEAR rather
than calling t r uncat e (), because f r ee () is pretty slow.
Accessing an entire set, The final two set routines, which access and print entire sets, are in Listing A.9. The
next member (). next _member () function on line 397 lets you access all elements of a set sequentially.
When the function is called several successive times with the same argument, it returns
the next element of the set with each call, or -1 if there are no more elements. Every
time the set argument changes, the search for elements starts back at the beginning of
the set. Similarly, next _member (NULL) resets the search to the beginning of the set
(and does nothing else). You should not put any new elements in the set between a
next _meni ber (NULL) call and a subsequent next _member (set) call. Elements
should not be added to the set between successive next member () calls.
Listing A.9. set.c Getting Elements and Printing the Set
397 PUBLIC i n t n e x t member( s e t )
398 SET * set;
399 {
400 / * set == NULL Reset
401 * set changed f r oml ast cal l : Reset and r et ur n f i r st el ement
402 * ot her wi se r et ur n next el ement or -1 i f none.
403 */
404
405 st at i c SET *oset =NULL; / * "set ar g i n l ast cal l */
406 st at i c i nt current_member = 0; / * l ast - accessed member of cur . set */
407 _SETTYPE *map;
408
409 i f ( !set )
410 r et ur n( ( i nt ) ( oset = NULL ) );
411
412 i f ( oset != set )
413 {
414 oset =set;
415 current_member =0 ;
416
417 f or (map = s e t - >ma p ; *map == 0 && c ur r e nt _me mbe r < s e t - > n b i t s ; ++map)
418 cur r ent _member += _BI TS_I N_WORD;
419 }
420
421 / * The i n c r e me n t mus t b e p u t i n t o t h e t e s t b e c a u s e , i f t h e TEST( ) i n v o c a t i o n
422 * e v a l u a t e s t r u e , t h e n an i n c r e me n t on t h e r i g h t o f a f o r ( ) s t a t e m e n t
423 * woul d n e v e r b e e x e c u t e d .
424 * /
425
426 whi l e( current _member++ < s e t - > n b i t s )
427 i f ( TEST (set, cur r ent _member - l ) )
428 r et ur n( c u r r e n t _ me mb e r - l ) ;
429 r et ur n( - 1 ) ;
430 }
431
432 / * -------------------------------------------------------------------------------------------------------------------------------------- * /
433
434 PUBLI C voi d p s e t ( s e t , o u t p u t _ r o u t i n e , param )
435 SET *set ;
436 i nt ( * o u t p u t _ r o u t i n e ) ( ) ;
437 voi d *param;
438 {
439 / * P r i n t t h e c o n t e n t s o f t h e s e t b i t map i n h u ma n - r e a d a b l e f o r m . The
440 * o u t p u t r o u t i n e i s c a l l e d f o r e ac h e l e me n t o f t h e s e t wi t h t h e f o l l o w i n g
441 * a r g u m e n t s :
442 *
443 * ( *out ) ( par am, "n u l l ", - 1) ; N u l l s e t ( " s e t " a r g ==NULL)
444 * ( *out ) ( par am, "e m p t y ", - 2) ; Empt y s e t (no e l e m e n t s )
445 * ( *out ) ( par am, "%d ", N ) ; N i s an e l e me n t o f t h e s e t
446 * /
447
448 i nt i , d i d _ s o me t h i n g = 0;
449
450 i f ( ! s e t )
451 ( * o u t p u t _ r o u t i n e ) ( param, " n u l l " , - 1 ) ;
452 el se
453 {
455 whi l e( ( i = n e x t _ me mb e r ( s e t ) ) >= 0 )
456 {
457 d i d _ s o me t h i n g + + ;
458 ( *out put _r out i ne ) ( par am, "%d ", i ) ;
459 }
461
462 i f ( ! di d_somet hi ng )
463 ( * o u t p u t _ r o u t i n e ) ( param, "empt y", - 2 ) ;
464 }
465 }
bit map that are all zeros.
Printing a set, ps et o . The p s e t () function on line 434 prints all the elements in a set. The standard call:
p s e t ( s e t , f p r i n t f , s t d o u t ) ;
prints the elements of the set, separated by space characters, and without a newline at the
end of the list. The second argument can actually be a pointer to any function, however.
The third argument is just passed to that function, along with a format string. The func
tion is called indirectly through a pointer as follows:
*
occ
+
(
param, " n u l l " , - l ) ;
(* out) (
param, "empt y", - 2 ) ;
(* out)
(
param, "%d ", N) ;
For null sets ( s e t is NULL)
For empty set ( s e t has no elements)
Normally, N is an element of the set
Calls to p s e t () and n e x t member () should not be interspersed.
A.3 Database MaintenanceHashing:
Compilers all need some sort of simple data-base management system. Typically the
databases are small enough to fit into memory, so an elaborate file-based system is not
only not required, but is contraindicated (because of the excessive access time). Of the
various techniques that are available for data management, the most appropriate is the
hash table.
The eight functions in this section implement a general-purpose, data-base manager
that uses a hash strategy, but you can use these functions without any knowledge of the
mechanics of manipulating a hash table. The hash functions are used in Chapter Two,
but hashing isnt discussed until Chapter Six. You may want to read that discussion
before proceeding. You can also just skim over the the function overviews that follow,
skipping over the implementation details until youve read Chapter Six.
Listing A. 10 shows a very simple application that creates a database that holds the
a r g v entries, and then prints the entries. The details of all the function calls will be dis
cussed in a moment, but to summarize: A single database record is defined with the
typedef on line nine. An empty database is created on line 55 with a ma k e t a b ( ) call,
which is passed pointers to two auxiliary functions that control database manipulation:
The hash function on line 17 translates the key field to a pseudo-random number, here
using a shift and exclusive-OR strategy. The comparison function on line 31 does a lexi
cographic comparison of the key fields of two database records. The ne ws ym () call on
line 58 allocates space for a new database record, which is put into the database with the
a d d s y m () call on line 60.
Listing A. 10. Program to Demonstrate Hash Functions
4
5 / * A appl i cat i on t hat demonst r at es how t o use t he basi c hash f unct i ons. Cr eat es
6 * a dat abase hol di ng t he ar gv st r i ngs and t hen pr i nt s t he dat abase.
7 * /
8
9 t ypedef st ruct / * A dat abase r ecor d */
10 {
11 char *ke y;
12 i nt o t h e r _ s t u f f ;
13 } ENTRY;
14
Section A.3Database MaintenanceHashing: 711
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
/
* *
/
unsi gned h a s h ( sym )
ENTRY *
{
sym;
/ * Hash f unct i on. Conver t t he key
/ * to a number .
*
*
/
/
*
i nt
p;
has h v a l u e 0;
( P
s ym- >ke y; *p ; has h v a l u e ( has h v a l u e << 1)
*
p++ )
ret urn has h v a l u e ;
}
/
* *
/
i nt cmp( s yml , sym2 )
ENTRY *s yml , *sym2;
/
/
Compar e t wo dat abase r ecor ds. */

Wor ks l i ke st r cmpO
*
/
{
ret urn s t r c m p ( s y ml - > k e y , s ym2- >ke y ) ;
}
/
*
/
voi d p r i n t ( sym, s t r e a m ) /
pr i nt a dat abase r ecor d to t he st r eam*/

ENTRY
*
FI LE
{
sym;
*s t r e a m;
f p r i n t f ( s t r e a m,
II Q
s \ n " , s ym- >key ) ;
}
/

/
mai n ( a r g c , a r g v )
i nt a r g c ;
k,k
a r g v ;
{
HASH TAB
*
ENTRY
*
tab;
p;
t a b ma k e t a b ( 31, ha s h, cmp) ; / * make hash t abl e
( ++argv ar gc >=0; ++argv) /

*
For each el ement of ar gv
/
/
{
P
(ENTRY *) ne ws ym( (ENTRY)) ; /
k
put i t i nt o t he t abl e
/
p - >k e y
k
a r g v ;
a dds y m( t a b , p ) ;
}
p t a b ( t a b , p r i n t , s t d o u t , 1 ) ;
/
/
pr i nt t he t abl e, st dout i s
t hr ough to pr i nt () .
/
/
}
Create hash table.
Allocate memory for
symbol.
Free memory used by
symbol.
HASH_TAB *maket ab ( unsi gned maxsym, unsi gned ( *hash) (),
unsi gned ( *cmp) ())
Make a hash table of the size specified in maxsym. The hash table is a data struc
ture that the manager uses to organize the database entries. It contains a hash
table array along with various other housekeeping variables. The maxsymargu
ment controls the size of the array component of the data structure, but any
number of entries can be put into the table, regardless of the table size. Different
table sizes will affect the speed with which a record can be accessed, however. If
the table is too small, the search time will be unnecessarily long. Ideally the table
should be about the same size as the expected number of table entries. There is
no benefit in making the table too large.3 I ts a good idea to make maxsyma
prime number4 Some useful sizes are: 47, 61, 89, 113, 127, 157, 193, 211, 257,
293, 337, 367, 401. If maxsymis zero, 127 is used.
The functions referenced by the two pointer arguments ( hash and cmp) are used
to manipulate the database. The hash function is called indirectly, like this:
( * h a s h ) ( sym ) ;
where symis a pointer to a region of memory allocated by newsym( ), which is
used like mal l oc ( ) to allocate a new table element. The assumption is that
newsym( ) is getting space for a structure, one field of which is the key. The
hash function should return a pseudo-random number, the value of which is con
trolled by the key; the same key should always generate the same number, but
different keys should generate different numbers. The simplest, but by no means
the best, hash strategy just adds together the characters in the name. Better
methods are discussed in Chapter Six. Two default hash functions are discussed
below.
The comparison function (cmp) is passed two pointers to database records. It
should compare the key fields and return a value representing the ordering of the
keys in a manner analogous to st r cmp (). A call to (*cmp) (pi , p2) should
return a negative number if the key field in the structure pointed to by p 1is less
than the one in *p2. It should return 0 if the two keys are identical, and it should
return a positive number if the key in *p2 is greater.
maket ab () prints an error message and raises SI GABRT if theres not enough
memory. (It works the same way as newset ( ) in this regard.)
voi d *newsym(i nt si ze)
This routine is used like mal l oc ( ) to get space for a database record. The
returned memory is initialized to zeros. Typically, you use newsym( ) to allo
cate a structure, one field of which is the key. The routine prints an error message
and raises SI GABRT if theres not enough memory. The pointer returned from
newsym() may not be passed to f r ee (); use f r eesym(), below.
voi d f r ee sym(voi d *sym)
This routine frees the memory for a symbol created by a previous newsym( )
3. You tend to get no fewer collisions in a too-long table. You just get holes.
4. The distribution of elements in the table tends to be better if the table size is prime.
call. You may not use f r ee () for this purpose. Do not free symbols that are still
in the tableremove them with a del sym() call first.
voi d * adds ym( HASH_TAB *t abp, voi d *sym)
Add a symbol to the hash table pointed to by t abpa pointer returned from a
previous maket ab ( ) call. The symargument points at a database record, a
pointer to which was returned from a previous newsym( ) call. You must initial
ize the key field of that record prior to the addsym( ) call.
voi d *f i ndsym( HASH_TAB *t abp, voi d *sym)
Return either a pointer to a previously-inserted database record or NULL if the
record isnt in the database. If more than one entry for a given key is in the data
base, the most recently added one is found. The symargument is used to identify
the record for which youre searching. It is not used directly by f i ndsym(), but
is passed to the hash and comparison functions. The comparison function is
called as follows:
(*cmp)(sym, i tem_ i n_ tabl e);
Here, i t emi nt abl e is a pointer to an arbitrary database element, and symis
just the second argument to f i ndsym( ) .
Strictly speaking, symshould be a pointer to an initialized database record
returned from newsym( ). I ts inconvenient, though, to allocate and initialize a
structure just to pass it to f i ndsym( ). You can get around the problem in one
common situation. If the key field is a character array, and that array is the first
field in the structure, you can pass a character-string name to f i ndsym( ) as the
key. This is a hack, but is nonetheless useful. The technique is illustrated in
Listing A .ll. This technique works only if the key is a character arraythe
string must be physically present as the first few bytes of the structure. Character
pointers wont work. Note that st r cmpO is used as the comparison function.
This works only because the array is at the top of the structure, so the structure
pointer passed to st r cmp is also the address of the key array.
voi d del sym( HASH_TAB *t abp, voi d *sym)
Remove a symbol from the hash table, a pointer to which was returned by a previ
ous f i ndsym() call, symis the pointer returned from f i ndsym( ) and t abp is
a pointer returned from maket ab ( ). The record is removed from the table, but
the associated memory is not freed, so you can recycle ityou can reinitialize the
record and reinsert it into the table at a later time. Use f r eesym() to free the
memory used by symyou may not use f r ee () for this purpose. I ts a serious
error to delete a symbol that isnt actually in the tableaddsym() must have
been called for a node before del sym() can be called.
voi d *next sym(HASH_TAB *t abp, voi d *l ast )
This function finds all references to objects in a table that have the same name.
The first such object is found by f i nd_sym( ) . The second object is found by
passing the pointer returned from f i nd_sym( ) to next symO, which returns
either a pointer to the next object or NULL if there are no such objects. Use it like
this:
Add symbol to table.
Find symbol in table.
Simplifying f i ndsymo
calls by subterfuge.
Remove symbol from
table.
Get next symbol.
Listing A.11. Fooling f i ndsym()
1
2
{
3 char name [ SI ZE ]; / * mu s t b e f i r s t , and mu s t b e an a r r a y * /
4 i nt o t h e r s t u f f ;
5 } ENTRY;
6
7 h a s h ( ke y )
8 char *ke y;
9
{
10 i nt i = 0; / * Add t o g e t h e r c h a r a c t e r s i n t h e name
11 whi l e( *ke y ) / * u s e a l e f t s h i f t t o r a n d o mi z e t h e number
12 i += *key++ << 1 ;
13
}
14
15 i g n a z ()
16
{
17 HASH_TAB * t a b ;
18 ext ern has h a d d ( ) ; / * has h f u n c t i o n */
19 ext ern s t r c m p ( ) ;
20 ENTRY
*p;
21
22 t a b = ma k e t a b ( 61, h a s h , s t r c mp ) ;
23
24 p = f i n d s y m ( Tab, "krazy" );
25 }
*
/
/
P
f i ndsym( Tab, "kr azy" ); / * Get t h e one.
*
/
( ! (P next sym( Tab, p) ) ) / * Get t h e n e x t one.
*
/
/
no mor e s y mb o l s
*
/
Third and subsequent objects are found by passing next sym() the value
returned from the previous next sym( ) call.
Print entire database.
i nt pt ab( HASH_TAB *t abp, voi d ( *pr i nt ) (), voi d
i nt sort )
*
par am
9
Print all records in the database represented by the hash table pointed to by tabp.
The function pointed to by pr i nt is called for every element of the table as fol
lows:
( *pr i nt ) ( sym, par am )
If sor t is false, the table elements are printed in random order and 1is always
returned. If sor t is true, the table is printed only if the routine can get memory
to sort the table. Zero is returned (and nothing is printed) if memory isnt avail
able, otherwise 1 is returned and the table is printed in order controlled by the
comparison function passed to maket ab (). If this comparison function works as
described earlier, the table is printed in ascending order. Reverse the sense of the
return value to print in descending order. In the current example, you can change
the comparison function as follows, reversing the arguments to st r cmp ():
cmp ( a, b ) / * P r i n t i n d e s c e n d i n g o r d e r *
ENTRY *a, *b;
{
ret urn st r cmp( b- >key, a- >key );
}
unsi gned hash_ add(char *name) ;
unsi gned hash_p j w(char * name) ;
These two functions are hash functions that you can pass to maketab ( ). They
are passed character strings and return a pseudo-random integer generated from
that string, hash add ( ) just adds the character in the nameits fast but
doesnt work well if the table size is larger than 128 or if keys are likely to be per
mutations of each other. The hash p jw ( ) function uses a shift-and-exclusive-
OR algorithm that yields better results at the cost of execution speed. As with
f indsym ( ), if the table entries have a character-array key at the top of the
structure, these functions can be used directly by maketab ( ),
t ypedef st ruct
{
char key[ 80 ]/
i nt st uf f ;
}
ENTRY;

maket ab(. . . , hash_p j w, . . . ) ;
Otherwise, youll have to encapsulate the hash function inside a second function
like this:
t ypedef st ruct
{
i nt st uf f ;
char *key;
}
ENTRY;
hash_f unct ( sym )
ENTRY *sym;
{
ret urn hash_add( sym- >key );
}

maket ab(. . . , hash_f unct , . . . ) ;
A.3.1 HashingImplementation
I ve used data abstraction extensively in the package presented here. That is, the
mechanics of manipulating the database is hidden from the user of the routines. For
example, the user calls one function to allocate a data-base block of an arbitrary size in a
manner similar to mal l oc (). The block is then modified as necessary, and its inserted
into the database with a second function call. The mechanics of allocating space, of
doing the insertion, and so forth, are hidden in the functions. Similarly, the internal data
structures used for database maintenance are also hidden. This abstraction makes these
routines quite flexible. Not only can you use them in disparate applications, but you can
change the way that the database is maintained. As long as the subroutine-call interface
is the same, you dont have to change anything in the application program.
The hash functions work with two data structures. I ll look at the C definitions and
allocation functions first, and then describe how they work.
Default hash functions.
Data abstraction.
A hash-table element,
BUCKET, ne ws y m().
The first structure is the BUCKET, declared in hash.h, (the start of which is in Listing
A. 12). The newsym( ) function, (in Listing A. 12) allocates a BUCKET along with the
memory requested by the newsym( ) simultaneously allocates enough memory for
both the BUCKET header, and for a user space, the size of which is passed in as a parame
ter. It returns a pointer to the area just below the header, which can be used in any way
by the application program:
pointer returned
from mal l oc ()
pointer returned
from newsym()
--------- >
next
prev
--------- >
user
area
Listing A.12. hash.h The BUCKET
1 t ypedef st ruct BUCKET
2
{
3 St ruct BUCKET * n e x t ;
4 St ruct BUCKET * * pr e v ;
5
6 } BUCKET;
Listing A.13. hash.c BUCKET Allocation
9 PUBLIC voi d *newsym( s i z e )
10 i nt s i z e ;
11
{
12 / * Al l ocat e space f or a new symbol ; r et ur n a poi nt er t o t he user space. */
13
14 BUCKET *sym;
15
16 i f ( ! (sym = (BUCKET *) c a l l o c ( s i z e + si zeof (BUCKET),
1 ) ) )
17
{
18 f p r i n t f ( s t d e r r , "Can' t g e t memory f o r BUCKET\n" )
19 r a i s e ( SIGABRT ) ;
20 ret urn NULL;
21
}
22 ret urn (voi d * ) ( sym + 1 ) ; / * r et ur n poi nt er t o user space */
23
}
24
25
/ * ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------ * /
26
27 PUBLI C voi d f r e e s y m ( sym )
28 voi d * s ym;
29
{
30 f r e e ( (BUCKET *) s ym - 1 ) ;
31
}
Freeing b u c k e t s ,
freesym() .
The hash table itself,
HASH TAB.
The f r eesym( ) function, also in Listing A.13, frees the memory allocated by a pre
vious newsym( ) . It just backs up the sympointer to its original position and then calls
f r ee ( ).
The second data structure of interest is the HASH TAB which holds the actual hash
table itself. (Its declared in Listing A. 14.) Like a BUCKET, its a structure of variable
length. The header contains the table size in elements (si ze), the number of entries
Section A.3.1 HashingImplementation 717
currently in the table (numsyms), pointers to the hash and comparison functions ( hash
and cmp), and the table itself ( t abl e is the first element of the table). The numsyms
field is used only for statistical purposesits not needed by any of the hash-table func
tions.
The maket ab ( ) function (in Listing A. 15) allocates a single chunk of memory big Making a hash table,
enough to hold both the header and an additional area that will be used as the array. The maketab ()
t abl e is declared as a one-element array, but the array can actually be any size, pro
vided that theres enough available memory following the header. I m taking advantage
of the fact that C doesnt do array-boundary checking when the array is accessed.
Listing A.14. hash.h HASH_TAB Definition
7 t ypedef st ruct h a s h _ t a b _
8 {
9 i nt s i z e ; / * Max number of el ement s i n t abl e * /
10 i nt numsyms ; / * number of el ement s cur r ent l y i n t abl e */
11 unsi gned ( * h a s h ) ( ) ; / * hash f unct i on */
12 i nt (*cmp) () ; / * compar i son f unct , cmp ( name, bucket _p) ; * /
13 BUCKET * tabl e[l ]; / * Fi r st el ement of act ual hash t abl e */
14
15 } HASH TAB;
Listing A.15. hash.c HASH TAB Allocation
32 PUBLIC HASH TAB * ma k e t a b ( maxsym, has h f u n c t i o n , cmp f u n c t i o n )
33 unsi gned maxsym;
34 unsi gned ( *has h f u n c t i o n ) ( ) ;
35 i nt (*cmp f u n c t i o n ) ( ) ;
36
{
37 / * Make a hash t abl e of t he i ndi cat ed si ze, */
38
39 HASH_TAB *p;
40
41 i f ( !maxsym )
42 maxsym = 127;
43 / * | <-- space f or t abl e- - - - >\ <- and header >| * /
44 i f ( p=(HASH TAB*) c a l l o c ( 1 , (maxsym * si zeof (BUCKET*) ) + si zeof (HASH TAB)) )
45
{
46 p - > s i z e = maxsym ;
47 p- >numsyms = 0 ;
48 p - > h a s h = has h f u n c t i o n ;
49 p- >cmp = cmp f u n c t i o n ;
50
}
51 el se
52
{
53 f p r i n t f ( s t d e r r , " I n s u f f i c i e n t memory f o r s ymbol t a b l e \ n " ) ;
54 r a i s e ( SIGABRT ) ;
55 ret urn NULL;
56
}
57 ret urn p;
58 }
These two structures interact as shown in Figure A. 1. The actual table is an array of b u c k e t s ina h a s h t a b .
BUCKET pointers. Empty t abl e elements are set to NULL, and new BUCKETS are tacked
onto the head of the list.
Figure A .l. A Hash Table with Two Elements in It
Adding symbols to the
ta b le, addsym().
The addsym( ) function, which puts items into the table, is in Listing A. 16. A
pointer to the current hash-table element, p, is initialized on line 68 by calling the hash
function indirectly through the hash pointer in the HASH_TAB header. Note that the
BUCKET pointer (sym) comes into the routine pointing at the user area. It is decremented
to point at the BUCKET header on line 68 after the hash function is called. The code on
lines 70 to 73 link the new node to the head of a chain found at the previously computed
array element.
Listing A.16. hash.c Adding a Symbol to the Table
59 PUBLI C v o i d *addsym( t abp, i sym )
60 HASH[ TAB *t abp;
61 v o i d * i s ym;
62
{
63 / * Add a symbol t o t he hash t abl e. */
64
65 BUCKET **p, *t mp ;
66 BUCKET *sym= ( BUCKET *) i sym;
67
68 p = & ( t abp- >t abl e) [ ( *t abp- >hash) ( sym- ) %t abp- >si ze ] ;
69
70 t mp = *p ;
71 *p = sym ;
72 sym- >pr ev = p ;
73 sym- >next = t mp ;
74
75 i f ( t mp )
76 t mp- >pr ev = &sym- >next ;
77
78 t abp- >numsyms++;
79 r e t u r n ( v o i d * ) ( sym+ 1);
80 }
b u c k e ts form a d o u b ly Note that the chain of BUCKETS is a doubly-linked list. You need the backwards
linked list, b a ck wa rd s pointers to delete an arbitrary element in the table without having to search for that ele-
pointers.
ment. The only obscure point is the two stars in the definition of the backwards
pointerthe forward pointer (next) is a pointer to a BUCKET, but the backwards pointer
(prev) is a pointer to a BUCKET pointer. You need this extra level of indirection
because the head of the chain is a simple pointer, not an entire BUCKET structure. The
backwards pointer for the leftmost node in the chain points at the head-of-chain pointer.
All other backwards pointers hold the address of the next field from the previous node.
You can see the utility of this system by looking at the code necessary to delete an
arbitrary node. Say that you want to delete the node pointed to by p in the following pic
ture:
Deleting a b u c k e t from
the table, del sym ().
p
The removal can be accomplished with the following statement:
i f ( * ( p- >prev) = p- >next )
p- >next - >pr ev = p- >pr ev ;
The pointer from the previous node (the one that points at the node to delete) is modified
first so that it points around the deleted node. Then the backwards pointer from the next
node is adjusted. The i f is required because the next pointer is NULL on the last node
in the chain. The double indirection on the backwards pointer makes this code work
regardless of the position of the node in the chainthe first and last nodes are not spe
cial cases. The del sym( ) function, which removes an arbitrary node from the table, is
shown in Listing A. 17.
Listing A.17. hash.c Removing A Node From The Table
81 PUBLI C void del sym( t abp, i sym )
82 HASH_TAE\ *t abp;
83 void *i sym;
84
{
85 / *
Remove a symbol f r om t he hash t abl e. "sym" i s a poi nt er r et ur ned f r om
86
*
a pr evi ous f i ndsym() cal l . I t poi nt s i ni t i al l y at t he user space, but
87
*
i s decr ement ed t o get at t he BUCKET header .
88
*/
89
90 BUCKET *sym= ( BUCKET *) i sym;
91
92
i f (
t abp && sym )
93
{
94 t abp- >numsyms;
95 sym;
96
97 i f ( * ( sym- >pr ev) = sym- >next )
98 sym- >next - >pr ev = sym- >pr ev ;
99 }
100
}
Finding symbols, The two symbol-finding functions, f indsym ( ) and nextsym( ), are in Listing
f mdsym o, next sym o. a .18. findsym( ) just hashes to the correct place in the table and then chases down the
chain, looking for the required node. It returns a pointer to the user area of the BUCKET
(thus the +1 on line 119), or NULL if it cant find the required node, next sym ( ) just
continues chasing down the same chain, starting where the last search left off.
Listing A.18. hash.c Finding a Symbol
101 PUBLI C voi d *f i ndsym( tabp, sym)
102 HASH_TAB *tabp;
103 voi d * s ym;
104
{
105 / * Ret ur n a poi nt er t o t he hash t abl e el ement havi ng a par t i cul ar name
106 * or NULL i f t he name i sn' t i n t he t abl e.
107 */
108
109 BUCKET *p ;
110
111 i f ( !tabp ) / * Tabl e empt y */
112 r et ur n NULL;
113
114 p = ( t abp- >t abl e) [ ( *t abp- >hash) ( sym) %t abp- >si ze ] ;
115
116 whi l e( p && ( *t abp- >cmp) ( sym, p+1 ) )
117 p =p- >next ;
118
119 r et ur n ( voi d *) ( p ? p + 1 : NULL );
120
}
121
122
/ *
-----------------------*/
123
124 PUBLI C voi d *next sym( tabp, i _l ast )
125 HASH[ TAB *tabp;
126 voi d *i l ast;
127
{
128 / * Ret ur n a poi nt er t he next node i n t he cur r ent chai n t hat has t he same
129 * key as t he l ast node f ound (or NULL i f t her e i s no such node) . "l ast "
130 * i s a poi nt er r et ur ned f r oma pr evi ous f i ndsym() of next sym() cal l .
131 */
132
133 BUCKET *l ast = (BUCKET *) i _l ast ;
134
135 f or ( l ast; l ast - >next ; l ast = l ast - >next )
136 i f ( ( t abp- >cmp) ( l ast +1, l ast - >next +1) == 0 ) / * keys mat ch */
137 r et ur n ( char *) ( l ast - >next + 1 ) ;
138 r et ur n NULL;
139
}
Printing the entire data- The last of the support functions is ptab ( ), which prints the table. It starts on line
base: ptab (). ^42 of Listing A. 19. The loop on lines 166 to 173 prints the table in the most straightfor
ward manner. The outer loop goes through the table from top to bottom looking for
chains. The inner loop traverses the chain, calling the print function indirectly at each
node.
The e l s e clause on lines 177 to 214 handles sorted arrays. It allocates an array of
BUCKET pointers on line 184 and initializes the array to point at every BUCKET in the
table with the loop starting on line 189. (This is essentially the same loop that printed
the unsorted array). The new array is sorted on line 208 using a sso rt ( ), a variant of
Section A. 3.1 HashingImplementation 721
Listing A.19. hash.c Printing the Table
140 PRI VATE i nt ( *User cmp) ();
141
142 PUBLI C i nt pt ab( t abp, pr i nt , par am, sor t )
143 HASH_TAB *t abp; / * Poi nt er t o t he t abl e */
144 voi d (* pr i nt ) ()/ / * Pr i nt f unct i on used f or out put */
145 voi d *par am; / * Par amet er passed t o pr i nt f unct i on */
146 i nt s o r t / / * Sor t t he t abl e i f t rue. */
147 {
148 / * Ret ur n 0 i f a sor t ed t abl e can' t be pr i nt ed because of i nsuf f i ci ent
149 * memor y, el se r et ur n 1 i f t he t abl e was pr i nt ed. The pr i nt f unct i on
150 * i s cal l ed wi t h t wo ar gument s:
151 * ( *pri nt ) ( symr par am)
152 *
153 * Symi s a poi nt er t o a BUCKET user ar ea and par ami s t he t hi r d
154 * ar gument t o pt ab.
155 * /
156
157 BUCKET **out t ab, **out p, *sym, **symt ab ;
158 i nt i nt er nal _cmp( ) ;
159 i nt i ;
160
161 i f ( ! t abp | | t abp- >si ze == 0 ) / * Tabl e i s empt y */
162 ret urn 1;
163
164 i f ( ! s o r t )
165 {
166 f or( symt ab = t abp- >t abl e, i = t abp- >si ze ; i >= 0 ; symt ab++ )
167 {
168 / * Pr i nt al l symbol s i n t he cur r ent chai n. The +1 i n t he pr i nt cal l
169 * i ncr ement s t he poi nt er t o t he appl i cat i ons ar ea of t he bucket .
170 * /
171 f or( sym= *symt ab ; sym ; sym= sym- >next )
172 ( * p r i n t ) ( sym+1, param ) 7
173 }
174 }
175 el se
176 {
177 / * Al l ocat e memor y f or t he out t ab, an ar r ay of poi nt er s t o
178 * BUCKETS, and i ni t i al i ze i t. The out t ab i s di f f er ent f r om
179 * t he act ual hash t abl e i n t hat ever y out t ab el ement poi nt s
180 * t o a si ngl e BUCKET st r uct ur e, r at her t han t o a l i nked l i st
181 * of t hem.
182 * /
183
184 i f ( !( o u t t a b = ( BUCKET **) ma l l o c ( t a b p - >n u ms y ms * si zeof ( BUCKET*)) ))
185 ret urn 0;
186
187 o u t p = o u t t a b ;
188
189 f or( symt ab =t abp- >t abl e, i = t abp- >si ze ; i >= 0 ; symt ab++ )
190 f or ( sym= *symt ab ; sym ; sym= sym- >next )
191 {
192 i f ( out p > out t ab + t abp- >numsyms )
193 {
194 f pr i nt f ( st der r , "I nt er nal er r or [ pt ab] , t abl e over f l ow\ n");
195 exi t (1) ;
196 }
197
198 *out p++ = sym/
199 }
200
201 / * Sor t t he out t ab and t hen pr i nt i t. The ( *out p) +l i n t he
202 * pr i nt cal l i ncr ement s t he poi nt er past t he header par t
203 * of t he BUCKET st r uct ur e. Dur i ng sor t i ng, t he i ncr ement
204 * i s done i n i nt er nal _cmp.
205 */
206
207 User _cmp = t abp- >cmp;
208 assor t ( out t ab, t abp- >numsyms, si zeof ( BUCKET* ), i nt er nal _cmp );
209
210 f or ( out p = out t ab, i =t abp- >numsyms; i >= 0 ; out p++ )
211 ( *pr i nt ) ( ( *out p) +l , par am );
212
213 f r ee( out t ab );
214 }
215 return 1;
216 }
217
218 PRI VATE i nt i n t e r n a l _ c m p ( p i , p2 )
219 BUCKET * * p l , **p2;
220 {
221 ( * Us e r _ c mp ) ( * p l + 1, *p2 + 1 ) ;
222 }
the UNIX qsor t ( ) function that uses a Shell sort rather than a quicksort. (Its discussed
later on in this appendix.) The sorted array is printed by the loop on line 210.
The sorting is complicated because, though the comparison function must be passed
two pointers to the user areas of buckets, assor t ( ) passes the sort function two
pointers to array elements. That is, assor t ( ) passes two pointers to BUCKET pointers
to the comparison function. The problem is solved by putting a layer around the user-
supplied comparison function, assor t ( ) calls i nt er nal cmp ( ), declared on line
218, which strips off one level of indirection and adjusts the BUCKET pointer to point at
the user space before calling the user-supplied function. The users comparison func
tions must be passed by a global variable ( User_cmp, declared on line 140).
Finishing up with the hash-table functions themselves, remainder of hash.h contains
various function prototypesits shown in Listing A.20. Similarly, there are a bunch of
#i ncl udes at the top of hash.c, shown in Listing A.21.
Listing A.20. hash.h HASH TAB Function Prototypes
16 ext ern HASH TAB *maket ab
P (
( unsi gned maxsym, unsi gned (* h a s h ) ( ) , i nt ( *cmp) () ) ) ;
17 ext ern voi d *newsym
P ( ( i nt s i z e ) ) ;
18 ext ern voi d f r e e s y m
P (
( voi d *sym ) ) ;
19 ext ern voi d *addsym
P (
( HASH_TAB * t a bp, voi d *sym
) ) ;
20 ext ern voi d * f i n d s y m
P ( ( HASH_TAB * t a bp, voi d *sym
) ) ;
21 ext ern voi d *ne xt s ym
P (
( HASH_TAB * t a bp, voi d * l a s t
) ) ;
22 ext ern voi d d e l s y m
P (
( HASH_TAB * t a bp, voi d * s ym ) ) ;
23 ext ern i nt p t a b P (
( HASH_TAB * t a bp, voi d( * p r n t ) ( ) , voi d *par, i nt s r t ) ) ;
24 unsi gned ha s h add P ( ( unsi gned char *name ) ) ; / * i n hashadd. c * /
25 unsi gned ha s h _pjw P ( ( unsi gned char *name ) ) ; / * i n hashpj w. c * /
Listing A.21. hash.c #i ncl udes
2 #i ncl ude <ct ype. h>
3 #i ncl ude < s i g n a l . h>
4 #i ncl ude < s t d l i b . h>
5 #i ncl ude <st ri ng. h>
6 #i ncl ude <tool s/ debug. h>
7 #i ncl ude <t ool s/ hash. h>
A.3.2 Two Hash Functions
Implementations of the two default hash functions discussed in Chapter Six are
shown in Listings A.22 and A.23.
Listing A.22. hashadd.c An Addition-Based Hash Function
^ ^ , i . i m i i wp .......... 11 m i - i . i, ^
3 / * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
4* Hash f unct i on f or use wi t h t he f unct i ons i n hash. c. J ust adds t oget her
5 * char act er s i n t he name.
6 */
7
8 unsi gned h a s h _ a d d ( name )
9 unsi gned char *name;
10 {
11 unsi gned h ;
12
13 f o r ( h = 0; *name ; h += *name++ )
14
15 ret urn h;
16 }
I ve modified Ahos version of hash_pjw considerably in order to make it portable Implementing
the original version assumed that the target machine had a 32-bit unsi gned i nt. All hash_J?Dw0-
the macros on lines four to seven of Listing A.23 are for this purpose.
NBITS IN UNS IGNED evaluates to the number of bits in an unsi gned i nt, using the
NBITS macro from <tools! debug.h>. You could also use the ANSI CHAR_BIT macro
(defined in limits.h) and multiply it by eight, but thats risky because a char might not
be eight bits wide. SEVENTY_FIVE_PERCENT evaluates to the number of bits required
to isolate the bottom %of the number. Given a 16-bit i nt, it will evaluate to 12.
TWELVE_PERCENT works the same way, but it gets A of the number2, given a 16-bit
i nt. HIGH_BITS is a mask that isolates the bits in the top A of the numberits 0x3fff
for a 16-bit i nt.
Note that most of the computation in the macros is folded out of existence by the
optimizer (because everythings a constant, so the arithmetic can be performed at com
pile time). For example:
h ~= g ( i nt ) ( (si zeof (unsi gned) * CHAR_BIT) * .75 );
generates the following code with Microsoft C, version 5.0:
3
4 #def i ne NBI TS_I N_UNSI GNED
5 #def i ne SEVENTY_FI VE_PERCENT
6 #de f i ne TWELVE_PERCENT
7 #def i ne HI GH_BI TS
8
9 unsi gned ha s h_ pj w( name )
10 unsi gned char *name;
11 {
12 unsi gned h = 0;
13 unsi gned g;
14
15 f or(; *name / ++name )
16 {
17 h = (h TWELVE_PERCENT) + *name ;
18 i f ( g = h & HI GH_BI TS )
19 h = (h ~ (g SEVENTY_FI VE_PERCENT) ) & ~HI GH_BI TS /
20 }
21 ret urn h;
22 }
Listing A.23. hashpjw.c The hashpjw Function
mov
s h r
x o r
Similarly:
g = h & ~( ( u n s i g n e d ) ( ~ 0 ) ( ( i nt ) ( NBITS_IN_UNSIGNED * . 1 2 5 ) ) )
becomes:
mov d x , a x
and a x , - 1 6 3 8 4
If you modify these constants, dont convert the floating-point numbers to fractions,
or else theyll end up as zero. For example, (NBITS_IN_UNSIGNED * (1/8)) wont
work because the 1/8 will evaluate to 0 (integer arithmetic is used).
The algorithm uses a shift-and-XOR strategy to randomize the input key. The main
iteration of the loop shifts the accumulated hash value to the left by a few bits and adds
in the current character. When the number gets too large, it is randomized by XORing it
with a shifted version of itself.
A.4 The a n s i Variable-Argument Mechanism
This book uses the ANSI-approved method of supporting subroutines with a variable
number of arguments, which uses various macros declared in <stdarg.h>, for this pur
pose. Since not all C environments support <stdarg.h>, this section presents a version
of the macros in this file. Before looking at <stdarg.h> itself, however, the following
subroutine demonstrates how the macros are used. T hepri nt i nt subroutine is passed
an argument count followed by an arbitrary number of int-sized arguments, and it prints
those arguments.5
Using the <stdarg.h>
macros.
c l , 12
ax, c l
dx, ax
/ * f or pr ot ot ypes onl y * /
( NBI TS(unsi gned i nt) )
( ( i nt ) ( NBITS_IN_UNSIGNED * . 7 5 ) )
( ( i nt ) ( NBITS_IN_UNSIGNED * . 1 2 5 ) )
( ~( ( unsi gned) ( ~0) TWELVE PERCENT) )
/ * Hash val ue */
Section A.4The ansi Variable-Argument Mechanism 725
print__int ( arg_count, . . . )
i nt arg_count;
{
v a _ l i s t args;
v a _ s t a r t ( args, arg_count );
whi l e( arg_count >= 0 )
p r i n t f ( "%d ", va_arg(args, i nt) );
va_end( args );
}
The args variable is a pointer to the argument list. It is initialized to point at the second
argument in the list with the va_st ar t ( ) call, which is passed the argument pointer
(args) and the name of the first argument (arg_count). This first argument can be
accessed directly in the normal waythe va_ar g ( ) macro is used to get the others.
This macro is passed the argument pointer, and the expected type of the argument. It
evaluates to the current argument, and advances the argument pointer to the next one. In
this case, the arguments are all ints, but they would be different types in an application
like pr i nt f ( ). The va_end( ) macro tells the system that there are no more argu
ments to get (that the argument pointer has advanced to the end of the list).
The foregoing is implemented in <stdarg.h>, shown in Listing A.24. The
argument-pointer type, va_l i st , is a pointer to char. Its declared as such because
pointer arithmetic on a character pointer is just plain arithmetic. When you add 1 to a
character pointer, you actually add 1 to the physical pointer (because the size of a char
acter is 1). The best way to understand the other two macros is to watch them work. A
typical C compiler passes arguments to subroutines by pushing them onto the run-time
stack in reverse order. A call like
i nt me,
l ong Ishmael;
c a l l ( me, Ishmael );
looks like this on the stack (assuming a 16-bit i n t and a 32-bit long):
100
102
104
106
other stuff
return address
me
Ishmael
The va st art (me, ar g pt r ) ; macro initializes ar g pt r to point at the second
argument as follows: &f i rst evaluates to the address of me to 100and
100+sizeof (f i rst) yields 102, the base address of the next argument on the stack.
5. Note that the ellipsis in the pr i nt _ i nt () argument list is not supported by uni x C. The VA_LIST macro
in <debug.h>, discussed on page 684, can be used to correct this deficiency. VA LI ST is used in various
subroutines in subsequent sections.
Start up variable-
argument processing,
va_ s t ar t ().
Get arguments,
v a _ a r g().
End variable-argument
processing, va_ end ().
va_ l i s t , implemented
as char*.
Implementing
va s t a r t ().
Listing A.24. stdarg.h Support for Variable-Argument Lists
The cast is required in front of &f i r st in order to defeat pointer arithmetic; otherwise
Implementing vaar g o. youd get 104, because &f i r st evaluates to an i nt pointer by default. A subsequent
va ar g ( arg pt r, l ong) call will fetch I shmael and advance the pointer.
expands to:
((l ong *) ( ar g p t r += si zeof ( l ong) ) ) [ - 1 ]
The ar g pt r += si zeof (l ong) yields 106 in this example. That is, it yields a
pointer to just past the variable that you want. The (l ong
(106) is treated as if it
were a pointer to an array of longs that had been initialized to point at cell 106, and the
1 ] gets the long-sized number in front of the pointer. You could think of the whole
thing like this:
l ong
*
p;
p
++p;
t a r g e t
a r g p t r +
(l ong * ) a r g p t r ;
P [- 1];
si zeof ( l ong) ;
/
/
/
/
*
*
*
*
poi nt er to cur r ent ar gument
ski p past i t
f et ch i t by backi ng up.
ski p t o next ar gument
*
*
*
*
/
/
/
/
Note that some non-ANSI compilers can do the foregoing with
*
((l ong * ) a r g p t r ) ++
but this statement isnt acceptable to many compilers (because the cast forms an rvalue,
and must be applied to an lvalue).
A.5 Conversion Functions
This section contains descriptions of several small data-conversion subroutines:
Convert string to long
l ong st ol
*
str)
unsi gned l ong st oul (char
*
str)
These routines are somewhat more powerful versions of the standard at oi ()
st ol () is passed the address of a character pointer (note the double indirection)
returns, in a long, the value of the number represented by the string, and
updates the pointer to point past the number. If the string begins with a Ox, the
number is assumed to be hex; otherwise, if it begins with a 0, the number is
assumed to be octal; otherwise it is decimal. Conversion stops on encountering
the first character that is not a digit in the indicated radix. Leading white space is
ignored, and a leading minus sign is recognized, st oul () works much the same,
except that it returns an unsi gned l ong, and doesnt recognize a leading minus
sign.
Translate escape se
quences to binary.
i nt e s c (char
* *
s)
Returns the character associated with the escape sequence pointed to by
s
and modifies * s to point past the sequence. The recognized strings are summar
ized in Table A. 1.
Section A.5 Conversion Functions
Table A.l. Escape Sequences Recognized by esc ()
727
I nput Returned
Notes
string value
\ b 0x08 Backspace.
\ f 0x0c Formfeed.
\ n 0x0a Newline.
\ r OxOd Carriage return.
\ s 0x20 Space.
\ t 0x09 Horizontal tab.
\ e 0x1b ASCII ESC character.
\DDD Ox?? Number formed of 1-3 octal digits.
\xDD Ox?? Number formed of 1-3 hex digits.
V c Ox?? C is any letter. The equivalent control code is returned.
\ c c a backslash followed by anything else returns the character following the
backslash (and *s is advanced two characters).
c c Characters not preceded by a backslash are just returned (and * s is advanced 1
character).
char *bi n_t o_asci i (i nt c, i nt use_hex)
Return a pointer to a string that represents c in human-readable form. This string
contains only the character itself for normal characters; it holds an escape
sequence (\ n, \ t, \ x00, and so forth) for others. A single quote () is returned as
the two-character string " \ '". The returned string is destroyed the next time
bi n t o asci i ( ) is called, so dont call it twice in a single pr i nt f () state
ment. If use hex is true then escape sequences of the form \ xDD are used for
nonstandard control characters (D is a hex digit); otherwise, sequences of the
form \ DDD are used ( Dis an octal digit).
st oul () and st ol () are implemented in Listing A.25. The esc ( ) subroutine is
in Listing A.26, and bi n t o asci i ( ) is in Listing A.26 All are commented
sufficiently that no additional description is needed here.
Listing A.25. stol.c Convert String-to-Long
1 # i ncl ude < c t y p e . h>
2
3 unsi gned l ong s t o u l ( i n s t r )
4 char **i n s t r ;
5 {
6 / * Conver t st r i ng to l ong. I f st r i ng st ar t s wi t h Ox i t i s i nt er pr et ed as
1 * a hex number , el se i f i t st ar t s wi t h a 0 i t i s oct al , el se i t i s
8 * deci mal . Conver si on st ops on encount er i ng t he f i r st char act er whi ch i s
9 * not a di gi t i n t he i ndi cat ed r adi x. *i nst r i s updat ed t o poi nt past t he
10 * end of t he number .
11 */
12
13 unsi gned l ong num = 0 ;
14 char * s t r = * i n s t r ;
15
16 whi l e( i s s p a c e ( * s t r ) )
17 + + * s t r ;
18
Convert binary to printing
ASCII.
Implementing es c ().
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43 }
44
45
/ * _ _
46
47 l ong
48 char
49
{
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
}
i f (
' 0' )
whi l e( i s d i g i t ( * s t r )
num (num str++ ' 0') ;
++str s t r ' X ' )
++str; i s x d i g i t ( * s t r ) ; ++str
num (num i s d i g i t (*
el se
whi l e( s t r <= '7
num
num str++ ' 0' ;
*i ns t r
r et ur n( num
s t o l ( i n s t r )
* * i n s t r ;
Li ke st oul ()f but r ecogni
l ong.
hex
*st r - '0
: toupper(*str) 'A' + 10 );
oct al
l eadi ng mi nus and r et ur ns si gned
whi l e( i s s pac e ( **i ns t r )
++*i nstr ;
i f (
* -5
i n s t r
r et ur n ( l ong) ( s t o u l ( i n s t r )
el se
++*i nstr; Ski p t he mi nus si gn. */
r et ur n ( l ong) ( s t o u l ( i n s t r )
Listing A.26. esc.c Map Escape Sequences to Binary
Section A.5Conversion Functions 729
7 PRIVATE i nt h e x 2 b i n P ( ( i nt c ) ) /
8 PRIVATE i nt o c t 2 b i n P ( ( i nt c ) ) ;
9 j *___________________________________________________________________* /
10
11 #def i ne ISHEXDIGIT (x) ( i s d i g i t (x) | | (' a' <= ( x ) &&(x) <=' f ' ) | | (' A' <= ( x ) &&(x) <=' F' ) )
12 #def i ne ISOCTDIGIT(x) ( ' 0 ' < = ( x ) && ( x ) < = ' 7 ' )
13
14 PRIVATE i nt h e x 2 b i n ( c )
15 i nt c;
16 {
17 / * Conver t t he hex di gi t r epr esent ed by ' c' t o an i nt . ' c' must be one of
18 * t he f ol l owi ng char act er s: 0123456789abcdef ABCDEF
19 * /
20 r et ur n ( i s d i g i t (c) ? ( c ) - ' O ' : ( ( t o u p p e r ( c ) ) A ' ) +10) & Oxf ;
21 }
22
23 PRIVATE i nt o c t 2 b i n ( c )
24 i nt c ;
25 {
26 / * Conver t t he hex di gi t r epr esent ed by ' c' t o an i nt . ' c' must be a
27 * di gi t i n t he r ange ' 0' - ' 7' .
28 * /
29 r et ur n ( ( ( c ) - ' O ' ) & 0x7 ) ;
30 }
31
32 /* ---------------------------------------------------------------------------------------------------- */
33
34 PUBLIC i nt e s c ( s )
35 char * * s ;
36 {
37 / * Map escape sequences i nt o t hei r equi val ent symbol s. Ret ur n t he equi val ent
38 * ASCI I char act er . *s i s advanced past t he escape sequence. I f no escape
39 * sequence i s pr esent , t he cur r ent char act er i s r et ur ned and t he st r i ng
40 * i s advanced by one. The f ol l owi ng ar e r ecogni zed:
41 *
42 * \ b backspace
43 * \ f f or mf eed
44 * \ n newl i ne
45 * \ r car r i age r et ur n
46 * \ s space
47 * \ t t ab
48 * \ e ASCI I ESC char act er ( ' \ 033' )
49 * \ DDD number f or med of 1- 3 oct al di gi t s
50 * \ xDDD number f or med of 1- 3 hex di gi t s ( t wo r equi r ed)
51 * \ ~C C = any l et t er . Cont r ol code
52 * /
53
54 r egi st er i nt r v a l ;
55
56 i f ( * *s != ' \ \ ' )
57 r v a l = *( ( * s ) + + ) ;
58 el se
59 {
60 + + ( * s ) ; / * Ski p t he \ */
61 swi t ch( t o u p p e r ( * * s ) )
62 {
63 case ' \ 0' : r v a l = ' W ; br eak;
64 case ' B ' : r v a l = ' \ b ' ; br eak;
ListingA.26. conti nued...
65 case ' F' r val = ' \ f ' ; br eak;
66 case ' N' r val = ' \n' ; br eak;
67 case ' R' r val = ' \ r' ; br eak;
68 case ' S' r val = ' ' ; br eak;
69 case
/ rf
r val = ' \ t ' ; br eak;
70 case ' E' r val = ' \ 033' ; br eak;
71
72 case
r ~r .
1: r val = *++(*s) ;
73 r val = t oupper ( r val ) - '
0' ;
74 br eak;
75
76 case ' X' : r val = 0;
77 ++( *s) ;
78 i f ( I SHEXDI GI T (**s) )
79 {
80 r val = hex2bi n ( *( * s) ++ );
81
}
82 i f ( I SHEXDI GI T (**s) )
83
{
84 r val <<= 4;
85 r val |= hex2bi n( *( *s) ++ );
86
}
87 i f ( I SHEXDI GI T(**s) )
88
{
89 r val <<= 4;
90 r val = hex2bi n( *( *s) ++ );
91 }
92 (*s) ;
93 br eak;
94
95 def aul t : i f ( ! I SOCTDI GI T(**s) )
96 r val = **s;
97 el se
98
{
99 ++(*s);
100 r val = oct 2bi n( *( *s) ++ );
101 i f ( I SOCTDI GI T( **s)
)
102
{
103 r val <<= 3;
104 r val |= oct 2bi n ( *( *s) ++
105
}
106 i f ( I SOCTDI GI T(**s)
)
107
{
108 r val <<= 3;
109 r val | = oct 2bi n( *( *s) ++
110 }
111
ii*
w
112
}
113 br eak;
114 }
115 ++(*s) ;
116
117
118
) ;
);
}
ret urn rval ;
}
Section A.5Conversion Functions 731
Listing A.27. bintoasc.c Convert Binary to Human-Readable String
3
A
#i ncl ude c t o o l s / c o m p i l e r . h > / * f or pr ot ot ypes onl y * /
T
5 char * bi n t o a s c i i ( c , u s e he x )
6 {
7 / * Ret ur n a poi nt er t o a st r i ng t hat r epr esent s c . Thi s wi l l be t he
8 * char act er i t sel f f or nor mal char act er s and an escape sequence (\ n, \t,
9 * \ x00, etc, , f or most ot her s) , A ' i s r epr esent ed as \ ' , The st r i ng wi l l
10 * be dest r oyed t he next t i me bi n t o asci i () i s cal l ed. I f "use hex" i s t r ue
11 * t hen \ xDD tescape sequences ar e used. Ot her wi se, oct al sequences (\ DDD)
12 * ar e used. (see al so: pchar . c)
13 */
14
15 st at i c unsi gned char b u f [ 8 ] ;
16
17 c &= Oxf f ;
18 i f ( ' ' <= c && c < 0 x 7 f && c != ' \ " && c != ' \ \ ' )
19
{
20 b u f [ 0 ] = c;
21 b u f [1] = 'A0' ;
22
}
23 el se
24
{
25 b u f [0] = 'A V ;
26 b u f [2] = 'A0' ;
27
28 swi t ch(c )
29
{
30 case ' \ \ ' b u f [ 1] = ' \ \ ' / break;
31 case ' \ ' ' b u f [1] = ' \ " ; break;
32 case ' \ b ' b u f [ l ] = ' b ' ; break;
33 case ' \ f ' b u f [1] = ' f ' ; break;
34 case ' \ t ' b u f [ l ] = ' t ' ; break;
35 case ' \ r ' b u f [ l ] = ' r ' ; break;
36 case ' \ n ' b u f [ l ] = ' n ' ; break;
37 de f aul t s p r i n t f ( & b u f [ l ] , u s e he x ? "x%03x" : "%03oM, c ) ;
38 br eak;
39 }
40 }
41 r et ur n b u f /
42
}
A.6 Print Functions
This section describes several output functions, many of which use the ANSI
variable-argument mechanism and the conversion functions described in the last two
sections. They are organized functionally.
voi d f er r ( char * f mt , . . .) Pr'n*error message and
exit.
This routine is a version of pr i nt f ( ) for fatal error messages. It sends output
to st der r rather than stdout, and it doesnt return; rather, it terminates the pro
gram with the following call:
exi t ( on f er r () ) ;
(on f err () is described below). Normally, the routine works like pr i nt f
in other respects. If, however, it is called with f err (NULL, st r i ng) , it works
like per r or (stri ng) in that it prints an error message associated with the most
recent I/O-system error. The program still terminates as described earlier, how
ever. The source code for f err ( ) is in Listing A.28. The prnt ( ) call on line
19 will be described in a moment.
Listing A.28. ferr.c Fatal-Error Processing
Error handler for f e r r (). i nt on f err (voi d)
This routine is the default error handler called by f err ( ) just before exiting,
returns the current contents of the system er r no variable, which is, in turn,
passed back up to the operating system as the process exit status. You can pro
vide your own on_f er r ( ) to preempt the library version. The source code is in
Listing A.29.
Print human-readable voi d f put st r ( char *str, i nt maxl en, FI LE *st ream)
string.
This function writes a string (str) having at most maxl en characters to the indi
cated st r eamin human-readable form. All control characters, and so forth, are
mapped to printable strings using the bi n_t o_asci i ( ) conversion function
described earlier (on page 727). The source code is in Listing A.30.
Print human-readable voi d pchar ( FI LE *st ream, i nt c)
character.
This function works like putc ( ), except that control characters are mapped to
human-readable strings using bi n to asci i ( ) . The source code is in Listing
A.31.
Section A.6Print Functions
Listing A.29. onferr.c Action Function for f e r r ()
733
1 #i n c l u d e < s t d l i b . h >
2 #i n c l u d e < t o o l s / d e b u g . h >
3 # i n c l u d e < t o o l s / l . h > / * Needed onl y f or pr ot ot ypes *
4
5 / * Thi s i s t he def aul t r out i ne cal l ed by f er r when i t exi t s. I t shoul d r et ur n
6 * t he exi t st at us. You can suppl y your own ver si on of t hi s r out i ne i f you l i ke
1 *
8 i n t o n _ f e r r ()
9 {
10 e x t e r n i n t e r r n o ;
11 r e t u r n e r r n o ;
12 }
Listing A.30. fputstr.c Fatal-Error ProcessingService Routine
3 #i ncl ude <t ool s/ compi l er. h>
4
5 / * FPUTSTR, C: Pr i nt a st r i ng wi t h cont r ol char act er s mapped t o r eadabl e st r i ngs,
6 */
7
8 voi d f put s t r ( s t r , maxlen, stream )
9 char *st r;
10 FILE *stream;
11 {
12 char *s;
13
14 whi l e( *s t r && maxlen >= 0 )
15 {
16 s = bi n_ t o _ a s c i i ( *str++, 1 );
17 whi l e( *s && maxlen >= 0 )
18 p u t c ( *s++ , stream );
19 }
20 }
Listing A.31. pchar.c Print Character in Human-Readable Form
voi d pri ntv ( FI LE * stream, char **argv) Print vector array.
This function prints an argv-like array of pointers to strings to the indicated
stream, one string per line (the ' \n' is inserted automatically at the end of
every string). The source code is in Listing A.32.
Listing A.32. printv.c Print argv-like Vector Array
2
3 voi d p r i n t v ( f p , a r g v )
4 FILE * f p ;
5 char * * a r g v ;
6
{
7 / * Pr i nt an ar gv- l i ke ar r ay of poi nt er s to st r i ngs, one st r i ng per l i ne.
8 * The ar r ay must be NULL t er mi nat ed.
9 */
10 whi l e( * a r g v )
11 f p r i n t f ( f p , "%s\ n", *argv++ ) ;
12
}
13
14 voi d c omme nt ( f p , a r g v )
15 FILE * f p ;
16 char * * a r g v ;
17 {
18 / * Wor ks l i ke pr i nt v except t hat t he ar r ay i s pr i nt ed as a C comment . */
19
20 f p r i n t f ( f p , " \ n / *
------------------------------------------------------ \ n) ;
21 whi l e( * ar gv )
22 f p r i n t f ( f p , " * %s\ n", *argv++ ) ;
23 f p r i n t f ( f p , " * / \ n \ n M) ;
24
}
Print multiple-line com- voi d comment ( FI LE *st r eam, char **ar gv)
ment.
This function also prints an argv-like array of pointers to strings to the indicated
st r eam, one string per line. The output text is put into a C comment, however.
Output takes the following form:
/
*
* st r i ng i n ar gv[ 0]
* st r i ng i n ar gv[ l ]
*
m 0 m m
* st r i ng i n ar gv[ N]
*
/
The source code is also in Listing A.32.
A pr i nt f o workhorse voi d pr nt ( i nt ( *of unct ) () , voi d *of unct ar g, char *f or mat ,
function
va l i st ar gs
pr nt ( ) is a variant on the UNI X _dopr nt () and ANSI vf pr i nt f () functions
that lets you write pr i nt f ( ) -like output routines in a portable way. It is passed
a pointer to a single-character output function ( of unct ) , a parameter that is
relayed to this output function ( of unct _ar g) , a pointer to a format string, and a
pointer to the location on the run-time stack where the other arguments are found.
This last parameter is usually derived using the va_st ar t ( ) macro in
<stdlib.h>, described earlier.
To see how pr nt () is used, f pr i nt f ( ) can be implemented as follows:
Section A.6 Print Functions
f pr i nt f ( st r eam, f or mat , . . . );
FI LE *st r eam;
char *f or mat ;
{
ext ern i nt f put c();
va_l i st args;
va_st ar t (args, f or mat ); / * Get addr ess of ar gument s. *
pr nt ( f putc, st r eam, f or mat , ar gs );
va_end ( ar gs) ;
}
s p r i n t f ( ) can be implemented as follows:
put st r ( c, p)
i nt c;
char * *p;
{
* ( *p) ++ = c ;
}
spr i nt f ( st r , f ormat , . . . )
char *st r, * f or mat ;
{
va_l i st args;
va_st ar t (args, f or mat );
pr nt ( put st r, &st r, f or mat , ar gs );
*st r = ' \ 0' ;
va_end ( ar gs) ;
}
The p r n t () subroutine is required by the curses implementation discussed
below, which must be able to change the output subroutine at will. Neither the
a n s i nor the UNIX function gives you this capability. The source code for
p r n t () is described in depth below.
voi d s t o p p r n t (voi d) Clean up after prnt().
This routine must be called by all programs that use p r n t ( ) just before termina
tion [after the last p r n t ( ) call].
The prnt ( ), subroutine is in Listing A.33. Two versions are presented, one for the
ANSI environment and another for UNIX. The first version (on lines 19 to 31) produces an
ANSI-compatible function. It uses v s p r i n t f ( ) to do the conversion into a buffer, and ANSI version of prnt o .
then prints the buffer. Though this version is desirable in that it supports all the
p r i n t f ( ) conversions, it has a drawback. The buffer requires a modicum of stack
space, and the Microsoft implementation of v s p r i n t f ( ) also uses a lot of run-time
stack. The second, UNIX compatible version of p r n t () is on lines 37 to 65 of Listing UN,Xversion of prnt o .
A.33. There are two problems here: First, the UNIX variable-argument mechanism is
different from the a n s i one, so slightly different procedures must be used to get argu
ments off the stack. Second, there is no UNIX equivalent to v s p r i n t f ( ) . The only
available p r i n t f driver is _ d o p r n t ( ), which works like the ANSI v p r i n t f ( ) func
tion except that the arguments are in different positions. In order to use an arbitrary out
put function, you must format to a FILE, rewind the file, and then transfer characters one
at a time to the desired output function. Ugh,6 you may well say. This approach is
6. Thats a technical term.
actually not as much of a kludge as it seems because the number of characters written
and read is so small. More often than not, theres no disk activity at all because the reads
and writes wont get past the I/O systems internal buffering.
Listing A.33. prnt.c General-Purpose pri nt f () Driver
4
5 / * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
6 * Gl ue f or mat t i ng wor khor se f unct i ons t o var i ous envi r onment s. One of t hr ee
1 * ver si ons of t he wor khor se f unct i on i s used, dependi ng on var i ous #def i nes
8 *
9 * i f ANSI i s def i ned vspr i nt f () St andar d ANSI f unct i on
10 * i f ANSI i s not def i ned dopr nt () St andar d UNI X f unct i on
11 *
12 * The def aul t wi t h Mi cr osof t C i s MSDOS def i ned and ANSI not def i ned,
13 * so i dopr nt () wi l l be used unl ess you change t hi ngs wi t h expl i ci t macr o
14 * def i ni t i ons.
15 */
16 #i f def ANSI /* --------------------------------------------------------------------------------------------------
17 #include < s t d a r g . h >
18
19 PUBLI C void p r n t ( o f u n c t , f u n c t a r g , f o r ma t , a r g s )
20 i nt ( * o f u n c t ) ( ) ;
21 void * f u n c t a r g ;
/
22 char *f or mat ;
23 va l i st ar gs;
24 {
25 char b u f [ 2 5 6 ] , *p ;
26 i nt v s p r i n t f ( char* b u f , char* f mt , va l i s t a r g s ) ;
27
28 v s p r i n t f ( b u f , f o r ma t , a r g s ) ;
29
30 f o r ( p = b u f ; *p ; p++ )
31 ( * o f u n c t ) ( *p, f u n c t a r g ) ;
32 }
33
34 PUBLI C voi d s t o p p r n t ( ) { }
35
36 #el se / * UNI X-----------------------------------
37 #i ncl ude < v a r a r g s . h >
38
39 st at i c FI LE * Tmp _ f i l e = NULL
40 st at i c char *Tmp name
41
42 PUBLI C voi d p r n t ( o f u n c t , f u n c t a r g , f mt , ar gp )
43 i nt ( * o f u n c t ) ( ) ;
44 voi d * f u n c t a r g ;
45 char * f mt ;
46 i nt *ar gp;
47 {
48 i nt c ;
*
/
49 char *mkt emp( ) ;
50
51 i f ( !Tmp f i l e )
52 i f ( ! ( Tmp f i l e = f open( Tmp name = mkt emp (11yyXXXXXX11) , nw+") ))
53 {
54
55
f pr i nt f ( st der r , "Can' t open t empor ar y f i l e %s\ n", Tmp_name );
exi t ( 1 );
Section A.6Print Functions 737
56 }
57
58 _ d o p r n t ( f mt , a r g p, Tmp _ f i l e ) ;
59 p u t c ( 0, Tmp _ f i l e ) ;
60 r e wi nd ( Tmp _ f i l e ) ;
61
62 w h i l e ( ( c = g e t c ( T m p _ f i l e ) ) != EOF && c )
63 ( * o f u n c t ) ( c , f u n c t _ a r g ) ;
64 r e w i n d ( Tmp f i l e ) ;
65 }
66
67 PUBLIC v o i d s t o p p r n t ( )
68 {
69 f c l o s e ( Tmp _ f i l e ) ; / * Remove pr nt t empor ar y f i l e * /
70 u n l i n k ( Tmp_name ) ;
71 Tmp _ f i l e = NULL;
72 }
73
74 /* -------------------------------------------------------------------------------------------------------------------- */
75
76 PUBLI C voi d v f p r i n t f ( s t r e a m, f mt , a r g p )
77 FI LE * s t r e a m;
78 char *f mt , *ar gp;
79 {
80 _ d o p r n t ( f mt , a r g p, s t r e a m ) ;
81 }
82
83 PUBLI C voi d v p r i n t f ( f mt , ar gp )
84 char *f mt , *argp;
85 {
86 _ d o p r n t ( f mt , a r g p , s t d o u t ) ;
87 }
88
89 PRI VATE voi d p u t s t r (c, p)
90 i nt c;
91 char **p;
92 {
93 * ( * p ) ++ = c ;
94 }
95
96 PUBLI C voi d v s p r i n t f ( s t r , f mt , ar gp )
97 char * s t r , *f mt , *ar gp;
98 {
99 p r n t ( p u t s t r , &s t r , f mt , ar gp ) ;
100 * s t r = ' \ 0 ' ;
101 }
102 #endi f
The temporary file is created automatically on lines 51 to 56 of Listing A.34 the first
time prnt ( ) is called. You must call st op_pr nt ( ) (on lines 67 to 72) to delete the
file. You can do so after every prnt () call, but its much more efficient to do it only
once, just before the program terminates. Youll loose a FILE pointer in this case, but it
saves a lot of time.
The final part of Listing A.33 (lines 76 to 101) contains UNIX versions of the ANSI
vpr i nt f (), vf pr i nt f (), and vspr i nt f () functions.
s s o r t o is like
qsort ().
Sorting argv with
s s o r t ().
Sorting an array of struc
tures with s s o r t ().
Sort array.
A.7 Sorting
This sections describes two functions for sorting in-memory arrays: ssort () and
assort (). Both are modeled after the UNIX qsort () function. I ve provided alternate
sorting functions because most qsort () implementations use a quick sort, which
does not behave particularly well with small arrays or arrays that might already be
sorted. A Shell sort, which does not have these problems, is used here. Both sort func
tions are general-purpose functions that can be used on arrays of any type of objectthe
same sorting function can sort arrays of i nts, arrays of character pointers, arrays of
structures, and so forth. Taking ssort ( ) as characteristic, argv is sorted as follows:
cmp( p i , p2 )
char * * p l , **p2;
{
ret urn s t r c mp ( * p l , *p2 ) ;
}
m a i n ( a r g c , a r g v )
i nt a r g c /
char * * a r g v ;
{
s s o r t ( a r g v , a r g c , si zeof ( * a r g v ) , cmp ) /
}
The sort function is passed the base address of the array to be sorted, the number of ele
ments in the array, the size of one element, and a pointer to a comparison function. This
comparison function is passed pointers to two array elements, and it should otherwise
work like strcmp ( ), returning a negative number if the key associated with the first
argument is less than the key associated with the second, zero if the two keys are equal,
and a positive number if the second key is larger. You can use ssort () to sort an array
of structures as follows:
t ypedef st ruct
{
char *ke y;
i nt o t h e r _ s t u f f ;
} r e c o r d /
s o r t _ c m p ( p i , p2 )
r e c o r d * p l , *p2;
{
r et ur n( s t r c m p ( p l - > k e y , p 2 - >k e y ) ) /
}
p l a t o ()
{
r e c o r d f i e l d [ 10 ] ;
s s o r t ( f i e l d , 10, si zeof ( r e c o r d ) , s o r t _ c mp ) ;
}
Of course, its usually better to sort arrays of pointers to structures rather than arrays of
structurestheres a lot less to move when two array elements are swapped. The calling
conventions for the two sort routines are as follows:
voi d sso r t( voi d * base, i nt nel , i nt el esi ze, i nt (*cmp)())
Sort an array at base address base, having nel elements, each of el esi ze
bytes. The comparison function, cmp, is passed pointers to two array elements
and should return as follows:
Section A.7Sorting 739
cmp( p i , p2 ) /
*pl < *p2
*pl = *p2
*pl > *p2
return a negative number
return zero
return a positive number
The ssort ( ) function is identical in calling syntax to the standard qsort ( )
function. A Shell sort is used by ssort ( ), however, rather than a quicksort.
The Shell sort is more appropriate for use on arrays with a small number of ele
ments and arrays that might already be sorted. Also, since Shell sort is nonrecur
sive, it is a safer function to use than quicksort, which can cause stack overflows
at run time.
voi d assort (voi d **base, i nt nel , i nt el esi ze, i nt (*cmp) ()) Sort array of pointers.
This routine is a version of ssort () optimized to sort arrays of pointers. It takes
the same arguments as ssort ( ) so that the routines can be used interchangeably
on arrays of pointers. The base argument to assor t ( ) must reference an array
of pointers, however, and the el esi ze argument is ignored.
A.7.1 Shell SortTheory
The sorting method used by ssort ( ) is a Shell sort, named after its inventor,
Donald Shell. It is essentially an improved bubble sort, so I ll digress for a moment and
describe bubble sort. A bubble-sort that arranges an array of i nts into ascending order Bubble sort.
looks like this:
i n t a r r a y [ ASIZE ] ;
i n t i , j , t emp ;
f o r ( i = 1; i < ASIZE/ ++i )
f o r ( j = i - 1 / j >= 0/ j )
i f ( a r r a y [ j ] > a r r a y [ j + 1 ] )
swap ( a r r a y + i , a r r a y + j ) / / * swap ar r ay[ i ] and ar r ay[ j ] * /
The outer loop controls the effective array size. It starts out sorting a two-element array,
then it sorts a three-element array, and so forth. The inner loop moves the new element
(the one added when the effective array size was increased) to its correct place in the
previously-sorted array. Consider a worst case sort (where the array is already sorted,
but in reverse order):
5 4 3 2 1
In the first pass, the outer loop starts out with a two-element array:
5 4 3 2 1
And the inner loop swaps these two elements because theyre out of place:
4 5 3 2 1
The next pass increases the effective array size to three elements:
4 5 3 2 1
And the 3 is moved into its proper place like this:
4 5 3
4 3 5
3 4 5
2 1
2 1
2 1
Shell sort.
The rest of the sort looks like this:
3 4 5 2 1
3 4 2 5 1
3 2 4 5 1
2 3 4 5 1
2 3 4 5 1
2 3 4 1 5
2 3 1 4 5
2 1 3 4 5
1 2 3 4 5
As you can see, this worst-case sort is very inefficient. In an N-element array, roughly N2
swaps (and as many comparisons) are required to get the array sorted. Even the average
case requires N2 comparisons, even though it will use fewer swaps. This behavior is
quite measurable in most computer programs, and slows down the program unneces
sarily.
The Shell sort improves the bubble sort by trying to move the most out-of-place ele
ments into the proper place as quickly as possible, instead of letting the out-of-place ele
ment percolate through the array one place at a time. Other sort strategies (such as the
quicksort used for most qsort ( ) implementations) are theoretically more efficient, but
the overhead required by these other methods is often great enough that the theoretically
less-efficient Shell sort is faster (at least with small arraysless than a hundred elements
or so).
The basic strategy of Shell sort is to partition the array into several smaller subarrays
whose elements are spread out over the original array. The subarrays are sorted using a
bubble sort, and then the array is repartitioned into a smaller number of subarrays, each
having more elements. The process is continued until theres only one subarray compris
ing the entire array. For example, if the initial array looks like this:
6 5 4 3 2 1
You can partition it into three, two-element arrays like this:
6 5 4 3 2 1
The first array is [6,3], the second is [5,2] and the third holds [4,1]. The
between any two elements in the subarray is called the gap size. Here, the gap
The three subarrays are now sorted. The 6 and 3 will be swapped, as will the
and the 4 and 1, yielding the following:
3 2 1 6 5 4
The array is repartitioned, this time with a gap size of two:
di stance
size is 3.
5 and 2,
3 2 1 6 5 4
Section A.7.1 Shell SortTheory
There are now two, three-element arrays having the members [3, 1, 5] and [2, 6, 4],
These are sorted, swapping the 1and 3, and the 4 and 6, yielding.
The gap size is now reduced to 1, yielding:
but, since this array is already sorted, nothing more needs to be done. Note that only 5
swaps were used here rather than the 15 swaps that would be required by the bubble sort.
In the average case, roughly N12 swaps are required to sort an array of size N.7
A.7.2 Shell SortImplementation
The ssort ( ) function in Listing A.34 is a general-purpose Shell-sort function. The
initial gap size is selected on line 16 from a number in the series: 1, 4, 13, 40,
121,... (3N+1). The largest number in this series that is less than or equal to the array
size is used. (This is the series recommended by Knuth in The Art of Computer Pro
gramming as an optimal choice. Note that an even power of two, as is used in many
Shell-sort implementations is among the worst choice of gap sizes.) The f or statement
on line 19 controls the size of the subarrays. The gap is divided by three on each pass,
yielding the previous element in the foregoing series (40/3=13, 13/3=4, 4/3=1I m
using integer division). The two loops on lines 20 to 33 are doing a bubble-sort pass on
the partitioned subarray.
Since the compiler doesnt know the size of an array element at compile time, it cant
do the pointer arithmetic; we have to do the pointer arithmetic ourselves by declaring the
array to be a character array on line eight and then explicitly multiplying by the size of
an array element every time a pointer is advanced (on lines 23 and 24). The array ele
ments also have to be swapped one byte at a time. The swap time could be improved by
passing a pointer to a swap function as well as a comparison function to ssort, but I
wanted to maintain compatibility with the standard qsort ( ), and so didnt change any
of the parameters.
One change that I did make is shown in Listing A.35. asort ( ) is a version of
ssort ( ) that sorts only arrays of pointer-sized objectsthe vast majority of cases.
The routine takes the same arguments as ssort ( ), but it ignores el si ze.
A.8 Miscellaneous Functions
This section discusses six routines that dont fit nicely into any of the other
categories:
i nt copyf i l e( char * dst, char * src, char *mode)
i nt movefi 1 e(char * dst, char * src, char *mode)
copyf i l e () copies the contents of the file named in src to the file named in
Choosing the gap size.
Doing explicit pointer
arithmetic.
Implementing as s or t ()
Copy entire file.
Move entire file.
7. See [Knuth], vol. 3, pp. 84f.
Listing A.34. ssort.c General-Purpose Shell Sort
1
/ *
SSORT. C Wor ks j ust l i ke qsor t () except t hat a shel l sor t , r at her
2
*
t han a qui ck sor t , i s used. Thi s i s mor e ef f i ci ent t han
3
*
qui cksor t f or smal l number s of el ement s, and i t ' s not r ecur si ve (so wi l l use
4
*
much l ess st ack space) .
5 * /
6
7 v o i d s s o r t ( b a s e , n e l , e l s i z e , cmp )
8 c h a r * b a s e ;
9 i n t n e l , e l s i z e ;
10 i n t (* cmp) () ;
11
{
12 i n t i , j ;
13 i n t g a p , k, tmp ;
14 c h a r * p l , *p2;
15
16 f o r ( g a p = l ; gap <= n e l ; gap = 3* gap + 1 )
17
18
19 f o r ( gap / = 3; gap > 0 ; gap / = 3 )
20 f o r ( i = gap; i < n e l ; i ++ )
21 f o r ( j = i - g a p ; j >= 0 ; j - = gap )
22
{
23 p i = b a s e + ( j * e l s i z e ) ;
24 p2 = b a s e + ( ( j + g a p ) * e l s i z e ) ;
25
26 i f ( ( * c m p ) ( p i , p2 ) <= 0 ) / * Compar e t wo el ement s * /
27 b r e a k ;
28
29 f o r ( k = e l s i z e ; - - k >= 0 ;) / * Swap t wo el ement s, one * /
30 { / * byt e at a t i me. * /
31 tmp = * p l ;
32 *pl ++ = *p2;
33 *p2++ = tmp;
34
}
35 }
36
}
dst. If the mode argument is "w", the destination is overwritten when it already
exists; otherwise, the contents of the source file are appended to the end of the
destination. The return values are as follows:
0 Copy was successful.
-1 Destination file couldnt be opened.
-2 Source file couldnt be opened.
-3 Read error while copying.
-4 Write error while copying.
movef i l e () works like copyf i l e (), except that the source file is deleted if
the copy is successful, movef i l e () differs from the standard rename () func
tion in that it allows a move across devices (from one disk to another), and it sup
ports an append moder ename () is faster if youre moving a file somewhere
else on the same device, and r ename () can be used on directory names, how
ever. The sources for copyf i l e () and movef i l e () are in Listings A.36 and
A.37.
Listing A.35. assort.c ssort () Optimized for Arrays of Pointers
Section A.8Miscellaneous Functions 743
1
/ * ASSORT. C A ver si on of ssor t opt i mi zed f or ar r ays of poi nt er s. * /
2
3 voi d a s s o r t ( b a s e , n e l , e l s i z e , cmp )
4 voi d * * ba s e ;
5 i nt n e l ;
6 i nt e l s i z e ; / * i gnor ed */
7 i nt (* cmp) () ;
8
{
9 i nt i , j , gap;
10 voi d *tmp, * * p l , **p2
r
11
12 f or ( g a p = l ; gap <= n e l ; gap = 3* gap + 1 )
13
m
f
14
15 f or( gap / = 3; gap > 0 ; gap / = 3 )
16 f or ( i = gap; i < n e l ; i ++ )
17 f or ( j = i - g a p ; j >= 0 ; j - = gap )
18
{
19
Pi
= b a s e +
( j );
20
P2 = b a s e + ( ( j +gap) );
21
22
i f (
(* cmp) (
p l / p2 )
<= 0 )
23 br eak;
24
25 tmp
= * p l ;
26 *pl = * p 2 ;
27 *p2 = tmp;
28
}
29 }
Listing A.36. copyfile.c Copy Contents of File
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#i ncl ude < e r r n o . h >
#i ncl ude < s y s / t y p e s . h>
#i ncl ude < s y s / s t a t . h >
#i ncl ude 
#def i ne ERR_N0NE
#def i ne ERR_DST_OPEN
#def i ne ERR_SRC_OPEN
#def i ne ERR_READ
#def i ne ERR WRITE
0
1
2
3
4
c o p y f i l e ( d s t , s r c , mode )
* d s t ,
*mode;
* s r c ;
/ * "w" or "a" */
{
/ * Copy t he sr c to t he dest i nat i on f i l e, openi ng t he dest i nat i on i n t he
* i ndi cat ed mode. Not e t hat t he buf f er si ze used on t he f i r st cal l i s
* used on subsequent cal l s as wel l . Ret ur n val ues ar e def i ned, above.
* er r no wi l l hol d t he appr opr i at e er r or code i f t he r et ur n val ue i s <0
*
/
Li sti ngA.36. conti nued. ..
25 i nt f d d s t , f d s r c ;
26 char * buf ;
27 i nt g o t
9
28 i nt we r r o r ;
29 i nt r e t v a l = ERR NONE ;
30 st at i c unsi gned s i z e = 31 * 1024 ;
31
32 whi l e( s i z e > 0 && ! ( buf = m a l l o c ( s i z e ) ) )
33 s i z e - = 1024;
34
35
i f (
! s i z e ) / * coul dn' t get a buf f er , do i t one byt e at a t i me */
36
{
37 s i z e = 1;
38 b u f = "x x "; / * al l ocat e a buf f er i mpl i ci t l y */
39 }
40
41 f d_ s r c = o p e n ( s r c , 0_RD0NLY | 0_BINARY ) ;
42 f d <d s t = o p e n ( d s t , 0_WR0NLY 0_BINARY | 0_CREAT
43 ( *mode==' w' ? 0_TRUNC : 0_APPEND) ,
44 S_IREAD | S_IWRITE) ;
45
46 i f ( f d s r c == - 1 ) { r e t _ v a l = ERR_SRC_OPEN; }
47 el se i f ( f d d s t == - 1 ) { r e t _ v a l = ERR_DST_OPEN; }
48 el se
49 {
50 whi l e( ( g o t = r e a d ( f d s r c , b u f , s i z e ) ) > 0 )
51 i f ( ( we r r or = w r i t e ( f d _ d s t , b u f , g o t ) ) == - 1 )
52 {
53 r e t v a l = E RR_WRITE;
54 br eak;
55
}
56
57 i f ( g o t == - 1 )
58 r e t v a l = ERR READ;
59 }
60
61
i f (
f d _ d s t != - 1 ) c l o s e ( f d d s t ) ;
62 i f ( f d s r c != - 1 ) c l o s e ( f d s r c ) ;
63 i f ( s i z e > 1 ) f r e e ( b u f ) ;
64
65 r et ur n r e t v a l ;
66 }
Listing A.37. movefile.c Move File to Different Device
1 m o v e f i l e ( d s t , s r c , mode ) / * Wor ks l i ke copyf i l e () ( see copyf i l e. c) */
2 char * d s t , * s r c ; / * but del et es sr c i f t he copy i s successf ul */
3 char * mo d e ;
4
{
5 i nt r v a l ;
6 i f ( ( r v a l = c o p y f i l e ( d s t , s r c , mo d e ) ) == 0 )
7 u n l i n k ( s r c ) ;
8 r et ur n r v a l ;
9
}
Section A.8Miscellaneous Functions 745
i nt * memi set(i nt * d s t , i nt w i t h _ w h a t , i nt c o u n t ) Fill memory with integer
values.
This function is a version of the standard me ms e t ( ) function but it works on
integer-sized objects rather than bytes. It fills c o u n t i nts, based at d s t , with
the pattern in w i t h _ w h a t and returns d s t . The source code is in Listing A.38.
Listing A.38. memiset.c Initialize Array of i n t to Arbitrary Value
1 / * Wor ks l i ke memset but f i l l s i nt eger ar r ays wi t h an i nt eger val ue */
2 /* The count i s t he number of i nt s (not t he number of byt es) . */
3
4 i nt * me mi s e t ( d s t , wi t h _ wh a t , c o u n t )
5 i nt * d s t , wi t h _ wh a t , c o u n t ;
6 {
7 i nt * t a r g ;
8 f o r ( t a r g = d s t ; c o u n t >= 0 ; * t a r g ++ = wi t h_ wha t )
9 ;
10 ret urn d s t ;
11 }
i nt concat ( i nt si ze, char *dst, . . . ) Concatenate strings.
The concat () function concatenates an arbitrary number of strings into a
single destination array (dst) of the indicated si ze. At most si ze- 1 characters
are copied. All arguments following the dst argument are the source strings to
be concatenated, and the list should end with a NULL. For example, the following
code loads the engl i sh array with the string angl es, saxons, j utes:
#i ncl ude < s t d i o . h > / * For NULL def i ni t i on */
char t a r g e t [ SIZE ] ;
c o n c a t ( SIZE, t a r g e t , " a n g l e s , ", " s a x o n s , ", " j u t e s " , NULL ) ;
The second and third arguments can be the same, but the target-string pointer
cannot appear in any of the other arguments. The following concatenates
new_st r i ng to the end of t ar get . This usage is easier to use than the stan
dard st rncat () function in many situations because it doesnt require you to
keep a tally of the unused space in the target array.
c o n c a t ( SIZE, t a r g e t , t a r g e t , n e w _ s t r i n g ) ;
The source code is in Listing A.39. The amount of available space is returned, or
- 1 if the string was truncated.
voi d sear chenv( char *f i l ename, char *env_name, char ^pat hname)
This function searches for a specific file (f i l ename), first in the current direc
tory, and then along a path specified in the environment variable whose name is
in env_name. The variable should hold a semicolon or space-delimited list of
directory names. If the file is found, the full path name (including the file-name
component) is put into pat hname; otherwise, pat hname will contain an empty,
null-terminated string. The source code is in Listing A.40.
FI LE *dr i ver _l ( FI LE *out put , i nt l i nes, char *f name)
i nt dr i ver 2(FI LE *out put , l i nes)
Search for file along
directory path listed in an
environment string.
Copy driver-template file.
These routines work together to transfer a template file to a LFX or occs output
file, dr i ver 1( ) must be called first. It searches for the file named in f name
Listing A.39. concat.c Concatenate Strings
1 #ncl ude < s t d i o . h >
2 #i ncl ude < s t d a r g . h >
3
4
5
#i ncl ude <de bug . h> / * VA LI ST def i ni t i on */
i nt c o n c a t ( s i z e , d s t , VA_LI ST )
6 i nt s i z e ;
7 char * d s t ;
8
{
9 / * Thi s subr out i ne concat enat es an ar bi t r ar y number of st r i ngs i nt o a si ngl e
10 * dest i nat i on ar r ay (dst) of si ze " si ze, " At most .si ze- 1 char act er s ar e
11 * copi ed. Use i t l i ke t hi s:
12 * char t ar get [ SI ZE] ;
13 * concat ( SI ZE, t ar get , "f i r st ", "second "l ast ", NULL) ;
14 */
15
16 char * s r c ;
18 va s t a r t ( a r g s , d s t );
19
20 whi l e( ( s r c = va a r g ( a r g s , char *)) && s i z e > 1 )
21 whi l e( * s r c && s i z e > 1 )
22 * ds t ++ = * s r c ++ ;
23
24 * ds t ++ = ' \ 0' ;
25 v a _ e n d ( a r g s ) ;
26 r et ur n ( s i z e <= 1 && s r c && * s r c ) ? - 1 : s i z e ;
27
}
by looking first in the current directory and then in any directory on the path
specified by the LIB environment. This environment can list several, semicolon-
Ctrl-L delimits parts of
driver-template file.
delimited directory names.
The file should contain one or more Ctrl-L-delimited parts. The first part is
copied by driver_l() to the stream indicated by out put . If l i nes is true, then a
#l i ne directive that references the template files current line number is output
just before the block is output. NULL is returned if the template file cant be
opened, otherwise the F I L E pointer for the template file is returned. You can use
this pointer to close the file after it has been copied.
All other Ctrl-L-delimited parts of the template file are printed by successive
Use @ to mark com
ments in driver-template
file.
calls to dr i ver _2 ( ). One part is printed every time its called. 1is returned
normally, 0 at end of file (if there are no more parts to print).
Lines that begin with an @ sign are ignored by both subroutines. I ts a fatal
error to call dr i ver _2 ( ) without a previously successful dr i ver 1 ( ) call.
The source code for both routines is in Listing A.41.
A.9 Low-Level Video I/O Functions for the IBM PC
This section presents a set of very low-level terminal I/O functions for the IBM PC.
They are used by the curses implementation presented in the next section. The routines
in this section are the only ones in the curses package that are system dependent. As a
consequence, I ve not bothered to make them portable, because theyll always have to be
rewritten if you port the code to another compiler.
Listing A.40. searchen.c Search for File Along Path Specified in Environment
Section A.9Low-Level Video I/O Functions for the IBM PC 747
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# i n c l u d e < s t d i o . h >
#d e f i n e PBUF SIZE 129 /
*
Maxi muml engt h of a pat h name 1
*
/
v o i d s e a r c h e n v ( f i l e n a m e , envname, pat hname )
* f i l e n a me ;
*
c h a r
{
envname;
pat hname
/ * f i l e name t o sear ch f or
/ * envi r onment name t o use as PATH
*
*
/
*
PI t o put f ul l pat h name when f ound
*
/
/
/
}
/
*
Sear ch f or f i l e by l ooki ng i n t he di r ect or i es l i st ed i n t he envname
* envi r onment . Put t he f ul l pat h name (i f you f i nd ) i nt o pat hname
* Ot her wi se set *pat hname t o 0. Unl i ke t he DOS PATH command ( and t he
* mi cr osof t sear chenv) , you can use ei t her a space or semi col on
*
t o separ at e di r ect or y names. The pat hname ar r ay must be at l east
* 128 char act er s.
*
/
pbuf [ PBUF SI ZE] ;
P ;
*
* s t r p b r k ( ) , * s t r t o k ( ) , * q e t e n v ( ) ;
s t r c p y ( pat hname, f i l e n a m e ) ;
i f ( a c c e s s ( pat hname, 0 )
i
1 )
r e t u r n ;
/
/
*
*
check cur r ent di r ect or y
. . . i t ' s
/
/
/ * The f i l e doesn exi st i n t he cur r ent di r ect or y. I f a pat h was
*
*
(i e. f i l e cont ai ns \ or /) or i f t he envi r onment i sn' t set
r et ur n a NULL, el se sear ch f or t he f i l e on t he pat h.
*
/
L f ( s t r p b r k ( f i l e n a m e , " \ \ / M) g e t e n v ( e n v n a m e ) ) )
{
pat hname
r e t u r n ;
' \ 0' ;
}
s t r n c p y ( p buf , p, PBUF SIZE ) ;
Lf ( p s t r t o k ( p b u f , "; 11 )
)
{
do
{
s p r i n t f ( pat hname, "%0. 9 0 s \ \ %0 . 2 0 s " , p, f i l e n a me ) ;
( a c c e s s ( pat hname, 0 ) > 0 )
r e t u r n ; / * f ound i t * /
}
w h i l e ( p s t r t o k ( NULL, "; ") ) ;
}
pat hname ' \ 0' ;

Two sets of complementary routines are presented here: a set of direct-video func
tions that write directly into the IBMs display memory (they are MGA and CGA com
patible), and a set of similar routines that use the video functions built into the IBM-PC
ROM-BIOS to do the I/O. The direct-video routines are faster, the BIOS routines are, at
least in theory, more portable. (The direct-video functions work better than the BIOS
ones in clones that have nonstandard BIOS implementations.)
1 #i ncl ude <stdi o. h>
2 #i ncl ude <ctype. h>
3 #i ncl ude <stdl i b. h>
5 #i ncl ude <t o o l s / c o m p i l e r . h> / * f or pr ot ot ypes */
6
Listing A.41. driver.c Copy Driver Template to Output File
7 /* -----------------------------------------------------------------------------------------------------*/
8
9 PUBLI C FI LE *dri ver_l P(( FI LE *output, i nt l i ne, char *f i l e_name ));
10 PUBLI C i nt dri ver_2 P(( FI LE *output, i nt l i ne ));
11
12 PRI VATE FI LE *I nput_f i l e =NULL ;
13 PRI VATE i nt I nput_l i ne; /* l i ne number of most - r ecent l y r ead l i ne */
14 PRI VATE char Fi l e_name [80] ; /* t empl at e- f i l e name */
15
16 / *- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
17
18 PUBLI C FI LE * d r i v e r _ l ( o u t p u t , l i n e s , f i l e _ n a me )
19 FI LE *o u t p u t ;
20 char * f i l e _ n a me ;
21 {
22 char p a t h [ 8 0 ] ;
23
24 i f ( ! ( I n p u t _ f i l e = f o p e n ( f i l e _ n a m e , "r" )) )
25 {
26 s e a r c h e n v ( f i l e _ n a m e , "LIB", p a t h ) ;
27 i f ( ! * pa t h I I! ( I n p u t _ f i l e = f o p e n ( p a t h , "r ") ) )
28 r et ur n NULL;
29 }
30
31 s t r n c p y ( F i l e _ n a me , f i l e _ n a m e , si zeof ( Fi l e _ na me ) ) ;
32 I n p u t _ l i n e = 0;
33 d r i v e r _ 2 ( o u t p u t , l i n e s ) ;
34 r et ur n I n p u t _ f i l e ;
35 }
36
37 /* -------------------------------------------------------------------------------------------------------*/
38
39 PUBLI C i nt d r i v e r _ 2 ( o u t p u t , l i n e s )
40 FI LE * o u t p u t ;
41 {
42 st at i c char b u f [ 256 ] ;
43 char *p;
44 i nt p r o c e s s i n g _ c o mme n t = 0;
45
46 i f ( ! I n p u t _ f i l e )
47 f e r r ( "INTERNAL ERROR [ d r i v e r _ 2 ] , Te mpl a t e f i l e n o t o p e n . \ n " ) ;
48
49 i f ( l i n e s )
50 f pr i nt f ( out put , "\ n#l i ne %d \ "%s\ "\ n", I nput _l i ne + 1, Fi l e_name ) ;
51
52 whi l e( f g e t s ( b u f , si zeof ( b u f ) , I n p u t _ f i l e ) )
53 {
54 + + I n p u t _ l i n e ;
55 i f ( *buf == ' \ f ' )
56 break;
57
Section A.9Low-Level Video I/O Functions for the IBM PC 749
Listing A.41. continued.. .
58 f or( p = b u f ; i s s p a c e ( * p ) ; ++p )
59
9
60 i f ( *p == '@' )
61
{
62 p r o c e s s i n g comment = 1;
63 cont i nue;
64
}
65 el se i f ( p r o c e s s i n g comment ) / * Pr evi ous l i ne was a comment , * /
66
{ / * but cur r ent l i ne i s not . * /
67 p r o c e s s i n g comment = 0;
68 i f ( l i n e s )
69 f p r i n t f ( o u t p u t , " \ n # l i n e %d \"%s \ " \ n " , I nput l i n e , Fi l e _ n a me ) ;
70
}
71 f p u t s ( b u f , o u t p u t ) ;
72
}
73 return( f e o f ( I n p u t f i l e ) ) ;
74 }
A.9.1 IBMVideo I/OOverview
You need to know a little about low-level I/O to understand the function descriptions
that follow. The IBM PC uses several different hardware adapters for video output. The
most common are the Color Graphics Adapter (CGA) and Monochrome Graphics
Adapter (MGA). Most other interface cards can emulate one or the other of these. Both
the MGA and CGA use a block of memory to represent the screena two-dimensional
array in high memory holds the characters that are currently being displayed. The posi
tion of the character in the array determines the position of the character on the screen.
Several such arrays, called video display pages, are available, but only one page can be
displayed at a time. Both systems use a 25x80 array, but the MGA bases the default
video page at absolute memory location OxbOOOO(B000:0000), and the CGA puts it at
0xb8000. The other pages are at proportionally higher addresses.
Both cards use a 16-bit word to represent each character. The low byte is an
extended ASCII code for the character. (All 255 codes are used, the extra ones are for
funny characters like smiley faces and hearts.) The high byte is used for attributes. This
attribute byte is pictured in Figure A.2. The high bit of the attribute byte controls char
acter blinking (its blinking if the bit is set). The next three bits determine the back
ground color. The next bit controls the intensity of the foreground colorof the charac
ter itself. If its set, the character is displayed in high intensity. The bottom three bits
determine the foreground color (the color of the character itself). The only difference
between the color and monochrome adapters is the way that the colors are interpreted.
The monochrome adapter recognizes only black and white; the code for blue, when used
as a foreground color, causes the MGA to print the character underlined.
Note that the foregoing applies only to the CGA and MGA, other adapters do things
slightly differently, and the situation is actually more complex for the CGA. If you have
a CGA and want more details, video interfacing is covered in depth in Chapter 4 of Peter
Nortons book: The Peter Norton Programmer*s Guide to the IBM PC (Bellevue, Wash.:
Microsoft Press, 1985).
The following functions use direct-video access to address the screen:
i nt dv_i ni t (voi d)
Initialize the direct-video functions. At present this routine just checks to see if a
color card is installed and changes the internal base address of the video memory.
Memory-mapped
screens, MGA, CGA.
Video display pages.
Characters and attri
butes.
Blinking, intensity, color.
Direct-video functions.
Initialize direct-video
functions.
Clear region of screen.
Clear screen.
Direct-video pr i nt f o .
Display character and at
tribute.
Figure A.2. Attribute Bits
l=blinking 0=steady intensity (l=high, (Mow)
background color
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
foreground color
T l T I T
1/* L/*
CGA MGA
black =black (
blue = underlined
green n/a
cyan n/a
red n/a
magenta n/a
brown n/a
white =white (
dark grey at high intensity)
(light grey at low intensity)
(The monochrome and color cards map video memory to different base
addresses.) It returns zero if an 80-column text mode is not available (the direct-
video functions will not work in this case), or 1 if everythings okay.
voi d dv_ cl r _ r egi on(i nt l ef t , i nt r i ght , i nt t op,
i nt bot t om, i nt at t r i b)
Clear the region of screen defined by a box with the upper-left comer at
( l ef t , t op) and the lower-right comer at ( r i ght , bot t om) . The region is
cleared by filling the area with space characters having the indicated attribute.
Symbolic defines for the attributes are in <t ool s/ t er ml i b. h>, discussed
below. The cursor position is not modified.
voi d dv_cl r s( at t r i b)
Clear the entire screen by filling it with space characters having the indicated
attribute. The cursor position is not modified.
voi d dv_ pr i nt f (i nt at t r i b, char *f mt , . . . )
A pr i nt f ( ) that uses direct-video writes. Output characters all have the indi
cated attribute. The pr nt ( ) function described earlier does the actual printing.
voi d dv_put c (i nt c , i nt at t r i b)
Write a single character to the screen at the current cursor position, and with the
indicated attribute. The following characters are special:
\ 0 ignored
\ f clear screen and home cursor
\ n send cursor to left edge of next line
\ r send cursor to left edge of current line
\ b back up cursor one character (non-destructive backspace)
The screen scrolls up if you go past the bottom line. Characters that go beyond
the end of the current line wrap around to the next line.
Section A.9.1 IBM Video I/OOverview 751
voi d dv_ ctoyx(i nt y, i nt x)
Position the cursor at the indicated row (y) and column (x). The top-left comer of
the screen is (0,0), the bottom-right comer is (24,79).
voi d dv_ getyx(i nt *r owp, i nt *col p)
Modify *r owp and *col p to hold the current cursor position. The top-left comer
of the screen is (0,0), the bottom-right comer is (24,79).
dv_i ncha (voi d)
Return the character and attribute at current cursor position. The character is in
the low byte of the returned value and the attribute is in the high byte.
voi d dv_ out cha( i nt c)
Write character and attribute to screen without moving the cursor. The character
is in the low byte of c and the attribute is in the high byte. Note that the special
characters supported by dv put c ( ) are not supported here. They will print as
funny-looking IBM graphics characters.
voi d dv_ r epl ace( i nt c)
Like dv out cha ( ) , this function writes the character to the screen without
moving the cursor, but the attribute is not modified by dv r epl ace ( ).
voi d dv_put cha r (i nt c )
Write a character to the screen with normal attributes (black background, white
foreground, not blinking, normal-intensity, not underlined). This function uses
dv put c ( ) for its output, so it supports the same special characters, scrolling,
line wrap, and so forth.
dv_put s (char * st r , i nt move)
Write a string to screen. If move is true, the cursor is positioned at the end of the
string, otherwise the cursor is not moved. Normal attributes, as described in ear
lier dv put char ( ), are used. This function uses dv put c ( ) for its output.
voi d dv_ put sa(char *st r , i nt at t r i b)
Write string to screen, giving characters the indicated attributes. The cursor is
positioned just after the rightmost character. This function uses dv put c ( ) for
its output.
#i ncl ude <t ool s/ t er ml i b. h>
SBUF *dv_save ( i nt l ef t , i nt r i ght , i nt t op, i nt bot t om)
SBUF *dv_r est or e ( SBUF *sbuf )
SBUF *dv_f r eesbuf ( SBUF *sbuf )
These three functions work together to save and restore the characters in a
specified region of the screen, dv save saves all characters in the box with
comers at ( t op, l ef t ) and ( bot t om, r i ght ) . It returns a pointer that can be
passed to a subsequent dv r est or e ( ) call to restore that region to its original
condition, dv save terminates the program with an error message if it cant get
memory. The cursor is not modified by either function. The SBUF structure, used
for this purpose, is defined in <toolsltermlib.h> Note that the memory used for
this structure is not freed, you must do that yourself with a
df f r eesbuf ( sbuf ) call after the dv r est or e ( ) call. For convenience,
Change cursor position.
Get cursor position.
Input character from
screen.
Display character and at
tribute without moving
cursor.
Display character without
moving cursor.
Display normal character.
Display string.
Display string with attri
butes.
Save and restore region
of screen. Discard save
buffer.
dv r est or e ( ) returns its own argument, so you can say:
dv f r e e s b u f ( dv r e s t o r e ( s b u f ) ) ;
if you like.
Scroll region arbitrary
direction.
voi d dv scr ol l l i ne( i nt l ef t , i nt r i ght , i nt t op, i nt bot t om,
i nt di r , i nt at t r i b)
Scroll the indicated region of the screen by one line or column. The di r argu
ment must be one of the characters: ' u' , ' d' , ' 1' , or ' r' for up, down, left, or
right. The cursor is not moved. The opened line or column is filled with space
characters having the indicated attribute.
Scroll region up or down, voi d dv scr ol l ( i nt l ef t , i nt r i ght , i nt t op, i nt bot t om, i nt amt ,
i nt at t r i b)
Video-BIOS functions
that mimic direct-video
functions.
Scroll the indicated region of the screen up or down by the indicated number of
lines (amt), filling the new lines with space characters having the indicated attri
bute. Negative amounts scroll down, positive amounts scroll up.
There are two sets of I/O routines that use the video-BIOS rather than direct-video
reads and writes. The routines in Table A.2 behave exactly like the dv_ routines that
were discussed earlier. For example, vb get yx ( ) works exactly like dv get yx ( ),
but it uses the video BIOS. Names that are in all caps are implemented as macros.
Table A.2. Video-BIOS Macros and Functions That Work Like Direct-Video Functions
Function or macro, t
Has side
effects?
v o i d VB_CLRS ( a t t r i b ) no
v o i d VB_CLR_REGION ( l e f t , r i g h t , t o p , bo t t o m, a t t r i b ) yes
v o i d VB_CTOYX ( Y, x ) no
v o i d vb f r e e s b u f ( s b u f )
v o i d vb g e t y x ( yp, xp )
i n t VB_INCHA
( )
no
v o i d VB_0 U T CHA ( c ) yes
v o i d VB_PUTCHAR ( c ) no
v o i d vb p u t c ( c , a t t r i b )
v o i d vb p u t s ( s t r , move c ur )
v o i d VB_REPLACE ( c ) yes
SBUF *vb r e s t o r e ( s b u f )
SBUF *vb s a v e ( l e f t , r i g h t , t o p , bo t t o m )
v o i d VB SCROLL ( l e f t , r i g h t , t o p , bo t t o m, amt, a t t r i b ) yes

t Names in all caps are macros.
Other video-BIOS func
tions.
Create block cursor.
The second set of BIOS routines gives you access to features that are not easy to
implement directly. Of these, the movement functions are used by the direct
video routines to move the physical cursor. As with the earlier group, names in all caps
represent macros, but none of these macros have side effects.
voi d VB BLOCKCUR(voi d)
Make the cursor a block cursor rather than the normal underline.
Create underline cursor.
voi d VB NORMALCUR(voi d)
Make the cursor a normal, underline, cursor.
Section A.9.1 IBM Video I/OOverview 753
voi d VB_CURSIZE ( i nt t o p , i nt bottom)
Change the cursor size by making it extend from the indicated t op scan line to
the bottom one. The line numbers refer to the box in which the character is
drawn. A character on the CGA is 8 scan lines (dots) high, and the line numbers
go from 0 to 7. They go from 0 to 12 on the MGA. A normal, underline cursor
can be created with VB_CURSIZE ( 6, 7) on the CGA, and
VB_CURSIZE ( 11, 12) on the MGA. VB_CURSIZE (0, 7) creates a block cur
sor on the CGA, filling the entire area occupied by a character.
VB CURSIZE ( 0, 1) puts a line over the character rather than under it.
VB CURSIZE ( 13, 13) makes the cursor disappear entirely. If the t op line is
larger than the bottom, youll get a two-part cursor, so VB CURSIZE ( 11, 1)
creates a cursor with lines both above and below the character on the MGA.
i nt vb_get char (voi d)
Get a character directly from the keyboard. The typed character is returned in the
low byte of the returned integer, the high byte holds the auxiliary byte that marks
ALT keys and such. See the IBM Technical Reference for more information. You
can use this function to read key codes that cant be accessed by some compilers
standard I/O system. Similarly, vb g e t c h a r ( ) gets characters from the key
board, even if input is redirected, though its more portable to open the console
directly for this purpose.
i nt VB_GETCUR (voi d)
Return the current cursor position. The top byte of the return value holds the row,
the bottom byte the column.
i nt VB_GETPAGE (voi d)
Return a unique number identifying the currently active video-display page,
i nt v b _ i s c o l o r (voi d)
Returns one of the following values:
CGA is installed an its in an 80-column text mode
MGA is installed
CGA is installed and its not in an 80-column text mode
The return value is controlled by the current video mode as follows:
Mode Return value
2 or 3 1
7 0
anything else - 1
voi d VB_SETCUR(posn)
Modify current cursor position. The top byte of posn holds the row (y), the bot
tom byte, the column (x). The top-left comer of the screen is (0,0).
A.9.2 Video I/OImplementation
This section presents the code that implements the foregoing functions. Most of the
actual code is pretty boring, and its presented with no additional comment in the text. A
high-level discussion of the data structures and techniques used is discussed however.
Change cursor size.
Get character from key
board.
Get cursor position.
Get video page number.
Determine display
adapter.
Move physical cursor.
I m assuming, in this section, that youre familiar to some extent with the IBM
ROM-BIOS interface. If you need this information, read Peter Nortons book: The Peter
Norton Programmer s Guide to the IBM PC (Bellevue, Wash.: Microsoft Press, 1985),
which covers all this material. I m also assuming that your compiler has a mechanism
for generating a software interruptmost do. I m using the Microsoft i n t8 6 ()
because its portablenewer versions of the compiler have more efficient BlOS-access
functions in the library, and you may want to modify the code to use these functions.
The <tools/termlib.h> file is used both by application programs and by the I/O func
tions themselves. I ts shown in Listing A.42. The macros on lines five to 12 are the
codes for the basic colors described earlier. The FGND ( ) and BGND ( ) macros on lines
14 and 15 put the color code into the foreground or background position within the attri
bute byte. (FGNDis just for documentation, it doesnt do anything. BGND shifts the color
four bits to the left.) The NORMAL, UNDERLI NED, and REVERSE definitions on lines 17
to 19 of Listing A.42 define all legal color combinations for the monochrome card: a nor
mal character, an underlined character, and a reverse-video character. Finally, BLI NK
I NGand BOLD (on lines 21 and 22 of Listing A.42) can be ORed with the colors to make
the character blink or be displayed at high intensity.
Listing A.42. termlib.h Video-I/O Definitions
1 ....................................... ...................... 1 1 1 1 | | I ' | | . . . i i i
1 / * Var i ous def i ni t i ons f or t he t er ml i b. Not e t hat i f your pr ogr ami ncl udes bot h
2 * t er ml i b. h and vbi os. h, t er ml i b. h must be i ncl uded FI RST.
3 */
4
5 #def i ne BLACK
6 #def i ne BLUE
7 #def i ne GREEN
8 #def i ne CYAN
9 #def i ne RED
10 #def i ne MAGENTA
11 #def i ne BROWN
12 #def i ne WHI TE
13
14 #def i ne FGND( col or)
15 #def i ne BGND( col or)
16
17 #def i ne NORMAL
18 #def i ne UNDERLI NED
19 #def i ne REVERSE
20
21 #def i ne BLI NKI NG
22 #def i ne BOLD
23
24 /* ---------------------------
25 * I f USE_FAR_HEAP i s t r ue t hen use t he f ar heap to save scr een i mages i n t he
26 * smal l model . You must r ecompi l e t he t er ml i b i f you change t hi s tf def i ne.
21 */
28
29 t ypedef unsi gned i nt WORD;
30
31 #i f ( USE FAR HEAP )
32 t ypedef WORD f a r *IMAGEP;
33 #def i ne IMALLOC _ f m a l l o c
34 #def i ne IFREE _ f f r e e
35 #el se
36 t ypedef WORD *IMAGEP;
37 #def i ne IMALLOC ma l l o c
38 #def i ne IFREE f r e e
0x00 / * Col or Car d. * /
0x01
0x02
0x03
0x04
0x05
0x06
0x07
( c o l o r )
( ( c o l o r ) <<4)
(FGND(WHITE) | BGND(BLACK)) / * Monochr ome car d * /
(FGND(BLUE) | BGND(BLACK))
(FGND(BLACK) | BGND(WHITE))
0x80 / * May be ORed wi t h t he above * /
0x08 / * and wi t h each ot her * /
The i n t 8 6 () function.
Attribute definitions,
SBUF, WORD, I MALLOC,
i f r ee : <tools/termlib. h>
Section A.9.2Video I/OImplementation 755
Li sti ngA.42. conti nued.
39 #endf
40
41 t ypedef struct SBUF / * used by vb save, vb r est or e, dv save, and dv r est or e */
42
{
43 unsi gned i nt t op, bot t om, l ef t , r i ght ;
44 I MAGEP i mage;
45 } SBUF;
46
47
/ *- - - - -
48
*
Pr ot ot ypes f or t he vi deo- BI OS access r out i nes.
49 */
50
51 ext ern i nt vb i scol or ( voi d );
52 ext ern voi d vb get yx ( i nt * yp, i nt * xp );
53 ext ern voi d vb put c ( i nt c, i nt at t r i b );
54 ext ern voi d vb put s ( char *st r , i nt move cur
);
55 ext ern i nt vb get char ( voi d );
56 ext ern SBUF *vk>save ( i nt 1, i nt r , i nt t , i nt b );
57 ext ern SBUF *vk>r est or e ( SBUF *sbuf
);
58 ext ern voi d vb f r eesbuf ( SBUF *sbuf / 9
59
60
/ *------------------------
61
*
Pr ot ot ypes f or t he equi val ent di r ect vi deo f unct i ons.
62
*/
63
64 ext ern i nt dv_ i ni t ( voi d ) ;
65 ext ern voi d dv scr ol l l i ne ( i nt x l ef t , i nt x r i ght , i nt y t op, \
66 i nt y bot t om, i nt di r , i nt at t r i b
);
67 ext ern voi d dv scr ol l ( i nt x l ef t , i nt x r i ght , i nt y t op, \
68 i nt y bot t om, i nt amt , i nt at t r i b
);
69 ext ern voi d dv cl r s ( i nt at t r i b
) ;
70 ext ern voi d dv cl r r egi on ( i nt l , i nt r, i nt t , i nt b, i nt at t r i b );
71 ext ern voi d dv_ ct oyx ( i nt y, i nt x );
72 ext ern voi d dv_ get yx ( i nt *rowp, i nt *col p );
73 ext ern voi d dv put c ( i nt c, i nt at t r i b );
74 ext ern voi d dv put char ( i nt c );
75 ext ern i nt dv put s ( char *st r , i nt move cur );
76 ext ern voi d dv put sa ( char *st r , i nt at t r i b );
77 ext ern i nt dv_ i ncha ( voi d );
78 ext ern voi d dv out cha ( i nt c
);
79 ext ern voi d dv repl ace ( i nt c );
80 ext ern voi d dv pr i nt f ( i nt at t r i but e, char *f mt , . . . );
81 ext ern SBUF *dv save ( i nt 1, i nt r , i nt t , i nt b );
82 ext ern SBUF *dv r est or e ( SBUF *sbuf );
83 ext ern voi d dv_ f r eesbuf ( SBUF *sbuf );
The WORD on line 29 of Listing A.42 is an attempt to get a portable
definition for a 16-bit unsigned quantity. I ts used later to access a single 16-bit,
character/attribute pair from the video page.
The definitions on lines 31 to 39 of Listing A.42 are compiler dependent. They con
trol from where, in real memory, the buffers used for an SBUF will be allocated. In an
average window application, where you need to save the area under a new window
before creating it, a considerable amount of memory can be used to store these old
images (25x80x12=4,000 bytes for the whole screen), and that much memory may not be
available in an 8086 small-model program. You dont want to convert the entire pro
gram to the medium or compact model because this conversion will slow down all the
Portable 16-bit data type,
WORD.
Screen-save buffer,
SBUF.
The 'far heap,
US E_FAR_HE AP,
f mal l oc(),
f f r ee ().
pointer accesses. I ts possible, however, for a small-model program to allocate and use a
region of memory outside of the normal 64K data area (called the far heap). The Micro
soft f ma l l o c ( ) function gets memory from the far heap, and returns a 32-bit f ar
pointer. The ( ) function puts memory back into the far heap. The far heap
The keyword.
used for window images if USE FAR HEAP is defined above line 31 of Listing A.42 (or
with a -DUSE FAR HEAP command-line switch). The f a r keyword on line 32 tells the
Microsoft compiler to use a 32-bit pointer that holds both the segment and offset com
Video-BIOS definitions,
vbios. h.
ponents of the address when _ f ma l l o c ( ) is used to allocate memory.
The next .h file of interest is vbios.h, which is used locally by the video-BIOS func
tions. I ts in Listing A.43. Two interrupts are of interest here: the video interrupt,
whose number is defined on line five, and the keyboard interrupt defined on the next line.
The definitions on lines seven to 15 let you select a specific functioneach interrupt can
do several things, and the function-number determines which of these functions are per
formed. All of the foregoing is described in great depth in Peter Nortons book if you
need more background information.
Listing A.43. vbios.h Video-BIOS Service Definitions
1
2
3
4
5
#f ndef NORMAL
#ncl ude < t o o l s / t e r m l i b . h>
#endi f
#def i ne VIDEO_INT 0x10 /* Vi deo i nt er r upt */
6 #def i ne KB_INT 0x16 /* Keyboar d i nt er r upt */
7 #def i ne CUR_SIZE 0x1 /* Set cur sor si ze
*/
8 #def ne SET_POSN 0x2 /* Modi f y cur sor posn */
9 #def i ne READ_POSN 0x3 /* Read cur r ent cur sor posn */
10 #def ne SCROLL_UP 0x6
/*
scr ol l r egi on of scr een up */
11 #def i ne S C RO L L_D OWN 0x7
/* " down */
12 #def ne READ_CHAR 0x8 /*
Read char act er f r omscr een */
13 #def i ne WRITE 0x9
/*
Wr i t e char act er */
14 #def ne WRITE_TTY Oxe
/*
Wr i t e char & move cur sor */
15 #def i ne GET_VMODE Oxf
/*
Get vi deo mode & di sp pg */
Accessing the video
BIOS, Vbi os ().
Most video-BIOS func
tions implemented as
macros.
Direct-video definitions,
video, h. Video-memory
base addresses, mon-
BASE, COLBASE.
The video BIOS is accessed by _Vbi os () in Listing A.44. The Microsoft int86() func
tion, a prototype for which is in <dos.h>, communicates with the BIOS using the normal
interrupt mechanism. It is passed an interrupt number and pointers to two REGS struc
tures that represent the 8086 register set. The first of these holds the desired contents of
the registers before the interrupt is executed, the second holds the values of the registers
after the interrupt returns. Most of the video-BIOS functions are actually macros that
evaluate to Vbi os () calls. These are also defined in <tools/vbios.h> and are shown in
Listing A.45.
The third .h file, video.h in Listing A.46, is used locally by the direct-video functions.
MONBASE and COLBASE (defined on lines one and two of Listing A.46) are the base
addresses of page 0 of the monochrome and color adapters. The addresses are in the
canonical segment/offset form used by most 8086 compilers, the high 16 bits are the seg-
Screen dimensions, num- ment and the low 16 bits are the offset. The dimensions of the screen (in characters) are
ROWS, NUMCOLS.
Accessing characters
and attributes, c har ac
t er . Representing entire
screen, di s pl ay.
controlled by NUMROWS and NUMCOLS on the lines three and four of Listing A.46.
The CHARACTER structure defined on lines six to 11 of Listing A.46 describes a sin
gle character/attribute pair. Using a structure to access the high byte is a better strategy
than a shift because its more efficient in most compilers. A DISPLAY, defined on the
next line, is a 25x80 array of these character/attribute pairs.
Listing A.44. vbios.c Video-BIOS Interface Function
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#ncl ude < d o s . h>
#i ncl ude < t o o l s / v b i o s . h >
#ncl ude " v i d e o . h "
/
/
*
*
Thi s f i l e cont ai ns a wor khor se f unct i on used by t he ot her vi deo I / O
f unct i ons t o t al k t o t he BI OS. I t execut es t he vi deo i nt er r upt .
*
*
/
/
i nt
i nt
i nt
V b i o s ( s e r v i c e , a l , bx, c x , dx, r e t u r n t h i s )
a l , bx, c x , dx;
* r e t u r n t h i s ;
/
/
/
*
*
*
Ser vi ce code, put i nt o ah
Ot her i nput r egi st er s.
Regi st er t o r et ur n, "
*
*
bh", "dx" onl y
*
/
/
/
{
uni on REGS
r e g s . h . ah
r e g s . h . a l
r e g s . x . bx
r e g s . x . cx
r e g s . x . dx
s e r v i c e
a l ;
bx;
c x;
dx;
i n t 8 6 ( VIDEO INT, &regs , &r e g s ) ;
ret urn ( * r e t u r n _ t h i s
( * r e t u r n t h i s
' a'
' b '
)
)
. X .
r e g s . h . bh r e g s . x . dx
}
The dv Scr een pointer declared on line 16 of Listing A.46 points at the base
address of the video memory for the current adapter (its initialized for the monochrome Accessing video memory
card but it can be changed at run time). The SCREEN macro on line 21 of Listing A.46 directly, dv Screen,
uses dv Scr een to randomly access a specific character or attribute on the screen. For
example, the following puts a high-intensity X in the lower-right comer of the screen:
#i ncl ude < t e r m l i b . h >
#i ncl ude < v i d e o . h >
SCREEN[ 25 ] [ 7 9 ] . c h a r a c t e r = ' X' ;
SCREEN[ 25 ] [ 79 ] . a t t r i b u t e = NORMAL | BOLD ;
The elaborate casts are needed because you cant cast a pointer into an array, but you can
cast it to an array pointer. Since dv Scr een is a pointer to an entire two-dimensional
array as compared to a pointer to the first element, the star in SCREEN gets you the array
itself. Looked at another way, a pointer to an entire array is special in two ways. First, if
you increment it, youll skip past the entire array, not just one element. Second, the
pointer is treated internally as if it were a pointer to a pointer. That is, the array pointer
is treated internally as if it points at an array of pointers to rows, similar to ar gv. An
extra star is needed to select the correct row. This star doesnt change the value of the
pointer, only its typedv Scr een and *dv_Scr een both evaluate to the same
number, but the types are different. dv Scr een (without the star) is of type pointer to
two-dimensional array, the first element of which is one-dimensional array of CHARAC
TERS. *dv_Scr een (with the star) is of type pointer to one-dimensional array (pointer
to row), the first element of which is a single CHARACTER. All this is discussed in
greater depth in Chapter 6.
The definitions on lines 23 to 25 of Listing A.46 work just like the foregoing, except
that the screen is treated as an array of 16-bit numbers rather than as an array of CHAR
ACTERS. The dv functions are in Listings A.47 to A.67.
Listing A.45. vbios.h Macros That Implement Video-BIOS Functions
16 /*
These vi deo- B
17
*
18
*
VB__I NCHA
19
*
20
*
VB__ GE TP A GE
21
*
VB GETCUR
22
*
23
*
d
24
*
25
*
VB_CURSI ZE
26
*
27
*
VB__OUTCHA
28
*
29
*
VB__REPLACE
30
*
VB__SETCUR
31
*
32
*
33
*
i
34
*
VB_CTOYX( y, x)
35
*
VB__SCROLL
36
*
37
*
VB__CLRS
38
*
VB__CLR_REGI ON
39
*
VB__BLOCKCUR
40
*
VB_J NORMALCUR
41
*
VB__PUTCHAR
42 */
43
53
54
55
56
57
58
59
60
61
62
63
I OS f unct i ons ar e i mpl ement ed as macr os
Char act er i n Ret ur ns t he char act er and at t r i but e ORed t oget her
t he l ow byt e and at t r i but e i n t he hi gh byt e.
Ret ur n t he cur r ent l y act i ve di spl ay page number
Get cur r ent cur sor posi t i on. The t op byt e of t he r et ur n val ue
hol ds t he r ow, t he bot t omby t he col umn. umi s t he vi deo
page number . Not e t hat VB GETPAGE () wi l l mess up t he f i el ds i n
t he st r uct ur e so i t must be cal l ed f i r st
scan Change t he cur sor shape t o go f r om t he t op to t he bot
l i ne.
Wr i t e a char act er and at t r i but e wi t hout movi ng t he cur sor . The
at t r i but e i s i n c' s hi gh byt e, t he char act er i s t he l ow byt e.
Same as VB_0UTCHA but uses t he exi st i ng at t r i but e byt e.
Modi f y cur r ent cur sor posi t i on. The t op byt e of "posn" val ue
hol ds t he r ow ( y) , t he bot t ombyt e, t he col umn (x) . The t op- l ef t
cor ner of t he scr een i s (0, 0).
number .
umi s t he vi deo- di spl ay- page
Li ke VB_SETCUR but y and x coor di nat es ar e used.
Scr ol l t he i ndi cat ed r egi on on t he scr een. I f amt i s <0,
scr ol l down; ot her wi se, scr ol l up.
Cl ear t he ent i r e scr een
Cl ear a r egi on of t he scr een
Change t o a bl ock cur sor .
Change t o an under l i ne cur sor .
l i ke vb put c, but uses whi t e on bl ack f or t he at t r i but e.
44 # d e f i n e VB__GETPAGE () _ V b i o s ( GET_VMODE, 0, 0, 0, 0, bh"
45 #d e f i n e VB__INCHA() _ V b i o s ( READ_CHAR, 0, VB_GETPAGE( ) , 0, o,
' ax"
46 # d e f i n e VB__GETCUR() _ V b i o s ( READ_POSN, 0, VB GETPAGE( ) , o, 0,
dx"
47 #d e f i n e VB__CURSIZE( t , b) _ V b i o s ( CUR_SIZE, 0, 0, ( ( t ) 8 ) I (b) , 0, ax"
48 # d e f i n e VB__OUTCHA( c ) V b i o s ( WRITE, ( c ) & 0 x f f , ( (c) 8 ) & 0 x f f , 1, 0, ax"
49 #d e f i n e VB_' r e p l a c e (c) VB OUTCHA( (c & Oxf f ) I (VB_INCHA() Sc ~Oxf f ) )
50 # d e f i n e VB__SETCUR(posn) _ V b i o s ( SET_POSN, 0, VB GETPAGE() 8, 0 , ( p o s n ) , "ax
51 #d e f i n e VB__CTOYX (y , x) VB SETCUR( ( (y) 8) I ( (x) & Oxf f ) )
52 #define VB SCROLL(xl , xr, y t , yb, amt , a t t r ) V b i o s ( \
( ( amt ) < 0) ? SCROLL DOWN SCROLL UP, \
a b s ( a mt ) , ( a t t r ) << 8, ( ( yt )
( (yb)
8)
8)
(xl ), \
( x r ) , "ax"\
)
#defi ne VB CLRS( at ) VB SCROLL ( 0, 79, 0, 24, 25, (at) )
#d e f i n e VB CLR REGION( 1 , r , t , b , a t ) VB SCROLL( ( 1 ) , ( r ) , ( t ) , ( b ) , ( ( b ) - ( t ) ) + l , ( a t ) )
#def i ne VB_BLOCKCUR()
#def i ne VB_NORMALCUR()
#def i ne VB PUTCHAR(c)
VB CURSIZE( 0, vb i s c o l o r ( ) ? 7
( v b _ i s c o l o r ( ) ? VB_CURSIZE( 6 , 7 )
vb p u t c ( ( c ) , NORMAL )
12 )
VB CURSIZE( 1 1 , 1 2 )
)
Listing A.46. video.h Definitions for Direct-Video Functions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#def i ne MONBASE
#def i ne COLBASE
#def i ne NUMROWS
#def i ne NUMCOLS
( DI SPLAY f ar
( DI SPLAY f ar
25
80
*
*
)
)
OxbOOOOOOO
0xb8000000
{
unsi gned char l et t er ;
unsi gned char at t r i but e;
} CHARACTER;
CHARACTER DI SPLAY[ NUMROWS ][ NUMCOLS ];
#i f def ALLOC
DI SPLAY f ar *dv Scr een ( DI SPLAY f ar *) MONBASE ;
#el se
DI SPLAY f ar *dv Scr een;
#endi f
#def i ne SCREEN (
*
dv Scr een)
short
CHAR ATTRI B
#def i ne VSCREEN
C HAR_AT T RI B;
VDI SPLAY[ NUMROWS ][ NUMCOLS ];
(*
( VDI SPLAY f ar *) dv Scr een )
Listing A.47. dv clr r.c Clear Region of Screen (Direct Video)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#i ncl ude "vi deo. h"
voi d
{
dv cl r r egi on( 1, r, t, b, at t r i b )
}
i nt ysi ze, xsi ze, x, y ;
xsi ze
ysi ze
(r
(b
1) + 1;
t ) + 1 ;
/ * hor i zont al si ze of r egi on
/ * ver t i cal si ze of r egi on
( y
ysi ze
y
>
0 ;)
( x xsi ze x > 0
)
{
SCREEN[ y + t ][ x + 1 ] . l e t t e r
SCREEN[ y + t ][ x + 1 ] . a t t r i b u t e
*
*
/
/
at t r i b
}
Listing A.48. dv clrs.c Clear Entire Screen (Direct Video)
7 CHARACTER f ar *p = ( CHARACTER f ar *) ( dv Scr een );
8 r egi st er i nt i ;
9
10 f or ( i = NUMROWS * NUMCOLS; i >= 0 ; )
11 {
12 (p ) - >l et t er = ' ' ;
13 ( p++) - >at t r i but e = at t r i b ;
14
}
15 }
Listing A.49. dvJrees.c Free an SBUF (Direct Video)
1 #i ncl ude <st dl i b. h>
2 #i ncl ude ct ool s/ t er ml i b. h>
3 #i ncl ude "vi deo. h"
4 / * Fr ee an SBUF as i s al l ocat ed by vb save( ) . */
5 voi d dv f r eesbuf ( p )
6 SBUF *p;
7 {
8 I FREE( p- >i mage );
9 f r ee ( p );
10
}
Listing A.50. dv init.c Initialize Direct-Video Functions
1 #i ncl ude "vi deo. h"
2
3 i nt dv i ni t ()
4
{
5 i nt i ;
6 i f ( (i = vb i scol or ( ) ) >= 0 )
7 dv Scr een = i ? COLBASE : MONBASE ;
8 r et ur n( i != - 1 );
9 }
Listing A.51. dv_print.c A Direct-Video pr i ntf ()
3 # i ncl ude < t o o l s / d e b u g . h > / * For VA_LI ST def i ni t i on */
4 # i ncl ude " v i d e o . h "
5
6 voi d d v _ p r i n t f ( a t t r i b , f mt , VA_ LI ST )
7 i nt a t t r i b ;
8 c h a r * f mt ;
9 {
10 / * Di r ect - vi deo pr i nt f , char act er s wi l l have t he
11 / * pr nt () i s i n cur ses. l i b, whi ch must be l i nked
12 / * use t he cur r ent f unct i on.
13
i ndi cat ed at t r i but es. */
to your pr ogr ami f you */
*/
ListingA.51. conti nued..
14 ext ern dv _put c () ;
16
18 p r n t (
dv p u t c , a t t r i b , f mt , a r g s ) ;
19 va end
( a r g s ) ;
20 }
Listing A.52. dv putc.c Write Character to Screen with Attribute (Direct Video)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#ncl ude " v i d e o . h "
#ncl ude < t o o l s / v b i o s . h>
st at i c i nt
st at i c i nt
Row
Col
0;
0;
/ * Cur sor Row
*
/
/ * Cur sor Col umn * /
/
/
* *
*
Move cur sor to (Row, Col )
*
/
/
#def i ne f i x c ur () V b i o s ( SET POSN, 0, 0, 0, (Row 8) Col
VV vv
)
/
* *
/
voi d
{
dv p u t c ( c , a t t r i b )
16 / *
Wr i t e a si ngl e char act er t o t he scr een wi t h t he i ndi cat ed at t r i but e.
17
A
The f ol l owi ng ar e speci al :
18
A
19
A
\0 i gnor ed
20
A
\ f cl ear scr een and home cur sor
21
*
\ n t o l ef t edge of next l i ne
22
A
\ r t o l ef t edge of cur r ent l i ne
23
*
\ b one char act er t o l ef t ( non- dest r uct i ve)
24
A
25
A
The scr een wi l l scr ol l up i f you go past t he bot t oml i ne. Char act er s
26
A
t hat go beyond t he end of t he cur r ent l i ne wr ap ar ound t o t he next l i ne.
27 */
swi t ch( c )
{
0 : /
*
I gnor e ASCI I NULL' s *
' \ f ' : dv c l r s ( a t t r i b ) ;
Row Col 0;
' \ n ' ( ++Row >= NUMROWS )
{
dv s c r o l l l i n e ( 0, 79, 0, 24, ' u ' , NORMAL ) ;
Row NUMROWS- 1
}
/
*
Fal l t hr ough t o ' \ r' * /
' \ r ' Col 0;
46 case ' \ b ' : i f ( Col < 0 )
47 Col = 0;
48 br eak;
49
50 def aul t : SCREEN[ Row ][ Col ] . l e t t e r = c
51 SCREEN[ Row ][ Col ] . a t t r i b u t e = a t t r i b ;
52 i f ( ++Col >=NUMCOLS )
53 {
54 Col = 0;
55 i f ( ++Row >=NUMROWS )
56 {
57 dv s c r o l l l i n e ( 0, 79, 0, 24, ' u ' , NORMAL ) ;
58 Row = NUMROWS- 1 ;
59 }
60 }
61 br eak;
62
}
63 f i x c ur ( ) ;
64
}
65
66
---------------------------------------------- * /
67
68 voi d dv c t o y x ( y , x )
69 {
/ * Posi t i on t he cur sor at t he i ndi cat ed r ow and col umn * /
70 Row = y ;
71 Col = x;
72 f i x c ur ( ) ;
73 }
74
75
/ * _ . ----------------------------------------------* /
76
77 voi d dv g e t y x ( rowp, c o l p )
78 i nt *rowp, * c o l p ;
79 {
/ * Modi f y *r owp and *col p t o hol d t he cur sor posi t i on. * /
80 *rowp = Row;
81 * c o l p = Col ;
82
}
83
84
/ * _ . ---------------------------------------------- * /
85
86 i nt dv i n c h a ( ) / * Get char act er & at t r i but e f r omscr een. * /
87 {
88 ret urn (i nt) VSCREEN[ Row ] [ Col ] ;
89
}
90
91 voi d dv o u t c h a ( c ) / * Wr i t e char . & at t r i b. w/ o movi ng cur sor . * /
92 {
93 VSCREEN[ Row ] [ Col ] = c ;
94
}
95
96 voi d dv r e p l a c e (c) / * Wr i t e char, onl y w/ o movi ng cur sor */
97 {
98 SCREEN[ Row ] [ Col ] . l e t t e r = c ;
99 }
Section A.9.2Video I/OImplementation
Listing A.53. dv_putch.c Write Character to Screen, Normal Attribute (Direct Video)
763
1 #i n c l u d e " v i d e o . h "
2 # i n c l u d e c t o o l s / t e r m l i b . h>
3
4 v o i d d v _ p u t c h a r ( c )
5 {
6 d v _ p u t c ( c & Oxf f , NORMAL );
7 }
Listing A.54. dvjputs.c Write String to Screen, Normal Attribute (Direct Video)
1 # i nclude " v i d e o . h "
2 #i ncl ude < t o o l s / t e r m l i b . h >
3
4 d v _ p u t s ( s t r , move )
5 char * s t r ;
6 {
7/ * Wr i t e st r i ng to scr een, movi ng cur sor t o end of st r i ng onl y i f move i s
8 * t rue. Use nor mal at t r i but es.
9 * /
10
11 i nt orow, o c o l ;
12
13 d v _ g e t y x ( &orow, &ocol ) ;
14
15 whi l e( * str )
16 d v _ p u t c ( * s t r + + , NORMAL ) ;
17
18 i f ( ! move )
19 d v _ c t o y x ( orow, o c o l ) ;
20 }
Listing A.55. dv_putsa.c Write String to Screen, with Attribute (Direct Video)
1 #i nclude "vi deo. h"
2
3 void dv_put sa( str, at t r i b )
4 regi ster char * str;
5 regi ster i nt at t r i b;
6 {
7 / * Wr i t e st r i ng t o scr een, gi vi ng char act er s t he i ndi cat ed at t r i but es. */
8
9 whi l e( *st r )
10 dv_put c( *st r ++, at t r i b );
11 }
7
Listing A.56. dvresto.c Restore Saved Region (Direct Video)
2 # i ncl ude < t o o l s / t e r m l i b . h >
4
5 SBUF * d v _ r e s t o r e ( s b u f )
6 SBUF *sbuf ;
7 {
8 / * Rest or e a r egi on saved wi t h a pr evi ous dv_save( ) cal l . The cur sor i s
9 * not modi f i ed. Not e t hat t he memor y used by sbuf i s not f r eed, you must
10 * do t hat your sel f wi t h a dv_f r eesbuf ( sbuf ) cal l .
11 */
12
13 i nt y s i z e , x s i z e , x, y ;
14 IMAGEP p;
15
16 x s i z e = ( s b u f - > r i g h t - s b u f - > l e f t ) + 1 ;
17 y s i z e = ( s b u f - > b o t t o m - s b u f - > t o p ) + 1 ;
18 p = s b u f - > i ma g e ;
19
20 f o r ( y = 0; y < y s i z e ; ++y )
21 f o r ( x = 0; x < x s i z e ; ++x )
22 VSCREEN[ y + s b u f - > t o p ] [ x + s b u f - > l e f t ] = *p++;
23
24 ret urn s b u f ;
25 }
Listing A.57. dv save.c Save Region (Direct Video)
4 # i ncl ude c t o o l s / t e r m l i b . h>
5
6 SBUF * d v _ s a v e ( 1, r , t , b )
7 {
8 / * Save al l char act er s and at t r i but es i n i ndi cat ed r egi on. Ret ur n a
9 * poi nt er t o a save buf f er . The cur sor i s not modi f i ed. Not e t hat t he
10 * save buf f er can be al l ocat ed f r om t he f ar heap, but t he SBUF i t sel f i s
11 * not . See al so, dv_r est or e( ) and dv_f r ee sbuf () ;
12 */
13
14 i nt y s i z e , x s i z e , x, y ;
15 IMAGEP p;
16 SBUF * s b u f ;
17
18 x s i z e = (r - 1) + 1;
19 y s i z e = (b - t ) + 1;
20
21 i f ( ! ( s b u f = (SBUF *) m a l l o c ( s i z e o f ( S B U F ) ) ) )
22 {
23 f p r i n t f ( s t d e r r , " I n t e r n a l e r r o r ( d v _ s a v e ) : No memory f o r SBUF. ") ;
24 e x i t ( 1 ) ;
25 }
26 i f ( ! (p = (IMAGEP) IMALLOC(xsi ze * y s i z e * si zeof (WORD)) ) )
27 {
28 f p r i n t f ( s t d e r r , " I n t e r n a l e r r o r ( d v _ s a v e ) : No memory f o r i ma g e ") ;
29 e x i t ( 2 ) ;
30 }
31
32 sbuf - >l ef t = 1;
33 sbuf - >r i ght = r;
34 sbuf - >t op = t;
35 sbuf - >bot t om= b;
36 sbuf - >i mage = p;
37
38 f or ( y = 0; y < ysi ze ; ++y )
39 f or ( x = 0; x < xsi ze ; ++x )
40 *p++ = VSCREEN[ y + t ] [ x + 1 ] ;
41
42 r e t u r n sbuf ;
43 }
Listing A.58. dvscree.c Direct-Video Variable Allocation
1 #d e f i n e ALLOC
2 #i n c l u d e "vi deo. h"
5 * def i ned i n vi deo. h
3
4 / *
5 *
6 */
Listing A.59. dv scrol.c Scroll Region of Screen (Direct Video)
1 #i n c l ude " v i d e o . h "
2
3 st at i c voi d c p y _ r o w( d e s t _ r o w, s r c _ r o w, l e f t _ c o l , r i g h t _ c o l )
4 {
5 / * Copy al l char act er s bet ween l ef t _col and r i ght _col ( i ncl usi ve)
6 * f r omsr c_r ow t o t he equi val ent posi t i on i n dest _r ow.
1 */
8
9 CHARACTER f a r *s ;
10 CHARACTER f a r *d ;
11
12 d = & SCREEN[ d e s t _ r o w ] [ l e f t _ c o l ] ;
13 s = & SCREEN[ s r c _ r o w ] [ l e f t _ c o l ] ;
14
15 whi l e( l e f t _ c o l + + <= r i g h t _ c o l )
16 *d++ = *s ++;
17 }
18
19 /* -------------------------------------------------------------------------------------------------------------------- */
20
21 st at i c voi d c p y _ c o l ( d e s t _ c o l , s r c _ c o l , t o p_ r o w, b o t _ r o w )
22 {
23 / * Copy al l char act er s bet ween t op_r ow and bot _r ow ( i ncl usi ve)
24 * f r omsr c col t o t he equi val ent posi t i on i n dest col .
25 * /
26
27 CHARACTER f a r *s = & SCREEN[ t o p_ r o w ] [ s r c _ c o l ] ;
28 CHARACTER f a r *d = & SCREEN[ t o p_ r o w ] [ d e s t _ c o l ] ;
29
*
55 /
56
*
*
*
*
*
/
void c l r r o w ( row, a t t r i b , l e f t c o l , r i g h t c o l )
30 w h i l e ( t o p row++ <= b o t row )
31 {
32 *d = * s ;
33 d += NUMCOLS;
34 s += NUMCOLS;
35 }
36 }
37
38 /
39
40
41 {
42 / * Cl ear al l char act er s i n t he i ndi cat ed r ow t hat ar e bet ween l ef t col and
43 * r i ght col ( i ncl usi ve)
44 * /
45
46 CHARACTER f a r *p = & SCREEN[ row ] [ l e f t c o l ] ;
47
48 w h i l e ( l e f t c o l + + <= r i g h t c o l )
49 {
50 (p ) - > l e t t e r
51 ( p + + ) - > a t t r i b u t e = a t t r i b ;
52 }
53 }
54
f t .
r
*
/
57 stati c void c l r _ c o l ( c o l , a t t r i b , t o p_ r o w, bo t _ r o w )
58 {
59/ * Cl ear al l char act er s i n t he i ndi cat ed col umn t hat ar e bet ween t op r ow
60 * and bot r ow ( i ncl usi ve)
61 */
62
63 CHARACTER f a r *p = & SCREEN[ t o p row ] [ c o l ] ;
64
65 whi l e( t o p row++ <= b o t row )
66 {
67
68
9 t .
f p - > l e t t e r
p - > a t t r i b u t e = a t t r i b ;
Ext er nal l y accessi bl e f unct i ons:
69 p += NUMCOLS
70 }
71 }
72
73 /
74
75
76 */
77
78 void dv s c r o l l l i n e ( x l e f t , x r i g h t , y t o p , y b o t t o m , d i r , a t t r i b )
79 {
80 /* Scr ol l t he wi ndow l ocat ed at :
81 *
82 * (y t op, x l ef t )
83 *
84 *
85
86 *
*
*
*
+----------------- +
*
+-------------- +
(y bot t om, x r i ght ) 87 *
88 *
89 * Di r i s one of : ' u' , ' d' , ' 1 ' , or ' r' f or up, down, l ef t , or r i ght
90 * The cur sor i s not moved. The opened l i ne i s f i l l ed wi t h space char act er s
91 * havi ng t he i ndi cat ed at t r i but e.
92 */
93
94 i n t i ;
95 CHARACTER f a r *p;
96
97 i f ( d i r == ' u' )
98
{
99 f o r ( i = y t o p ; i < y bo t t o m ; i ++ )
100 cpy r o w ( i , i + 1 , x l e f t , x r i g h t ) ;
101 c l r r o w( y b o t t o m, a t t r i b , x l e f t , x r i g h t ) ;
102
}
103 e l s e i f ( d i r == ' d ' )
104
{
105 f o r ( i = y bo t t o m; i >= y t o p ; )
106 cpy r o w ( i + 1 , i , x l e f t , x r i g h t ) ;
107 c l r r o w( y t o p , a t t r i b , x l e f t , x r i g h t ) ;
108
}
109 e l s e i f ( d i r == ' 1 ' )
110 {
111 f o r ( i = x l e f t ; i < x r i g h t ; i ++ )
112 cpy c o l ( i , i + 1 , y t o p , y bo t t o m ) ;
113 c l r c o l ( x r i g h t , a t t r i b , y t o p , y bo t t o m ) ;
114
}
115 e l s e / * d i r == ' r' * /
116
{
117 f o r ( i = x r i g h t ; i >= x l e f t ; )
118 cpy c o l ( i + 1 , i , y t o p , y bo t t o m ) ;
119 c l r c o l ( x l e f t , a t t r i b , y t o p , y bo t t o m ) ;
120 }
121
}
122
123
/ * -------------------------------------------------------------------------------------------------------------------------------------------- * /
124
125 v o i d dv s c r o l l ( x l e f t , x r i g h t , y t o p , y bo t t o m, amt , a t t r i b )
126
{
127 / * Scr ol l t he scr een up or down by t he i ndi cat ed amount . Negat i ve
128 * amount s scr ol l down.
129 */
130
131 i n t d i r = ' u ' ;
132
133 i f ( amt < 0 )
134
{
135 amt = - amt ;
136 d i r = ' d ' ;
137 }
138 w h i l e ( - - a mt >= 0 )
139 dv s c r o l l l i n e ( x l e f t , x r i g h t , y t o p , y b o t t o m, d i r , a t t r i b ) ;
140 }
768
Listing A.60. vbjrees.c Free Save Buffer (Video BIOS)
Support FunctionsAppendix A
2 #i ncl ude < t o o l s / t e r m l i b . h >
4
5 voi d vb f r e e s b u f ( p )
6 SBUF *p;
7
{
/ * Fr ee an SBUF as i s al l ocat ed by vb save(). */
8 IFREE( p- >i ma g e ) ;
9 f r e e ( p ) ;
10
}
Listing A.61. vb getch.c Get Character from Keyboard (Keyboard!BIOS)
1 #i ncl ude < d o s . h>
2 #i ncl ude c t o o l s / v b i o s . h>
3
4 i n t vb g e t c h a r ()
5
{
6 / * Get a char act er di r ect l y f r omt he keyboar d * /
7
8 union REGS r e g s ;
9 r e g s . h . a h = 0 ;
10 i n t 8 6 ( KB INT, &regs , &regs ) ;
11 r e t u r n ( (i n t ) r e g s . x . ax ) ;
12 }
Listing A.62. vb_getyx.c Get Cursor Position (Video BIOS)
1 # i ncl ude < t o o l s / v b i o s . h>
2
3 voi d vb g e t y x ( yp, xp )
4 i n t *yp, *xp;
5 {
6 r e g i s t e r i n t po s n;
7
8 p o s n = VB_GETCUR();
9 *xp = p o s n & Oxf f ;
10 *yp = ( pos n 8) & Oxf f ;
11
}
Listing A.63. vb iscol.c Test for Color Card (Video BIOS)
1 #i ncl ude ct ool s/ vbi os. h>
2
3 i n t vb i scol or O / * Ret ur ns t r ue i f a col or car d i s act i ve * /
4
{
5 i n t mode = Vbi os( GET VMODE, 0, 0, o, 0, "ax" ) & Ox f f ;
6 r e t ur n( ( mode == 7
) ?
0
7 ( mode == 2 | mode == 3) ? 1 : - 1 ) ;
8
}
Listing A.64. vbjputc.c Write Character in TTY Mode (Video BIOS)
1 #i n c l u d e < t o o l s / t e r m l i b . h >
2 #i n c l u d e c t o o l s / v b i o s . h >
3
4 v o i d v b _ p u t c ( c , a t t r i b )
5 {
6 / * Wr i t e a char act er t o t he scr een i n TTY mode. Onl y nor mal pr i nt i ng
7 * char act er s, BS, BEL, CR and LF ar e r ecogni zed. The cur sor i s aut omat i c-
8 * al l y advanced and l i nes wi l l wrap. The WRI TE_TTY BI OS ser vi ce doesn' t
9 * handl e at t r i but es cor r ect l y, so pr i nt i ng char act er s have t o be out put
10 * t wi ce----- once by VB_OUTCHA t o set t he at t r i but e bi t , and at ai n usi ng
11 * WRI TE_TTY t o move t he cur sor . WRI TE_TTY pi cks up t he exi st i ng at t r i but e.
12 */
13
14 i f ( c != ' \ b ' && c != ' \ 0 0 7 ' && c != ' \ r ' && c != ' \ n ' )
15 VB_OUTCHA( (c & Oxf f ) | ( a t t r i b 8) ) ; '
16
17 _ V b i o s ( WRI TE_TTY, c, a t t r i b & Ox f f , 0, 0, "ax" ) ;
18 }
Listing A.65. vbjputs.c Write String in TTY Mode (Video BIOS)
1 #i ncl ude < t o o l s / v b i o s . h>
2
3 voi d vb_puts( s t r , move_cur )
4 r e g i s t e r char *st r;
5 {
6 / * Wr i t e a st r i ng t o t he scr een i n TTY mode. I f move_cur i s t r ue t he cur sor
1 * i s l ef t at t he end of st r i ng. I f not t he cur sor wi l l be r est or ed t o i t s
8 * or i gi nal posi t i on ( bef or e t he wr i t e) .
9 */
10
11 i n t posn;
12
13 i f ( !move_cur )
14 posn = VB_GETCUR( ) ;
15
16 w h i l e ( *st r )
17 VB_PUTCHAR( *str++ );
18
19 i f ( !move_cur )
20 VB_SETCUR( posn );
21 }
Listing A.66. vb resto.c Restore Saved Region (Video BIOS)
Listing A.66. continued..
8 / * Rest o r e a r egi on saved wi t h a pr evi ous vb save( ) cal l . The cur sor i s
9 * not modi f i ed. Not e t hat t he memor y used by sbuf i s not f r eed, you must
10 * do t hat your sel f wi t h a vb f r eesbuf ( sbuf ) cal l .
11 */
12
13 i n t y s i z e , x s i z e , x, y ;
14 IMAGEP
p;
15
16 x s i z e = ( s b u f - > r i g h t - s b u f - > l e f t ) + 1 ;
17 y s i z e = ( s b u f - > b o t t o m - s b u f - > t o p ) + 1 ;
18
P
s b u f - > i m a g e ;
19
20 f o r ( y = 0; y < y s i z e
; ++y )
21 f o r ( x = 0; x < x s i z e ; ++x )
22
{
23 VB_CTOYX( y + s b u f - > t o p , x + s b u f - > l e f t ) ;
24 VB_OUTCHA( *p
) ;
25 ++p; / * VB OUTCHA has si de ef f ect s so can' t use *p++ * /
26
}
27 r e t u r n s b u f ;
28
}
Listing A.67. vb save.c Save Region (Video BIOS)
1 # i n c l u d e < s t d i o . h >
2 # i n c l u d e < s t d l i b . h >
3 # i n c l u d e < t o o l s / t e r m l i b . h > / * must be i ncl uded f i r st * /
4 # i n c l u d e < t o o l s / v b i o s . h >
5
6 SBUF * v b _ s a v e ( 1, r, t , b )
7 {
8 / * Save al l char act er s and at t r i but es i n i ndi cat ed r egi on. Ret ur n a poi nt er
9 * t o a save buf f er . The cur sor i s not modi f i ed. Not e t hat t he save buf f er
10 * i s al l ocat ed f r omt he f ar heap but t he sbuf i t sel f i s not . See al so,
11 * dv_r est or e( ) and dv_f r eesbuf () ;
12 */
13
14 i n t y s i z e , x s i z e , x, y ;
15 IMAGEP p;
16 SBUF * s b u f ;
17
18 x s i z e = (r - 1) + 1 ;
19 y s i z e = (b - t ) + 1;
20
21 i f ( ! ( s b u f = (SBUF *) m a l l o c ( s i z e o f ( S B U F ) ) ))
22 {
23 f p r i n t f ( s t d e r r , " I n t e r n a l e r r o r ( v b _ s a v e ) : No memory f o r SBUF. ") ;
24 e x i t ( 1 ) ;
25 }
26
27 i f ( ! (p = (IMAGEP) IMALLOC(xsi ze * y s i z e * s i z e o f ( WORD) ) ) )
28 {
29 f p r i n t f ( s t d e r r , " I n t e r n a l e r r o r ( v b _ s a v e ) : No memory f o r i ma g e . " ) ;
30 e x i t ( 2 ) ;
31 }
32
33 s b u f - > l e f t = 1;
34 s b u f - > r i g h t = r;
35 s b u f - > t o p = t ;
36 s b u f - > b o t t o m = b;
37 s b u f - > i ma g e =
p;
38
39 f o r ( y = 0; y < y s i z e ; ++y )
40 f o r ( x = 0 ; x < x s i z e ; ++x )
41
{
42 VB CTOYX( y + t , x
+ l ) ;
43 *p++ = VB_INCHA( ) ;
44
}
45
46 r e t u r n s b u f ;
47
}
A.10 Low-level-l/O, Glue Functions
Glue functions are intermediate-level functions that allow your program to access
low-level functions in a portable way. In the current context, the glue functions let you
access either the direct-video or video-BIOS low-level-I/O functions without having to
change the function calls in the program itself. If your program uses these glue functions
instead of the actual I/O subroutines, you can choose which set of functions to use by
linking either direct-video or BIOS libraries to your final program, without having to
change the program itself. The glue-to-I/O-function mapping is summarized in Table
A.3.
Table A.3. Mapping Glue Routines to I/O Functions
Glue Functiont
Mapped to This Function
Direct-video mode BIOS mode
voi d cl r r egi on(1, r , t , b, at t ri b) dv_ cl r r egi on VB_CLR_REGI ON
voi d cmove (y, x) dv_ ct oyx VB_CTOYX
voi d curpos (i nt * yp,i nt * xp); dv_ get yx vb get yx
voi d doscr ol l (1, r, t, b, a, at) dv scrol l VB_SCROLL
voi d f reescr (SBUF *p) ; dv_ f r eesbuf vb f r eesbuf
i nt i ncha 0
dv_ i ncha VB__I NCHA
i nt i nchar
0
dv_ i ncha VB__I NCHA
voi d out c (c, at t ri b) dv_ put c vb put c
voi d repl ace (c) ; dv_ r epl ace VB__RE PLACE
SBUF ^rest ore (SBUF *b) ; dv_ r est or e vb r est or e
SBUF *savescr (1, r, t, b) dv_ save vb save
t Unspecified argument types are i nt.
The glue functions are declared in glue.c, Listing A .68. One of three sets of func- One of three sets of glue
tions are compi l ed depending on the exi stence o f various macros as f ol l ows: functions can be com
piled, R, V, A.
If this macro
is #def i ned
Use this interface
R Video-BIOS.
V Direct-video.
A Select automatically at run time (autoselect).
Select between direct
video and video BIOS at
run time, autoselect
mode.
Initializing the low-level
I/O functions, i n i t ().
The VIDEO environment
You can compile glue.c three times, once with R defined, once with V defined, and once
with neither defined. The version of glue.obj that you link then determines which set of
I/O functions are called into the final program.
In autoselect mode, the glue functions select one or the other of the sets of I/O
functions at run time. The i ni t () function is used for this purpose. Normally, when R
or V are #def i ned at compile time, i ni t () just initializes for the required mode,
returning zero if the BIOS is used (if R is defined) or 1if direct-video functions are used.
If A is #def i ned at compile time, the i ni t () function selects either the BIOS or
direct-video routines at run time. An i ni t (0) call forces use of the BIOS, i ni t (1)
forces use of direct-video functions, i ni t (- 1) causes i ni t () to examine the VIDEO
environment variable and chose according to its contentsif VIDEO is set to "DI RECT"
or "di r ect ", the direct-video functions are used, otherwise the video-BIOS is used.
i ni t () returns 0 if the BIOS is selected, 1 if the direct-video functions are selected.
The one disadvantage to autoselect mode is that both libraries must be in memory at run
time, even though youre using only one of them.
Listing A.68. glue.c Glue Functions for Low-Level I/O
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
32
33
#i ncl ude < t o o l s / t e r m l i b . h >
#i ncl ude < t o o l s / v b i o s . h>
/ * GLUE. C
*
Thi s f i l e gl ues t he cur ses package to ei t her t he vi deo
BI OS f unct i ons or to t he di r ect - vi deo f unct i ons.
*
/
#i f ( ! d e f i n e d ( R ) && ! d e f i n e d ( V ) )
AUTOSELECT
#endi f
/ *_____________________________________________________________________________ * j
#i f def R
18 i n i t ( h o w ) { ret urn 0;
19 c mo v e ( y , x) { ret urn VB__CTOYX
y,x) ;
20 c u r p o s ( y p , x p ) i nt * y p , * x p ; { ret urn vb _ge t yx y p , x p ) ;
21 r e p l a c e ( c ) { ret urn VB__RE PLACE c) ;
22 d o s c r o l l ( l , r , t , b , a , a t ) { ret urn VB__SCROLL 1, r , t , b, a, a t ) ;
23 i n c h a r () { ret urn VB_~INCHA ) & Oxf f ;
24 i n c h a () { ret urn VB__INCHA ) ;
25 o u t c ( c , a t t r i b ) { ret urn vb p u t c c , a t t r i b ) ;
26 SBUF * s a v e s c r ( 1 , r , t , b) { ret urn vb s a v e 1 , r , t , b ) ;
27 SBUF * r e s t o r e ( b ) SBUF *b; { ret urn vb r e s t o r e b) ;
28 f r e e s c r ( p ) SBUF *p; { ret urn vb f r e e s b u f
p ) ;
29 c l r r e g i o n ( 1 , r , t , b , a t t r i b ) { ret urn VB_ CLR REGION( 1 , r , t , b , a t t r i b ) ;
30
31 i nt i s d i r e c t ( ) { ret urn 0; };
#endi f
Section A. 10Low-level-I/O, Glue Functions 773
34
35
36
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
/* -------------------------------------------------------------------------------------------------------------------- */
#i f def V
37 c mo v e ( y , x ) { ret urn dv c t o y x (y, x) ;
38 c u r p o s ( y p , x p ) i nt * y p , * x p ; { ret urn dv _ge t yx ( y p , x p ) ;
39 r e p l a c e ( c ) { ret urn dv r e p l a c e (c) ;
40 d o s c r o l l ( 1 , r , t , b , a , a t ) { ret urn dv s c r o l l ( 1, r , t , b, a, a t ) ; }
41 i n c h a r () { ret urn dv i n c h a ( ) & Ox f f ;
42 i n c h a () { ret urn dv i n c h a
( ) ;
43 o u t c ( c , a t t r i b ) { ret urn dv p u t c ( c, a t t r i b )
I
9 )
44 SBUF * s a v e s c r ( 1 , r , t , b ) { ret urn dv s a v e ( 1, r , t , b) ;
45 SBUF * r e s t o r e ( b ) SBUF *b; { ret urn dv r e s t o r e (b) ;
46 f r e e s c r ( p ) SBUF *p; { ret urn dv_ f r e e s b u f
(p) ;
47 c l r r e g i o n ( 1 , r , t , b , a t t r i b ) { ret urn dv c l r r e g i o n ( 1 , r , t , b , a t t r i b ) ; }
48
49 i n i t ( how ) / * I ni t i al i ze * /
50 {
51 i f ( ! dv i n i t () )
52
{
53 f p r i n t f ( s t d e r r , "MGA or CGA i n 8 0 - c o l umn t e x t mode r e q u i r e d \ n " ) ;
e x i t ( 1 ) ;
}
1;
}
i nt i s d i r e c t ( ) { ret urn 1;
#endi f
};
/* -------------------------------------------------------------------------------------------------------------------- */
#i f def A
st at i c i nt Dv 0 ;
i n i t ( how )
i nt how;
{
/
*
0=BI C>S, l =di r ect vi deo, - l =aut osel ect * /
*
p;
( Dv how )
{
( how < 0 )
{
P
Dv
g e t e n v ( "VIDEO" ) ;
p && ( ( s t rcmp( p, "DI RECT") 0 s t r c m p ( p , " d i r e c t " ) 0) ) ;
}
( Dv && ! dv i n i t ()
)
{
f p r i n t f ( s t d e r r , "MGA or CGA i n 8 0 - c o l umn t e x t mode r e q u i r e d \ n " ) ;
e x i t ( 1 ) ;
}
}
ret urn Dv;
}
/
*
*
*
*
The f ol l owi ng st at ement s al l depend on t he f act t hat a subr out i ne name
( wi t hout t he t r ai l i ng ar gument l i st and par ent heses) eval uat es to a t empor ar y
var i abl e of t ype poi nt er - t o- f unct i on. Ther ef or e, f unct i on names can be t r eat ed
as a poi nt er to a f unct i on i n t he condi t i onal s, bel ow. For exampl e, cur posO
94 * cal l s dv get yx i f Dv i s t r ue, ot her wi se vb get yx i s cal l ed. ANSI act ual l y
95 * says t hat t he st ar i sn' t necessar y. The f ol l owi ng shoul d be l egal :
96
*
97 * r et ur n (Dv ? dv ct yx : VB CTOYX) ( y, x );
98
*
99 * but many compi l er s ( i ncl udi ng Mi cr osof t ) don' t accept i t. Macr os, cl ear l y,
100 * must be t r eat ed i n a mor e st andar d f ashi on.
101 * /
102
103 c u r p o s ( y p , x p ) i n t * y p , * x p ; {r e t u r n ( Dv ? *dv g e t y x * v b _ g e t y x ) ( y p , x p ) ; }
104 SBUF * s a v e s c r ( 1 , r , t , b) { r e t u r n ( Dv ? *dv s a v e *vb s a v e ) ( l , r , t , b ) ; }
105 SBUF * r e s t o r e ( b ) SBUF *b; { r e t u r n ( Dv ? *dv r e s t o r e *vb r e s t o r e ) ( b ) ; }
106 f r e e s c r ( p ) SBUF *p; {r e t u r n ( Dv ? *dv f r e e s b u f *vb f r e e s b u f ) ( p ) ; }
107 o u t c ( c , a t t r i b ) {r e t u r n ( Dv ? *dv p u t c * v b_ put c ) ( c , a t t r i b ) ; }
108
109 r e p l a c e (c) { i f ( D v ) dv r e p l a c e ( c ) ; i l s e VB_REPLACE(c); }
110 c mo v e ( y , x ) { i f ( D v ) dv c t o y x ( y , x ) ; i l s e VB_CTOYX( y, x) ; }
111 i n c h a r () { r e t u r n ( Dv? dv i n c h a () : VB_INCHA() ) & Oxf f ; }
112 i n c h a () { r e t u r n ( Dv? dv i n c h a () : VB_INCHA() ) ; }
113
114 d o s c r o l l ( l , r , t , b , a , a t )
115
{
116 i f ( Dv ) d v _ s c r o l l ( 1 , r , t , b , a , a t ) ;
117 e l s e VB SCROLL( 1 , r , t , b , a , a t ) ;
118
}
119
120 c l r r e g i o n ( 1 , r , t , b , a t t r i b )
121
{
122 i f ( Dv ) dv c l r r e g i o n ( 1 , r , t , b , a t t r i b ) ;
123 e l s e VB CLR REGI ON( l , r , t , b , a t t r i b ) ;
124
}
125
126 i n t i s d i r e c t ( ) { r e t u r n Dv; };
127 # e n d i f
A.11 Window Management: Curses
The low-level I/O functions are really too low level to be usefulhigher-level func
tions are needed. Since I write a lot of code that has to work in both the MS- DOS and
UNI X environments, it seemed sensible to mimic a set of UNI X functions under MS- DOS.
Among other things, this approach solves various incompatibility problems between the
two systems. For example, the Microsoft, MS- DOS C compiler doesnt have UNI X-
compatible f cntl ( ) or i o c tl ( ) functions, it doesnt support the /dev/tty device for
the console (you have to use /dev/con or CON), and it doesnt provide any sort of
termcap or curses-compatible function library. Consequently, its difficult to port code
that does low-level I/O from the Microsoft compiler to UNI X.
Curses. The most serious omission in the earlier list is the lack of a curses library. For the
uninitiated, curses is a collection of terminal-independent, low-level I/O functions. The
package was written by Ken Arnold, then at U.C. Berkeley. These routines let you do
things like move the cursor around on the screen, create and delete windows, write text
and seek to specific window-relative cursor positions and so forth. The windows can be
overlapping and they support individual wrap-around, scrolling, and so forth. The curses
Termcap database. functions can talk to virtually any terminal. They accomplish this feat by using the
Section A. 11 Window Management: Curses 775
termcap terminal database and interface library developed by Bill J oy. The database is
an ASCII file that contains definitions for the various escape sequences needed to get
around on specific terminals, and the interface library lets you access these machine-
specific escape sequences in a portable and transparent way.
The real curses routines talk to the terminals efficiently over a serial line. They
always send the minimum number of characters necessary to modify the current screen.
Curses keeps two internal images of the screen. One of these reflects whats actually on
the screen, and the other is a scratch space that you modify using the various curses func
tions. When you tell curses to do a refresh, it compares the scratch buffer with the actual
screen image and then sends out the minimum number of characters necessary to get
these images to match. This behavior is especially important when youre running a pro
gram via a modem and characters are coming at 1200 baud. Redrawing the entire screen
every time you scroll a four-line by 10-character window is just unacceptable behavior.
It takes too long. Curses solves the problem by redrawing only those parts of the screen
that have actually changed.
The UNIX curses functions are described in depth in [Arnold] and also in [Haviland].
Since the implementation details can vary somewhat from installation to installation, I
strongly recommend that you read your own system documentation as well.
The current section describes the behavior of my own curses implementation, which
occs uses for window management in interactive debugging environment. My system
models Berkeley cursesthere are a few minor differences between it and System V ver
sions. I ve written several quite complex programs using these functions, programs that
maintain several windows on the screen simultaneously, all of which are being updated
at different rates. Moreover, the finished programs have ported to UNIX (BSD 4.3) with
literally no modification. I have not implemented the entire curses library, however, and
I ve added a few features to my own package that are not supported by the real curses.
You can write a UNIX-compatible program using my functions, but there are minor (but
documented) differences that can cause problems if youre not careful. For example,
support for overlapping windows is limited (you cant write reliably to a window that has
a second window on top of it). Subwindows arent supported. This last problem means
that you cant move a window along with the associated subwindows. Similarly, you
have to delete or update the subwindows explicitly, one at a time. Pseudo-subwindows
can be used, but only if you dont move them. My package does implement a few handy
features not found in the UNIX version, however. You can hide a window without delet
ing it, and there is a provision for making boxed windows.
This section just describes my own functions. Though I point out the important
differences from UNIX, youll have to read the UNIX documentation too if you intend to
write portable code. The two systems do behave differently in many ways. Note that,
since these routines are not intended to port to UNIX, I havent done things like use the
UX () and VA LIST macros in <debug.h> to make them portable.
A.11.1 Configuration and Compiling
The curses functions talk to the screen using the low-level I/O functions described in
the previous section. Curses automatically selects between the CGA and MGA adapters
by using the autoselect version of the glue functions. (The CGA must be running in an
80-column mode, though.) Curses normally uses the ROM-BIOS routines, which are
slow but portable. If, however, a VIDEO environment variable exists, and that variable is
set to the string DIRECT, the direct-video functions will be used. (Do this with a
s e t VIDEO=DIRECT from the DOS prompt.) The BIOS functions are noticeably
slower than the direct-video functions. The speed problem is most evident when you are
saving and restoring the area under a window. Moving a visible window in an
Curses is designed for
efficient serial communi
cation.
Curses portability issues.
curses. lib, termlib. lib.
<curses. h>
Initialize,
terminate curses.
Set unbuffered,
buffered input.
incremental fashion (one space or line at a time) is a particularly painful process when
youre using the BIOS. I ts better, in this instance, to hide the window, move it where
you want it, and then redisplay it.
The curses functions are all in a library called curse s.lib, and the I/O routines are in
termlib.lib. Both of these must be linked to the final program.
A.11.2 Using Curses
Curses itself is part macro and part subroutines. The file <curses.h> should be
#ncl uded at the top of every file that uses the curses functions. Supported functions
are described in this section, grouped functionally.
A.11.2.1 Initialization Functions.
voi d i n i t s c r ( voi d) ;
voi d e n d w i n ( voi d) ;
i n i t s c r ( ) initializes the curses package. It should be called at the head of
your ma i n ( ) subroutine, before any other curses functions are called,
e n d w i n ( ) cleans up. It should always be called before your program exits.
In UNIX programs, the terminal can be left in an unknown state if you abort
your program with a BREAK. If you exit abnormally from a program that uses
curses, only to find your terminal acting funny (not echoing, not handling tabs or
newlines properly, and so forth), you can usually correct the problem by typing
tset with no arguments. If that doesnt work, try <NL>reset<NL> where <NL> is
a newline or Ctrl-J . If that doesnt work try stty cooked echo nl, and if that
doesnt work, hang up and log on again. To avoid this sort of flailing around, its
much better for your program to trap SIGINT and call e n d w i n ( ) from within
the service routine. Use the following:
#i ncl ude <si gnal . h>
#i ncl ude <cur ses. h>
oni nt r ()
{
endwi n();
exi t ( STATUS );
}
mai n ()
{
si gnal ( SI GI NT, oni nt r );
i ni t scr ( voi d );
}
A.11.2.2 Configuration Functions. Once the curses package is initialized, you
should determine how your terminal is going to respond to typed characters. Six rou
tines are provided for this purpose:
i nt c r mo d e ( voi d) ;
i nt n o c r m o d e ( voi d) ;
These routines control input buffering; c r mo d e ( ) disables bufferingcharacters
will be available as soon as theyre typed. A n o c r mo d e ( ) call cancels a previ
ous c r mo d e ( ). In nocr mode, an entire line is read before the first character is
returned. The only editing character available in nocrmode() is a backspace,
which deletes the character to the left of the cursor. Many curses programs use
c r mo d e ( ), but some implementations wont work properly unless n o c r
mode ( ) is active.
Section A.l 1.2Using Curses 111
i nt echo ( voi d) ;
i nt noecho (voi d) ;
If echo ( ) is called, characters are echoed as theyre typed; noecho ( )
suppresses the echoingyoull have to echo the character yourself every time
you read a character. The real curses gets very confused when echo ( ) is
enabled. The problem here is that curses doesnt know about any character that it
has not itself written to the screen. Since characters are echoed by the operating
system rather than curses, the package doesnt know that theyre there. As a
consequence, when curses does a screen refresh, it wont delete the characters
that it doesnt know about and the screen rapidly fills with unwanted and uneras-
able characters. I ts best always to call noecho ( ) at the top of your program.
Another echo-related problem found in the MS- DOS versions of curses. Character
echo cannot be suppressed with the MS- DOS buffered-input function. So echo ()
and noecho () have no effect on a program if nocr mode ( ) is active.
i nt nl (voi d) ;
i nt nonl (voi d) ;
A nl ( ) call causes a newlines (' \ n' ) to be converted to a carriage-retum, line
feed sequence on output; an input carriage return ( ' \ r ' ) is also mapped to a
newline. If nonl ( ) is called, no mapping is done. I ts usually convenient to call
nl ( ) at the top of your program, but again, many UNI X curses packages fail to
work properly unless nonl ( ) is specified. You have to do all the ' \n' to \ r \ n
mapping yourself in nonl mode, of course.
A.11.2.3 Creating and Deleting Windows. This section describes the functions
that create and delete windows. There are two kinds of windows. A normal window and
a subwindow. In UNI X curses, a subwindow is affected by all commands that also affect
the parent. When a parent window is refreshed by curses, all subwindows are refreshed
too. Similarly, when you delete or move a parent window, all the subwindows are
deleted or moved. This feature is not supported in the current implementation, though
you can pretend that subwindows exist if you dont move them. The mechanics of this
process are discussed below.
A default window, called st dscr , which occupies the entire screen, is created
automatically by i ni t scr (), and several functions are provided to modify this win
dow. A st dscr variable is defined in ccur ses. h>, and it can be passed as a WI NDOW
pointer to any of the curses functions in a manner analogous to st dout . You shouldnt
declare st dscr explicitly in your program, just use it. For convenience, most functions
have a version that uses st dscr , rather than an explicit window, in a manner analogous
to put char ( ) . If you are using curses only for cursor movement and are not creating
additional windows, you can use st dscr directly and dont need to use any of the
functions described in this section.
In UNI X applications, its often convenient to declare all windows as subwindows to
st dscr . For example, the save-screen mechanism used in yydebug.c wont work unless
all the windows are sub windows of st dscr because it reads characters from st dscr
itself.
WI NDOW *newwi n( i nt l i nes, i nt col s, i nt begi n_y, i nt begi n_ x) ;
WI NDOW *subwi n( WI NDOW *wi n, i nt l i nes, i nt col s, i nt begi n_y,
i nt begi n_ x) ;
This function creates a new window, l i nes rows high and col s columns wide
with the upper-left comer at ( begi n y, begi n x). [All coordinates here are
(yjc), where y is the row number and x is the column number. The upper-left
Echo characters.
Do not echo.
Map newlines.
Do not map newlines.
Subwindows.
The stdscr window.
The s t d s c r variable in
<curses. h> .
Create window, subwin
dow.
Coordinates are (y,x),
y=row, x=column. Upper
left corner is (0,0).
Enable or disable scrol
ling.
Nonstandard: enable or
disable line wrap.
Nonstandard:
Box windows.
No boxes.
Nonstandard: change
colors.
comer of the screen is (0,0).] A pointer to a WINDOW structure, declared in
<curses.h>, is returned in a manner analogous to f open ( ) returning a FILE.
Windows are created as visible, unboxed, with scrolling disabled, and with
line wrap enabled. Characters that go past the end of line show up at the left edge
of the next line, but the window wont scroll when you get to the bottom. The
text under the window is saved by default and is restored when the window is
moved or deleted. The window is automatically cleared as part of the creation
process and theres no way to disable this clearing. All of these defaults can be
changed with subroutine calls discussed below.
The subwi n ( ) call is provided for UNI X compatibility. It is mapped to a
newwi n ( ) call (the first argument is ignored).
scr ol l ok(WINDOW *wi n, i nt f l ag) ;
This macro is passed a WINDOW pointer and a flag. If f l ag is true, the indi
cated window is allowed to scroll. Otherwise the window does not scroll and
characters that go off the bottom of the window are discarded. Scrolling is always
enabled on the st dscr window.
wr apok(WINDOW *wi n, i nt f l ag) ;
This macro enables or disables line wrap if f l ag is false, enables it otherwise.
Line wrap is enabled by default. When wrapping is disabled, characters written
past the edge of the window are discarded rather than appearing on the next line.
voi d boxed (voi d) ;
voi d unboxed (voi d) ;
The lack of support for subwindows complicates management of windows with
boxes around them considerably. A UNIX-compatible method for creating boxed
windows is described below in the discussion of the box ( ) function. I ts not
practical to use this function if windows will be moved, however. I ve solved the
problem to some extent by adding a mechanism for creating boxed windows in
which the box is an integral part of the window and cannot be overwritten by the
text. All windows created with newwi n ( ) after boxed ( ) has been called will
have an integral box as part of the window. You can go back to normal, unboxed
windows by calling unboxed ( ) . IBM box-drawing characters are used for the
border.
voi d gr ound (WINDOW *wi n, i nt f or e, i nt back) ;
voi d def _gr ound( i nt f or e, i nt back) ;
gr ound () lets you change foreground ( f or e) and background ( back) colors on
the IBM/PC. The color codes are defined symbolically in <tools/termlib.h>.
def _gr ound( ) changes the default foreground and background colors. All
windows created after a def _gr ound( ) call have the indicated colors. Use
gr ound () to selectively change the attribute associated with specific characters
written to a window; only those characters written to the window after the
gr ound () call are affected. If you want to change an existing windows color
without modifying the windows contents, you must change the color with
gr ound () and then read each character individually and write it back using
wi nch () and addch ().
BUGS: del wi n () doesnt know about color changes, you must set the ground
back to NORMAL and call wcl ear () before deleting the window.
voi d save (voi d) ;
voi d no save (voi d) ;
Normally, when you create a window, the text under that window is saved so that
it can be replaced when the window is moved or deleted. This is a needless waste
of memory if youre not going to delete or move the window. Once no save ( )
has been called, the underlying text is not saved by subsequent newwi n ( ) calls.
Saving can be reenabled by calling save ( ) . Text under st dscr is never saved.
del wi n( WI NDOW * wi n) ;
This function removes a window from the screen, usually restoring the text that
was underneath the window. If nosave ( ) was active when the window was
created, however, an empty box is left on the screen when the window is deleted.
Overlapping windows must be deleted in the opposite order from which they
were created. Similarly, if you move one window on top of a second one, you
must delete the top window first. I d suggest keeping a stack of active windows
so that you know the order in which they are created.
A.11.2.4 Subroutines That Affect Entire Windows.
WI NDOW *hi dewi n( WI NDOW * wi n) ;
WI NDOW *showwi n( WI NDOW * wi n) ;
The only way to get rid of a UNix-curses window is to delete it. This is incon
venient when you really just want to make it invisible for a while and then
redisplay it later. The nonstandard hi dewi n ( ) function hides the window
(makes it disappear but does not delete it). It returns NULL if the window cant
be hidden for some reason (usually not enough memory is available to save the
image), or the wi n pointer on success. The window can be resurrected sometime
later with a showwi n ( ) call, showwi n ( ) returns NULL (and does nothing) if
the window wasnt hidden, otherwise the wi n argument is returned, hi dewi n ( )
always returns its argumentit terminates the program with an error message if it
cant get memory to hide the window.
Its okay to move a hidden window, it will just reappear at the new location when
you call showwi n ( ) . You cannot, however, write to a hidden window or the
written characters will be lost. Similarly, the restrictions of deleting windows in
the opposite order from which they were created also applies to hiding and
redisplaying them. When windows can overlap, you must always hide the most-
recently displayed window first.
i nt mvwi n( WI NDOW *wi n, i nt y, i nt x) /
This function moves a window to an absolute location on the screen. The area
under the window is restored, the window is moved to the indicated location, and
it is then redisplayed, saving the text that is now under the window first. You
should move only the topmost of several overlapping windows. The normal UNIX
mvwi n ( ) returns ERR and does nothing if the new coordinates would have
moved any part of the window off the screen. I ve changed this behavior some
what by always moving the window as far as I can. The window still wont move
off the screen, but it may move a little if it wasnt already at the screens edge. If
you prefer UNIX compatibility, change the 0 in #def i ne LIKE_UNIX 0 in
mvwin.c (below) to 1 and recompile.
mvwi nr ( WI NDOW *wi n, i nt y, i nt x) ;
This nonstandard macro lets you move a window relative to the current window
Nonstandard: control
whether area under win
dow is saved.
Delete a window.
Nonstandard: hide and
redisplay window.
Move window, absolute.
Nonstandard: move win
dow, relative.
Refresh st dscr.
Refresh window.
Draw box in window.
Creating a boxed window
with box ().
position. Positive values of y move down, negative values move up. Similarly,
positive values of x move right and negative values go left. For example,
mv wi n r ( wi n, - 5 , 10 ) ;
moves the window pointed to by wi n five spaces up from, and 10 spaces to the
right of its current position. Like mvwi n ( ) , you cant move the window off the
screen.
voi d r ef r esh (voi d) ;
voi d wr ef r esh( WI NDOW * wi n) ;
These macros are used by the real curses to do a screen refresh. They force the
screen to coincide with the internal representation of the screen. No characters
are actually written out to the terminal until a refresh occurs. My own curses
writes to the screen immediately, so both of these macros expand to null
stringsthey are ignored. Youll need them to be able to port code to UNIX, how
ever. r ef r esh ( ) refreshes the whole screen (the st dscr window and all
subwindows of st dscr ) ; wr ef r esh ( wi n) is passed a WI NDOWpointer and
refreshes only the indicated window.
i nt box( WI NDOW *wi n, i nt ver t , i nt hor i z) ;
This subroutine draws a box in the outermost characters of the window using
ver t for the vertical characters and hor i z for the horizontal ones. I ve
extended this function to support the IBM box-drawing characters. If IBM box-
drawing characters are specified for ver t and hor i z, box() will use the correct
box-drawing characters for the comers. The box-drawing characters are defined
in <tools/box.h> as follows:
symbolic
value
numeric
value
description
HORI Z 0xc4 Single horizontal line.
D HORI Z Oxcd Double horizontal line.
VERT 0xb3 Single vertical line.
D_VERT Oxba Double vertical line.
Boxes may have double horizontal lines and single vertical ones, or vice versa.
The UNIX box ( ) function uses the vertical character for the comers.
Note that box ( ) doesnt draw a box around the window; rather, it draws the
box in the outermost characters of the window itself. This means that you can
overwrite the border if your output lines are too wide. When you scroll the win
dow, the box scrolls too. Normally, this problem is avoided by using the subwin
dow mechanism. A large outer window is created and boxed, a smaller subwin
dow is then created for the text region.
A function that creates a bordered window in this way is shown in Listing
A.69. The outer window is created on line 14, the inner, nested one on line 22.
The outer window just holds the box and the inner window is used for characters.
This way, you can overflow the inner window and not affect the outer one (that
holds the border). The function returns a pointer to the text window. If you plan
to move or delete the window, youll need the pointer to the outer window too,
and will have to modify the subroutine accordingly.
Listing A.69. Creating a Boxed Window
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
WI NDOW
{
}
*boxwi n( l i nes, col s, y st ar t
*t i t l e;
x st ar t t i t l e )
/ * Thi s r out i ne wor ks much l i ke t he newwi n ()
*
t hat t he wi ndow has a
box ar ound i t t hat won' t be dest r oyed by wr i t es t o t he wi ndow. I t
accompl i shes t hi s f eat by cr eat i ng t wo wi ndows, one i nsi de t he ot her ,
* wi t h a box dr awn i n t he out er one. I t pr i nt s t he t i t l e cent er ed on t he
t op l i ne of t he box. Not e t hat I ' mmaki ng al l wi ndows subwi ndows of t he
*
*
*
*
st dscr to f aci l i t at e t he pr i nt - scr een command.
/
WI NDOW*out er, *i nner ;
out er subwi n( st dscr , l i nes, col s, y st art , x st ar t ) ;
box( out er,
);
wmove ( out er, 0, ( col s st r l en ( t i t l e) ) / 2 );
wpr i nt w( out er,
II Q.
s", t i t l e );
wr ef r esh( out er );
ret urn subwi n( st dscr , l i nes- 2, col s- 2, y st ar t +1, x st ar t +1 );
A.11.2.5 Cursor Movement and Character I/O.
i nt
i nt
move (y, x) ;
wmove( WI NDOW *wi n,
voi d get yx( WI NDOW *wi n,
y
y
r
r
X) ;
x) ;
These functions support cursor movement, move ( ) moves the cursor to the indi
cated absolute position on the screen. The upper-left comer of the screen is (0,0).
wmove ( ) moves the cursor to the relative position within a specific window
(pointed to by wi n). The upper-left comer of the window is (0,0). If you try to
move past the edge of the window, the cursor will be positioned at the closest
edge. The get yx ( ) macro loads the current cursor position for a specific win
dow into y and x. Note that this is a macro, not a subroutine, so you should not
precede y with an address-of operator (&). A get yx ( st dscr , y call
loads the current absolute cursor position into y and x.
Move cursor.
Get cursor position
(MACRO).
i nt get ch ( voi d) ;
i nt wget ch( WI NDOW *wi n)
Get character and echo
to window
These functions are used for direct keyboard input, get ch ( ) gets a character
from the keyboard and echoes it to st dscr , wget ch ( ) echoes the character to
the indicated window (if echo enabled, that is). Note that cr mode has
to be enabled to get the character as soon as its typed. Otherwise the entire line
will be buffered. I ts unfortunate that many compiler manufactures, Microsoft
included, have chosen to use get ch ( ) as the name of their standard direct
keyboard-input function. Youll need to specify a /NOE switch to Microsoft
l i nk to prevent the linker from attempting to call in both versions of the sub
routine.
Use Microsoft Links
/NOE switch to avoid
name conflicts.
Read functions. i nch (void) ;
wi nch ( WI NDOW * wi n) ;
mvi nch ( i nt y, i nt x) ;
mvwi nch( WI NDOW *wi n, i nt y, i nt x) /
These functions let you read back characters that are displayed on the screen,
i nch ( ) returns the character at the current cursor position (that is, from
st dscr ) . mvi nch ( y, x) moves the cursor to the indicated position and then
returns the character at that position; the wi nch ( ) and mvwi nch ( wi n, y, x)
versions do the same, but the cursor position is relative to the origin of the
specified window. Note that the y and x coordinates are relative to the origin of
the indicated window, not to the entire screen. (0,0) is the top left comer of the
window. Some older versions of curses dont support the mv versions of this com
mand. Note that all four functions work as if all windows were declared as
subwindows to st dscr . This is not the case in UNI X systems, in which a read
from st dscr can get a character from the standard screen that is currently
obscured by an overlapping window.
pr i nt f o to st dscr. voi d pr i nt w( char *f mt , . . .) ;
pr i nt f o to window. i nt wpr i nt w ( WI NDOW *wi n, char *f mt , . . .) ;
These functions are used for formatted output, pr i nt w ( ) works just like
pr i nt f (); wpr i nt w ( ) is the same but it prints to the indicated window, mov
ing the cursor to the correct position in the new window if necessary. (The cursor
is moved to the position immediately following the character most recently writ
ten to the indicated window.) pr i nt w ( ) pays no attention to window boun
daries, but wpr i nt w ( ) wraps when you get to the right edge of the window and
the window scrolls when you go past the bottom line (provided that scr ol
l ok ( ) has been called for the current window).
Character-outputfunc- voi d addch ( i nt c) /
tions.
i nt waddch ( WI NDOW *wi n, i nt c) ;
voi d addst r ( char * st r ) ;
i nt waddst r ( WI NDOW *wi n, char * st r ) ;
These functions are the curses equivalents of put c ( ), put char ( ), put s ( ),
and f put s ( ): addch ( ) works like put char ( ) ; waddch ( ) writes a charac
ter to the indicated window (and advances the cursor); addst r () and
waddst r ( ) work like f put s ( ) , writing a string to st dscr or the indicated
window. Neither addst r ( ) nor waddst r ( ) add a newl i ne at the end of the
string, addch ( ) and waddch ( ) treat several characters specially:
' \n' Clear the line from the current cursor position to the right edge of the win
dow. If nl ( ) is active go to the left edge of the next line, otherwise go to
the current column on the next line. In addition, if scrolling is enabled, the
window scrolls if youre on the bottom line.
' \ t ' is expanded into an 8-space field. If the tab goes past the right edge of the
window, the cursor wraps to the next line.
' \ r ' gets you to the left edge of the window, on the current line.
' \b' backs up one space but may not back up past the left edge of the window.
A nondestructive backspace is used (the character on which the initial cur
sor sits is not erased). The curses documentation doesnt say that ' \ b' is
handled specially but it does indeed work.
ESC The ESC character is not handled specially by UNI X but my waddch ( )
does do so. Dont use explicit escape sequences if portability is a
consideration. All characters between an ASCII ESC and an alphabetic
character (inclusive) are sent to the output but are otherwise ignored. This
behavior lets you send escape sequences directly to the terminal to
change character attributes and so forth. I m assuming here that you wont
change windows in the middle of an escape sequence.
waddch ( ) returns ERR (defined in <cur ses. h>) if the character had caused
the window to scroll illegally or if you attempt to write to a hidden window.
voi d wer ase ( WI NDOW* wi n) ;
voi d er ase (voi d) ;
voi d wcl ear ( WI NDOW* wi n) ;
voi d cl ear ( voi d) ;
voi d wcl r t oeol ( WI NDOW * wi n) ;
voi d cl r t oeol ( voi d) ;
These functions all erase one or more characters, cl ear ( ) and er ase ( ) both
clear the entire screen, wcl ear ( ) and wer ase ( ) both clear only the indicated
window, wcl r t oeol ( ) clears the line from the current cursor position in the
indicated window to the right edge of the indicated window, and cl r t oeol (
clears from the current cursor position to the right edge of the screen.
i nt scr ol l ( WI NDOW * wi n) ;
i nt wscr ol l ( WI NDOW *wi n, i nt amt ) ;
These two functions scroll the window: scr ol l ( ) scrolls the indicated window
up one line, wscr ol l ( ) scrolls by the indicated amountup if amt is positive,
down if its negative, wscr ol l () is not supported by the UNI X curses. Both
functions return 1 if the window scrolled, 0 if not.
Theres one caveat about scrolling. The UNI X functions have a bug in them:
when a window scrolls, the bottom line is not cleared, leaving a mess on the
screen. This problem is not restricted to the scr ol l ( ) subroutine but occurs
any time that the window scrolls (as when you write a newline on the bottom line
of the window or when a character wraps, causing a scroll. As a consequence,
youre porting to UNI X, you should always do a cl r t oeol ( ) immediately after
either scrolling or printing a newline. Unfortunately, theres no easy way to tell if
a window has scrolled because of a character wrap. My curses package doesnt
have this problemthe bottom line of the window is always cleared on a scroll.
A.11.3 CursesImplementation
This section discusses my curses implementation. Note that I havent bothered to
make this code UNI X compatibleI m assuming that youre going to use the native UNI X
curses in that environment. For the most part, the code is straightforward and needs little
comment. The WI NDOWstructure is declared on lines one to 17 of <tools!curse s.h>
(Listing A.70). The macros for bool , reg, TRUE, FALSE, ERR, and OK are defined in the
UNI X <curses.h> file so I ve put them here too, though theyre not particularly useful.
Be careful of:
( f o o () == TRUE )
TRUE is #defi ned to 1 but, in fact, any nonzero value is true. As a consequence,
f oo ( ) could return a perfectly legitimate true value that didnt happen to be 1, and the
test would fail. The test:
Erase window, entire
screen, to end of line.
Scroll window.
The wi n d o w structure.
Problems with
i f ( f o o ( ) ==TRUE) .
( f o o ()
i
FALSE )
e r r retu rn ed b y cu rse s
ou tp u t fu n ctio n s.
is safer. Most of the output functions return ERR if a write would normally have caused
a scroll but scrolling is disabled.
Listing A.70. curses.hWINDOW Definitions and Macros
1 t ypedef st ruct wi ndow
2
{
3 i nt x o r g ; / *
4 i nt y o r g ;
/ *
5 i nt x s i z e ; / *
6 i nt y s i z e ; / *
7 i nt row; / *
8 i nt c o l ; / *
9 voi d *i mage ; / *
10 / *
11 unsi gned wrap_ok 1 ;
/ *
12 unsi gned s c r o l l ok 1 ;
/ *
13 unsi gned h i d d e n 1 ;
/ *
14 unsi gned bo x e d 1 ;
/ *
15 unsi gned a t t r i b 8 ; / *
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
X coor di nat e of upper - l ef t cor ner
Y coor di nat e of upper - l ef t cor ner
Hor i zont al si ze of t ext ar ea.
Ver t i cal si ze of t ext ar ea
Cur r ent cur sor r ow (0 t o y si ze- 1)
Cur r ent cur sor col umn (0 to x_si ze- l )
I mage buf f er . Hol ds what used t o be
under t he wi ndow.
Li ne wr ap i s enabl ed i n t hi s wi n.
Scr ol l i ng per mi t t ed i n t hi s wi ndow
Wi ndow i s hi dden ( nonst andar d)
Wi ndow i s boxed ( nonst andar d)
at t r i but e used f or char act er wr i t es
} WINDOW;
# d e f n e b o o l
# d e f n e
#d e f i n e TRUE
#d e f i n e FALSE
# d e f i n e ERR
# d e f i n e OK
u n s i g n e d i n t
(1)
(0)
(0)
(1)
/ * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
*
The f ol l owi ng macr os i mpl ement many of t he cur ses f unct i ons
*
/
#d e f i n e g e t y x ( wi n, y x )
( (x)
\
( (WINDOW*) ( w i n ) ) - > c o l ,
(y)
#d e f i n e r e f r e s h ()
# d e f i n e s c r o l l o k ( w i n , f l a g ) ( ( w i n ) - > s c r o l l ok
# d e f i n e w r a p o k ( w i n , f l a g )
# d e f i n e w r e f r e s h ( w i n )
( ( wi n) - >wr a p ok
( f l a g ) )
( f l a g ) )
/
*
empt y
*
/
/ * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
*
*
*
*
*
*
*
*
*
*
/
/
/
/
/
/
/
/
/
/
/
/
/
( (WINDOW*) ( w i n ) ) - >row)
*
*
*
Nonst andar d Macr os: movewi n() moves t he wi ndow r el at i ve t o t he cur r ent
posi t i on. Negat i ve i s l ef t or up, posi t i ve i s r i ght or
down, gr ound () changes t he f or e and backgr ound col or s f or subsequent wr i t es
* t o t he wi ndow.
*
/
# d e f i n e mv wi n r ( w, d y , d x ) mv wi n( ( w) , ( ( w) - >y _ o r g
( ( w) - >x o r g
( w) - >boxe d) + ( d y ) , \
( w) - >boxe d) + (dx) )
#d e f i n e g r o u n d ( w i n , f , b ) ( w i n - > a t t r i b
( ( f )
& 0 x 7 f ) ( (b) & 0 x 7 f ) << 4)
/ * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
*
Ext er ns f or t he wi ndow f unct i ons and #def i nes t o map t he st andar d scr een
* f unct i ons t o t he st dscr ver si ons. Ther e ar e a f ew i di osyncr asi es her e
*
*
I n par t i cul ar , mvcur ( ) j ust i gnor es i t ' s f i r st t wo ar gument s and maps to a
move() cal l . Si mi l ar l y, subwi n () j ust maps t o a newwi n() cal l , and cl ear ok( )
Section A.l 1.3CursesImplementation 785
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
*
i sn' t suppor t ed. You must cl ear t he wi ndow expl i ci t l y bef or e wr i t i ng t o i t
*
/
WINDOW
voi d endwi n
voi d i n i t s c r
i nt waddch
* s t d s c r ;
(voi d) ;
(voi d) ;
(WINDOW
i nt) ;
#def i ne a ddc h( c ) w a d d c h ( s t d s c r , c)
i nt wa dds t r
#def i ne a d d s t r ( s )
* *
(WINDOW
w a d d s t r ( s t d s c r , s)
) ;
i nt w c l r t o e o l
#def i ne c l r t o e o l ()
(WINDOW
*
);
w c l r t o e o l ( s t d s c r )
i nt we r a s e (WINDOW
*
) ;
#def i ne e r a s e we r a s e ( s t d s c r )
#def i ne wc l e a r ( w)
#def i ne c l e a r ()
we r a s e ( w)
w e r a s e ( s t d s c r )
i nt wg e t c h (WINDOW
*
) ;
#def i ne g e t c h () w g e t c h ( s t d s c r )
i nt wmove
#def i ne move ( y, x)
(WINDOW
*
, i nt, i nt );
wmove( s t d s c r , ( y ) , (x)
)
#def i ne m v c u r ( o y , o x , y , x ) m o v e ( ( y ) , ( x ) )
i nt wpr i nt w
i nt p r i n t w
(WINDOW
(

) ;
);
i nt w s c r o l l
#def i ne s c r o l l ( w i n )
(WINDOW
, i nt )
w s c r o l l ( w i n , 1)
(WINDOW i nt wi nch
#def i ne i n c h ()
#def i ne mv i n c h (y, x)
#def i ne mv wi n c h ( w, y , x ) ( wmove(w,
);
w i n c h ( s t d s c r )
( w m o v e ( s t d s c r , y , x ) , w i n c h ( s t d s c r )
y , x ) , wi nc h (w)
)
)
WINDOW *newwi n ( i n t , i n t , i n t , i n t ) ;
#d e f i n e s u b w i n ( w , a , b , c , d ) n e w w i n ( a , b , c , d )
/
*
* Ext er ns f or f unct i ons t hat don' t have st dscr ver si ons
100 * /
101
102 ext ern i nt box (WINDOW * , i nt
103 ext ern i nt crmode (voi d) ;
104 ext ern i nt d e l w i n (WINDOW * ) ;
105 ext ern i nt e c h o (voi d) ;
106 ext ern i nt mvwi n (WINDOW *wi n,
107 ext ern i nt n l ( voi d) ;
108 ext ern i nt nocrmode (voi d) ;
109 ext ern i nt noe c ho (voi d) ;
110 ext ern i nt no nl (voi d) ;
111 ext ern voi d bo x e d ( voi d );
112 ext ern voi d unboxed ( voi d );
113 ext ern voi d s a v e ( voi d ) ;
114 ext ern voi d n o s a v e ( voi d );
/ i nt ); / * UNI X f unct i ons */
i nt y, i nt x) ;
* Nonst andar d f unct i ons */
115 e x t e r n v o i d d e f gr ound ( i n t , i n t ) ;
The mvi nch ( ) and mvwi nch ( ) macros on lines 92 and 93 of Listing A.70 use the
comma, or sequence, operator. The comma operator evaluates from left to right, and the
entire expression evaluates to the rightmost object in the list. For example, the
mvi nch ( ) looks like:
# d e f i n e mv i n c h ( y , x ) ( mo v e ( y , x ) , i n c h ( ) )
An equivalent subroutine is:
mv i n c h ( y , x )
{
mo v e (y , x ) ;
r e t u r n i n c h ( ) ;
)
The comma operator is used because there are two statements that have to be executed,
the move ( ) call and the i nch ( ) call. Were you to define the macro as:
# d e f i n e mv i n c h ( y , x ) m o v e ( y , x ) ; i n c h ( )
the following code wouldnt work:
i f ( c o n d i t i o n )
mv i n c h ( y , x ) ;
because it would expand to:
mo v e (y, x ) ;
i n c h ( ) ;
Putting curly braces around the statements doesnt help. For example:
# d e f i n e mv i n c h ( y , x ) { mo v e ( y , x ) ; i n c h ( ) ; }
mv i n c h ( y , x ) ;
e l s e
s o m e t h i n g ( ) ;
expands to
{
move(y, x);
i n c h ( ) ;
}
f
e l s e
s o m e t h i n g ( ) ;
Here the el se will try to bind with the semicolon, which is a perfectly legitimate state
ment in C, causing a No matching if for else error message. Though the comma operator
solves both of these problems, it isnt very readable. I dont recommend using it unless
you must. Never use it if curly braces will work in a particular application. The
remainder of curses is in Listings A.71 to A.87.
The comma operator in
mvi nch() and
mvwi nch(). Writing
multiple-statement mac
ros.
Listing A.71. box.h IBM/PC Box-Drawing Characters
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
/ * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
* BOX. H: #def i nes f or t he box- dr awi ng char act er s
* __________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
*
The names ar e
*
*
*
*
*
*
*
*
*
*
*
*
UL
UR
LL
LR
CEN
TOP
BOT
LEFT
RI GHT
HORI Z
VERT
Upper l ef t cor ner
Upper r i ght cor ner
l ower l ef t cor ner
l ower r i ght cor ner
Cent er ( i nt er sect i on of t wo l i nes)
Tee wi t h t he f l at pi ece on t op
Bot t om t ee
Lef t t ee
Ri ght t ee
Hor i zont al l i ne
Ver t i cal l i ne.
*
*
UL TOP UR HORI Z
*
*
*
*
*
L
E
F
T
CEN
*
R
I
G
H
T
V
E
R
T
*
*
LL BOT LR
*
* The D_XXX def i nes have doubl e hor i zont al and ver t i cal l i nes.
* The HD_XXX def i nes have doubl e hor i zont al l i nes and si ngl e ver t i cal l i nes
* The VD XXX def i nes have doubl e ver t i cal l i nes and si ngl e hor i zont al l i nes
*
*
I f your t er mi nal i s not I BM compat i bl e, tf def i ne al l of t hese as +' ( except
*
f or t he VERT #def i nesf whi ch shoul d be a and t he HORI Z i def i nes, whi ch
* shoul d be a -) by #def i ni ng NOT I BM PC bef or e i ncl udi ng t hi s f i l e
*
/
# i f d e f NOT IBM PC
IBM_B0X(x)
OTHER BOX(x) x
# e l s e
IBM_BOX(x)
OTHER BOX(x)
x
43
44
45
#endi
#def i ne VERT IBM__BOX ( 179 ) OTHER_ BOX (
46 #def i ne RIGHT IBM_~BOX( 180 ) OTHER_[b o x (
47 #def i ne UR IBM_[b o x ( 191 ) OTHER_[b o x (
48 #def i ne LL IBM_"b o x ( 192 ) OTHER_[b o x ( ' +'
49 #def i ne BOT IBM_[b o x ( 193 ) OTHER_[b o x ( ' +'
50 #def i ne TOP IBM_[b o x ( 194 ) OTHER_[b o x ( ' +'
51 #def i ne LEFT IBM_[b o x ( 195 ) OTHER_[b o x ( ' +'
52 #def i ne HORIZ IBM_[b o x ( 196 ) OTHER_[b o x (
99
53 #def i ne CEN IBM_[b o x ( 197 ) OTHER_[b o x ( ' +'
54 #def i ne LR IBM_[b o x ( 217 ) OTHER_[b o x ( ' +'
55 #def i ne UL IBM_[b o x ( 218 ) OTHER_[b o x ( ' +'
56 #def i ne D_VERT IBM_[b o x ( 186 ) OTHER_[b o x ( 919
57 #def i ne D_RIGHT IBM_[b o x ( 185 ) OTHER_ BOX ( ' +'
58 #def i ne D_UR IBM_ BOX ( 187 ) OTHER_ BOX (
59 #def i ne D LL IBM BOX ( 200 ) OTHER BOX (
Listing A.71. continued9 9 9
60 #def i ne D_BOT IBM__BOX ( 202 ) OTHER__BOX (
' + ' )
61 t def i ne D_TOP IBM_ BOX ( 203 ) OTHER__BOX (
' +' )
62 #def i ne D_LEFT IBM_[b o x ( 204 ) OTHER__BOX (
' +' )
63 #def i ne D_HORIZ IBM_[b o x ( 205 ) OTHER_[b o x ( f f )
64 #def i ne D_CEN IBM_[BOX ( 206 ) OTHER_[bo x ( ' +' )
65 #def i ne D_LR IBM__B0X ( 188 ) OTHER_[bo x ( ' + ' )
66 t def i ne D_UL IBM_[B0X( 201 ) OTHER_[bo x (
' +' )
67 #def i ne HD_VERT IBM_[BOX ( 179 ) OTHER_[b o x (
' r )
68 #def i ne HD_RIGHT IBM_[BOX( 181 ) OTHER_[b o x ( ' +' )
69 #def i ne HD_UR IBM_[BOX( 184 ) OTHER_[b o x ( ' +' )
70 #def i ne HD_LL IBM_ BOX ( 212 ) OTHER_ BOX (
' +' )
71 #def i ne HD*_BOT IBM_~BOX( 207 ) OTHER_[b o x (
' +' )
72 #def i ne HD_TOP IBM_ BOX ( 209 ) OTHER_[box (
' +' )
73 #def i ne HD_LEFT IBM_[box ( 198 ) OTHER_[box (
' +' )
74 #def i ne HD_HORIZ IBM_[b o x ( 205 ) OTHER_[box (
! ! \
75 #def i ne HD_CEN IBM_[b o x ( 216 ) OTHER_[b o x (
' +' )
76 #def i ne HD_LR IBM_[b OX( 190 ) OTHER_[box (
' +' )
77 #def i ne HD_UL IBM_[BOX( 213 ) OTHER_[b o x (
' +' )
78 #def i ne VD_VERT IBM_[box ( 186 ) OTHER_[box (
' r )
79 #def i ne VD_RIGHT IBM_[bo x ( 182 ) OTHER_[b o x (
' +' )
80 #def i ne VD_UR IBM_[bo x ( 183 ) OTHER_[bo x (
' +' )
81 #def i ne VD_LL IBM_[b o x ( 211 ) OTHER_[bo x (
' +' )
82 #def i ne VD_B0T IBM_[b o x ( 208 ) OTHER_[box (
' +' )
83 #def i ne VD_TOP IBM_[b o x ( 210 ) OTHER_[box ( ' +' )
84 #def i ne VD_LEFT IBM_[b o x ( 199 ) OTHER_[b o x (
' +' )
85 #def i ne VD_HORIZ IBM_[b o x ( 196 ) OTHER_[b o x ( r r )
86 #de f i ne VD_CEN IBM_[b o x ( 215 ) OTHER_[b o x (
' +' )
87 #def i ne VD_LR IBM_[b o x ( 189 ) OTHER_[b o x ( ' +' )
88 #def i ne VD_UL IBM_[b o x ( 214 ) OTHER_[b o x (
' +' )
Listing A.72. cur.h Curses Fuctions # includes
2 #i ncl ude < s t d l i b . h>
3 #i ncl ude < o t y p e . h>
4 #i ncl ude < c u r s e s . h> / * r out i nes i n t he l i br ar y. */
5 #i ncl ude < s t d a r g . h>
/ *
va l i st and va st ar t (ANSI ) */
7 #i ncl ude < t o o l s / t e r m l i b . h>
8 # i ncl ude < t o o l s / b o x . h>
/ * of I BMbox- dr awi ng char act er s
*/
/ *
f unct i on pr ot ot ypes f or cur ses f unct i ons */
10 / * ( di st r i but ed on di sk but not pr i nt ed her e */
Listing A.73. box.c Draw Box in Window
1 # i n c l u d e "c ur . h"
2
3 box( wi n, v e r t , hor i z )
4 WI NDOW *wi n;
5 {
6 / * Dr aws a box i n t he out er most char act er s of t he wi ndow usi ng vert f or
1 * t he ver t i cal char act er s and hor i z f or t he hor i zont al ones. I ' ve
8 * ext ended t hi s f unct i on t o suppor t t he I BMbox- dr awi ng char act er s. That
9 * i s, i f I BMbox- dr awi ng char act er s ar e speci f i ed f or ver t and hor i z,
10 * box () wi l l use t he cor r ect box- dr awi ng char act er s i n t he cor ner s. These
11 * ar e def i ned i n * box. h as:
12 *
13 * HORI Z (0xc4) si ngl e hor i zont al l i ne
14 * D_HORI Z (Oxcd) doubl e hor i zont al l i ne.
15 * VERT (0xb3) si ngl e ver t i cal l i ne
16 * D_VERT (Oxba) doubl e ver t i cal l i ne.
17 * /
18
19 i nt i , nr ows /
20 i nt u l , ur , 11, l r ;
21 i nt o s c r o l l , owrap, oy, ox;
22
23 g e t y x ( wi n, o y , ox ) ;
24 o s c r o l l = w i n - > s c r o l l _ o k ; / * Di sabl e scr ol l i ng and l i ne wr ap * /
25 owrap = wi n- >wr ap_ok; / * i n case wi ndowuses t he whol e scr een * /
26 w i n - > s c r o l l _ o k = 0;
27 wi n- >wr ap_ok = 0;
28
29 i f ( ! ( (hori z==HORIZ || hori z==D_HORIZ) && ( v e r t ==VERT || v e r t ==D_VERT)) )
30 u l = ur = 11 = l r = v e r t ;
31 el se
32 {
33 i f ( v e r t == VERT )
34 {
35 i f (hori z==HORIZ)
36 ul =UL, ur=UR, 11=LL, l r=LR;
37 el se
38 ul=HD_UL, ur=HD_UR, 11=HD_LL, l r=HD_LR;
39 }
40 el se
41 {
42 i f ( h o r i z == HORIZ )
43 ul =VD_UL, ur=VD_UR, 11=VD_LL, l r=VD_LR;
44 el se
45 ul =D_UL, ur=D_UR, 11=D_LL, l r=D_LR;
46 }
47 }
48
49 wmove ( wi n, 0, 0 ) ;
50 waddch ( wi n, u l ) ; / * Dr aw t he t op l i ne */
51
52 f o r ( i = w i n - > x _ s i z e - 2 ; - - i >= 0 ; )
53 wa ddc h( wi n, h o r i z ) ;
54
55 wa ddc h( wi n, ur ) ;
56 nrows = w i n - > y _ s i z e - 2 ; / * Dr aw t he t wo si des */
57
58 i = 1 ;
59 whi l e( nrows >= 0 )
60 {
61 wmove ( wi n, i , 0 ) ;
62 wa ddc h( wi n, v e r t ) ;
63 wmove ( wi n, i + + , w i n - > x _ s i z e - 1 ) ;
64 wa ddc h( wi n, v e r t ) ;
65 }
66
67 wmove ( wi n, i , 0 ) ; / * Dr aw t he bot t oml i ne * /
68 wa ddc h( wi n, 11 ) ;
69
70 f or ( i = wi n- >x_si ze- 2; - - i >= 0 ; )
71 waddch( wi n, hor i z );
72
73 waddch( wi n, l r ) ;
74 wmove ( wi n, oy, ox ) ;
75 wi n- >scr ol l ok = oscr ol l ;
76 wi n- >wr ap ok = owr ap ;
77 }
Listing A.74. delwin.c Delete Window
1 #i ncl ude " c u r . h "
2
3 d e l w i n ( w i n )
4 WINDOW * w i n ;
5 {
6 / * Copy t he saved i mage back ont o t he scr een and f r ee t he memor y used f or
1 * t he buf f er .
8 */
9
10 i f ( w i n - > i m a g e )
11 {
12 r e s t o r e ( ( SBUF * ) w i n - > i m a g e ) ;
13 f r e e s c r ( ( SBUF *) w i n - > i m a g e ) ;
14 }
15 f r e e ( w i n ) ;
16 }
Listing A.75. hidewin.c Hide a Window
1 #i ncl ude "cur . h"
2
3 WI NDOW *hi dewi n( wi n )
4 WI NDOW *wi n;
5 {
6 / * Hi de a wi ndow. Ret ur n NULL and do not hi ng i f t he i mage wasn' t saved
1 * or i gi nal l y or i f i t ' s al r eady hi dden, ot her wi se hi de t he wi ndow and
8 * r et ur n t he wi n ar gument . You may not wr i t e t o a hi dden wi ndow.
9 * /
10
11 SBUF * i mage ;
12
13 i f ( ! wi n- >i ma g e | | wi n - > h i d d e n )
14 ret urn NULL;
15
16 i mage = s a v e s c r ( ( (SBUF*) ( w i n - > i ma g e ) ) - > l e f t , ((SBUF*) ( wi n - > i ma g e ) ) - > r i g h t ,
17 ( ( SBUF* ) ( w i n - > i ma g e ) ) - > t o p , ( ( SBUF* ) ( w i n - > i ma g e ) ) - >bo t t o m ) ;
18
19 r e s t o r e ( ( SBUF *) wi n- >i ma g e ) ;
20 f r e e s c r ( ( SBUF *) wi n - >i ma g e );
21 wi n - > i ma g e = i mage ;
22 wi n - > h i d d e n = 1;
23 r et ur n( wi n ) ;
24 }
Listing A.76. initscr.c Initialize Curses
1 #i ncl ude "c ur . h"
2
3 WINDOW * s t d s c r ;
4
5 voi d e n d w i n () / * Cl ean up as r equi r ed */
6
{
7 c mo v e ( 2 4 , 0 ) ;
8
}
9
10 voi d i n i t s c r ()
11
{
12 / * Cr eat es st dscr . I f you want a boxed scr een, cal l boxed () bef or e t hi s
13 * r out i ne. The under l yi ng t ext i s NOT saved. Not e t hat t he at exi t cal l
14 * i nsur es a cl ean up on nor mal exi t , but not wi t h a Ct r l - Br eakf you' l l
15 * have t o cal l si gnal () t o do t hat .
16 */
17
18 n o s a v e ( ) ;
19 i n i t ( - 1 ) ;
20 s t d s c r = n e wwi n ( 25, 80, 0, 0 ) ;
21 s a v e ( ) ;
22 a t e x i t ( endwi n ) ;
23
}
Listing A.77. mvwin.c Move a Window
2
3 / * Move a wi ndow t o new absol ut e posi t i on. Thi s r out i ne wi l l behave i n one of
4 * t wo ways, dependi ng on. t he val ue of LI KE_UNI X when t he f i l e was compi l ed.
5 * I f t he i def i ne has a t r ue val ue, t hen mvwi n() r et ur ns ERR and does not hi ng
6 * i f t he new coor di nat es woul d move t he wi ndow past t he edge of t he scr een.
7 * I f LI KE_UNI X i s f al se, ERR i s st i l l r et ur ned but t he wi ndow i s moved f l ush
8 * wi t h t he r i ght edge i f i t ' s not al r eady t her e. ERR says t hat t he wi ndow i s
9 * now f l ush wi t h t he edge of t he scr een. I n bot h i nst ances, negat i ve
10 * coor di nat es ar e si l ent l y set t o 0.
11 */
12
13 #def i ne LIKE_UNIX 0
14
15 #i f ( LIKE UNIX )
16 # def i ne UNIX(x) x
17 # def i ne DOS(x)
18 #el se
19 # def i ne UNIX(x)
20 # def i ne DOS(x) x
21 #endi f
22
23 mvwi n( wi n, y, x )
24 WINDOW *wi n;
25 {
26 i nt o l d _ x , o l d _ y , x s i z e , y s i z e , d e l t a _ x , d e l t a _ y , v i s i b l e ;
27 SBUF * i mage;
28
29 i f ( wi n == s t d s c r ) / * Can' t move st dscr wi t hout i t goi ng * /
30 ret urn ERR; / * of f t he scr een. * /
31
32 / * Get t he act ual di mensi ons of t he wi ndow: compensat e f or a bor der i f t he
33 * wi ndow i s boxed.
34 * /
35
36 o l d x = wi n - > x o r g - wi n - > b o x e d ;
37 o l d y = wi n - > y o r g - wi n - > b o x e d ;
38 x s i z e = wi n - > x s i z e + ( wi n- >bo x e d * 2 ) /
39 y s i z e = wi n - > y s i z e + ( wi n- >bo x e d * 2 ) ;
40
41 / * Const r ai n x and y so t hat t he wi ndow can' t go of f t he scr een * /
42
43 x = ma x ( 0, x ) ;
44 y = ma x ( 0, y ) ;
45 i f ( x + x s i z e > 80 )
46
{
47 UNIX( r e t u r n ERR; )
48 DOS ( x = 80 - x s i z e ; )
49 }
50 i f ( y + y s i z e > 25 )
51 {
52 UNIX( r e t u r n ERR; )
53 DOS ( y = 25 - y s i z e ; )
54
}
55
56 d e l t a x = x - o l d _ x ; / * Adj ust coor di nat es. * /
57 d e l t a _ y = y - o l d _ y ;
58
59 i f ( d e l t a y == 0 && d e l t a x == 0 )
60 r e t u r n ERR;
61
62 i f ( v i s i b l e = ! wi n - > h i d d e n )
63 h i d e w i n ( wi n ) ;
64
65 wi n - > y o r g += d e l t a _ y ;
66 wi n - > x o r g += d e l t a x;
67 i mage = (SBUF *) wi n - >i ma g e ;
68 i ma g e - > t o p += d e l t a _ y ;
69 i ma g e - > b o t t o m += d e l t a _ y ;
70 i m a g e - > l e f t += d e l t a x;
71 i m a g e - > r i g h t += d e l t a x;
72
73 i f ( v i s i b l e )
74 s h o wwi n ( wi n ) ;
75 r e t u r n ( OK ) ;
76 }
Listing A.78. showwin.c Display a Previously Hidden Window
2
3 WINDOW *s howwi n( wi n )
4 WINDOW *wi n;
5 {
6 / * Make a pr evi ousl y hi dden wi ndow vi si bl e agai n. Ret ur n NULL and do
1 * not hi ng i f t he wi ndowwasn' t hi dden, ot her wi se r et ur n t he wi n ar gument .
8 */
9
10 SBUF * i mage;
11
12 i f ( ! wi n - > h i d d e n )
13 r et ur n( NULL ) ;
14
15 i mage = s a v e s c r ( ( ( S BUF* ) ( wi n - > i ma g e ) ) - > l e f t , ( ( S BUF * ) ( wi n - > i ma g e ) ) - > r i g h t ,
16 ( (SBUF*) ( w i n - > i ma g e ) ) - > t o p , ( (SBUF*) ( w i n - > i ma g e ) ) - > b o t t o m ) ;
17
18 r e s t o r e ( (SBUF *) wi n- >i ma g e ) ;
19 f r e e s c r ( (SBUF *) wi n- >i ma g e ) ;
20 wi n- >i ma g e = i mage;
21 wi n - > h i d d e n = 0;
22
23 / * Move t he cur sor t o compensat e f or wi ndows t hat wer e moved whi l e t hey
24 * wer e hi dden.
25 * /
26 c mo v e ( wi n - > y _ o r g + wi n- >r o w, wi n - > x _ o r g + w i n - > c o l ) ;
27 r et ur n( wi n ) ;
28 }
Listing A. 79. waddstr.c Write String to Window
1
2
3
#i ncl ude "c ur . h"
w a d d s t r ( wi n, s t r )
4 WINDOW *wi n;
5 char * s t r ;
6
{
7 whi l e( * s t r )
8 wa ddc h( wi n, * s t r + + ) ;
9
}
Listing A. 80. wclrtoeo.c Clear from Cursor to Edge of Window
1
2
3
#i ncl ude "c u r . h"
w c l r t o e o l ( wi n )
4 WINDOW *wi n;
5
{
6 / * Cl ear f r omcur sor t o end of l i ne, t he cur sor i sn' t moved. The mai n
7 * r eason t hat t hi s i s i ncl uded her e i s because you have t o cal l i t af t er
8 * pr i nt i ng ever y newl i ne i n or der t o compensat e f or a bug i n t he r eal
9 * cur ses. Thi s bug has been cor r ect ed i n my cur ses, so you don' t have t o
10 * use t hi s r out i ne i f you' r e not i nt er est ed i n por t abi l i t y.
11
12
*/
13 c l r r e g i o n ( wi n - > x o r g + w i n - > c o l , wi n - > x o r g + ( wi n- >x s i z e -
1) ,
14 wi n - > y o r g + wi n- >r o w , wi n - > y o r g + wi n- >r o w,
15 w i n - > a t t r i b );
16
}
Listing A.81. werase.c Erase Entire Window
1 #include "c ur . h"
2
3 we r a s e ( wi n )
4 WINDOW *wi n;
5 {
6 c l r r e g i o n ( wi n - > x o r g , wi n - > x o r g + ( wi n- >x s i z e - 1 ) ,
7 wi n - > y o r g , wi n - > y o r g + ( wi n- >y s i z e - 1 ) ,
8 w i n - > a t t r i b ) ;
9
10 c mo v e ( wi n - > y o r g , wi n - > x o r g ) ;
11 wi n- >r o w = 0;
12 w i n - > c o l = 0;
13 }
Listing A.82. winch.c Move Window-Relative Cursor and Read Character
2
3 i nt w i n c h ( wi n )
4 WINDOW *wi n;
5
{
6 i nt y, x, c;
7
8 c u r p o s ( &y, &x ) ;
9 c mo v e ( wi n - > y o r g + wi n- >r o w, wi n - > x o r g + w i n - > c o l ) ;
10 c = i n c h a r ( ) ;
11 c mo v e ( y, x ) /
12 return c;
13 }
Listing A.83. wincreat.c Create a Window
1 # i n c l u d e "c ur . h"
2 #i n c l u d e < t o o l s / b o x . h >
3
4 /* ---------------------------------------------------------------------------------------------
5 * Wi ndow cr eat i on f unct i ons.
6 * St andar d Funct i ons:
1 *
8 * WI NDOW *newwi n ( l i nes, col s, begi n_y, begi n_x )
9 * cr eat es a wi ndow
10 *
11 * Nonst andar d Funct i ons:
12 *
13 * s a v e f j Ar ea under al l new wi ndows i s saved ( def aul t )
14 * nosave () Ar ea under al l new wi ndows i s not saved
15 *
16 * boxed () Wi ndow i s boxed aut omat i cal l y.
17 * unboxed () Wi ndow i s not boxed ( def aul t )
18 * def _gr ound( f , b) Set def aul t f or egr ound col or t o f, and backgr ound col or
19 * t o b.
20 *---------------------------------------------------------------------------------------------
21 */
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
PRIVATE i nt Save
PRIVATE i nt Box
PRIVATE i nt A t t r i b
1;
0;
NORMAL;
/
*
Save i mage when wi ndow cr eat ed
*
/ * Wi ndows ar e boxed
/
*
Def aul t char act er at t r i but e byt e
*
/
/
/
voi d s a v e
voi d no s a v e
voi d boxe d
0
0
0
voi d unboxed ()
{
{
{
{
Box
Box
1;
0;
1;
0;
}
}
}
}
voi d d e f g r o u n d ( f , b ) { A t t r i b ( f & 0 x 7 f ) ( (b & 0 x 7 f ) << 4) ;
}
/
* *
/
WINDOW * ne wwi n( l i n e s , c o l s , b e g i n y, b e g i n x )
i nt
i nt
i nt
i nt
{
c o l s ;
l i n e s ;
b e g i n _ y ;
b e g i n x;
/ * Hor i zont al si ze ( i ncl udi ng bor der )
/ * Ver t i cal si ze ( i ncl udi ng bor der )
/ *
/ *
*
X coor di nat e of upper - l ef t cor ner
Y coor di nat e of upper - l ef t cor ner
*
*
*
/
/
/
/
WINDOW *wi n;
( ! (wi n (WINDOW
) m a l l o c ( (WINDOW)
) ) )
{
f p r i n t f ( s t d e r r , " I n t e r n a l e r r o r ( newwi n) : Out o f me mor y\ nM) ;
e x i t ( 1 ) ;
}
( c o l s > 80 )
{
c o l s
b e g i n x
80;
0;
}
( b e g i n x + c o l s > 80 )
b e g i n x 80 c o l s ;
( l i n e s > 25 )
{
l i n e s
b e g i n y
25;
0;
}
( b e g i n y + l i n e s > 25 )
b e g i n x 25 c o l s ;
wi n - > x _ o r g
wi n - > y _ o r g
w i n - > x _ s i z e
w i n - > y _ s i z e
wi n- >r ow
w i n - > c o l
w i n - > s c r o l l _ o k
wi n- >wr ap_ok
wi n - >b o x e d
wi n - > h i d d e n
w i n - > a t t r i b
wi n- >i ma g e
b e g i n _ x
b e g i n y
c o l s
l i n e s
0
0
0
1
0
0
A t t r i b ;
' Save ? NULL s a v e s c r ( b e g i n _ x , b e g i n _ x + ( c o l s
b e g i n y, b e g i n y + ( l i n e s
1)
1) ) ;
w e r a s e ( w i n ) ;
82 i f (
Box )
/* Must be done l ast */
83 {
84 box( wi n, VERT, HORI Z ); /* Box i t f i r st
*/
85 wi n- >boxed =1/
86 wi n- >x si ze -=2; /*
Then r educe wi ndow si ze */
87 wi n- >y si ze -=2; /* so t hat t he box won' t
*/
88 wi n- >x org += l ; /* be over wr i t t en.
*/
89 wi n- >y org += 1;
90
91 cmove( wi n- >y org, wi n- >x org
);
92 }
93 r e t u r n wi n;
94 }
Listing A.84. winio.c Miscellaneous Window I/O Functions
1 # i n c l u d e "c ur . h"
2
3 /* -------------------------------------------------------------------
4 * WI NI O. C Lowest l evel I / O r out i nes:
5 * waddch( wi n, c) wget ch( wi n)
6
*
echo () noecho ()
7
*
nl () nonl ()
8
*
cr mode () nocr mode()
9
*
10 * /
11
12 PRI VATE i n t Echo = l ; / * Echo enabl ed
*/
13 PRI VATE i n t Crmode ==0; / * I f 1, use buf f er ed i nput * /
14 PRI VATE i n t Nl =1; / * I f 1, map \ r t o \ n on i nput
*/
15 / * and map bot h t o \ n\ r on out put
*/
16 echo () { Echo =
=1; }
17 noecho () { Echo ==0; }
18 nl 0 { Nl
= l ; }
19 nonl 0 { Nl =0; }
20
21 cr mode()
22 {
23 Crmode = 1;
24 }
25
26 noc r mode ( )
27 {
28 Crmode = 0;
29 }
30
3! /* --------------------------------------------------------------------------------------------- */
32
33 s t a t i c c h a r * g e t b u f ( w i n , buf )
34 WINDOW *wi n;
35 c h a r *buf ;
36 {
37 / * Get a buf f er i nt er act i vel y. ~H i s a dest r uct i ve backspace. Thi s r out i ne
38 * i s mi l dl y r ecur si ve ( i t ' s cal l ed f r omwget ch() when Cr mode i s f al se.
39 * The newl i ne i s put i nt o t he buf f er . Ret ur ns i t ' s second ar gument .
40 * /
41
42 regi st er i nt c ;
43 char * s b u f = b u f ;
44
45 Cr mode = 1;
46 whi l e( (c = w g e t c h ( w i n ) ) != ' \ n ' && c != ' \ r ' )
47 {
48 swi t ch ( c )
49 {
50 case ' \ b ' : i f ( b u f >= s b u f )
51 wpr i nt w( wi n, " \ b" ) ;
52 el se
53 {
54 wpr i nt w( wi n, " " ) ;
55 put char ( ' \ 0 0 7 ' ) ;
56 b u f = s b u f ;
57 }
58 br eak;
59
60 def aul t: *buf ++ = c;
61 br eak;
62 }
63 }
64 *buf ++ = c ; / * Add l i ne t er mi nat or (\ n or \ r) * /
65 *buf = ' \ 0 ' ;
66 Cr mode =0 ;
67 ret urn s b u f ;
68 }
69
70 / * ---------------------------------------------------------------------------------------------------------------- * /
71
72 i nt wget ch( wi n )
73 WI NDOW *wi n;
74 {
75 / * Get a char act er f r omDOS wi t hout echoi ng. We need t o do t hi s i n or der
76 * to suppor t ( echo/ noecho) . We' 11 al so do noncr mode i nput buf f er i ng her e.
77 * Maxi mumi nput l i ne l engt h i s 132 col umns.
78 *
79 * I n nocr mode( ) , DOS f unct i ons ar e used t o get a l i ne and al l t he nor mal
80 * command- l i ne edi t i ng f unct i ons ar e avai l abl e. Si nce t her e' s no way t o
81 * t urn of f echo i n t hi s case, char act er s ar e echoed t o t he scr een
82 * r egar dl ess of t he st at us of echo (). I n or der t o r et ai n cont r ol of t he
83 * wi ndow, i nput f et ched f or wget ch() i s al ways done i n cr mode, even i f
84 * Cr mode i sn' t set . I f nl ( ) mode i s enabl ed, car r i age r et ur n ( Ent er, ~M)
85 * and l i nef eed ( ~J ) ar e bot h mapped t o ' \ n' , ot her wi se t hey ar e not mapped.
86 *
87 * Char act er s ar e r et ur ned i n an i nt . The hi gh byt e i s 0 f or nor mal
88 * char act er s. Ext ended char act er s ( l i ke t he f unct i on keys) ar e r et ur ned
89 * wi t h t he hi gh byt e set t o Oxf f and t he char act er code i n t he l ow byt e.
90 * Ext ended char act er s ar e not echoed.
91 * /
92
93 st at i c unsi gned char b u f [ 133 ] = { 0 };
94 st at i c unsi gned char *p = b u f ;
95 st at i c i nt numchars = - 1 ;
96 regi st er i nt c ;
97
98 i f ( I Cr mode )
99 {
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
158
Lf( ! *p )
P
g e t b u f ( wi n, b u f ) /
r et ur n( N1 &&
*
P
' \ r '
) ? ' \ n ' :
*
p++ ;
}
{
i f (
(c b d o s ( 8 , 0 , 0 ) & Oxf f )
r et ur n EOF ;
) /
*
C t r l - Z */
Lf ( !c ) / * E x t e n d e d c h a r */
c Oxf f b d o s ( 8 , 0 , 0 ) ;
{
( c
c
' \ r ' &&
' \ n ' /
Nl )
( Echo )
wa d dc h ( wi n, c ) ;
}
r et ur n c;
119 }
120 }
121
122
/ * _ . ---------------------------------------------------------------------------------------------* /
123
124 i nt wa d dc h ( wi n, c )
125 WINDOW *wi n
126 i nt c ;
127
{
128 / *
P r i n t a c h a r a c t e r t o an a c t i v e ( not h i d d e n ) wi ndow:
129
(
*
The f o l l o w i n g a r e h a n d l e d s p e c i a l l y :
130
*
131
*
\ n C l e a r t h e l i n e f r o m t h e c u r r e n t c u r s o r p o s i t i o n t o t h e r i g h t edge
132
*
o f t h e wi ndow. Then:
133
*
i f n l ( ) i s a c t i v e :
134
*
go t o t h e l e f t e dge o f t h e n e x t l i n e
135
*
e l s e
136
*
go t o t h e c u r r e n t col umn on t h e n e x t l i n e
137
*
I n a d d i t i o n , i f s c r o l l i n g i s e n a b l e d , t h e wi ndow s c r o l l s i f y o u ' r e
138
*
on t h e b o t t o m l i n e .
139
*
\ t i s e x p a n d e d i n t o an 8 - s p a c e f i e l d . I f t h e t a b g o e s p a s t t h e r i g h t
140
*
e dge o f t h e wi ndow, t h e c u r s o r wr aps t o t h e n e x t l i n e .
141
*
\ r g e t s y o u t o t h e b e g i n n i n g o f t h e c u r r e n t l i n e .
142
*
\ b b a c k s up one s p a c e b u t may n o t b a c k up p a s t t h e l e f t e dge o f t h e
143
*
wi ndow. N o n d e s t r u c t i v e . The c u r s e s d o c u me n t a t i o n d o e s n ' t s a y t h a t
144
A
\ b i s h a n d l e d e x p l i c i t l y b u t i t d o e s i n d e e d wor k.
145
*
ESC T h i s i s n o t s t a n d a r d c u r s e s b u t i s u s e f u l . I t i s v a l i d o n l y i f
146
*
DOESC was t r u e d u r i n g t h e c o m p i l e . A l l c h a r a c t e r s b e t we e n an ASCI I
147
*
ESC and an a l p h a b e t i c c h a r a c t e r a r e s e n t t o t h e o u t p u t b u t a r e o t h e r
148
*
wi s e i g n o r e d . T h i s l e t ' s y ou s e n d e s c a p e s e q u e n c e s d i r e c t l y t o t h e
149
*
t e r m i n a l i f y ou l i k e . I ' m a s s u mi n g h e r e t h a t y o u wo n ' t change
150
*
wi ndows i n t h e m i d d l e o f an e s c a p e s e q u e n c e .
151
*
152
*
R e t u r n ERR i f t h e c h a r a c t e r woul d h a v e c a u s e d t h e wi ndow t o s c r o l l
153
*
i l l e g a l l y , o r i f y o u a t t e m p t t o w r i t e t o a h i d d e n wi ndow.
154 */
155
156 st at i c i nt saw e s c = 0;
157 i nt r v a l = OK;
159 i f ( wi n - > h i d d e n )
160 ret urn ERR;
161
162 c mo v e ( wi n - > y _ o r g + wi n- >r o w, wi n - > x _ o r g + w i n - > c o l ) ;
163
164 #i f def DOESC
165 i f ( s a w_ e s c )
166 {
167 i f ( i s a l p h a ( c ) )
168 s a w_ e s c = 0;
169 o u t c ( c , w i n - > a t t r i b ) ;
170 }
171 el se
172 #endi f
173 {
174 swi t ch( c )
175 {
176 #i f def DOESC
177 case ' \ 033' : i f ( saw_esc )
178 s a w_ e s c = 0;
179 el se
180 {
181 s a w_ e s c = 1;
182 o u t c ( ' \ 0 3 3 ' , w i n - > a t t r i b ) ;
183 }
184 br eak;
185 #endi f
186 case ' \ b ' : i f ( w i n - > c o l > 0 )
187 {
188 o u t c ( ' \ b ' , w i n - > a t t r i b ) ;
189 ( w i n - > c o l ) ;
190 }
191 br eak;
192
193 case ' \ t ' : do {
194 wa ddc h( wi n, ' ' ) ;
195 } whi l e( w i n - > c o l % 8 ) ;
196 br eak;
197
198 case ' \ r ' : w i n - > c o l = 0;
199 c mo v e ( wi n - > y _ o r g + wi n- >r o w, w i n - > x _ o r g ) ;
200 br eak;
201
202 def aul t: i f ( ( w i n - > c o l + 1) < w i n - > x _ s i z e )
203 {
204 / * I f you' r e not at t he r i ght edge of t he wi ndow, */
205 / * pr i nt t he char act er and advance. * /
206
207 + + wi n - > c o l ;
208 o u t c ( c , w i n - > a t t r i b ) ;
209 br eak;
210 }
211 r e p l a c e ( c ) ; / * At r i ght edge, don' t advance * /
212 i f ( ! wi n- >wr ap_ok )
213 br eak;
214
215 / * ot her wi se wr ap ar ound by f al l i ng t hr ough t o newl i ne * /
216
217 case ' \ n' :
i f (
c == ' \ n' ) / * Don' t er ase char act er at f ar * /
218 wcl r t oeol ( wi n ); / * r i ght of t he scr een. * /
219
220 i f (
Nl )
221 wi n- >col = 0;
222
223 i f (
++( wi n- >r ow) >= wi n- >y si ze )
224
{
225 r val = wscr ol l ( wi n, 1 );
226 ( wi n- >r ow );
227 }
228 cmove( wi n- >y or g + wi n- >r ow, wi n- >x or g + wi n- >col ) ;
229 br eak;
230 }
231
}
232 ret urn r val ;
233 }
Listing A.85. wmove.c Move Window-Relative Cursor
2
3 wmove( wi n, y, x )
4 WI NDOW *wi n;
5
{
6 / * Go t o wi ndow- r el at i ve posi t i on. You can' t go out si de t he wi ndow. */
7
8 cmove( wi n- >y or g + (wi n-
2 0 u
A
1
= mi n( y, wi n- >y _si ze- l ) ) ,
9 wi n- >x or g + (wi n-
l-
1
0 0
A
1
= mi n( x, wi n- >x si ze- 1) ) );
10
}
Listing A.86. wprintw.c Formated Print to Window
1 #i ncl ude "cur. h"
2
3 PRI VATE i nt Er r code = OK;
4
5 PRI VATE wput c( c, wi n)
6 i nt c ;
7 WI NDOW *wi n;
8
{
9 Er r code = waddch( wi n
10
}
11
12 wpr i nt w( wi n, f mt, . . . )
13 WI NDOW *wi n;
14 char * f mt ;
15
{
16 va l i st ar gs;
17 va st ar t ( ar gs, f mt );
18
19 Er r code = OK;
20 pr nt ( wput c, wi n, f mt,
21 va end( ar gs );
22 ret urn Er r code;
c ) ;
ar gs );
23
}
24
25 pr i nt w( f mt , . . .
)
26 cha?c * f mt ;
27
{
28 va l i st args;
29 va st ar t ( args, f mt ) ;
30
31 Er r code = OK;
32 pr nt ( wput c, st dscr , f mt, ar gs );
33 va end( ar gs ) ;
34 r e t u r n Er r code;
35
}
Listing A.87. wscroll.c Scroll Window
2
3 /* ------------------------------------------------------------------------------------------------
4 * Scr ol l t he wi ndow i f scr ol l i ng i s enabl ed. Ret ur n 1 i f we scr ol l ed. ( I ' m
5 * not sur e i f t he UNI X f unct i on r et ur ns 1 on a scr ol l but i t ' s conveni ent t o
6 * do i t her e. Don' t assume anyt hi ng about t he r et ur n val ue i f you' r e por t i ng
7 * t o UNI X. Wscr ol l () i s not a cur ses f unct i on. I t l et s you speci f y a scr ol l
8 * amount and di r ect i on ( scrol l down by - amt i f amt i s negat i ve) ; scr ol l ()
9 * i s a macr o t hat eval uat es t o a wscr ol l cal l wi t h an amt of 1. Not e t hat t he
10 * UNI X cur ses get s ver y conf used when you scr ol l expl i ci t l y ( usi ng scr ol l ()).
11 * I n par t i cul ar , i t doesn' t cl ear t he bot t oml i ne af t er a scr ol l but i t t hi nks
12 * t hat i t has. Ther ef or e, when you t r y t o cl ear t he bot t oml i ne, i t t hi nks t hat
13 * t her e' s not hi ng t her e t o cl ear and i gnor es your wcl r t oeol () commands. Same
14 * t hi ng happens when you t r y t o pr i nt spaces t o t he bot t oml i ne; i t t hi nks
15 * t hat spaces ar e al r eady t her e and does not hi ng. You have t o f i l l t he bot t om
16 * l i ne wi t h non- space char act er s of some sort , and t hen er ase i t.
17 */
18
19 wscr ol l ( wi n, amt)
20 WI NDOW *wi n;
21 {
22 i f ( wi n- >scr ol l _ok )
23 doscr ol l ( wi n- >x_or g, wi n- >x_or g + ( wi n- >x_si ze- l ),
24 wi n- >y_or g, wi n- >y_or g + ( wi n- >y_si ze- l ), amt, wi n- >at t r i b );
25
26 r et ur n wi n- >scr ol l _ok ;
27 }
Passing by value.
Passing by reference.
Though the material covered in Chapter Six applies to most compilers, languages
such as Pascal, which support nested subroutine declarations, arguments passed by refer
ence, and so forth, have their own set of problems.
B.1 Subroutine Arguments
The first issue is differentiating between arguments passed by value and arguments
passed by reference. With the exception of arrays, all arguments in C are passed by
valuethe arguments value (the contents of a variable, for example) is passed to the
subroutine rather than the argument itself. Pascal doesnt differentiate between arrays
and other types, however. Unless you tell the compiler to do otherwise, everything is
passed by value, so the entire array must be copied onto the stack as part of the subrou
tine call. Similarly, records must be copied to the stack to pass them to a subroutine (as
must structures in ANSI C).
In a call by reference, the called subroutine can modify the contents of a variable
local to the calling subroutine. A reference to the variable is passed, not its contents. In
C, youd do this by passing a pointer to the variable, and this approach works in Pascal as
well, except that the compiler takes care of the details. If a subroutine argument is
declared using the var keyword, then the address of the object is passed to the subrou
tine. The subroutine itself must access that object indirectly through the pointer.
An alternative approach is not practical in most languages, but should be mentioned
primarily so that you dont use it. You can pass all arguments to the subroutine by value,
just as if they were arguments to a C subroutine. Rather than discarding the arguments
after the subroutine returns, however, you can pop their modified values from the stack
back into the original variables. This approach has the advantage of simplifying the
called subroutines life considerably because it doesnt have to keep track of the extra
level of indirection. The obvious disadvantage is that large objects, such as arrays,
require a lot of stack space if passed by value. A less obvious, but more serious, disad
vantage is illustrated by the following subroutine:
802
Section B.l Subroutine Arguments 803
f u n c t i o n t w e e d l e ( v a r dum: i n t e g e r ; v a r d e e : i n t e g e r ) : i n t e g e r ;
b e g i n
dum = 1;
ret urn dum + de e ;
end
t weedl e ()s return value is undefined when a copy-in/copy-out strategy is used and
the arguments are identical, as follows:
var
a r g : i n t e g e r ;
r e s u l t : i n t e g e r ;
b e g i n
ar g := 100;
r e s u l t := t w e e d l e ( a r g , a r g ) ;
end
The assignment dum=l should modify both dumand dee and the function should return
2. If the arguments are copied onto the stack, however, dumalone is modified, so 101 is
returned. The other problem is the undefined order of evaluation in the arguments.
Theres no way to tell which of the two arguments will be popped first when the subrou
tine returns. If the left argument, which corresponds to dum, is popped first, then ar g is
set to 100; otherwise, arg is set to 1.
.2 Return Values
The next issue is a functions return value. Aggregate objects, such as arrays and
records, cant be returned in registers because theyre too big. The usual approach is to
allocate an area of static memory using mal l oc () or some similar memory allocator,
copy the aggregate object into this memory, and return a pointer to the region in a regis
ter. You can also reserve a fixed region of static memory for this purposeall functions
copy the aggregate object to the same fixed region of memory. A pointer doesnt have to
be returned in this case because everybody knows where the returned object can be
found. On the other hand, the size of the fixed region will limit the size of the returned
object.
B.3 Stack Frames
The next issue is the stack frame. Pascal supports nested subroutine definitions. A
child subroutine (declared inside the parent subroutine) must be able to access all vari
ables that are local to the parent in addition to its own local variables. In general, all
variables declared at more outer scoping levels must be accessible to the inner subrou
tine. The subroutines in Figure B.l illustrate the problems, cl i o can be accessed from
inside all three subroutines. Other local variables can be accessed only from within the
subroutines where they are declared, however. Similarly, cal l i ope cant access t er p-
si chor e or mel pomene. The situation is complicated, somewhat, by er at o contain
ing a recursive call to itself.
The stack, when the recursive call to e r a t o is active, is shown in Figure B.2. There
is one major difference between these stack frames and C stack frames: the introduction
of a second pointer called the static link. The dynamic link is the old frame pointer, just
as in C. The static link points, not at the previously active subroutine, but at the parent
subroutine in the nesting sequencein the declaration. Since e r a t o and t h a l i a are
both nested inside c a l l i o p e , their static links point at c a l l i o p e s stack frame. You
can chase down the static links to access the local variables in the outer routines
Returning aggregate ob
jects.
Static and dynamic links.
804 Notes on Pascal CompilersAppendix B
Figure B.l. Pascal Nested Subroutines
pr ocedur e cal l i ope( pol yhymni a: i nt eger ) ;
var
cl i o: i nt eger ;
begi n
pr ocedur e er at o( ur ani a: i nt eger );
var
t er psi chor e: i nt eger ;
begi n
er at o(3);
end
pr ocedur e t hal i a( eut er pe: i nt eger );
var
mel pomene: i nt eger ;
begi n
er at o (2);
end
t hal i a(1);
end
Figure B.2. Pascal Stack Frames
<
v
<
V
<
>
dynamic link
return address
static link
3
dynamic link
return address
static link
2
dynamic link
return address
static link
1
dynamic link
return address
static link
t e r p s i c h o r e
u r a n i a
u r a n i a
mel pomene
c l i o
po l y hy mni a
7K
e r a t o
t e r p s i c h o r e
t h a l i a
c a l l i o p e
Section B.3Stack Frames 805
stack frame. For example, c l i o can be accessed from e r a t o with the following C-code
rO. pp = WP(fp + 4 ) ; / * rO = st at i c l i nk * /
x = W(rO. pp - 8 ) ; / * x = cl i o */
You can access po l y hy mni a from e r a t o with:
rO. pp = WP(fp + 4 ) ; / * rO = st at i c l i nk * /
x = W( r0. pp + 8 ) ; / * x = cl i o */
Though its not shown this way in the current example, its convenient for the frame
pointer to point at the static, rather than the dynamic link to make the foregoing indirec
tion a little easier to do. The static links can be set up as follows: Assign to each sub
routine a declaration level, equivalent to the nesting level at which the subroutine is
declared. Here, c a l l i o p e is a level 0 subroutine, e r a t o is a level 1 subroutine, and so
forth. Then:
If a subroutine calls a subroutine at the same level, the static link of the called
subroutine is identical to the static link of the calling subroutine.
If a subroutine at level N calls a subroutine at level N+1, the static link of the
called subroutine points at the static link of the calling subroutine.
If a subroutine calls a subroutine at a lower (more outer) level, use the following
algorithm:
i =the difference in levels between the two subroutines;
p =the static link in the calling subroutines stack frame;
whi l e( - - i >= 0 )
p = *p;
the static link of the called subroutine =p;
Note that the difference in levels ( i ) can be figured at compile time, but you must chase
down the static links at run time. Since the static link must be initialized by the calling
subroutine (the called subroutine doesnt know who called it), it is placed beneath the
return address in the stack frame.
This organization can cause problems with got o statements. I ts possible for a got o
statement in erato to reference a label in cal l i ope, for example. You cant just jump
to the label in this case because the stack will be left with inactive stack frames on it, so
you must treat the got o something like a r et ur n, adjusting the frame and stack
pointers before making the jumpyou must simulate return statements until the stack
frame for the subroutine that contains the target label is active.
In order to make this simulation easier, Pascal delegates the responsibility of clean
ing up the stack somewhat differently than C: its the responsibility of the called func
tion to delete the entire stack frameboth local variables and argumentsbefore return
ing. As an added advantage, the Pascal calling conventions tend to make somewhat
smaller programs because the compiler doesnt have to generate an add-constant-to-
stack-pointer directive after every subroutine callprograms usually have more subrou
tine calls than definitions. Finally, as a matter of convention rather than necessity, argu
ments to Pascal subroutines are pushed by the calling function in forward rather than
reverse order. The leftmost argument is pushed first. This convention makes it difficult
to write a Pascal function that takes a variable number of arguments unless the compiler
provides some run-time mechanism for finding and identifying the type of the leftmost
argument.
ThePascal got o.
Calledsubroutinecleans
upstack. Less codeis
generated.
Argumentspushedinfor
wardorder.
This appendix summarizes the grammar used in Chapter Six. The organization here
is different from that in Chapter Six in that its more hierarchical. Line numbers in the
following listing dont match the ones in Chapter 6. End-of-production actions are not
shown, though imbedded actions are indicated with a {}. I ve replaced the semicolon
required at the end of a yacc or occs production with a blank line. The following abbre
viations are used in the nonterminal names:
abs abstract
ar g( s) argument(s)
const constant
decl declarator
def definition
expr expression
ext external
opt optional
par am parameter
st r uct structure
Listing C.l. A Summary of the C Grammar in Chapter Six.
1 %uni on { / * The val ue st ack. * /
2 char * p_ c ha r ;
3 s ymbol *p_sym;
4 l i n k * p _ l i n k ;
5 s t r u c t d e f * p _ s d e f ;
6 s p e c i f i e r * p _ s p e c ;
7 v a l u e * p _ v a l ;
8 i nt num; / * Make shor t i f si zeof (i nt ) > si zeof (i nt *). * /
9 i nt a s c i i ;
10 }
11
12 %term STRI NG / * St r i ng const ant . * /
13 %term I CON / * I nt eger or l ong const ant i ncl udi ng ' \ t ' , et c. * /
14 %term FCON / * Fl oat i ng- poi nt const ant . * /
806
A Grammar for CAppendix C 807
Listing C.l. continued...
15
16
17
%t er m TYPE /
*
i nt char l ong f l oat doubl e si gned unsi gned shor t
*
I t er m
/ * const vol at i l e voi d
STRUCT / * st r uct uni on
*
*
/
/
/
18 %term ENUM / * enum
*/
19
20 %term TTYPE
/ * t ype cr eat ed wi t h t ypedef */
21 %nonassoc < a s c i i > CLASS / *
aut o ext er n r egi st er st at i c t ypedef */
22 %nonassoc NAME
23 %nonassoc ELSE
24
25 %term RETURN GOTO
26 %term I F ELSE
27 %term SWI TCH CASE DEFAULT
28 %term BREAK CONTI NUE
29 %term WHI LE DO FOR
30 %term LC RC / * { } */
31 %term SEMI
/ * */
32 %term ELLI PSI S / * */
33
34 %l e f t COMMA / * / */
35 %r i g h t EQUAL < a s c i i > ASSI GNOP / *
-
*= / = %= += - = = = &= = ~= */
36 %ri ght QUEST COLON / *
9
# */
37 %l e f t OROR / *
| | */
38 %l e f t ANDAND
/ *
&& */
39 %l e f t OR / *
| */
40 %l e f t XOR / * */
41 %l e f t AND / * & */
42 %l e f t < a s c i i > EQUOP / *
i =
* */
43 %l e f t < a s c i i > RELOP / *
<=
A V
I
I
A
*/
44 %l e f t < a s c i i > SHI FTOP
/ *

*/
45 %l e f t PLUS MI NUS / *
+ -
*/
46 %l e f t STAR < a s c i i > DI VOP / *
*
/ % */
47 %ri ght SI ZEOF < a s c i i > UNOP I NCOP
/ * si zeof I " ++ */
48 %l e f t LB RB LP RP < a s c i i > STRUCTOP
/ * [ J ( ) " >
*/
High-Level Definitions
49 program : e x t d e f l i s t { c l e a n u p ( ) ;
}
50
51 e x t d e f l i s t
52 : e x t d e f l i s t e x t d e f
53 / * epsi l on */
54
55 e x t d e f : o pt s p e c i f i e r s e x t d e c l l i s t {} SEMI
56 | o pt s p e c i f i e r s {} SEMI
57 I o p t s p e c i f i e r s f u n c t d e c l
{}
d e f l i s t {} compound s t mt
58
59
60
61
62
63
64
65
ext decl l i st
ext _decl
ext decl l i st COMMA ext decl
ext decl : var decl
var _decl EQUAL i ni t i al i zer
f unct decl
66
67
68
69
70
opt speci f i er s
CLASS TTYPE
TTYPE
speci f i er s
/
*
empt y
*
/
Specifiers
808 A Grammar for CAppendix C
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
s p e c i f i e r s
t y p e _ o r _ c l a s s
s p e c i f i e r s t y p e or c l a s s
t y p e t y p e _ s p e c i f i e r
t y p e t y p e s p e c i f i e r
t y p e or c l a s s
: t y p e _ s p e c i f i e r
CLASS
t y p e s p e c i f i e r
: TYPE
e n u m_ s p e c i f i e r
s t r u c t s p e c i f i e r
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
Enumerated Constants
enum s p e c i f i e r
enum name o p t _ e n u m _ l i s t
enum LC e nume r a t o r l i s t RC
o pt enum l i s t
LC e nume r a t o r l i s t RC
/
*
empt y
*
/
enum : ENUM
e nume r a t o r l i s t
e nume r a t o r
e nume r a t o r l i s t COMMA e nume r a t o r
e nume r a t o r
name
name EQUAL c o n s t e x pr
104 v a r d e c l : new name
105 | v a r d e c l LP RP
106 v a r d e c l LP v a r l i s t RP
107 v a r d e c l LB RB
108 v a r d e c l LB c o n s t e x p r RB
109 STAR v a r d e c l
110 LP v a r d e c l RP
111
112 new name: NAME
113 name : NAME
114 f u n c t d e c l
115 STAR f u n c t d e c l
116 |I f u n c t d e c l LB RB
117 f u n c t d e c l LB c o n s t e x pr RB
118 LP f u n c t d e c l RP
119 f u n c t d e c l LP RP
120 new name LP RP
121 new name LP {} name l i s t {}
122 | new name LP {} v a r l i s t {}
123
Variable Declarators
o
op r e c COMMA
o
op r e c UNOP
Function Declarators
RP
RP
m
124
125
126
127
128
129
130
131
132
133
134
name l i s t
new_name
name l i s t COMMA new name
var l i s t : param d e c l a r a t i o n
v a r l i s t COMMA param d e c l a r a t i o n
param d e c l a r a t i o n
t y p e v a r d e c l
ELLIPSIS
135
136
137
138
139
140
141
142
143
144
145
Abstract Declarators
t y p e abs d e c l
TTYPE abs d e c l
abs d e c l : / * epsi l on
*
/
LP a b s _ d e c l RP LP RP
STAR a b s _ d e c l
a b s _ d e c l LB
a b s _ d e c l LB c o n s t e x p r RB
LP abs d e c l RP
RB
146
147
148
149
150
151
152
153
Structures
s t r u c t s p e c i f i e r
o pt t a g
t a g
STRUCT o pt t a g LC d e f 1
STRUCT t a g
RC
t a g
/
*
empt y
*
/
NAME
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
Local Variables and Function Arguments
d e f l i s t : d e f l i s t d e f
/
epsi l on
*
/
d e c l l i s t {} SEMI
SEMI
d e c l l i s t
d e c l
d e c l l i s t COMMA d e c l
d e c l f u n c t _ d e c l
v a r _ d e c l
v a r _ d e c l EQUAL i n i t i a l i z e r
v a r _ d e c l COLON c o n s t e x pr
COLON c o n s t e x pr
O,
'O
Q,
'O
COMMA
COMMA
i n i t i a l i z e r
O,
'O COMMA
LC i n i t i a l i z e r l i s t RC
i n i t i a l i z e r l i s t
i n i t i a l i z e r
i n i t i a l i z e r l i s t COMMA i n i t i a l i z e r
810 A Grammar for CAppendix C
Listing C.l.continued...
Statements
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
2 0 3
2 0 4
2 0 5
2 0 6
2 0 7
2 0 8
2 0 9
210
compound st mt
: LC {} l ocal def s st mt l i st RC {}
l ocal def s
: def l i st
st mt l i st
st at ement
t est
t ar get
st mt l i st st at ement
/
*
epsi l on
*
/
SEMI
SEMI
expr SEMI
compound_st mt
RETURN
RETURN expr
SEMI
GOTO t ar get SEMI
t ar get COLON {} st at ement
SWI TCH LP expr RP {} compound_st mt
CASE const _expr COLON
DEFAULT COLON
I F LP t est RP st at ement
I F LP t est RP st at ement ELSE {}st at ement
WHI LE LP t est RP {} st at ement
FOR LP opt expr SEMI t est SEMI {} opt expr RP {} st at ement
DO {} st at ement WHI LE {} LP t est RP SEMI
BREAK SEMI
CONTI NUE SEMI
{} expr
/
*
empt y
*
/
NAME
211
212
213
2 1 4
215
2 1 6
2 1 7
2 1 8
2 1 9
220
221
222
2 2 3
2 2 4
2 2 5
2 2 6
2 2 7
2 2 8
2 2 9
2 3 0
231
2 3 2
2 3 3
expr
or _expr
or l i st
Expressions
opt expr : expr
/
*
on
*
/
const expr
o
o COMMA
expr COMMA
{}
non comma expr
non comma expr
non comma expr
non_comma_expr QUEST {} non_comma_expr COLON {} non comma expr
non_comma_expr ASSI GNOP non comma expr
non_comma expr EQUAL
or expr
non comma expr
or _l i st
or _l i st OROR {} and expr
and expr
and_expr : and_l i st
and l i st : and_l i st ANDAND {} bi nar y
bi nar y
Listing C.l.continued...
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
234 bi nar y : bi nar y RELOP bi nar y
235 I bi nar y EQUOP bi nar y
236 I bi nar y STAR bi nar y
237 I bi nar y DI VOP bi nar y
238 I bi nar y SHI FTOP bi nar y
239 bi nar y AND bi nar y
240 bi nar y XOR bi nar y
241 bi nar y OR bi nar y
242 bi nar y PLUS bi nar y
243
244
245
I bi nar y
I unar y
MI NUS bi nar y
246
247
248
249
unar y : LP expr
FCON
I CON
NAME
: RP
250 I st r i ng const %pr ec CO:
251 I SI ZEOF LP st r i ng const
252 I SI ZEOF LP expr RP
MMA
RP
SI ZEOF LP abst r act _decl RP
LP abst r act decl RP unar y
MI NUS
UNOP
unar y
I NCOP
AND
STAR
unar y
unar y
I NCOP
unar y
unar y
unar y
unar y LB expr RB
unar y STRUCTOP NAME
unar y LP ar gs RP
unar y LP RP
st r i ng const
: STRI NG
st r i ng const STRI NG
n o n_c omma_e xp r
non comma expr COMMA ar gs
o
o
o
"O
o
o
o
o
o
o
pr ec SI ZEOF
pr ec SI ZEOF
pr ec SI ZEOF
pr ec UNOP
pr ec UNOP
o
"0
o
o
o
o
o
"o
pr ec UNOP
pr ec UNOP
pr ec UNOP
pr ec STRUCTOP
%pr ec COMMA
This appendix is a users manual for LPX,1 a program that translates an input file
made up of intermixed regular expressions and C source code into a lexical analyzer.
Appendix E contains a users manual for occs, a compiler-compiler modeled after the
UNIX yacc utility that translates an augmented, attributed grammar into a bottom-up
parser.
This (and the next) appendix describe LX and occs in depth. LPX and occs are both
very similar to the equivalent UNIX utilities, and you can use the UNIX lex and yacc rather
than LX and occs to implement the compiler in Chapter Six, though you should read the
UNIX documentation for these programs rather than this (and the next) appendix if you
do. Footnotes are used to point out differences between LEX and occs and their UNIX
counterpartsyou can ignore them if youre not interested in using the UNIX programs.
In general, you should have little difficulty going from one system to the other.
D.1 Using LEX and Occs Together
IPX and occs work together in an integrated fashion, though LPX can be used without
occs for many noncompiler applications. Both utilities can be viewed as C preproces
sors. lX takes as input files made up of interspersed I fX directives and C source code
and outputs a table-driven lexical analyzer. Occs outputs a parser. The interaction
between LPX and occs is pictured in Figure D. 1.
The LPX output file is called lexyy.c, and this file contains a tokenizer function called
yyl ex ( ). This function reads characters from the input stream, returning a token (and
the associated lexeme) on every call. It is called by the occs-generated parser every time
a token is required. Similarly, occs creates a file called yyout.c, which includes a parser
subroutine called yyparse (). Somewhere in your program, there must be a call to
1. Im calling it IfX to differentiate it from the uni x program of the same name, which will be called lex
throughout this manual. You can pronounce it as leek if you want to, but I wouldnt advise it.
2. unix lex creates a file called lex.yy.c; the lexical analyzer function is still called y y l e x ( ) , however.
3. The yacc equivalent to yyout.c is y.tab.c; yyparse () is used by both programs.
812
Figure D.l. Creating a Compiler with LEX and occs
Section D.l Using lX and Occs Together 813
language.lex
language.y
(scanner
(parser
description)
description) (y.output)
i other
| libraries
i & objects
L _ J
(lex) (yacc)
LPX OCCS
V (lex.yy.c) \ / (y.tab.h) \ /
-V
yyout.doc
(state-machine
description) \
-s
lexyy.c yyout.h yyout.c
y y l e x () p_ (token __ _ _ n y y p a r s e ()
(scanner)
1
1
1
definitions) 1
1
(parser)
c)
yyout.sym
(symbol table)
(cc)
> C compiler
(UNIXfile names
are in italics.)
linker
(libl.lib, libl.a)
----------j
l.lib
L __ __ _ J
(Id)
source generated translated
code
compiler
code
yypar se ( ) to start up the parser.
The l.lib library contains various subroutines that are used by the lexical analyzer
and should be linked into the final program. Note that l.lib includes a mai n ( ) subrou
tine which loops, calling yyl ex ( ) repetitively until yyl ex ( ) returns zero. You can,
of course, provide your own mai n ( ) , and the one in the library will not be linked. [You
must do this in the final compiler or else the parser will never be called. I ts a common
mistake to forget to provide amai n ().]
There are several steps in creating a workable compiler using LPX and occs. The first
step is to write a lexical-analyzer specification as a LPX input file, and to define a gram
mar in a way acceptable to occs (all this is discussed below and in Appendix E). You
then run the input files through the appropriate programs, generating several .c and .h
files.
LPX creates lexyy.c, as was discussed earlier. Occs actually creates two source files:
yyout.c contains source code for the parser and tables; yyout.h contains #def i nes for all
the tokens.5
4. The unix programs use /usr/ lib /libl.lib (alternately named /usr/lib/libl.a in some systems). Incorporate this
library into your program with a cc files.. .-ll directive when you link.
5. yacc calls the file y.tab.h and generates the file only if you specify -d on the command line. Occs always
generates yyout.h, however.
814 LEXAppendix D
One of these token values (and only these values) must be returned by your lexical
analyzer for each input symbol.6 The special symbol _EOI_, always #defi ned as zero
in yyout.h, is returned by the LPX-generated analyzer at end of input. You should not
return this value yourself as a token value, or occs (and yacc) will get confused.7 A typi
cal development cycle with Microsoft C looks something like this:
.8
v i c o m p i l e r , l e x Create I3X input specification.
v i c o m p i l e r . y Create occs input specification.
l e x c o m p i l e r . l e x Create lexyy.c from IX input specification.
-vD compi l e r . y Create yyout.c and yyout.h from occs input specification.
c l - c l e x y y . c Compile everything and link.
c l - c y y o u t . c
c l l e x y y . o b j y y o u t . o b j - l i n k l . l i b
The various files generated by LPX and occs are summarized in Table D.l. The
yyout.h file is generally #ncluded in the LEX input specification, so you shouldnt
compile lexyy.c until youve run occs.
D.2 The LEX Input File: Organization
The remainder of this appendix describes the lexical-analyzer generator, LPX. My
implementation of LPX is a proper subset of the similarly named UNIX utility.9 The input
file is divided into three parts, separated from one another by %%directives:
definitions
oo
o o
rules
oo
OO
code
The %% must be the leftmost two characters on the line. The definitions section con
tains two things: macro definitions and user-supplied code. All code must be delimited
6. yacc also lets you use literal characters as tokens. If you put a %term ' + ' . into the yacc input file, you
can return an ASCII + from the lexical analyzer. Occs doesnt support this mechanism. Youd have to
put a%term PLUS in the occs input file and a r e t u r n PLUS; statement in the lexical analyzer.
7. yacc doesnt define_EOI_ for youit nonetheless treats 0 as an end-of-input marker.
8. The uni x development cycle looks like this:
vi compiler.lex
vi compiler.y
lex compiler.lex
yacc -vd compiler.y
cc -c lex.yy.c
cc -c y.tab.c
cc -o compiler y.tab.o lex.yy.o -11
9. For those of you who are familiar with the uni x program, the following features are not supported by UX:
The / lookahead operator is not supported.
The %START directive and the <name> context mechanism are not supported.
The {N,M} operator (repeat the previous regular expression N to M times) is not supported.
The REJ ECT mechanism is not supported (but yymor e ( ) is).
The %S, %T, %C, and %R directives are not supported. Similarly, the intemal-array-size modification
directives %p, %n, %e, %a, %k, and %o) arent supported.
The internal workings of the output lexical analyzer (especially the input mechanism) bear virtually no
relation to the uni x program. You wont be able to hack the UX output in the same ways that youre
used to.
Table D.l. LPX and Occs Output Files
Section D.2The LEX Input File: Organization 815
Generated
by
Occs/LEX
name
UNI X
name
Command-line
Switches
Description
LPX/lex lexyy.c lex.yy.c none needed C source code for lexical analyzer function,
yyl ex ().
yyout.c y.tab.c none needed C Source code for parser subroutine,
yypar se ()
yyout.h y.tab.h occs: none needed
yacc: -d
#def i nes for the tokens.
yyout.sym not available -s -S -D Symbol-table showing various internal
values used by the parser, including produc
tion numbers, TOKEN values, and so forth.
This table is useful primarily for debugging.
occs/yacc
yyout.doc y.output occs: -v -V
yacc: -v
A human-readable description of the LALR
state machine used by the parser. The file
also contains copies of all the error and
warning messages that are sent to the screen
as the input is processed. I ts useful for
debugging the grammar. The - V version of
the table holds LALR(i ) lookaheads for all
the items, and FIRST sets for all the nonter
minals. The -v version contains lookaheads
only for those items that trigger reductions,
and does not contain FIRST sets.
with %{and %}directives like this:
%{
# i n c l u d e < s t d i o . h >
e x t e r n i n t s o me t h i n g ; / * Thi s code i s passed t o t he out put * /
%}
oo
o o
The percent sign must be the leftmost character on the line.10 Code blocks in the
definitions section should contain #i ncl udes, external definitions, global-variable
definitions, and so forth. In general, everything that would go at the top of a C program
should also go at the top of the LPX input file.
Anything outside of a %{%}block is taken to be a macro definition. These consist of
a name followed by white space (blanks or tabs), followed in turn by the macros con
tents. For example:
10. uni x l ex suppor t s di rect i ves l i ke thi s:
%{ #i nc l ude <s t di o. h> %}
UX requi res you to use t he f ol l owi ng, however :
%{ #i nc l ude <s t di o. h> %}
816 LPXAppendix D
Comments.
Whitespace.
Metacharacters.
d i g i t [ 0 - 9 ] / * deci mal di gi t * /
h e x d i g i t [ 0 - 9 a - f A - f ] / * hexadeci mal di gi t * /
a l p h a [ a- z A- Z] / * al phabet i c char act er * /
Macro expansion is discussed further, below. C-style comments are permitted anywhere
in the definitions sectionthey are ignored along with any preceding white space. C
comments in a %{. . . %}block are passed to the output, however.
Skipping the middle, rules, section for a moment, the third, code section is just
passed to the output file verbatim. Generally, the initial definitions section should con
tain only code that is used in the rules section; moreover, it should contain only variable
definitions, ext er n statements, and the like. All other code, including the subroutines
that are called from the rules section, should be defined in the last, code section.
D.3 The LEX Rules Section
The rules in the middle part of the LEX input specification have two components: a
regular expression like the ones described both in Chapter Two and below, and C-
source-code fragments that are executed when input that matches the regular expression
is found. The regular expressions must start in the leftmost column, and they are
separated from the code by white space (spaces or tabs). This means that you cant put a
literal space character or tab into a regular expression. Use the \ s or \ t escape
sequence for this purpose. You can also use double quotesa " " matches a spacebut
the escape sequences are preferable because you can see them. The code part of the rule
may take up several lines, provided that the lines begin with white space. That is, all
lines that start with white space are assumed to be part of the code associated with the
previous regular expression. A possible rule section looks like this:
oo
o o
l ama p r i n t f ( " P r i e s t \ n " ) ;
l l a ma p r i n t f ( "Be a s t \ n " ) ;
111+ama p r i n t f ( " F i r e ! \ n " ) ;
oo
o o
which, when given the input:
l ama
l l a ma
l l l a m a
recognizes the various kinds of lamas.
D.3.1 LPXRegular Expressions
The LPX regular-expression syntax is essentially the one described in Chapter Two.
It is summarized below, along with discussions of LPXs peculiarities. Regular expres
sions are formed of a combination of normal characters and metacharacters, which
have special meanings. The following characters are used as metacharactersthey are
described in depth below:
* + ? { } [ ] ( ) . ~ $ " \
/ * A one- el l l ama */
/ * A t wo- el l l l ama */
/ * A t hr ee- el l l l l ama, f our */
/ * al ar mer , and so f ort h. */
11. See [Nash]
Section D.3.1 LX Regular Expressions 817
Note that the term newline is used in a nonstandard way in the following
definitions. A newline here, is either an ASCII carriage return or a linefeed character
Newline, nonstandard
definition.
( ' Xr ' or ' \ n ' ) . This usage is consistent with most MS-DOS applications because the
input characters are read in binary, not translated mode, so all input lines are terminated
with a \ r \ n pair. By recognizing either ' \ n ' or ' \ r ' as a newline, LEX-generated
code work can work in the UNIX environment, in MS-DOS translated, and in MS-DOS binary
modes as well. If you want strict UNIX compatibility, you can use the -u command line
switch discussed below to redefine newline to be a single linefeed character.
c A single character that is not used as a metacharacter forms a regular expres- Matching normal charac-
sion. The multiple-character sequences shown in Table D.2 also match single terS| escaPe sequences.
characters. These are known as escape sequences. Note that in MS-DOS,
binary-mode input is used by LEX. This means that all lines are terminated
with the two-character sequence \ r \ n . A \ n matches the linefeed, but not the
carriage return. It can be risky to test for an explicit \ r or \ n in the MS-DOS
environment (but see ~and $, below). Double quotes can also be used to take
away the meaning of metacharacters: " * ? " matches an asterisk followed by a
question mark.
Table D.2. LPX Escape Sequences
Escape
sequence
Matches
Recognized
by UNIX
\b backspace yes
\ f formfeed yes
\n linefeed (end of line character) yes
\ r carriage return yes
\s space yes
\ t tab yes
\e ASCII ESC character (' \033') no
\ddd number formed of three the octal digits ddd yes
\xdd A number formed of the two hex digits dd. Two digits must be presentuse
\x02 for small numbers.
no
V C (backslash, followed by an up arrow, followed by a letter) matches a control char
acter. This example matches Ctrl-C.
no
\c \ followed by anything else matches that character; so, \ . matches a dot,
\* matches an asterisk, and so forth.
yes
cc
e I e
(e)
Two regular expressions concatenated form a regular expression. Concatenation
Two regular expressions separated by a vertical bar recognize either the OR operator,
expression to the left of the bar or the expression to the right of the bar.
Parentheses are used for grouping. Grouping.
An up arrow anchors the pattern to beginning of line. If the first character in Beginning-of-lineanchor
the expression is a ~, then the expression is recognized only if it is at the far
left of a line.
the
End-of-lineanchor.
A dollar sign anchors the pattern to end of line. If the last character in
expression is a $, then the expression is recognized only if it is at the far right
of a line.
A period (pronounced dot) matches any character except the newline (' \ r ' or Match any character.
' \ n ' normally,' \ n ' if-w is specified).
818 LEXAppendix D
Character classes. [...] Brackets match any of the characters enclosed in the brackets. If the first char
acter following the bracket is an up arrow Q, any character except the ones
specified is matched. Only six characters have special meaning inside a char
acter class:
{ Start of macro name
} End of macro name
] End of character class
- Range of characters
Indicates negative character class
" Takes away special meaning of characters up to next quote mark
\ Takes away special meaning of next character
Use \ ], \- , \ \ , and so forth, to put these into a class. Since other metacharac
ters such as *, ?, and +are not special here, the expression [*?+] matches a
star, question mark, or plus sign. Also, a negative character class doesnt
match a newline character["a-z] actually matches anything except a
lower-case character or a newline. Use ([ " a- z] I [ \ r \ n ]) to match the new-
line, too. Note that a negative character class must match a character
~[ "a-z] $ does not match an empty line. The line must have at least one
character on it, but that character may not be a lower-case letter.
Empty character classes. j _e x supports t wo special character classes not supported by the UNIX l ex.
An empty character class ([ ]) matches all white space, with white space
defined liberally as any character whose ASCII value is less than or equal to a
space (all control characters and a space). An empty negative character class
([ " ]) matches anything but white space.
Closure operators. * + ? a regular expression followed by a * (pronounced star) matches that expres
sion repeated zero or more times; a +matches one or more repetitions; a ?
matches zero or one repetitions.
Greedy algorithm. Note that LEX uses a greedy algorithm to match a pattern. It matches the long
est possible string that can satisfy the regular expression. Consequently,
something like . * is dangerous (it absorbs all the characters on the line). The
expression (. I \n) * tries to absorb an entire input file, but will probably
cause an input-buffer overflow.
Macro expansion. {name} Braces are used to expand a macro name. (The {m,n}notation described in
Chapter 2 is not supported by LPX.) The name must be part of a macro
definition in the first of the input file.
You can use the macros as an element of a regular expression by surround
ing the name with curly braces. Given the earlier macro definitions:
d i g i t [ 0 - 9 ] / * deci mal di gi t * /
h e x d i g i t [ 0 - 9 a - f A - f ] / * hexadeci mal di gi t * /
a l p h a [ a- z A- Z] / * al phabet i c char act er * /
o o
o o
a hex number in C can be recognized with Ox {hexdi gi t}+. An
alphanumeric character can be recognized with the following expression:
( { a l p h a } | {d i g i t })
Macros nest, so you could do the following.
d i g i t [ 0 - 9 ]
a l p h a [ a- z A- Z]
al num ( { a l p h a } | {d i g i t })
You could do the same with:
Section D.3.1 LPXRegular Expressions 819
d i g i t 0 - 9
a l p h a a- z A- Z
al num [ { d i g i t } { a l p h a }]
The ~and $ metacharacters work properly in all MS-DOS input modes, regardless of
whether lines end with \ r \ n or a single \ n. Note that the newline is not part of the lex
eme, even though it must be present for the associated expression to be recognized. Use
a n d \ r \ n to put the end of line characters into the lexeme. (The \ r is not required in
UNIX applications, in fact its an error under UNIX.) Note that, unlike the vi editor
does not match a blank line. Youll have to use an explicit search such as \ r \ n \ r \ n to
find empty lines.
The operator precedence is summarized in the Table D.3. All operators associate left
to right.
Table D.3. Regular-Expression Operator Precedence
operator description level
( )
[ ]
* + ?
*
CC
~ $
parentheses for grouping
character classes
closure: 0 or more, 1or more, 0 or 1
concatenation
OR
anchors to beginning and end of line
1 (highest)
2
3
4
5
6 (lowest)
By default, input that doesnt match a regular expression is silently absorbed. You
can cause an error message to be printed when bad input is encountered by putting the
following into the definitions section:
%{
#def i ne YYBADI NP
%}
In general, its a good idea to put one rule of the form
at the very end of the rules section. This default rule matches all characters not recog
nized by a previous expression, and does nothing as its action.
When two conflicting regular expressions are found in the input file, the one that
comes first is recognized. The expression that matches the longest input string always
takes precedence, however, regardless of the ordering of expressions in the file. Con
sider the following input file:
a. ret urn DOO;
aa+ ret urn WHA;
Given the input aa, DOO will be returned because a . comes first in the file. Given the
input aaa, the second expression is active because the longer string is matched.
The ~ and $ metachar
acters.
Operator Precedence.
Input that doesnt match
a regular expression.
Conflicting regular ex
pressions.
12. uni x lex behaves differently. It prints the unrecognized string to standard output and doesnt support
Y Y BADI NP.
820 LPXAppendix D
Using local variables in
actions.
Subroutines for use in
actions.
Default mai n ().
Current lexeme.
Lexeme length.
Input line number.
Input a character to the
current lexeme.
The code portion of the rule can be any legal C-code. It ends up in the output as part
of a swi t ch statement, so be careful with breaks, yyl ex () does not return unless an
explicit r et ur n statement is included in one of the actions associated with a regular
expression. On the other hand, a legal C statement must be attached to every regular
expression, even if that statement is only a single semicolon.
The action code does not have to be surrounded with curly braces, but if you do so,
you can use local variables in the code, as in:
e x p r e s s i o n {
i nt i ;
}
The scope of the local variable is limited by the curly braces, as in normal C. Global
variables must be defined in a %{...%} block in the definitions section.
If a vertical bar is used in place of an action, the action associated with the next rule
is used. For example:
Ox[ 0 - 9 a - f A - F ] + |
0[ 0- 7] + |
[ 1 - 9 ] [ 0 - 9 ] * r et ur n( NUMERIC_CONSTANT ) ;
causes the r et ur n statement to be executed if any of the three expressions are matched.
Several subroutines, variables, and macros are available for use inside any of your
actions (in either the rules or code section, but not in a %{...%} block in the definitions
section). The subroutines are all in l.lib, which must be linked to your program. You can
provide your own versions of all of them, however, in which case the ones in the library
are not linked. The macros definitions are all imbedded in the LPX output file itself, and
their definitions are discussed in depth in Chapter Two. LPX supports the following sub
routines and macros:
voi d mai n()
The short mai n ( ) subroutine shown in Listing D.l is provided in l.lib as a con
venience in testing lex-generated recognizers. It just loops, calling yyl ex ()
until yyl ex () returns zero. I ts linked automatically if you dont provide a
mai n ( ) of your own. This default mai n ( ) can cause problems if you forget to
put a mai n ( ) in an occs-generated compiler. In this case, mai n ( ) is fetched
from the library, and the parser is never activated.
char *yyt ext
This variable points at the lexeme that matches the current regular expression.
The string is ' \ 0' terminated.
i nt yyl eng
The number of characters in yyt ext , excluding the final' \ 0 '.
i nt yyl i neno
The current input line number. If the lexeme spans multiple lines, it is the
number of the last line of the lexeme.
i nt i nput ()
Read (and return) the next input characterthe one following the last character
in the lexeme. The new character is added to the end of the lexeme ( yyt ext ) by
D.3.2 T h e C o d e Part o f t h e Rule
Listing D.l.yymain.c The mai n ( ) Subroutine in l.lib
Section D.3.2The Code Part of the Rule 821
1 # i n c l u d e < s t d l i b . h >
2 # i n c l u d e < t o o l s / d e b u g . h >
3 # i n c l u d e < t o o l s / l . h >
4
5 v o i d I ma i n ( a r g c , a r g v )
6 c har **ar gv;
7 i n t a r g c /
8
{
9 / * A def aul t mai n modul e to t est t he l exi cal anal yzer .
10 */
11
12 i f ( a r g c == 2 )
13 i i n e w f i l e ( a r g v [ l ] ) ;
14 w h i l e ( y y l e x () )
15
f
16 e x i t ( 0 ) ;
17 }
LPX and y y l e n g is adjusted accordingly. Zero is returned at end of file, -1 if the lexeme
is too long.13
ECHO
Echo current lexeme to
the screen.
Print the current lexeme to a stream named yyout , which is initialized to
st dout . This is just an alias for f pr i nt f ( yyout , " %s" , yyt ext ). Do not
this macro if you intend to use the interactive debugging environment
described in Appendix Eit messes up the windows
voi d out put (i nt c)
Output a character.
Print c to the output stream. The FI LE pointer used for this stream is called
yyout and is initialized to st dout . You can change the file with f open (
you like. Do not use this function if you intend to use the occs interactive
debugging environment described in Appendix Eit messes up the windows.
voi d unput (i nt c)
Push back one character
Pushes the indicated character back into the input stream. Note that c effectively
overwrites yyt ext [ yyl eng- 1] and then makes the lexeme one character
smaller. Both yyl eng and yyt ext are modified; c will be read as the next input
character. 14
YYERROR(
*
S)
Print internal error mes-
This macro is not supported by the UNIX lex. It is used to print internal error mes
sage.
sages generated by the lexical analyzer itself. The default macro prints the string,
s, to st derr. You will probably want to redefine this macro for occsput the
following into the LPX input files definition section
13. uni x lex doesnt ret urn - 1 and it doesnt modi f y t he yyt ext or yyl eng; it j ust ret urns t he next i nput
character.
14. uni x lex does not modify y y t e x t or yyleng. It just pushes back the character.
822 LEXAppendix D
{
#def i ne YYERROR( st r ) yyer r or ("%s\ n", st r) ;
o
O
}
Push back multiple input
characters.
voi d yyl ess(i nt n)
Push n characters back into the input stream. Again, yyt ext and yyl eng are
adjusted to reflect the pushed back characters.15 You cannot push back more than
yyl eng characters.
Initialize LEX parser.
voi d yy i ni t l ex
This subroutine is called by yyl ex ( ) before it reads any input. You can use it to
do any run-time initializations that are required before the lexical analyzer starts
up. Its particularly handy if you want to initialize variables that are declared
st at i c in the definitions part of the LEX input file. You can define a
yy_i ni t _l ex () that does the initializations at the bottom of the input file, and
this function is called automatically. The default routine, in l.lib, does nothing.
I ts in Listing D.2.16
Listing D.2. yyinitlx.c Initialize LEX
Keep processing after ex
pression is recognized.
\[ ~\] *\ | i f ( yyt ext [ yyl eng- 2 ] == ' \ V )
yymor e();
el se
ret urn STRI NG;
}
The regular expression recognizes a double quote (") followed by anything except
a second double quote repeated zero or more times. The quotes have to be
escaped with a backslash because theyre special characters to LEX. The problem
here is escaped double quotes within strings, as in
"st r i ng wi t h a \ " i n i t "
The regular expression is satisfied when the first double quote is encountered.
The action code looks at the character preceding the ending quote mark
( yyt ext [ yyl eng- 2] ) . If that character is a backslash, yymor e ( ) is called,
and more input will be processed as if the offending double quote had not been
seen. The same code is executed a second time when the third double quote is
found; but here, theres no preceding backslash so the r et ur n STRING is exe
cuted.
" voi d yymor e()
Invoking this macro causes yyl ex ( ) to continue as if the regular expression had
not been recognized. Consider the following regular expression and action
(which recognizes a subset of the legitimate string constants in C):
15. unix lex does not modify y y t e x t or yyleng; it just pushes back the characters.
16. uni x lex does not support this mechanism. Youre on your own when it comes to initializations
This subroutine is called by the lexical analyzer at end of file. The name stands
for go ahead and wrap up the processing, so yywrap () returns true if process
ing should terminate [in which case yyl ex () returns zero.] The default routine,
in l.lib, just returns 1. I ts in Listing D.3. You can provide your own yywrap ()
however, and yours will be linked instead of the library routine. Typically, youd
use yywrap ( ) to handle a series of input files. It would open the next input file
in the series with each call, returning zero until there were no more files to open,
whereupon it would return one. Theres an example in Listing D.4.
i n t y y w r ap () End-of-input processing.
Listing D.3. yywrap x Library Version
1 i nt y y wr a p () / * yyl ex() hal t s i f 1 i s r et ur ned * /
2
{
3 r et ur n( 1 ) ;
4
}
Listing D.4. A User-Defined yywrap ()
1 i nt Argc; / * Copy of ar gc as passed t o mai n (). */
2 char **Argv; / * Copy of ar gv as passed t o mai n( ) . * /
3
4 y y wr a p ()
5 {
6 i f ( Argc >= 0 )
7 {
8 i f ( i i n e w f i l e ( *Argv ) != - 1 )
9
{
10 ++Argv;
11 r et ur n 0; / * New f i l e opened successf ul l y. * /
12
13 f p r i n t f ( s t d e r r , "Can' t open %s\ n", *Argv ) ;
14
}
15 r et ur n 1;
16
}
17
18 ma i n ( a r g c , a r g v )
19 i nt a r g c ;
20 char ** ar g v;
21
{
22 Argc = a r g c - 1;
23 Argv = a r g v + 1;
24 i i n e w f i l e ( * a r g v )
25 whi l e( y y l e x () )
26
9 / * Di scar d al l i nput t okens. */

27
}
YYBADINP Control error message for
unrecognized input char-
If this macro is #defi ned in a %{...%} block in the definitions section, an error acter.
message is printed when an input sequence thats not recognized by any regular
expression is encountered at run time. Otherwise, the unrecognized input is
silently ignored at run time.17
824 LEXAppendix D
Low-level input functions.
Low-level input routine.
Low-level lookahead.
Open new input file.
Flush low-level input
buffers.
Change system-level in
put functions.
The longest permissible lexeme is 1024 bytes.18 An error message is printed if you
try to read more characters. If a lexeme is likely to exceed this length, youll have to
work with one of the low-level input functions that LPX itself uses, synopsized below,
and described in depth in Chapter Two. None of these functions are supported by UNIX
lex.
i nt i i _i nput ( )
This is a somewhat more flexible input function than i nput ( ) . Normally, it
returns the next input character (and stretches the lexeme one character). It
returns zero on end of file and - 1 if it cant read a character because the input
buffer is full [see i i f l ushbuf () ]. Note that yyl eng is not modified by
i i _ i nput ().
i nt i i _l ookahead( i nt n)
You can use this routine to look ahead in the input without actually reading a
character, i i l ookahead(0) returns the last character in the lexeme,
i i l ookahead (1) returns the next character that will be read,
i i l ookahead (- 1) returns the penultimate character in the lexeme. The max
imum forward lookahead is 32 characters, and you can look back only as far as
the beginning of the current lexeme.
i nt i i _ newf i l e( char *name)
Use this routine to open a new input file. The name is the file name. The routine
returns a file descriptor for the open file or -1 if it cant open the file. As is the
case with normal input functions, the global variable yyer r or will hold an error
code identifying the problem. This routine automatically closes the current input
file before opening the new one, so its actually an error if you close the current
input file explicitly before calling i i _newf i l e ( ).
i nt i i _ f l ushbuf ()
This routine flushes the input buffer, discarding all characters currently in it and
destroying the current lexeme. It can be used to continue reading when
i i _i nput ( ) returns 1. Note that yyl ex, yyl eng, and yyl i neno are all
invalid after this routine is called. I ts usually used only when youre doing
something like absorbing a long commentwhen the contents of the lexeme are
immaterial. It returns 1if everythings okay, 0 at end of file.
i nt i i _i o( i nt ( *open) (), i nt ( *cl ose) (), i nt ( *r ead) ())
You can use this routine to change the low-level, unbuffered I/O functions used
by the input routines. It takes three function pointers as arguments, the first is a
pointer to an open function, the second to a close function, and the third to a read
function. These should work just like the standard UNIX I/O functions with the
same name, at least in terms of the external interface. The open function is called
17. uni x lex does not support this mechanism. It just writes the unrecognized characters to standard output.
18. 128 bytes is typical in unix lex.
19. unix lex does not use this input mechanism; rather, it uses the buffered I/O system for its input and gets
input from a stream called yyin. This stream is initialized to s t d i n , but you can modify it using
fopen ( ). You must use i i in p u t () to change the UX input stream, however.
as follows:
#ncl ude <f cnt 1. h>
#i f def MSDOS
#def i ne 0_M0DE (0_RDONLY| 0_BI NARY)
#el se
#def i ne 0_M0DE ( 0_RD0NLY)
#endi f
i nt f d;

i f ( (f d = ( *open) ( name, 0_M0DE) ) != - 1 )
{
/ * f i l e open was successf ul * /
}
ret urn f d;
It should return an int-sized number that cant be confused with standard input
(anything but zero), and that same number will be passed to the read and close
functions as follows:
i n t g o t ;
char * l o a d _ c h a r a c t e r s _ h e r e ; / * base address o f t h e b u f f e r */
i n t read_this_many_bytes; / * number o f c h a r a c t e r s r e q u e s t e d * /
i n t g o t ; / * number o f c h a r a c t e r s a c t u a l l y read */
i n t fd; / * val ue r e t u r n e d from p r e v i o u s open */

i f ( ( g o t = ( * r e a d ) ( f d , l o a d _ c h a r a c t e r s _ h e r e , r e a d _ t h i s _ m a n y _ b y t e s ) )==-1)
p r o c e s s _ e n d _ o f _ f i l e () ;

( * c l o s e ) ( f d ) ;
All variables and subroutines used both by LEX and its input functions start with one
of the following character sequences:20
yy Yy YY i i _
Dont start any of your own symbols with these characters.
D.4 LEX Command-Line Switches
LPX takes various command-line switches that modify its behavior. These are sum
marized in Table D.421
LPX automatically compresses the tables that it generates so that theyll take up less
room in the output. A ten-to-one compression rate is typical. The default compression
method eliminates equivalent rows and columns from the tables and is best for most
practical applications. The -c and - / switches can be used to control the table-
compression, however. The -/(for fast) switch eliminates compression entirely, yielding
much larger tables, but also giving a faster lexical analyzer. The - c switch changes the
compression algorithm to one in which each character/next-state pair is stored as a two-
byte object. This method typically (though not always) gives you smaller tables and a
20. You have to worry only about the yy prefix with lex.
21. Of these, only -t and - f are supported by uni x lex. The lex - v switch prints a one-line summary of internal
statistics onlythe UrX - v is more verbose. The lex n switch suppress the verbose-mode output, and is
not supported by IfX.
Names start with yy Yy
YY i i .
Controlling table
compression, -c , -f .
826 LEXAppendix D
Header comment, -h,
-H.
Suppressing #l i ne
directives, -I.
Choose an alternate
driver file, -m.
The LIB environment.
Use standard output, -t .
UNix-compatible newline,
-u.
Verbose-mode output,
-v, -V.
Table-size limits.
Table D.4. LPXCommand-Line Switches
Switch Description
-c N
- f
-h
-H
-1
-m name
-t
-u
v
-V
Use pair compression, N is the threshold. The default N is 4.
For fast. Dont compress tables.
Suppress header comment that describes state machine.
Print the header only.
Suppress #l i ne directives in the output.
Use name as template-file name rather than lex.par.
Send output to standard output instead of lexyy.c
UNIX mode (. is everything but \n)
Verbose mode, print various statistics.
More verbose, print internal diagnostics as lex runs.
slower lexical analyzer than the default method. The -c switch takes an optional
numeric argument (-c5) that specifies the threshold above which the pair compression
kicks in for a given row of the table. If the row has more than the indicated number of
nonerror transitions, it is not compressed. The default threshold, if no number is
specified, is four. In any event, the -v switch can be used to see the actual table size, so
you can decide which method is most appropriate in a particular situation.
The -h and -H switches are used to control the presence of a larger header comment
that is output at the top of lexyy.c. This comment describes the state machine that LPX
uses to recognize the regular expressions. A lower-case h suppresses this comment. An
upper-case H suppresses all output except the comment.
LPX automatically generates #l ne directives in the output file. These cause the
compiler to print error messages that reference lines in the the original LPX input file
rather than the output file. They are quite useful when youre trying to track down syn
tax errors. The #l ne can confuse source-level debuggers, however, so - / (thats an ell)
is provided to eliminate them.
LPX itself creates only a small part of the output file. The majority is stored in a tem
plate file called lex.par, discussed in depth in Chapter 2. LPX searches for this file, first
in the current directory and then along a path specified by semicolon-delimited directory
names in the LIB environment. The -mname switch can be used to specify an explicit
file name to use as the template. Theres no space between the m and the name.
The t switch causes LPX to send output to standard output rather than lexyy.c.
The -u switch changes the definition of a newline to be consistent with UNIX
rather than MS-DOS. Normally a newline is either a carriage return or linefeed (' \ r ' or
' \ n '). In UNIX mode, however, a newline is a single linefeed. This definition affects the
- operator, which matches anything except a newline, and a negative character class,
which matches anything except a newline or one of the specified characters.
The -v switch causes LPX to print various statistics, such as the output table sizes, to
standard output. - V forces more-verbose output, which describes the internal workings
of the program as it runs. This last switch is useful primarily for debugging LPX itself;
its also handy if you want to see a running commentary of the subset construction used
to create a DFA from an NFA, described in Chapter Two.
D.5 Limits and Bugs
Several limits are imposed on the LPX input file. In addition, several internal limits
will affect the workings of LPX. These are summarized in Table D.5. There are two
Section D.5Limits and Bugs 827
limits on the action components to the rules. No single rule (including the action) can be
longer than 2,048 bytes, and the space used by all the actions combined cannot exceed
20,480 bytes. If either of these limits are exceeded, fix the problem by moving some of
the imbedded code into subroutines declared in the third part of the LEX input
specification.
Table D.5. LEX Limits
Maximum space available for all actions combined 20,480 bytes
Maximum characters in a single rule (including the action) 2,048 bytes
Maximum number of NFA states 512 states
Maximum number of DFA states 254 states
The limits on NFA and DFA states refer to the maximum size of internal tables used
by LEX (see Chapter Two). There are always fewer DFA than NFA states. If you exceed
either of these limits, youll either have to reduce the number of regular expressions in
the input file or simplify the expressions. Two simplification techniques can be used
with little difficulty:
(1) Character classes take up less room than the I operator so they should be used when
ever possible; [012] is preferable to (0 11 12 ). The character class uses two
states; the expression with I operator uses ten.
(2) Use multiple rules with shared actions rather than the OR operator if possible. This
rule:
abc |
d e f p r i n t f ( " a b c or d e f " ) ;
requires one fewer NFA state than this one:
a b c | d e f p r i n t f ( " a b c or d e f " ) ;
(3) One of the least efficient ways to use 1X is to recognize long strings that do not
contain metacharacters. A simple string uses roughly as many NFA states as it has
characters0123456789" requires 11 states, but a closure operator or character
class uses only a few states. These sorts of strings can cause problems when you try
to do something like this:
<Ct r l - -A> p r i n t f "<Ct r l - -A>\ n'
<Ct r l - -B> p r i n t f "<Ct r l - - B>\ nf
<Ct r l - -C> p r i n t f "<Ct r 1--C>\ n'
<Ct r l - -D> p r i n t f "<Ct r l - -D>\ n'
<Ct r l - -E> p r i n t f "<Ct r l - - E>\ n'
<Ct r l - -F> p r i n t f "<Ct r l - - F>\ n'
<Ct r l - -G> p r i n t f "<Ct r l - -G>\ n'
<Ct r l - -H> p r i n t f "<Ct r l - -H>\ n'
<Ct r l - - I> p r i n t f "<Ct r l - - I > \ n '
<Ct r l - -J> p r i n t f "<Ct r l - - J>\ n'
<Ct r l - -K> p r i n t f A0

f
t
h
-
*
1
-K>\ n'
<Ct r l - -L> p r i n t f "<Ct r l - - L>\ n'
Though the final state machine uses only 31 DFA states, 121 NFA states are
required to produce this machine. A better solution is:
Simplifying expressions
to get smaller internal
table size.
Use character classes,
not |.
Use shared actions,
not I.
Dont use long literal
matches.
828 LPXAppendix D
Use lookup tables to
disambiguate lexemes.
Problems with anchors.
oo
o o
< C t r l - [ A - L ] > swi t ch( y y t e x t [6]
{
)
case ' A' p r i n t f "<Ct r l - - A>\ n") ; br eak;
case ' B' p r i n t f "<Ct r l - - B>\ n") ; br eak;
case ' C' p r i n t f "<Ct r l - C>\ n") ; br eak;
case ' D' p r i n t f "<Ct r l - - D> \ n " ) ; br eak;
case ' E' p r i n t f "<Ct r l - - E>\ n") ; br eak;
case
' F'
p r i n t f "<Ct r l - - F > \ n " ) ; br eak;
case ' G' p r i n t f "<Ct r l - - G>\ n") ; br eak;
case ' H' p r i n t f "<Ct r l - - H>\ n") ; br eak;
case ' I ' p r i n t f "<Ct r l - * I > \ n " ) ; br eak;
case ' J' p r i n t f "<Ct r l - - J > \ n " ) ; br eak;
case ' K' p r i n t f "<Ct r l - - K>\ n") ; br eak;
case ' L' p r i n t f "<Ct r l - - L > \ n " ) ; br eak;
}
oo
o o
which requires only 11NFA states.
%
Keyword recognition presents a similar problem that can be solved in much the
same way. A single expression can be used to recognize several similar lexemes,
which can then be differentiated with a table lookup or calls to st r cmp(). For
example, all keywords in most languages can be recognized by the single expres
sion [ a- zA- Z_] +. Thereafter, the keywords can be differentiated from one
another with a table look up. Theres an example of this process below.
A bug in LPX affects the way that the default . action is handled if the beginning-of-
line anchor Q is also used in the same input specification. The problem is that an expli
cit match of \ n is added to the beginning of all patterns that are anchored to the start of
(in order to differentiate from x$ from ~x$ from it should be possible to have
all four patterns in the input file, each with a different accepting action). The difficulty
here is that an input file like this:
oo
o o
ab ret urn AB;
ret urn NOT AB;
oo
o o
generates a machine like this:
\n
anything but \ n
a
r et ur n AB;
r et ur n NOT AB;
The problem lies in the way that the LPX-generated state machine works. When it fails
out of a nonaccepting state, the machine backs up to the most-recently-seen accepting
state. If there is no such state, the input lexeme is discarded. Looking at the previous
machine, if you are in State 4, the machine fails if a character other than b is in the input.
Not having seen an accepting state, the machine discards the newline and the a rather
than performing the default return NOT AB action. You can correct the problem by
making State 3 an accepting state that al
this by modifying the input file as follows:
executes the NOT AB action Do
Section D.5Limits and Bugs 829
oo
o o
ab r et ur n AB;
\ n
NOT AB;
oo
o o
D.6 Example: A Lexical Analyzer for C
Listing D.5. shows a sample LEX input file for a C lexical analyzer. Listing D.6,
yyout.h, contains the token definitions. Various macros are defined on lines 11 to 17 of
Listing D.5. The suf f i x macro is for recognizing characters that can be appended to
the end of a numeric constant, such as 0xl234L . Note that all characters from an ASCII
NUL up to a space (that is, all control characters) are recognized as white space.
Listing D.5. c.lex IPX Input File for a C Lexical Analyzer
1
2
3
4
5
6
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
o
o
/
{
*
lcat i on f or C. Not e t hat CONSTANTS ar e
*
*
Lexi cal anal yzer
al l posi t i ve i n or der to avoi d conf usi ons (to pr event a- 1 f r om
bei ng i n as NAME CONSTANT t her t han NAME MI NUS CONSTANT)
*
/
7 #i ncl ude " y y o u t . h"
8 #i ncl ude < s e a r c h . h>
/ *
Funct i on pr ot ot ype f or bsear ch( ) */
9
%}
10
11 l e t _ a - z A- Z]
/ *
Let t er */
12 al num _ a - z A- Z 0 - 9 ] / *
Al phanumer i c char act er */
13 h 0 - 9 a - f A- F ] / * Hexadeci mal di gi t */
14 o 0 - 7 ]
/ *
Oct al di gi t */
15 d 0 - 9 ] / * Deci mal di gi t */
16 s u f f i x UuLl ]
/ *
Suf f i x i n i nt egr al numer i c const ant */
17 wh i t e \ x 0 0 - \ s ] / *
Whi t e space */
18
19
oo
o o
ii / * ii
{
i nt i ;
whi l e( i
{
i i i nput ()
)
( i < 0 )
i i f l ushbuf (); / * Di scar d Lexeme */
Lf ( i ScSc i i l ookahead( l )
{
i i i nput ();
/
*
Recogni zed comment
}
}
( i 0 )
yyer r or ( "End of f i l e i n comment \ n" );
}
\" (\\. I r \ " ] )
*
\"
r et ur n STRI NG;
)
*
/
\ " (\ \ . I [~\ ])* [\ r\ n] yyer r or ( "Addi ng
r et ur n STRI NG;
ng \ t o ng \
830 LEXAppendix D
Listing D.5. continued...
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
/ *
' a' , ' b' , et c. */
/ * M t ' , ' \ f ' , e t c . * /
/ *
' \ 123' , ' \ 12' , Ml * /
/ *
' \ xl 23' , ' \ xl 2' , ' \ xl '
*/
/ *
0, 01, 012, 012L, et c. * /
/ *
0x1, 0x12, 0x12L, et c. * /
/ *
123, 123L, et c. * /
yyer r or ( "I l l egal char act er <%s>\ n", yyt ext );
oo
o o
/

/
/ * Rout i nes t o r ecogni ze keywor ds
*
/
{
i nt
*name;
val ;
}
KWORD;
KWORD Kt ab[] /
*
Al phabet i c keywor ds
*
/
{
{ "aut o",
{ "br eak",
CLASS
BREAK
Section D.6Example: A Lexical Analyzer for C 831
Li sti ngD.5. conti nued. ..
103
{
"case", CASE },
104
{
"char 11, TYPE },
105
{
"cont i nue", CONTI NUE },
106
{
"def aul t ", DEFAULT },
107
{
"do", DO },
108
{
"doubl e", TYPE },
109
{
"el se", ELSE },
110
{
"ext er n", CLASS },
111
{
"f l oat ", TYPE },
112
{
"f or " , FOR },
113
{
"got o", GOTO },
114
{
"i f ", I F },
115
{
"i nt ", TYPE },
116
{
"l ong", TYPE },
117 {
"r egi st er ", CLASS },
118
{
"r et ur n", RETURN },
119
{
"shor t ", TYPE },
120
{
"si zeof ", SI ZEOF },
121
{
"st at i c", CLASS },
122
{
"st r uct ", STRUCT },
123
{
"swi t ch", SWI TCH },
124
{
"t ypedef ", CLASS },
125 {
"uni on", STRUCT },
126
{
"unsi gned", TYPE },
127
{ "voi d", TYPE },
128
{
"whi l e", WHI LE }
129 };
130
131 i nt c mp( a, b )
132 KWORD *a, *b;
133
{
134 r et ur n st r cmp( a- >name, b- >name );
135
}
136
137 i nt i d or keywor d( l ex ) / * Do a bi nar y sear ch f or a
*/
138 chaz
*l ex; / * possi bl e keywor d i n Kt ab

*/
139
{
/ * Ret ur n t he t oken i f i t ' s
*/
140 KWORD *p; / * i n t he t abl e, NAME
*/
141 KWORD dummy; / * ot her wi se.
*/
142
143 dummy. name = l ex;
144
P
= bsear ch( &dummy, Kt ab, si zeof ( Kt ab) / si zeof ( KWORD) ,
145 si zeof ( KWORD) , cmp);
146 r et ur n( p ? p- >val : NAME );
147 }
The rule on lines 20 to 37 of Listing D.5 handles a comment. The main problem here Comments,
is the potential comment length. A single regular expression cant be used to recognize a
comment because various internal limits on lexeme length will be exceeded by long
comments. So, the code looks for the beginning-of-comment symbol, and then sits in a
whi l e loop sucking up characters until an end-of-comment is encountered. The
ii_flushbuf() call on line 26 forces an internal buffer flush (discarding the partially-
collected lexeme) if the input buffer overflows. Note that this comment processing isnt
portable to UNIX lex. An alternate strategy is shown in Listing D.7. This second method
works fine under UNIX, but disallows comments longer than 1024 bytes if used with my
LEX.
832 LEXAppendix D
Listing D.6. yyout.h C Token Definitions
1
/ *
t oken val ue l exeme * /
z
3 # d e f i n e _EOI 0
/ *
end- of - i nput symbol * /
4 # d e f i n e NAME 1 / * i dent i f i er * /
5 #d e f i n e STRING 2 / * "st r i ng const ant
* /
6 # d e f i n e ICON 3 / * i nt eger const ant or char act er const ant * /
7 # d e f i n e FCON 4
/ *
f l oat i ng- poi nt const ant * /
8 # d e f i n e PLUS 5
/ *
+
* /
9 # d e f i n e MINUS 6
/ *
*/
10 # d e f i n e STAR 7
/ *
A
* /
11 #d e f i n e AND 8
/ *
& * /
12 # d e f i n e QUEST 9
/ * * /
13 #d e f i n e COLON 10
/ *
m
# * /
14 # d e f i n e ANDAND 11
/ *
&& * /
15 #d e f i n e OROR 12
/ *
1 i
* /
16 # d e f i n e RELOP 13
/ *
> > = < < =
*/
17 #d e f i n e EQUOP 14 / * 9 * /
18 # d e f i n e DIVOP 15
/ * / % * /
19 #d e f i n e OR 16
/ *
|
* /
2 0 # d e f i n e XOR 17 / *
A
* /
21 # d e f i n e SHIFTOP 18
/ *

* /
22 #d e f i n e INCOP 19
/ *
++ - -
* /
23 # d e f i n e UNOP 20 / *
t
* /
2 4 #d e f i n e STRUCTOP 21
/ *
. - > * /
25 # d e f i n e TYPE 22
/ *
i nt , l ong, et c. * /
2 6 #d e f i n e CLASS 23
/ *
ext er n, st at i c, t ypedef , et c. * /
27 # d e f i n e STRUCT 24
/ *
st r uct uni on
* /
28 #d e f i n e RETURN 25
/ *
r et ur n
* /
2 9 # d e f i n e GOTO 26
/ *
got o * /
3 0 #d e f i n e IF 27
/ *
i f * /
31 # d e f i n e ELSE 28
/ * el se * /
32 #d e f i n e SWITCH 29
/ *
swi t ch * /
33 # d e f i n e BREAK 30 / * br eak
* /
34 #d e f i n e CONTINUE 31
/ *
cont i nue * /
35 # d e f i n e WHILE 32 / * whi l e * /
36 #d e f i n e DO 33
/ *
do * /
37 # d e f i n e FOR 34
/ *
f or
* /
38 # d e f i n e DEFAULT 35
/ *
def aul t * /
39 #d e f i n e CASE 36
/ *
case * /
4 0 # d e f i n e SIZEOF 37 / * si zeof * /
41 #d e f i n e LP 38
/ *
( (l ef t par ent hesi s) * /
4 2 # d e f i n e RP 39
/ * ) * /
43 #d e f i n e LC 40
/ *
{ ( l ef t cur l y) * /
4 4 # d e f i n e RC 41
/ * } * /
45 # d e f i n e LB 42
/ *
[ ( l ef t br acket ) * /
4 6 # d e f i n e RB 43
/ * ] * /
4 7 #d e f i n e COMMA 44
/ * / * /
4 8 # d e f i n e SEMI 45
/ *
/ * /
4 9 # d e f i n e EQUAL 46
/ *
-
* /
5 0 #d e f i n e ASSIGNOP 47
/ * += - =, et c. * /
String constants. The next two rules, on lines 39 to 41 handle string constants. The rules assume that a
string constant cannot span a line, so the first rule handles legal constants, the second
rule handles illegal constants that contain a newline rather than a terminating close
quote. (ANSI is unclear about whether newlines are permitted in string constants. I m
assuming that they are not permitted because string concatenation makes them unneces
sary. Consequently, a hard error is printed here rather than a warning.)
Listing D.7. A UNI X Comment-Processing Mechanism
Section D.6Example: A Lexical Analyzer for C 833
oo
o o
II / * II
3 i nt i , l asti = 0;
4
5 whi l e( (i = i nput ()) && i != EOF )
6 {
7 i f ( i == ' / ' && l ast _i == );
8 break;
9 }
10
11 i f ( i == 0 || i ==EOF )
12 yyer r or ( "End of f i l e i n comment \ n" );
13 }
14 %%
Most of the complexity comes from handling backslashes in the string correctly. The
strategy that I used earlier:
\ "[ ~\ "] *\ " i f ( yyt ext [ yyl eng- 2] == ' W )
yymor e();
el se
return STRI NG;
doesnt work in a real situation because this expression cant handle strings in which the
last character is a literal backslash, like M\ \ M. It also cant handle newlines in string
constants, for which an error should be printed. The situation is rectified on line 39 of
Listing D.5 by replacing the character class in the middle of \" [ ~\ " ] * \" with the fol
lowing subexpression (Ive added some spaces for clarity):
( w. i r \ " ] )*
This subexpression recognizes two things: two-character sequences where the first char
acter is a backslash, and single characters other than double quotes. The expression on
line 40 handles the situation where the closing quote is missing. It prints an error mes
sage and returns a STRI NG, as if the close quote had been found.
All the rules on lines 43 to 49 recognize integer constants of some sort. The first five
are for the various character constants and the last four are for numeric constants.
Floating-point constants are recognized on line 51.
The rule on line 85 recognizes both identifiers and keywords. I ve done this to
minimize the table sizes used by the LPX-generated analyzer. The i d or keywor d ()
subroutine is used to distinguish the two. It does a binary search in the table declared on
lines 99 to 129, returning the appropriate token if the current lexeme is in the table, oth
erwise it returns NAME. The bsear ch ( ) function on line 144 does the actual search. It
is a standard ansi function thats passed a key, a pointer to a table, the number of ele-
r
ments in the table, the size of one element (in bytes), and a pointer to a comparison func
tion. This last function is called from bsear ch ( ) , being passed the original key and a
pointer to one array element. It must compare these, but otherwise work like
st r cmp ( ). The comparison function used here is declared on lines 131 to 135. The
remainder of the rules are straightforward.
Using lookup table to
differentiate identifiers
from keywords.
834 LPXAppendix D
D.7 Exercises
D.l. Newlines in string constants are not handled particularly well by the LPX input file
presented in the previous section. The current file assumes that newlines are not
permitted in constants, so it inserts a close quote when a newline is encountered.
Rewrite the regular expressions for string constants so that they will do the follow
ing:
Print a warning if a newline is found in a constant, and then take the next line
as a continuation of the previous line. That is, the close quote should not be
inserted by the lexical analyzer; rather, it should recognize multiple-line
strings.
If the newline in the string constant is preceded by a backslash, however, the
warning message is suppressed.
D.2. The UNIX vi editor supports a tags feature. When the cursor is positioned on a sub
routine name, typing a Ctrl-] causes the current file to be saved, the file that con
tains the indicated subroutine is then read into the editor, and the cursor is posi
tioned at the first line of the subroutine. Vi accomplishes this feat with the aid of
supplemental file called tags, which is created by a utility called ctags. The tags
file consists of several lines, each having the following fields:
subroutine_name file_name search_pattem
The three fields are separated by tabs, and the search pattern consists of the actual
first line of the subroutine, surrounded by r and $/. For example:
mai n ma i n . c / ~ ma i n ( a r g c , a r g v ) $ /
Using LPX, write a version of ctags that recognizes all subroutine declarations
and all subroutine-like macro definitions in a C program. It should not recognize
external declarations. That is, none of the following should generate an entry in
the output file:
e x t e r n l a r r y ( ) ;
c h a r * c u r l y ( i n t x ) ;
PRIVATE d o u b l e moe ( v o i d ) ;
You can assume that subroutine declarations will follow a standard formatting con
vention. Hand in both the program and documentation describing what the input
must look like for a declaration to be recognized.
D.3. Write a version of ctags for Pascal programs.
D.4. Using LPX, write a version of the UNIX cal endar utility. Your program should
read in a file called calendar and print all lines from that file that contain todays
date somewhere on the line. All the following forms of dates should be recog
nized:
7/4/1776
7/4/76 Assume 20th century, so this entry is for 1976.
7/4 Assume the current year.
7-4-1776
7-4-76
7-4
1776-7-4 Since 1776 cant be a month, assume European ordering.
J uly 4, 1776
Section D.7 Exercises 835
J uly 4 1776
J uly 4
J ul 4 Three-character abbreviations for all month names are supported.
J ul. 4 The abbreviation can be followed by an optional period.
4 J uly
4 J uly, 1776
4 J uly, 76
4 J uly, 76
4 J ul. 76
LLama and occs are compiler-generation tools that translate an augmented, attri
buted grammar into a parser. They are essentially C preprocessors, taking as input a file
containing intermixed C source code, grammatical rules, and compiler directives, and
outputting the C source code for a parser subroutine.
This appendix is a users manual for both LLama and occs, the two compiler com
pilers developed in Chapters Four and Five. (Occs stands for the Other Compiler-
Compiler System. I ts pronounced ox.) An understanding of the theoretical parts of
these chapters (sections marked with asterisks in the table of contents) will be helpful in
reading the current Appendix. Similarly, a discussion of how to use occs and LLama in
conjunction with LPX is found at the beginning of Appendix D, which you should read
before continuing.
Both LLama and occs are modeled after the UNIX yacc utility, though LLama builds
a top-down, LL(1) parser while occs builds a bottom-up, LALR(l) parser. The compiler
in Chapter Six can be constructed both with occs or with yacc. If you intend to use the
UNIX utility, your time will be better spent reading the UNIX documentation than the
current appendix.
LLama and occs are both discussed in one appendix because they are very similar at
the user level. I will discuss the common parts of both systems first, and then describe the
individual characteristics of occs and LLama in separate sections. Both programs are
compiler compilers, because they are themselves compilers which translate a very-high-
level-language description of a compiler to a high-level-language source code for a com
piler. I will use the term compiler compiler when I m talking about characteristics
shared by both systems.
Like LPX, occs and LLama are C preprocessors. They take as input a file that con
tains an augmented, attributed grammar that describes a programming language, and
they create the C source code for a parser for that language. C source-code fragments
that specify code-generation actions in the input are passed to the output with various
translations made in that source code to give you access to the parsers value stack.
Occs outputs the source for a bottom-up parser (both the tables and the parser itself);
836
Section E.l Using The Compiler Compiler 837
LLama outputs a top-down parser. The generated parser subroutine is called
yypar se ()just call this subroutine to get the parsing started. The subroutine returns
zero if it parses the input successfully, - 1 if an unrecoverable error in the input is
encountered. A mechanism is provided for you to return other values as well,
yypar se () expects to get input tokens from a scanner subroutine generated by LEX, the
interaction was discussed at the beginning of Appendix D.
Occs, despite superficial similarities, is not yaccI ve made no attempt to perpetu
ate existing deficiencies in the UNIX program in order to get compatibility with UNIX.
Occs has a better error-recovery mechanism than yacc (though its still not ideal), the
output code is more maintainable, and it provides you with a considerably improved
debugging environment. This last change, activated when you compile with YYDEBUG
defined, gives you a window-oriented debugging system that lets you actually watch the
parse process in action. You can watch the parse stack change, see attributes as theyre
inherited, set breakpoints on various stack conditions and on reading specified symbols,
and so forth. I ve documented differences between occs and yacc in footnotes; these are
of interest only if youre already familiar with yacc. LLama, of course, is radically
different from yacc in that it uses a completely different parsing technique. Dont be
misled by the superficial similarities in the input formats.
Unfortunately, its difficult to present either compiler compiler in a hierarchical
fashionyou may have to read this Appendix twice. Similarly, the description of the
attribute-passing mechanism will make more sense after youve seen it used to build the
C compiler in Chapter Six.
E.2 The Input File
The LLama and occs input files, like LEX input files, are split up into three sections,
separated from one another with %%directives:
definitions
oo
oo
rules
o o
o o
code
Both the definitions and code sections can be empty.
E.3 The Definitions Section
The first part of the input file contains both directives to the compiler compiler itself
and C source code that is used later on in the rules section. As with the LPX input format,
the code must be surrounded by %{and %}directives.1 All the directives start with a per
cent sign, which must be in the leftmost column. Those directives that are unique to one
or the other of the programs are discussed later, but those directives that are shared by
both LLama and occs are summarized in Table E. 1.
The %t oken and %t er mdirectives are used to declare nonterminal symbols in the
grammar. The directive is followed by the name of the symbol, which can then be used
1. Yacc passes l i nes that start wi t h whi t e space t hr ough to t he out put wi t hout modi f i cat i on, as i f t hey wer e
part of a %{. . . %} del i mi t ed code bl ock. Occs does not suppor t thi s mechani sm.
The parser subroutine,
y y p a r s e ().
Input-file organization.
Code in the definitions
section.
Defining tokens:
%token, %term.
838 LLama and OccsAppendix E
Table E.l. Occs and LLama %Directives
Directive Description
oo
oo Separates the three sections of the input file.
%{
Starts a code block in the definitions section. All lines that follow, up to a line that starts with a
%}directive, are written to the output file unmodified.
%}
Ends a code block.
%token Defines a token.
%term A synonym for %token.
/* */ C-like comments are recognizedand ignoredby occs, when found outside of a %{. . . %}-
delimited code block. They are passed to the output if found inside a code block.
later in a production in the rules section. For example:
%term LP RP / * ( and ) */
%term ID / * an i dent i f i er * /
%term NUM / * a number * /
%token PLUS STAR / * + * * /
yyout.h Token def i ni t i ons Several names can be listed on a single line. Occs creates an output file called yyout.h
which holds symbolic values for all these tokens, and these values must be returned from
the lexical analyzer when the equivalent string is recognized.2 For example, the previous
declarations produce the following yyout.h file:
# d e f n e _EOI_ 0
#d e f i n e LP 1
#d e f i n e RP 2
#d e f i n e ID 3
#d e f i n e NUM 4
#d e f i n e PLUS 5
#d e f i n e STAR 6
The arguments to %term are used verbatim for the macro names in yyout.h. Conse
quently, they must obey the normal C naming conventions: The token names must be
made up of letters, numbers, and underscores, and the first symbol in the name cannot be
"2
a number. The definitions in yyout.h can then be used from within a LEX input file to
build a lexical analyzer. A sample LEX file is in Listing E. 1.
Code blocks, delimited by %[ and %} directives, are also found in the definitions
section. These blocks should contain the normal things that would be found at the top of
a C file: macro definitions, typedefs, global-variable declarations, function prototypes,
and so forth. Certain macro names are reserved for use by the compiler compiler, how
ever, and these are summarized in Table E.2. You can modify the parsers behavior by
defining these macros yourself in a code block in the definitions section (theres no need
to #undef them first).
Token names.
Code Blocks %{ ...%}
2. Yacc creat es a fi l e cal l ed y.tab.h, and t he fi l e i s cr eat ed onl y i f - d i s speci f i ed on t he c ommand l i ne.
3. Yacc doesnt i mpose t hese restri cti ons. Al so, yacc ( but not occs) accept s a def i ni ti on of t he f or m
%t oken NAME number, i n whi ch case it uses t he i ndi cat ed number f or t he t oken val ue rat her t han
assi gni ng an arbi trary val ue.
4. Of these, YYABORT, YYACCEPT, YYDEBUG YYMAXDEPTH, and YYSTYPE are suppor t ed by yacc. The
YYDEBUG f l ag j ust act i vat es a r unni ng l og of shi f t and r educe act i ons to t he scr een t here i s no i nteracti ve
debuggi ng envi r onment .
Section E.3 The Definitions Section
Listing E.l. exprJex LPXInput Specification for Simple Expressions
839
1
%{
2 #i ncl ude "y y o u t
3
%}
4
oo
o o
5
vv+ vv
ret ur n PLUS;
6
vv* vv
ret ur n STAR;
7
vv ^vv
ret ur n LP;
8
vvj vv
ret ur n RP;
9 [ 0 - 9 ] + ret ur n NUM;
10 [ a - z ] + r et ur n I D ;
11
o o
o o
In addition to the foregoing, the file <stdio.h> is automatically #i ncl uded at the
top of the parser file before any code that you specify in the definitions section is output.
Its harmless, though unnecessary, for you to #i ncl ude it explicitly. Similarly,
"yyout.h" and the ANSI variable-argument-list-definitions file, <stdarg.h>, is also
c
#i ncl uded automatically. Finally, the yacc version of the stack macros are included
(<tools/yy stack. h>). These are described in Appendix A.
Note that all the symbols that occs uses start with YY, yy, or Yy. You should not start
any of your own names with these characters to avoid possible name conflicts.
.4 The Rules Section
The rules section of the input file comprises all lines between an initial %%directive
and a second %%directive or end of file. Rules are made up of augmented, attributed pro
ductions: grammatical rules and code to be executed. A rules section for an occs input
file is shown in Listing E.2 along with the necessary definitions section. (Ill discuss this
file in greater depth in a moment; its presented here only so that you can see the struc
ture of an input file.)
Productions take the form:
left-hand side : right-hand side \ rhs 2 ... rhs N ;
*
A colon is used to separate the left- from the right-hand side instead of the that weve
used hitherto. A vertical bar (I) is used to separate right-hand-sides that share a single
left-hand side. A semicolon terminates the whole collection of right-hand sides. Termi
nal symbols in the grammar must have been declared with a previous %term or %token,
and their names must obey the C naming conventions.
Nonterminal names can be made up of any collection of printing nonwhite characters
except for the following:
o
o
{ } [ ] ( ) < >
*
There cant be any imbedded spaces or tabs. Imbedded underscores are permitted.
Imbedded dots can be used as well, but I dont suggest doing so because dots make the
yyout.doc file somewhat confusing. I d suggest using lower-case letters and underscores
5. Yacc actually puts the token definitions in the output file. Occs puts the definitions in yyout.h and
# i n c l u d e s it in the output file.
6. Exactly the opposite situation applies to yacc. An imbedded underscore should be avoided because an
underscore is used to mark the current input position in yyout.doc.
Internal names: yy, Yy,
YY
Occs modified BNF:
representing productions.
Nonterminal names.
Table E.2. Macros that Modify the Parser
YYABORT
YYACCEPT
YYCASCADE
YYD(x)
YYDEBUG
YYMAXDEPTH
YYMAXERR
YYPRIVATE
This macro determines the action taken by the parser when an unrecoverable error is
encountered or when too many errors are found in the input. The default macro
#def i ne YYABORT
but you can redefine it to something else in a code block at the top of the occs input
if you prefer. Note that there are no parentheses attached to the macro-name
component of the definition.
This macro is like YYABORT, except that it determines the accepting action of the
parser. The standard action is r et ur n 0, but you can redefine it to something else
if you prefer.
(Occs only.) This constant controls the number of parser operations that have to fol
low an error before another error message is printed. It prevents cascading error
messages. I ts initialized to five. #def ne it to a larger number if too many error
messages are printed, a smaller number if too many legitimate error messages
suppressed.
If the D switch is specified, then all text in a YYD macro is expanded, otherwise
its ignored. For example, the following pr i nt f () statement is executed only if
is present on the occs command line.
YYD ( pr i nt f ( "hi
Set automatically by the D command switch to enable interactive debug
define it explicitly if you like. The contents ging environment, but you
unimportant.
This macro determines the size of the state and value stacks. I ts 128 by default, but
you may have to enlarge it if you get a Parse stack overflow error message at run
time. You can also make it smaller for some grammars.
(Occs only.) Initially defined to 25, the of this macro determines when
parser will quit because of too many syntax errors in the input.
Defining this macro as an empty macro makes various global-variable declarations
public so that you can see them in a link map. Use the following:
#def i ne YYPRI VATE / *not hi ng*/
YYSHIFTACT(ras) (Occs only.) This macro determines default shift action taken parser
That controls what onto stack when input symbol is
shifted onto the stack. It is passed a pointer to the top of the value stack after the
push (tos). The default definition looks like this:
#def i ne YYSHI FTACT( t os) ((tos) yyl val )
( yyl val is used analyzer shift attribute associated with
YYSTYPE
YYVERBOSE
current token. It is discussed below.) You can redefine YYSHI FTACT as an empty
macro if none of the tokens have attributes.
This macro determines the typing of the value stack. Its discussed at length in the
text.
If this macro exists, the parser uses more verbose diagnostic messages when YYDE
BUG is also defined.
Section E.4The Rules Section
Listing E.2. expr.y Occs Rules Section for the Expression Compiler
841
1 %term I D / *
an i d e n t i f i e r */
2 %term NUM
/ *
a number */
3 %l ef t PLUS / *
+
*/
4 %l ef t STAR / *
*
* /
5 %l e f t LP RP
/ *
( and ) */
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
oo
o o
/
*
A s ma l l e x p r e s s i o n grammar t h a t r e c o g n i z e s n u mb e r s , name s , a d d i t i o n
(+) /
*
m u l t i p l i c a t i o n ( *) , and s s i o n s a s s o c i a t e l e f t t o r i g h t
*
u n l e s s p a r e n t h e s e s f o r c e i t t o go o t h e r w i s e .
*
i s h i g h e r t h a n +
*
Not e t h a t an u n d e r s c o r e i s a p p e n d e d t o i d e n t i f i e r s s o t h a t t h e y wo n ' t b e
* c o n f u s e d wi t h r v a l u e s .
*
/
s e
e e PLUS e
e STAR e
LP e RP
NUM
I D
*
y y c o d e ( "%s +
y y c o d e ( "%s
$$ = $2;
y y c o d e (" %s
y y c o d e ( "%s
o
o
o
o
s \ n " , $1, $ 3 ) ; f r e e _ n a me ( $3 ) ;
s \ n " , $1, $ 3 ) ; f r e e name ( $3 ) ;
}
}
}
o
o
o
o
s\ n", $$
s \ n " , $$
new_name( ) , y y t e x t ) ;
new na me ( ) , y y t e x t ) ;
}
}
oo
o o
for nonterminal names and upper-case letters for tokens, as in Listing E.2. This way you
can tell what something is by looking at its name. Its a hard error to use a nonterminal
that is not found somewhere to the left of a colon.
Text surrounded by curly braces is code that is executed as a parse progress. In a
top-down LLama parser, the code is executed when it appears at the top of the parse
stack. In the bottom-up occs parser, the code is executed when a reduction by the associ
ated production is performed.
8 productions are represented with an empty right-hand side (and an optional action).
Its best to comment the fact that the production has been left empty on purpose. All the
following are legitimate 8 productions:
Augmentations: putting
code in a production.
productions.
a /
*
e mp t y
*
/
b : /
*
e p s i l o n */ { an a c t i o n ( ) ;
}
c normal r hs
/
*
e p s i l o n
*
/
{ a n _ a c t i o n ( ) ;
{ a n o t h e r a c t i o n ( ) ;
}
}
f
I d suggest using the formatting style shown in Listing E.2. The colons, bars, and
semicolons should all line up. The nonterminal name should start at the left column and
the body of a production should be indented. The curly braces for the actions all line up
in a nice column. If a name is too long, use the following:
Formatting.
v e r y l o n g _ l e f t _ h a n d _ s i d e name
r i g h t _ h a n d s i d e
a n o t h e r
{ a c t i o n ( 1 ) ;
{ a c t i o n ( 2 ) ;
}
}
Any valid C statement thats permitted inside a C subroutine is allowed in an action,
with the following caveats:
ret urn and break
statements in actions
Local variables in ac
tions.
The start production.
(1) A br eak statement at the outermost level terminates processing of the current
action but not the parser itself.
(2) All r et ur n statements must have an associated value. A simple r et ur n is not
permitted.
(3) A r et ur n statement causes the parser to return only if it takes a nonzero argument.
That is, r et ur n 0 causes processing of the current action to stop, but the parsing
continues; r et ur n 1 forces the parser to return to the calling routine, because the
argument is nonzero. The argument to r et ur n is returned from yyparse ().
(4) The YYACCEPT and YYABORT macros determine what happens when the parser ter
minates successfully (accepts the input sentence) or fails (rejects the input sen
tence). The default actions return 0 and 1respectively, so you should not return
either of these values from within an action.
You cant declare subroutines in an action because the actions themselves are inside a
subroutine. You can, however, declare local variables after any open brace, and you can
use an action thats several lines long. The following is okay:
l e f t _ h a n d _ s i d e : r i g h t _ h a n d _ s i d e
{
i nt i , mo n s t e r ;
f o r ( i = 10; i >= 0 ;)
b u g ( i , mo n s t e r ) ;
}
f
Of course, you cant declare a local variable in one action and access that variable from
another. True global variables that are declared in a %{...%} block in the definitions
sections can be accessed everywhere.
The first production in the rules section is taken as the start production and is treated
specially: Its left-hand side is the goal symbol and it can have only one right-hand side.
E.5 The Code Section
The third part of the input file, the code section, comprises all text after the second
%%directive. This entire section is written to the output file verbatim. It is used the same
way as the LPX code sectionit should contain subroutines that are called from the
actions. Note that the occs code section differs from both the LPX and LLama sections
in that dollar attributes (discussed below) can be used in the code found there.8
E.6 Output Files
LLama and occs create several output filestheir interaction with LPX is discussed
at the beginning of Appendix D. The two programs use different names for these files,
however, and these are summarized in Table E.3.
7. Some versions of yacc support a %s t a r t directive that lets you define an arbitrary production as the start
production. This mechanism isnt supported by occs.
8. Yacc doesnt permit this: dollar attributes may be used only in the rules sections.
Section E.7 Command-Line Switches 843
Table E.3. Files Generated by Occs and LLama
Occs
Name
LLama
Name
Command-line
Switches
Contains
yyout.c llout.c none, -p Normally contains parser and action code taken from original
grammar. If -p is specified to occs, this file contains the parser
only. (LLama doesnt support a p switch.)
yyact.c -a If -a is specified to occs, this file contains code for the action
components of the productions and all of the code portion of
input file (but not the parser). LLama doesnt support - a.
yyout.h llout.h none Contains #def i nes for the tokens.
yyout.sym llout.sym -s, -S, -D The symbol table. The S version contains more information than
the - 5 version. Specifying D implies - 5you can get more-
verbose tables with -DS.
yyout.doc -v, -V Used only by occs, contains the LALR(l) state-machine descrip
tion along with any warning messages that are sent to the screen.
E.7 Command-Line Switches
The basic command-line usage for both programs is the same:
o c c s [ - s w i t c h e s ] f i l e
l l a ma [ - s w i t c h e s ] f i l e
where the file is an input specification file, as just discussed. The programs take the fol
lowing command-line switches:
-a (occs only.) Rather than creating a combined parser-and-action file, the action
subroutine only is output to the file yyact.c. The parser component of the output
file can be created using the -p switch, described below.
-c[N] (LLama only.) This switch controls the parse-table compression method; it
works just like the LPX switche with the same name. It changes the compression
algorithm to one in which each character/next-state pair is stored as a two-byte
object. This method typically (though not always) gives you smaller tables and
a slower parser than the default method. The -c switch takes an optional
numeric argument (-c5, for example) that specifies the threshold above which
the pair compression kicks in for a given row of the table. If the row has more
than the indicated number of nonerror transitions, it is not compressed. The
default threshold, if no number is specified, is four.
-D Enables the interactive debugging environment in the generated parser. All that
this switch really does is insert a #def i ne YYDEBUG into the output file. You
can do the same thing yourself by putting an explicit #def i ne in the input
specification. Specifying D also causes - 5 to be set.
- f (LLama only.) This switch causes the output tables to be uncompressed,
thereby speeding up the parser a little at the cost of program size.
Generate Action Subrou
tine.
Pair compress LLama
parse tables.
Activating the interactive
debugging environment.
Uncompressed tables.
9. Of these, only -v is supported by yacc, though yacc is considerably less verbose than occs. occs - D
switch is nothing like yaccs -d switch.
Making private variables
public.
g
Several global variables and subroutines are used in the output parser. These
are declared stati c by default, so their scope is limited to yyout.c (or llout.c).
The -g switch causes the compiler compiler to make these variables global to
the entire program. You can achieve the same effect by putting the following
the first section of the input specification:
o
"o{ #def i ne YYPRI VATE
}
Suppress #l i ne di r ec
ti ves.
(1 is an ell.) Occs and LLama usually generate #l i ne directives that cause the
compilers error messages to reflect line numbers in the original input file
rather than the output file. (A preprocessor directive of the form
#l i ne N
vv vv
tells the compiler to pretend that its on line N of the indi
cated f i l e.) These directives can cause problems with source-level debuggers,
however, and the - / switch gives you a way to turn them off. Generally you
should not use - / until youve eliminated all the syntax errors from the input file.
Resist the temptation to modify the output file in order to fix a syntax error. Its
too easy to forget to modify the input too, and the output file is destroyed the
next time you use the compiler compiler.
Alternative parser tem
plates.
m
The LIB environment
variable.
Occs and LLama both generate only part of the parser. They create the parse
tables and a few macros and typedefs needed to compile those tables. The
remainder of the parser is read from a template file, called occs.par (in the case
of occs) and llama.par (in the case of LLama). The programs look for these
files, first in the current directory, and then along all directories specified in the
LIB environment variable. LIB should contain a semicolon-delimited list of
directories The m command switch is used to specify an alternate tem
plate file. For example:
m / u s r / a l i e n / t e m p l a t e / t e m p l a t e . p a r
tells occs to read a file called template.par from the /usr/allen/template directory
rather than using the default template. The template files are discussed in depth
in Chapters Four and Five. Note that theres no space between the m and the
first character of the path name.
Create parser only.
(occs only.) Rather than creating a combined parser-and-action file, the parser
component of the file is output to the file yyout.c. No actions are put into
yyout.cthe action component of the output file can be created using the -a
switch, described earlier.
Symbol Tables.
s
These switches causes the symbol-table file {yyout.sym for occs and llout.sym
for LLama) to be generated. A capital S causes the symbol-table file to contain
more informationthe FIRST( ) sets for each nonterminal are printed in the
occs table, and both FIRST and FOLLOW sets are printed in LLamas.
Use stdout.
t Causes occs to output the parser to standard output rather than to the yyout.c file.
Send large tables to
yyoutab. c.
The table-compression method used by occs can create several hundred small
arrays, and some compilers cant handle that many declarations in a single file.
The -T switch splits the parser file up into two portions, one containing the
parser and most of the tables, the other containing the two biggest tablesthe
ones most likely to give the compiler problems. These parser is put into yyout.c
as usual. The two parse tables are put into yyoutab.c. The two files can then be
compiled independently and linked together. You can use this switch in con
junction to the -a and -p switches to split the parser into three parts. Use:
Section E.7Command-Line Switches 845
occs -pT i nput. y Create yyout.c and yyparse.c
occs - a i nput, y Create actions file
-V The -v switch puts occs and LLama into verbose mode. Occs generates a file
-v called yyout.doc which contains a human-readable description of the LALR(l)
state machine. There is no equivalent file from LLama. Both programs send
progress reports to the screen as they work and they print various statistics
describing table sizes, and so forth, when done. V works like -v except that
more information about the internal workings of occs and LLama than youre
interested in seeing is printed on the screen. I ts used primarily for debugging
occs itself or for seeing how the LALR(l) state machine is put together. V.
also puts the LALR(l) lookaheads for every kernel item (rather than just for the
items that cause reductions) into yyout.doc.
-w Suppress all warning messages. These warnings announce the presence of
shift/reduce and reduce/reduce conflicts found when the state machine is con
structed. Once youve ascertained that these warnings are harmless, you can use
-w to clean up the screen. Warnings are printed to yyout.doc, even if w is
specified, and the number of warnings is always printed at the end of an occs
run, even if the warnings themselves arent printed.
-W Normally, the number of hard errors is returned to the operating system as the
exit status. If -W is specified, the exit status is the sum of the number of hard
errors and warning messages. This switch is handy if you want the UNIX make
utility (or equivalent) to terminate on warnings as well as errors.
E.8 The Visible Parser10
An occs- and LLama-generated parser can function in one of two ways. In produc
tion mode, it just parses the input and generates code, as you would expect. In debug
mode, however, a multiple-window interactive debugging environment (or IDE) is made
available to you. Debug mode is enabled in one of two ways, either specify D on the
occs command-line, or put the following in the rules section of your input file:
%{ #def i ne YYDEBUG %}
The resulting file should be compiled and linked to three libraries: l.lib, curses.lib, and
termlib.lib.]] Only the first of these libraries is necessary if youre not compiling for
debugging. The occs source file for a small expression compiler is on the software-
distribution disk mentioned in the preface, along with the curses and termlib libraries.
You may need to recompile the libraries to get them to work properly if youre not using
Microsoft C (or QuickC). An executable version of the compiler (expr.exe) is also on
the disk, however, so you dont have to recompile right now. Make the expression com
piler with Microsoft C as follows:
10. Thi s sect i on descr i bes how to use t he i nteracti ve debuggi ng envi r onment . I f at al l possi bl e, you shoul d
have an occs out put fi l e, compi l ed f or i nteracti ve debuggi ng, r unni ng i n f ront of you as you read. An
execut abl e ver si on of t he expr essi on compi l er descr i bed bel ow i s pr ovi ded on t he sof t war e di stri buti on
di sk i n t he fi l e expr.exe.
11. Usecc f i l e s . . . - 11 - l c u r s e s - l t e r m l i b inUNIX.
Verbose-mode. LALR(1)
state-machine descrip
tion.
Suppress warning mes
sages.
Production and debug
mode.
IDE (interactive debug
ging environment).
Compiling for Debugging.
Direct-video output.
The VIDEO environment
variable.
Window names.
Changing stack-window
size from the command
line.
-vD e x p r . y
- v e x p r . l e x
c l - o e x p r . e x e y y o u t . c l e x y y . c - l i n k l . l i b c u r s e s . l i b t e r m l i b . l i b
Then run it with:
e x pr t e s t
where test is an input file containing a simple expression composed of numbers, variable
names, plus signs, asterisks, and parentheses. The expression must be the first and only
line in the file.
In the MS- DOS environment, the debugger uses the ROM-BIOS for its output. It can
also use direct-video writes, however, and on some clones this approach is more reliable
than using the BIOS. To activate the direct-video routines, set the VIDEO environment
to DIRECT with the following command to COMMAND.COM:
s e t VIDEO=DIRECT
The direct-video routines assume that either an MGA, or a CGA running in high resolu
tion, 25x80 text mode is present (it automatically selects between these two display
adapters). Most other video adapters can simulate one or the other of these modes. The
program will not work with a CGA running in one of the other text modes (25x20 or
25x40), however. Issue a MODE BW80 request at the DOS prompt to get the CGA into
high-resolution, text mode.
The program comes up with several empty windows, as shown in Figure E.l. The
stack window is used to show the state and value-stack contents; the comments window
holds a running description of the parse process; generated code is displayed in the out
put window, the lookahead window displays the most-recently read token and lexeme
the current lookahead symbol, and the prompts window is used to communicate with the
debugger. The size of the stack window can be changed by specifying an optional -sN
command-line switch when you invoke the program, as in expr - s i 5 test. In this case the
stack window occupies 15 lines. The other windows scale themselves to take up whats
left of the screen.
Several commands are supported by the debugger; you can see a list by typing a
question mark, which prints the list shown in Figure E.2. The easiest command to use is
the g command (for go) which just starts up the parsing. Figure E.3 shows the various
windows several stages into parsing the expression a*(b+c). The parsing can be
stopped at any time by pressing any key.
In Figure E.3, the parser has just read the c (but it has not yet been shifted onto the
parse stack). The I D c in the lookahead window is the most recently read token and
lexeme. The output window shows code that has been output so far, and the current
parse stack is shown in the stack window.
The output window shows output sent to all three segment streams. The leftmost
column identifies the segmentit holds a C for code, Dfor data, or B for bss. Because
the window is so narrow, sequences of white-space characters (tabs, spaces, and so forth)
are all replaced with a single space character in this window.
The stack window always displays the topmost stack elements. If the stack grows
larger than the window, the stack window scrolls so that the top few stack elements are
displayed. The three columns in the stack window are (from left to right): the parse
stack, the parse stack represented in symbolic form rather than numbers, and the value
stack. Several commands other than g are recognized at the "Enter command" prompt in
the prompts window:
Section E.8The Visible Parser 847
Figure E.l. The Initial Debug Screen
+ [ s t a c k ] +
+
+
[ comment s ] +
+
[o u t p u t ] +
----------------------------------------- [ pr o mpt s ] ------------------------------
Ent e r command ( s pa c e t o c o n t i n u e , ? f o r l i s t )
+ [ l o o ka he a d] +
+ +
Figure E.2. Debug-Mode Commands
a ( A) bor t p a r s e r by r e a d i n g EOI 1 ( L) og o u t p u t t o f i l e
b modi f y or exami ne ( B) r e a k p o i n t n/ N ( N ) o n i n t e r a c t i v e mode
d s e t ( D) e l a y t i me f o r go mode
q
( Q) u i t ( e x i t t o dos )
f r e ad ( F ) i l e r ( R ) e f r e s h s t a c k wi ndow
g
(G)o (any key s t o p s p a r s e ) w ( W) r i t e s c r e e n t o f i l e or d e v i c e
1 change ( I ) n p u t f i l e x Show c u r r e n t and p r e v . l e ( X) e me

Space or Ent e r t o s i n g l e s t e p
space, You can single step through the parse by repetitively pressing the space bar or Single step
enter
the Enter key. In practice, this is usually more useful than g which can go by
faster than you can see.
Ctrl-A, These two commands are deliberately not printed on the help
Ctrl-B
debugger hooks. Ctrl-A causes the debugger to call the
. They are Debugger hooks
subroutine
yyhook_a (), and Ctrl-B calls yyhook_b (). The default versions of these
routines do nothing. You can use these commands for two things. First, its
often useful to run the IDE under a compiler debugger like sdb or CodeView,
and you can use a hook as an entry point to the debugger: Set a breakpoint at
yyhook a () and then start the parser running; issuing a Ctrl-A then returns
you to the debugger.
Figure E.3. The Windows Several Steps i n t o a Parse
+ [ s t a c k ] +
6| PLUS _ t l
5 I e _ t l
3 I LP t o
7 I STAR t o
4 1 e t o
0 1 s 1<empt y>
+ [ c omme nt s ] +
S h i f t <LP> ( g o t o 3)
Advance p a s t ID 
S h i f t <ID> ( g o t o 1)
Advance p a s t PLUS <+>
Reduce by (5) e - >I D
( g o t o 5)
S h i f t <PLUS> ( g o t o 6)
Advance p a s t <PLUS>
Read ID <c>
B
B
C
C
p u b l i c word
p u b l i c word
tO = a
t l = b
[out put ]- - - - - - - -
10, _t 1, _t 2, _t 3;
t4, t5, 16, t7;
+
+ ----------------------------------------------------[ p r o mp t s ] --------------------
Ent e r command ( s p a c e t o c o n t i n u e , ? f o r l i s t )
+ [ l ooka he ad] +
ID c
+ + +
The two hook subroutines are in modules by themselves, so you can use
them to add capabilities to the IDE by linking your own versions rather than
the library versions. For example, its occasionally useful to print the symbol
table when you are debugging the parts of the compiler that are processing
declarations. This capability cant be built into the IDE because theres no
way of knowing what the data structures will look like. The problem is solved
by supplying a subroutine called yyhook a () which prints the symbol table.
Your version is linked rather than the default library version, so a Ctrl-A com
mand now prints the symbol table.
a This command aborts the parse process by forcing the lexical analyzer to
return the end-of-input marker on all subsequent calls. Compare it with the q
command, which exits immediately to the operating system without returning
to the parser itself.
Breakpoints. 5 The b command is used to set, clear, or examine breakpoints. Four types of
breakpoints are supported:
input (i) breaks when a specified symbol is input.
input line (1) breaks when the input sweeps past the specified input line.
This breakpoint is activated only when the lexical analyzer
returns a token, and its possible to specify an input line that
doesnt have any tokens on it. (In this case, the break occurs
when the first token on a line following the indicated one is read.)
Section E.8 The Visible Parser 849
Note that this breakpoint, unlike any of the others, is automati
cally cleared as soon as it is triggered,
stack (s) breaks as soon as a specified symbol appears at the top of the
parse stack. In a LLama parser, parsing stops before any action
code is executed.
production (p) breaks just before the parser applies a specific production. In
occs, it breaks just before the reduction. In LLama, it breaks just
before replacing a nonterminal with its right-hand side.
When you enter b, you are prompted for one of these types (enter an /, /, s, or p
for input,, line, stack, or production"). You are then prompted for a
symbol. The different kinds of breakpoints can take different inputs in response
to this prompt:
input (1) A decimal number is assumed to be a token value, as defined
in yyout.h or yyout.sym (llout.h or llout.sym if LLama). Parsing
stops when that token is read. (2) A string that matches a token
name (as it appears in a %term or %token directive) behaves
like a numberparsing stops when that token is read. (3) Any
other string causes a break when the current lexeme matches the
indicated string. For example, if you have a %term NAME in the
input file and the compiler compiler puts a #def i ne NAME 1 in
yyout.h or llout.h, then entering either the string 1 or the string
NAME causes a break the next time a NAME token is read. You
can also enter the string I l l i avi tch, whereupon a break
occurs the next time Illiavitch shows up as a lexeme, regardless
of the token value,
input line Enter the input line number.
stack (1) If you enter a number, parsing stops the next time that number
appears at the top of the real parse stack (the leftmost column in
the stack window). (2) Any other string is treated as a symbolic
name, and parsing stops the next time that name appears at the
top of the symbol stack (the middle column in the window),
production You must enter a number for this kind of breakpoint. Parsing
stops just before the particular production is applied (when its
replaced in a LLama parser and when a reduction by that pro
duction occurs in an occs parser). Production numbers can be
found in the yyout.sym or llout.sym file, generated with a - s or
command-line switch.
Theres no error checking to see if the string entered for a stack breakpoint
actually matches a real symbol. The breakpoint-processing routines just check
the input string against the symbol displayed in the symbolic portion of the
stack window. If the two strings match, the parsing stops. The same technique
is used for input breakpoints. If the symbol isnt a digit, then the breakpoint
string is compared, first against the input lexeme, and then by looking up the
name of the current lexeme. If either string matches, then parsing stops.
Three other breakpoint commands are provided: c clears all existing
breakpoints, d displays all existing breakpoints, and ? prints the help screen
shown in Figure E.4.
The d command is used to slow down the parse. The parsing process, when
started with a g command, nips along at a pretty good clip, often too fast for
you to see whats going on. The d command is used to insert a delay between
Adding a delay in go
mode.
Figure E.4. Breakpoint Help Screen
Sel ect a br eakpoi nt t ype (i , p, or s) or command (c or 1)
Type : Descr i pt i on: Ent er br eakpoi nt as f ol l ows:
t
i i nput . . . . . . . . . . . . . . . . . . . number f or t oken val ue
or st r i ng f or l exeme or t oken name
1 i nput l i ne. . . . . . . . . . . . . . . i nput l i ne number
P
r educe by pr oduct i on. . . . . . number f or pr oduct i on number
s t op- of - st ack symbol . . . . . . . number f or st at e- st ack i t em
or st r i ng f or symbol - st ack i t em
s = cl ear al l br eakpoi nt s
1 = l i st al l br eakpoi nt s
every parse step. Setting the delay to 0 puts it back to its original blinding
speed. Delay times are entered in seconds, and you can use decimal fractions
(1, 2.5, .5, and so forth) if you like.
Examine file.
The /command lets you examine a file without leaving the debugger. You are
prompted for a file name, and the file is then displayed in the stack window,
one screenfull at a time.
Go!
The g command starts the parse going.
Specify input file.
i The / command lets you change the input file from the one specified on the
command line. It prompts you for a file name.
Log all output to file
Specifying horizontal
stacks.
This command causes the entire parse to be logged to a specified file, so that
you can look at it later. If youre running under MS- DOS, you can log to the
printer by specifying pm: instead of a file name. Some sample output is
shown in Figures E.5 and E.6. Output to the CODE window is all preceded by
CODE>. Most other text was COMMENT-window output. The parse stack is
drawn after every modification in one of two ways (youll be prompted for a
method when you open the log file). Listing E.3 shows vertical stacks and
Listing E.4 shows horizontal stacks. The latter is useful if you have relatively
small stacks or relatively wide paperit generates more-compact log files in
these situations. The horizontal stacks are printed so that items at equivalent
positions on the different stacks (parse/state, symbol, and value) are printed
one atop the other, so the column width is controlled by the stack that requires
the widest string to print its contentsusually the value stack. If you specify
horizontal stacks at the prompt, you will be asked which of the three stacks to
print. You can use this mechanism to leave out one or more of the three
stacks.
Noninteractive mode:
Create log without win
dow updates
n
*
Run parser without log
ging or window updates
These commands put the parser into noninteractive mode. The n command
generates a log file quickly, without having to watch the whole parse happen
before your eyes. All screen updating is suppressed and the parse goes on at
much higher speed than normal. A log file must be active when you use this
command. If one isnt, youll be prompted for a file name. You cant get back
into normal mode once this process is started. The N command runs in nonin
teractive mode without logging anything. I ts handy if you just want to run the
parser to get an output file and arent interested in looking at the parse process.
Section E.8The Visible Parser
Listing E.3. Logged OutputVertical Stacks
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
CODE- >publ i c word
CODE- >publ i c word
1 0, _ t 1, _ t 2 , _ t 3;
1 4, t 5 , 1 6, t 7 ;
S h i f t s t a r t s t a t e
+ +
0 0 s
+ +
Advance p a s t NUM <1>
S h i f t <NUM> (goto 2)
+ +
0
1
2
0
NUM
S
+ +
Advance p a s t PLUS <+>
CODE- > t O 1
Reduce by (4) e->NUM
+ +
0 0 s
+ +
(goto 4)
+ +
0
1
4
0
e
S
+ +
S h i f t <PLUS> (goto 6)
+ +
0
1
2
6
4
0
PLUS
e
S
+ +
Advance p a s t NUM <2>
S h i f t <NUM> (goto 2)
+ +
0
1
2
3
2
6
4
0
NUM
PLUS
e
S
+ +
Advance p a s t STAR <*>
CODE-> t l 2
+
+
+
+
+
+
+
+
+
+
+
+
t O
t O
t o
t o
t o
t o
q returns you to the operating system.
r An r forces a STACK-window refresh. The screen-update logic normally
changes only those parts of the STACK window that it thinks should be
modified. For example, when you do a push, the parser writes a new line into
the stack window, but it doesnt redraw the already existing lines that
represent items already on the stack. Occasionally, your value stack can
become corrupted by a bug in your own code, however, and the default update
strategy wont show you this problem because it might not update the
incorrectly modified value stack item. The r command forces a redraw so that
you can see whats really on the stack.
851
Quit.
Redraw stack window.
w (for write) dumps the screen to an indicated file or device, in a manner analo- Save screen to file.
gous to the Shift-PrtSc key on an IBM PC. This way you can save a snapshot
Listing E.4. Logged OutputHori zont al Stacks
1 CODE- >publ i c word t O, 1 1, 1 2, 1 3 ;
2 CODE- >publ i c word t 4 , 1 5, 1 6, 1 7 ;
3
4 S h i f t s t a r t s t a t e
5 PARSE 0
6 SYMBOL S
7 ATTRIB -
8 Advance p a s t NUM <1>
9 S h i f t <NUM> (goto 2)
10 PARSE 0 2
11 SYMBOL S NUM
12 ATTRIB -
13 Advance p a s t PLUS <+>

14 CODE- >_t 0 = 1
15
16 Reduce by (4) e->NUM
17 PARSE 0
18 SYMBOL S
19 ATTRIB -
20 (goto 4)
21 PARSE 0 4
22 SYMBOL S e
23 ATTRIB - tO (e)
24 S h i f t <PLUS> (goto 6)
25 PARSE 0 4 6
26 SYMBOL S e PLUS
27 ATTRIB - _ t 0 (e ) _ t 0 (PLUS)
28 Advance p a s t NUM <2>
29 S h i f t <NUM> (goto 2)
30 PARSE 0 4 6 2
31 SYMBOL S e PLUS NUM
32 ATTRIB - t o t o t o
33 Advance p a s t STAR <*>
34 CODE-> t l = 2
of the current screen without having to enable logging. Any IBM box-drawing
characters used for the window borders are mapped to dashes and vertical
bars. I used the w command to output the earlier figures. Note that the output
screen is truncated to 79 characters because some printers automatically print
a linefeed after the 80th character. This means that the right edge of the box
will be missing (I put it back in with my editor when I made the figures).
Display lexeme. x Display both the current and previous lexeme in the comments window. The
token associated with the current lexeme is always displayed in the tokens
window.
E.9 Useful Subroutines and Variables
There are several useful subroutines and variables available in an occs- or LLama-
generated parser. These are summarized in Table E.4 and are discussed in this section.
voi d yypar se()
This subroutine is the parser generated by both occs and LLama. Just call it to
get the parse started.
char *yypst k( YYSTYPE *val , char *sym)
This subroutine is called from the debugging environment and is used to print the
value stack. Its passed two pointers. The first is a pointer to a stack item. So, if
your value stack is a stack of character pointers, the first argument will be a
pointer to a character pointer. The second argument is always a pointer to a
string holding the symbol name. That is, it contains the symbol stack item that
corresponds to the value stack item. The returned string is truncated to 50 charac
ters, and it should not contain any newline characters. The default routine in l.lib
assumes that the value stack is the default i nt type. Its shown in Listing E.5.
This subroutine is used in slightly different ways by occs and LLama, so is dis
cussed further, below.
i nt yy_get _ar gs(i nt ar gc, char **ar gv)
This routine can be used to modify the size of the stack window and to open an
input file for debugging. The other windows automatically scale as appropriate,
and the stack window is not allowed to get so large that the other windows disap
pear. Typically, yy_get _ar gs ( ) is called at the top of your mai n ( ) routine,
before yypar se ( ) is called. The subroutine is passed ar gv and ar gc and it
scans through the former looking for an argument of the form -sN, where N is the
desired stack-window size. The first argument that doesnt begin with a minus
sign is taken to be the input file name. That name is not removed from argv. All
other arguments are ignored and are not removed from argv, so you can process
them in your own program. The routine prints an error message and terminates
the program if it cant open the specified input file. Command-line processing
stops immediately after the input file name is processed. So, given the line:
program - x - s l 5 - y f o o - s i bar
ar gv is modified to:
program - x - y f o o - s i bar
the file/00 is opened for input, and the stack window will be 15 lines high. A
new value of ar gc that reflects the removed argument is returned.
This routine can also be used directly, rather than as a command-line proces
sor. For example, the following sets up a 17-line stack window and opens testfile
as the input file:
char * v e c t s [ ] = " - s l 7 " , " t e s t f i l e " } ;
y y _ g e t _ a r g s ( 3, v e c t s ) ;
voi d yy_i ni t _occs ( YYSTYPE *t os)
These routines are called by yypar se ( ) after it has initialized the various
stacks, but before it has read the first input token. That is, the initial start symbol
has been pushed onto the state stack, and garbage has been pushed onto the
corresponding entry on the value stack, but no tokens have been read. The sub
routine is passed a pointer to the (only) item on the value stack. You can use
yy_i ni t _occs () to provide a valid attribute for this first value-stack element.
A user-supplied version of both functions is also useful when mai n () isnt in the
occs input file, because it can be used to initialize st at i c global variables in the
Section E.9Useful Subroutines and Variables 853
The parser subroutine.
Print value-stack item.
Change stack-window
size and specify input file
to debugger.
Command-line process
ing.
Using yy_get_args()
directly.
Initialization functions.
Table E.4. Useful Subroutines and Variables
i nt yypar se ( voi d
);
i nt yyl ex ( voi d
);
char *yypst k ( voi d *val ue st ack i t em, char *symbol st ack i t em );
voi d yycomment ( char *f mt , ... );
voi d yycode ( char *f mt , ... );
voi d yydat a ( char *f mt , ...
);
voi d yybss ( char *f mt , ...
) ;
i nt yy get ar gs ( i nt ci rgc, char **ar gv );
voi d yy i ni t occs ( voi d
) ;
voi d yy i ni t l l ama ( voi d
) ;
voi d yy i ni t l ex ( voi d
);
voi d yyer r or ( char *f mt , . . .
);
FI LE *yyout = st dout ; / * out put s t r eam f o r code */
FI LE *yybssout = st dout ; / * out put s t r eam f o r b s s */
FI LE *yydat aout = st dout ; / * out put s t r eam f o r dat a */
Listing E.5. yypstk.c Print Default Value Stack
1 / * Def aul t r out i ne t o pr i nt user - suppl i ed por t i on of t he val ue st ack. * /
2
3 char * y y p s t k ( v a l , s y m )
4 voi d * v a l ;
5 char * s ym;
6 {
7 st at i c char b u f [ 3 2 ] ;
8 s p r i n t f ( b u f , "%d", * (i nt * ) v a l ) ;
9 ret urn b u f ;
10 }
parser file. Its easy to forget to call an initialization routine if it has to be called
from a second file. The default subroutines, in l.lib, do nothing. (They are shown
in Listings E.6 and E.7.)
Listing E.6. yyinitox.c Occs User-Initialization Subroutine
1 voi d yy i n i t o x ( t o s ) voi d * t o s ; { }
Print parser error mes- voi d yyerror (char *f mt, . . .)
sages.
This routine is called from yyparse() when it encounters an error, and you should
use it yourself for error messages. It works like pri nt f ( ), but it sends output to
st derr and it adds the current input line number and token name to the front of
the message, like this:
Section E.9Useful Subroutines and Variables 855
Listing E.7. yyinitll.c Default LLama Initialization Function
1 voi d yy i n i t l l a m a ( t o s ) voi d * t o s ; { }
ERROR ( l i ne 00 near TOK) : your message goes here.
The 00 is replaced by the current line number and TOK is replaced by the sym
bolic name of the current token (as defined in the occs input file). The routine
adds a newline at the end of the line for you.
voi d yycode * f m t , . . . )
These four subroutines should be used for all your output. They work like
pr i nt f ( ) , but write to appropriate windows when the debugging environment
is enabled. When the IDE is not active, yycomment writes to standard output
and is used for comments to the user, yycode ( ) writes to a stream called
yycodeout and should be used for all code, yydat a ( ) writes to a stream called
yydat aout and should be used for all initialized data, yybss ( ) writes to a
stream called yybssout and should be used for uninitialized data.
All of these streams are initialized to st dout , but you may change them at any
time with an f open ( ) call. Dont use f r eopen ( ) for this purpose, or youll
close st dout . If any of these streams are changed to reference a file the
debugger sends output both to the file and to the appropriate window. If you for
get and use one of the normal output routines like put s ( ) or pr i nt f ( ) , the
windows will get messed up, but nothing serious will happen, pr i nt f ( ) is
automatically mapped to yycode if debugging is enabled, so you can use
pr i nt f ( ) calls in the occs input file without difficulty. Using it elsewhere in
your program causes problems, however.
voi d yypr ompt (char *pr ompt , char *buf , i nt get str)
This subroutine is actually part of the debugger itself, but is occasionally useful
when implementing a debugger hook, described earlier. It prints the pr ompt
string in the IDEs prompts window, and then reads a string from the keyboard
into buf . (theres no boundary checking so be careful about the array size).
get _st r is true, an entire string is read in a manner similar to get s (), other
wise only one character is read. If an ESC is encountered, the routine returns 0
immediately, otherwise 1is returned.
.10 Using Your Own Lexical Analyzer
Occs and LLama are designed to work with a LEX-generated lexical analyzer. You
can build a lexical analyzer by hand, however, provided that it duplicates IPXs interface
to the parser.
You must provide the following subroutines and variables to use the interactive
debugging environment without a lex-generated lexical analyzer:
Output functions.
Print a message to the
prompt window.
o
\

o
\

Specifying precedence
and associativity, %term,
l ef t, %ri ght,
nonassoc.
char *yyt ext ; current lexeme
i nt yyl i neno; current input line number
i nt yyl eng; number of characters in yyt ext [ ]
i nt yyl ex( voi d) ; return next token and advance input
char *i i pt ext ( voi d) ; return pointer to previous lexeme
i nt i i pl engt h( voi d) ; return length of previous lexeme
i nt i i mar k pr ev( voi d) ; copy current lexeme to previous lexeme
i nt i i newf i l e( char *name) ; open new input file
The scanner must be called yyl ex ( ) . It must return either a token defined in yyout.h or
zero at end of file. Youll also need to provide a pointer to the current lexeme called
yyt ext and the current input line number in an i nt called yyl i neno. The
i i p t ext ( ) subroutine must return a pointer to the lexeme that was read immediately
before the current onethe current one is in yyt ext so no special routine is needed to
get at it. The string returned from i i p t ext ( ) does not have to be ' \ 0' terminated.
Like yyl eng, i i _pl engt h( ) should evaluate the the number of valid characters in
the string returned from i i pt ext ( ). i i mar k pr ev ( ) should copy the current
lexeme into the previous one. i i _newf i l e( ) is called when the program starts to
open the input file. It is passed the file name. The real i i newf i l e ( ) returns a file
handle that is used by yyl ex ( ) in turn. Your version of i i newf i l e ( ) need only do
whatever is necessary to open a new input file. It should return a number other than -1
on success, 1 on error. Note that input must come from a file. The debugging routines
will get very confused if you try to use st di n.
E.11 Occs
This section describes occs-specific parts of the compiler compiler. A discussion of
the LLama-speci fic functions starts in Section E. 12
E.11.1 Using Ambiguous Grammars
The occs input file supports several %directives in addition to the ones discussed ear
lier. (All occs directives are summarized in Table E.5. They will be described in this
and subsequent sections.)
A definitions section for a small expression compiler is shown in Listing. E.8. The
analyzer will recognize expressions made up of numbers, identifiers, parentheses, and
addition and multiplication operators (+ and *). Addition is lower precedence than mul
tiplication and both operators associate left to right. The compiler outputs code that
evaluates the expression. The entire file, from which the definitions section was
extracted, appears at the end of the occs part of this appendix.
Terminal symbols are defined on lines one to six. Here, the %t er mis used for those
terminal symbols that arent used in expressions. %l ef t is used to define left-
associative operators, and %r i ght is used for right-associative operators. (A %nonas
soc directive is also supplied for declaring nonassociative operators.) The higher a
%l ef t , %ri ght , or %nonassoc is in the input file, the lower its precedence. So, PLUS
is lower precedence than STAR, and both are lower precedence than parentheses. A
%t er mor %t oken is not needed if a symbol is used in a %l ef t , %ri ght , or %nonas-
soc. The precedence and associativity information is used by occs to patch the parse
tables created by an ambiguous input grammar so that the input grammar will be parsed
correctly.
Table E.5. Occs %Directives and Comments
Section E.l 1.1 Using Ambiguous Grammars 857
9- 9-
o o
%{
%}
%t oken
%t er m
/ * * /
Delimits the three sections of the input file.
Starts a code block. All lines that follow, up to a %} are written to the output file unchanged.
Ends a code block.
Defines a token.
A synonym for %t oken.
C-like comments are recognizedand ignoredby occs, even if theyre outside of a %{ %}
delimited code block.
%l ef t
%r i ght
%nonassoc
%pr ec
Specifies a left-associative operator.
Specifies a right-associative operator.
Specifies a nonassociative operator.
Use in rules section only. Modifies the precedence of an entire production to resolve a
shift/reduce conflict caused by an ambiguous grammar.
%uni on
%t ype
Used for typing the value stack.
Attaches %uni on fields to nonterminals.
Listing E.8. expr.y occs Definitions Section for a Small Expression Compiler
1 %t er m I D
/ *
a st r i ng of l ower - case char act er s * /
2 %t er m NUM / * a number
* /
3
4 %l ef t PLUS
/ *
+
* /
5 %l ef t STAR / *
*
* /
6 %l ef t LP RP / * ( ) * /
7
8
%{
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#i ncl ude <st dl i b. h>
*yyt ext ;
*new name();
/
*
I n yyl ex(), hol ds l exeme
*
/
/ * decl ar ed at bot t omof t hi s f i l e * /
*
#def i ne YYSTYPE
st ype;
st ype
/ * Val ue st ack * /
#def i ne YYMAXDEPTH
#def i ne YYMAXERR
#def i ne YYVERBOSE
}
64
10
Most of the grammars in this book use recursion and the ordering of productions to
get proper associativity and precedence. Operators handled by productions that occur
earlier in the grammar are of lower precedence; left recursion gives left associativity;
right recursion gives right associativity. This approach has its drawbacks, the main one
being a proliferation of productions and a correspondingly larger (and slower) state
machine. Also, in a grammar like the following one, productions like 2 and 4 have no
purpose other than establishing precedencethese productions generate single
reduction states, described in Chapter Five.
Accessing bottom-up at
tributes, $$, $i, etc.
Default action: $$=$1.
Attributes and e produc
tions.
Typing the value stack.
0. s ^ e
1. e ^ e + 1
2. 1 t
3. t ^
4. 1
f
5.
f
^ ( e )
6. 1 NUM
7. 1 ID
Occs lets you redefine the foregoing grammar as follows:
%t er mNUM I D
/ * + */
/ * * * /
/ * ( ) * /
e ;
e PLUS e
e STAR e
LP e RP
NUM
I D
The ambiguous grammar is both easier to read and smaller. It also generates smaller
parse tables and a somewhat faster parser. Though the grammar is not LALR(l), parse
tables can be built for it by using the disambiguation rules discussed in Chapter Five and
below. If there are no ambiguities in a grammar, %l ef t and %r i ght need not be
usedyou can use %t er mor %t oken to declare all the terminals.
E.11.2 Attributes and the Occs Value Stack
The occs parser automatically maintains a value stack for you as it parses. More
over, it keeps track of the various offsets from the top of stack and provides a simple
mechanism for accessing attributes. The mechanism is best illustrated with an example.
In SA B C, the attributes can be accessed as follows:
S : A B C ;
$$ $1 $2 $3
That is, $$ is the value that is inherited by the left-hand side after the reduce is per
formed. Attributes on the right-hand side are numbered from left to right, starting with
the leftmost symbol on the right-hand side. The attributes can be used anywhere in a
curly-brace-delimited code block. The parser provides a default action of $$=$1 which
can be overridden by an explicit assignment in an action. (Put a $$ to the left of an
equals sign [just like a variable name] anywhere in the code part of a rule.)
The one exception to the $$=$1 rule is an production (one with an empty right-
hand side)theres no $1 in an production. A reduce by an production pops nothing
(because theres nothing on the right-hand side) and pushes the left-hand side. Rather
than push garbage in this situation, occs duplicates the previous top-of-stack item in the
push. Yacc pushes garbage.
By default, the value stack is of type i n t . Fortunately, its easy to change this type.
Just redefine YYSTYPE in the definitions section of the input file. For example, the
expression compiler makes the value stack a stack of character pointers instead of ints
o
"o
l ef t PLUS
l ef t STAR
l ef t LP RP
o o
S
e
oo
"O"O
Section E.l 1.2Attributes and the Occs Value Stack 859
with the following definitions:
%{
t ypedef char *st ype;
#def i ne YYSTYPE st ype
%}
The t ypedef isnt needed if the stack is redefined to a simple type (int, long, f l o a t ,
double, and so forth). Given this definition, you could use $1 to modify the pointer
itself and *$1 to modify the object pointed to. (In this case, *$1 modifies the first char
acter of the string.) A stack of structures could be defined the same way:
%{
t ypedef st ruct
{
i nt har po;
l ong gr oucho;
doubl e chi co;
char zeppo[10];
}
st ype;
#def i ne YYSTYPE st ype
%}
You can use $1 . har po, $2. zeppo [ 3] , and so forth, to access a field. You can also
use the following syntax: $<har po>$, $<chi co>l , and so forth$<chi co>l is
identical to $1. chi co. following statements:
Input Expanded to
$$
$N in the rules section
$N in the code section
Yy val
yysp [ constant ]
yysp [ ( Yy r hsl en - constant) ]
yysp is the value-stack pointer. Yy_r hsl en is the number of symbols on the right-
hand side of the production being reduced. The constant is derived from N, by adjusting
for the size of the right-hand side of the current production. For example, in a produc
tion like
S : A B C ;
$1, $ 2 , and $3 evaluate to the following:
Yy_vs p[ - 2 ] / * $! * /
Yy_vs p[ - 1 ] / * $2 */
Yy _ v s p [ - 0 ] / * $3 * /
Be careful of saying something like $ $ = $ 2 if youve defined YYSTYPE as a structure.
The whole structure is copied in this situation. (Of course, if you want the whole struc
ture to be copied...) Note that the default action ($$=$1) is always performed before
any of the actions are executed, and it affects the entire structure. Any specific action
modifies the default action. This means that you can let some fields of a structure be
inherited in the normal way and modify others explicitly. For example, using our earlier
structure, the following:
{ $ $ . harpo = 5 ; }
modifies the har po field, but not the gr oucho, chi co, or zeppo fields, which are
inherited from $1 in the normal way.
CopyingStructures.
Negative attributes ($-1)
%uni onAutomatic
field-name generation.
The entire 24-byte structure is copied at every shift or reduce. Consequently, its
worth your effort to minimize the size of the value-stack elements. Note that the parser
is built assuming that your compiler supports structure assignment. If this isnt the case,
youll have to modify the parser to use memcpy ( ) to do assignments.
Occs attribute support is extended from that of yacc in that dollar attributes can be
used in both the rules and code sections of the input specification (yacc permits them
only in the rules section). Occs treats $0 as a synonym for $$. Finally, occs permits
negative attributes. Consider the following:
s X A B C
b : E { $$ $- 1 + $- 2 };
The $-1 evaluates to the attributes for the symbol immediately below the E on the value
stack. To see what the negative attributes are doing, consider the condition of the stack
just before the reduce by b>E:
$1 references the attributes associated with E in the normal way; $-1 references /4s
attributes, and $-2 references Xs attributes. Occs normally prints an error message
you try to reference an attribute thats outside the production (if you tried to use $3 on a
right-hand side that had only two elements, for example). Its possible to reference off
the end of the stack if the numbers get too negative, howeverno error message is
printed in this case. For example, a $-3 in the earlier example just silently evaluates to
garbage (the start symbol doesnt have any attributes).
Now, look closely at the attribute-passing mechanism for the expression compiler,
reproduced below:
e e PLUS e I gen( " %s +==%s ; " , $ 1 , $3) ; f r ee name( $3 ) ; }
e STAR e ; gen("%s *== %s; ", $1, $3) ; f r ee name( $3 ) ; }
LP e RP $$ = $2; }
NUM ; gen ("%s = %s ; 11, $$ = new name() , yyt ext ) ; }
ID gen( "%s = %s ; 11, $$ = new name() , yyt ext
) ; }
The NUM and >ID actions on lines four and five are identical. They output a
move instruction and then put the target of the move (the name of the rvalue) onto the
value stack as an attribute. The $$=$2 in the LP e RP action on the third line just
moves the attribute from the e thats buried in the parentheses to the one on the left-hand
side of the production. The top two productions are doing the real work. To see how
they function, consider the sample parse of A+2 in Figure E.5. The transition of interest
is from stack picture six to seven, t o and t l were put onto the stack when the rvalues
were created (and theyre still there). So, we can find the temporaries used for the
rvalues by looking at $1 and $2. The inherited attribute is the name of the temporary
that holds the result of the addition.
Because the value stack is more often then not typed as a uni on, a mechanism is
provided to keep track of the various fields and the way that they attach to individual
symbols. This mechanism assumes a common practice, that specific fields within a union
are used only by specific symbols. For example, in a production like this:
e : e d i v o p e
where di vop could be either a / or %(as in C), you would need to store one of two attri
butes on the value stack. The attribute attached to the e would be a pointer to a string
holding the name of a temporary variable; the attribute attached to the di vop would be
/ for divide and % for a modulus operation. To do the foregoing, the value stack must
be typed to be a union of two fields, one a character pointer and the other an i nt. You
could do this using an explicit redefinition of YYSTYPE, as discussed earlier:
Figure E.5. A Parse of A*2
$ ID

$ e
tO
$ e STAR
tO
$ c STAR NUM
tO
l$
e STAR e
Id tO t l
$ e
tO
$ S
80
tl ;
(Accept)
uni on
{
*var name
i nt P t y p e
}
yyst ype;
#def i ne YYSTYPE yyst ype
but the %uni on directive is a better choice. Do it as follows
buni on {
*var name
i nt op t ype ;
}
A
o
uni on activates an automatic field-name generation feature that is not available
you just redefine YYSTYPE. Given the earlier production (e : e DI VOP e), wed like
to be able to say $$, $1, and so forth, without having to remember the names of the
fields. That is, you want to say
$1 new name();
rather than
$1. var name new name();
Do this, first by using the %uni on, and then attaching the field names to the individual
terminal and nonterminal symbols with a <name> operator in token-definition directive <name> and %t ype
(%term, %l ef t, etc.) or a %t ype directive. The name can be any field name in the
%uni on. For example:
%t er m<op_t ype> DI VOP
%t ype <var name> e
attaches the op t ype field to all DIVOPs and the var _name field to all ^s. That is, if a
<name>is found in a %t erm, %t oken, %l ef t , %ri ght , or %nonassoc directive, the
indicated field name is automatically attached to the specified token. The angle brackets
are part of the directive. They must be present, and there may not be any white space
between them and the actual field name. The %t ype directive is used to attach a field
name to a nonterminal.
$<f i el d>N
Listing E.9. union.y Using %uni on: An Example
A real example of all the foregoing is in Listing E.9. The grammar recognizes a sin
gle statement made up of two numbers separated by a DIVOP, which can be a / or %.
The $$ in the di vop rule on line 25 references the op t ype field of the union because
di vop is attached to that field on line eight. Similarly, the $2 on line 16 references the
op t ype field because $2 corresponds to a di vop in this production, and di vop was
attached to op t ype on line eight. The $1 and $3 on lines 17, 19, 21, and 22 reference
the var name field because they correspond to an e in the production and were attached
to e on line seven. Similarly the $$ on line 22 references var name because it
corresponds to a st at ement , which was also attached to a var name on line seven.
The field stays active from the point where its declared to the end of the current
directive. For example, in
%token ONE < f i e l d l > TWO THREE < f i e l d 2 > FOUR
<f i el dl >is attached to TWO and THREE, and <f i el d2>is attached to FOUR. ONE has
no field attached to it. If a $$ or $N corresponds to a symbol that has no explicit
%uni on field attached to it, a default int-sized field is used and a warning message is
printed. Also, note that the nonterminal name on the left-hand side of a production
must be found in a previous %t ype in order to use $$ in the corresponding right-hand
side. The %t ype directive can have more than one nonterminal symbol listed in it, but it
may have only one <name> and the name must be the first thing in the list. Similarly,
the <name> in a %l ef t and so forth applies only to those terminals that follow it in the
list.13
The $<field>N syntax described earlier can also be used to access a field of a
%uni on. This syntax takes precedence over the default mechanism, so you can use it
12. Yacc doesnt print the warning.
13. Yacc is very picky about type matching when you use this mechanism. Since this pickyness is now
supported by most compilers, Ive not put any type checking into occs. It just stupidly makes the
substitutions discussed earlier.
when you want to override the default field for a particular symbol. Its also useful in
imbedded actions where theres no mechanism for supplying a field name to $$. For
example, in:
%union { i n t e g e r }
o o
o o
x : a { $ $ = 1; } b { f o o ( $ 2 ) ; }
r
The $$ is used to put 1into the i nt eger field of the synthesized attribute.
That number is later referenced using the $2 in the second action. Since the
field type is an integer, you could also use the default field as follows:
x : a { $ $ = 1 : } b { f o o ( $ 2 ) ; }
f
but occs prints a warning message when no field is specified, as is the case here. (Yacc
doesnt print the warning).
Occs translates all %uni ons into a typedef and redefinition of YYSTYPE. The
foregoing example is output as follows:
t ypedef uni on
{
i nt y y_de f ;
char * va r_n ame ;
i nt o p _ t y p e ;
}
y y s t y p e ;
#def i ne YYSTYPE y y s t y p e
The YYSTYPE definition is generated automatically. Both it and the yyst ype type name
will be useful later on. The yy_def field is used for the default situation, where the
symbol specified in a dollar attribute has not been attached to an explicit symbol. For
example, if the following production is found without a preceding %t ype <f i el d> e
t:
t : e PLUS e { $$ = $1 + $3; } ;
the yy def field will be used for both $$ (because theres no field attached to the t) and
for $1 and $2 (because theres no field attached to the e either).
E.11.3 Printing the Value Stack
If youre using the default stack type of i n t you dont have to do anything special to
print the value stack in the debugging environment because the default stack-printing
routine, yypstk(), will be called in from the library. The library version of this routine
assumes a stack of integers.
If youve changed the stack type, either by redefining the YYSTYPE or macro or by
using a %uni on, youll need to provide a version of the yypst k () function for the
debugger to use. The function is passed a pointer to a value-stack element and a second
pointer to the equivalent symbol-stack element. It must return a pointer to a string that
represents the value-stack element.
If the value stack is a simple type, such as a stack of character pointers, then printing
the stack contents is easy:
t ype de f for %uni on:
yyst ype.
Printing an i n t value
stack.
Printing a non-i nt value
stack.
t ypedef char * s t a c k _ t y p e ;
#def i ne YYSTYPE s t a c k _ t y p e ;
oo
oo
'o'o
char * y y p s t k ( v a l u e , s ymbol )
YYSTYPE * v a l u e ;
char * s ymbo1;
{
ret urn *v a l u e ;
}
When the value stack is typed with a %uni on, things are a little more complicated.
Listing E.10 shows how the value stack for the earlier example would be printed. The
%uni on forces the value stack to be an array of yyst ypes, and a pointer to one element
of this array is passed to yypst k (). The second, symbol argument to yypst k () is
used to select the correct field of the union.
Listing E.10. yypstk2.c Printing Attributes for a %uni on
/ * pr oduct i ons go her e * /
/ * poi nt er t o a char act er poi nt er
1 %uni on {
2 i nt op t y p e ;
3 char *var name;
4
}
5
6 %term DIVOP / * / or % * /
7 %type <var name> e s t a t e me n t
8 %type <op t y p e > d i v o p
9
o o
" 0 " 0
10
11
o o
o o
12 y y p s t k ( v a l u e , s ymbol )
13 y y s t y p e * v a l u e ;
14 char * s ymbol ;
15 {
16 st at i c char b u f [ 8 0 ] ;
17
18 i f ( ! s t r c mp ( s ymbol , 11e 11) ! s t r c mp ( s ymbol , " s t a t e me n t " ) )
19 ret urn v a l u e - > v a r name ;
20
21 i f ( ! s t r c m p ( s y mb o l , " d i v o p " ) )
22
{
23 s p r i n t f ( b u f , "%c", v a l u e - > o p t y p e ) ;
24 ret urn b u f ;
25
}
26 el se
27 ret urn 11----- "; / * ot her symbol s don' t have at t r i but es i n t hi s */
28 / * appl i cat i on. */
29
}
If you were using the default, i n t attribute field in the union, the return "---- "
statement on line 27 would be replaced with the following:
s p r i n t f ( b u f , "%d", v a l u e - > y y _ d e f ) ;
ret urn b u f ;
Section E.l 1.4Grammatical Transformations 865
.11.4 Grammatical Transformations
Imbedded actions (ones that are in the middle of a production) can be used in an occs
grammar, but you have to be careful with them. The problem is that actions can be exe
cuted only during a reduction. Consequently, if you put one in the middle of a produc
tion, occs has to shuffle the grammar by adding an e production. For example:
Imbedded actions.
s a { a c t i o n ( 1 ) ; } b c { a c t i o n ( 2 ) ;
}
is modified by occs as follows:
s a 0001 b c { a c t i o n ( 2 ) ;
} ;
0001 : { a c t i o n ( 1 ) ;
}
Unfortunately, that extra 8 production can introduce shift/reduce conflicts into the state
machine. Its best, if possible, to rewrite your grammar so that the imbedded production
isnt necessary:
s
p r e f i x a
b c { a c t i o n ( 2 ) ;
{ a c t i o n ( 1 ) /
};
};
Using the attribute mechanism is also a little tricky if you imbed an action in a pro
duction. Even though imbedded actions are put in their own production, the $1, $2, and
so forth reference the parent production. That is, in:
s a b { $$ $1 + $2 } c {$$
$1 + $2 + $3 + $4}
the $1 and $2 in both actions access the attributes associated with a andb. $$ in the left
action is accessed by $3 in the right action. (That is, this $$ is actually referencing the
0001 left-hand side inserted by occs.) The $$ in the right action is attached to s in the
normal way. $4 accesses c. Note that $3 and $4 are illegal in the left action because
they wont be on the parse stack when the action is performed. An error message is gen
erated in this case, but only if the reference is in the actual grammar section of the yacc
specification. Illegal stack references are silently accepted in the final code section of
the input file.
Occs supports the two yacc transformations. Brackets are used to designate Optional subexpressions,
optional parts of a production. The following input:
s a [ b ] c
is translated internally to
s a 001 c
001 b
/
on
*
/
Note that attributes in optional productions can be handled in unexpected ways (which
actually make sense if you consider the translation involved). For example, in:
s > a [b c {$$ $1 + $ 2 ; } ] d {$$ $1 + $2 + $ 3 ; }
The $$=$l+$2 in the optional production adds the attributes associated with b and c and
attaches the result to the entire optional production. The action on the right adds
together the attributes associated with a, the entire optional production be and d. That is,
$2 in the right production is picking up the $$ from the optional production. Note that
the $2 used in the right action is garbage if the 8 production was taken in the optional
part of the production. As a consequence, optional productions are typically not used
when attributes need to be passed.
Repeating subexpres
sions.
Optional productions nest. The following is legal:
s > a [ b [ c] [d [ e] ]
]
f
though almost incomprehensibleI wouldnt recommend using it. The maximum nest
ing level is 8. Note that optional subexpressions can introduce duplicate productions.
That is:
s b [c] d
e [c]
creates:
s b 001 d
e 002
001 : c
/
*
epsi l on * /
002 : c
/
*
epsi l on * /
Its better to use the following in the original grammar:
s b o p t _ c d
e o pt c
opt c c
/
*
empt y
*
/
Also note that
s [x] ;
is acceptable but needlessly introduces an extra production
s
001
: 001 ;
x /
*
empt y
*
/ ;
Its better to use:
s X /
*
empt y
*
/
Adding a star after the right bracket causes the enclosed expression to repeat zero or
more times. A left-associative list is used by occs; a right-associative list is used by
LLama. The internal mappings for all kinds of brackets are shown in the Table E.6.
You cant use this mechanism if you need to pass attributes from b back up to the parent,
primarily because you cant attach an action to the added 8 production. The extra 8 pro
duction may also add shift/reduce or reduce/reduce conflicts to the grammar. Be careful.
Some examplesa comma-separated list that has at least one element is:
s a [ COMMA a ]
*
A dot-separated list with either one or two elements is:
s a [DOT a] ;
One or more repetitions of b is:
s : a b [b] * c ;
Section E. 11.4Grammatical T ransformations 867
Table E.6. Occs Grammatical Transformations
Input Output
s : a [ b ] c ; s
001
a 001 c
b | / * epsi l on */ ;
s : a [b] * c ; s
001
a 001 c (occs version)
001 b / * epsi l on */ ;
s : a [b] * c ; s
001
a 0 01 c (llama version)
b 001 / * epsi l on */ ;
E.11.5 The yyout sym File
You can see the transformations made to the grammar, both by adding imbedded
actions and by using the bracket notation for optional and repeating sub-productions, by
looking in the symbol-table file, yyout.sym, which will show the transformed grammar,
not the input grammar. The symbol-table file is generated if -D, -5, or - s is specified on
the command line. A yyout.sym for the grammar in Listing E.2 on page 841 is in Listing
E. 11. (It was generated with occs - S expr . y.)
Listing E.ll. yyout.sym Occs Symbol-Table Output
Generating the symbol
table, yyout.sym, -D,
-S, -s.
1 Symbol t a b l e
2
3 NONTERMINAL SYMBOLS:
4
5 e ( 257) <>
6 FIRST : ID NUM LP
7 5: e - > ID
8 4: e - > NUM
9 3: e - > LP e RP
10 2: e - > e STAR e .
11 1: e - > e PLUS e .
12
13 s ( 256) ( g o a l s ymbol ) <>
14 FIRST : ID NUM LP
15 0: s - > e
16
17 TERMINAL SYMBOLS:
18
19 name v a l u e p r e c a s s o c f i e l d
20 STAR 4 2 1 <>
21 RP 6 3 1 <>
22 PLUS 3 1 1 <>
23 NUM 2 0
<>
24 ID 1 0
<>
25 LP 5 3 1 <>
PREC 3
PREC 2
PREC 1
Taking a few lines as a representative sample:
e ( 257) <>
FIRST : ID NUM LP
5: e - > ID
4: e - > NUM
3: e - > LP e RP ............................................................................................... PREC 3
A %uni on <f i el d>in
yyout. sym.
The FIRST set in
yyout. sym.
Right-hand sides in
yyout.sym, production
numbers.
Precedence, associativity
in yyout.sym.
Modified productions are
shown.
LALR(1) states in
yyout. doc
the top line is the name and internal, tokenized value for the left-hand side. The <> on
the right contains the field name assigned with a %t ype directive. Since there was none
in this example, the field is empty. Had the following appeared in the input:
%type < f i e l d > e
the <f i el d>would be on the corresponding line in the symbol table. The next line is
the symbols FIRST set. (The list of terminal symbols that can appear at the far left of a
parse tree derived from the nonterminal.) That is, its the set of tokens which can legiti
mately be next in the input when were looking for a given symbol in the grammar.
Here, were looking for an e, and an ID, NUM, or LP. can all start an e. The FIRST sets
are output only if the symbol table is created with -S.
The next few lines are the right-hand sides for all productions that share the single
left-hand side. The number to the left of the colon is the production number. These are
assigned in the order that the productions were declared in the source file. The first pro
duction is 0, the second is 1, and so forth. Note that the production that has the goal
symbol on its left-hand side is always Production 0. Production numbers are useful for
setting breakpoints in the interactive debugging environment. (You can set a breakpoint
on reduction by a specific production, provided that you know its number.) The PREC
field gives the relative precedence of the entire production. This level is used for resolv
ing shift/reduce conflicts, also discussed below. Note that the productions shown in this
table are the ones actually used by occsany transformations caused by imbedded
actions or the [ ] operator are reflected in the table. Productions are sorted alphabeti -
cally by left-hand side.
The second part of the table gives information about the terminal symbols. For
example:
name v a l u e p r e c a s s o c f i e l d
PLUS 3 1 1 < >
NUM 2 0 - <>
Here, 3 is the internal value used to represent the PLUS token. It is also the value
assigned to a PLUS in yyout.h. The pr ec field gives the relative precedence of this sym
bol, as assigned with a %l ef t , %ri ght , or %nonassoc directive. The higher the
number, the higher the precedence. Similarly, the assoc is the associativity. It will
have one of the following values:
left associative
right associative
nonassociative
Associativity is not specified. The token was declared with a %t er mor
%t oken rather than a %l ef t , %ri ght , or %nonassoc directive.
Finally, the f i el d column lists any %uni on fields assigned with a <name> in the
%l ef t , %ri ght , and %nonassoc directives. A o is printed if there is no such assign
ment.
.11.6 The yyout.doc File
The yyout.doc file holds a description of the LALR(l) state machine used by the
occs-generated parser. It is created if -v or - V is present on the command line. The
yyout.doc file for the grammar in Listing E.2 on page 841 is shown in Table E.7. Figure
E.6 shows the state machine in graphic form.
This machine has ten states, each with a unique number (running from 0 to 9). The
top few lines in each state represent the LALR(l) kernel items. You can use them to see
Section E. 11.6The yyout.doc File 869
the condition of parse stack when the current state is reached. For example, the header
from State 9 looks like this:
S t a t e 9:
e - > e .PLUS e
e - > e PLUS e . [$ PLUS STAR RP]
e - > e .STAR e
The dot is used to mark the current top of stack, so everything to the left of the dot will
be on the stack. The top-of-stack item in State 9 is an e because theres an e immediately
to the left of the dot. The middle line is telling us that there may also be a PLUS and
another e under the e at top of stack.
If the dot is at the far right (as it is on the middle line), then a handle is on the stack
and the parser will want to reduce. The symbols in brackets to the right of the produc
tion are a list of those symbols that (at least potentially) cause a reduction if theyre the
next input symbol and we are in State 9. (This list is the LALR(l) lookahead set for the
indicated LR item, as discussed in Chapter Five.) The lookahead set is printed for every
kernel item (as compared to only those items with the dot at the far right) if - V is used
rather than -v. Note that these are just potential reductions, we wont necessarily do the
reduction on every symbol in the list if theres a conflict between the reduction and a
potential shift. A $ is used in the lookahead list to represent the end-of-input marker.
The next lines show the possible transitions that can be made from the current state.
There are four possibilities, which will look something like the following:
Reduce by 3 on PLUS
says that a reduction by Production 3 occurs if the next input symbol is a PLUS
S h i f t t o 7 on STAR
says that a 7 is shifted onto the stack (and the input advanced) if the next input symbol is
a STAR.
Got o 4 on e
takes care of the push part of a reduction. For example, starting in State 0, a NUM in the
input causes a shift to State 2, so a 2 is pushed onto the stack. In State 2, a PLUS in the
input causes a reduce by Production 4 (eNUM), which does two things. First, the 2 is
popped, returning us to State 0. Next, the parser looks for a goto transition (in State 0)
that is associated with the left-hand side of the production by which we just reduced. In
this case, the left-hand side is an e, and the parser finds a Go t o 4 on e in State 0, so a
4 is pushed onto the stack as the push part of the reduction. The final possibility,
Ac c e pt on end o f i n p u t
says that if the end-of-input marker is found in this state, the parse is successfully ter
minated.
Symbols that cause
reductions in yyout.doc
state.
Transitions in yyout.doc,
shift, reduce, accept.
Table E.7. yyout.doc (Generated from Listing E.2)
S t a t e 0: S t a t e 6:
s - > . e e - > e PLUS . e
S h i f t t o 1 on ID S h i f t t o 1 on ID
S h i f t t o 2 on NUM S h i f t t o 2 on NUM
S h i f t t o 3 on LP S h i f t t o 3 on LP
Got o 4 on e Got o 9 on e
e - >I D . [$ PLUS STAR RP ] e - > e STAR . e
Reduce by 5 on End o f I nput S h i f t t o 1 on ID
Reduce by 5 on PLUS S h i f t t o 2 on NUM
Reduce by 5 on STAR S h i f t t o 3 on LP
Reduce by 5 on RP Got o 10 on e
e->NUM . [$ PLUS STAR RP ] e- >LP e RP . [$ PLUS STAR RP ]
Reduce by 4 on End o f I nput Reduce by 3 on End o f I nput
Reduce by 4 on PLUS Reduce by 3 on PLUS
Reduce by 4 on STAR Reduce by 3 on STAR
Reduce by 4 on RP Reduce by 3 on RP
e- >LP . e RP e - > e .PLUS e
e - > e PLUS e . [$ PLUS STAR RP]
S h i f t t o 1 on ID e - > e .STAR e
S h i f t t o 2 on NUM
S h i f t t o 3 on LP Reduce by 1 on End o f I nput
Got o 5 on e Reduce by 1 on PLUS
S h i f t t o 7 on STAR
S t a t e 4: Reduce by 1 on RP
s - > e . [$ ]
e - > e .PLUS e S t a t e 10:
e - > e .STAR e e - > e .PLUS e
e - > e .STAR e
Ac c e p t on end o f i n p u t e - > e STAR e . [$ PLUS STAR RP]
S h i f t t o 6 on PLUS
S h i f t t o 7 on STAR Reduce by 2 on End o f I nput
Reduce by 2 on PLUS
S t a t e 5: Reduce by 2 on STAR
e - > e .PLUS e Reduce by 2 on RP
e - > e .STAR e
e- >LP e .RP 6 / 2 5 4 t e r m i n a l s
2 / 2 5 6 n o n t e r mi n a l s
S h i f t t o 6 on PLUS 6 / 5 1 2 p r o d u c t i o n s
S h i f t t o 7 on STAR 11 s t a t e s
S h i f t t o 8 on RP
Section E. 11.6The yyout.doc File
Figure E.6. State Machine Represented by the yyout.doc File in Listing E.l 1
871
.11.7 Shift/Reduce and Reduce/Reduce Conflicts
One of the main uses of the yyout.doc file is to see how shift/reduce and
reduce/reduce conflicts are solved by occs. You should never let a WARNING about an
inadequate state go by without looking in yyout.doc to see whats really going on.
Occs uses the disambiguating rules discussed in Chapter Five to resolve conflicts.
Reduce/reduce conflicts are always resolved in favor of the production that occurred ear
lier in the grammar. Shift/reduce conflicts are resolved as follows:
(1) Precedence and associativity information is assigned to all terminal symbols using
, %ri ght , and %nonassoc directives in the definitions part of the input file.
o,
o
The directives might look like this:
Resolving shift/reduce
and reduce/reduce
conflicts.
LLama and OccsAppendix E
%l ef t PLUS MI NUS
%l ef t TI MES DI VI DE
%r i ght ASSI GN
The higher a directive is in the list, the lower the precedence. If no precedence or
associativity is assigned, a terminal symbol will have a precedence level of zero
(very low) and be nonassociative.
Productions are assigned the same precedence level as the rightmost terminal sym
bol in the production. You can override this default with a %prec TOKEN direc
tive to the right of the production. (It must be between the rightmost symbol in the
production and the semicolon or vertical bar that terminates the production).
TOKEN is a terminal symbol that was declared with a previous %l ef t , %r i g h t , or
%nonassoc, and the production is assigned the same precedence level as the indi
cated token. Occs, but not yacc, also allows statements of the form
%prec number
where the number is the desired precedence level (the higher the number, the
higher the precedence). The number should be greater than zero.
(2) When a shift/reduce conflict is encountered, the precedence of the terminal symbol
to be shifted is compared with the precedence of the production by which you want
to reduce. If the terminal or production is of precedence zero, then resolve in favor
of the shift.
(3) Otherwise, if the precedences are equal, resolve using the following table:
associativity of
lookahead symbol
resolve in favor of
left reduce
right shift
nonassociative shift
(4) Otherwise, if the precedences are not equal, use the following table:
precedence resolve in favor of
lookahead symbol < production
lookahead symbol > production
reduce
shift
The %prec directive can be used both to assign a precedence level to productions
that dont contain any terminals, and to modify the precedences of productions in which
the rightmost nonterminal isnt what we want. A good example is the unary minus
operator, used in the following grammar:
%t er m NUM
%l ef t MI NUS PLUS
%l ef t TI MES
%nonassoc VERY_HI GH
o o
o o
s : e /
e : e PLUS e
I e MI NUS e
e TI MES e
I MI NUS e %pr ec VERY_HI GH
I NUM
9
OO
o o
Here, the %prec is used to force unary minus to be higher precedence than both binary
Section E.l 1.7Shift/Reduce and Reduce/Reduce Con flicts 873
minus and multiplication. VERY_HIGH is declared only to get another precedence
level for this purpose. Occs also lets you assign a precedence level directly. For exam
ple,
I MINUS e %prec 4
could have been used in the previous example.14 Cs si zeof operator provides another
example of how to use %prec. The precedence of si zeof must be defined as follows:
e x p r e s s i o n : SIZEOF LP t ype _name RP %prec SIZEOF
f
in order to avoid incorrectly assigning a si zeof the same precedence level as a right
parenthesis.
The precedence level and %pr ec operator can also be used to resolve shift/reduce
conflicts in a grammar. The first technique puts tokens other than operators into the pre
cedence table. Consider the following state (taken from yyout.doc).
WARNING: S t a t e 5: s h i f t / r e d u c e c o n f l i c t ELSE/ 40 ( c h o o s e s h i f t )
S t a t e 5:
s t mt - > IF LP e x pr RP s t mt .
| IF LP e x pr RP s t m t . ELSE s t mt
The default resolution (in favor of the shift) is correct, but its a good idea to eliminate
the warning message (because you dont want to clutter up the screen with harmless
warning messages that will obscure the presence of real ones). You can resolve the
shift/reduce conflict as follows. The conflict exists because ELSE, not being an opera
tor, is probably declared with a %t er mrather than a %l ef t or %ri ght . Consequently, it
has no precedence level. The precedence of the first production is taken from the RP
(the rightmost terminal in the production), so to resolve in favor of the shift, all you need
do is assign a precedence level to ELSE, making it higher than RP. Do it like this:
%l e f t LP RP / * exi st i ng pr ecedence of LP * /
%nonassoc ELSE / * hi gher pr ecedence because i t f ol l ows LP * /
Though you dont want to do it in the current situation, you could resolve in favor of the
reduce by reversing the two precedence levels. (Make ELSE lower precedence than
LP).
The second common situation is illustrated by the following simplification of the C
grammar used in Chapter Six. (Ive left out some of the operators, but you get the idea.)
f u n c t i o n _ a r g u me n t
: e x pr
f u n c t i o n _ a r g u me n t COMMA e x pr / * comma separ at es ar gument s
f
e xpr
: e x pr STAR e x pr
I e x pr COMMA e x pr / * comma oper at or * /
I e x pr DIVOP e x pr

t e r m
f
A shift/reduce conflict is created here because of the COMMA operator (the parser
Using %prec to resolve
shift/reduce conflicts.
14. Yacc doesn t permit this. You have to use a bogus token
doesnt know if a comma is an operator or a list-element separator), and this conflict is
displayed in yyout.doc as follows:
WARNING: S t a t e 170: s h i f t / r e d u c e c o n f l i c t COMMA/102 ( c ho o s e s h i f t )
S t a t e 170:
f u n c t i o n _ a r g u me n t - > e x p r . ( pr od. 102, p r e c . 0)
[COMMA RP ]
e x p r - > e x p r . STAR e x pr
e x p r - > e x pr COMMA
e x p r - > e x p r . DIVOP e x pr
The problem can be solved by assigning a precedence level to production 102:
function argument-êxpr. (It doesnt have one because there are no terminal symbols in
it). You can resolve in favor of the reduce (the correct decision, here) by giving the pro
duction a precedence level greater than or equal to that of the comma. Do this in the
input file as follows:
e x pr l i s t
I p r e c COMMA
e x pr l i s t COMMA e x pr
Similarly, you could resolve in favor of the shift by making the production lower pre
cedence than the COMMA (by replacing the COMMA in the %pr ec with the name of a
lower-precedence operator). Since the comma is the lowest-precedence operator in
youd have to do it here by creating a bogus token that has an even lower precedence,
like this:
I n o n a s s o c VERY LOW /
*
bogus t oken (not used i n t he gr ammar )
*
/
l e f t COMMA
e x pr l i s t
I p r e c VERY LOW
e x pr l i s t COMMA e x pr
Shift/reduce and reduce/reduce conflicts are often caused by the implicit 8 produc
tions that are created by actions imbedded in the middle of a production (rather than at
the far right), and the previous techniques can not be used to resolve these conflicts
because there is no explicit production to which a precedence level can be assigned. For
this reason, its best to use explicit 8 productions rather than imbedded actions. Translate:
X a { a c t i o n ( ) ; } b ;
to this:
x
a c t i o n
a a c t i o n b ;
: /
*
empt y
*
/ { a c t i o n ( ) ; } ;
or to this:
X a' b
a' a {a c t i o n ( ) ; }
These translations probably wont eliminate the conflict, but you can now use %pr ec to
resolve the conflict explicitly.
Its not always possible to do a translation like the foregoing, because the action may
have to access the attributes of symbols to its left in the parent production. You can
sometimes eliminate the conflict just by changing the position of the action in the pro
duction, however. For example, actions that follow tokens are less likely to introduce
Section E.l 1.7Shift/Reduce and Reduce/Reduce Con flicts 875
conflicts than actions that precede them. Taking an example from the C grammar used in
Chapter Six, the following production generated 40-odd conflicts:
a n d _ l i s t
: a n d _ l i s t { a n d ( $ l ) ; } ANDAND b i n a r y { a n d ( $ 4 ) ; $$=NULL; }
| b i n a r y
f
but this variant generated no conflicts and does the same thing:
a n d _ l i s t
: a n d _ l i s t ANDAND { a n d ( $ l ) ; } b i n a r y { a n d ( $ 4 ) ; $$=NULL; }
b i n a r y
f
E.11.8 Error Recovery
One of the ways that occs differs considerably from yacc is the error-recovery
mechanism.15 Occs parsers do error recovery automatically, without you having to do
anything special. The panic-mode recovery technique that was discussed in Chapter 5 is
used. It works as follows:
(0) An error is triggered when an error transition is read from the parse table entry for
the current input and top-of-stack symbols (that is, when theres no legal outgoing
transition from a state on the current input symbol).
(1) Discard the state at the top of stack.
(2) Look in the parse table and see if theres a legal transition on the current input sym
bol and the uncovered stack item. If so, weve recovered and the parse is allowed
to progress, using the modified stack. If theres no legal transition, and the stack is
not empty, go to 1.
(3) If all items on the stack are discarded, restore the stack to its original condition, dis
card an input symbol, and go to 1.
The algorithm continues either until it can start parsing again or the entire input file
is absorbed. In order to avoid cascading error messages, messages are suppressed if a
second error happens right on the tail of the first one. To be more exact, no messages are
printed if an error happens within five parse cycles (five shift or reduce operations) of the
previous error. The number of parse cycles can be changed with a
%{ # d e f i n e YYCASCADE de sired value %}
in the specifications section. Note that errors that are ignored because they happen too
soon arent counted against the total defined by YYMAXERR.
E.11.9 Putting the Parser and Actions in Different Files
Unfortunately, occs can take a long time to generate the parse tables required for a
largish grammar. (Though it usually takes less time for occs to generate yyout.c than it
does for Microsoft C to compile it.) To make program development a little easier, a
mechanism is provided to separate the table-making functions from the code-generation
15. Yacc uses a special e r r o r token, which the parser shifts onto the stack when an error occurs. You must
provide special error-recovery productions that have e r r o r tokens on the right-hand sides. The
mechanism is notoriously inadequate, but its the only one available if youre using yacc. See [Schreiner]
pp. 65-82 for more information.
Panic-mode error
recovery.
Avoiding cascading error
messages, y y c a s c a d e .
functions. The -/? command-line switch causes occs to output the tables and parser only
(in a file called yyout.c). Actions that are part of a rule are not output, and the third part
of the occs input file is ignored. When -a is specified, only the actions are processed, and
tables are not generated. A file called yyact.c is created in this case.
Once the grammar is stable you can run occs once with -p to create the tables.
Thereafter, you can run the same file through occs with -a to get the actions. You now
have two files, yyout.c and yyact.c. Compile these separately, and then link them
together. If you change the actions (but not the grammar), you can recreate yyact.c using
occs - a without having to remake the tables. Remember that actions that are imbedded
in the middle of a production will effectively modify the grammar. If you modify the
position of an action in the grammar, youll have to remake the tables (but not if you
modify only the action code). On the other hand, actions added to or removed from the
far right of a production will not affect the tables at all, so can be modified, removed, or
added without needing to remake the tables.
The first, definitions part of the occs input file is always output, regardless of the
presence of - a or -p. The special macro YYPARSER is generated if a parser is present in
the current file, YYACTI ON is generated if the actions are present. (Both are defined
when neither switch is specified.) You can use these in conjunction with #f def s in the
definitions section to control the declaration of variables, and so forth (to avoid duplicate
declarations). Its particularly important to define YYSTYPE, or put a %uni on, in both
filesif youre not using the default i nt type, that isotherwise, attributes wont be
accessed correctly. Also note that three global variables whose scope is normally lim
ited to yyout.cYy val , Yy vsp, and Yy r hsl en are made public if either switch
is present. They hold the value of $$, the value-stack pointer, and the right-hand side
length.
Listing E.12 shows the definitions section of an input file thats designed to be split
up in this way.
Listing E.12. Definitions Section for - a / - p
1
%{
2 # i f d e f YYPARSER
/ *
I f par ser i s pr esent , decl ar e var i abl es. * /
3 # d e f i n e CLASS
4 # d e f i n e _ ( x ) x
5 # e l s e / *
I f par ser i s not pr esent , make t hemext er ns. */
6 # d e f i n e CLASS e x t e r n
7 # d e f i n e _ ( x )
8
Q
# e n d i f
7
10 CLASS i n t x _ ( = 5 ) ; / *
Eval uat es t o "i nt x = 5; " i n yypar se. c and t o */
11 / *
"ext er n i nt x; " i n yyact s. c. */
12
%}
13
14 %uni on {
/ *
Ei t her a %uni on or a r edef i ni t i on of YYSTYPE */
15 / * shoul d go her e. * /
16
}
17
18 %term . . . / * Token def i ni t i ons go her e. * /
19 %l e f t . . .
20
o o
o'o
Using the -p and -a
command-line switches.
YYACTI ON, YYPARSER.
Yy_ val , Yy_ vs p, and
Yy_ r hs i en made public
by -p or -a.
Section E.l 1.10Shifting a Tokens Attributes 877
.11.10 Shifting a Tokens Attributes
to shift attribute for a
token.
It is sometimes necessary to attach an attribute to a token that has been shifted onto
the value stack. For example, when the lexical analyzer reads a NUMBER token, it
would be nice to shift the numeric value of that number onto the value stack. One way
to do this is demonstrated by the small desk-calculator program shown in the the occs Using an extra production
input file in Listing E.l3. The LX lexical analyzer is in Listing E.l4. This program
sums together a series of numbers separated by plus signs and prints the result. The
input numbers are converted from strings to binary by the a t o i () call on line 12 and
the numeric value is also pushed onto the stack here. The numeric values are summed on
line eight, and the result is printed on line four. The difficulty, here, is that you need to
introduce an extra production on line 12 so that you can shift the value associated with
the input number, making the parser both larger and slower as a consequence.
Listing E.13. Putting Token Attributes onto the Value StackUsing A Reduction
1
2
3
4
5
6
7
8
9
10
11
12
13
14
o
o
o
o
t e r m NUMBER
t e r m PLUS
/
/
*
a col l ect i on of one or mor e di gi t s
*
*
a + gn
*
/
/
o o
o o
s t a t e me n t e x pr s t a t e me n t
/
*
empt y
*
/
e x pr PLUS number { $$ $1 + $3; }
/
*
empt y
*
/
f
number : NUMBER { $$ a t o i ( y y t e x t ) ; }
f
o o
o o
{ p r i n t f ( " T h e sum i s %d\ n", $ 1 ) ; }
Listing E.14. Lexical Analyzer for Listing E.13
An alternate solution to the problem is possible by making the lexical analyzer and Using yyl val to pass
parser work together in a more integrated fashion. The default action in a shift (as con
trolled by the YYSHIFTACT macro described earlier) is to push the contents of a variable
called y y l v a l onto the value stack. The lexical analyzer can use this variable to cause
a tokens attribute to be pushed as part of the shift action (provided that you havent
modified YYSHIFTACT anywhere).16The procedure is demonstrated in Listings E.l5 and
E.l6, which show an alternate version of the desk calculator. Here, the lexical analyzer
assigns the numeric value (the attribute of the NUMBER token) to y y l v a l before
attributes from scanner
to parser.
16. Note that yacc (but not occs) also uses yyl val to hold $$, so it should never be modified in a yacc
application, except as described below (or youll mess up the value stack).
returning the token. That value is shifted onto the value stack when the token is shifted,
and is available to the parser on line eight.
Listing E.l5. Passing Attributes from the Lexical Analyzer to the Parser
1 %term NUMBER / * a col l ect i on o f one or mor e di gi t s * /
2 %term PLUS / * a + si gn * /
3
oo
O0
4 s t a t e me n t :: e x pr s t a t e me n t
{
p r i n t f (" The sum i s %d\ n", $1) ; }
5 I / * empt y * /
6
i
/
t
7
8 e x pr
i
i: e x pr PLUS NUMBER
{ $$ = $1 + $ 3; }
9 / * empt y * /
10
i
9
P
r
11
oo
O0
Listing E.16. Lexical Analyzer for Listing E.l5
1
2
3
4
5
6
7
8
9
10
11
o
o {
#i ncl ude 11y y o u t . h 11
i nt y y l v a l ;
/ * t oken def i ni t i ons
*
/
/
*
decl ar ed by par ser i n occs out put f i l e * /
o
o
}
oo
o o
[ 0 - 9 ] + y y l v a l = a t o i ( y y t e x t ) ;
NUMBER;
/
*
numer i c val ue i s synt hesi zed i n shi f t
*
/
\ + ret urn PLUS;
/
*
i gnor e ever yt hi ng el se
*
/
oo
o o
yyl val is of type YYSTYPEi nt by default, but you can change YYSTYPE expli
citly by redefining it or implicitly by using the %uni on mechanism described earlier. If
you do change the type, be careful to also change any matching ext er n statement in the
LEX input file as well.
Note that this mechanism is risky if the lexical analyzer is depending on the parser to
have taken some action before a symbol is read. The problem is that the lookahead sym
bol can be read well in advance of the time that its shiftedseveral reductions can
occur after reading the symbol but before shifting it. The problem is best illustrated with
an example. Say that a grammar for C has two tokens, a NAME token thats used for
identifiers, and a TYPE token that is used for types. A t ypedef statement can cause a
string that was formerly treated as a NAME to be treated as a TYPE. That is, the
t ypedef effectively adds a new keyword to the language, because subsequent refer
ences to the associated identifier must be treated as if they were TYPE tokens.
It is tempting to try to use the symbol table to resolve the problem. When a
t ypedef is encountered, the parser creates a symbol-table entry for the new type. The
lexical analyzer could then use the symbol table to distinguish NAMEs from TYPEs.
The difficulty here lies in input such as the following:
t ypedef i nt i t y p e ;
i t y p e x;
In a bottom-up parser such as the current one, the symbol-table entry for the t ypedef
will, typically, not be created until the entire declaration is encounteredand the parser
cant know that its at the end of the declaration until it finds the trailing semicolon. So,
Section E.l 1.10Shifting a Tokens Attributes 879
the semicolon is read and shifted on the stack, and the next lookahead (the second
i t ype) is read, before doing the reduction that puts the first i t ype into the symbol
tablethe reduction that adds the table entry happens after the second i t ype is read
because the lookahead character must be read to decide to do the reduction. When the
scanner looks up the second i t ype, it wont be there yet, so it will assume incorrectly
that the second i t ype is a NAME token. The moral is that attributes are best put onto
the stack during a reduction, rather than trying to put them onto the stack from the lexi
cal analyzer.
.11.11 Sample Occs Input File
For convenience, the occs input file for the expression compiler weve been discuss
ing is shown in Listing E.l7.
Listing E.17. expr.y An Expression Compiler (Occs Version)
1 %term ID
/ *
a st r i ng of l ower - case char act er s */
2 %term NUM
/ *
a number */
3
4 %l e f t PLUS / *
+
*/
5 %l e f t STAR
/ *
*
* /
6 %l e f t LP RP
/ * ( ) * /
7
8
%{
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#i ncl ude < c t y p e . h>
#i ncl ude < ma l l o c . h >
*
y y t e x t ; /
*
I n yyl ex (), hol ds l exeme
*
/
*new name ( ) ; / * decl ar ed at bot t omof t hi s f i l e * /
#def i ne YYSTYPE
*s t y p e ;
s t y p e
/ * Val ue st ack * /
#def i ne YYMAXDEPTH
#def i ne YYMAXERR
#def i ne YYVERBOSE
}
64
10
oo
"o"o
/
*
A smal l expr essi on gr ammar t hat r ecogni zes number s, names, addi t i on
(+),
*
mul t i pl i cat i on (*), and par ent heses. si ons associ at e l ef t t o r i ght
* unl ess f or ce i t t o go ot her wi se
*
i s hi gher t han +
*
Not e t hat an under scor e i s appended t o i dent i f i er s so t hat t hey won' t be
* conf used wi t h r val ues.
*
/
s e f
e e PLUS e
e STAR e
LP e RP
NUM
ID
yycode( "%s +
y y c o de ( 11%s
*
o
o
o
o
s \ n " , $1, $ 3 ) ; f r e e _ n a me ( $3 ) ;
s \ n " , $1, $ 3 ) ; f r e e na me ( $3 ) ;
}
}
$$ $2; }
y y c o d e ( "%s
y y c o d e ( "%s
%s\ n", $$
f cs\ n", $$
new_name( ) , y y t e x t ) ;
new name ( ) , y y t e x t ) ;
}
}
oo
o o
Listing E.17. continued...
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
/
* *
/
* y y p s t k ( p t r )
* *
p t r ;
{
/
*
Yypst k i s used by t he debuggi ng r out i nes I t i s a poi nt er to a
*
val ue- st ack i t emand shoul d r et ur n a st r i ng r epr esent i ng t hat i t em. I n
* t hi s case, al l i t has t o do i s der ef er ence one l evel of i ndi r ect i on.
*
/
ret urn
*
p t r : p t r : "<empty>" ;
}
/
* *
/
*Names []
Namep
* *
{ "tO", " t l " , "t 2 ", "t 3 ", "t 4 ", "t 5 ", "t 6", "t 7" };
Names;
*new name ()
{
/
*
Ret ur n a t empor ar y- var i abl e name by poppi ng one of f t he name st ack.
*
/
( Namep >= &Names[ ( Names) / s i ( *Names)
] )
{
y y e r r o r ( " Ex p r e s s i o n t o o c o mp l e x \ n " ) ;
e x i t ( 1 ) ;
}
r et ur n( *Namep++ ) ;
}
f r e e name ( s )
*
s ;
{ /
*
Fr ee up a pr evi ousl y al l ocat ed name
*
/
*
Namep s ;
}
/
* *
/
y y _ i n i t _ o c c s ()
{
/ * Gener at e decl ar at i ons f or t he r val ues */
y y c o d e ( " p u b l i c word t O, t l , t 2 , t 3 ; \ n " ) ;
y y c o d e ( " p u b l i c word t 4 , t 5 , t 6 , t 7 ; \ n " ) ;
}
mai n ( a r q c , a r g v )
* *
a r g v ;
{
/
*
A
Open t he i nput f i l e, usi ng yy get args( ) i f we' r e debuggi ng or
i i newf i l e() i f not .
*
/
#i f def YYDEBUG
yy g e t a r g s ( a r g c , a r g v ) ;
#el se
( < 2 )
{
f p r i n t f ( s t d e r r , "Need f i l e na me \ n") ;
e x i t ( 1 ) ;
Section E. 11.11 Sample Occs Input File 881
102 }
103 el se i f ( i i n e w f i l e ( a r g v [1 ]) < 0 )
104 {
105 f p r i n t f ( s t d e r r , "Can' t open %s\ n", a r g v [ l ] );
106 e x i t ( 2 ) ;
107 }
108 #endi f
109 y y p a r s e ();
110 e x i t ( 0 ) ;
111 }
Listing E.18. expr.lex LPXInput File for Expression Compiler
1
%{
2 #i ncl ude "yyout .h"
3
%}
4
oo
0 o
5
l +
ret urn PLUS;
6
vv* vv
ret urn STAR;
7
vv ^vv
ret urn LP;
8
vvj vv
ret urn RP;
9 [0- 9]+ ret urn NUM;
10 [a- z]+ ret urn I D;
11
oo
0 o
.11.12 Hints and Warnings
Though input defaults to standard input in production mode (as compared to debug
mode), the input routines really expect to be working with a file. The default,
standard-input mode is intended for use in pipes, not for interactive input with a
human being. This expectation can produce unexpected consequences in an interac
tive situation. Since the input is always one token ahead of the parser, the parsers
actions can appear to be delayed by one token. This delay is most noticeable at end
of file, because the last token in the input stream isnt processed until an explicit
end-of-file marker is read. If youre typing the input at the keyboard, youll have to
supply this marker yourself. (Use Ctrl-D with UNIX; the two-character sequence
Ctrl-Z Enter under MS-DOS).
Though you can redefine YYPRIVATE to make various global s t a t i c variables pub
lic for the purpose of debugging, you should never modify any of these global vari
ables directly.
Once the grammar is working properly, make changes very carefully. Very small
changes in a grammar can introduce masses of shift/reduce and reduce/reduce
conflicts. You should always change only one production at a time and then remake
the tables. Always back up the current input file before modifying it so that you can
return to a working grammar if you mess up (or use a version-control system like
SCCS).
Avoid e productions and imbedded actionsthey tend to introduce shift/reduce
conflicts into a grammar. If you must introduce an imbedded production, try to put it
immediately to the right of a terminal symbol.
Avoid global variables in the code-generation actions. The attribute mechanism
should be used to pass all information between productions if at all possible
(sometimes its not). Grammars are almost always recursive. Consequently, youll
find that global variables tend to be modified at unexpected times, often destroying
information that you need for some subsequent action. Avoiding global variables
can seem difficult. You have to work harder to figure out how to do thingsits like
writing a program that uses subroutine return values only, without global variables or
subroutine arguments. Nonetheless, its worth the effort in terms of reduced develop
ment time. All of the global-variable-use issues that apply to recursive subroutines
apply to the action code in a production.
The assignment in the default action ($$=$1) is made before the action code is exe
cuted. Modifying $1 inside one of your own actions will have no effect on the value
of $$. You should modify $$ itself.
If you are using the -a and -p switches to split the parser into two files, remember that
actions imbedded in a production actually modify the grammar. If you add or move
such an action, you must remake the tables. You can add or remove actions at the far
right of a production without affecting the tables, however.
Occs is not yacc, and as a consequence, many hacks found in various books that discuss
the UNIX ut i l i t i es must be avoi ded when using occs:
Because of the way that the input is processed, its not safe to modify the lexeme
from the parser or to do any direct input from the parser. All tokens should be
returned from LEX in an orderly fashion. You must use yyt ext , yyl i neno, and
yyl eng to examine the lexeme. Its risky for code in the parser to modify these
variables or to call any of the i i _ input routines used by LEX. The problem here is
that occs and LLama both read one token aheada second, lookahead token will
already have been read before any action code is processed. The token in yyt ext is
the current token, not the lookahead token.
By the same token (so to speak) you should never modify the occs or LLama value
stack directly, always use the dollar-attribute mechanism$$, $1, and so onto do
so. The contents of the yyl val , which has the same type as a stack element, is
shifted onto the value stack when a state representing a token is shifted onto the state
stack, and you can use this variable to shift an attribute for a token. (Just assign a
value to yyl val before returning the token). It is not possible for code in a IPX
generated lexical analyzer to modify the value stack directly, as is done in some pub
lished examples of how to use the UNI X utilities. Use yyl val .
The occs error-recovery mechanism is completely automatic. Neither the yacc
er r or token, nor the yyer r ok action is supported by occs. The er r or token can
be removed from all yacc grammars. Similarly, all yyer r ok actions can be deleted.
If a yacc production contains nothing but an error token and optional action on its
right-hand side, the entire production should be removed (dont just delete the right-
hand side, because youll introduce a hitherto nonexistent production into the
grammar).
The occs start production may have only one right-hand side. If a yacc grammar
starts like this:
b a g g i n s : f r o d o
I b i l b o
f
add an extra production at the very top of the occs grammar (just after the %%):
s t a r t : b a g g i n s ;
Section E.l 1.12Hints and Warnings 883
E.12 LLama
The remainder of this appendix describes the LLama-specific parts of the compiler
compiler. The main restriction in using LLama is that the input grammar must be LL(1).
LLama grammars are, as a consequence, harder to write than occs grammars. On the
other hand, a LLama-generated parser will be both smaller and faster than an occs
parser for an equivalent grammar.
E.12.1 Percent Directives and Error Recovery
The %directives supported by LLama are summarized in Table E.8. LLama sup
ports one %directive over and above the standard directives. The %synch is placed in
the definitions section of the input fileuse it to specify a set of synchronization tokens
for error recovery. A syntax error is detected when the top-of-stack symbol is a terminal
symbol, and that symbol is not also the current lookahead symbol. The error-recovery
%
code does two things: it pops items off the stack until the top-of-stack symbol is a token
in the synchronization set, and it reads tokens from input until it finds the same token
that it just found on the stack, at which point it has recovered from the error. If the
parser cant find the desired token, or if no token in the synchronization set is also on the
stack, then the error is unrecoverable and the parse is terminated with an error flag,
yypar se () usually returns 1 in this case, but this action can be changed by redefining
YYABORT. (See table E.2 on page 840).
Table E.8. LLama %Directives and Comments
oo
oo Delimits the three sections of the input file.
%{
Starts a code block. All lines that follow, up to a %}are written to
the output file unchanged.
%>
Ends a code block.
%t oken Defines a token.
%t er m A synonym for %t oken.
j k 'kj
C-like comments are recognizedand ignoredby occs, even if
theyre outside of a %{ %} delimited code block.
%synch Define set of synchronization tokens.
Several tokens can be listed in the %synch directive. Good choices for synchroniza
tion symbols are tokens that end something. In C, for example, semicolons, close
parentheses, and close braces are reasonable selections. Youd do this with:
%t er m SEMI CLOSE_PAREN CLOSE_CURLY

%synch SEMI CLOSE_PAREN CLOSE_CURLY
E.12.2 Top-Down Attributes
The LLama value stack and the $ attribute mechanism are considerably different
from the one used by occs. LLama uses the top-down attribute-processing described in
Chapter Four. Attributes are referenced from within an action using the notation $$, $1,
$2, and so forth. $$ is used in an action to reference the attribute that was attached to the
nonterminal on the left-hand side before it was replaced. The numbers can be used to
reference attributes attached to symbols to the right of the current action in the grammar.
The number indicates the distance from the action to the desired symbol. ($0 references
The %synch directive,
synchronization tokens.
Changing the action tak
en by the parser on an
error.
LLama attributes, $$,
$i , etc.
Typing LLamas value
stack.
Initializing the value stack
with y y i n i t l l a ma ( ) .
s t mt : { $l =$2 =ne w_ na me ( ) ; } e x pr { f r e e _ n a me ( $ 0 ) ; } SEMI s t mt
the $1 in the left action modifies the attribute attached to expr, the $2 references the
attribute attached to the second action, which uses $0 to get that attribute. $ $ references
the attribute attached to the left-hand side in the normal way. Attributes flow across the
grammar from left to right.
E.12.3 The LLama Value Stack
The LLama value stack is a stack of structures, as was described in Chapter Four.
The structure has two fields, defined in the LLama-generated parser as follows:
t ypedef st r uct / * Typedef f or val ue- st ack el ement s. */
{
YYSTYPE l e f t ; / * Hol ds val ue of l ef t - hand si de at t r i but e. * /
YYSTYPE r i g h t ; / * Hol ds val ue of cur r ent - symbol ' s at t r i but e. */
}
y y v s t y p e ;
The YYSTYPE structure is defined as an i n t by default, but you can redefine it in the
definitions section of the input file as follows:
%{
t ypedef char * s t a c k _ t y p e ; / * Use st ack of char act er poi nt er s. * /
#def i ne YYSTYPE s t a c k _ t y p e
%}
oo
o o
In this case, the dollar attribute will be of the same type as YYSTYPE. That is $$, $1,
and so forth, reference the character pointer. You can use *$1 or $1 [ 2] to access indi
vidual characters in the string. If the stack is a stack of structures or unions, as in:
%{
t ypedef uni on
{
i nt i n t e g e r ;
char *s t r i n g ;
}
s t a c k _ t y p e ; / * Use st ack of char act er poi nt er s */
%}
oo
o o
You can access individual fields like this: $$ . i nt eger or $1. st ri ng.
The initialization subroutine for the LLama-generated parser, yy_i ni t _l l ama () ,
is called after the stack is initialized but before the first lexeme is inputthe start sym
bol will have been pushed onto the parse stack, and garbage onto the value stack. The
initialization routine is passed a pointer to the garbage entry for the pushed start symbol.
The default routine in l.lib does nothing, but you can use your own
yy_i ni t _l l ama (p) to provide an attribute for the goal symbol so that subsequent
replacements wont inherit garbage. Your replacement should be put at the bottom of the
LLama input file and should look something like this:
y y _ i n i t _ l l a m a ( p )
y y v s t y p e *p;
{
p - > l e f t = p - > r i g h t = initial attribute value for goal symbol ;
the current actions own attributes.) For example, in
}
Section E.l2.3The LLama Value Stack 885
The yyvst ype type is a two-part structure used for the value-stack items described ear
lier.
The yypst k () routine used to print value-stack items in the debugging environment
is passed a pointer to a value-stack item of type (a pointer to a yyvst ype structure) and
a string representing the symbol. You could print character-pointer attributes like this:
%{
t ypedef char * s t a c k _ t y p e ; / * Use st ack of char act er poi nt er s. */
%}
O "O

oo.
t>o
char * y y p s t k ( t o v s , t o d s ) / * Pr i nt at t r i but e st ack cont ent s. * /
y y v s t y p e * t o v s ;
char * s ymbo1;
{
st at i c char b u f [ 6 4 ] ;
s p r i n t f ( b u f , " [ %0. 3 0 s , %0 . 3 0 s ] ", t o v s - > l e f t , t o v s - > r i g h t ) ;
ret urn b u f ;
}
E.12.4 The llout.sym File
LLama produces a symbol table file if the - s or - S switch is specified on the com
mand line. Listing E.l9 shows the symbol table produced by the input file at the end of
this appendix.
The first part of the file is the nonterminal symbols. Taking expr as characteristic, the
entry looks like this:
e x pr ( 257)
FIRST : NUM_OR_ID LP
FOLLOW: RP SEMI
2: e xpr - > t e r m e x p r ' .................................................SELECT: NUM_OR_ID LP
The 257 is the symbol used to represent an expr on the parse stack, the next two lines are
the FIRST and FOLLOW sets for expr. (These are present only if - S was used to gen
erate the table.) The next lines are all productions that have an expr on their left-hand
side. The number (2, here) is the production number, and the list to the right is the LL( 1)
selection set for this production. The production number is useful for setting breakpoint
on application of that production in the debugging environment.
If the production contains an action, a marker of the form { N } is put into the pro
duction in place of the action. All these action symbols are defined at the bottom of
llout.sym (on lines 51 to 57 of the current file). Taking {0} as characteristic:
{0} 512, l i n e 42 : { $l =$2=ne wname ( ) ; }
The 512 is the number used to represent the action on the parse stack, the action was
found on line 42 of the input file, and the remainder of the line is the first few characters
of the action itself.
The middle part of the symbol table just defines tokens. The numbers here are the
same numbers that are in llout.h, and these same values will be used to represent a token
on the parse stack.
E.12.5 Sample LLama Input File
Printing the value-stack,
yypst k (), yyvst ype.
Generating the symbol-
table file, -s -S.
Listing E.20 is a small expression compiler in LLama format.
Listing E.19. A LLama Symbol Table
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
------------------------------- Symbol t a b l e -----------------------------------
NONTERMINAL SYMBOLS:
e x pr ( 257)
FOLLOW: RP SEMI
2: e x pr - > t e r m e x pr SELECT: NUM OR ID LP
e x p r ' ( 259)
FIRST : PLUS < e p s i l o n >
FOLLOW: RP SEMI
3: e x p r ' - > PLUS {2} t e r m {3} e x pr
4: e x p r ' >
SELECT: PLUS
SELECT: RP SEMI
f a c t o r ( 260)
FOLLOW: PLUS TIMES RP SEMI
8: f a c t o r - > NUM_OR_ID {6}
9: f a c t o r - > LP e x pr RP . .
SELECT
SELECT
NUM OR ID
LP
s t mt ( 256) ( g o a l s ymbol )
FIRST : NUM OR ID LP < e p s i l o n >
FOLLOW: $
0: s t mt >
1: s t mt - > {0} e x p r {1} SEMI s t mt
SELECT
SELECT
$
NUM OR ID LP
t e r m ( 258)
FOLLOW: PLUS RP SEMI
5: t e r m - > f a c t o r t e r m' SELECT: NUM OR ID LP
t e r m' ( 261)
FIRST : TIMES < e p s i l o n >
FOLLOW: PLUS RP SEMI
6: t e r m'
7: t e r m'
> TIMES {4} f a c t o r {5} t e r m'
>
SELECT
SELECT
TIMES
PLUS RP SEMI
TERMINAL SYMBOLS:
41 name V
42 LP
43 NUM_OR_! D
44 PLUS
45 RP
46 SEMI
47 TIMES
48
49 ACTION SYMBOLS:
50
51 {0} 512, l i n e 42
52
{1}
513, l i n e 42
53 {2} 514, l i n e 48
54
{3}
515, l i n e 49
55 {4} 516, l i n e 56
56 {5} 517, l i n e 57
57 {6} 518, l i n e 61
v a l u e
4
3
1
5
6
2
$ l =$2=ne wname ( ) / }
f r e e n a m e ( $ 0 ) ; }
$ l =$ 2 =ne wna me ( ) ; }
y y c o d e ( "%s+=%s\ \ n", $ $ , $ 0 ) ; f r e e n a me ( $ 0 ) ;
$ l =$ 2 =ne wna me ( ) ; }
s \ \ n " , $ $ , $ 0 ) ; f r e e n a me ( $ 0 ) ; }
}
y y c o de ( 11%s
y y c o d e ( "%s % 0 . * s \ \ n " , $ $ , y y l e n g , y y t e x t ) }
Listing E.20. expr.lma An Expression Compiler (Llama Version)
Section E.l2.5Sample LLama Input File 887
1 %term PLUS / *
+
*/
2 TIMES
/ *
*
*/
3 %term NUM_0R_ID / * a number or i dent i f i er */
4 %term LP
/ * ( * /
5 %term RP
/ * ) * /
6 %term SEMI
/ *
m
r * /
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
o
o
{
/ * ------------------------------------------------------------------------------------------------------------------------
* Rval ue names ar e st or ed on a st ack. name () pops a name of f t he st ack and
* f r eename( x) put s i t back. A r eal compi l er woul d do some checki ng f or
* st ack over f l ow her e but t her e' s no poi nt i n cl ut t er i ng t he code f or now.
*
/
*Name pool []
{
"tO", " t l " , "t 2 ", Mt 3 H, "t 4 ", "t 5 ", " t 6", "t 7 ", "t 8 ", "t 9"
};
* *
Namep Namepool ;
*newname()
* f r e e n a me (
{ r et ur n( *Namep++) ;
*
x) { r et ur n(
*
Namep x) ;
}
}
* y y t e x t ;
ext ern i nt y y l e n g ;
#def i ne YYSTYPE char
*
o
o
}
I s y nc h SEMI RP
/
*
A smal l expr essi on gr ammar t hat r ecogni zes number s, names, addi t i on
(+) f
*
*
mul t i pl i cat i on (*), and par ent heses,
unl ess par ent heses f or ce i t t o go ot her wi se
si ons associ at e l ef t t o r i ght
*
i s hi gher t han +
*
/
s t mt /

/
{ $l =$2=ne wname ( ) ; } e x pr { f r e e n a m e ( $ 0 ) ; } SEMI s t mt
t e r m e xpr
e x pr ' PLUS { $l =$2=ne wname ( ) ; } t e r m
{ y y c o de ( "%s +=%s \ n", $ $ , $ 0 ) ; f r e e n a m e ( $ 0 ) ; } e x p r '
/
*
t erm f a c t o r t e r m'
t erm' TI MES { $l =$2=ne wname ( ) ; } f a c t o r
{ y y c o de ( 11%s*= s \ n " , $ $ , $ 0 ) ; f r e e n a m e ( $ 0 ) ; } t e r m'
/
on
*
/
58 f a c t o r : NUM OR ID { yycode( "%s==%0. *s \ n", $$, y y l e n g , y y t e x t ) ; }
59 LP e x pr RP
60
r
61
oo
oo
62
/ * ----------------------------------------------------------------------------------*/
63
64 yy i n i t l l a m a ( p )
65 y y v s t y p e *p;
66 {
67 p - > l e f t = p - > r i g h t = ;
68 }
69
70 char * y y p s t k ( t o v s , t o d s )
71 y y v s t y p e * t o v s ;
72 char * t o d s ;
73 {
74 s t a t i c char b u f [ 1 2 8 ] ;
75 s p r i n t f ( b u f , " [ %s , %s] ", t o v s - > l e f t , t o v s - > r i g h t ) ;
76 r e t u r n b u f ;
77 }
78
79 mai n ( a r g c , a r g v )
80 char * * a r g v ;
81 {
82 yy g e t a r g s ( a r g c , a r g v ) ;
83 y y p a r s e ( ) ;
84 e x i t ( 0 ) ;
85 }
This appendix summarizes all the C-code directives described in depth in Chapter
Six. Many of the tables in that chapter are also found here.
Figure F.l. The C-code Virtual Machine
high low
b3 b2 bl bO
rO
rl
r2
r3
r4
r5
r6
rl
r8
r9
rA
rB
rC
rD
rE
rF
stack [ ]
7K
2,048,
32-bit
Iwords
sp
32 bits
text data bss
iP
889
890 A C-code SummaryAppendix F
Table F.l. Registers
rN
sp
f p
i p
General-purpose register (rO, rl,..., r9, rA,..., rF).
Stack pointer.
Frame pointer.
Instruction pointer.
r N . pp Access 32-bit register as pointer to something. Must use an addressing mode to specify type
of object pointed to.
r l . w . h i g h Access word in high 16-bits of register.
r N . w. l o w Access word in low 16-bits of register.
r / V . b . b 3 Access most significant byte of register (MSB).
r/ V. b . b 2 Access low byte of high word.
r/ V. b . b l Access high byte of low word.
r / V. b. bO Access least significant byte of register (LSB).
Table F.2. Types, Storage Classes, and Declarations
Types
byt e 8-bit
ar r ay (alias for byt eused to declare pointers to arrays)
r ecor d (alias for byt eused to declare structures)
wor d 16-bit
l wor d 32-bit
pt r generic pointer
Storage classes
pr i vat e Space is allocated for the variable, but the variable can not be accessed from
outside the current file.
publ i c Space is allocated for the variable, and the variable can be accessed from any
file in the current program. It is illegal for two publ i c variables to have the
same name, even if they are declared in separate files.
ext er nal Space for this variable is allocated elsewhere. There must be a publ i c or
common definition for the variable in some other file.
common Space for this variable is allocated by the linker. If a variable with a given
name is declared common in one module and publ i c in another, then the
publ i c definition takes precedence. If there are nothing but common
definitions for a variable, then the linker allocates space for that variable in the
bss segment.
Declarations
class type name; Variable of indicated type and storage class.
class type name [ size ] ; Array of indicated type and storage class; size is optional if initialized with C-
style initializer.
All declarations in the data segment must be initialized using a C-style initializer. Nonetheless, pointers
may not be initialized with string constants, only with the address of a previously-declared variable.
Declarations in the bss segment may not be initialized.
A C-code SummaryAppendix F
Table F.3. Converting C Storage Classes to C-code Storage Classes
891
s t a t i c
not s t a t i c
definition
declaration (extern
or prototype)
Subroutine
(in text segment).
pr i vat e publ i c ext er nal
Uninitialized variable
(in bss segment).
pr i vat e common
Initialized variable
(in data segment).
pr i vat e publ i c
Table F.4. Addressing Modes
Mode Example Notes
immediate 10 Decimal number. Use leading Ox for hex, 0 for octal.
direct x, r O. l Contents of variable or register.
B (p) byt e whose address is in p.
W( p) wor d whose address is in p.
L( p) l wor d whose address is in p.
indirect
P( p)
pt r whose address is in p.
BP (p) byt e pointer whose address is in p .
WP (p) wor d pointer whose address is in p.
LP( p) l wor d pointer whose address is in p .
PP( p)
pt r pointer whose address is in p.
*BP( p) byt e pointed to by pointer whose address is in p.
double *WP( p) wor d pointed to by pointer whose address is in p.
indirect *LP( p) l wor d pointed to by pointer whose address is in p.
*PP( p)
wor d pointed to by pointer whose address is in p.
P( p+N) pt r at byte offset N from address in p.
B( p+N) wor d at byte offset N from address in p .
based W( p+N) l wor d at byte offset N from address in p.
indirect L( p+N) l wor d at byte offset N from address in p.
LP (p+n) l wor d pointer at byte offset N from address in p.
*LP( p+n)
###
l wor d pointed by pointer at byte offset N from address in p.
effective &name Address of variable or first element of array.
address &W( p+N) Address of word at offset +n from the pointer p.
The effective-address modes can also be used with other indirect modes. (See
Table F.5).
A generic pointer, /?, is a variable declared pt r or a pointer register ( r N. pp, fp, sp).
N is any integer: a number,
able.
a numeric register (rO.w.low), or a reference to a byt e, word, or l wor d vari-
892 A C-code SummaryAppendix F
Table F.5. Combined Indirect and Effective-Address Modes
Synt ax: Evaluates to:
&p
&W( &p)
&WP( &p)
address of the pointer
P
&W(p)
WP (&p)
contents of the pointer itself
W( P)
*WP (&p)
contents of the word whose address is in the pointer
Table F.6. Arithmetic Operators
d S assignment
d
-J- =
s addition
d
=
s subtraction
d
* =
s multiplication
d / = s division
d
o_
o s modulus division
d
A
s bitwise XOR
d & s bitwise AND
d
-
s bitwise OR
d <<= s Shift d to left by s bits
d >>= s Shift d to right by s bits (arithmetic)
d
__
s d = twos complement of s
d s d = ones complement of s
d & s d = effective address of s
lrs(d,n) logical right shift of d by n bits. Zero fill in left bits.
Table F.7. Test Directives
Directive: Execute following line if:
EQ( a, b ) a = b
NE( a, b ) a * b
LT( a, b ) a b
GE( af b ) a > b
U_LT( a, b ) a b (unsigned comparison)
U_ GE( a, b ) a > b (unsigned comparison)
BI T ( b, s ) bit b of s is set to 1(bit 0 is the low bit).
Table F.8. Other Directives
A C-code SummaryAppendix F 893
SEG( t ext )
SEG( dat a )
SEG( bss )
PROC (name)
ENDP {name)
ALI GN (type)
Change to code segment.
Change to initialized-data segment.
Change to uninitialized-data segment.
Begin code for subroutine name.
End code for subroutine name.
Next declared variable is alligned as if it were the indicated type.
ext l ow( r eg) ;
ext hi gh( r eg) ;
ext wor d( r eg) ;
Duplicate high bit of r eg. b. bOin all bits of r eg. b. bl .
Duplicate high bit of r eg. b. b2 in all bits of r eg. b. b3.
Duplicate high bit of r eg. w. l owin all bits of r eg. w. hi gh.
push (x) ;
x = pop (type) ;
Push x onto the stack.
Pop an object of the indicated type into x.
cal l (ref) ;
ret ( ) ;
Call subroutine, ref can be a subroutine name or a pointer.
Return from current subroutine.
l i nk (N) ;
unl i nk ( ) ;
Set up stack frame: push (fp) , f p=sp; sp-=N.
Discard stack frame: sp=f p, f p=pop ( pt r ) .
pm( )
mai n
(subroutine) Print all registers and top few stack items,
is mapped to mai n.
Directives whose names are in all upper case must not be followed by a semicolon when used. All other
directives must be found inside a subroutine delimited with PROC () and ENDP () directives. SEG () may
not be used in a subroutine.
Table F.9. Supported Preprocessor Directives
#l ne line-number "file" Generate error messages as if we were on line line-number of the indicated file.
#def i ne NAME text
#def i ne NAME (x) text
#undef
#i f def
NAME
NAME
#i f constant-expression
#endi f
#el se
#i ncl ude <file>
'file"
Define macro.
Define macro with arguments.
Delete macro defintion.
If macro is defined, compile everything up to next #el se or #endi f .
If constant-expression is true, compile everything up to next #el se or #endi f
End conditional compilation.
If previous #i f def or #i f is false, compile everything up the next #endi f .
(search system directory first, then current with contents Replace
directory)
Replace line with contents of file (search current directory).
Table F.10. Return Values: Register Usage
Type Returned in:
char
i nt
l ong
pointer
r F. w. l ow (A char is always promoted to i nt.)
r F. w.l ow
r F. l
rF . pp
[Aho]
[Aho2]
[Angermeyer]
[Armbrust]
[Arnold]
[Bickel]
[DeRemer79]
[DeRemer82]
[Haviland]
[Holubl]
[Howell]
[Intel]
[Jaeschke]
[K&R]
[Knuth]
[Kruse]
[Lewis]
Books Referenced in the Text
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques,
and Tools. Reading, Massachusetts: Addison-Wesley, 1986.
Alfred V. Aho, Brian W. Kemighan, and Peter J. Weinberger. The AWK Programming
Language. Reading, Massachusetts: Addison-Wesley, 1988.
John Angermeyer and Kevin Jaeger. MS-DOS Developer s Guide. Indianapolis, Indi
ana: Howard W. Sams & Co., 1986.
Steven Armbrust and Ted Forgeron. .OBJ Lessons P.C. Tech Journal 3:10, October,
1985, pp. 63~81.
Kenneth Arnold. Screen Updating and Cursor Movement Optimization: A Library
Package.
Automatic Correction to Misspelled Names: A Fourth-Generation Language
Approach. Communications of the ACM (CACM) 30, March 1987, pp. 224-228.
F. L. DeRemer and T. J. Pennello. Efficient Computation of LALR(l) Look-Ahead
Sets. Proceedings of the SIGPLAN Symposium on Compiler Construction. Denver,
Colorado, August 6~10, 1979, pp. 176-187.
F. L. DeRemer and T. J. Pennello. Efficient Computation of LALR(l) Look-Ahead
Sets. ACM Transactions on Programming Lanugages and Systems. 4:4, October,
1982, pp. 615-649.
Keith Haviland and Ben Salama. UNIX System Programming. Reading, Massachusetts:
Addison-Wesley, 1987, pp. 285-307.
Allen I. Holub. C Chest: An AVL Tree Database Package. Dr. Dobbs Journal of
Software Tools, 11:8 (August, 1986), pp. 20-29, 86-102, reprinted in Allen I. Holub. C
Chest and Other C Treasures. Redwood City, California: M&T Books, 1987, pp.
193-215.
Jim Howell. An Alternative to Soundex. Dr. Dobbs Journal of Software Tools,
12:11 (November, 1987), pp. 62-65, 98~99.
8086 Intel Relocatable Object Module Formats. Santa Clara, California: Intel Corpora
tion, 1981. (Order number 121748-001). See also: [Armbrust].
Rex Jaeschke. Portability and the C Language. Indianapolis, Indiana: Hayden Books,
1989.
Brian Kemighan and Dennis Ritchie. The C Programming Language, 2nd ed. Engle
wood Cliffs (N.J.), Prentice Hall, 1988.
Donald E. Knuth. The Art of Computer Programming. Vol. 1-3. Reading, Mas
sachusetts: Addison-Wesley, 1973.
Rogert L. Kruse. Data Structures and Program Design. Englewood Cliffs, New Jersey:
Prentice Hall, 1984.
P.M. Lewis II, D. J. RosenKrantz, and R. E. Steams. Compiler Design Theory. Reading,
894
Bibliography 895
[McNaughton]
[Nash]
Massachusetts: Addison-Wesley, 1976.
Robert McNaughton and H. Yamada. Regular Expressions and State Graphs for Auto
mata. IRE Transactions on Electronic Computers, vol. EC-9, no. 1 (March, 1960), pp.
39-47. (Reprinted in E. F. Moor, editor. Sequential Machines: Selected Papers. Read
ing: Addison Wesley, 1964.)
Ogden Nash. The Lama. Verses from 1929 On. Boston: Little, Brown, and Company:
1931, 1985.
John W. Ratcliff and David E. Metzener. Pattern Matching by Gestalt. Dr. Dobbs [Ratcliff]
Journal of Software Tools 13:7 [July, 1988], pp. 46-51.
Axel Schreiner and H. George Friedman, Jr. Introduction to Compiler Construction with
UNIX. Englewood Cliffs, New Jersey: Prentice Hall, 1985.
Aaron M. Tenenbaum and Moshe J. Augenstein. Data Structures Using Pascal, 2nd ed.
Englewood Cliffs, New Jersey: Prentice Hall, 1986.
Jean-Paul Tremblay and Paul G. Sorenson The Theory and Practice of Compiler Writing.
New York: McGraw-Hill: 1985.
[Schreiner]
[Tenenbaum]
[Tremblay]
Other Compiler Books
This, and following sections, list books and magazine articles of interest that are not
referenced in the text.
Patricia Anklam, et al. Engineering a Compiler: Vax-11 Code Generation and Optimiza- [Anklam]
tion. Bedford, Massachusetts: Digital Press, 1977.
William A. Barret, et al. Compiler Construction: Theory and Practice. 2nd ed. Chi- [Barret]
cago: Science Research Associates, 1986.
Charles N. Fischer and Richard J. LeBlanc, Jr. Crafting a Compiler. Menlo Park, Cali- [Fischer]
fomia: Benjamin/Cummings, 1988.
David Gries Compiler Construction for Digital Computers. New York: John Wiley & [Gries]
Sons, 1971.
Per Brinch Hansen Brinch Hansen on Pascal Compilers. Englewood Cliffs, New Jersey: [Brinch Hansen]
Prentice Hall, 1985.
Robin Hunter. The Design and Construction of Compilers. New York: John Wiley &
Sons, 1971.
Robin Hunter. Compilers: Their Design and Construction Using Pascal. New York:
[Hunterl]
[Hunter2]
John Wiley & Sons, 1985. [Sarff] Gary Sarff. Optimization Strategies Computer
Language, 2:12 (December, 1985), pp. 27~32.
P.D. Terry. Programming Language Translation. Reading, Massachusetts: Addison
Wesley, 1986.
Jean-Paul Tremblay and Paul G. Sorenson An Implementation Guide to Compiler Writ
ing. New York: McGraw-Hill: 1982.
Interpreters
[Terry]
[Tremblay2]
Timothy Budd. A Little Smalltalk. Reading, Massachusetts: Addison Wesley, 1986.
Adele Goldberg and David Robson. Smalltalk-80: The Language and Its Implementa
tion. Reading, Massachusetts: Addison Wesley, 1983.
William Payne and Patricia Payne. Implementing BASICs: How BASICs Work. Reston,
Virginia: Reston (a Prentice Hall Company), 1982.
[Budd]
[Goldberg]
[Payne]
Samuel P. Harbison and Guy L. Steele, Jr. C: A Reference Manual. 2nd ed. Englewood
Cliffs, New Jersey: Prentice Hall, 1987.
Allen I. Holub. The C Companion. Englewood Cliffs, New Jersey: Prentice Hall, 1987.
[Harbison]
[Holub2]
[s>a. x p, C] 363
[x->.y, FIRST(!bC)] 363
in regular expression 817
C operator 611
<x 192
e 8, 12,52
in FIRST set 214
is terminal, not token 12
identity element in strings 53
recognized in recursive-descent parser. 19
e edges, effectively merge states 113
in Thompson NFA 85
e closure 117
e items 364
add to current state by closure 359
e productions 173
executed first in list grammar 178
in bottom-up parse tables 359
occs 858
occs/LLama 841
reductions by 359
e transitions 58
u (see union)
as first character in g e n () format string 564
mark comments in driver-template file 746
(), in a IX regular expression 55, 817
indicate indirection in C-code 463
h 11 (see also end-of-input marker)
166
::= 166
$ attribute processing, LLama version 332
in IfX regular expression 54, 817, 819
$$, $1, etc. in LLama 332,883
in occs 858
$$, attribute of left-hand side 205, 352
translation in occs-generated parser 388
$$=$1, default action in bottom-up parser
353,858
$N 388 392
$-N 393, 860
%directives 857
LLama 883
shared occs and LLama 838
%{ %}, occs/LLama 857, 838
%%, occs (in Table E.5) 857
I I operator, in C compiler 622
I, in a production 7
in Thompson s construction 81,82
LEX 817
~, in regular expression 54, 817-819
?, in regular expression 818
in Thompsons construction 82
*, in regular expression 818
C dereference operator 606f.
C multiplication operator (see C compiler,
binary operators)
+, in regular expression 818
C operator, in compiler 634
++ operator 603
- operator 603
: : = (see >)
operator 7
=> symbol 168
- > C operator 611
~ C operator, in compiler 602
! C operator, in compiler 602
- C operator, in compiler 602
&C operator in compiler 603f.
, C operator, in compiler 618
?: C operator, in compiler 619
= C operator, in compiler 620
&&C operator in compiler 624
- C operator in compiler 634
[ ], in regular expression 54, 818
C operator, in compiler 606f.
occs operator 865
[ ] *, occs operator 867
<, >, <=, etc. C operator 626
{n, m}, operator in regular expression 55
## 688
_8086 683
VC 817
\0-terminated-string, support for 47
\n in IfX input, 817
added to to negative character class by IfX 55
\ f 817
\b 817
\ s 817
\ t 817
\ e 817
\ r 817
16-bit data type, WORD 755
8086 assembly language, translated to C code 451
8086 macros 684
8086, representing an address 684
8086, segment not modified in 8086 far-pointer
decrement 686(fn)
A, % C, <D, etc. 192
-a, occs/LLama command-line switch 410,
843,876
A 771
a [ 7 ] ; , generate code for 611
abbreviations, C-compiler nonterminals 512
abstract_decl 535
abstract declarator 535
accept 338, 347
a c c e p t 85
a c c e p t , field in DFA_STATE 126
accept action, represented in yyout.doc 869
accepting actions, y y l e x () 81
accepting state 56, 57
accepting-state array, Y y a c c e p t [ ] 72
ACCEPT structure 126
access arbitrary stack element (macro, high-level
description) 687
access function 42
input system 43
structure-member 611
ACT 412
actions
default ($$=$1) 353
executed as parse tree traversed 203
executed in top-down parse 208
imbedded (see imbedded actions)
imbedded (translation by occs) 406
initialization (see initialization)
IfX multiple line 95
occs imbedded 864
put occs parser and actions in different files
875
897
898
Index
subroutine, generated by occs/LLama (-a)
843
subroutine, in occs-generated parser 388
subroutine, sample in occs-generated parser
389
A c t i o n (LR parse tables) 412
activation record (see stack frame)
A c t u a l _ l i n e n o 95
ADD() 698
addch () 782
a d d _ d e c l a r a t o r () 530
ADD (function description) 693
addition
and subtraction processing (listings) 634
as hash function 723
hash function 483
hash function, analysis 484
address
finding physical on an 8086 685
LSB at lowest in C-code virtual machine 461
representing in 8086 architecture 684
address-of operator (&), in C compiler 601, 603
_ a d d s e t () 691,701
a d d _ s p e c _ t o _ d e c 1 () 536
a d d s t r () 782
addsym () 713,718
add_synch () 289
a d d _ t o _ d s t a t e s () 129
a d d _ t o _ r h s () 289
a d d _ u n f i n i s h e d () 417
adjectives 490
a d v a n c e () 100
a d v a n c e ( ) . 17
advance 44
advance, input pointer: i i _ a d v a n c e () 42
aggregate types, represented by physical pointers at
run time 599
algorithm, greedy (see greedy algorithm)
algorithm, nongreedy (see nongreedy algorithm)
ALIGN ( ) , C-code directive 462
alignment
forcing in C-code 462
in C-code 457, 461
in structure field 547
problems caused by leading line number in IfX
string 89
worst-case restriction in C-code [c-code.h] 462
ALLOC 86
allocate
temporary variables, tmp a l l o c () 577
normal variables, in C-code 460
memory 559
allocation unit (disk) 37
alphabet 52, 166
alphabetic componants of output labels (label.h)
554
alternative parser templates, occs/LLama (-m)
844
ambiguity
resolving when making LALR parse tables
379
rules for resolving ambiguities in parse table
379
eliminating from grammar 223
in i f / e l s e , binding problems 219
ambiguous accepting actions, handled by IfX 117
ambiguous grammars 182
in occs 856
using in LALR parse tables 375
analysis, lifetime 673
analyzer, lexical (see lexical analyzer)
a n c h o r 86
field in DFA_STATE 126
anchor 54
end-of-line 54
IfX 817
processing by IfX 101
problems 828
AND operator (&&) 624
anonymous temporaries 6,184
ANSI 683
ANSI C
grammar 806
subset implemented in Chapter Six 445
ANSI
concatenation operator,## 688
requires downward-growing stacks 688
variable-argument mechanism 724
arbitrary stack element (macro, high-level
description) 687
architecture, 8086 segmented 684
args 613
field in symbol structure 488
argument
declaration 532
assembled in reverse order 535
K&R-style 555
functions with a variable number 724
function (see function arguments)
in recursive-descent parser 26
Pascal subroutine 802
to g e n () 565
to l i n k instruction 552
a r g v
printing 733
sorting with s s o r t () 738
arithmetic operators
handled by C compiler 630
in C-code 473
Arnold, Ken (curses author) 774
a r r a y 450
array
accepting-state (Yyaccept [ ]) 72
dereferencing 603
manipulation, macros 686
problems with decrement past start of in 8086
686(fn)
represents LR State machine 345
dynamic 456
printing two-dimensional 140
used to model state machine 57
using to access high byte of number without
shift 685
< a s c i i > attribute 543
ASSIGN (function description) 692
assignment
dead 661
operator, in C compiler 619-621
associativity
controlling when eliminating ambiguity from
grammar 224
specifying under occs 856
specifying gramatically 175
used to resolve conflicts in LALR tables 376
a s s o r t () 739,741
attribute 34, 186
< a s c i i > 543
bits 750
bottom-up parsing 349
definitions 754
differentiate tokens with 511
l i n k pointer used in C compiler 524
nonterminal 187
notation, for bottom-up parser ($A0 205
notation, for top-down parser ($N) 205
processing, bottom-up parser 348
and occs e productions 858
for token (shifting onto occs value stack) 877
in code section of occs input file 392
inherited 187
inherited (in top-down parser) 203
in recursive descent 348
LLama 883
negative ($ - 1 ) 860
notation in bottom-up parser 353
occs 858
passing bottom-up between symbols (an exam
ple) 350
stored on stack in recursive-descent parser 199
synthesized 187, 349
stack (see value stack)
synthesized by * and [ ] operators 608
attributed grammars 186-187, 191"
for bottom-up parsing 351
implementing with a PDA 203
ATYPE 141
augmentations 183
in occs/LLama grammar 841
augmented grammars 183
and attributed grammars 191
and table-driven parsers 202
for bottom-up parsing 351
automata
deterministic finite (see DFA)
finite (see state machine)
nondeterministic finite (see NFA)
push-down (see push-down automata)
automatic variables
figuring offset in stack frame 562
handling at run time 467
autoselect mode 772
A v a i l a b l e 416
back end 2,5,447,563
Backus, J.W. 7
Backus-Naur Form (BNF) 7, 166
modified for occs/LLama 839
Backus-Normal Form (See Backus-Naur Form)
base address, video memory 756
basic block 663
basic types, in C-code 450,452
beginning-of-line anchor, IfX 817
Bilbo 535
Binary-mode, untranslated input (see also escape
sequences, IfX) 40, 817
b i n a r y _ o p () 630
binary operator productions, summary 617,618
binary, convert to printing ASCII 727
b i n _ t o _ a s c i i () 727
BIOS, video (see video BIOS)
bit maps 690
implementing 695
bit
bit positions, computing in set (bit map) 697
number of bits required for a type 685
test bit in set 691
b i t s e t 85
_BITS_IN_WORD 697
blinking characters 749
block
basic 663
deletions from symbol table 485
disk 37
starting with symbol (see bss)
BNF (See Backus-Naur Form)
body 478
bootstrapping, a compiler using language subsets
271
bottom-up attributes, accessing under occs 858
bottom-up attributes, notation 353
bottom-up parsing/parser 169, 337f.
attribute notation ($A0 205
attribute processing 348
error recovery 348
passing attributes between symbols (an exam
ple) 350
uses synthesized attributes 349
value stack 348
stack, relation to parse tree 340
table driven (an example) 345
tables, creating 357
$$ 352
attributed grammars 351
augmented grammars 351
description of parse process 338
Index 899
implementing with state machine 343
of lists 340
recursion 340
with a PDA (basic algorithm) 338
boundaries, lexeme ( i i _ m a r k _ s t a r t ( ) ,
ii__mark_end ( ) ) 42
box () 780
box, drawing under curses 780, 788
box-drawing characters, box.h 787
boxed () 778
b r e a k 642
breakpoints, IDE 848
IDE production 264
IDE stack 255
b r e a k statements, in occs/LLama actions 841
b s s , output stream 552
bss segment 456
C compiler (temporary file) 512
output from occs/LLama 855
bubble sort 739
BUCKET 716, 718
buffer
flush 37
input system 15
input system 37,38
ring 50
BUFSIZE 37
b y t e 450
_BYTES_IN_ARRAY 697
c
C to C-code mappings 495
C-code, 6,449
addressing modes 462
ALIGN () directive 462
alignment 457
alignment 461
allocating variables 460
Arithmetic operators 473
automatic variables 467
basic types 450
basic types (implementation) 452
basic types (widths) 452
caveats 478
c-code.h 450
CLASS 455
combined effective-address and indirect modes
&WP(&p) 463,465
common 458
comparison directives [virtual.h] 475
constant expressions 475
control flow 474
converting C storage classes to C-code storage
classes 458
declarations 478
direct-stack-access directives [virtual.h] 465
ENDP, and SEG() 457
executable image 456
e x t e r n a l 458
file organization 476
g o t o 474
identifiers 450
indirect-mode prefixes [c-code.h] 465
instruction pointer register, i p 452
jump tables cant be implemented 478
labels 474
labels different from assembly language 478
leading underscores avoid name conflicts 467
LSB (at the lowest address) 461
macros 475
memory alignment, forcing 462
memory organization 455
miscellany 476
preprocessor directives 476
Print virtual-machine state, pm () 476
p r i v a t e 457
PROC, and SEG () 457
program-load process 455
program prefix 455
p u b l i c 458
Push and pop directives 465
register set 451
register set 454
run-time trace support [virtual.h] 477
segment-change directive [SEG () ] 457
segments 455
segments 456
semicolon 457
Sign extension 474
stack 454
stack manipulation 465
stack-size definitions 454
storage classes 457
storage classes [in virtual.h] 461
storage classes, &WP (&_p) used for lvalues
599
subroutines 466
subroutine arguments 467
subroutine linkage 467
subroutine return values 472
test directives 474
translating C to C-code to 8086 assembly
language 451
translating to 8086 assembly language 451
Type Conversions 473
types 450
variable allocation 460
variable declarations 457
variable names 450
virtual machine 451
virtual-machine register set 453
virtual-machine stack registers, f p, sp 452
virtual-register-set allocation, ALLOC 454
white space 450
worst-case alignment restriction [c-code.h]
462
c-code.h 450
C compiler
abbreviations in nonterminal names 512
abstract declarators 535
all other operators, b i n a r y op () 630
binary operators 617
cast and address-of operators 601
cleanup 512
code generation [gen () ] 564
comma, conditional, and assignment operator
619
compound statements (see compound state
ments)
constants 598
control flow 637
declarator processing 528
expressions 572
function-argument declarations 532
function calls 611
function declarations 552
function declarators 532
identi fiers
initialization (listings) 515
initialization productions (listings) 524
initialize l i n k (decl.c) 526
integer constants 593
interaction between scanner and code generator
(problems) 536
lexical analyzer 518
l i n k , pointer to (used as an attribute) 524
local variables 559
local variables 562
logical AND and OR operators 622
lvalues and rvalues 578
merge declarator and specifier 536
merge l i n k s (decl.c) 526
parser configuration 509
pointer and array dereferencing 603
print value stack [ y y p s t k () ] 514
relational operators 626
SEG directives 524
s i z e o f 599
specifier processing 524
statements
string constants 599
s t r u c t u r e and u n i o n declarations 543
s t r u c t u r e and u n i o n declarations 543
structure-member access 611
temporary files for code, data, and bss segments
512
temporary-variable allocation 572
token definitions 511
type conversions 592
t y p e d e f processing 528
unary minus, NOT, ones complement 602
unary operators 593
%union to type value stack
variable declarations, simple 522
variable declarators (listings) 529
C, grammar 806
calling sequence, mirrors parse tree 21
c a l l (name), in C-code 466, 477
call, of the wild 468
Camptown ladies 107
cascading error messages 348
avoiding with YYCASCADE 875
cast
implemented in C compiler 601
two needed to convert i n t to pointer 688
used to access high byte of number without
shift 685
c a t _ e x p r ( ) 104
CCL 85
CGA 749
CHARACTER 756
character
print in human-readable form 732
and attributes 749
box-drawing box.h 787
attribute bits 750
display with direct video 751"
character class 54
add \ n to negative 55
d o d a s h () 107
empty 818
LEX 818
character I/O
curses 781
video-BIOS 752
character pointer, used for variable-argument-list
pointer 725
Chomsky, Noam 7
choose an alternate driver file, -m (IfX) 826
CHUNK 412
CLASS 86
class, character (see character class)
CLASS, has precedence 511
in C-code 455
clean up, after p r n t () 735
cleanup productions, C compiler (listings) 524
CLEAR 692, 698
c l e a r () 783
clear
entire window, werase.c 794
from cursor to edge of window, wclrtoeo.c 793
region of screen, (high-level description) 750
set (function description) 692
c losure
in creating LR parse tables 357
Kleene (*) 53
operators, IfX 818
positive (+) 53
processing, IfX 107
c l r t o e o l () 783
cluster (disk) 37
code block, %{...%} 838
code, dead 660, 661
code generation 29,445
900
Index
expression evaluation 184
g en () subroutine 564
using the value stack 351
with inherited attributes 189
with synthesized attributes 189
with top-down parser and augmented grammar
203
code generator, communicate with scanner through
symbol table 529
problems 536
cod e, output stream 552
code
in an occs/LLama production 841
in IfX rule 820
in occs/LLama definitions section 837
op 459
section, occs/LLama input file 842
segment 456
code, useless 662
COLBASE 756
collision, in symbol table 481
resolution (hashing) 483
color, changing on screen 749
curses changing foreground/background 778
combined effective-address and indirect modes
&WP(&p) 463,465
comma-delimited lists, problems 618
COMMA,ext decljisttext_decl list COMMA
ext decl 531
comma, in D() macro argument 683
comma operator
in mvinch () and mvwinch () 786
isolating 618
code to handle in C compiler 619
command-line
processing, IDE 853
switches (IfX) 825
occs/LLama 843
commands, IDE 847
comment () 734
comment
print multiple-line 734
add to C-compiler output, gen_comment ()
564
in IfX input file 816
output from occs/LLama 855, 857
common 458
common-subexpression elimination 672
compare
two sets, (function description) 691
sets 701
directives, in C-code [virtual.h] 475
function, used by s s o r t () 738
two ITEMS 420
two STATES [ s t a t e _ c m p () ] 420
compiler.h 689
compiler, passes 1
compiler, Pascal 802
COMPLEMENT () 695
COMPLEMENT (function description) 692
complemented set
logical COMPLEMENT () 695
physical INVERT () 695
by inverting bits 691
by marking set as negative true 695
by physically inverting bits 694
problems with 694
compound statements 559,637
compound stmt 559
compoundstmt, ext deftopt spedfiers funct jdecl
def j i s t compound stmt 552
compress
parse tables, occs 393
transition matrix (IfX) 65
IfX state-machine tables 140
pair 68
ratio, pair compression 72
redundant row and column elimination 146
redundant row and column elimination 65
using default transition 71
c o n c a t () 745
concatenation
of strings (function description) 745
LEX 817
operator,## 688
operator, IfX 106
processing, IfX 104
regular expression 54
string 53
conditional operator (a?b:c) 619
conditional-compilation macros: LL(), OX() 275
conflicts 11
local-static names 486
resolving reduce/reduce 380
resolving shift/reduce 379
rules for resolving conflicts in parse table 379
shift/reduce and reduce/reduce 360
CONSTANT, storage class 491
constant
expressions, in C-code 475
expressions, in tests 641
folding 633,659
propagation 660
symbolic 598
v a l u e . 584
constjexpr dummy used for grammar development
530
constrained types, numeric coding 489
construction, subset 122
context-free grammar 166, 167
c o n t i n u e 642
control characters, matching in a regular expression
( V ) 817
control flow 637
in C-code 474
in y y l e x () 74
conversion functions 726
conversion, implicit type 620
conversion, type (see type conversion)
c o n v e r t _ t y p e () 592
coordinates, curses 777
COPY() 39
copy_amt
copy file, c o p y f i l e () 741
copying, structures 859
comer 221
comer substitution 221
translate indirect recursion to self recursion
228
configuration, C parser 509
counter, location 458
c r e a t e _ s t a t i c _ l o c a l s () 563
create
stack, stack macro (high-level description) 686
parser only, occs/LLama (-p) 844
window, wincreat.c 794
crmode ( ) , Set unbuffered, (high-level description)
776
cross links 486
creating in C compiler ext_decl_list ext_decl
531
set up in C compiler 554
Ctrl-L, delimits parts of driver-template file 746
cur.h 788
c u r r e n t 15
C u r r e n t _ t o k 100
<curses.h> 776
c u r s e s . l i b
curses
boxed windows 778
character I/O 781
configuration 775
configuration functions 776
coordinates (y=row, x=column). Upper left
comer is (0,0) 777
creating and Deleting Windows 777
cursor movement and position functions 781
cursor position, change 751
delete/hide window 779
designed for efficient serial communication
775
disable on abort in IDE 246
curses, erase window 783
foreground/background color 778
glue functions to video I/O 771
implementation 783
initialization functions 776
line wrap 778
move window 779
portability issues 775
save region under screen 779
scrolling 778
scroll window
source code
box, draw in curses window, box.c 788
clear, entire window, werase.c 794
clear, from cursor to edge of window,
wclrtoeo.c 793
create window, wincreat.c 794
delete window, delwin.c 790
display window, showwin.c 792
erase, entire window, werase.c 794
erase, from cursor to edge of window,
wclrtoeo.c 793
formated print to window, wprintw.c 800
hide window, hidewin.c 790
initialize, curses (initscr.c) 791
move window, mvwin.c 791
move window-relative cursor and read
character, winch.c 794
move window-relative cursor, wmove.c
800
scroll window, wscroll.c 801
string, write to window, waddstr.c 793
winio.c 796
write, string to window, waddstr.c 793
standard box-window function 780
stdscr window 777
subwindows 777
using 776
window management 774
modify and move (video-BIOS) 752
c.y 806
D() 683
DAG, see (directed acyclic graph)
DANGER 37
dangling e l s e 378
data abstraction 715
database, management with hash table (see hash
ing)
database, occs STATE database 417
database, termcap 774
database layer, in symbol table 479
d a t a , output stream 552
data segment 456
\xDD 817
\DDD 817
dead assignment 661
nonconstant 663
dead code 660
dead states 140
dead variables 660
DEBUG 682
debugging
hooks, for Ctrl-A, Ctrl-B. 263, 514,518,847
diagnostics, D () 683
environment, occs interactive (see IDE)
print value stack in C compiler debugging
Index
901
mode 514
debug.h 681
debug mode, occs/LLama 845
commands, IDE 847
declaration
function (see function declaration)
identi fier processing 528
C-code 478
enumerated-type 550
function argument (see argument declaration)
implicit (in C compiler) 598
s t r u c t u r e and u n i o n 543
variable, simple 522
declarator 489
abstract 535, 536
adding to the type chain 531
function 532
implementation 490
manipulation functions [symtab.c] 501
merge with specifier [ a d d _ s p e c _ t o _ d e c l () ]
536
processing, C compiler 528
in symtab.h 491
variable (listings) 529
decompression, of LALR parse table 369
default transition, used in compression 71
_DEFBITS 697
d e f _ g r o u n d ( ) 778
defined as operator (see
definition, regular 56
definitions
high-level external (listings) 549
nested variable 559
section, occs/LLama 837
def j i s t 545
def list, ext jdef>opt_speci fiers funct decl def j i s t
compound stmt 552
defjist, struct spedfier^STRUCT opt j a g LC
defjist RC 546
d e f n e x t ( ) 140
_DEFWORDS 697
d e l a y () 255
delay
adding to IDE go mode 849
of first read until after advance 40
label generation for l i n k until after subroutine
processed 552
delete
window, curses 779
window, delwin.c 790
block (from symbol table) 485
delimiters, in list grammars 179
d e l s e t () 691
d e l _ s e t ( ) 701
d e l s y m () 713
d e l s y m () 719
d e l w i n () 779
depth-first traversal 180
dereferencing, pointer and array 603
derivation, leftmost 168
rightmost 169
derived types 490
deterministic finite automata (see DFA)
d f a ( ) 126
DFA structure 125
DFA 57
creating from NFA (see subset construction)
finding equivalent states 132
formal definition 59
minimization, theory 132
algorithm 134
implementation 135
removing redundant states 136
represented by right-linear grammar 174
DFA_STATE
diagram, syntax 4
transition 56
DIFFERENCE (function description) 692
direct access to video memory, dv S c r e e n ,
SCREEN 757
direct addressing mode 462
directed acyclic graph (DAG) 673
directives, comparison in C-code [virtual.h] 475
preprocessor, in C-code 476
directories, used for support functions 680
direct-video
definitions (video.h) 756
functions, initialization 749
I/O functions 749
output, in IDE 846
p r i n t f () 750
select mode at run time 772
disable scrolling, curses 778
d i s a b l e _ t r a c e () 565
disambiguate lexemes, with lookup tables 828
d i s c a r d () 88
d i s c a r d _ l i n k _ c h a i n () 548
d i s c a r d _ v a l u e () 585
disk access, sectors 37
disk organization 37
dispatch table 646
DISPLAY 756
display, character (direct video) 751
lexeme, IDE 852
pages, video 749
string (direct video) 751
window, showwin.c 792
distinguished 132
_DIV_WSIZE(x) 697
d o _ d o l l a r () 406
do_enum () 550
dollar attributes, subroutine that translates them to
stack references 406
translation in occs/LLama-generated parser
(see $)
do_name () 594
doo dah, doo dah 107
d o _ p a t c h ( ) 401
d o _ s t r u c t ( ) 611
dot, in LR items 356
operator in regular expression 54, 817
doubly linked list 718
do/While 643
d r i v e r _ l () 745
d r i v e r _ 2 () 745
driver, subroutine to copy template from file to
standard output 745, 748
state machine driver 59
driver subroutine, output by occs/LLama 319
d s t _ o p t () 634
Dtran 126
dummy const_expr production, used for grammar
development 530
d u p l i c a t e (field in symbol structure) 486
duplicate set (function description) 691
duplicate symbols (hashing) 483
in tree-based symbol table 481
d u p s e t () 691,701
d v _ c l r _ r e g i o n () 750
d v _ c l r s ( ) 750
d v _ c t o y x ( ) 751
d v _ f r e e s b u f () 751
d v _ g e t y x ( ) 751
d v _ i n c h a ( ) 751
d v _ i n i t () 749
d v _ o u t c h a () 751
d v _ p r i n t f ( ) 750
dv_jputc() 750
d v _ p u t c h a r () 751
dv_jputsa() 751
dv_jputs() 751
d v _ r e p l a c e () 751
d v _ r e s t o r e () 751
d v _ s a v e ( ) 751
d v _ S c r e e n 757
d v _ s c r o l l _ l i n e () 752
dv s c r o l l () 752
dynamic
array 456
link 472,803
temporary-variable creation 573
e b s s 457
ECHO 73
e c h o ( ) , Echo characters, (high-level description)
111
ECHO (IfX print lexeme) 821
e _ c l o s u r e ( ) 117
e d a t a 457
edge, in graph 57
e d g e , field in NFA 84
e d g e 89
edge, e, see (e edge)
eeyore 553
effective-address and indirect modes combined
&WP(&p) 463,465
egrep, implementing with NFA interpreter 113,
119
elements, specifying number of in a list 178
elimination, common-subexpression 672
redunant row and column 65
ellipsis, removing in the u n i x environment 684
ELSE, has precedence 511
e l s e , dangling 378
trailing (in multiple-statement macro) 64
eMark 37
emit, labels to l i n k 555
EMPTY 85
empty, character classes 818
sets 694
statements 637
stack, test for 687
string 52, (see also e)
enable scrolling, curses 778
e n a b l e _ t r a c e () 565
encapsulation 492
END 37
end, back 2
End_buf 37
end of file, detecting in input system:
NO_MORE_CHARS () 43
processing by IfX, yywrap () 80
end-of-function processing 555
end of input 44
marker 173
end-of-line anchor 54, 817
problems with 104
e n d _ o p t () 289
ENDP, and SEG () 457
endwin ( ) , terminate curses, (high-level descrip
tion) 776
enlarge set (function description) 691
ENTER 87
enumerated-type declarations 550
enumerator-list element 493
Enum_val 550
environment, search for file along path in environ
ment string 745
VIDEO 772, 846
E o f _ r e a d 43
EOF (see: end of file)
EOS 95
EPSILON 85, 89
epsilon, see
equivalent states, in DFA 132
e r a s e () 783
erase, entire window, werase.c 794
from cursor to edge of window, wclrtoeo.c 793
window, curses 783
ERR 784
Errmsgs 88
error handler, stack macros, (macro, high-level
description) 687
error marker, in occs-generated parse table (YYF)
388
902 Index
error message, print and exit 731
processing by IfX 88
cascading 348
error recognition, FIRST sets 23
error recovery, in a bottom-up parser 348
in bottom-up parser 348
algorithm used by occs 401
in lexical analysis 34
in LLama 883
in occs-generated parser [ y y _ r e c o v e r () ]
400
in occs parser 875
in recursive descent, l e g a l _ l o o k a h e a d ()
23
in top-down parser 201
detection not affected by merging LR( 1) into
LALR(l) 367
panic mode 348
error transitions, in LR parse table 356
e s c ( ) 726
escape character (\) 55
escape sequence, recognized by e s c () 727
LEX 817
translate to binary 726
e t e x t 457
e t y p e (field in symbol structure) 488
evaluation, order of, as controlled by grammar 181
executable image 456
executed first, list-element production 178
e x p a n d _ m a c r o () 93
exponentiation, string 53
e x p r () 101, 104
expression, constant (in test) 641
grammars 180
processing, high level 636
with rvalues and logical lvalues 581
with rvalues and physical lvalues 579
by C compiler 572
in a C statement 637
regular (see regular expressions)
ext decl, ext_decl list-*ext_decl 531
ext decl, ext_declJist -êxt decl list COMMA
ext_decl 531
ext_declJist text_decl 531
ext_declJist -êxt decl Jist COMMA ext decl
531
ext_def>opt_specifiers funct_decl defjist
compound_stmt 552
e x t e r n a l 458
external definitions, high-level (listings) 549
F 124
f a c t o r () 107
factoring, left 219
failure transisitions, IfX F 124
in y y l e x () . y y l a s t a c c e p t 80
far heap 756
f a r keyword 756
pointer 684
fatal error message handling 731
f e r r ( ) 731
_ f f r e e ( ) 756
field definitions, structure 545
< f i e l d > , in yyout.sym 867
field-name generation, occs 860
$ < f i e l d > N , occs 863
f i g u r e _ l o c a l _ o f f s e t s () 562
file, copy or move 741
examine from IDE 850
organization, in C-code 476
search for file along path in environment string
745
FILL 692,698
find next set element, (function description) 692
f i n d s y m ( ) 713,720
trick to simplify calling interface 714
finished-states list 415
finite automata, (see state machine)
deterministic (see DFA)
nondeterministic (see NFA)
f i r s t _ i n _ c a t () 106
FIRST set 23,213
code to compute 305
in yyout.sym 867
of a right-hand side 215
f i r s t _ s y m ( ) 289
f i x _ d t r a n ( ) 136
flies, fruit
floating-point, state machine that recognizes con
stants 61
floccinaucinihilipili fication 247
flush, input buffer 37,44
_ f m a l l o c ( ) 756
folding, constant 633, 659
FOLLOW set 215
code to compute 307
used to resolve conflicts in LR state machine
361
f o r 643
f o r c e 44
foreground/background color, curses 778
formated print to window, wprintw.c 800
formatting, occs/LLama input-file 841
four-pass compiler, structure 1
four armed, Shiva 79
fp, in C-code virtual machine 452
f p u t s t r ( ) 732
fragmentation, heap 457
frame pointer 469
frame, stack (see stack frame)
f r e e _ i t em () 420
f r e e _ n a m e () 352
f r e e n a m e () 26
f r e e _ n f a ( ) 116
f r e e _ r e c y c l e d _ i t e m s () 420
f r e e _ s e t s ( ) 130
f r e e s y m ( ) 712
f r e e s y m ( ) 716
frito 535
front end 447
front-end considerations 563
fruit flies 171
full, test for stack (macro, high-level description)
687
funct decl 532
funct decl, ext deftopt specifiers funct decl
function, access 42
function-argument, declarations 532
definitions, K&R-style 555
processing, non_comma_expr 612
processing, listings 547
function body, compound stmt 559
function-call processing 611,616
function calls, and temporary-variable allocation
573
function declaration 552
function declarators 532
function definition, compiler output 554
function, end-of (processing) 555
function prototypes, create with Microsoft C, /Zg
switch 689
translating to e x t e r n with P () macro 683
functions, conversion (in library) 726
intrinsic 657
print (in library) 731
video I/O 746
Funct_name 552
G
- g , occs/LLama command-line switch 844
gap size, in Shell sort 741
Gaul 478
_GBIT 698
g e n e r a t e () 352
generate code, from the syntax tree 670
generic grammars 192
g e n _ f a l s e _ t r u e () 603
gen () 564, 565
g e t c h () 781
g e t _ e x p r () 95
g e t _ p r e f i x ( ) 588
get_unmarked () 130
g e t y x () 781
globals.h 86
global variables, putting declaration and definition
into same file 86
used by IfX 86
glue functions, from curses to video I/O 771
goal symbol 168, 338
go command, IDE 850
goes to operator (see
Goto (LR parse tables) 412
g o t o , processing by compiler 638,640
in C-code 474
C style note 23
vectored (see s w i t c h )
goto entries, in LR parse table 355
goto transitions, in LALR parse table 372
grammar 52
associativity in 175
attributed 187
augmented and attributed 191
C 806
context-free 166
controlling order of evaluation 181
development, dummy const_expr 530
for regular expressions 87
LALR(l) 365
LEX 88
list 175
LL( 1) grammars cannot be left recursive 226
LR(0) 354
modified by occs for LALR( 1) grammars 401
modifying 218
modifying imbedded actions for bottom-up
parser 354
Q 222
recursion in 175
recursive 180
right-linear 174
S 173
ambiguous 182
ambiguous (see ambiguous grammars)
attributed 186
augmented 183
expression 180
LALR(l) 337
LL(0) 211
LL(1) 211
LR(1) 361
occs ambiguous 856
precedence built into 180
SLR(l) 361
translating to syntax diagram 13
using to recognize sentence 8
transformations, occs 864, 867
graph, syntax (see syntax diagram)
greedy algorithm (matches longest string) 60
disadvantages 62
NFA interpreter 114
as used by If X 818
grep, implementing with NFA interpreter 113
g r o u n d () 778
group, field in DFA_STATE 126
grouping, in regular expression 55, 817
G r o u p s [] 135
handle 169
detecting with state machine 343, 344
hard-coded scanners 50
hardware problems, v o l a t i l e 664
h a s h _ a d d ( ) 715
hashing 71 Of, 723
addition and hashpjw 483
Index
903
adding symbols 718
addition 723
delete 719
demonstration program 710
finding symbols 720
hash_j? jw () 723, 724
hash-table elements 716
hash table 482
high-level descriptions 712
implementation 715
occs STATE [ s t a t e _ h a s h () ] 420
printing
functions (see also hashpjw and addition)
hash.h 712
hashpjw, 16-bit implementation 483,484
portable implementation h a s h _ p j w ( ) 715,
723
HASH_TAB 716,717,722
hash table, picture 718
hash value 483
find for set (function description) 691
header comment, - h, - H (IfX) 826
Heap 416
heap, combined with stack 456
fragmentation 457
segment 456
h i d e w i n () 779
hide window, curses 779, 790
high byte, accessing without shift 685
High_water_mark 577
hobbit 535
horizontal stacks, spedifying in IDE log file 850
I ( ) 455
IBM/PC, box-drawing characters box.h 787
IBM/PC, video 746
character attribute bits 750
video I/O functions (see video I/O)
IDE 845
breakpoints 848
change stack-window size and specify input
file to debugger. 853
command-line processing 853
debugger hooks 847
debugger hooks for Ctrl-A, Ctrl-B. 263
debugger hooks, sources 514, 518
debug-mode commands 847
display lexeme 852
enable (occs/LLama - D ) 843
error output function: y y e r r o r () 250
examine-file command 850
go command 850
go mode, adding delay 849
initial debug screen 847
initialization 247
initialize debugging 246
(Interactive Debugging Environment) 242
(Interactive Debugging Environment) 845
log all output to file 850
log file, create without window updates 850
log file, specify horizontal or vertical stacks
850
main control loop [ d e l a y () ] 255
non interactive mode 850
parsing command-line arguments 246
printing occs value stack 864
print message to prompt window 855
print value stack in C compiler debugging
mode 514
print value-stack item under IDE 853
production breakpoints 264
quit command 851
redraw stack window 851
save screen to file 851
single step command 847
specify input file 850
stack breakpoint 255
stack-window size, changing from the com
mand line 846
update stack window, y y _ p s t a c k () 255
window names 846
identi fier processing, in declaration 528
creating v a l u e s for 599
in C-code 450
undeclared 598
identity element, (*e 53
i f / e l s e 638
ambiguity in 219
processing 637
statements, nested 639
I f i l e 95
IFREE 754
I f u n c t 100
i i _ a d v a n c e () 42
i i _ f i l l b u f () 47
i i _ f l u s h ( ) 44
i i _ f l u s h b u f (needed by IfX) 824
i i _ i n p u t (needed by IfX) 824
i i _ i o ( ) 40
i i _ i o (needed by If X) 824
i i _ l o o k ( ) 47
i i _ l o o k a h e a d () 50
i i _ l o o k a h e a d (needed by IfX) 824
i i _ m a r k _ e n d () 42
i i _ m a r k _ s t a r t () 42
i i _ m o v e _ s t a r t () 42
i i _ n e w f i l e ( ) 40
i i _ n e w f i l e (needed by IfX) 824
i i _ prefix 825
i i i p u s h b a c k ( n ) 47
i i _ t e r m ( ) 48
i i _ u n p u t () 50
i i _ u n t e r m ( ) 48
IMALLOC 754
imbedded actions 354
cause shift/reduce conflicts 354
in occs 864
translated by occs, example 536
translation by occs 406
immediate, addressing mode 462
i m p l i c i t (field in symbol structure) 488
implicit declarations 598
type conversion 620
inadequate states 360
might be added in LALR( 1) machine 367
INBOUNDS () 686
i n c h ( ) 782
include files, used by support functions 681
i n c o p () 603
indirect addressing modes 463
combined with effective address &WP (&p) 463,
465
prefixes, in C-code [c-code.h] 465
i n _ d s t a t e s () 129
I n g r o u p [] 135
inherited attributes 187
in top-down parser 203
i n i t O 772
i n i t _ g r o u p s () 135
initialization,
actionsdone in nonrecursive list element
524
C compiler (listings) 515
input system 41
list-element production that is executed first
178
LLama user supplied 855
LLama value stack 884
occs/LLama user supplied 853
productions, C compiler (listings) 524
curses (initscr.c) 791
direct-video functions 749
l i n k (decl.c) 526
initializers, s y m b o l . a r g s 532
i n i t s c r ( ) , Initialize, (high-level description)
776
I np u t 100
input 32
i n p u t () 73
i n p u t _ b u f f e r 15
I n p u t _ b u f f e r 95
input buffer (figure) 38
flush: i i _ f l u s h () 44
l o a d i i _ f i l l b u f () 47
LEX 95
input.c 39
i n p u t () (IfX character-input function) 820
input file, IfX 814
open new: i i _ n e w f i l e () 40
specify in IDE 850
input, that doesn't match a IfX regular expression
819
input, untranslated (binary) 40
input functions, IfX low-level 96
input pointer, advancing: i i _ a d v a n c e () 42
input routines, changing low-level i i _ i o () 40
input system, 35
access routines 43
design criteria 36
example 36
initialization 41
macros and data structures 39
marker movement 43
organization: buffers and pointers 37
systems 35
simple buffered 15
\ 0-terminated-strings 47
instruction, optimization 659
output and formatting 566
two-address, three-address 447
i n t 86 () 754
i n t , convert to pointer with double cast 688
integer, computing largest positive with macro
115
constants, C compiler 593
constants, used for enumerated types 550
largest (MAXINT) 685
intensity, changing on screen 749
interaction, between optimizations 663
between scanner and code generator via sym
bol table 529
between scanner and code generator via sym
bol table (problems) 536
interactive debugging environment (see IDE)
interface, lexical analyzer to parser 33
intermediate languages 5,446
advantages, disadvantages 6
converting postfix to a syntax tree 668
postfix used by optimizer 667
interpreters 6
interpreting an NFA, theory 113
implementation 115
INTERSECT (function description) 692
intrinsic functions 657
i n v e r t () 691
INVERT () 693,695
i p , in C-code virtual machine 452
is defined as operator (see
IS_DISJOINT (function description) 693
IS_EMPTY (function description) 693
ISEQUI VALENT (function description) 693
IS_INTERSECTING (function description) 693
ITEM 410, 420
comparison, item_cmp () 420
management, occs 420
i t e m _ c m p () 420
item, 364
LR(0) 356
LR(1) 362
adding to LR states 357
I ( x ) 86
jump table 478,647
904 Index
kernel 357
key 479
detect key pressed under u n i x 246
keyboard status, getting under u n i x (SIGIO) 246
Kleene closure (*) 53
K&R-style function-argument definitions 555
-I, switch to occs/LLama 844
L token (IfX) 95
label, argument to l i n k 552
label}i 554
labels 640
for true/false targets 602
numeric componant of g o t o 641
alphabetic componants of 554
in C-code 474
to l i n k , emitted 555
ladies, Camptown 107
LALR( 1) grammar 337, 365
LALR( 1) parse table, size 368
manufacturing 368
subroutines to generate 408
sample in occs-generated parser 389
LALR parser, simplified parser loop 373
LALR( 1) state-machine description, output by
occs/LLama - v 845,869
language 52, 166
intermediate 446
intermediate (see intermediate languages)
LARGEST_INT 115
largest integer, MAXINT 685
largest positive integer, LARGEST_INT 115
LASTELE() 686
layers, symbol table 479
LC, struct spedfier>STRUCT opt tag LC
defjist RC 546
LEAVE 87
% l e f t 856
left associativity, fudging with a right-recursive
grammar 176
left factoring 219
left-hand side, 166
as subroutine 18
% l e f t (in Table E.5) 857
leftmost derivation 168
left recursion 10, 175
eliminating from grammar 226
problems with 11
l e g a l _ l o o k a h e d () 23
l e r r o r () 289
l e v e l (field in symbol structure) 486
level, nesting 530
IfX 812
algorithm used by output state-machine driver
62
anchor processing 101
anchors 817
change bad-input error message (YYBADINP)
823
character classes 818
choose an alternate driver file, -m 826
closure 107, 818
code part rule 820
command-line switches 825
compressed transition matrix 65
compressing tables 140
concatenation 104, 817
conflicting regular expressions 819
copy template file from file to standard output
745
debugging output routines 74
DFA transition matrix. Dtran, N s t a t e s 126
end-of-file processing, yywrap () 80
error-message processing 88,90
escape sequences 817
global variables 86
grammar 88
grouping 817
header comment, -h, - H 826
initialize (yy i n i t l e x ) 822
input buffers 95
input character ( i n p u t ( ) ) 820
input file organization 814
input line number ( y y l i n e n o ) 820
input specification used by LLama and occs
271
input that doesnt match a regular expression
819
keep processing (yymore) 822
lexeme ( y y t e x t ) 820
lexeme length ( y y l e n g ) 820
lexical analyzer 95
LIB environment 826
limits and bugs 826
literal characters 95
local variables in actions 820
low-level input functions 96, 824
macro support 93, 101, 818
m a i n ( ) 820, 821
matching normal characters, escape sequences
817
memory management 88
metacharacters 816
multiple-line actions 95
open new input file (yywrap) 823
operator precedence 819
OR operator 104, 817
output character ( o u t p u t ) 821
output, files 815
output, state-machine description,
parenthesized subexpressions 107
parser 101
printing tables 140
print lexeme (ECHO) 821
problems with implicit concatenation operator
106
push back character (unput) 821
push back several characgters ( y y l e s s ) 822
regular expressions 816
rules section 816
single-character matches 108
single characters 107
source code 81
stack used for state allocation 88
state-machine driver 73,75
string management 89
subroutines and variables for use in actions
820
suppressing l i n e directives, -1 826
table compression, - c , - f 825
template file, organization 63
unary operators (closure) 108
uncompressed transition matrix 65
UNix-compatible newline, u 826
use standard output, - t 826
using with occs 812
verbose-mode output, - v , - V 826
YYERROR 821
Lexeme 100
Lexeme 3,33
boundaries: i i mark s t a r t ( ) ,
i i _ m a r k _ e n d () 42
display under IDE 852
markers, pMark, sMark, eMark 37
termination: i i _ t e r m ( ) a n d i i _ u n t e r m ( )
48
lexical analyzer 3, 32
C compiler 518
error recovery 34
lexical-analyzer generator (see IfX)
interaction with code generator via symbol
table 529
interaction with code generator via symbol
table (problems) 536
interface to parser 33
IfXs own 95
advantages of independent phase 33
state-machine driven 60
using your own under occs/LLama 855
lexio.c 74
lex.par 63
lexyy.c 63f.
l.h 689
LHS (see left-hand side)
LIB, environment variable (occs/LLama) 844,
IfX 826
lifetime analysis 673
limits and bugs, IfX 826
limit scope, PRIVATE 42
linear optimizations 658
# l i n e directives, suppress under occs/LLama (-1)
844
Lineno 95
line wrap, curses 778
linkage, subroutine 467
link, dynamic and static 472
linked list, used to resolve colision, in symbol table
481
l i n k instruction in C-code 471
instruction, argument (label) generation 552
l i n k structure 492
in declaration chain 493
initialize (decl.c) 526
labels, emitted 555
merge (decl.c) 526
pointer to (used as an attribute) 524
links, cross (see cross links)
list, doubly linked 718
enumerator 493
finished-states list (occs) 415
grammars that describe lists 175
nonrecursive element of list reduced first 341
in a bottom-up parser 340
right recursive use a lot of stack 343
with delimiters 179
unfinished-state (occs) 416
literal characters, IfX 95
LL grammar 170
LL(0) grammars 211
LL( 1) grammars 211, 275
cannot be left recursive 226
parser, implementing 229
SELECT set 217
LLama 883 (see also occs/LLama)
$ $ , $ l , e t c . 883
$ attribute processing 332
changing action taken by the parser on an error
883
command-line processing 322
compilation directives and exit stati 276
computing FIRST, FOLLOW, SELECT sets
304-311
745
creating parse tables 304
%directives and error recovery 883
error functions 328
implementation 270
input file 282
interactive debugging environment sources
yydebug.c (see IDE)
internal lexical analyzer 271
internal parser 270
internal symbol table 275
internal token set 271
IfX input specification 271
LLama input file describing its own parser
main () 322
numeric limits for token values 277
output driver subroutine 319
output file discussed 229
output functions, 278
output functions 328
Index
905
output the token definitions 321
parser action: add s y n c h () 289
parser action: a d d _ t o _ r h s () 289
parser action: e n d_o p t () 289
parser action: f i r s t _ s y m () 289
parser action: l e r r o r () 289
parser action: make_term () 289
parser action: new_f i e l d () 289
parser action: n e w _ l e v () 289
parser action: new_nonterm () 290
parser action: new_rhs () 290
parser action: o u t p u t () 290
parser action: p r e c () 290
parser action: p r e c _ l i s t () 290
parser action: s t a r t _ o p t () 290
parser action: u n i o n _ d e f () 290
parse-table creation 311
parse tables 210
parse tables, pair compress 843
printing the value-stack 885
recursive-descent grammar 282
recursive-descent parser 282
sample input file 885
sign-on message 333
Statistics functions 328
symbol table (data structures) 278
%synch directive 883
token definitions 276
top-down attributes 883
typing LLama's value stack 884
uncompressed parse tables 843
user supplied initialization 855
value stack 884
value stack, y y v s t y p e 205
LLama-generated parser
debugging functions and macros 235
numerical limits of tokenized input symbols
229
output streams 235
parse- and value-stack declarations 229
parse-table declarations 229
symbol stack 235
symbol-to-string conversion arrays. 235
YYDEBUG 235
y y p a r s e () 242
l.lib, subroutines for IfX actions to use 820
llout.sym 885
load, input buffer: i i _ f i l l b u f () 47
local static name conflicts 486
local variables 547,559
automatic variables, handling at run times 467
in IfX actions 820
in occs/LLama actions 842
code that processes 547f
s t a t i c 563
location counter 458
l o c _ v a r _ s p a c e () 563
log file, create without window updates 850
logical AND operator (&&) 624, 626
logical complement of set COMPLEMENT () 695
logical lvalue 580
logical OR, at the top level in Thompsons con
struction 81
and auxiliary stack (listings) 622
in C compiler 622
logical right-shift directive, in C-code 473
log output to file, IDE 850
_ l o n g 491
l o n g 493
longest, string (see greedy algorithm)
word in English, see (floccinaucinihilipili -
fication)
Lookahead 17
lookahead 11,15-17,35
function, i i _ l o o k () 47
set, LR( 1) 361
used to compute next state in finite automata
57
lookup tables, disambiguate lexemes 828, 833
in hard-coded lexical analyzer 50
loops 642
low-level input functions, IfX 96
changing: i i _ i o () 40
LR(0), grammar 354
item 356
LR( 1), grammars 361
item 362
lookahead set, (see lookahead set)
LR grammar 170
LR item, management by occs 420
internal occs representation 410
LR parsers, advantages and disadvantages 337
LR parse tables
decompression 369
goto entries 355
nonterminal edges 355
Action and Goto components 345
adding e productions to 359
creating 357
creating (theory) 354
shift entries 356
representing 368
subroutines to create 423f.
use FOLLOW set to resolve conflicts 361
terminal edges 356
LR state, internal occs representation 411
LR state tables (and LALR), representing 368
LSB at lowest address in C-code virtual machine
461
lvalue 578
use&WP(&_p) to access 599
logical vs. physical 580
implementation 583
logical and physical [how handled by
r v a l u e ( ) ] 588
summary 582
lword 450
-m, switch to occs/LLama 844
machine () 101
machine, state (see state machine)
machine, virtual (see virtual machine)
macro
computes largest.positive integer 115
expansion, IfX 101
expansion, IfX 818
8086 684
conditional-compilation for LLama and occs
275
in C-code 475
IfX support 93
made by - a and - p switch 876
multiple-statement 786
set 697
set-manipulation (see set macros)
stack maintenance 686
stack (customization for occs output file) 386
that modify the occs/LLama parser 840
video-BIOS 756
_main ( ) , translate to main () 476
main ( ) , C compiler 514
LEX 155,820,821
maintenance layer, in symbol table 479
make_dt r a n () 130
m a k e _ i c o n () 593
m a k e _ i m p l i c i t _ d e c l a r a t i o n () 598
maketab () 712,717
make_term () 289
make_types_match () 627
map, bit 690
map, of temporary variables on run-time stack:
R e g i o n [] 577
marker movement 43
mark lexeme boundaries: i i _ m a r k _ s t a r t ( ) ,
i i _ m a r k _ e n d () 42
match ( ) , 17
match, any character 817
matches, multiple 55
MATCH ( t ) 100
matrix, transition 60
max () 686
maximum values 686
MAXINT 685
MAXLEX 37
MAXLOOK 37
MEMBER () 693, 69^
vs. TESTQ 695
membership, test on complemented sets,
MEMBER() vs. TEST() 695
m e m i s e t O 745
memory alignment (see alignment)
memory
allocation 559
fill with integer value 745
management, IfX 88
occs 412
symbol-table functions [symtab.c] 497
organization, in C-code virtual machine 455
memory-mapped screens 749
merge, declarator and specifier
[ a d d _ s p e c _ t o _ d e c l () ] 536
l i n k (decl.c) 526
multiple productions into single subroutine.
21
metacharacter, in regular expression 54
escaping (quotes marks) 55
in IfX 816
MGA 749
Microsoft C, /Zg switch creates function prototypes
689
Microsoft link, /NOE switch avoids name conflicts
781
m i n ( ) 686
MINACT 276
m i n _ d f a ( ) 135
m i n i m i z e () 135
minimization, DFA 132
minimum values 686
MINNONTERM, 276
model state machine with arrays 57
modes, addressing (see C-code, addressing modes)
462
modify the behaviour of the occs/LLama parser
840
module, relocatable object 459
_MOD_WSIZE(x ) 697
MONBASE 756
move () 117,781
m o v e d o t () 420
m o v e f i l e () 741
move, file 741
input marker 43
set 117
start marker: i i _ m o v e _ s t a r t () 42
window, curses 779
window, mvwin.c 791
window-relative cursor and read character,
winch.c 794
window-relative Cursor, wmove.c 800
MS() 683
ms -d o s , end-of-line problems in anchors 104
MS-DOS-Specific code 683
multiple accepting actions, handled by IfX 117
multiple-line actions, IfX 95
multiple right-hand sides, a n 192
multiple-statement macros 786
trailing e l s e in 64
multiple tokens, Tn 192
mvinch () 782
mvwin () 779
mvwinch () 782
mvwinr () 779
name, conflicts of local-static names 486
906 Index
name (field in symbol structure) 486
NAME, has precedence 511
{name} in IfX regular expression 818
name*NAME 530
NAME, new_name>NAME 528
<name>, occs 861
names, IfX (start with yy Yy YY i i _ ) 825
NAME, Ma/7-NAME 594,613
$N, attribute notation for top-down grammar 205
NBITS () 685
NCOLS 141
NCOLS 141
n e a r pointer 684
n e e d 47
negative, attribute ($-1) 860
character class, add \ n to 55
nested, parentheses (recognize with state machine)
196
subroutines, Pascal 804
variable definitions 559
N e s t _ l e v 530
___NEVER___63
new() 88,412
n e w _ f i e l d ( ) 289
n e w i t e m () 420
n e w _ l e v ( ) 289
newline, nonstandard definition in IfX 817
new_macro() 93
new_name () 352
new name*NAME 528
newname() 26
new name, var_dcl>new name 530
new_nonterm () 290
new_rhs () 290
n e w s e t ( ) , 691,699
n e w s t a t e () 417
newsym () 712,716
n e w _value () 585
newwin ( ) , Create window, subwindow, (high-
level description) 777
n e x t 2 , field in NFA 84
N e x t _ a l l o c a t e 416
n e x t field, in NFA structure 84
in symbol structure 488
next_member () 692, 708
next(N,c)=M 57
next set element, (function description) 692
next state, finding 80
nextsymO 713,720
Nfa 116
n f a () 116
NFA structure 84, 88
NFA 58
conversion to DFA (see subset construction)
converting regular expression to (see
Thompsons construction)
formal definition 59
interpretation, implementation 115
print to standard output 110
using directly to recognize strings 113
NFA_MAX 86
N f a _ s t a t e s [ ] 88,116
n l ( ) , Map newlines, (high-level description) 777
nocrmode ( ) , buffered input, (high-level descrip
tion) 776
node 669
node, terminal (see terminal nodes)
n o e c h o ( ) , Do not echo, (high-level description)
777
/NOE switch, avoids name conflicts 781
NO_MORE_CHARS() 43
%nonassoc 856, 857
nonconstant dead assignments 663
nondeterministic finite automaton (see NFA)
nongreedy algorithm (matches first string) 62, 114
noninteractive mode, IDE 850
n o n l ( ) , Do not map newlines, (high-level descrip
tion) 777
nonterminal 7, 167
edges, in LR parse table 355
names, occs/LLama 839
how they get onto bottom-up parse stack 344
nullable 214
attributes for 187
form interior nodes in parse tree 9
unreachable 219
n o s a v e ( ) , (high-level description) 779
NOT operator (!), in C compiler 602
nouns 490
N s t a t e s 126
nullable, nonterminals 214
productions 218
null, set 694
string 52
number of elements, in set 691
on stack 687
number, production (in yyout.sym) 868
NUMCOLS 756
NUMELE() 686
n u m _ e le ( ) 691,701
numeric, component of label 641
values for symbols 208
Numgroups 135
NUM OR ID 14
NUMROWS 756
object module, relocatable 459
occs/LLama 836
actions, local variables 842
alternative parser templates (-m) 844
code section 842
compiling for debugging 845
create parser only (-p) 844
debug mode (see IDE)
definitions section 837
e productions 841
generate action subrouine only (-a) 843
input file 837
input-file formatting conventions 841
internal names 839
# l i n e directives (-1) 844
making private variables public (-g) 844
modified BNF 839
nonterminal names 839
output files 842
output functions, bss-segment 855
output functions, code-segment 855
output functions, comments 855
output functions, data-segment 855
parser, modifying behaviour of 840
parser subroutine 853
parser subroutine (see y y p a r s e ( ) )
parser template placed on path in LIB environ
ment 844
production vs. debug mode 845
representing productions 839
r e t u r n and b r e a k statements in action 841
rules section 839
rules section for expression compiler (expr.y)
841
send large tables to yyouttab.c (-T) 844
send large tables to yyouttab.c (-T) 844
send output to s t d o u t (-t) 844
supply your own lexical analyzer 855
suppress warning messages (-w) 845
symbol-table dump (-s) 844
token names 838
useful subroutines and variables 852
useful subroutines and variables 854
user supplied initialization 853
verbose-mode (-v, - V ) 845
occs, output file 381
occs.par 385
occs 856
$$=$1 default action 858
$ $ , $ l , e t c . 858
adding reductions to tables, yystate.c 431
algorithm used by occs to create LALR state
machine 415
ambiguous grammars 856
-a output (file header) 410
attributes 858
avoiding cascading error messages YYCAS
CADE 875
change type of the value stack 858
745
%directives and comments 857
error recovery 348
error recovery 875
$ < f i e l d > N 863
finished-states list 415
grammatical transformations 864,867
high-level table-generation subroutines,
yystate.c 423
hints and warnings 881
imbedded actions 864
imbedded actions translated (example) 536
interactive debugging environment (see IDE)
interactive debugging environment sources
yydebug.c (see IDE)
internals 401
ITEM management 420
lexical analyzer (see LLama, internal lexical
analyzer)
low-level table-generation subroutines,
yystate.c 425
memory management 412
<name> 861
negative attributes ($-1) 860
optional subexpressions [ . . . ] 865
output files 815
output file (see occs-generated parser)
parser action (see LLama, action)
%prec 872
print subroutines, yystate.c 434
putting parser and actions in different files 875
repeating subexpressions [ . . . ] * 867
representation of LR items 410
representation of LR parse tables 412
representation of LR states 411
resolving shift/reduce and reduce/reduce
conflicts 871
sample input file 879
specifying precedence and associativity 856
stack macros in yystack.h 688
STATE comparison [ s t a t e _ c m p () ] 420
state data base 417
STATE hashing [ s t a t e _ h a s h () ] 420
state management 415
statistics subroutines, yystate.c 433
subroutines that create auxiliary tables,
yystate.c 437
subroutines to create LR parse tables 423f.
subroutines to generate LALR( 1) parse tables
408
symbol-table dump 867
symbol-table (see LLama, symbol table)
%%, %{ %}, %token, %term, comments (in
Table E.5) 857
translating dollar attributes 406
translating imbedded actions 406
%type 861
unclosed-state list 417
unfinished-states list 416
%union 860
%union fields 406
using with IfX 812
value stack 858
occs-generated parser
action subroutine 388
$ attributes (see $)
compressed parse tables 393
error marker in tables 388
error recovery 400
reduce subroutine [yy_reduce () ] 400
Index
907
shift subroutine [ y y _ s h i f t () ] 396
stack initialization 400
symbol stack: [Yy d s t a c k ] 394
table decompression [ yy_next () ] 393
the occs output file 381
the parser itself 400
OFF() 684,686
O f f s e t 562
offset 684
isolating from 8086 address 685
to structure fields 546
to local variable in stack frame 562
ones complement operator (~), in C compiler 602
o n _ f e r r () 732
opcode 459
open, new input file: i i _ n e w f i l e () 40
operation, internal set function for binary opera
tions 691
operator
assignment 620
conditional 619
precedence, IfX 819
precedence, in regular expression 56
arithmetic in C-code 473
unary in grammar 181
unary (see unary operators)
optimization 657
common-subexpression elimination 672
constant folding and propagation 659
dead code and variables 660
example of peephole 665
instruction 659
interaction between 663
linear 658
parser 657
peephole 658
register allocation and lifetime analysis 673
structural 667
strength reduction 658
productions, (see production, optional)
subexpressions, occs [ . . . ] 865
opt_specifiers 524
opt_specifiers, extdef>optspedfiers functdecl
opt j a g > 543
opt tag, struct_specifier^STRUCT opt Jag LC
defjist RC 546
order of evaluation, as controlled by grammar 181
OR (see also I)
operator ( | I), in C compiler 622
operator, IfX 817
processing, IfX 104
output () 73, 290
output files, IfX and occs 815, 842
output function, error 250
output
IDE direct video 846
IfX, output-character function 821
log to file from IDE 850
streams, LLama 235
streams, occs (code, d a t a , b s s ) 552
OX() 275
- p IfX, command-line switch 876
-p, occs/LLama command-line switch 844
*p++;, generate code for 609
(*p) ++;, generate code for 610
*++p;, generate code for 610
++*p;, generate code for 611
P++[ 3 ] ; , generate code for 612
++P[ 3 ] ; , generate code for 613
_ P ( ) 566
pack variables, dont if back end is used 563
padding, in structure field 547
pages, video display 749
pair compression, compression ratio 72
compressed tables 141
in LALR( 1) parse tables 368
setting threshold 72
transition matrix 68
LLama parse tables 843
p a i r s () 141
panic-mode error recovery 348 (see also error
recovery)
in occs parser 875
parentheses, in regular expression 55
in subexpression, handled by IfX 107
p a r s e _ e r r () 88
parse and parsing, 3 (see also occs, LLama)
bottom-up 169, 337
top-down 169, 195
generate trace from IDE 850
parser
configure parser for C-compiler 509
create only (-p) 844
generation 408
generator, top down (see LLama)
generator, bottom up (see occs)
how a parser decides which productions to
apply 212
implementing LL(1) 229
interface to lexical analyzer 33
IfX 101
LLamas internal 270
occs-generated (see occs-generated parser)
occs/LLama-generated (see occs/LLama
parser) 840
optimizations 657
predictive 201
put occs parser and actions in different files
875
LR (see LR parsers and bottom-up parsing)
recursive descent 198
table driven with augmented grammars 202
subroutine, occs/LLama 853
*
template, placed on path in LIB environment
844
visible (see IDE)
parse table
code to make top-down 311
creating LR(1) 362
created by LLama 304
generated by LLama 210
how bottom-up parse table relates to state
diagram 355
internal occs representation 412
LLama uncompressed 843
LR (See LR parse tables)
making 213
occs compressed 393
sample in occs-generated parser 389
top down 208
parse tree 4
and ambiguous grammars 182
evolution during bottom-up parse 342
execute actions while traversing 203
implicit in recursive-descent subroutine calling
sequence 21
path to current node rememembered on
recursive-descent stack 199
relation to bottom-up parse stack 340
semantic difficulties 170
Pascal compilers 802
passes, compiler 1
path, search for file along path in environment
string 745
pattern matching, greedy algorithm 60
nongreedy algorithm 62
pchar () 732
PDA (see push-down automata)
peephole optimizations 658
example 665
pgroups 0 137
phase, parser 3,4
phase problems, with y y t e x t 242
phrase 166, 167
PHYS () 684,685
physical address, finding on an 8086 685
physical complement of set INVERT () 695
physical lvalue 580
piglet 553
pMark 37
pointer
and array dereferencing 603
create from i n t with double cast 688
frame pointer 469
n e a r and f a r
problems with decrement in 8086 686(fn)
pooh, winnie der 553
pop 687
in C-code 465
p o p () 688
pop_ () 688
popn 687
popn () 688
popn_ () 688
pop stack elements (macro, high-level description)
687
Positive closure (+) 53
postdecrement operator, in C compiler 603
postfix 448
converting intermediate code to a syntax tree
668
intermediate code (generating) 668
intermediate language, used by optimizer 667
postincrement operator, in C compiler 603
p r e c ( ) 290
%prec 872
precedence
and associativity in yyout.sym 868
controlling when eliminating ambiguity from
grammar 225
IfX operator 819
regular-expression operator 56
specifying in grammar 180
specifying under occs 856
stored by occs 281
used to disambiguate non-arithmetic tokens
511
%prec (in Table E.5) 857
p r e c _ l i s t () 290
%prec, stored in precedence table 281
PREC_TAB 281
predecrement operator, in C compiler 603
predictive parser 201
prefix 52,478
C-code, indirect-mode [c-code.h] 465
subroutine 552
viable 169
preincrement operator, in C compiler 603
preprocessor directives, in C-code 476
print, argv-like vector array 733
C-code virtual-machine pm () 476
DFA state groups 137
functions (in library) 731
human-readable string or character form 732
LLama value-stack 885
occs value stack from IDE 864
set 708
message to IDE prompt window 855
multiple-line comment 734
p r i n t f () workhorse function 734
set elements (function description) 692
p r i n t _ a r r a y () 140
p r i n t f ( ) , direct-video version 750
mapped to y y c o d e () in occs debugging mode
385
workhorse function 734
p r i n t _ i n s t r u c t i o n () 566
p r i n t m a c s () 93
p r i n t v () 733
p r i n t w () 782
print value-stack item under IDE 853
908 Index
PRIVATE 42, 682
p r i v a t e 457
private variables, making public under
occs/LLama ( -g ) 844
p r n t () 734,735
PROC, and SEG() 457
PRODUCTION 279
production 7,166
as subroutine 18
breakpoint, IDE 264
mode, occs/LLama 845
nullable 218
numbers, in yyout.sym 868
occs/LLama (inserting code) 841
optional 8
e 173
executed first or last 178
merging into single subroutine. 21
represented in occs/LLama 839
start 357,842
start 842
unreachable 219
program-load process, 455
program prefix 455
prompt window, print message to IDE 855
proper prefix 52
suffix 52
substring 52
prototypes, function (see function prototype)
p s e t () 692
pseudo registers
p t a b ( ) 714,720
p t r 450
p u b l i c 458
PUBLIC 683
public variables created by occs - p or - a 876
p u r g e _ u n d e c l () 598
push () 687, 688
p u s h _ ( ) 688
pushback 15,35,48
push-down automata (PDA) 196
and recursive descent 198
used in bottom-up parsing 338
used for a top-down parsing 201, 208
used to implement attributed grammars 203
push, in C-code
push stack element (macro, high-level description)
687
Q grammar 222
quads 447
without explicit assignments 447
compared to triples 448
quit command, IDE 851
quote marks, escape metacharacters in regular
expression 55
quotes marks, in regular expressions 101
R 771
racetrack, Camptown 107
r a i s e () 699
RANGE () 686
RC, struct_specifier>STRUCT opt Jag LC
defjist RC 546
read, delayed until after first advance 40
recognition, sentence 8
recognizer 25,56, 167
r e c o r d 450
record 479
activation (see stack frame)
recursion, and the stack frame 470
comer substitution 228
in bottom-up parsing 340
in grammar 175
in grammar 180
in grammar 9
left (see left recursion)
replace with control loop and explicit stack
200
right (see right recursion)
recursive descent 18
attributes compared to bottom-up parsing 348
parser
arguments 26
as push-down automata 198
for LLama 282
return values 26
use the stack 198
used by LLama 270
recycling temporary variables 575
redraw stack window, IDE 851
reduce 338, 343
reduce directives, in LR parse table 356
represented in yyout.doc 869
subroutine, y y _ r e d u c e () 400
by e productions 359
reduce/reduce conflicts 360
resolving 380, 871
reduction, in yyout.doc state 869
strength (see strength reduction)
redundant row and column elimination 65, 146
redundant states, removing from DFA 136
references, to temporary-variable in emitted code,
T() 574
referencing, pointer and array via * or [ ] operator
603
r e f r e s h () 780
R e g i o n [] 577
register allocation, optimization 673
register set, in C-code virtual machine 451,453,
454
registers, as temporaries 573
pseudo
regular definition 56
regular expressions 54
( ) operator 55
anchors C $)
[ ] (character clases) 54
closure (*+?) 55
concatenation 54
converting to NFA 81
grammar 87
grouping 55
limitations 55
{n,m} operator 55
Ioperator 55
operator precedence 56
conflicting (in IfX input file) 819
LEX 816
recognizing with NFA interpreter 113
simplifying to get smaller internal table size
827
reinitialize stack (macro, high-level description)
687
reject 347
relational operators, processing by C compiler 626,
628
r e l e a s e _ v a l u e () 593
relocatable object module 459
r e l o p () 627
REMOVE () 693,698
repeating subexpressions, occs [ . . . ] * 867
representing productions, occs/LLama 839
restore region of screen, direct-video version 751
r e t ( ) , in C-code 466,477
r e t u r n 637, 638
in occs/LLama actions 841
return values, from y y p a r s e () 837
Pascal 803
subroutine 472-473
r e v e r s e _ l i n k s () 535
reverse-Polish notation (see postfix)
RHS (see right-hand side)
%r i g h t 856
right-hand side (RHS) 7,167
length, Yy_rhs l e n 392
in yyout.sym 868
multiple a n 192
%r i g h t (in Table E.5) 857
right-linear grammars 174
rightmost derivation 169
right recursion 175
eliminating. 21
gets arguments pushed in correct order 613
right-recursive lists use a lot of stack 343
right-shift directive, in C-code 473
ring buffer 50
rname (field in symbol structure) 486
_ROUND 697
ROW 124
row and column, eliminating redundant 65,146
RPN, reverse-Polish notation (see postfix)
r u l e ( ) 101
rules, for forming lvalues and rvalues when pro
cessing * and [ ] 608
rules section, IfX 816
occs/LLama 839
run-time, trace 477
instructions 565
r v a l u e () 587
rvalue_name () 588
rvalue 578
implementation 583
integer and integer constant (value.c) 594
summary 582
- s - S , LLama 885
s a v e ( ) , 89,779
save region of screen, direct-video version 751
save screen to file, IDE 851
SBUF 754, 755
scanner (see lexical analyzer)
scope, limiting: PRIVATE 42
levels 559
SCREEN 757
screen, characters and attributes 749
dimensions 756
representing DIS P LAY 756
save buffer, SBUF 755
memory-mapped 749
s c r o l l () 783
s c r o l l o k ( ) , Enable or disable scrolling, (high-
scroll, screen (direct-video version) 752
window, wscroll.c 801
s e a r c h e n v () 745
search, for file along path in environment string
745
sector (disk) 37
seed items 357
SEG() 457,684,685
cant issue between PROC () and ENDP () 476
generated by C compiler 524
segment 684
bss 456
change [SEG () ] 457
code 456
data 456
heap 456
isolating from 8086 address 685
not modified in 8086 far-pointer decrement
686(fn)
in C-code virtual machine 455
stack 456
segment-end markers: e b s s , e t e x t , e d a t a 457
segmented architecture (8086) 684
select, between direct video and video BIOS 772
SELECT set, code to compute 310
LL(1) 217
Index
909
translating into an LL( 1) parse table 218
semantic difficulties 170
semantics vs. syntax 167
semicolon, in C-code 457
sentence, 166, 347,52
definition 4
recognizing with grammar 8
sentential form 168, 169
sequence, sub 52
SET 690, 697
setcmpO 691,701
s e t , field in DFA_STATE 126
set, synchronization 201
set functions and macros 690
comparing 701
complement by marking set as negative true
695
complement by physically inverting bits 694
creation and destruction 699
deletion d e l _ s e t () 701
duplication 701
enlarging _ a d d s e t () 701
enlarging 702
example 693
FIRST (see FIRST set)
FOLLOW (see FOLLOW set)
FOLLOW (see FOLLOW set)
internal representation 697
implementation 695
implementation difficulties 694
LL( 1) SELECT (see SELECT set) 217
LR( 1) lookahead (see lookahead set)
manipulation functions 706
null and empty sets 694
number of elements num_ele () 701
operations 702
operations, on languages 53
print set 708
problems with complemented sets 694
test functions 703
traversal 708
set.h 690,695
s e t h a s h ( ) 691
_ s e t _ o p () 691
_ s e t _ o p () 702
_ s e t _ t e s t ( ) 691
_ s e t _ t e s t ( ) 701
_SETTYPE 695
S grammars 173
shadowing 479
Shell sort, implementation 741
theory 739
shift 338, 343
s h i f t _ a m t 45
shift entries, LR parse tables 356
shift, attributes for token 877
represented in yyout.doc 869
getting high byte of number without a shift
685
shift/reduce conflicts 360
caused by imbedded actions 354
resolving 379,871
Shiva 79
_ s h o r t 491
s h o r t 493
shortest string (see nongreedy algorithm)
showwin () 779
SIGINT 512
raised by IDE q command 246
SIGIO 246
sign extension, in C-code 474
simple expressions in Thompsons construction 81
single-reduction states, eliminating 373
algorithm for parsing minimized tables 375
cant remove if there is an action 374
single step, IDE 847
singleton substitution 223
S i n p u t 100
s i z e o f 599
SLR( 1) grammars 361
sMark 37
- s , occs/LLama command-line switch 844
sorting 738
source code, in occs/LLama definitions section
837
Spam 535
specifier 489
implementation 490
manipulation functions [symtab.c] 501
merge with declarator
[ a d d _ s p e c _ t o _ d e c l () ] 536
processing, C compiler 524
symtab.h 492
specifiers 524
specifiersttype or_class 525
sp, in C-code virtual machine 452
Spot 4
s s o r t () 738
implementation 741
stack, attribute (in top-down parser) 203
auxiliary (used to process AND and OR opera
tors) 622
bottom-up parse lacks lookahead information
for error recovery 348
breakpoint, IDE 255
C-code, virtual machine 454
stack-manipulation directives 465
combined with heap 456
downward-growing stacks a n s i compatible
688
-element macro 687
LLama value 884
implement recursion with 200
initializing occs-generated parsers 400
in recursive-descent parsers 198
-manipulation macros 686
implementation 688
customization for occs output file 386
PDA, to implement 197
pointer, stack macros to access (high-level
description) 687
pushback 35
references, $N translated to 392
relationship between bottom-up parse stack
and parse tree 340
remember state transitions in bottom-up parse
344
segment 456
state allocation in IfX 88
symbol vs. state in bottom-up parse 347
synchronization-set 201
temporary variable 573
value (see value stack)
window, IDE 255
size, changing from the command line 846
stack-access directives, in C-code [virtual.h]
465
stack-based symbol tables 480
s t a c k _ c l e a r 687
s t a c k _ d c l 686
s t a c k _ e l e 687
s t a c k _ e m p t y 687
stack empty, test for (macro, high-level description)
687
s t a c k _ e r r 687
stack frame 198, 468
advantages 470
figuring offset to local variable 562
Pascal 803
phase errors 471
used for temporary variables 574
s t a c k _ f u l l 687
stack full, test for (macro, high-level description)
687
stack.h 686
s t a c k _ i t e m 687
s t a c k _ p 687
standard input, reassigning 40
standard output, (If X) use instead of lexyy.c - t 826
S t a r t _ b u f 37
start marker, moving: i i _ m o v e _ s t a r t () 42
start-of-line anchor O 54
s t a r t _ o p t () 290
start production 357, 842
start state 56
start symbol 168
STATE 411
allocation 417
comparison, s t a t e _ c m p () 420
state 56
accepting (see accepting state)
state,
data base 417
diagram, how bottom-up parse table relates to
state diagram 355
inadequate 360
management by occs 415
dead 140
inadequate (might be added in LALR( 1)
machine) 367
single-reduction (see single-reduction states)
stack, in bottom-up parse 347
start 56
unreachable 140
tables, representing LR (LALR) 368
s t a t e _ c m p () 420
s t a t e _ h a s h ( ) 420
state machine (finite automata) 56, 195
in bottom-up parsing 343
cant count 196
creating LR(1) 362
description, in lexyy.c 64
description, output by occs/LLama - v 845
description, yyout.doc 869
deterministic (see DFA)
driver 59
algorithm, examples 61
algorithm used by IfX 62
LEX 73, 75
and grammars 174
lexical analyzer that uses 60,63
LR 344
modeled with array 57
nondeterministic (see NFA)
parser, used for 195
recognize handles with 343
recognize nested parentheses 196
representing 58
table compression 140
statement 637
b r e a k 642
c o n t i n u e 642
loops 642
compound (see compound statements)
control-flow 637
do/While 643
empty 637
expression 637
f o r 643
g o t o 640
labels 640
nested i f / e l s e 639
r e t u r n 637
simple 637
s w i t c h 642, 646
tests 641
w h i l e 643
s t a t e m e n t s ( ) . 18
static link 472
s t a t i c local variables 563
status, keyboard (getting under u n i x ) 246
stdarg.h 724
s t d o u t , send occs/LLama output to (-t) 844
s t d s c r , variable in <curses.h> 111
s t o l ( ) 726
s t o p _ p r n t ( ) 735
910 Index
storage classes
in C-code 457
C-code indirect used for lvalues (&WP (&_p))
599
converting C to C-code 458
[in virtual.h] 461
s t o u l () 726
strength reduction 566, 658
string 52, 166
accessing line number 91
concatenation 53, 745
constants, in C compiler 599
IfX example 832
convert to l o n g 726
created from token in occs output file 392
display with direct video 751
empty (see )
environment (see environment)
exponentiation 53
management, IfX [ s a v e () ] 89
null 52
prefixing with i n t to hold line number 89
print in human-readable form 732
recognition, using NFA directly 113
write to window, waddstr.c 793
STR_MAX 86
s t r u c t d e f 495
STRUCT, recognized for both s t r u c t and u n i o n
543
struct specifier 543
struct spedfierSTRUCT opt tag LC defjist
RC 546
STRUCT, struct spedfier*STRUCT opt Jag LC
def j i s t RC 546
structure
representing in symbol table 496
access (op.c) 614
copying 859
declarations 543
field, alignment and padding 547
allocation and alignment 461
definitions 545
figuring offsets 546
-member access 611
representing in symbol table 494-497
sorting with s s o r t () 738
variable-length 716
tags 543
structural optimizations 667
subexpressions, occs optional 865
occs repeating 867
subroutine
arguments 469
attributes work like 189
handling at run time 467
Pascal 802
%
C-code 466
calling sequence mirrors parse tree 21
in IfX actions 820
models LHS, implement RHS. 18
numeric part of end-of-subroutine (listing) 640
occs/LLama support subroutines 852
prefix and suffix generation 552
prefix, body, and suffix 478
return values 472
Pascal nested 804
sub-sequence 52
s u b s e t () 691,701
subset construction 122
algorithm 124
convert NFA to DFA with make_dt ran ()
130
subset, find (function description) 691
substitution, comer 221
singleton 223
can create ambiguity 223
substring 52
subtraction processing 634
subwin ( ) , (high-level description) 777
suffix, 52 478
subroutine 552
suppress l i n e directives, -1 (IfX) 826
warning messages, occs/LLama (-w) 845
s w i t c h 646
s w i t c h , coded with jump table 478
statement 642
used to execute actions 209
SYMBOL 278
symbol 486
. a r g s 488
function arguments 554
initializers 532
. d u p l i c a t e 486
. e t y p e 488
. i m p l i c i t 488
. l e v e l 486
.name 486
. n e x t 488
.rname 486
. t y p e 488
printing functions [symtab.c] 504
symbol
goal 338
nonterminal (see nonterminal symbol)
stack, in bottom-up parse 347
terminal (see terminal symbol)
symbolic constants 598
symbol table 478
block deletions 485
characteristics 479
code generator uses to communicate with
scanner 529
problems 536
cross links 486
database layer 479
declarator manipulation functions [symtab.c]
501
dump, LLama 885
occs/LLama (-s) 844
handling duplicate symbols 481,483
hashing 482
implementation 485
LLama 275
data structures 278
maintenance layer 479
implementation 497
memory-management functions [symtab.c]
497
modi fied by occs for LALR( 1) grammars 401
dump occs 867
passing information to back end 563
picture of an element 487
shared between parser and lexical analyzer 33
specifier manipulation functions [symtab.c]
501
stack based 480
structures in 494
print functions [symtab.c] 504
tree-based 480
type-manipulation functions [symtab.c] 502
%s y n c h directive, LLama 883
synchronization, set 201
tokens (LLama) 883
syntax 166
diagram 4, 13
-directed translation 168, 183
tree 4
constructed by optimizer 667
creating from postfix intermediate code
668
data structures to represent in optimizer
669
generating code from physical syntax tree
670
temporaries on 25
vs. semantics 167
synthesized attributes 187, 349
-T, send large tables to yyouttab.c, occs 844
- t , send output to s t d o u t , occs/LLama 844
_T( ) 566
T() 574
T token 166
Tn 192
table
bottom-up parse, creating 357
compression 140
- c , - f (IfX) 825
decompression, occs 393
dispatch 646
-driven parsers, and augmented grammars 202
jump 647
lookup, used to find the number of elements in
a set 701
size, limits, IfX 826
minimizing LfX 827
pair compressed 141
parse (as generated by LLama) 210
split up into two occs/LLama output files (-T)
844
top-down parse 208
symbol 478
tag, generated for %union u n i o n 863
structure 543
target, for true/false test 602
template file, organization (IfX). 63
temporary, anonymous (see temporary variable)
temporary files for C-compiler code, data, and bss
segments 512
temporary variable 6, 184
at run time 25
on syntax tree 25
simple management strategy 26
allocation 577
in C compiler 572
create and initialize tmp_gen () 592
creation, dynamic 573
management, defer to back end 572
references: T () 574
represeted by va 1 ue 588
names are on value stack 350
recycling 575
stored in stack frame 574
type 574
t e r m ( ) 107
%term 837, 856
%term, definitions generated by 229
termcap database 774
terminal
edges, LR parse tables 356
nodes 62
symbols 7, 167
how they get onto bottom-up parse stack
344
are leaf notes on parse tree 9
terminate current lexeme: i i _ t e r m ( ) and
i i _ u n t e r m ( ) 48
termlib.h 751,754
t e r m l i b . l i b 776
%term, LLama 883
TESTO 693,698
vs. MEMBER() 695
test, bits in set 691
constant expressions in 641
directives, in C-code 474
text segment (see code segment)
t f _ l a b e l () 602
t hompsonO 110
Thompsons construction 81
NFA characteristics 83
character classes in 85
data structures 83
implementation 83
three-address instructions (see quads)
Index
911
t h r e s h o l d , argument to p a i r s () 142
threshold, setting pair-compression 72
time flies 171
t m p _ a l l o c () 577
t m p _ c r e a t e () 588
tmp_gen () 592
TNODE 416
%token 837
LLama 883
occs 857
token 3,33
function to output token definitions 321
IDE, input a token 264
names, legal occs/LLama 838
C compiler 511
defining under occs/LLama 837
definitions generated by %term 229
set, choosing 3
multiple Tn 192
synchronization 201, 883
-to-string conversion, Yy_stok [ ] 392
tokenization 17
tokenizer (see lexical analyzer)
TOOHIGH() 686
TOOLOW() 686
top-down parsing 169, 195
actions in 208
attributes, LLama 883
attribute notation ($A0 205
attribute processing 203
algorithm 205, 211
code generation, using augmented grammar
203
error recovery 201
generating parser for 208
with PDA 201
trace, of parse (log to file from IDE) 850
trace, run-time (see run-time trace)
transformations, occs grammatical 864, 867
transition, diagram 56
58
transition matrix 60
representations 65
pair-compressed 68
transitions 56
in yyout.doc 869
translation, syntax-directed (see syntax-directed
translation)
translation, s w i t c h to i f / e l s e 645
traversal, depth-first 180
traversal, set 708
tree-based symbol tables 480
tree, parse 4
symbol-table deletions 480
syntax (see syntax tree)
triples 447
compared to quads 448
without explicit assignments 447
TRUE, problems with 783
t r u n c a t e () 692
Tspace 552
TTYPE 124,528
two-address instructions (see tripples)
two-dimensional arrays, printing 140
%type 512
type 524
type chain, adding declarators 531
type conversions 592
implicit 620
operators, C compiler 601
in C-code 473
t y p e d e f , for %union 863
processing in C compiler 528
TYPE, has precedence 511
t y p e (field in symbol structure) 488
%type 857, 861
type
manipulation, functions [symtab.c] 502
enumerated 550
of temporary variable 574
representations, example 489
aggregate (represented by physical pointers at
run time) 599
derived 490
in C-code 450
numeric coding of constrained 489
representing, theory 489
implementation 490
unconstrained 489
changing type of occs value stack 858
type specifier 524
type_or_class-ttype_specifier 525
unary operators 593
minus (-), in C compiler 602
pointer dereference operator, 606
in grammar 181
w/uzry>NAME 594,613
u n b o x e d () 778
unclosed-state list 417
uncompressed tables, IfX 140
uncompressed transition matrix (If X) 65
unconstrained types 489
undeclared identifiers 598
underscores, used to avoid name conflicts in C-code
467
unfinished-states list 416
u n g e t c ( ) , inappropriate
uninitialized data segment (see data segment)
union (set operation), applied to languages 53
UNION (set function) 692
%union 860, 857
example 862
< f i e l d > , in yyout.sym 867
handled by occs 406
used for C compilers value stack 509
u n i o n declarations 543
u n i o n _ d e f ( ) 290
UNix-compatible newline, - u (IfX) 826
u n i x , getting keyboard status (SIGIO) 246
UNix-specific code 683
u n l i n k , in C-code 471
unput () 73,821
unreachable productions 219
elimination 220
unreachable states 140
unterminate current lexeme: i i _ t e r m ( ) and
i i _ u n t e r m ( ) 48
untranslated input 40
USE_FAR_HEAP 756
useless code 662
UX() 683
- v , - V occs/LLama command-line switch 845
V 771
v a _ a r g ( ) 725,726
v a _ e n d () 725
VA_LIST 684
v a _ l i s t 725
value: lvalue and rvalue 578
implementation 583
v a l u e structure 583
. sym 584
. t y p e 584
. e t y p e 584
. i s _ t m p 583
. l v a l u e 583
. name 584,599
. o f f s e t 583
for identi fier 599
for temporary variable 588
value stack
top-down parser 203
sample parse with 207
sample top-down parse with 207
bottom-up parser 348
change type of (occs) 858
LLama 884
maintained by bottom-up parser 352
occs 858
print under IDE 853
print Ccompiler [ y y p s t k ( ) ] 514
top-down parser 203
%union used for C compilers 509
var del >new name 530
var decl 530
variable allocation, in C-code 460
variable
declarations, in C-code 457
C (simple) 522
declarators (listings) 529
definitions, nested 559
names, in C-code 450
annonymous temporary (see temporary vari
able)
dead 660
handling automatic at run times 467
local (see local variable)
useful to occs/LLama 852
temporary (see temporary variable)
variable-argument mechanism, a n s i 724
variable-length structures 716
v a _ s t a r t ( ) 725
v a _ s t a r t ( ) 725
v a x , assembly language forward references 552
VB_BLOCKCUR() 752
VB_CURSIZE ( ) , Change cursor size, (high-level
description) 753
v b _ g e t c h a r ( ) , Get character from keyboard,
(high-level description) 753
VB_GETCUR ( ) , Get cursor position, (high-level
description) 753
VB_GETPAGE ( ) , Get video page number, (high-
_ V b i o s ( ) 756
vbios.h 756,758
v b _ i s c o l o r ( ) , Determine display adapter,
(high-level description) 753
VB_NORMALCUR () 752
VB_SETCUR ( ) , Move physical cursor, (high-level
description) 753
vector array, printing 733
verbose mode, occs/LLama ( -v, - V ) 845, 826
viable prefix 169, 343
video-BIOS functions 752
access function _ V b i o s () 756
definitions, vbios.h 756
interface function (_vbios.c) 757
macros 756
that mimic direct-video functions 752
select mode at run time 772
video display pages 749
v i d e o environment 772,846
video.h 756,759
video I/O
characters and attributes 749
clear entire screen (direct video) dv_clrs.c 759
clear region of screen (direct video) dv_clr_r.c
759
clear region of screen, (high-level description)
750
direct-video functions 749
definitions (video.h) 759
p r i n t f () dv_print.c 760
variable allocation dv_scree.c 765
free an SBUF (direct video) dv_frees.c 760
free save buffer (video BIOS) vb_frees.c 768
IBM/PC 746
initializing [ i n i t () ] 772
get character from keyboard (keyboard BIOS)
vb_getch.c 768
get Cursor Position (video BIOS) vb_getyx.c
768
implementation 753
912
Index
initialize direct-video functions dv_init.c 760
overview 749
restore saved region (Direct Video) dv_resto.c
764, 769
save region 764, 770
scroll region of screen (Direct Video)
dv_scrol.c 765
test for color card (video BIOS) vb_iscol.c 768
video-BIOS functions, macros (vbios.h) 758
write character 769, 763
with arbitrary attribute (Direct Video)
dv_putc.c 761
write string in TTY mode (Video BIOS)
vb_puts.c 769, 763
video memory, base addresses, MONBASE, COLBASE
756
direct access to [dv_Screen, SCREEN] 757
virtual.h 449
virtual machine 446,449,451
characteristics 447
printing pm () 476
virtual register set 454
visible parser (see IDE)
v o l a t i l e , in optimization 664
Vspace 552
v _ s t r u c t 494
W
-w, command-line switch to occs/LLama 845
waddch() 782
w a d d s t r () 782
w c l e a r () 783
w c l r t o e o l () 783
w e r a s e () 783
w g e t c h () 781
w h i l e 643
whitespace, in C-code 450
in IfX input file 816
wild, call of the 468,472
wildcard 54
winch () 782
WINDOW 783
window management, curses 774
erase 783
move 779
boxed 778
window names under IDE 846
winio.c 796
wmove () 781
word 450
WORD 754,755
word 52, 166
word size, computing 685
worst-case alignment restriction [c-code.h] 462
W(p) 463
W ( p + o f f s e t ) 463
*WP(p) 463
WP(p) 463
&WP ( &_p), used for lvalues 599
wprintw () 782
wrapok () 778
w r e f r e s h () 780
write, string to window, waddstr.c 793
w s c r o l l () 783
X
x, curses column coordinate 777
[x>. y, FIRST(/6 C)] 363
Y
y, curses row coordinate 777
YY 839
Yy 839
yy 839
YYABORT (in table E.2) 840
Y y a c c e p t [] 72
YYACCEPT (in table E.2) 840
y y _ a c t () 388
Y y _ a c t i o n 369
YYACTION 876
YyaN 369
YYBADINP 80,823
y y b s s () 855
YYCASCADE 840, 875
Yy_cmap [ ] 68
y y c o d e () 385, 855
yycomment () 855
Yy_d[] 210
YY_D, 63
y y d a t a () 855
YYDEBUG 63, 235, 840
yydebug, 63
yydebug.c (see IDE)
Y y _ d s t a c k 235, 394
YYD (x) (in table E.2) 840
y y e r r o r () 250
YYERROR () 73,821
y y e r r o r ( ) , Print parser error messages, (high
YYF 65, 388
y y _ g e t _ a r g s () 246, 853
use directly 853
yyhook_a () 263, 514
yyhook__b () 263, 514
y y i n 824 (footnote)
y y _ i n i t _ d e b u g () 246
y y _ i n i t _ l e x () (IfX, initialize) 822
y y _ i n i t _ l l a m a () 855, 884
y y _ i n i t _ o c c s () 853
YY_IS_SHIFT 388
y y l a s t a c c e p t 80
y y l e n g 15, 820
y y l e s s () 73,822
y y l e x () 74
actions, 81
called fromyy_nextoken () in IDE 264
control-flow 74
failure transitions 80
Y y _ l h s 372
y y l i n e n o (IfX input line number) 820
y y l v a l 877
YYMAXDEPTH (in table E.2) 840
YYMAXERR 840, 875
yymore () 73,822
problems with 79
y y m o r e f l g 78
y y _ n e x t () 140, 393
yy__nextoken ( ) , calls y y l e x () 264
Yy_nxt [ ] 68
y y o u t 73
yyout.h, example 385, 838
yyout.sym 867
example 384
YYP 388
y y p a r s e () 242, 400, 837,853
YYPARSER 876
y y p r e v 79
YYPRIVATE 63, 840
y y p r o m p t () 855
y y p s t k () 853
example in C compiler 514
LLama 885
Yy__pushtab 210
y y _ r e c o v e r () 400
Yy_reduce 372, 400
Y y _ r h s l e n 392, 876
Yy_rmap [ ] 68
Y y _ s a c t 235
y y _ s h i f t ( ) 396
YYSHIFTACT 840,877
Y y _ s l h s [ ] 392
Yy_s n nterm, 235
Yy _ s r h s [ ] 392
yystack.h 686, 690
y y s t a t e 74
Yy_stok, 235
Yy__stok[] 392
YYSTYPE (in table E.2) 840
y y s t y p e 863
y y t e x t 15, 820
phase problems 242
YY_TTYPE 278, 388
YY_TYPE 65
Yy_va 1 876
YERBOSE (in table E.2) 840
Yy_vsp 876
y y v s t y p e 205, 885 (see also, y y s t y p e )
yywrap () 80, 823
All entries take the following form:
symbol name listing(line) file, page
The page number is the page on which the listing starts, not the one on which the specified line is found.
_8086 A. 1(14) debug.h, 681 ALLOC_CLS 6.22(7,9) symtab.h, 488
Abor t 4.8(102) yydebug.c, 243 al num D.5(12) c.lex, 829
ACCEPT 2.39(31) dfa.h, 125 al num 6.37(40) c.lex, 519
access wi t h( ) 6.85(460) op.c, 614 al num 5.6(6) expr.lex, 384
ACT 5.22(62) yystate.c, 413 al pha 5.6(5) expr.lex, 384
ACT_FI LE 4.20(100) parser.h, 277 AND D.6( 11) yyout.h, 832
ACTI ON 4.18(2) lloutX 276 and () 6.94(654) op.c, 626
Act i ons 5.22(64) yystate.c, 413 ANDAND D.6(14) yyout.h, 832
ACT_TEMPL 4.20(106) parser.h, 277 ANSI A. 1(13) debug.h, 681
Act ual l i neno 2.22(16) globals.h, 87 ARRAY 6.23(34) symtab.h, 491
ADD( ) A.4(67) set.h, 696 _AS SI GN A.4(41) set.h, 696
add act i on() 5.23(113) yystate.c, 413 ASSI GN () A.4(46) set.h, 696
add case() 6.109(43) switch.c, 650 assi gnment () 6.90(571) op.c, 621
addch() A.70(63) curses.h, 784 AS SI GNOP D.6(50) yyout.h, 832
add decl ar at or () 6.29(182) symtab.c, 501 Associ at i vi t y 4.26(49) acts.c, 290
add def aul t case () 6.109(64) switch.c, 650 assor t () A.35(3) assort.c, 743
add got o( ) 5.23(157) yystate.c, 413 At t r i b A.83(25) wincreat.c, 794
add l ookahead() 5.29(928) yystate.c, 425 At t r _pi x 4.8(138) yydebug.c, 243
addr educt i ons() 5.30(958) yystate.c, 431 at t r st r( ) 6.32(393) symtab.c, 504
addr of () 6.83(270) op.c, 606 ATYPE 6.27(173) symtab.h, 497
addset () A.6(80) set.c, 702 ATYPE 2.48(21) pairs.c, 143
add spec t o decl () 6.45(102) decl.c, 537 ATYPE 2.47(8) print_ar.c, 142
addst r () A.70(66) curses.h, 784 AUTO 6.24(50) symtab.h, 492
addsym() A. 16(59) hash.c, 718 AUTOSELECT A.68(12) glue.c, 112
add symbol s t o t abl e() 6.48(154) decl.c, 539 Avai l abl e 5.24(197) yystate.c, 416
add synch() 4.26(628) acts.c, 290 B 6.9(46) virtual.h, 465
add synch() 4.26(690) acts.c, 290 BGND() A.42(15) termlib.h, 754
add t o dst at es () 2.41(119) dfa.c, 129 bi nar y op() 6.99(762) op.c, 630
add t o rhs () 4.26(488) acts.c, 290 bi n t o asci i () A.27(5) bintoasc.c, 731
add unf i ni shed() 5.25(267) yystate.c, 418 BI T () 6.19(80) virtual.h, 475
ADJ _VAL() 4.20(74) parser.h, 211 _BI TS_I N_WORD A.4(4) set.h, 696
advance() 1.3(87) lex.c, 17 BLACK A.42(5) termlib.h, 754
advance() 4.25(38) llpar.c, 285 BLI NKI NG A.42(21) termlib.h, 754
advance() 2.28(392) nfa.c, 97 BLUE A.42(6) termlib.h, 754
ALI GN() 6.7(44) virtual.h, 462 body() 4.25(124) llpar.c, 285
ALI GN_WORST 6.8(11) c-code.h, 462 BOLD A.42(22) termlib.h, 754
ALLOC 6.33(10) c.y, 509 bool A.70(19) curses.h, 784
ALLOC A.58(l) dv scree.c, 765 BOT A.71(49) box.h, 787
ALLOCATE 2.52(10) lex.c, 155 BOTH 2.20(27) nfa.h, 85
ALLOCATE 2.38(147) terp.c, 120 box () A.73(3) box.c, 788
914
Cross Reference by Symbol
Box
boxwi n()
BP
BREAK
br eakpoi nt ()
BRKLEN
BROWN
BSI ZE
Bss
BUCKET
Buf
BUFSI ZE
BYT E_H I GH_B I T
BYTE_PREFI X
BYTEPTR_PREFI X
_byt e s
_BYTES_I N_ARRAY()
BYTE_WI DTH
cal l ()
CASE
Case_l abel
CASE_MAX
case_val
cat _expr ()
CCL
CELL
CEN
CHAR
CHARACTER
CHAR_ATTRI B
Char _avai l
CHUNK
c_i dent i f i er ()
CLASS
CLASS
CLASS
CLASS
CLASS
cl ean_up()
cl ear ()
CLEAR()
CLEAR_STACK()
cl one_t ype()
CLOSED
Cl osep
cl osur e()
cl r _col ()
cl r _r egi on()
cl r _r ow()
cl r t oeol ()
cmd_l i ne_er r or ()
cmd_l i st ()
cmp ()
cmp ()
c_n ame
cnext ()
Code
CODE_BLOCK
code_header ()
code_header ()
Code_wi ndow
Col
COLBASE
col _cpy()
col _equi v()
Col _map
COLON
COLON
Col umn_compr e s s
COM
A.83(24) wincreat.c, 794
4.9(287) yydebug.c, 247
6.9(50) virtual.h, 465
D.6(33) yyout.h, 832
4.15(1187) yydebug.c, 264
4.8(86) yydebug.c, 243
A.42( 11) termlib.h, 754
2.38( 150) terp.c, 120
6.34(1246) c.y, 512
A. 12(6) hash.h, 716
2.38( 152) terp.c, 120
2.2(28) input.c, 39
6.2(6) c-code.h, 452
6.10(12) c-code.h, 465
6.10(16) c-code.h, 465
6.4( 19) virtual.h, 454
A.4(5) set.h, 696
6.2( 1) c-code.h, 452
6.87(483) op.c, 616
D.6(39) yyout.h, 832
6.108(176) c.y, 649
6 . 107( 1) switch.h, 648
6.107(7) switch.h, 648
2.30(643) nfa.c, 104
2.20(20) nfa.h, 85
6.64( 18) temp.c, 575
A . 71(53) box.h, 787
6.24(43) symtab.h, 492
A.46(l 1) video.h, 759
A.46(23) video.h, 759
4.8(149) yydebug.c, 243
5.23(71) yystate.c, 4 13
4.26(365) acts.c, 290
2.22(3) globals.h, 87
4.23(170) parser.h, 281
4.23(173) parser.h, 281
6.4(12,15) virtual.h, 454
D.6(26) yyout.h, 832
6.34(1323) c.y, 512
A.70(75) curses.h, 784
A.4(48) sc/./z, 696
2.25(94) nfa.c, 91
6.31 (226) symtab.c, 502
5.24(208) yystate.c, 416
2.2(63) input.c, 39
5.29(788) yystate.c, 425
A.59(57) dvscrol.c, 765
A.68(120) glue.c, 772
A.59(40) dv scrol.c, 765
A.70(69) curses.h, 784
2.52(43) /cj c.c , 155
4.12(1084) yydebug.c, 260
D.5(131) c.lex, 829
6.37(178) c./cx, 519
4.17(40) parser.lex, 271
2.49(298) squash.c, 147
6.34(1247) c.y, 512
4.18(3) llout.h, 276
4.31(49) lldriver.c, 319
5.18(57) yydriver.c, 409
4.8(118) yydebug.c, 243
A.52(5)dv _putc.c,16\
A.46(2) video.h, 759
2.49(94) squash.c, 147
2.49(78) squash.c, 147
2.49(23) squash.c, 147
4.18(4) llout.h, 276
D.6(13) yyout.h, 832
2.52(30) /cj c.c , 155
6.24(59) symtab.h, 492
COMMA
comment ()
Comment _buf
Comment _wi ndow
common
COMPLEMENT()
concat ()
CONSTANT
CONST_STR()
const _val
CONTI NUE
conv()
conver t _t ype()
conv_sym_t o_i nt _const ()
COPY()
copyf i l e()
cpy_col ()
cpy_r ow()
cr eat e_st at i c_l ocal s()
CREATI NG_LLAMA_PARSER
CREATI NG_LLAMA_PARSER
Cr mode
CSI ZE
CTYPE
Cur _act
Cur _nont er m
Cur r ent _t ok
CUR_SI ZE
CUR_SYM
Cur _t er m
CYAN
D()
d
d
DANGER
Dat a
D_BOT
D_CEN
DCL_TYPE
Debug
DECLARATOR
decl ar at or
DEFAULT
_DEFBI TS
DEF_EXT
DEF_EXT
DEF_FI ELD
def i ni t i ons()
def next ()
DEFSTACK
_DEFWORDS
del ay()
Del ay
del set ()
del sym()
del wi n()
Dept h
df a ()
df a
DFA_MAX
DFA_STATE
D_HORI Z
Di d_somet hi ng
Di d_somet hi ng
di e_a_hor r i bl e_deat h ()
_DI FFERENCE
DI FFERENCE()
di gi t
di scar d()
di scar d l i nk( )
D.6(47) yyout.h, 832
A.32(14) printv.c, 734
6.63(77) gen.c, 567
4.8(119) yydebug.c, 243
6.6(41) virtual.h, 461
A.4(50) set.h, 696
A.39(5) concat.c, 746
6.24(52) symtab.h, 492
6.67(22) value.h, 584
6.24(83) symtab.h, 492
D.6(34) yyout.h, 832
4.8(175) yydebug.c, 243
6.69(265) value.c, 588
6.55(506) decl.c, 551
2.2(16,18) input.c, 39
A.36( 15) copyfile.c, 743
A.59(21) dv scrol.c, 765
A.59(3) dv scrol.c, 765
6.62(84) local.c, 560
4.17(8) parser.lex, 271
4.24(7) parser.lma, 283
A.84(13) winio.c, 796
6.27(160) symtab.h, 497
6.27(161) symtab.h, 497
4.23(220) parser.h, 281
4.23(219) parser.h, 281
2.28(374) nfa.c, 97
A.43(7) vbios.h, 756
4.26(77) acts.c, 290
4.23(218) parser.h, 281
A.42(8) termlib,h, 754
A. 1(3,6) debug.h, 681
D.5(15) c.lex, 829
6.37(43) c.lex, 519
2.2(30) input.c, 39
6.34(1248) c.y, 512
A .71(60) box.h, 787
A .71(64) box.h, 787
6.25(115) symtab.h, 493
4.23(178) parser.h, 281
6.25(85) symtab.h, 493
6.23(41) symtab.h, 491
D.6(38) yyout.h, 832
A.4(9) set.h, 696
4.20(104) parser.h, 277
4.20(94) parser.h, 277
4.23(166) parser.h, 281
4.25(85) llpar.c, 285
2.46(5) def next.c, 141
4.8(60) yydebug.c, 243
A.4(8) set.h, 696
4.12(910) yydebug.c, 260
4.8( 126) yydebug.c, 243
A.5(33) set.c, 699
A. 17(81) hash.c, 719
A.74(3) delwin.c, 790
4.8(110) yydebug.c, 243
2.40(53) dfa.c, 127
2.40(50) dfa.c, 127
2.39(7) dfa.h, 125
2.40(31) dfa.c, 127
A .71(63) box.h, 787
4.27(17) first.c, 305
4.28(22) follow.c, 307
4.9(263) yydebug.c, 247
A.4(40) set.h, 696
A.4(45) set.h, 696
5.6(4) expr. I ex, 384
2.25( 131) nfa.c, 91
6.28( 139) symtab.c, 497
915
di scar d l i nk chai n () 6.28(116) symtab.c, 497
di s car d st r uct def () 6.28( 167) symtab.c, 497
di scar d symbol () 6.28(45) symtab.c, 497
di scar d symbol chai n() 6.28(70) symtab.c, 497
di scar d val ue() 6.68(46) value.c, 585
DI SPLAY A.46(13) video.h, 759
di spl ay f i l e( ) 4.10(558) yydebug.c, 251
DI VOP D.6( 18) yyout.h, 832
_DI V_WSI ZE() A.4(6) set.h, 696
D_LEFT A . 71(62) box.h, 787
D_LL A . 71(59) box.h, 787
D_LR A . 71(65) box.h, 787
DO D.6(36) yyout.h, 832
do bi nar y const () 6.99(816) op.c, 630
DOC_FI LE 4.20(103) parser.h, 277
DOC_FI LE 4.20(93) parser.h, 277
Doc f i l e 4.33(33) main.c, 322
do cl ose() 5.29(825) yystate.c, 425
document () 4.34(393) main.c, 328
document t o() 4.34(409) main.c, 328
dodash() 2.32(811) nfa.c, 108
do dol l ar () 4.35(7) lldollar.c, 332
do dol l ar () 5.16(7) yydollar.c, 407
do enum() 6.55(493) decl.c, 551
do f i l e() 2.52(177) lex.c, 155
do f i l e() 4.33(272) main.c, 322
DOLLAR_DOLLAR 4.20(83) parser.h, 277
do name() 6.72(27) op.c, 595
dopat ch() 5.15(90) yypatch.c, 402
DOS () A.77( 17,20) mvwin.c, 791
do st r uct () 6.85(377) op.c, 614
do unop() 6.79(177) op.c, 604
D_RI GHT A . 71(57) box.h, 787
dr i ver () 4.31(71) lldriver.c, 319
dr i ver () 5.18(68) yydriver.c, 409
dr i ver 1() A.41(18) driver.c, 748
dr i ver 2() A .41(39) driver.c, 748
Dr i ver f i l e 4.31(19) lldriver.c, 319
Dr i ver f i l e 5.18(19) yydriver.c, 409
D_SCLAS S 2.48(26) pairs, c, 143
Dst ack 4.8(106) yydebug.c, 243
Dst at es 2.40(33) dfa.c, 127
dst opt () 6.99(931) op.c, 630
D_TOP A . 71(61) box.h, 787
DTRAN 4.30(42) llcode.c, 311
Dt r an 2.40(36) dfa.c, 127
Dt r an 4.30(61) llcode.c, 311
DTRAN_NAME 2.52(24) /ejc.c, 155
D_UL A . 71(66) box.h, 787
dupset () A.5(45) ser.c, 699
D_UR A . 71(58) 787
Dv A.68(65) glue.c, 772
dv cl r r egi on() A.47(3) dv_clr_r.c, 759
dv cl r s() A.48(3) dv clrs.c, 759
dv ct oyx( ) A.52(68) dv_putc.c, 761
D_VERT A . 71(56) t o *. A, 787
dv f r eesbuf () A.49(5) dvjrees.c, 760
dv get yx() A.52(77) dv_putc.c, 761
dv i ncha() A.52(86) dvjputc.c, 761
dv i ni t () A.50(3) dvjnit.c, 760
dv out cha () A.52(91) dvjputc.c, 761
dv_j ?ri ntf () A . 51(6) dvj?rint.c,160
dv_put c() A.52(14) dvjputc.c, 761
dv_put char () A.53(4) dvjyutch.c, 763
dv_put s() A.54(4) dv jjuts.c, 763
dv_put s a() A.55(3) dvjyutsa.c, 763
dv r epl ace() A.52(96) dvj?utc.c, 761
dv r est or e() A.56(5) dvjresto.c, 164
dv save() A.57(6) dv_save.c, 764
dv scr ol l () A.59(125) dv scrol.c, 765
dv scr ol l l i ne( ) A.59(78) dv scrol.c, 765
E() 2.52(28) lex.c, 155
e () 5.7(4) yyout.sym, 384
e () E. 11 (5) yyout.sym, 868
ECHO 2.19(148) lex.par, 75
echo() A.84(16) winio.c, 796
Echo A.84(12) winio.c, 796
e cl osur e() 2.36(47) terp.c, 118
ELSE D.6(31) yyout.h, 832
eMar k 2.2(43) input.c, 39
EMPTY 2.20(21) nfa.h, 85
enabl e t r ace() 6.63(100) gen.c, 567
END 2.2(33) input.c, 39
END 2.20(26) nfa.h, 85
End_buf 2.2(40) input.c, 39
END_0PT 4.18(5) llout.h, 276
end opt () 4.26(596) acts.c, 290
ENDP() 6.12(56) virtual.h, 467
endwi n() A.76(5) initscr.c, 791
enl ar ge() A.6(95) set.c, 702
ENTER() 2.23(22) nfa.c, 89
ENTER() 2.23(27) nfa.c, 89
ENTRY A. 10(13) nul, 710
ENTRY A .l 1(5) nul, 714
Enumval 6.54(123) c.y, 550
Eof r ead 2.2(55) input.c, 39
EOI 1.1(1) lex.h, 15
_EOI _ 4.18(1) llout.h, 276
_EOI _ 4.2(1) llout.h, 229
_EOI D.6(3) yyout.h, 832
EPSI LON 2.20(19) nfa.h, 85
EPSI LON 4.20(65) parser.h, 277
EQ() 6.19(68) virtual.h, 475
EQUAL D.6(49) yyout.h, 832
EQUOP D.6(17) yyout.h, 832
er ase A.70(72) curses.h, 784
ERR A.70(23) curses.h, 784
Er r code A.86(3) wprintw.c, 800
ERR_D S T_OP EN A.36(10) copyfile.c, 743
Er r msgs 2.24(53) nfa.c, 90
ERR_NONE A.36(9) copyfile.c, 743
ERR_NUM 2.24(51) nfa.c, 90
er r or () 4.34(470) main.c, 328
ERR_READ A.36(12) copyfile.c, 743
E RR_S RC_0 P EN A.36( 11) copyfile.c, 743
ERR_WRI TE A.36(13) copyfile.c, 743
ESC 4.8(73) yydebug.c, 243
esc () A.26(34) esc.c, 728
EXI T_I LLEGAL_ARG 4.19(20) parser.h, 276
EXI T_NO_DRI VER 4.19(22) parser.h, 276
EXI T_OTHER 4.19(23) parser.h, 276
EXI T_TOO_MANY 4.19(21) parser.h, 276
EXI T_USR_ABRT 4.19(24) parser.h, 276
expand macr o() 2.26(264) nfa.c, 93
expr () 3.4(19) naive.c, 188
expr () 2.30(592) nfa.c, 104
Expr 2.38(154) terp.c, 120
expr essi on() 1.10(29) args.c, 29
expr essi on() 1.6(25) improved.c, 21
expr essi on() 1.5(23) plain, c, 18
expr essi on() 1.9(30) retval.c, 27
expr _pr i me() 3.4(26) naive.c, 188
expr _pr i me() 1.5(31) plain.c, 18
EXT 6.24(58) symtab.h, 492
EXTERN 6.25(111) symtab.h, 493
ext er nal 6.6(43) virtual.h, 461
ext _hi gh() 6.18(66) virtual.h, 474
ext l ow( ) 6.18(65) virtual.h, 474
ext wor d() 6.18(67) virtual.h, 414
F 2.39(18) dfa.h, 125
916 Cross Reference by Symbol
f act or () 1.10(67) args.c, 29 get l i ne() 2. 38(177) terp.c, 120
f act or () 1.6(55) i mproved.c, 21 get name() 5.5(18) expr.y, 382
f act or () 3. 4(69) naive.c, 188 get _j ?r ef i x() 6. 69(181) value.c, 588
f act or () 2. 31(701) nfa.c, 108 get si ze( ) 6. 69(304) value.c, 588
f act or () 1. 5(67) plain.c, 18 get si zeof () 6.31(302) symtab.c, 502
f act or () 1.9(66) retval.c, 27 get suf f i x() 6. 69(331) value.c, 588
FALSE A. 70(22) curses.h, 784 get unf i ni shed() 5. 25(307) yystate.c, 418
FATAL 4. 20(80) parser.h, 277 get unmar ked() 2. 41(157) dfa.c, 129
FCON 2.12(1) l exyy.c, 64 GET_VMODE A. 43(15) vbios.h, 756
FCON 2.11(2) numbers.l ex, 63 get yx () A. 70(30) curses.h, 784
FCON D. 6(7) yyout.h, 832 Goal symbol 4. 23(214) parser.h, 281
f er r () A. 28(13) ferr.c, 732 Goal symbol i s next 4. 26(58) acts.c, 290
FGND() A. 42(14) termlib.h, 754 GOTO D. 6(29) yyout.h, 832
FI ELD 4. 18(6) llout.h, 276 GOTO 5. 22(63) yystate.c, 413
Fi el d name 4. 26(53) acts.c, 290 Got os 5. 22(68) yystate.c, 413
f i el ds act i ve() 4. 26(786) acts.c, 290 GREEN A. 42(7) termlib.h, 754
f i el ds act i ve 4. 26(43) acts.c, 290 gr ound() A. 70(48) curses.h, 784
Fi el ds act i ve 4. 26(54) acts.c, 290 Gr oups 2. 43(16) mi ni mi ze.c, 136
f i gur e l ocal of f set s() 6. 62(45) local.c, 560 GT () 6. 19(72) virtual.h, 475
f i gur e oscl ass() 6. 48(212) decl.c, 539 h D. 5(13) c.lex, 829
f i gur e^par amof f set s() 6. 60(591) decl.c, 557 h 6. 37(41) c.lex, 519
f i gur e st r uct of f set s() 6. 52(427) decl.c, 548 hash add( ) A. 22(8) hashadd.c, 723
f i l e header () 4. 31(27) lldriver.c, 319 hash f unct () 4. 26(303) acts.c, 290
f i l e header () 5. 18(30) yydriver.c, 409 hashj ? j w() 6.21(1) hashpj w.c, 484
Fi l e name A. 41(14) driver.c, 748 hashj ? j w() A. 23(9) hashpj w.c, 724
FI LL( ) A. 4(49) set.h, 696 HASH_TAB A. 14(15) hash.h,! 17
f i l l row( ) 4. 30(101) llcode.c, 311 HD_BOT A. 71(71) box.h, 787
f i nd f i el d( ) 6. 85(440) op.c, 614 HD_CEN A. 71(75) box.h, 787
f i nd pr obl ems() 4. 26(278) acts.c, 290 HD_HORI Z A. 71(74) box.h, 787
f i nd sel ect set ( ) 4. 29(23) llselect.c, 310 HD_LEFT A. 71(73) box.h, 787
f i ndsym() A. 18(101) hash.c, 720 HD_LL A. 71(70) box.h, 787
f i r st () 4. 27(21) first.c, 305 HD_LR A. 71(76) box.h, 787
f i r st cl osur e() 4. 27(37) first.c, 305 HD_RI GHT A. 71(68) box.h, 787
f i r st i n cat ( ) 2. 30(681) nfa.c, 104 HD_TOP A. 71(72) box.h, 787
f i r st rhs () 4. 27(97) first.c, 305 HD_UL A. 71(77) box.h, 787
f i r st sym( ) 4. 26(380) acts.c, 290 HD_UR A. 71(69) box.h, 787
f i x cur( ) A. 52(10) dv_putc.c, 761 HD_VERT A. 71(67) box.h, 787
f i x dt r an() 2. 44(177) mi ni mi ze.c, 137 head() 2. 52(269) lex.c, 155
FI XED 6. 24(48) symtab.h, 492 Header onl y 2. 52(34) lex.c, 155
f i x t ypes and di scar d syms() 6. 60(528) decl.c, 557 Heap 5. 24(194) yystate.c, 416
f ol l ow() 4. 28(26) fol l ow.c, 307 hex2bi n() A. 26(14) esc.c, 728
f ol l ow cl osur e() 4. 28(127) fol l ow.c, 307 hi dewi n() A. 75(3) hi dewi n.c, 790
FOR D. 6(37) yyout.h, 832 HI GH_BI TS A. 23(7) hashpj w.c, 724
f p
6.4(37) virtual.h, 454 Hi gh wat er mar k 6. 64(21) temp.c, 575
f put st r () A. 30(8) fputstr.c, 733 HORI Z A. 71(52) box.h, 787
f r eei t em() 5. 27(421) yystate.c, 422 Hor i z st ack pi x 4. 8(134) yydebug.c, 243
f r eename() 5.5(17) expr.y, 382
I 0 2.22(4) globals.h, 87
f r ee_name() E. l 7(72) expr.y, 879
I 0 2.22(7) globals.h, 87
f r eename() 1.8(15) name. c, 26
I 0 4. 23(171) parser.h, 281
f r ee name() 5.1(27) support.c, 353
I 0 4. 23(174) parser.h, 281
f r ee nf a( ) 2. 35(43) terp.c, 116
I 0 6.4( 11) virtual.h, 454
f r ee r ecycl ed i t ems() 5. 27(429) yystate.c, 422
I 0 6.4(14) virtual.h, 454
f r ee set s( ) 2. 41(188) dfa.c, 129 I BM_BOX() A. 71(38) box.h, 787
f r eesym() A. 13(27) hash.c, 716 I BM_BOX() A. 71(41) box.h, 787
FUNCTI ON 6. 23(35) symtab.h, 491 I br eakpoi nt 4.8(91) yydebug.c, 243
Funct name 6. 57(137) c.y, 555 I CON 2.12(2) lexyy.c, 64
GBI T () A. 4(64) set.h, 696 I CON 2.11(3) numbers.l ex, 63
GE() 6. 19(73) virtual.h, 475 I CON D. 6(6) yyout.h, 832
gen () 6. 63(113) gen.c, 567 i d or keywor d() D. 5(137) c.lex, 829
gen comment () 6. 63(81) gen.c, 567 i d or keywor d() 6. 37(184) c.lex, 519
gener at e() 5.1(7) support.c, 353 I F D. 6(30) yyout.h, 832
gener at e def s and f r ee ar gs () 6. 48(249) decl.c, 539 I f i l e 2. 22(21) globals.h, 87
gen f al se t rue( ) 6. 79(206) op.c, 604 I FREE A. 42(34) termlib.h, 754
gen r val ue() 6. 92(638) op.c, 623 I FREE A. 42(38) termlib.h, 754
gen st ab and f r ee t abl e() 6. 109(78) switch.c, 650 I f unct 2. 28(371) nfa.c, 97
get al i gnment () 6. 52(470) decl.c, 548 I gnor e 4. 17(27) parser.lex, 271
get buf () A. 84(33) winio.c, 796 i i advance() 2. 5(168) input.c, 44
get ch() A. 70(78) curses.h, 784 i i f i l l buf () 2. 6(278) input.c, 45
get expr () 2.27(8) input.c, 96 i i f l ush() 2. 6(204) input.c, 45
get l i ne() 2. 27(59) input.c, 96 i i f l ushbuf ( ) 2. 9(425) input.c, 48
Cross Reference by Symbol 917
i i i nput () 2. 9(377) input.c, 48
i i _i o () 2.3(65) input.c, 41
i i l ook( ) 2. 7(322) input.c, 47
i i l ookahead () 2. 9(418) input.c, 48
i i mar k end( ) 2. 4(134) input.c, 43
i i mark_j prev() 2. 4(154) input.c, 43
i i move st ar t () 2. 4(140) input.c, 43
i i newf i l e() 2.3(81) input.c, 41
i i i pushback() 2. 8(337) input.c, 48
i i t erm( ) 2. 9(358) input.c, 48
i i t ext () 2. 4(120) input.c, 43
i i t o mar k () 2. 4(148) input.c, 43
i i unput () 2. 9(400) input.c, 48
i i unt er m() 2. 9(366) input.c, 48
i l l egal st r uct def ( ) 6. 52(401) decl.c, 548
I MALLOC A. 42(33) termlib.h, 754
I MALLOC A. 42(37) termlib.h, 754
I NBOUNDS() A. 1(62) debug.h, 681
i nch() A. 70(91) curses.h, 784
i n cl osur e i t ems() 5. 29(911) yystate.c, 425
I NCOP D. 6(22) yyout.h, 832
i ncop() 6. 81(233) op.c, 605
i ndi r ect () 6. 84(294) op.c, 606
i n dst at es() 2. 41(139) dfa.c, 129
I ngr oup 2. 43(18) minimize.c, 136
i ni t () 4. 28(61)follow.c, 307
i ni t () A. 68(18) glue.c, 772
i ni t () A. 68(67) glue.c, 112
i ni t act s( ) 4. 26(312) acts.c, 290
i ni t gr oups() 2. 44(52) minimize.c, 137
i ni t out put st r eams() 6. 34(1251) c.y, 512
i ni t scr () A. 76(10) initscr.c, 791
I np f i l e 2.2(48) input.c, 39
I np f mf i l e 4. 8(129) yydebug.c, 243
i nput () 2. 19(162) lex.par, 75
I nput 2. 28(372) nfa.c, 97
I nput buf 2. 22(19) globals.h, 87
i nput char () 4. 16(1368) yydebug.c, 267
I nput f i l e A. 41(12) driver.c, 748
I nput f i l e name 2. 22(20) globals.h, 87
I nput f i l e name 4. 23(179) parser.h, 281
I nput l i ne A. 41(13) driver.c, 748
I NT 6. 24(42) symtab.h, 492
I nt er act i ve 4.8( 124) yydebug.c, 243
i nt er nal cmp() A. 19(218) hash.c, 721
_I NTERSECT A. 4(39) set.h, 696
I NTERSECT() A. 4(44) set.h, 696
I NVERT() A. 4(51) set.h, 696
i nver t () A. 8(365) set.c, 706
I 0_T0P 4.8(64) yydebug.c, 243
10 WI NSI ZE 4.8(65) yydebug.c, 243
_I S( ) A. 1(64) debug.h, 681
I SACT() 4. 20(57) parser.h, 211
I S_AGGREGATE() 6. 25(145) symtab.h, 493
I S_ARRAY() 6.25( 131) symtab.h, 493
I S_CHAR() 6. 25(137) symtab.h, 493
I S_CONSTANT() 6. 25(148) symtab.h, 493
I S_DECLARATOR() 6. 25(130) symtab.h, 493
I S_DI SJ OI NT() A. 4(57) set.h, 696
I S_EMPTY() A. 4(60) set.h, 696
I S_EQUI VALENT() A. 4(59) set.h, 696
I S_FUNCT() 6. 25(133) symtab.h, 493
I SHEXDI GI T() A. 26(l 1) esc.c, 728
I S_I NT() 6. 25(138) symtab.h, 493
I S_I NT_CONSTANT() 6. 25(150) symtab.h, 493
I S_I NTERSECTI NG() A. 4(58) set.h, 696
I SI ZE 6. 27(163) symtab.h, 497
I S_LABEL() 6. 25(135) symtab.h, 493
I S_L0NG() 6. 25(140) symtab.h, 493
I SNONTERM() 4. 20(56) parser.h, 277
I SOCTDI GI T() A. 26(12) esc.c, 728
I S_POI NTER() 6. 25(132) symtab.h, 493
I S_PTR_TYPE() 6. 25(146) symtab.h, 493
I S_SPECI FI ER() 6. 25(129) symtab.h, 493
I S_STRUCT() 6. 25(134) symtab.h, 493
I STERM() 4. 20(55) parser.h, 277
I S_TYPEDEF() 6. 25(149) symtab.h, 493
I S_ UI NT() 6. 25(139) symtab.h, 493
I S_ULONG() 6. 25(141) symtab.h, 493
I SJ J NSI GNED () 6. 25(142) symtab.h, 493
I TEM 5. 20(33) yystate.c, 411
i t em cmp( ) 5. 27(452) yystate.c, 422
I TYPE 6. 27(164) symtab.h, 497
kbhi t () 4. 8(150) yydebug.c, 243
KB_I NT A. 43(6) vbios.h, 756
kbr eady() 4. 9(257) yydebug.c, 247
kcl osur e() 5. 29(745) yystate.c, 425
KWORD 6. 37(143) c.lex, 519
KWORD D. 5(97) c.lex, 829
L 6. 9(47) virtual.h, 465
LABEL 6. 24(46) symtab.h, 492
LABEL_MAX 6. 22(13) symtab.h, 488
LARGEST_I NT 2. 35(19) terp.c, 116
LASTELE() A. 1(59) debug.h, 681
Last mar ked 2. 40(38) dfa.c, 127
Last r eal nont er m 5. 15(22) yypatch.c, 402
LB D. 6(45) yyout.h, 832
L_BODY 6. 59(6) label.h, 557
L br eakpoi nt 4. 8(89) yydebug.c, 243
LC D. 6(43) yyout.h, 832
LCHUNK 6. 28(19) symtab.c, 497
L_COND_END 6. 59(7) label.h, 557
L_COND_FAL S E 6. 59(8) label.h, 557
L_DOEXI T 6. 59(9) label.h, 557
L_DOTEST 6. 59(10) label.h, 557
L DOTOP 6. 59(11) label.h, 557
LE( ) 6. 19(71) virtual.h, 475
LEAVE() 2. 23(24) nfa.c, 89
LEAVE() 2. 23(28) nfa.c, 89
LEFT A. 71(51) box.h, 787
LEFT 4. 18(7) llout.h, 276
LEFT 6. 67(19) value.h, 584
l egal l ookahead() 1.7(80) improved.c, 23
L_ELSE 6. 59(12) label,h, 557
L_END 6. 59(13) label.h, 557
l er r or () 2. 52(89) lex.c, 155
l er r or () 4. 34(429) main.c, 328
l ex () 1.2(9) lex.c, 15
Lexeme 2. 28(375) nfa.c, 97
L_FALSE 6. 59(14) label.h, 557
LI KE_UNI X A. 77(13) mvwin.c, 791
L_I NCREMENT 6. 59(15) label.h, 557
Li neno 2. 22(17) globals.h, 87
Li neno 2.2(49) input.c, 39
l i nk () 6. 16(61) virtual.h, 472
l i nk 6. 25(101) symtab.h, 493
Li nk f r ee 6. 28(16) symtab.c, 497
LL A. 71(48) box.h, 787
LL () 4. 19( 10) parser.h, 276
LL () 4. 19(7) parser.h, 276
LL () 4. 26(441) acts.c, 290
L_L I NK 6. 59(16) label.h, 557
L_NE XT 6. 59(17) label.h, 557
l oc aut o cr eat e() 6. 62(57) local.c, 560
l oc r eset () 6. 62(24) local.c, 560
l oc st at i c cr eat e() 6. 62(96) local.c, 560
l oc var space() 6. 62(33) local.c, 560
Log 4.8( 130) yydebug.c, 243
LONG 6. 25(109) symtab.h, 493
Lookahead 1.3(74) lex.c, 17
Lookahead 4.25(30) llpar.c, 285 MARK 6.64(16) temp.c, 575
l ookf or () 4.25(47) llpar.c, 285 mat ch() 4.25(36) llpar.c, 285
LP 1.1(5) lex.h, 15 MATCH() 2.28(377) nfa.c, 97
LP 4.2(5) llout.h, 229 mat ch() 1.3(76) lex.c, 17
LP 6.9(51) virtual.h, 465 max () A. 1(74) debug.h, 681
LP D.6(41) yyout.h, 832 MAX_CHARS 2.39(19) dfa.h, 125
LR A . 71(54) box. h, 787 MAXCLOSE 5.21(37) yystate.c, 412
l r () 5.29(547) yystate.c, 425 MAXEPSI LON 5.21(39) yystate.c, 412
l r conf l i ct s() 5.31(1073) yystate.c, 433 MAXFI RST 1.7(77) i mproved.c, 23
L_RET 6.59(18) label.h, 557 MAXI NP 2.22(9) globals.h, 87
l r s () 6.17(64) virtual.h, 473 MAXI NT A. 1(71) debug.h, 681
l r st at s () 5.31(1053) yystate.c, 433 MAXKERNEL 5.21(36) yystate.c, 412
LSI ZE 6.27(166) symtab. h, 497 MAXLEX 2.2(26) input.c, 39
L_STRI NG 6.59(19) label.h, 557 MAXLOOK 2.2(25) input.c, 39
L SWI TCH 6.59(20) label.h, 557 MAXNAME 4.20(25) parser.h, 211
LT () 6.19(70) virtual.h, 475 MAXNONTERM 4.20(42) parser.h, 211
Lt ab 6.63(26) gen.c, 567 MAXOBUF 5.20(21) yystate.c, 411
L_TEST 6.59(21) label.h, 557 MAXPROD 4.20(26) parser.h, 277
L_TRUE 6.59(22) label.h, 557 MAXRHS 4.22(141) parser.h, 219
LTYPE 6.27(167) symtab.h, 497 MAXSTATE 5.20(20) yystate.c, 411
L_VAR 6.59(23) label.h, 557 MAXTERM 4.20(41) parser.h, 211
LWORD_HI GH_BI T 6.2(8) c-code.h, 452 MAX_T0K_P E R_LI NE 5.32(1084) yystate.c, 434
LWORD_PREFI X 6.10(14) c-code.h, 465 MAXJ J NFI NI SHED 5.24(185) yystate.c, 416
LWORDPTR_PREFI X 6.10(18) c-code.h, 465 MEMBER() A.4(68) set.h, 696
LWORD_WI DTH 6.2(3) c-code.h, 452 memi set () A.38(4) memi set.c, 745
machi ne() 2.29(508) nfa.c, 102 mer ge l ookaheads() 5.29(667) yystate.c, 425
MAC_NAME_MAX 2.26( 188) nfa.c, 93 mi n () A. 1(77) debug.h, 681
MACRO 2.26(196) nfa.c, 93 MI NACT 4.20(30) parser.h, 211
Macr os 2.26(198) nfa.c, 93 mi n df a( ) 2.43(32) mi ni mi ze.c, 136
MAC_TE XT_MAX 2.26(189) nfa.c, 93 mi ni mi ze() 2.44(104) mi ni mi ze.c, 137
MAGENTA A.42(10) terml i b.h, 754 MI NNONTERM 4.20(29) parser.h, 211
mai n () E.20(79) expr.l ma, 887 MI NTERM 4.20(28) parser.h, 277
mai n() E. 17(88) expr.y, 879 MI NUS D.6(9) yyout.h, 832
mai n () 5.5(90) expr.y, 382 mkpr od() 5.33(1393) yystate.c, 437
mai n() 2.52(109) lex.c, 155 Ml i ne 2.2(50) input.c, 39
mai n () 1.4(1) mai n.c, 18 _ MOD_ WSI ZE() A.4(7) set.h, 696
mai n() 6.36(3) mai n. c, 518 MONBASE A.46( 1) video.h, 759
mai n () 4.33(76) mai n. c, 322 MONTHS A.3(8) p, 690
mai n() 7.1(16) postfix.y, 668 move() A.70(81) curses.h, 784
mai n() 2.38(190) terp.c, 120 move() 2.37(116) terp.c, 119
mai n() D. 1(5) yymai n.c, 821 movedot () 5.27(442) yystate.c, 422
Make act i ons 4.23(180) parser.h, 281 move eps () 5.29(707) yystate.c, 425
make act s () 4.30(366) llcode.c, 311 movef i l e() A .37(l ) movefile.c, 144
make dt r an() 2.42(197) dfa.c, 131 MS () A. 1(11) debug.h, 681
make i con( ) 6.71 (371) value.c, 594 MS () A. 1(16) debug.h, 681
make i mpl i ci t decl ar at i on() 6.72(117) op.c, 595 mvcur () A.70(82) curses.h, 784
make i nt () 6.71(414) value.c, 594 mvi nch() A.70(92) curses.h, 784
make nont er m() 4.26(682) acts.c, 290 mvwi n() A.77(23) mvwi n.c, 791
Make par ser 4.23(181) parser.h, 281 mvwi nch() A.70(93) curses.h, 784
make _j ?ar se_t abl es() 5.28(467) yystate.c, 423 mvwi nr () A.70(45) curses.h, 784
make _j ?usht ab() 4.30(132) llcode.c, 311 NAME 4.18(8) llout.h, 276
make scon( ) 6.75(426) value.c, 601 NAME D.6(4) yyout.h, 832
makesi g() 7.6(94) optimize.c, 674 NAME_MAX 4.22(121) parser.h, 279
maket ab() A. 15(32) hash.c, 717 NAME_MAX 6.22(12) symtab.h, 488
make t er m( ) 4.26(335) acts.c, 290 Namep E . l 7(57) expr.y, 879
make t oken f i l e( ) 4.32(56) stok.c, 321 Namep 3.3(7) expr.y, 185
make t ypes mat ch() 6.97(712) op.c, 628 Namep 1.8(2) name. c, 26
make yy act () 4.30(444) llcode.c, 311 Namep 5.1(5) support.c, 353
make yy dt r an() 4.30(194) llcode.c, 311 Names E . l 7(56) expr.y, 879
make yy l hs( ) 5.33(1239) yystate.c, 437 Names 3.3(6) expr.y, 185
Make yyout ab 4.23(182) parser.h, 281 Names 1.8(1) name. c, 26
make yy pusht ab() 4.30(163) llcode.c, 311 Names 5.1(4) support.c, 353
make yy r educe() 5.33(1271) yystate.c, 437 NBI TS () A. 1(68) debug. h, 681
make yy sact ( ) 4.30(335) llcode.c, 311 NBI TS_I N_UNSI GNED A.23(4) hashpj w.c, 124
make yy sl hs( ) 5.33(1302) yystate.c, 437 NCOLS 2.48(23) pairs.c, 143
make yy snont er m( ) 4.30(303) llcode.c, 311 NCOLS 2.47(9) print_ar.c, 142
make yy sr hs( ) 5.33(1328) yystate.c, 437 NCOLS 2.49(26) squash.c, 147
make yy s t ok() 4.32(17) stok.c, 321 NE () 6.19(69) virtual.h, 475
make yy synch() 4.30(270) llcode.c, 311 Nest l ev 6.42(120) c.y, 530
Map 4.8( 168) yydebug. c, 243 NEW 5.24(206) yystate.c, 416
new() 2.25(105) nfa.c, 91
new() 5.23(73) yystate.c, 413
new cl ass spec( ) 6.40(21) decl.c, 526
new f i el d() 4.26(675) acts.c, 290
new f i el d() 4.26(793) acts.c, 290
new i nput f i l e( ) 4.16(1291) yydebug.c, 267
newi t em() 5.27(394) yystate.c, 422
new l ev() 4.26(648) acts.c, 290
new l ev() 4.26(698) acts.c, 290
NEWLI NE() 4.8(40) yydebug.c, 243
NEWLI NE() 4.8(46) yydebug.c, 243
new l i nk( ) 6.28(85) symtab.c, 497
new macr o () 2.26(202) nfa.c, 93
new name() E . l 7(59) expr.y, 879
newname() 3.3(9) expr.y, 185
newname() 1.8(4) name.c, 26
new name() 5.1(16) support.c, 353
new nont er m() 4.26(391) acts.c, 290
new rhs () 4.26(460) acts.c, 290
newset () A.5(9) set.c, 699
new st ab( ) 6.109(17) switch.c, 650
newst at e() 5.25(209) yystate.c, 418
new_st r uct def () 6.28(148) symtab.c, 497
newsym() A. 13(9) hash.c, 716
new symbol () 6.28(23) symtab.c, 497
new t ype spec() 6.40(70) decl.c, 526
new val ue() 6.68(19) value.c, 585
Next 2.2(41) input.c, 39
Next al l oc 2.25(84) nfa.c, 91
Next al l ocat e 5.24(195) yystate.c, 416
next char () 2.38(156) terp.c, 120
next member () A.9(397) set.c, 708
next sym() A. 18(124) hash.c, 720
nf a () 2.35(26) terp.c, 116
NFA 2.20(17) nfa.h, 85
Nf a 2.35(21) terp.c, 116
NFA_MAX 2.21(28) nfa.h, 86
Nf a st at es 2.25(82) nfa.c, 91
Nf a st at es 2.35(22) terp.c, 116
Ni t ems 5.20(14) yystate.c, 411
Nl A.84(14) winio.c, 796
No comment pi x 4.8(131) yydebug.c, 243
No compr essi on 2.52(31) lex.c, 155
nocr mode() A.84(26) winio.c, 796
node 7.3(10) optimize.c, 669
No header 2.52(33) lex.c, 155
No l i nes 2.22(12) globals.h, 87
No l i nes 4.23(183) parser.h, 281
N 0_M0 RE_C HARS() 2.2(35) input.c, 39
NONASSOC 4.18(9) llout.h, 276
NONE 2.20(24) nfa.h, 85
NONFATAL 4.20(79) parser.h, 277
NO_OCLASS 6.24(55) symtab.h, 492
NORMAL A.42(17) termlib.h, 754
No st ack^pi x 4.8(132) yydebug.c, 243
NOT_I BM_PC 4.8(42) yydebug.c, 243
NOUN 6.25(107) symtab.h, 493
noun str() 6.32(381) symtab.c, 504
No war ni ngs 4.23(184) parser.h, 281
nows() 4.17(215) parser.lex, 271
Npai r s 5.20(15) yystate.c, 411
NREQ 6.63(76) gen.c, 567
Nst at es 2.40(37) dfa.c, 127
Nst at es 2.25(83) nfa.c, 91
Nst at es 5.24(183) yystate.c, 416
Nt ab ent r i es 5.20(16) yystate.c, 411
NULLABLE() 4.22(137) parser.h, 279
NUMCOLS A.46(4) video.h, 759
NUMELE() A. 1(58) debug.h, 681
NUM ELE 6.25(116) symtab.h, 493
numel e() A.7(126) set.c, 703
Numgr oups 2.43( 17) minimize.c, 136
NUMNONTERMS 4.20(45) parser.h, 277
NUM_OR_I D 1.1(7) lex.h, 15
NUM_OR_I D 4.2(4) llout.h, 229
Numpr oduct i ons 4.23(221) parser.h, 281
NUMROWS A.46(3) video.h, 759
NUMTERMS 4.20(44) parser.h, 277
Numwar ni ngs 4.33(31) main.c, 322
o D.5(14) c.lex, 829
o 6.37(42) c.lex, 519
0_BI NARY A. 1(18) debug.h, 681
OCLASS 6.25(113) symtab.h, 493
ocl ass str() 6.32(370) symtab.c, 504
oct 2bi n() A .26(23) esc.c, 728
OFF () A. 1(40) debug.h, 681
Of f set 6.62(14) local.c, 560
Of i l e 2.22(22) globals.h, 87
OFI LE_NAME 6.34(1245) c.y, 512
OK A.70(24) curses.h, 784
on f err( ) A.29(8) onferr.c, 733
oni nt r () 4.33(146) main.c, 322
Onumel e 4.8(123) yydebug.c, 243
op () 3.3(22) expr.y, 185
open er r msgO 4.34(507) main.c, 328
Openp 2.2(62) input.c, 39
opt i mi ze() 7.6(74) optimize.c, 674
OR 4.18(10) llout.h, 276
OR D.6(19) yyout.h, 832
or () 6.92(627) op.c, 623
OROR D.6(15) yyout.h, 832
Osi g 6.34(1281) c.y, 512
OTHER 4.18(11) llout.h, 276
OTHER_BOX() A . 71(39) box.h, 787
OTHER_BOX() A . 71(42) box.h, 787
out 4.33(241) main.c, 322
out put () 2.19(147) lex.par, 75
out put () 4.34(379) main.c, 328
Out put 4.23(185) parser.h, 281
Out put f name 4.33(32) main.c, 322
OX( ) 4.19( 11) parser.h, 276
OX() 4.19(8) parser.h, 276
OX 4.23(200) parser.h, 281
P()
A. 1(25,28) debug.h, 681
P 6.9(48) virtual.h, 465
pact () 4.26(128) acts.c, 290
p act i on () 5.23(95) yystate.c, 413
pai r s() 2.48(29) pairs.c, 143
par se args( ) 4.33(159) main.c, 322
par se err( ) 2.24(70) nfa.c, 90
PARSE_FI LE 4.20(91) parser.h, 211
PARSE_FI LE 4.20(99) parser.h, 211
Par se_pi x 4.8(136) yydebug.c, 243
PAR_TEMPL 4.20(105) parser.h, 211
PAR_TEMPL 4.20(95) parser.h, 277
pat ch() 5.15(29) yypatch.c, 402
P br eakpoi nt 4.8(88) yydebug.c, 243
Pbuf 2.38(153) terp.c, 120
PBUF_SI ZE A.40(3) searchen.c, 747
pchar () A . 31(5) pchar.c, 733
pcl osur e() 5.32(1225) yystate.c, 434
pdr i ver () 2.51 (83) print.c, 153
P dsp 4.8(107) yydebug.c, 243
PERCENT_UNI ON 4.18(21) llout.h, 276
P_got o() 5.23(134) yystate.c, 413
pgr oups() 2.45(223) minimize.c, 140
pheader () 2.51(18) print.c, 153
PHYS() A. 1(41) debug.h, 681
PHYS () A. 1(43) debug.h, 681
pl ab() 2.33(38) printnfa.c, 110
pLengt h
pLi neno
PLUS
PLUS
PLUS
pl us_mi nus()
pm()
pmap()
pMar k
pnext ()
pnont er m()
POI NTER
POP ()
pop_ ()
pop ()
pop ()
popn_()
popn()
PP
PREC
pr ec()
pr ec()
Pr ec_l ev
pr ec_l i st ()
PREC_TAB
pr esskey()
PRI
pr i nt _a_macr o()
pr i nt _ar r ay()
pr i nt _bss_dcl ()
pr i nt buf ()
pr i nt ed ()
pr i nt _col _map()
pr i nt f
pr i nt _i nst r uct i on()
pr i nt macs()
pr i nt _nf a()
pr i nt _of f set _comment ()
pr i nt _one_case()
pr i nt _r educt i ons()
pr i nt _r ow_map()
pr i nt _symbol s()
pr i nt _syms()
pr i nt _t ab()
pr i nt _t ok()
pr i nt v()
pr i nt w()
PRI NTWI DTH
PRI VATE
PRI VATE
pr i vat e
pr nt ()
pr nt _put c()
pr obl ems()
PROC()
PRODUCTI ON
pr oduct i on_st r ()
PROG_NAME
P ROG_NAME
P ROMP T_TOP
Pr ompt _wi ndow
PROMPT_WI NSI ZE
pset ()
PSI ZE
P_sp
pst at e()
pst at e_st dout ()
pst r uct ()
psym()
pt ab( )
2.2(46) input.c, 39
2.2(45) input.c, 39
1.1(3) lex.h, 15
4.2(2) llout.h, 229
D.6(8) yyout.h, 832
6.101(963) op.c, 634
6.20(84) virtual.h, 477
2.49(276) squash.c, 147
2.2(44) input.c, 39
2.48(135) pairs.c, 143
4.26( 171) acts.c, 290
6.23(33) symtab.h, 491
2.25(96) nfa.c, 91
A.2(24) stack.h, 689
A.2(30) stack.h, 689
6.11(54) virtual.hy 466
A.2(34) stack.h, 689
A.2(35) stack.h, 689
6.9(52) virtual.h, 465
4.18(12) llout.h, 276
4.26(659) acts.c, 290
4.26(735) acts.c, 290
4.26(50) acts.c, 290
4.26(710) acts.c, 290
4.23(164) parser.h, 281
4.16(1444) yydebug.c, 267
6.24(57) symtab.h, 492
2.26(294) nfa.c, 93
2.47(12) printar.c, 142
6.49(334) decl.c, 542
2.38(169) terp.c, 120
2.33(16) printnfa.c, 110
2.49(240) squash.c, 147
5.19(14) occs-act.par, 410
6.63(247) gen.c, 567
2.26(300) nfa.c, 93
2.33(57) printnfa.c, 110
6.60(628) decl.c, 557
5.15(179) yypatch.c, 402
5.33(1368) yystate.c, 437
2.49(268) squash.c, 147
4.26(243) acts.c, 290
6.32(589) symtab.c, 504
5.33(1406) yystate.c, 437
4.26(87) acts.c, 290
A.32(3) printv.c, 734
A.86(25) wprintw.c, 800
4.8(68) yydebug.c, 243
A. 1(2) debug.h, 681
A. 1(5) debug.h, 681
6.6(42) virtual.h, 461
A.33(19) Driver, 736
4.10(386) yydebug.c, 251
4.26(290) acts.c, 290
6.12(55) virtual.^ 467
4.22(156) parser.h, 279
4.26(143) acts.c, 290
4.20(107) parser.h, 277
4.20(96) parser.h, 277
4.8(62) yydebug.c, 243
4.8(117) yydebug.c, 243
4.8(63) yydebug.c, 243
A.9(434) sc/.c, 708
6.27(169) symtab.h, 497
4.8(109) yydebug.c, 243
5.32(1159) yystate.c, 434
5.32(1215) yystate.c, 434
6.32(571) symtab.c, 504
6.32(557) symtab.c, 504
A. 19(142) hash.c, 721
pt er m()
PTR_PREFI X
PTRPTR_PREFI X
PTR_WI DTH
PTYPE
PUB
PUBLI C
publ i c
Publ i c
Publ i c
pur ge_undecl ()
PUSH( )
push_ ()
push ()
push ()
put st r ()
QUEST
rl
r2
r3
r4
r5
r6
r l
r9
rA
RANGE ()
RB
rB
RC
rC
rD
rE
RE AD_C HAR
Readp
READ_POSN
Recycl ed_i t ems
RED
r educe()
r educe_one_i t em()
Reduce_r educe
r educt i ons()
r ef r esh()
r ef r esh_wi n()
r eg
r eg
Regi on
REGI ON_MAX
REGI STER
r el ease_val ue()
RELOP
r el op()
REMOVE()
RETURN
r et ur n()
REVERSE
r ever se_l i nks()
rF
rhs ()
RHSBI TS
RI GHT
RI GHT
RI GHT
RI GHT OF DOT( )
4.26(108) acts.c, 290
6.10(15) c-code.h, 465
6.10(19) c-code.h, 465
6.2(4) c-code.h, 452
6.27(170) symtab.h, 497
6.24(56) symtab.h, 492
A. 1(8) debug.h, 681
6.6(40) virtual.h, 461
2.22(14) globals.hy 87
4.23(186) parser.h, 281
6.72(152) op.c, 595
2.25(95) nfa.c, 91
A.2(23) stack.h, 689
A.2(26) stack.h, 689
6.11 (53) virtual.h, 466
A.33(89) Driver, 736
D.6( 12) yyout.h, 832
6.4(30) virtual.h, 454
6.4(30) virtual.h, 454
6.4(30) virtual.h, 454
6.4(30) virtual.h, 454
6.4(30) virtual.h, 454
6.4(30) virtual.h, 454
6.4(30) virtual.h, 454
6.4(31) virtual.h, 454
6.4(31) virtual.h, 454
A. 1(79) debug.h, 681
D.6(46) yyout.h, 832
6.4(31) virtual.h, 454
D.6(44) yyout.h, 832
6.4(31) virtual.h, 454
6.4(31) virtual.h, 454
6.4(31) virtual.h, 454
A.43( 12) vbios.h, 756
2.2(64) input.c, 39
A.43(9) vbios.h, 756
5.27(392) yystate.c, 422
A.42(9) termlib.h, 754
2.49(116) squash.c, 147
5.30(989) yystate.c, 431
5.20(18) yystate.c, 411
5.30(944) yystate.c, 431
A.70(33) curses.h, 784
4.10(417) yydebug.c, 251
A.70(20) curses.h, 784
6.4(28) virtual.h, 454
6.64(20) temp.c, 575
6.64(14) temp.c, 575
6.24(49) symtab.h, 492
6.69(361) value.c, 588
D.6(16) yyout.h, 832
6.97(664) op.c, 628
A.4(66) sc/./f, 696
6.48(301) decl.c, 539
4.28( 171) follow.c\ 307
6.62( 111) /oca/.c, 560
6.63(19) gen.c, 567
6.12(60) virtual.h, 467
6.87(553) op.c, 616
D.6(28) yyout.h, 832
A.68(103) glue.c, 112
A.42(19) termlib.h, 754
6.31 (331) symtab.c, 502
6.4(31) virtual.h, 454
4.25(188) llpar.c, 285
4.22(142) parser.h, 279
A . 71(46) box.h, 787
4.18(13) llout.h, 276
6.67(20) value.h, 584
5.20(35) yystate.c, 411
r emove_dupl i cat es ()
r emove_epsi l on()
r emove_symbol s_f r om_t abl e ()
r equest
ret ()
ret reg( )
r i ght si des () 4.25(164) llpar.c, 285
r l abel () 6.104(1062) op.c, 640
_ ROUND() A.4(10) set.h, 696
ROW 2.39(21) dfa.h, 125
Row A.52(4) dv_putc.c,16\
ROW_CP Y () 2.49(46) squash.c, 147
ROW_EQUI V() 2.49(45) squash.c, 147
Row map 2.49(24) squash.c, 147
RP 1.1(6) lex.h, 15
RP 4.2(6) llout.h, 229
RP D.6(42) yyout.h, 832
r ul e() 2.29(531) nfa.c, 102
r val ue() 6.68(84) val ue.c, 585
r val ue name() 6.68(117) value.c, 585
s E.2(14) expr.y, 841
s 5.5(32) expr.y, 382
s2 6.34(1301) c.y, 512
s2 6.34(1302) c.y, 512
save () 2.25(146) nfa.c, 91
save() A.83(27) wi ncreat.c, 794
Save A.83(23) wi ncreat.c, 794
Savep 2.25(101) nfa.c, 91
S br eakpoi nt 4.8(90) yydebug.c, 243
SBUF A.42(45) termlib.h, 754
SCLASS 2.48(25) pairs, c, 143
SCLASS 2.49(28) squash.c, 147
SCLASS 6.25(108) symtab.h, 493
scl ass st r ( ) 6.32(358) symtab.c, 504
SCREEN A.46(21) video.h, 759
SCRNSI ZE 4.8(39) yydebug.c, 243
SCRNSI ZE 4.8(45) yydebug. c, 243
scr ol l () A.70(88) curses.h, 784
SCROLL_DOWN A.43(l 1) vbios.h, 756
scr ol l ok () A.70(34) curses.h, 784
SCROLLJ J P A.43(10) vbios.h, 756
SDEPTH 6.3(10) c-code.h, 454
sear chenv() A.40(5) searchen.c, 747
SEGO A. 1(39) debug.h, 681
SEG( ) 6.5(39) virtual.h, 457
sel ect () 4.29(46) llselect.c, 310
SEMI 1.1(2) lex.h, 15
SEMI 4.18(14) llout.h, 276
SEMI 4.2(7) llout.h, 229
SEMI D.6(48) yyout.h, 832
SEPARATOR 4.18(15) llout.h, 276
SET A.4(20) set.h, 696
set cl ass bi t ( ) 6.40(37) decl.c, 526
set cmp() A.7(215) set.c, 703
_SET_DI SJ A.4(54) set.h, 696
_SET_EQUI V A.4(53) set.h, 696
set hash() A.7(254) set.c, 703
_SET_I NTER A.4(55) set.h, 696
set _op() A .8(313) set.c, 706
SET_POSN A.43(8) vbios.h, 756
set t est () A.7(170) set.c, 703
_SETTYPE A.4(3) set.h, 696
SEVENT Y_F I VE_P E RCENT A.23(5) hashpj w.c, 724
shi f t name() 6.68(61) value.c, 585
SHI FTOP D.6(21) yyout.h, 832
Shi f t r educe 5.20(17) yystate.c, 411
showwi n() A.78(3) showwi n.c, 792
si gi nt handl er () 6.34(1283) c.y, 512
si gnon() 2.50(5) si gnon.c, 152
si gnon () 4.36(7) si gnon.c, 333
Si ngl est ep 4.8(125) yydebug. c, 243
S i nput 2.28(373) nfa.c, 91
SI ZEOF D.6(40) yyout.h, 832
sMar k 2.2(42) input.c, 39
snames 4.24(88) parser.I ma, 283
Sor t by number 5.24(204) yystate.c, 416
sp 6.4(38) virtual.h, 454
Sp 2.25(89) nfa.c, 91
spec cpy() 6.30(211) symtab.c, 501
SPECI FI ER 6.25(86) symtab.h, 493
speci f i er 6.24(84) symtab.h, 492
spr i nt t ok( ) 5.32(1087) yystate.c, 434
squash() 2.49(50) squash.c, 147
SSI ZE 4.26(69) acts.c, 290
SSI ZE 2.25(86) nfa.c, 91
ssor t () A.34(7) ssort.c, 742
Sst ack 2.25(88) nfa.c, 91
Sst ack 4.8(108) yydebug.c, 243
st ab 6.107(17) switch.h, 648
St ack 4.26(79) acts.c, 290
st ack cl ear () A.2(11) stack.h, 689
st ack el s 4.26(13) acts.c, 290
st ack el s A.2(5) stack.h, 689
st ack dcl ( ) A.2(7) stack.h, 689
st ack el e( ) A.2(18) stack.h, 689
st ack empt y() A.2(15) stack.h, 689
st ack er r ( ) 6.91(156) c.y, 622
st ack er r () A.2(39) stack.h, 689
st ack f ul l ( ) A.2(14) stack.h, 689
st ack i t em( ) A.2(20) stack.h, 689
STACK_ OK() 2.25(91) nfa.c, 91
st ack p( ) A .2(21) stack.h, 689
St acksi ze 4.8( 121) yydebug.c, 243
STACK_TOP 4.8(59) yydebug.c, 243
STACK_USED() 2.25(93) nfa.c, 91
St ack wi ndow 4.8(116) yydebug.c, 243
STACK_WI NSI ZE 4.8(61) yydebug.c, 243
STAR D.6(10) yyout.h, 832
START 4.18(16) llout.h, 276
START 2.20(25) nfa.h, 85
St ar t buf 2.2(39) input.c, 39
St ar t l i ne 4.17(28) parser.l ex, 271
S TART_OPT 4.18(17) llout.h, 276
st ar t opt () 4.26(581) acts.c, 290
STATE 5.21(55) yystate.c, 412
st at e cmp( ) 5.26(336) yystate.c, 421
st at e hash( ) 5.26(372) yystate.c, 421
St at e i t ems 5.24(202) yystate.c, 416
st at ement s() 1.10(11) args.c, 29
st at ement s() 1.6(10) i mproved.c, 21
st at ement s() \.5(6) plain.c, 18
st at ement s() 1,9( 11) retval.c, 27
St at e ni t ems 5.24(203) yystate.c, 416
STATENUM 5.21(42) yystate.c, 412
St at es 5.24(182) yystate.c, 416
STATI C 6.25(112) symtab.h, 493
st at i st i cs () 4.34(346) mai n. c, 328
STDI N 2.2(21) input.c, 39
st dscr A.76(3) initscr.c, 791
st k er r () 6.91(148) c.y, 622
st mt () 3.4(5) naive.c, 188
st ol () A.25(47) stol.c, 727
st op pr nt () A .33(34) Dri ver, 736
st op_pr nt () A.33(67) Dri ver, 736
st oul () A.25(3) stol.c, 727
St r buf 6.74(141) c.y, 600
STRI NG D.6(5) yyout.h, 832
St r i ngs 2.25(100) nfa.c, 91
st r i p comment s() 2.52(332) lex.c, 155
st r i per () 4.17(224) parser.l ex, 271
st r i t em() 5.32(1109) yystate.c, 434
STR_MAX 6.74(140) c.y, 600
STR_MAX 2.21(31) nfa.h, 86
STRUCT D.6(27) yyout.h, 832
st r uct def 6.26(157) symtab.h, 495
St r uct f r ee 6.28( 17) symtab. c, 497
STRUCTOP D.6(24) yyout.h, 832
St r uct t ab 6.26(159) symtab. h, 495
STRUCTURE 6.24(45) symtab. h, 492
STYPE 6.27(172) symtab. h, 497
st ype 5.5(20) expr.y, 382
st ype 5.10(37) yyout.c, 386
subset () A.7(275) set.c, 703
subwi n () A.70(96) curses.h, 784
suf f i x D.5(16) c.lex, 829
suf f i x 6.37(44) c.lex, 519
SWI DTH 6.3(9) c-code.h, 454
SWI TCH D.6(32) yyout.h, 832
SYMBOL 4.22(135) parser.h, 279
symbol 6.22(30) symtab.h, 488
Symbol f r ee 6.28(15) symtab.c, 497
symbol s() 4.34(331) mai n.c, 328
Symbol s 4.23(187) parser.h, 281
Symbol t ab 6.22(32) symtab.h, 488
sym chai n st r ( ) 6.32(519) symtab.c, 504
S YM_FI LE 4.20(102) parser.h, 277
SYM_FI LE 4.20(92) parser.h, 277
Sympi x 4.8(137) yydebug. c, 243
Symt ab 4.23(212) parser.h, 281
SYNCH 1.7(78) i mproved.c, 23
SYNCH 4 A S ( \ S ) llout.h, 276
Tab 6.35(1357) c.y, 515
TAB_FI LE 4.20(101) parser.h, 277
t abl es() 4.30(66) llcode.c, 311
t abl es() 5.17(1) yycode.c, 408
t abt ype 6.35(1355) c.y, 515
t ai l () 2.52(368) lex.c, 155
t ai l () 4.34(524) mai n.c, 328
t cmp() 6.35(1417) c.y, 515
t const st r ( ) 6.32(470) symtab.c, 504
Templ at e 2.22(15) gl obal s.h, 87
Templ at e 4.23(210) parser.h, 281
t er m() 1.10(50) args.c, 29
t er m() 1.6(42) i mproved.c, 21
t er m() 3.4(44) naive.c, 188
t er m() 2.32(733) nfa.c, 108
t er m() 1.5(45) plain.c, 18
t er m() 1.9(50) retval.c, 27
Ter mchar 2.2(51) i nput.c, 39
t er m_j ?r i me () 3.4(51) naive.c, 188
t er m_j pr i me () 1.5(53) plain.c, 18
Ter ms 4.23(194) parser.h, 281
TERM_SPEC 4.18(19) llout.h, 276
TEST( ) A.4(69) set.h, 696
t f l abel () 6.79(199) op.c, 604
t he same t ype( ) 6.31(258) symtab.c, 502
t hompson() 2.34(831) nfa.c, 112
Thr eshol d 2.52(32) lex.c, 155
Thr eshol d 4.23(188) parser.h, 281
TI MES 1.1(4) lex.h, 15
TI MES 4.2(3) llout.h, 229
t mp al l oc() 6.64(24) temp.c, 575
t mp cr eat e() 6.69(138) value.c, 588
t mp f r ee( ) 6.64(82) temp.c, 575
t mp f r eeal l ( ) 6.64(109) temp.c, 575
t mp gen() 6.69(225) value.c, 588
t mp r eset () 6.64(97) temp.c, 575
t mp var space() 6.64(121) temp.c, 575
TNODE 5.24( 192) yystate.c, 4 16
TOKEN 2.28(330) nfa.c, 97
TOKEN_FI LE 4.20(90) parser.h, 211
TOKEN_FI LE 4.20(98) parser.h, 277
Tokens j >r i nt ed 5.32(1085) yystate.c, 434
TOKEN_WI DTH 4.8(67) yydebug.c, 243
Token wi ndow 4.8(120) yydebug.c, 243
Tokmap 2.28(332) nfa.c, 97
t o_l og () 4.16(1316) yydebug.c, 267
TOOHI GH() A. 1(60) debug.h, 681
TOOLOW() A. 1(61) debug.h, 681
TOP A .71(50) box.h, 787
Tr ace 6.63(11) gen.c, 567
t r av() 7.5(48) optimize.c, 672
TRUE A.70(21) curses.h, 784
t r uncat e() A.8(380) set.c, 706
Tspace 6.57(127) c.y, 555
TTYPE 2.39(12) dfa.h, 125
TWELVE_PERCENT A.23(6) hashpj w.c, 124
TYPE 4.18(20) llout.h, 276
TYPE 2.48(24) pairs.c, 143
TYPE 2.49(27) squash.c, 147
TYPE D.6(25) yyout.h, 832
TYPEDEF 6.24(51) symtab.h, 492
t ype_ st r () 6.32(409) symtab.c, 504
uchar 2.2(37) input.c, 39
U_GE () 6.19(78) virtual.h, 475
U_GT () 6.19(76) virtual.h, 475
UL A . 71(55) box.h, 787
U_LE () 6.19(77) virtual.h, 475
U_LT () 6.19(75) virtual.h, 475
UNADJ _ VAL() 4.20(75) parser.h, 277
UNCLOSED 5.24(207) yystate.c, 416
Uncompr essed 4.23(189) parser.h, 281
Undecl 6.72(16) op.c, 595
UNDERLI NED A.42(18) termlib.h, 754
Unf i ni shed 5.24(200) yystate.c, 416
_UNI ON A.4(38) set.h, 696
UNI ON() A.4(43) set.h, 696
uni on def () 4.26(665) acts.c, 290
uni on def () 4.26(762) acts.c, 290
UNI X() A.77(16) mvwi n.c, 791
UNI X() A.77(19) mvwi n.c, 791
Uni x 2.22(13) globals.h, 87
unl i nk() 6.16(63) virtual.h, 412
UNOP D.6(23) yyout.h, 832
unput () 2.19(156) lex.par, 75
UNSI GNED 6.25(110) symtab.h, 493
UR A .71(47) box.h, 787
USED_NONTERMS 4.20(48) parser.h, 211
USED_TERMS 4.20(47) parser.h, 277
User cmp A. 19(140) hash.c, 721
Use st dout 4.23(190) parser.h, 281
UX( ) A. 1(12) debug.h, 681
UX( ) A. 1(17) debug.h, 681
va ar g() A.24(3) stdarg.h, 726
va end( ) A.24(4) stdarg.h, 726
VA_LI ST A. 1(26) debug.h, 681
VA_ L1ST A. 1(30) debug.h, 681
va l i st A .24(l ) stdarg.h, 726
VALN AME_MAX 6.67(5) value.h, 584
VALUE 6.25(118) symtab.h, 493
val ue 6.67(17) value.h, 584
Val ue f r ee 6.68(14) value.c, 585
var del () 6.49(350) decl.c, 542
va st ar t () A.24(2) stdarg.h, 726
VB_BLOCKCUR() A.45(61) vbios.h, 758
VB_CLR_REGI0N()
A.45(59) vbios.h, 758
VB_ CLRS() A.45(58) vbios.h, 758
VB_ CTOYX() A.45(51) vbios.h, 758
VB_ CURSI ZE() A.45(47) vbios.h, 758
vb f r eesbuf () A.60(5) vbj rees.c, 768
vb_get cha r () A.61(4) vb getch.c, 768
VB_ GETCUR() A.45(46) vbios.h, 758
VB_ GETPAGE() A.45(44) vbios.h, 758
vb get yx() A.62(3) vb getyx.c, 768
VB_ I NCHA() A.45(45) vbios.h, 758
Vbi os () A.44(8) vbios.c, 757
923
vb i scol or O A.63(3) vbjscol.c, 768 WRI TE A.43(13) vbios.h, 756
VB_NORMALCUR() A.45(62) vbios.h, 758 wr i t e scr een() 4.10(607) yydebug.c, 251
VB_OUTCHA() A.45(48) vbios.h, 758 WRI TE_TTY A.43(14) vbios.h, 756
vb put c() A.64(4) vbjputc.c, 769 ws 4.17(37) parser.lex, 271
VB_PUTCHAR() A.45(63) vbios.h, 758 wscr ol l () A.87(19) wscroll.c, 801
vb put s() A.65(3) vbjputs.c, 769 XOR D.6(20) yyout.h, 832
VB_REPLACE() A.45(49) vbios.h, 758 xp A.68(103) glue.c, 112
vb r est or e() A.66(5) vb resto.c, 769 YyaOOO 5.12(205) yyout.c, 389
vb save() A.67(6) vbsave.c, 770 YyaOOl 5.12(206) yyout.c, 389
VB_SCROLL() A.45(52) vbios.h, 758 Yya003 5.12(207) yyout.c, 389
VB_SETCUR() A.45(50) vbios.h, 758 Yya004 5.12(208) yyout.c, 389
VCHUNK 6.68(15) value.c, 585 Yya005 5.12(209) yyout.c, 389
VD_BOT A . 71(82) box.h, 787 Yya006 5.12(210) yyout.c, 389
VD_CEN A . 71(86) box.h, 787 Yya009 5.12(211) yyout.c, 389
VD_HORI Z A .71(85) box.h, 787 YyaOl O 5.12(212) yyout.c, 389
VDI SPLAY A.46(24) video.h, 759 YyaOl l 5.12(213) yyout.c, 389
VD_LEFT A . 71(84) box.h, 787 YYABORT 4.4(54) llama.par, 230
VD_LL A.l $S$ box.h, 787 YYACCEPT 4.4(50) llama.par, 230
VD_LR A . 71(87) box.h, 787 Yyaccept 2.17(125) lexyy.c, 13
VD_RI GHT A . 71(79) box.h, 787 yy_act () 4.5(199) llout.c, 232
VD_TOP A . 71(83) box.h, 787 Yy act i on 5.12(215) yyout.c, 389
VD_UL A . 71(88) box.h, 787 yy br eak() 4.15(1167) yydebug.c, 264
VD_UR A . 71(80) box.h, 787 yybss () 4.6(414) llama.par, 236
VD_VERT A . 71(78) box.h, 787 yybss() 4.6(469) llama.par, 236
VERBOSE() 4.33(35) main.c, 322 yybss () 5.13(42) occs.par, 394
Ver bose 2.22( 11) globals.h, 87 yybss () 5.13(82) occs.par, 394
Ver bose 4.23(191) parser.h, 281 yybssout 4.6(301) llama.par, 236
VERT A .71(45) box.h, 787 Y y cmap 2.15(71) lexyy.c, 69
vf pr i nt f () A.33(76) Driver, 736 yycode() 4.6(398) llama.par, 236
VI DEO_I NT A.43(5) vbios.h, 756 yycode() 4.6(453) llama.par, 236
V_I NT 6.25(119) symtab.h, 493 yycode() 5.13(26) occs.par, 394
V_LONG 6.25( 121) symtab.h, 493 yycode() 5.13(66) occs.par, 394
voi d A. 1(29) debug.h, 681 yycodeout 4.6(300) llama.par, 236
VOI D 6.24(44) symtab.h, 492 yycomment () 2.18(11) lex Jo.c, 74
vpr i nt f () A.33(83) Driver, 736 yycomment () 4.6(477) llama.par, 236
VSCREEN A.46(25) video.h, 759 yycomment () 5.13(90) occs.par, 394
Vsi ze 4.8(105) yydebug.c, 243 yycomment () 4.10(488) yydebug.c, 251
Vspace 6.57(126) c.y, 555 YY_D() 2.13(45) lex.par, 65
vspr i nt f () A.33(96) Driver, 736 YY_D() 2.13(47) lex.par, 65
Vst ack 4.8(103) yydebug.c, 243 YYD() 5.19(13) occs-act.par, 410
V_STRUCT 6.25(123) symtab.h, 493 YYD() 5.19(16) occs-act.par, 410
V_UI NT 6.25(120) symtab.h, 493 YYD() 5.19(2) occs-act.par, 410
V_ULONG 6.25(122) symtab.h, 493 YYD() 5.19(4) occs-act.par, 4 10
W 6.9(45) virtual.h, 465 Yyd 4.5(177) llout.c, 232
waddch() A.84( 124) winio.c, 796 yydat a() 4.6(406) llama.par, 236
waddst r () A.79(3) waddstr.c, 793 yydat a() 4.6(461) llama.par, 236
War n exi t 4.33(30) main.c, 322 yydat a() 5.13(34) occs.par, 394
WARNI NG 4.20(81) parser.h, 211 yydat a() 5.13(74) occs.par, 394
wcl ear () A.70(74) curses.h, 784 yydat aout 4.6(302) llama.par, 236
wcl r t oeol () A.80(3) wclrtoeo.c, 793 YYDEBUG 4.3(2) llout.c, 230
wer ase () A . 81(3) werase.c, 794 YYERROR() 6.37(34) c.lex, 519
wget ch() A .84(72) winio.c, 796 YYERROR() 2.19(151) lex.par, 75
WHI LE D.6(35) yyout.h, 832 yyer r or () 2.18(22) lex Jo.c, 74
WHI TE A.42(12) termlib.h, 754 yyer r or () 4.6(485) llama.par, 236
whi t e D.5( 17) c./cx, 829 yyer r or () 5.13(98) occs.par, 394
whi t e 6.37(45) c./cjc, 519 yyer r or () 4.10(510) yydebug.c, 251
WHI TESPACE 4.18(22) llout.h, 276 YYF 2.13(51) lex.par, 65
wi nch() A.82(3) winch.c, 794 YYF 4.4(35) llama.par, 230
WI NDOW A.70(17) curses.h, 784 YygOOO 5.12(235) yyout.c, 389
wmove() A.85(3) wmove.c, 800 Yyg002 5.12(236) yyout.c, 389
WORD_H I GH_B I T 6.2(7) c-code.h, 452 Yyg007 5.12(237) yyout.c, 389
WORD_PREFI X 6.10(13) c-code.h, 465 Yyg008 5.12(238) yyout.c, 389
WORDPTR_PREFI X 6.10(17) c-code.h, 465 yy_get ar gs() 4.9(320) yydebug.c, 241
wor ds 6.4(18) virtual.h, 454 Yy got o 5.12(240) yyout.c, 389
WORD_WI DTH 6.2(2) c-code.h, 452 yyhook a() 6.36(17) main.c, 518
WP 6.9(49) virtual.h, 465 yyhook a() 4.13(1) yyhookjj.c, 264
wpr i nt w() A.86(12) wprintw.c, 800 yyhook b () 6.36(27) main.c, 518
wput c() A.86(5) wprintw.c, 800 yy i ni t debug() 5.13(56) occs.par, 394
wr apok() A.70(35) curses.h, 784 yy i ni t debug() 4.9( 180) yydebug.c, 241
wr ef r esh() A.70(36) curses.h, 784 yy i ni t l ex() D.2( 1) yyinitlx.c, 822
yy i ni t l l ama() E. 20(64) expr.l ma, 887
yy i ni t l l ama() 4. 24(110) parser.I ma, 283
yy i ni t occs( ) 6. 34(1306) c.y, 512
yy i ni t occs( ) 5.5(73) expr.y, 382
yy i ni t _occs() E. 17(80) expr.y, 879
yy i ni t st ack() 5. 14(165) occs.par, 396
yy i nput () 4. 10(542) yydebug. c, 251
yy i n synch() 4. 7(502) l l ama.par, 239
YY_I SACT() 4. 4(47) l l ama.par, 230
YY_I SNONTERM() 4. 4(46) l l ama.par, 230
YY_I STERM( ) 4. 4(45) l l ama.par, 230
yyl eng 1.2(6) lex.c, 15
yyl eng 2. 19(139) lex.par, 75
yyl ess() 2. 19(157) l ex.par, 75
yyl ex () 2. 19(177) l ex.par, 75
Yy_l hs 5. 12(252) yyout.c, 389
yyl i neno 1.2(7) lex.c, 15
yyl i neno 2. 19(140) lex.par, 15
yyl val 6. 37(28) c.lex, 519
YYMAXDEPTH E. 8(19) expr.y, 857
YYMAXDEPTH 4. 4(66) l l ama.par, 230
YYMAXERR E. 8(20) expr.y, 857
YYMAXERR 4. 4(62) l l ama.par, 230
YY_MAXNONTERM 4. 3(31) llout.c, 230
YY_MAXTERM 4. 3(29) llout.c, 230
YY_MI NACT 4. 3(33) llout.c, 230
YY_MI NNONTERM 4. 3(30) llout.c, 230
YY_MI NTERM 4. 3(28) llout.c, 230
yymor e() 2. 19(154) lex.par, 75
yyner r s 4. 6(303) l l ama.par, 236
yyner r s 4. 25(29) llpar.c, 285
yy next () 2. 15(102) lexyy.c, 69
yy next () 2. 14(147) lexyy.c, 66
yy next () 4. 5(192) llout.c, 232
yy next () 2. 16(96) lexyy.c, 70
yy next () 5.13(1) occs.par, 394
yy_next () 5.3(1) yynext.c, 372
yy next oken() 5. 13(54) occs.par, 394
Yy nxt 2. 14(54) lexyy.c, 66
Yy nxt 2. 15(88) lexyy.c, 69
Yy nxt 2. 16(89) lexyy.c, 70
Yy nxt O 2. 16(54) lexyy.c, 70
Yy nxt l 2. 16(58) lexyy.c, 70
Yy nxt 2 2. 16(72) lexyy.c, 70
Yy nxt 3 2. 16(76) lexyy.c, 70
Yy nxt 4 2. 16(80) lexyy.c, 70
Yy nxt 5 2. 16(84) lexyy.c, 70
yyout 2. 19(141) lex.par, 75
yy out put () 4. 10(430) yydebug. c, 251
YypOO 4. 5(133) llout.c, 232
YypOl 4. 5(132) llout.c, 232
Yyp02 4. 5(131) llout.c, 232
Yyp03 4. 5(129) llout.c, 232
Yyp04 4. 5(128) llout,c, 232
Yyp05 4. 5(130) llout.c, 232
Yyp06 4. 5(125) llout.c, 232
Yyp07 4. 5(124) llout,c, 232
Yyp 08 4. 5(127) llout.c, 232
Yyp09 4. 5(126) llout.c, 232
yypar se() 4. 7(552) l l ama.par, 239
yypar se() 4. 25(74) llpar.c, 285
yypar se() 5. 14(245) occs.par, 396
yypar se 4. 25(23) llpar.c, 285
yypop_() A. 3(24) yystack.h, 690
yypop() A. 3(30) yystack.h, 690
yy_pop()
4. 6(345) l l ama.par, 236
yypopn_() A. 3(34) yystack.h, 690
yypopn() A. 3(35) yystack.h, 690
YYPRI VATE 2. 13(36) lex.par, 65
YYPRI VATE 4. 4(58) l l ama.par, 230
yypr ompt () 4.16( 1385) yydebug.c, 267
yy_ pst ack() 5.13(57) occs.par, 394
yy_ pst ack() 4.11(664) yydebug.c, 256
yypst k () 6.35(1423) c.y, 515
yypst k () E.20(70) expr.l ma, 887
yypst k () 5.5(58) expr.y, 382
yypst k () 4.24( 116) parser.l ma, 283
yypst k () E. 10( 12) yypstk2.c, 865
yypst k () E.5(3) yypstk.c, 854
yypush_ () A.3(23) yystack.h, 690
yypush () A .3(26) yystack.h, 690
yy_ push() 4.6(331) l l ama.par, 236
Yy pusht ab 4.5(135) llout.c, 232
yy_qui t debug() 5.13(55) occs.par, 394
yy_qui t _debug() 4.9(270) yydebug.c, 247
yy r ecover () 5.14(182) occs.par, 396
yy r edr aw st ack() 4.11 (881) yydebug.c, 256
yy r educe() 5.14(128) occs.par, 396
Yy r educe 5.12(268) yyout.c, 389
Y y r map 2.15(83) lexyy.c, 69
Yy sact 4.5(292) llout.c, 232
yy say what s happeni ng() 4.6(356) l l ama.par, 236
yy shi f t () 5.14(111) occs.par, 396
YYSHI FTACT() 5.5(23) expr.y, 382
Yy sl hs 5.12(285) yyout.c, 389
Yy snont er m 4.5(276) llout.c, 232
Yy sr hs 5.12(301) yyout.c, 389
YY_START_STATE 4.3(32) llout.c, 230
yyst k cl ear () A.3(10) yystack.h, 690
yyst k_cl s A .3(4) yystack.h, 690
yyst k del () A.3(6) yystack.h, 690
yyst k el e() A.3(17) yystack.h, 690
yyst k empt y() A.3(14) yystack.h, 690
yyst k er r ( ) A.3(39) yystack.h, 690
yyst k f ul l ( ) A.3( 13) yystack.h, 690
yyst k i t em( ) A.3(20) yystack.h, 690
yyst k j p( ) A .3(21) yystack.h, 690
Yy st ok 4.5(256) llout.c, 232
Yy st ok 5.12(177) yyout.c, 389
YYSTYPE E.20(27) expr.l ma, 887
YYSTYPE E.8(17) expr.y, 857
YYSTYPE 5.5(21) expr.y, 382
YYSTYPE 4.4(70) l l ama.par, 230
YYSTYPE 4.3(23) llout.c, 230
YYSTYPE 5.19(8) occs-act.par, 410
YYSTYPE 4.24(23) parser.l ma, 283
yy_ sym() 5.13(58) occs.par, 394
yy_ sym() 4.6(310) l l ama.par, 236
yy synch() 4.7(514) l l ama.par, 239
yyt ext 1.2(5) lex.c, 15
yyt ext 2.19(138) lex.par, 75
yyt os () 4.4(96) l l ama.par, 230
YY_TTYPE 2.13(50) lex.par, 65
YY_TTYPE 4.4(34) l l ama.par, 230
YY_TTYPE 4.21(116) parser.h, 279
YY_TYPE 5.11(106) occs.par, 386
YYVERBOSE E.8(21) expr.y, 857
Yy vsp 4.4( 108) l l ama.par, 230
Yy vst ack 4.4(107) l l ama.par, 230
yyvst ype 4.4(105) l l ama.par, 230
Errata: Com pi l er Desi gn in C
1, 2(1.02)
This document is a list of typos and corrections that need to be made to Compiler
Design in C, Allen Holub, 1990 (as of September 11,1997). The corrections marked
Disk only represent changes made to the files on the distribution disk that either
dont affect the code in any significant way or are too big to insert into the book. All
these will eventually be incorporated into the second edition, if there is one, but they
wont be put into subsequent printings of the first edition. There are also a many
trivial changes that have been made to the code on the disk to make it more robust:
Function prototypes have been added here and there, as have includes for .h files
that contain prototypes, all .h files have been bracketed with statements like:
t t i f ndef __ FI LE_EXT / * f i l e name wi t h _ f or dot */
t t def i ne__ FI LE_EXT

ttendi f
to make multiple inclusions harmless, and so forth. These changes allow the code to
compile under Borland C++as well as Microsoft C and BSD UNI X. None of these
trivial changes are documented here.
The printings in which the error occurs and the software version in which the
change was made are identified in the margin at the head of each entry. For exam
ple, the numbers next to the current paragraph indicate that the typo is found in
both the first and second printings, and that the bug was fixed in software version
1.02. Determine your printing by looking at the back of the flyleaf, where youll
find a list of numbers that looks something like this: 10 9 8 7 6 5 4 3 . The
smallest number in the list is the printing.
Page xvi - Seventh line from the bottom. Change No credit cards to No purchase
orders or credit cards. The last two lines of the paragraph should read:
must add local sales tax. No purchase orders or credit cards (sorry). A Macintosh
. Binary site licenses are available for educational
institutions.
version will be available eventually
Page xvi - Last line. Internet can now be accessed from CompuServe. Add the
CompuServe/internet address > INTERNET:holub@violet.berkeley.edu to the
parenthesized list at the end of the paragraph. A replacement paragraph follows:
The code in this book is bound to have a few bugs in it, though I ve done my best to test
it as thoroughly as possible. The version distributed on disk will always be the most
recent. If you find a bug, please report it to me, either at the above address or electroni
cally. My internet address is holub@violet.berkeley.edu CompuServe users can access
internet from the email system by prefixing this address with >INTERNET: type hel p
i nternet for information. My UUCP address is .../ucbvaxIviolet!holub.
September 11, 1997 10 Errata: Compi l er Desi gn inC
(510) 540-7954
Page xvi - 15 l i nes f r omthe bottom. Change the phone number to
Page xviii - Line 7, change that to than. The replacement line follows:
primitives. It is much more useful than pic in that you have both a WYSIWYG capability 4
Page 8 - The line that starts J ANE verb object in the display at the bottom of the page
is repeated. Delete the first one. A replacement display follows:
sentence
subject predicate
noun predicate
J ANE predicate
J ANE verb object
J ANE SEES object
J ANE SEES noun opt_participle
apply sentences subject predicate to get:
apply subjectsnoun to get:
apply nounsJ ANE to get:
apply predicate >verb object to get:
apply verbsSJLJLS to get:
apply objects noun opparticiple to get:
apply nounsSPOT to get:
J ANE SEES SPOT opt_participle apply opt_participle sparticiple to get:
J ANE SEES SPOT participle
J ANE SEES SPOT RUN
apply participles RUN to get:
done there are no more nonterminals to replace
Page 11 - Table 1.1, line 1 should read: statements >EOF. A replacement table follows
1. statem ents s EOF
2. expression ; statements
3. expression s expression +term
4. term
5. term s term * factor
6. factor
7. factor s number
8. ( expression )
September 11, 1997 -11- Errata: Compi l er Desi gn inC
Page 12 - Fi gure 1.6. Repl ace the f i gure wi th the f ol l owi ng one:
Fi gur e 1.6. A Parse of 1+2
Page 16 - Listing 1.2, delete line 46. (Replace it with a blank line.)
Page 18 - Replace the untitled table just under Listing 1.4 with the following (only the
first line has been changed):
1. statem ents > expression ; eoi
2. expression ; statement
3. expression - term expression
4. expression > +ter/w expression
5.
6. term > factor term'
7. term' > * factor term'
8.
9. factor - numori d
10. f expression )
Page 24 - Line 110 of Listing 1.7 should read *p++ =tok ,*
110 *p++ =tok;
Page 26 - Change the caption to Figure 1.11 as follows:
Figure 1.11. A Subroutine Trace of i +2*3+4 (Improved Parser)
Page 27 - Change Li ne 42 of Li sti ng 27 to t empvar 2 = t er m() ,*
42 t empvar 2 = t er m();
Page 27 - Replace the word temporary in the code part of lines 19, 26, and 42 of Listing
1.9 with the word tempvar. These three lines should now read:
19 tempvar =expression()
26 freename( tempvar );
Page 36 - Replace Lines 16 and 17 of Listing 2.1 with the following (Ive added a few
parentheses):
Page 44 - Replace line 186 of Listing 2.5 with the following line:
Page 55 - Eleven lines from bottom. The display that says ( ["a-z] |\en) should read as
follows:
( ["a-z] \n)
Page 41 - Modify line 113 of Listing 2.3 to the following:
Page 46 - Replace lines 299 and 300 of Listing 2.6 with the following lines:
299 i nt need,
/*
Number of byt es r equi r ed f r omi nput . */
300 got ;
/*
Number of byt es act ual l y r ead. */
Page 57 - Line six should read as follows:
causes a transition to State 1; from State 1, an e gets the machine to State 2, and an i 4
c
Page 57 - Eleventh line from the bottom should read as follows:
next st at e = Tr ansi t i on t abl e[ cur r ent st at e ][ i nput char act er ];
Page 57 - Last line of second paragraph should read as follows (delete the "s"):
r, or e from State 0) are all implied transitions to a special implicit error state. 4
Page 63 - Listing 2.11, lines 2 and 3: remove the semicolons.
Page 68 - The first line beneath the Figure should read as follows (the Figure and List
ing numbers are wrong):
table is shown in Figure 2.6 and in Listing 2.15. The Yy cmap [] array is indexed by 4
Page 72 - The first display (second and third lines) should read as follows:
[ ' O' , 2] [' l ' , 2] [ ' 2' , 2] [ ' 3' , 2] [ ' 4' , 2]
[' 5' ,21 [' 6' ,21 [' 7' ,21 [' 8' ,21 [ ' 9' , 21 [' e' ,51
Page 73 - The first line of the third paragraph calls out the wrong line number. It should
read as follows :
The yyer r or () macro on line 151 of Listing 2.19 prints internal error messages. 4
Page 73 - First paragraph, lines two and four. Both references to Listing 2.18 should be
to Listing 2.19. A replacement first paragraph follows:
The remainder of lexyy.c file is the actual state-machine driver, shown in Listing 2.19
The first and last part of this listing are the second and third parts of the Ctrl-L-delimited
template file discussed earlier. The case statements in the middle (on lines 287 to 295 of
Listing 2.19) correspond to the original code attached to the regular expressions in the
input file and are generated by LeX itself.
Page 76 - Listing 2.19, line 166. Some code is missing. Replace line 166 with the follow
ing line:
Page 78 - Listing 2.19, lines 292 and 294. Align the F in FCON under the I in i c on (on
line 288).
Page 84 - Figure 2.13(e). The arrows should point from states 5and 11to state 13. Here
is a replacement figure:
Figure 2.13. Constructing an NFA for (D*\ .D|D\ .D*)
Page 85 - Third line from the bottom (which starts For example) should read:
For example, in a machine with a 16-bit i nt, the first two bytes of the string are the
c
Page 86 - Listing 2.21. Change the number 512 on line 28 to 768.
768
Page 91 - Change the definition of STACK USED on line 93 of Listing 2.25 to
( (i nt) ( Sp- Sst ack) +1). A replacement line follows:
93 ttdefine STACK USED( ) ( ( i n t ) ( Sp- Sst ack) +1 ) / * sl ot s used */
Page 92 - Change line 167 of Listing 2.25 to the following:
167 i f ( t ext p >= (char *) st r i ngs + ( STR MAX- 1) )
Page 95 - Listing 2.26. Replace lines 280 to 283 with the following lines:
280 *p = ' \ 0' ; / * Over wr i t e cl ose br ace. { */
281 i f ( i ( mac = ( MACRO *) f i ndsym( Macr os, *namep )) )
282 par se_er r ( E_NOMAC );
283 *p++ = ' }' ; / * Put t he br ace back. */
Page 104 - Second paragraph, first four lines (above the first picture) should read:
Subroutines expr ( ) and cat expr ( ) are in Listing 2.30. These routines handle the
binary operations: (OR) and concatenation. I ll show how expr () works by watching
it process the expression A\B. The cat _expr () call on line 621 creates a machine that
recognizes the A:
Page 121 - Listing 2.38. Replace lines 218 to 255 with the following:
e newset ();
ADD( st ar t _df ast at e, sst at e );

i f ( i e cl osur e( st ar t df ast at e, &accept , &anchor ) )
f pr i nt f ( st der r , "I nt er nal er r or : St at e machi ne i s empt y\ n") ;
exi t (1);
cur r ent newset ();
3
*
ASSI GN( cur r ent , st ar t df ast at e );
* Now i n t h e NFA: The n e x t s t a t e i s t h e s e t o f a l l NFA s t a t e s t h a t
* can b e r e a c h e d a f t e r w e ' v e made a t r a n s i t i o n on t h e c u r r e n t i n p u t
Ik-
c h a r a c t e r f r o m a n y o f t h e NFA s t a t e s i n t h e c u r r e n t s t a t e . The c u r r e n t
Ik-
i n p u t l i n e i s p r i n t e d e v e r y t i m e an a s t a t e i s e n c o u n t e r e d
* The ma c h i n e i s r e s e t t o t h e i n i t i a l s t a t e when a
* e n c o u n t e r e d .
*
u r e t r a n s i t i o n i s
whi l e( c next char () )
e cl osur e( move( cur r ent , c) , &accept , &anchor );
i f ( accept )
3. pr i nt buf ();
i f ( next );
del set ( next );
ASSI GN( cur r ent , st ar t df ast at e );
* r e s e t
*
*
el se
*
del set ( cur r ent );
cur r ent =
l o o k i n g
-k
del set ( cur r ent ) ;
Not
del set ( st ar t _df ast at e ) ;
* s u b r o u t i n e .
#endi f
f o r ma i n , b u t y o u ' l l
'k
n e e d i t when a d a p t i n g ma i n ( ) t o a
'k
*
Page 122 - First display, which starts with z-closure({12}) should read as follows
e-closure({0}) ={0, 1, 3, 4, 5, 12} (new DFA State 0)
Page 123 - Display on lines 69. Second line of display, which now reads -
closure({7,l 1}) ={9, 11, 13, 14}, is wrong. The entire display should read as follows:
DFA State 7 ={11}
-closure({l 1}) {9, 11, 13, 14}
move({9, 11, 13, 14}, .) =0
move({9, 11, 13, 14}, D) - {11} (existing DFA State 7)
Page 124 - First line of second paragraph should read:
The maximum number of DFA states is defined on line seven of Listing 39 to be 4
Page 129 - First line should say Listing 2.41 , not Listing 2.40 Second line should
say line 119, not line 23. The forth line should say line 129, not line 31. A
replacement paragraph follows:
Several support functions are needed to do the work, all in Listing 2.41 The
add_t o_dst at es () function on line 119 adds a new DFA state to the Dst at es array
and increments the number-of-states counter, Nst at es. It returns the state number (the
index in Dst at es) of the newly added state. i n_dst at es () on line 139 of Listing 2.41
is passed a set of NFA states and returns the state number of an existing state that uses
the same set, or -1 if there is no such state.
Page 129 - Replace lines 147 and 148 of Listing 2.41 with the following:
Page 130 - Replace lines 193 and 194 of Listing 2.41 with the following
Page 139 - Replace line 198 of Listing 2.44 with the following:
198 SET **end = &Gr oups[ Numgr oups] ;
Replace Line 204 of Listing 139 with the following:
204 f or ( current =Groups; current <end; ++current )
Page 140 - Replace lines 229 and 230 of Listing 2.45 with the following
229 SET **end = &Gr oups[ Numgr oups] ;
230 f or ( cur r ent = Gr oups; cur r ent < end; ++cur r ent )
Page 152 - Remove the #inc lude "date. h" on line 3.
Page 157 - Replace line 117 with the following (Ive added the - - argc on the left)
Page 167 - 13th line from the bottom, returns should be return.
Page 167 - Third line from the bottom. Change the second comma to a semicolon.
Page 171 - Replace the first display (which now reads 770W7?^time| banana,) with the fol
lowing:
noun fruit banana
Page 173 - Second display, text to the right of the =L > is missing. It should read as fol-
l ows:
compound_stmt =L> LEFT_CURLY stmt RIGHT_CURLY
L> LEFT CURLY RI GHT CURLY
Page 173 - First line after second display, change exprSE to stmtsz: The line should
read:
The application of s t m t s 8 effectively removes the nonterminal from the derivation by 4
Page 173 - 11th and 16th line from the bottom. Change CLOSE_CURLY to
RI GHT CURLY
RI GHT CURLY RI GHT CURLY
Page 175 - Figure 3.2. The line that reads 4DIGIT errorshould read V^DI GI T
5.
Page 177 - Listing 3.2, Line 12, should read:
Page 178 - Third line below the one that starts with Since(next to the marginal note),
replace the thewith the: All three lines should read:
Since a decljist can go to , the list could be empty. Parse trees for left and right recur- Productions executed
sive lists of this type are shown in Figure 3.5. Notice here, that the decl l i s t s z first or last.
production is the first list-element thats processed in the left-recursive list, and its the
last4 4
Page 179 - fourth line of second display, "declaratorshould be "declaratorJist". Line
should read:
declarator Jist TYPE declarator Jist
Page 180 - Table 3.2. Grammatical rules for Separator Between List Elements, Zero
elements okayrow are wrong. Replace the table with the following one:
Table 3.2. List Grammars
No Separator
Right associative Left associative
At least one
element
list >MEMBER list MEMBER list >list MEMBER MEMBER
Zero elements
okay
list >MEMBER list list >list MEMBER
Separator Between List Elements
Right associative Left associative
At least one
element
list >MEMBER delim list MEMBER list >list delim MEMBER MEMBER
Zero elements
okay
opt Jist list
list MEMBER delim li
St MEMBER
opt Jist list
list >list delim MEMBI
:R MEMBER
A MEMBER is a list element; it can be a terminal, a nonterminal, or a collection of terminals and nontermi
nals. If you want the list to be a list of terminated objects such as semicolon-terminated declarations,
member should take the form: member s a TERMINATOR, where a is a collection of one or more termi
nal or nonterminal symbols.
Page 181 - Figure 3.6. Change all statementsto stmt for consistency. Also change
expression to expr. A new figure follows:
Figure 3.6. A Parse Tree for 1+2* (3+4) +5 ;
Page 183 - Display at bottom of page. Remove the exclamation point. The expression
should read:
expr +term {op (7+7) ;} expr
Page 186 - Figure 3.8. Change caption to Augmented parse tree for 1+2+3; and
change statementsto stmt. A new figure follows:
Fi gur e 3. 8. Augmented Parse Tree for 1+2+3 ,*
stmt
expr21 ;28 stmt3Q
term expr 26 29
factor term\ +7 term13 {op(" +");}14 expr
num. {cr eat e_t mp (yytext)
1
factor
(i)
term'12 +15 term2l {op( " +");}22
expr
24
nui l l o {cr eat e t mp (yy t ext ) ;}
8
- f n r t n v
(2)
term'
20
23
n u m16 {cr eat e_t mp (yytext) ; }^q
(3)
Page 188 - The caption for Listing 3.4 should say Inherited (not Synthesized)
"Attributes. A replacement caption follows:
Listing 3.4. naive, c Code Generation with Inherited Attributes
Page 190 - First three lines beneath the figure (which start right-hand side) should
read:
right-hand side of the production, as if they were used in the subroutine. For example,
say that an attribute, t, represents a temporary variable name, and it is attached to an
expr, it is represented like this in the subroutine representing the expr:
Page 196 - Figure 4.1. both instances of the word numbershould be in boldface
number number
Page 208 - Third and eighth lines change automata to automaton . Replacement
lines:
solution is a push-down automaton in which a state machine controls the activity on the 4
The tables for the push-down automaton used by the parser are relatively straightfor-
Page 210 - Figure 4.6, Yy_d [ f act or] [LP] should be "9," not "8."
9
Page 213 - First Paragraph, add SEMICOLON to the list on the third line. A replace
ment paragraph follows.
Production 1is applied if a statement is on top of the stack and the input symbol is an
OPEN_CURLY. Similarly, Production 2 is applied when a statement is on top of the
stack and the input symbol is an OPEN_PAREN, NUMBER, SEMICOLON, or IDEN
TI FI ER (an OPEN_PAREN because an expression can start with an OPEN_PAREN
by Production 3, a NUMBER or I DENTI FI ER because an expression can start with a
term, which can, in turn, start with a NUMBER or IDENTIFIER. The situation is com
plicated when an expression is on top of the stack, however. You can use the same rules
as before to figure out whether to apply Productions 3or 4, but what about the 8 produc
tion (Production 5)? The situation is resolved by looking at the symbols that can follow
an expression in the grammar. If expression goes to 8, it effectively disappears from the
current derivation (from the parse tree)it becomes transparent. So, if an expression is
on top of the stack, apply Production 5 if the current lookahead symbol can follow an
expression (if it is a CLOSE CURLY, CLOSE PAREN, or SEMICOLON). In this
last situation, there would be serious problems if CLOSE_CURLY could also start an
expression. The grammar would not be LL(1) were this the case.
Page 213 - Last six lines should be replaced with the following seven lines.
ones. Initially, add those terminals that are at the far left of a right-hand side:
FIRST(stmt) = { }
FIRST(expr) = {e}
FIRST(expr') = {PLUS, e}
FIRST(term) = { }
FIRST(term') = {TIMES, e}
FIRST(factor) = {LEFT PAREN, NUMBER}
Page 214 - Remove the first line beneath Table 4.14 [which starts 'FIRST(fact or) ].
Page 214 - Table 4.13, item (3), third line. Replace is are with are.A replacement
table follows :
Tabl e 3. 3. Finding FIRST Sets
FIRST(A), where A is a terminal symbol, is {A}. If A is , then is put into the FIRST set
Given a production of the form
A a
where s is a nonterminal symbol, A is a terminal symbol, and a is a collection of zero or more ter
minals and nonterminals, A is a member of FIRST(s).
Given a production of the form
where s and b are single nonterminal symbols, and a is a collection of terminals and nontermi
nals, everything in FIRST(b) is also in FIRST(s).
This rule can be generalized. Given a production of the form
s>ocB p
where s is a nonterminal symbol, a is a collection of zero or more nullable nonterminals,! B is a
single terminal or nonterminal symbol, and |3is a collection of terminals and nonterminals, then
FIRST(s) includes the union of FIRST(5) and FIRST(a). For example, if a consists of the three
nullable nonterminals x, y, and z, then FIRST(s) includes all the members of FIRST(x), FIRST(y),
and FIRST(z), along with everything in FIRST(5).
| A nonterminal is nullable if it can go to by some derivation. is always a member of a nullable
nonterminals FIRST set.
Page 214 - This is a change that comes under the should be explained bettercategory
and probably wont make it into the book until the second edition. It confuses the issue a
bit to put into the FIRST set, as per rule (1) in Table 4.13. (In fact, you could argue that
it shouldnt be there at all.) I ve put into the FIRST sets because its presence makes it
easier to see if a production is nullable (it is if is in the FIRST set). On the other hand,
you dont have to transfer the to the FOLLOW set when you apply the rules in Table
4.15 because serves no useful purpose in the FOLLOW set. Consequently, doesnt
appear in any of the FOLLOW sets that are derived on pages 215 and 216.
Page 214 - Bottom line and top of next page. Add to FIRST(expr), FlRST(expr ), and
FIRST(term). A replacement display, which replaces the bottom two lines of page 214
and the top four lines of page 215, follows:
FIRST(stmt)
FIRST(expr)
FIRST(expr')
FIRST(term)
FIRST(term')
FIRST(factor)
{LEFTPAREN, NUMBER, SEMI}
{LEFT PAREN, NUMBER, e}
{PLUS, }
{LEFT PAREN, NUMBER}
{TIMES, }
{LEFT PAREN, NUMBER}
Page 216 - Add the following sentence to the end of item (2):
Note that, since serves no useful purpose in a FOLLOW set, it does not have to be
transfered from the FIRST to the FOLLOW set when the current rule is applied. A
replacement table follows:
September 11, 1997 24 Errata: Compi l er Desi gn i nC
Tabl e 3.4. Finding FOLLOW Sets
(1) If s is the goal symbol, eoi (the end-of-input marker) is in FOLLOW(s);
(2) Given a production of the form:
SK . .a B. . .
where a is a nonterminal and B is either a terminal or nonterminal, FIRST(^) is in FOLLOW(a);
To generalize further, given a production of the form:
SK . .a ocB. . .
where s and a are nonterminals, a is a collection of zero or more nullable nonterminals and B is
either a terminal or nonterminal. FOLLOW(a) includes the union of FIRST(a) and FIRST(5).
Note that, since serves no useful purpose in a FOLLOW set, it does not have to be transfered
from the FIRST to the FOLLOW set when the current rule is applied.
(3) Given a production of the form:
SK . . a
where a is the rightmost nonterminal on the right-hand side of a production, everything in
FOLLOW(s) is also in FOLLOW(a). (Ill describe how this works in a moment.) To generalize
further, given a production of the form:
SK . . Ct OC
where s and a are nonterminals, and a is a collection of zero or more nullable nonterminals,
everything in FOLLOW(s) is also in FOLLOW(a).
Page 217 - Grammar in the middle of the page. Delete the pk and adm at the right edges
of Productions 1 and 2.
Page 218 - Move the last line of page 217 to the top of the current page to eliminate the
orphan.
Page 218 - Table 4.16, Replace the table with the following one (Ive made several small
changes). You may also want to move the widow at the top of the page to beneath the
table while youre at it.
Tabl e 4. 16. Finding LL(1) Selection Sets
A production is nullable if the entire right-hand side can go to . This is the case, both when the
right-hand side consists only of , and when all symbols on the right-hand side can go to
some derivation.
For nonnullable productions: Given a production of the form
s>ocB...
where s is a nonterminal, a is a collection of one or more nullable nonterminals, and B is either a
terminal or a nonnullable nonterminal (one that cant go to ) followed by any number of addi
tional symbols: the LL(1) select set for that production is the union of FIRST(a) and FIRST(5).
That is, its the union of the FIRST sets for every nonterminal in a plus FIRST(5). If a doesnt
exist (there are no nullable nonterminals to the left of B), then SELECT(s)=FIRST(i?).
For nullable productions: Given a production of the form
s>a
where s is a nonterminal and a is a collection of zero or more nullable nonterminals (it can be ):
LL(1) select set for that production is the union of FIRST(a) and FOLLOW(s). In plain
words: if a production is nullable, it can be transparentit can disappear entirely in some deriva
tion (be replaced by an empty string). Consequently, if the production is transparent, you have to
look through it to the symbols that can follow it to determine whether it can be applied in a given
situation.
Page 223 - Replace the last two lines on the page as follows:
Ambiguous productions, such as those that have more than one occurrence of a given
nonterminal on their right-hand side, cause problems in a grammar because a unique
parse tree is not generated for a given input. As weve seen, left factoring can be used to
Page 224 - The sentence on lines 13 and 14 (which starts with I f an ambiguous )
should read as follows:
If an ambiguous right-hand side is one of several, then all of these right-hand sides
must move as part of the substitution. For example, given:
Page 228 - 8th line from the bottom, the Yin Youshould be in lower case.
you can use a comer substitution to make the grammar self-recursive, replacing the 4
Page 222 - First line of text should read Figures 4.5 and 4.6rather than Figure 4.6
A replacement paragraph follows:
4.5 are identical in content to the ones pictured in Figures 4.5 and 4.6 on page 210. Note
that the Yyd table on lines 179 to 184 is not compressed because this output file was gen
erated with the - f switch active. Were - f not specified, the tables would be pair
compressed, as is described in Chapter Two. The yy act () subroutine on lines 199 to
234 contains the swi t ch that holds the action code. Note that $references have been
translated to explicit value-stack references (Yy vsp is the value-stack pointer). The
Yy synch array on lines 243 to 248 is a 1-terminated array of the synchronization
tokens specified in the %synch directive.
Page 237 - The loop control on line 377 of Listing 4.6 wont work reliably in the 8086
medium or compact models. Replace lines 372384 with the following:
372 i nt nt erms; / * #o f t e r m s i n t h e p r o d u c t i o n */
373 st art =Yy_pusht ab[ pr oduct i on ] ;
374 f or ( end = start; *end; ++end ) / * A f t e r l o o p , e nd i s p o s i t i o n e d */
375 ; / * t o r i g h t o f l a s t v a l i d s y mbol */
376 count = si zeof (buf );
377 *buf = ' \ 0' ;
378 f or (nterms = end - start; - - nt erms >= 0 && count >0 ; ) / * A s s e m b l e * /
379 { / * s t r i n g . */
380 st r ncat ( buf , yy_sym(*- - end), count );
381 i f ( (count -= st r l en( yy_sym( *end) + 1 )) < 1 )
382 br eak;
383 st r ncat ( buf , " ", - - count );
384 }
Page 242 - Pieces of the section heading for Section 4.9.2 are in the wrong font, and the
last two lines are messed up. Replace with the following:
4.9.2 Occs and LLama Debugging Suppor%yydebug.c
This section discusses the debug-mode support routines used by the llama-generated
parser in the previous section. The same routines are used by the occs-generated parser
discussed in the next chapter. You should be familiar with the interface to the curses,
window-management functions described in Appendix A before continuing.
Page 255 - Fourth line beneath the listing (starting with teractive mode), replace
comma following the close parenthesis with a period. The line should read:
teractive mode (initiated with an n command). In this case, a speedometer readout that 4
c
Page 271-303 - Odd numbered pages. Remove all tildes from the running heads.
Page 274 - Add the statement l ooki ng_f or _br ace = 0; between lines 179 and 180.
Do it by replacing lines 180-190 with the following:
180 l ooki ng f or br ace = 0;
181
}
182 el se
183
{
184 i f ( c == ' %' ) l ooki ng f or br ace =
l ;
185 el se out put ( "%c", c );
186
}
187
}
188
}
189 r et ur n CODE BLOCK;
190
}
Page 278 - Fifth line from bottom. Replace {cr eat e_t mp ( yyt ext ) ;} with the follow
ing (to get the example to agree with Figure 4.9):
{r val ue( yyt ext ) ;}
Page 282 - The last paragraph should read as follows (the Listing and Table numbers
are wrong):
The recursive-descent parser for LLama is in Listing 4.25. It is a straightforward
representation of the grammar in Table 4.19.
60 PRI VATE i nt *Dt r an; / * I n t e r n a l r e p r e s e n t a t i o n o f t h e p a r s e t a b l e .
61 * I n i t i a l i z a t i o n i n make_ y y _ d . t r an () a s s u me s
62 * t h a t i t i s an i n t [ i t c a l l s m e m i s e t O J .
63 */
231 nt er ms = USED_TERMS + 1; / * +1 f o r EOI */
232 nnont er ms = USED NONTERMS;
233
234 i = nt er ms * nnont er ms; / * Number o f c e l l s i n a r r a y * /
235
236 i f ( ! ( Dt r an = ( i nt *) mal l oc( i * si zeof ( *Dt r an) ) )) / * number o f b y t e s
*/
237 f er r ( "Out of memor y\ n") ;
238
239 memi set ( Dt r an, - 1, i ); / * I n i t i a l i z e Dt r a n t o a l l f a i l u r e s
*/
240 pt ab( Symt ab, f i l l _r ow, NULL, 0 ) ; / * and f i l l n o n f a i l u r e t r a n s i t i o n s .
*/
Page 330 - Listing 4.34, line 464. Delete everything except the line number.
Page 330 - Last two lines of second paragraph should read as follows:
bottom-up parse tables are created, below. Most practical LL(1) grammars are also
LR(1) grammars, but not the other way around.
Page 330 - Add the right-hand side |NUMBER" to the grammar in exercise 4.5. Also,
align the table in Exercise 4.5 under the text so that it no longer extends into the gutter.
expr expr
* expr
expr * expr
expr / expr
expr =expr
expr +expr
expr expr
( expr )
NUMBER
Page 349 - Replace Table 5.5 with the following table:
Table 5.5. Error Recovery for 1++2
state:
parse:
1 + + 2 - Shift start state
state: 0
parse: $
1 + + 2 - Shift NUM (goto 1)
state: 0 1
p ars e: $ NUM
+ + 2 -
Reduce by Production 3 (ITsNUM)
state: 0 3
parse: $ !T
+ + 2 -
Reduce by Production 2 (!Es!T)
state: 0 2
parse: $ !E
+ + 2 - Shift /+(goto 4)
state: 0 2 4
parse: $ !E !+
+ 2 -
ERROR (no transition in table)
Pop one state from stack
state: 0 2
parse: $ !E
+ 2 -
There is a transition from 2 on /+
Error recovery is successful
state: 0 2
parse: $ !E
+ 2 - Shift /+(goto 4)
state: 0 2 4
parse: $ !E !+
2 - Shift NUM (goto 1)
state: 0 2 4 1
parse: $ !E !+ NUM
2 - Shift NUM (goto 1)
state: 0 2 4 1
parse: $ !E !+ NUM
Reduce by Production 3 (ITsNUM)

state: 0 2 4 5
parse: $ !E !+ !T
Reduce by Production 1 (!Es!E!+!T)

state: 0 2
parse: $ !E
A ccept
Page 360 - Figure 5.6. The item immediately below the line in State 7 (ie. the first clo
sure item) should be changed from !Es.!E!*!F to !Ts.!T!*!F
!T>. !T!* !F
Page 361 - Third paragraph of section 5.6.2 (which starts The FOLLOW). Replace
the paragraph with the following one:
S o m e sym b o ls In
F O L L O W set a re not
n e e d e d .
Lo o k a h e a d set.
[! S >a !. x p, C ].
The FOLLOW sets for our current grammar are in Table 5.6. Looking at the
shift/reduce conflict in State 4, FOLLOW(!E) doesnt contain a !*,so the SLR(l) method
works in this case. Similarly, in State 3, FOLLOW(s) doesnt contain a !+, so
everythings okay. And finally, in State 10, there is an outgoing edge labeled with a !*,
but FOLLOW(!E) doesnt contain a !*. Since the FOLLOW sets alone are enough to
resolve the shift/reduce conflicts in all three states, this is indeed an SLR(l) grammar.
Page 361 - First paragraph in section 5.6.3 (which starts Continuing our quest)
Replace the paragraph with the following one:
Many grammars are not as tractable as the current oneits likely that a FOLLOW
set will contain symbols that also label an outgoing edge. A closer look at the machine
yields an interesting fact that can be used to solve this difficulty. A nonterminals FOL
LOW set includes all symbols that can follow that nonterminal in every possible context.
The state machine, however, is more limited. You dont really care which symbols can
follow a nonterminal in every possible case; you care only about those symbols that can
be in the input when you reduce by a production that has that nonterminal on its left-
hand side. This set of relevant lookahead symbols is typically a subset of the complete
FOLLOW set, and is called the lookahead set.
Page 362 - Ninth line from bottom. Delete 'only. A replacement for this, and the follow
ing four lines follows:
The process of creating an LR(1) state machine differs from that used to make an
LR(0) machine only in that LR(1) items are created in the closure operation rather than
LR(0) items. The initial item consists of the start production with the dot at the far left
and |- as the lookahead character. In the grammar weve been using, it is:
Page 362 - Last line. Delete the period. The line should read: x>y
Page 363 - The C is in the wrong font in both the first marginal note and the first display
(on the third line). It should be in Roman.
[x-> !.y, FIRST((i C)].
Page 363 - Ninth line from bottom. Replace new machinewith new states. A replace
ment paragraph follows:
The process continues in this manner until no more new LR(1) items can be created.
The next states are created as before, adding edges for all symbols to the right of the dot
and moving the dots in the kernel items of the new states. The entire LR(1) state
machine for our grammar is shown in Figure 5.7. I ve saved space in the Figure by
merging together all items in a state that differ only in lookaheads. The lookaheads for
all such items are shown on a single line in the right column of each state. Figure 5.8
shows how the other closure items in State 0 are derived. Derivations for items in States
2 and 14 of the machine are also shown.
Page 364 - Figure 5.7. About 3 inches from the left of the figure and 13A inches from the
bottom, a line going from the box marked 2 to a circle with a B in it is currently labeled
((t e/NUM (. Delete the e.
Page 365 - The fourth line below Figure 5.7 should read:
best of both worlds. Examining the LR(1) machine in Figure 5.7, you are immediately 4
c
Page 365 - Figure 5.7 (continued). The upper-case F in the second item of State 16
should be lower case.
!T>. !T!* !F
Page 366 - First and third line under the Figure. The figure numbers are called out
incorrectly in the text. The first three lines beneath the figure should read:
parenthesis first. The outer part of the machine (all of the left half of Figure 5.7 except
States 6 and 9) handles unparenthesized expressions, and the inner part (States 6 and 9,
and all of the right half of Figure 5.7) handles parenthesized subexpressions. The parser
Page 370 - Listing 5.2, line 14. Change the sentence Reduce b y p r o d u c t i o n n
to read as follows (leave the left part of the line intact):
Re duc e b y p r o d u c t i o n n , n == - a c t i o n .
Page 371 - Listing 5.2, line 16. Change the sentence S h i f t t o s t a t e n to read as
follows (leave the left part of the line intact):
S h i f t t o s t a t e n , n == a c t i o n .
Page 373 - Listing 5.4, line 6. Remove the yy in yyl ookahead. The corrected line looks
like this:
6 do t hi s = yy next ( Yy act i on, st at e at t op of st ack(), l ookahead );
Page 373 - Listing 5.4, line 29. Change r hs_l en to r hs_l engt h. The corrected line
looks like this:
29 w h i l e ( - - r hs l engt h >= 0 ) / * p o p r h s l e n g t h i t e m s * /
Page 373 - Last line. Change to read as follows (the state number is wrong):
shifts to State 1, where the only legal action is a reduce by Production 6 if the next input
t c
Page 374 - Paragraph beneath table, replace youyouon third line with you. Entire
replacement paragraph follows:
Theres one final caveat. You cannot eliminate a single-reduction state if there is a
code-generation action attached to the associated production because the stack will have
one fewer items on it than it should when the action is performedyou wont be able to
access the attributes correctly. In practice, this limitation is enough of a problem that
occs doesnt use the technique. In any event, the disambiguating rules discussed in the
next section eliminate many of the single-reduction states because the productions that
cause them are no longer necessary.
Page 387 - Listing 5.11, line 107. Add the word short . The repaired line looks like this:
107 ttdefine YYF ( (YY TTYPE) ( (unsigned s h o r t ) ~0 >>1 ))
Page 388 - Second paragraph, third and fourth lines. Change the largest positive
integer to to the largest positive shor t i nt. and remove the following I m. The
repaired lines read as follows:
subroutine, yy_act _next (), which I ll discuss in a moment.) It evaluates to the largest
positive shor t i nt (with twos complement numbers). Breaking the macro down: 4
Page 390 - Listing 5.12, line 199. Change the sentence Reduce b y pr oduct i on
n to read as follows (leave the left part of the line intact):
Re duc e b y p r o d u c t i o n n, n == - a c t i o n .
Page 390 - Listing 5.2, line 201. Change the sentence S h i f t t o s t a t e n to read
as follows (leave the left part of the line intact):
S h i f t t o s t a t e n , n == a c t i o n .
209 YYD( yycomment ("Poppi ng %s f r omst at e st ack\ n", t os) ; )
Page 398 - Li sti ng 5.14, l i nes 219222. Repl ace wi th the f ol l owi ng code:
219 Yy_vsp = Yy_vst ack + ( YYMAXDEPTH - yyst k_el e( Yy_st ack) ) ;
220 # i f def YYDEBUG
221 yyst k_p( Yy_dst ack) = Yy_dst ack +
222 ( YYMAXDEPTH - yyst k_el e( Yy_st ack) );
Page 403 - The loop control on Line 128 of Listing 5.15 doesnt work reliably in the
8086 compact or large model. To fix it, replace Line 97 of Listing 5.15 (p. 403) with the
following (and also see change for next page):
97 i nt
128 f or( i = (pp - prod- >rhs) + 1; - - i >= 0; - - pp )
Page 425 - Lines 585594. Replace with the following code:
585 i f ( ncl ose )
586
{
587 assor t ( cl osur e_i t ems, ncl ose, si zeof (I TEM*), i t em_cmp );
588 ni t ems = move_eps( cur _st at e, cl osur e_i t ems, ncl ose );
589 p = cl osur e_i t ems + ni t ems;
590 ncl ose - = ni t ems ;
591
592 i f ( Ver bose > 1 )
593 pcl osur e( cur _st at e, p, ncl ose );
594
}
Page 440 - Listing 5.33, replace the code on lines 1435 to 1438 with the following (be
sure that the quote marks on the left remain aligned with previous lines) :
1435
M
act i on < 0 -- Reduce by pr oduct i on n, n == - act i on. 11,
1436
M
act i on == 0 -- Accept , (i e. Reduce by pr oduct i on 0. ) 11,
1437
M
act i on > 0 -- Shi f t to st at e n, n == act i on. 11,
1438
M
act i on == YYF - - er r or . 11,
Page 447 - Line 14. Change "hardly every maps" to "hardly ever maps. The line should
read:
ideal machine hardly ever maps to a real machine in an efficient way, so the generated 4
c
Page 452 - First line below Listing 6.2. Change 2048 to 1024. Line should read:
In addition to the register set, there is a 1024-element, 32-bit wide stack, and two 4
c
Page 453 - Figure 6.1. Around the middle of the figure. Change 2048 to 1024.
Line should read:
1,024
3 2-bit
Iwords
Page 466 - Listing 6.11, lines 53 and 54, change sp to__sp (two underscores).
53 #def i ne push( n) (--__sp) - >1 = (l word) (n)
54 tt def i ne pop(t) (t) ( (__sp++) - >1 )
Page 471 - Ninth line. Replace with f p+4 with f p+16. Replace the last four lines of the
paragraph with the following lines:
cal l () subroutine modifies wi l d, it just modifies the memory location at f p+16, and on
the incorrect stack, ends up modifying the return address of the calling function. This
means that cal l () could work correctly, as could the calling function, but the program
would blow up when the calling function returned.
Page 475 - Listing 6.19, Line 80. Replace the first (b) with a (s). The line should now
read:
80 t t def i ne BI T( b, s) i f ( (s) & (1 << (b) ) )
Page 486 - Replace last line of first paragraph (which now reads 6.10) with the follow
ing:
6. 11.
Page 494 - Li sti ng 6.25, repl ace l i nes 129142 wi th the f ol l owi ng:
129 t t def i ne I S__SPECI FI ER(p) ( (p) ScSc (p)- >cl ass==SPE CI F I E R )
130 t t def i ne IS~_DECLARATOR (p) ( (p) ScSc (p) - >class==DECLARATOR )
131 t t def i ne IS~_ARRAY (p ) (
(P)
ScSc (p) - >cl ass- -DECLARATOR ScSc
(P)
- >DCL_TYPE==ARRAY )
132 t t def i ne IS~_POI NTER (p) (
(P)
ScSc (p) - >cl ass=-DECLARATOR ScSc
(P)
- >DCL_TYPE--POI NTER )
133 t t def i ne IS~_FUNCT(p) (
(P)
ScSc (p) - >cl ass=-DECLARATOR ScSc
(P)
- >DCL_TYPE--FUNCTI ON)
134 t t def i ne IS~_STRUCT(p) (
(P)
ScSc (p) - >cl ass==SPECI FI ER ScSc
(P)
- >NOUN == STRUCTURE )
135 t t def i ne IS~_LABEL(p) (
(P)
ScSc (p) - >cl ass==SPECI FI ER ScSc
(P)
- >NOUN == LABEL )
136
137 t t def i ne I S__CHAR( p ) ( (p) ScSc (p) - >cl ass == SPECI FI ER ScSc (p) - >NOUN == CHAR )
138 t t def i ne IS~_I NT (p) ( (p) ScSc (p) - >cl ass == SPECI FI ER ScSc (p) - >NOUN == I NT )
139 t t def i ne IS~_UINT (p) ( I S _I NT (p) ScSc ( p) - >UNS IGNED )
140 t t def i ne IS~_LONG ( p ) ( ' I S_I NT (p) ScSc ( p) - >LONG )
141 t t def i ne IS~_ULONG(p ) ( I S _I NT (p) ScSc ( p) - >LONG ScSc (p) ->UNSI GNED )
142 t t def i ne IS~_UNSI GNED(p) ( (p) ScSc (p) ->UNSI GNED )
Page 496 - Figure 6.13. All of the l i nk structures that are labeled s p e c i f i e r should be
labeled d ec l a r a t o r and vice versa. The corrected figure follows:
Fi gur e 6. 13. Representing a Structure in the Symbol Table
Symbol t ab
symbol :
>
name
r name
t ype
"gi psy"
" gi psy"
t
l i nk l i nk:
>
V
to next variable at this level
St r uct t ab
st r uct def :
symbol
name
l evel
t ype
"Cl opi n"
0
symbol
name
l evel
t ype
"Mat hi as"
4
symbol
name
l evel
t ype
"Gui l l aume"
44
symbol
st r uct def :
cl ass
next
SPECI FI ER
NULL
sel ect :
noun
cl ass
i s l ong
unsi gned
val ue
STRUCT
FI XED
0
0
SPECI FI ER
NULL
noun
i s_l ong
unsi gned
val ue
I NT
FI XED
0
0
0
7K
<
l i nk
> num
l i nk
>
cl ass
next
DECLARATOR
A
9
sel ect
cl ass
numel e
ARRAY
5
l i nk
>
cl ass
next
DECLARATOR
A
9
S0l 0Ct
cl ass
numel e
POI NTER
l i nk
<
l i nk:
cl ass
next
DECLARATOR
A
9
sel ect
cl ass
numel e
FUNCTI ON
A
l i nk:
noun
i s_l ong
unsi gned
val ue
A
l i nk
>
cl ass
next
SPECI FI ER
NULL
sel ect :
noun
cl ass
i s l ong
unsi gned
val ue
STRUCT
0
0
>
cl ass
next
SPECI FI ER
NULL
sel ect :
noun
cl ass
i s l ong
unsi gned
val ue
STRUCT
0
0
A
9
l i nk:
cl ass
next
SPECI FI ER
NULL
sel ect :
noun
cl ass
i s l ong
unsi gned
val ue
I NT
0
0
0
Page 500 - Fi rst sentence bel owf i gure shoul d start The subrouti ne
The subroutine in Listing 6.29 manipulates declarators: add decl arator adds
Page 503 - Listing 6.31, line 281, should read as follows:
Page 520 - Replace line 71 of Listing 6.37 with the following line:
Replace line 76 of Listing 6.37 with the following line:
Page 521 - Listing 6.37, Lines 138143. Replace as follows:
138 t ypedef st r uct
/ *
R o u t i n e s t o r e c o g n i z e k e y w o r d s . A t a b l e * /
139
{ / *
l o o k u p i s u s e d f o r t h i s p u r p o s e i n o r d e r t o */
140 char *name
/ *
m i n i m i z e t h e number o f s t a t e s i n t h e FSM. A * /
141 i nt val ;
/ *
KWORD i s a s i n g l e t a b l e e n t r y .
*/
142
}
143 KWORD;
Page 524 - Second line of last paragraph, remove period after TYPE and change List
ing 6.38 to Listing 6.39. The repaired line should read as follows:
reduces type s p e c i f i e r (on line 229 of Listing 6.39). The associated action 6
Page 527 - First line below Listing 6.40. Change Listing 6.40 to Listing 6.39. The
repaired line should read as follows :
216 to 217 of Listing 6.39.) There are currently three attributes of interest: $1 and $$ 4
Page 553 - Li sti ng 6.56, l i ne 10. Change l o to (l o*4 )
10 t t def i ne T(n) ( f p- ( L0*4) - (n*4))
Page 556 - Listing 6.58, line 539. Change %s to (%s*4)
539 yycode( "#def i ne T(n) ( f p- ( %s*4) - ( n*4) ) \ n\ n", Vspace );
Page 558 - Listing 6.60, lines 578582. Replace with the following:
di scar d_l i nk_chai n( exi st i ng- >t ype) ; / * R e p l a c e e x i s t i n g t y p e * /
exi st i ng- >t ype = sym- >t ype; / * c h a i n w i t h t h e c u r r e n t o n e . * /
exi st i ng- >et ype = sym- >et ype;
sym- >t ype = sym- >et ype = NULL; / * Mus t b e NULL f o r d i s c a r d _ - * /
/ * s y m b o l () c a l l , b e l o w . */
Page 558 - Listing 6.60, lines 606 and 607. i is not used and the initial offset should be 8.
Replace lines 606 and 607 with the following:
606 i n t of f set = 8; / * F i r s t p a r a m e t e r i s a l w a y s a t BP ( f p+8) :
*/
607 / * 4 f o r t h e o l d f p , 4 f o r t h e r e t u r n a d d r e s s . * /
Page 560 - Page 560, Listing 6.61. Replace lines 578580 with the following:
578 : LC
{
i f ( ++Nest l ev =
= 1 )
579 l oc r eset ();
580
}
Page 573 - Fifth line from the bottom. Insert a period after needed. The line should
read:
needed. The stack is shrunk with matching additions when the variable is no longer 4
c
Page 574 - Figure 6.18, the LOin the ttdefine T (n) should be ( L0*4).
ttdefine T(n) ( f p- ( L0*4) - ( n*4))
578
579
580
581
582
Page 578 - First paragraph Theres an incomplete sentence on the first line. Replace
with the following paragraph and add the marginal note:
The cell is marked as in use on line 73. The Regi on element corresponding to the Ma rkin g a sta ck cell a s
ff" 35
first cell of the allocated space is set to the number of stack elements that are being alio-
cated. If more than one stack element is required for the temporary, adjacent cells that
are part of the temporary are filled with a place marker. Other subroutines in Listing
6.64 de-allocate a temporary variable by resetting the equivalent Regi on elements to
zero, de-allocate all temporary variables currently in use, and provide access to the
high-water mark. You should take a moment and review them now.
Page 590 - Listing 6.69, line 214. Replace with the following:
214 case CHAR: r et ur n BYTE PREFI X;
Page 598 - Third paragraph, second line (which starts type i nt for), replace line with
the following one:
type i nt for the undeclared identifier (on line 62 of Listing 6.72). The 4
Page 601 - First line beneath Listing 6.76. Replace generate with generated:
So far, none of the operators have generated code. With Listing 6.77, we move
Page 607 - Listing 6.84, line 328: Change the to an &&.
Page 608 - Fonts are wrong in all three marginal notes. Replace them with the ones
given here.
O p e ra n d to * or [] m ust
be a rra y or poin ter.
Attrib u te syn th e size d b y *
and [] o p e ra to rs.
R u le s fo r form in g lva lu e s
and rva lu e s wh e n
p ro ce ssin g * and [].
Page 613 - First line of last paragraph is garbled. Since the fix affects the entire para
graph, an entire replacement paragraph follows. Everything that doesnt fit on page 613
should be put at the top of the next page.
cal1 o The cal l () subroutine at the top of Listing 6. 87 generates both the cal l instruction
and the code that handles return values and stack clean up. It also takes care of implicit
i/nary>NAME subroutine declarations on lines 513 to 526. The action in unary s NAME creates a sym
bol of type i nt for an undeclared identifier, and this symbol eventually ends up here as
the incoming attribute. The cal l () subroutine changes the type to function returning
int by adding another l i nk to the head of the type chain. It also clears the i mpl i ci t
bit to indicate that the symbol is a legal implicit declaration rather than an undeclared
variable. Finally, a C-code ext er n statement is generated for the function.
Page 617 - Listing 6.87, line 543. Replace nar gs with nar gs * SWI DTH.
543 gen( "+=%s%d" , "sp", nar gs * SWI DTH ); /* sp i s a byte p o i n t e r , */
Page 619 - Listing 6.88, line 690. Delete the - >name. The repaired line should look like
this:
690 gen ( "EQ", r val ue( $1 ), "0" );
Disk only. Page 619, Listing 6.88. Added semantic-err or checking to first (test) clause
in ?: operator. Tests to see if its an integral type. Insert the following between lines 689
and 690:
i f ( !I S_I NT($1- >t ype) )
yyer r or ("Test i n ?: must be i nt egr al \ n" );
Page 619 - Listing 6.88. Replace line 709 with the following line:
709 gen( " = $$- >name, r val ue($7) );
Page 644 - Lines 895 and 896. There is a missing double-quote mark on line 895, inser
tion of which also affects the formatting on line 896. Replace lines 895 and 896 with the
following:
895 gen("got o%s%d", L_BODY, $5 ) ;
896 gen(": %s%d", L_I NCREMENT, $5 ) ;
Page 648 - Listing 6.107, line 1, change the 128 to 256.
1 t t def i ne CASE MAX 256 / * Maximum number o f c a s e s i n a s w i t c h * /
Page 649 - Listing 6.108. Add the following two lines between lines 950 and 951: (These
lines will not have numbers on them, align the firsts in pop with the g in gen_st ab. .. on
the previous line.)
pop( S_br k );
pop( S br k l abel );
Page 658 - First paragraph of section 7.2.1 should be replaced with the following one:
A strength reduction replaces an operation with a more efficient operation or series
of operations that yield the same result in fewer machine clock cycles. For example,
multiplication by a power of two can be replaced by a left shift, which executes faster on
most machines. (x*8 can be done with x<<3.) You can divide a positive number by a
power of two with a right shift (x/8 is x>>3 if x is positive) and do a modulus division by
a power of two with a bitwise AND (x%8is x&7).
Page 671 - Figure 7.1, third subfigure from the bottom. In initial printings, the asterisk
that should be at the apex of the tree had dropped down about Vi inch. Move it up in
these printings. In later printings, there are two asterisks. Delete the bottom one.
Page 681 - Listing A.l, lines 19 and 20 are missing semicolons. Change them to the fol
lowing:
19 t ypedef l ong t i me_
_t ; / *
f o r t h e VAX, may h a v e t o c hange t h i s
*/
20 t ypedef unsi gned si ze_
_t; / *
f o r t h e VAX, may h a v e t o c hange t h i s
*/
Page 682 - Listing A. 1, line 52. Delete all the text on the line, but leave the asterisk at
the far left.
Page 688 - 6th line from the bottom. Remove thoughat start of line.
it might introduce an unnecessary conversion if the stack type is an i nt, short , or char.
These multiple type conversions will also cause portability problems if the
st ack_er r () macro evaluates to something that wont fit into a l ong (like a doubl e) .
Page 690 - Last line, calling conventions should not be hyphenated.
Page 696 - Li sti ng A. 4, l i ne 40. The comment is wrong. The l i ne shoul d read as f ol l ows:
40 t t def i ne DI FFERENCE 2 / * (x i n si ) and (x not i n s2) */
Page 702 - Listing A. 6. Change lines 98102 to the following:
98 / * Enl ar ge t he set to "need" wor ds, f i l l i ng i n t he ext r a wor ds wi t h zer os.
99 * Pr i nt an er r or message and exi t i f t her e' s not enough memor y.
100 * Si nce t hi s r out i ne cal l s mal l oc, i t ' s r at her sl ow and shoul d be
101 * avoi ded i f possi bl e.
102
*/
Page 706 - Listing A.8, line 330. Change unsi gned to i nt:
330 i n t ssi ze; / * Number of wor ds i n sr c set */
Page 713 - Third line from the bottom. nextsym () should be in the Courier font.
Replace the last three lines on the page with the following:
passing the pointer returned from find symO to nextsymO, which returns either a
pointer to the next object or nul l if there are no such objects. Use it like this:
Page 719 - Second line above Listing A. 17, change tree to table :
cial cases. The delsym ( ) function, which removes an arbitrary node from the table, is
shown in Listing A. 17.
Page 722 - Listing A. 19, line 221. Replace with:
221 r e t u r n ( *User cmp) ( ( voi d*) ( *pl + 1), ( voi d*) ( *p2 +1) );
Page 729 - Listing A.26, line 50. Delete the ( t wo r equi r ed) at the end of the line.
Page 736 - Li sti ng A. 33, l i ne 34. Change to the f ol l owi ng:
34 PUBLI C voi d st op pr nt ( ) {}
Page 737 - Listing A.33, line 97. Change to the following:
97 char *str, *f mt, *argp;
Page 739 - The swap statement in the code in the middle of the page is incorrect.
Here is a replacement display:
i nt ar r ay[ ASI ZE ] ;
i nt i , j , t emp ;
f or ( i = 1; i <ASI ZE; ++i )
f or ( j - i - 1; j >= 0; - - j )
i f ( array[ j ] >ar r ay[ j +1] )
swap( ar r ay[ j ] , ar r ay[ j +1] );
Page 743 - Listing A. 36. Delete the text (but not the line number) on line 4 (which now
says tti ncl ude <f cnt l . h>) .
Page 745 - Change caption of Listing A. 38 to the following:
Listing A.38. memiset.c Initialize Array of i nt to Arbitrary Value
Page 755 - Third line from the bottom. Delete the exclamation point. The line should
read:
images (25x80x2=4,000 bytes for the whole screen), and that much memory may not be
Page 758 - Listing A. 45, line 49. Remove the semicolon at the far right of the line.
Page 768 - Listing A.61. Remove the exclamation point from the caption.
Page 776 - Line above heading for section A.l 1.2.2. Delete the voi d.
Page 797 - Replace Lines 5059 of Listing A.84 with the following:
50 case ' \ b' : i f ( buf > sbuf )
51
{
52 - - buf ; wpr i nt w( wi n, 11 \ b" );
53
}
54 el se
55
{
56 wpr i nt w( wi n, 11 11 ) ;
57 put char (' \ 007' ) ;
58
}
59 br eak;
Page 803 - Just above heading for section B.2. The line should read:
set to 100; otherwise, arg is set to 1.
Page 803 - Replace the last paragraph on the page with the following one:
The stack, when the recursive call to er at o is active, is shown in Figure B.2. There
is one major difference between these stack frames and C stack frames: the introduction
of a second pointer called the static link. The dynamic link is the old frame pointer, just
as in C. The static link points, not at the previously active subroutine, but at the parent
subroutine in the nesting sequencein the declaration. Since er at o and t hal i a are
both nested inside cal l i ope, their static links point at cal l i opes stack frame. You
can chase down the static links to access the local variables in the outer routines 4
c
Page 804 - Replace Figure B.2 with the following figure
Figure B.2. Pascal Stack Frames
t er psi chor e /
<-- fp
ere
ur ani a v
f
t
0
t er psi chor e
ere
ur ani a v
f
t
0
mel pomene
t ha
eut er pe v
\
l i a
/
-i /
cl i o
cal
pol yhymni a ^
\
l i ope
/
Page 805 - Top of the page. Replace the top seven lines (everything up to the paragraph
that begins This organization) with the following (youll have to move the rest of the
text on page 805 down to make room) :
stack frame. For example, cl i o can be accessed from er at o with the following C-code:
r O. pp = WP( f p +4) ; / * rO = s t a t i c l i n k */
_x = W( r O. pp - 8); / * x = c l i o */
You can access pol yhymni a from er at o with:
r O. pp = WP( f p +4) ; / * rO = s t a t i c l i n k * /
_x = W( r O. pp + 8); / * x = c l i o * /
Though its not shown this way in the current example, its convenient for the frame
pointer to point at the static, rather than the dynamic link to make the foregoing indirec
tion a little easier to do. The static links can be set up as follows: Assign to each sub
routine a declaration level, equivalent to the nesting level at which the subroutine is
declared. Here, cal l i ope is a level 0 subroutine, er at o is a level 1subroutine, and so
forth. Then:
I f a subroutine calls a subroutine at the same level, the static link of the called
subroutine is identical to the static link of the calling subroutine.
I f a subroutine at level N calls a subroutine at level 7V+1, the static link of the
called subroutine points at the static link of the calling subroutine.
I f a subroutine calls a subroutine at a lower (more outer) level, use the following
algorithm:
i =the difference in levels between the two subroutines;
p =the static link in the calling subroutines stack frame;
whi l e( - - i >= 0 )
p = *P;
the static link of the called subroutine =p;
Note that the difference in levels (i ) can be figured at compile time, but you must chase
down the static links at run time. Since the static link must be initialized by the calling
subroutine (the called subroutine doesnt know who called it), it is placed beneath the
return address in the stack frame.
Page 806 - Change caption and title of Listing C. 1 as follows:
Listing C.l. A Summary of the C Grammar in Chapter Six.
Page 819 - First full paragraph. Replace the the the on the fourth line with a single
the.
Th e ~ and $
m eta ch a ra cters
A replacement paragraph follows:
The
A
and $metacharacters work properly in all MS-DOS input modes, regardless of
whether lines end with \ r \ n or a single \ n. Note that the newline is not part of the lex
eme, even though it must be present for the associated expression to be recognized. Use
and\ r \ n to put the end of line characters into the lexeme. (The \ r is not required in
UNIX applications, in fact its an error under UNIX.) Note that, unlike the vi editor ~
does not match a blank line. Youll have to use an explicit search such as \ r \ n\ r \ n to
find empty lines.
Page 821 - Li sti ng D. l , repl ace l i nes 14 and 15 wi th the f ol l owi ng
Page 821 - First two lines beneath the listing. Delete both lines and replace with the fol
lowing text:
LeX and yyl eng is adjusted accordingly. Zero is returned at end of file, -1 if the lexeme
is too long.13
Page 821 - Replace Footnote 13 at the bottom of the page with the one at the bottom of
the page you are now reading.
Page 828 - Listing D.5.; in order to support the ul suffix, replace line 16 with the follow
ing:
16 suf f i x ( [UuLl ] [uU] [ 1L] ) / * S u f f i x i n i n t e g r a l n u m e r i c c o n s t a n t *
Page 841 - Replace the code on the first five lines of Listing E.2 with the following five
lines:
1 %t er m I D
/ *
an i d e n t i f i e r
*/
2 %t er mNUM
/ *
a number
*/
3 %l ef t PLUS
/ *
+
* /
4 %l ef t STAR
/ *
*
* /
5 %l ef t LP RP
/ *
( and )
* /
Page 843 - Paragraph starting -c[N],, 2nd line. Delete 44e in "switche."
Page 860 -15 lines from the bottom. Remove the underscores. The line should read:
is from stack picture six to seven, to and tl were put onto the stack when the rvalues 4
13 unix lex doesnt return -1 and it doesnt modify the yytext or yyleng; it just returns the next input
character.
Page 861 - Figure E.5. Remove all underscores. The figure should be replaced with the
following one:
Figure E.5. A Parse of A*2
Page 862 - Move caption for Listing E.9 to the left. (It should be flush with the left edge
of the box.)
Page 871 - Figure E.6. Label the line between States 5 and 7 with STAR.
STAR
Page 880 - Listing E. 17, Lines 84 and 85, replace with the following:
84 yycode("publ i c wor d tO, tl , t 2, t 3; \ n") ;
85 yycode("publ i c wor d t4, t 5, t 6, t 7; \ n");
Page 886 - Listing E.l 9. Replace lines 51 to 57 with the following:
51 {0} 512, l i ne 42 {$l =$2=newname(); }
52 {1} 513, l i ne 42 {f r eename($0); }
53 {2} 514, l i ne 48 {$l =$2=newname(); }
54 {3} 515, l i ne 49 { yycode("%s+=%s\ \ n", $$, $0); f r eename($0); }
55 {4} 516, l i ne 56 {$l =$2=newname(); }
56 {5} 517, l i ne 57 { yycode("%s*=%s\ \ n", $$, $0); f r eename($0); }
57 {6} 518, l i ne 61 { yycode("%s=%0. *s\ \ n", $$, yyl eng, yyt ext ); }
Page 887 - Listing E. 19. Replace lines 46 and 54 with the following:
46
{
yycode("%s+==%s\ n", $$, $0); f r eename($0); } expr'
54
{
yycode("%s*==%s\ n", $$, $0); f r eename($0); } term'
Page 888 - Listing E. 19. Replace line 58 with the following:
Disk only. I made several changes to searchen.c (p. 747) to make the returned path
names more consistent (everythings now mapped to a UNIX-style name). Also added a
disk identifier when running under DOS.
Disk only. Insert the following into the brace-processing code, between lines 115 and
116 of parser.lex on page 273:
i f ( i == ' \ n' ScSc i n_st r i ng )
{
l er r or ( WARNI NG,
"Newl i ne i n st ri ng, i nser t i ng \ "\ n") ;
i n_st r i ng = 0;
}
Disk only. The do_unop subroutine on page 604 (Line 177 of Listing 6.79) wasnt han
dling incoming constant val ues correctly and it wasnt doing any semantic-error check
ing at all. Its been replaced by the following code. (Instructions are now generated
only if the incoming val ue isnt a constant, otherwise the constant value at the end of
the l i nk chain is just modified by performing the indicated operation at compile time.)
177 val ue *do_unop( op, val )
178 i nt op;
179 val ue *val ;
180 {
181 char *op_buf ="=?" ;
182 i nt i ;
183
184 i f ( op != ' ! ' ) / * ~o r u n a r y - */
185 {
186 i f ( !I S_CHAR ( val - >t ype) ScSc !I S_I NT ( val - >t ype) )
187 yyer r or ( "Unar y oper at or r equi r es i nt egr al ar gument \ n" );
188
189 el se i f ( I S_UNS I GNED ( val - >t ype) ScSc op == )
190 yyer r or ( "Mi nus has no meani ng on an unsi gned oper and\ n" );
191
192 el se i f ( I S_CONSTANT( val - >t ype ) )
193 do_unar y_const ( op, val );
194 el se
195 {
196 op_buf [1] = op;
197 gen( op_buf , val - >name, val - >name );
198 }
199 }
200 el se / * / */
201 {
202 i f ( I S_AGGREGATE( val - >t ype ) )
203 yyer r or ( "May not appl y ! oper at or t o aggr egat e t ype\ n") ;
204
205 e l s e i f ( I S_I NT_CONSTANT( val - >t ype ) )
206 do_unar y_const ( ' ! ' , val );
207 e l s e
208
{
209 gen ( 11EQ", r val ue (val ), "0" ); / * EQ (x , 0)
*/
210 gen( "got o%s%d", L_ TRUE, i = t f _l abel () ); / * g o t o TO 00 ;
*/
211 val = gen f al se t r ue( i , val ) ; / * f a l l t h r u t o F
*/
212
}
213
)
214 return val ;
215
}
216
/ *- -*/
217
d_
unar y_const ( op, val )
218 i n t op;
219 val ue *val ;
220
{
221 / * Handl e u n a r y c o n s t a n t s b y m o d i f y i n g t h e c o n s t a n t ' s v a l u e .
*/
222
223 l i nk *t =val - >t ype;
224
225 i f ( I S I NT( t ) )
226
{
227 s wi t c h ( op )
228
{
229 case t - >V_INT = ~t - >V_I NT; break;
230 case t - >V_INT = - t - >V_I NT; break;
231 case ' ! ' : t->V INT = !t - >V_I NT; break;
232
}
233
}
234 e l s e i f ( IS LONG(t ) )
235
{
236 s wi t c h ( op )
237
{
238 case t - >V_LONG = ~t - >V_LONG; break;
239 case t - >V_LONG = - t - >V_LONG; break;
240 case ' ! ' : t->V LONG = !t - >V_LONG; break;
241
}
242
}
243 e l s e
244 yyer r or ("I NTERNAL do unar y const : unexpect ed t ype\ n
") ;
245
}
Disk only. Page 506, Listing 6.32, lines 453457: Modified subroutine t ype_st r () in
file symtab.c to print the value of an integer constant. Replaced the following code:
i f ( l i nk p- >NOUN != STRUCTURE )
cont i nue;
el se
i = spr i nt f ( buf , " %s", l i nk_p- >V_STRUCT- >t ag ?
l i nk p- >V STRUCT- >t ag : "unt agged11);
with this:
i f ( l i nk p- >NOUN STRUCTURE )
1 spr i nt f ( buf ,
M
s" l i nk p- >V STRUCT- >t ag ?
l i nk p- >V STRUCT- >t ag "unt agged")
el se i f (I S I NT( l i nk p)
el se
) spr i nt f ( buf ,
el se i f (I S_UI NT( l i nk_p) ) spr i nt f ( buf ,
el se i f (I S_LONG( l i nk_p) ) spr i nt f ( buf ,
el se i f (I S ULONG( l i nk p) ) spr i nt f ( buf ,
cont i nue;
M
M
M
M
O.
o
o.
o
o.
o
o.
o
d", l i nk p- >V I NT
) ;
u", 1i nk_p- >V_UI NT );
I d", l i nk_p- >V_LONG );
l u", l i nk p- >V ULONG) ;
Disk only. Page 241, Listing 4.7, lines 586589 and line 596, and page 399 Listing
5.14, lines 320323 and line 331. The code in llama.par was not testing correctly for a
nul l return value from i i _pt ext (). The problem has been fixed on the disk, but wont
be fixed in the book until the second edition. The fix looks like this:
i f ( yyt ext ( char *) i i pt ext O )
*
a c e s l l a m a . p a r l i n e s 5 8 6 - 5 8 9
*
*
a nd o c c s . p a r , l i n e s 3 2 0 - 3 2 3
*
yyl i neno
t char
yyt ext [ yyl eng]
i i _pl i neno() ;
yyt ext [ yyl eng
' \ 0' ;
i i pl engt h()
] ;
el se
*
no p r e v i o u s t o k e n
*
yyt ext
yyl eng
II II
yyl i neno 0 ;
i f ( yyl i neno )
i i pt ext O [ i i pl engt h ()
*
a c e s l l a m a . p a r , l i n e 5 9 6
*
]
t char;
*
a nd o c c s . p a r , l i n e 331
*
Disk only. The iiJook() routine (in Listing 2.7 on page 47) doesnt work in the 8086
large or compact models. The following is ugly, but it works everywhere:
1 i nt i i
2
{
3
/ *
4
*
5
*
6
*
7
*
8
*
9
*
10
*
11
*
12
*
13
*
14
*/
15
16 i f (
17
18
19
20
/ *
21
/ *
22
/ *
23
l ook( n )
R e t u r n t h e n t h c h a r a c t e r o f l o o k a h e a d , EOF i f y o u t r y t o l o o k
e n d o f f i l e , or 0 i f y o u t r y t o l o o k
We h a v e t o j ump t h r o u g h h o o p s h e r e t o
e i t h e r e n d o f t h e b u f f e r
t h e ANSI on
t h a t a c an n o t g o t o t h e l e f t o f an a r r a y o r mor e t h a n o ne
t h e r i g h t o f an a r r a y . I f we d o n ' t t h i s r e s t r i c t i o n ,
t h e n t h e c o d e w o n ' t wo r k i n t h e 8 0 8 6 o r c o m p a c t m o d e l s . I n
t h e s m a l l m o d e l ----------- or i n a n y m a c h i n e w i t h o u t a
, y o u c o u l d do a s i m p l e c o m p a r i s o n t o t e s t f o r o v e r f l o w
u c h a r
*
P
N e x t + n;
i f ( ! ( S t a r t _ b u f <= p && p < End b u f )
o v e r f l o w
n > (End buf - Next ) )
*
r et ur n Eof r ead ? EOF 0
*
( E n d J o u f - N e x t ) i s t h e # o f u n r e a d *
c h a r s i n t h e b u f f e r ( i n c l u d i n g *
*
t h e o ne p o i n t e d t o b y N e x t ) .
*
The c u r r e n t l o o k a h e a d c h a r a c t e r i s a t N e x t [ 0 ] . The l a s t c h a r a c t e r *
r e a d i s a t N e x t [ - 1 ] . The - - n i n t h e f o l l o w i n g i f s t a t e m e n t a d j u s t s *
n s o t h a t N e x t [ n] w i l l r e f e r e n c e t h e c o r r e c t c h a r a c t e r . *
Li st i ng 5.11. cont i nued...
24 i f ( - -n < - ( Next - St ar t buf ) ) / * ( N e x t - S t a r t ) i s t h e # o f b u f f e r e d * /
25 r e t u r n 0 ;
/ *
c h a r a c t e r s t h a t h a v e b e e n r e a d . * /
26
27 r e t u r n Next [n];
28 }

Compiler Design in C (HQ)

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Compiler Design in C (HQ)

Caricato da

Copyright:

Formati disponibili

Allen I.

p = (ROW * ) ( &Ar r a y [ 0 ] [6] ) ; / * I ni t i al i ze p t o head * /

Groups [ DFA MAX ] ; /

Pr ot ot ypes f or subr out i nes i n

PUBLI C i nt p a i r s ( f p , a r r a y , nr ows , n c o l s , name, t h r e s h o l d , numbers )

Number of col umns i n ar r ay[]

Number of col umns i n dest i nat i on ar r ay

) mal l oc( nr ows * r ncol s * ( i nt ) )) )

Name used f or DFA t r ansi t i on t abl e. Up to

Pr i nt an er r or message and exi t t o t he oper at i ng syst em. Thi s r out i ne i s

To review a bit, an LL(1) parse table looks like this:

In general, given a left-hand side with one or more right-hand sides:

case Tn: advance(); an

Typedef f or val ue- st ack el ement s. */

The push( ) macr o checked

Wr i t e somet hi ng to t he dat a- segment st r eam

f / * Wi ndows f or t he debuggi ng scr een,

118 PRIVATE WINDOW *Code wi ndow

370 / * ssi ze i s i n bounds * / s s i z e

obj && !mat ch(*obj ) ; obj ++ )

ACTI ON {add to r hs} r hs

I nput l i ne number - - cr eat ed by LeX

0; / * Pr ecedence l evel . I ncr ement ed

/ * f i el d- name gener at i on, as per

%uni onf i s not act i vat ed. )

Sp S t a c k + ( SSI ZE- 1 ) ; / * st ack poi nt er . I t ' s i nconveni ent to u

Add new nont er mi nal t o symbol t abl e

( r e) i ni t i al i ze new nont er mi nal * /

/ * and t he br ace i t sel f

Def aul t f i el d, used when no

pr evi ous %uni on was speci f i ed

Fi nd t he LL( 1) sel ect i on set f or al l pr oduct i ons at t ached t o t he

cur r ent member of synch set

/ * number of member s i n set

Tot al war ni ngs pr i nt ed

Out put newl i ne f ol l owi ng

assignment (lowest precedence)

and decr ement ed on each par se cycl e,

Get f i r st i nput symbol

Thi s subr out i ne does sever al t hi ngs

s p e c i f i c a t i o n . I t n o r ma l l y r e t u r n s 0, but i f any o f your ",

Check r i ght most symbol

Ski p t he at t r i but e r ef er ence. The i f st at ement handl es $$ t he

Ret ur ns a poi nt er to t he next unf i ni shed st at e and del et es t hat

Pr i nt t he ker nel and epsi l on i t ems f or t he cur r ent st at e.

> l ong i nt <

speci f i er . noun. I NT has t he val ue 0 so

t o i nt , same goes f or EXTERN, bel ow.

Put cur r ent symbol

Ret ur n a st r i ng r epr esent i ng al l

of t he f i l e (i f any) ar e dest r oyed. */

Pr ovi de at t r i but e f or t he st ar t symbol .

bel ow, when t he gen() cal l i s di scussed.

Thi s def i ni t i on must dupl i cat e t he %uni on

Decl ar ed by occs i n yyout . c.

Decl ar ed by occs i n yyout . c

Does not hi ng i n UNI X l ex

Not a dupl i cat e, go t o t he

def l i st def { symbol *p;

I t ' s i ncr ement ed agai n i n t he compound st mt

Deal wi t h i mpl i ci t decl ar at i ons

Reset t empor ar y- var i abl e syst em. */

* 0 i s r et ur ned i f no space i s avai l abl e, and an er r or message i s al so

l ex; / possi bl e keywor d i n Kt ab