Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Thu, 26 Sep 2013 13:32:25 UTC
Contents
Articles
Wikibooks:Collections Preface Introduction 1 3 7 7 10 14 17 17 19 21 31 34 37 38 42 43 43 47 53 60 64 67 70 72 72 74 75 80 81 84 84
Introduction To Parrot
Introduction Building Parrot Running Parrot
Parrot Hacking
Parrot Internals IMCC and PIRC Run Core Memory and Garbage Collection PMC System String System Exception Subsystem
85 85 85 86 86 87 87 87 87 88 89 90 90 98 98 103 103 104 108 112 119 126 133 141 149 156 162 162 163
Appendices
PIR Reference PASM Reference PAST Node Reference Languages on Parrot HLLCompiler Class Command Line Options Built-In PMCs Bytecode File Format VTABLE List
References
Article Sources and Contributors Image Sources, Licenses and Contributors 164 166
Article Licenses
License 167
Wikibooks:Collections Preface
Wikibooks:Collections Preface
This book was created by volunteers at Wikibooks (http://en.wikibooks.org).
What is Wikibooks?
Started in 2003 as an offshoot of the popular Wikipedia project, Wikibooks is a free, collaborative wiki website dedicated to creating high-quality textbooks and other educational books for students around the world. In addition to English, Wikibooks is available in over 130 languages, a complete listing of which can be found at http:/ / www. wikibooks. org. Wikibooks is a "wiki", which means anybody can edit the content there at any time. If you find an error or omission in this book, you can log on to Wikibooks to make corrections and additions as necessary. All of your changes go live on the website immediately, so your effort can be enjoyed and utilized by other readers and editors without delay. Books at Wikibooks are written by volunteers, and can be accessed and printed for free from the website. Wikibooks is operated entirely by donations, and a certain portion of proceeds from sales is returned to the Wikimedia Foundation to help keep Wikibooks running smoothly. Because of the low overhead, we are able to produce and sell books for much cheaper then proprietary textbook publishers can. This book can be edited by anybody at any time, including you. We don't make you wait two years to get a new edition, and we don't stop selling old versions when a new one comes out. Note that Wikibooks is not a publisher of books, and is not responsible for the contributions of its volunteer editors. PediaPress.com is a print-on-demand publisher that is also not responsible for the content that it prints. Please see our disclaimer for more information: http://en.wikibooks.org/wiki/Wikibooks:General_disclaimer .
Wikibooks:Collections Preface
Wikibooks in Class
Books at Wikibooks are free, and with the proper editing and preparation they can be used as cost-effective textbooks in the classroom or for independent learners. In addition to using a Wikibook as a traditional read-only learning aide, it can also become an interactive class project. Several classes have come to Wikibooks to write new books and improve old books as part of their normal course work. In some cases, the books written by students one year are used to teach students in the same class next year. Books written can also be used in classes around the world by students who might not be able to afford traditional textbooks.
Happy Reading!
We at Wikibooks have put a lot of effort into these books, and we hope that you enjoy reading and learning from them. We want you to keep in mind that what you are holding is not a finished product but instead a work in progress. These books are never "finished" in the traditional sense, but they are ever-changing and evolving to meet the needs of readers and learners everywhere. Despite this constant change, we feel our books can be reliable and high-quality learning tools at a great price, and we hope you agree. Never hesitate to stop in at Wikibooks and make some edits of your own. We hope to see you there one day. Happy reading!
Introduction
Introduction
What Is Parrot?
Parrot is a virtual machine (VM), similar to the Java VM and the .NET VM. However, unlike these two which are designed for statically-typed languages like Java or C#, Parrot is designed for use with dynamically typed languages such as Perl, Python, Ruby, or PHP. The Parrot VM itself is written in the C programming language, which means thatin theoryit will be portable to a large number of different computer architectures and operating systems. It is written to be easily modular and extensible. Programmers can write in any of the languages for which a Parrot-capable compiler exists. Modules written in one language, such as Perl, can transparently interoperate with modules which have originally been written in any of the other languages supported by Parrot. This easy interoperability and native support for cutting-edge dynamic programming features makes Parrot an important tool for next-generation language designers and implementers. It is precisely because Parrot is intended to support so many diverse high level languages that Parrot has developed a very general and feature-rich architecture. Much of the Parrot architecture is still under active development, so those parts will not be able to be properly discussed here in this book quite yet. Once Parrot reaches a stable release, and more details are set in stone, this book will be able to provide a more comprehensive coverage.
History of Parrot
The Parrot project was born from the Perl 6 development project. As such, the history of Parrot, at least the early history of it, is closely tied to the history of Perl 6. In fact, understanding just how large and ambitious Perl 6 is, you'll start to understand why Parrot must have all the features it has. It was famously quoted about version 5 of the Perl programming language that "nothing can parse Perl but perl". The implication was that the perl executable was the only program that could reliably parse the Perl programming language. There were two reasons for this. First, the Perl language didn't follow any formal specification; The behavior of the perl interpreter was the definitive documentation for the actions of Perl. Second, the Perl programming language allowed the use of source filters, programs which could modify their own source code prior to execution. This means that to reliably parse and understand a Perl program, you needed to be able to execute the source filters reliably. The only program that could do both was perl. The next planned version of Perl, Perl 6, was supposed to be a major rewrite of the language. In addition to standardizing and bringing sanity to all the features which had slowly entered the language grammar, it was decided that Perl 6 would be a formal specification first, and implementations of that specification later. The name "Parrot" was first used as an April Fool's joke. The story claimed that the Perl and Python languages (which are competitors, and which were both undergoing major redesigns) were going to merge together into a single language named Parrot. This was, of course, a hoax, but the idea was a powerful one. When the project was started to create a virtual machine that would be capable of running not only Perl 6, but also Python and other dynamic languages, the name Parrot was a perfect fit. The first release of Parrot, 0.0.1, was released in September 2001. The development team has prepared a stable point release on the third Tuesday of every month.
Introduction
Introduction Parrot needs to be built and tested regularly. People are always needed who are willing to perform regular builds and tests of Parrot. If you are willing to set up automated build bot to perform regular builds and tests, that's even better. If you can write This book needs your help, and anybody can edit it. Also, there are a number of other book-writing projects concerning Parrot that are looking for active authors and editors. The more is written about Parrot, the more new users will be able to learn about it. If you don't fall cleanly into any of these categories, there are other opportunities to help as well. This might be a good opportunity for you to learn a new skill, like programming Perl 6, PIR, or NQP. If you are interested in writing or editing, you can help with this wikibook too!
Parrot Developers
There are several different roles that people have taken up in Parrot development, even though there is no centralized management hierarchy. Volunteers tend to fulfill certain roles that they enjoy and that they have skill at. Architect The Parrot Architect, currently w:Allison Randal, is in charge with laying out the overall design specifications for Parrot. The architect has the final say in important decisions and is responsible to ensure that design documents are up to date. By laying out the overall requirements of the system, other volunteers are able to contribute to areas where they are most interested. Pumpking The Pumpking has oversight of the Parrot source repository, and is also the lead developer. The Pumpking defines coding standards that all contributors must follow, and helps to coordinate other contributors. Release Managers Parrot has a schedule of making releases approximately once a month. The release manager oversees this process, and ensures that releases are high quality. Release managers will control when new features can be added, and when code should be frozen for debugging. Pre-release debugging sessions are very productive and important periods for Parrot development, and ensure that many bugs get fixed between each release. Committer A committer is a person with write access to the Parrot SVN repository. Committers typically have submitted several patches and participated in Parrot-related discussions. Metacommitter A metacommitter is a person who has write access to the Parrot SVN repository and is also capable of promoting new committers. The architect and the Pumpking are automatically metacommitters, but there are several others too. Among the above groups, there are other designations as well. This is because many committers tend to focus their efforts on a relatively small portion of the Parrot development effort. Core Developer A person who works on Parrot internals, typically one or two subsystems. Core developers need to be skilled in C programming, and also need to work with many development utilities written in Perl. Compiler Developer These developers, like the Core Developers are working on the internals of Parrot, typically by writing lots of C code. In contrast, Compiler Developers focus their effort on the various compiler front-ends such as IMCC, PIRC, PGE, or TGE.
Introduction High-Level Language Developer A high-level language developer is a person who is working to implement a high-level language on Parrot. Even though they have commit access to the whole repository, many high-level language developers will focus only on a single language implementation. High-level language developers need to be skilled in PCT and many of the Perl 6-based development tools for HLLs. Build Managers Build managers help to create and maintain tools that other developers rely on. Testers Testers create and maintain a suite of hundreds and thousands of tests to verify the operations of Parrot, its subsystems, its compilers and the high-level languages that run on it. Platform Porters A platform porter ensures that Parrot can be built on multiple platforms. Porters must build and test Parrot on different platforms, and also create and distribute pre-compiled installation packages for different platforms. This certainly isn't an exhaustive list of possible roles either. If you have programming skills, but don't know if you fit in well to any of the designations above, your help is still needed.
Resources
http://www.parrotcode.org/docs/intro.html http://www.parrotcode.org/docs/roadmap.html http://www.parrotcode.org/docs/parrothist.html
References
[1] irc:/ / irc. perl. org/ Parrot
Introduction To Parrot
Introduction
What Is Parrot?
Parrot is a virtual machine (VM), similar to the Java VM and the .NET VM. However, unlike these two which are designed for statically-typed languages like Java or C#, Parrot is designed for use with dynamically typed languages such as Perl, Python, Ruby, or PHP. The Parrot VM itself is written in the C programming language, which means thatin theoryit will be portable to a large number of different computer architectures and operating systems. It is written to be easily modular and extensible. Programmers can write in any of the languages for which a Parrot-capable compiler exists. Modules written in one language, such as Perl, can transparently interoperate with modules which have originally been written in any of the other languages supported by Parrot. This easy interoperability and native support for cutting-edge dynamic programming features makes Parrot an important tool for next-generation language designers and implementers. It is precisely because Parrot is intended to support so many diverse high level languages that Parrot has developed a very general and feature-rich architecture. Much of the Parrot architecture is still under active development, so those parts will not be able to be properly discussed here in this book quite yet. Once Parrot reaches a stable release, and more details are set in stone, this book will be able to provide a more comprehensive coverage.
History of Parrot
The Parrot project was born from the Perl 6 development project. As such, the history of Parrot, at least the early history of it, is closely tied to the history of Perl 6. In fact, understanding just how large and ambitious Perl 6 is, you'll start to understand why Parrot must have all the features it has. It was famously quoted about version 5 of the Perl programming language that "nothing can parse Perl but perl". The implication was that the perl executable was the only program that could reliably parse the Perl programming language. There were two reasons for this. First, the Perl language didn't follow any formal specification; The behavior of the perl interpreter was the definitive documentation for the actions of Perl. Second, the Perl programming language allowed the use of source filters, programs which could modify their own source code prior to execution. This means that to reliably parse and understand a Perl program, you needed to be able to execute the source filters reliably. The only program that could do both was perl. The next planned version of Perl, Perl 6, was supposed to be a major rewrite of the language. In addition to standardizing and bringing sanity to all the features which had slowly entered the language grammar, it was decided that Perl 6 would be a formal specification first, and implementations of that specification later. The name "Parrot" was first used as an April Fool's joke. The story claimed that the Perl and Python languages (which are competitors, and which were both undergoing major redesigns) were going to merge together into a single language named Parrot. This was, of course, a hoax, but the idea was a powerful one. When the project was started to create a virtual machine that would be capable of running not only Perl 6, but also Python and other dynamic languages, the name Parrot was a perfect fit. The first release of Parrot, 0.0.1, was released in September 2001. The development team has prepared a stable point release on the third Tuesday of every month.
Introduction
Introduction Parrot needs to be built and tested regularly. People are always needed who are willing to perform regular builds and tests of Parrot. If you are willing to set up automated build bot to perform regular builds and tests, that's even better. If you can write This book needs your help, and anybody can edit it. Also, there are a number of other book-writing projects concerning Parrot that are looking for active authors and editors. The more is written about Parrot, the more new users will be able to learn about it. If you don't fall cleanly into any of these categories, there are other opportunities to help as well. This might be a good opportunity for you to learn a new skill, like programming Perl 6, PIR, or NQP. If you are interested in writing or editing, you can help with this wikibook too!
Parrot Developers
There are several different roles that people have taken up in Parrot development, even though there is no centralized management hierarchy. Volunteers tend to fulfill certain roles that they enjoy and that they have skill at. Architect The Parrot Architect, currently w:Allison Randal, is in charge with laying out the overall design specifications for Parrot. The architect has the final say in important decisions and is responsible to ensure that design documents are up to date. By laying out the overall requirements of the system, other volunteers are able to contribute to areas where they are most interested. Pumpking The Pumpking has oversight of the Parrot source repository, and is also the lead developer. The Pumpking defines coding standards that all contributors must follow, and helps to coordinate other contributors. Release Managers Parrot has a schedule of making releases approximately once a month. The release manager oversees this process, and ensures that releases are high quality. Release managers will control when new features can be added, and when code should be frozen for debugging. Pre-release debugging sessions are very productive and important periods for Parrot development, and ensure that many bugs get fixed between each release. Committer A committer is a person with write access to the Parrot SVN repository. Committers typically have submitted several patches and participated in Parrot-related discussions. Metacommitter A metacommitter is a person who has write access to the Parrot SVN repository and is also capable of promoting new committers. The architect and the Pumpking are automatically metacommitters, but there are several others too. Among the above groups, there are other designations as well. This is because many committers tend to focus their efforts on a relatively small portion of the Parrot development effort. Core Developer A person who works on Parrot internals, typically one or two subsystems. Core developers need to be skilled in C programming, and also need to work with many development utilities written in Perl. Compiler Developer These developers, like the Core Developers are working on the internals of Parrot, typically by writing lots of C code. In contrast, Compiler Developers focus their effort on the various compiler front-ends such as IMCC, PIRC, PGE, or TGE.
Introduction High-Level Language Developer A high-level language developer is a person who is working to implement a high-level language on Parrot. Even though they have commit access to the whole repository, many high-level language developers will focus only on a single language implementation. High-level language developers need to be skilled in PCT and many of the Perl 6-based development tools for HLLs. Build Managers Build managers help to create and maintain tools that other developers rely on. Testers Testers create and maintain a suite of hundreds and thousands of tests to verify the operations of Parrot, its subsystems, its compilers and the high-level languages that run on it. Platform Porters A platform porter ensures that Parrot can be built on multiple platforms. Porters must build and test Parrot on different platforms, and also create and distribute pre-compiled installation packages for different platforms. This certainly isn't an exhaustive list of possible roles either. If you have programming skills, but don't know if you fit in well to any of the designations above, your help is still needed.
10
Resources
http://www.parrotcode.org/docs/intro.html http://www.parrotcode.org/docs/roadmap.html http://www.parrotcode.org/docs/parrothist.html
Building Parrot
Obtaining Parrot
The most recent development release of Parrot can be downloaded from CPAN [1]. Development of Parrot is controlled through the SVN repository at http:/ / svn. parrot. org/ parrot/ . The most up-to-date version of Parrot can be obtained from https://svn.parrot.org/parrot/trunk/via svn checkout.
Building Parrot
11
Configure.pl
Notice that Configure.pl has a capitalized first letter. This is an important distinction on Unix and Linux systems which are case sensitive. The first step in building Parrot is to run the Configure.pl script which will perform some basic tests on your system and produce a makefile. To automatically invoke Configure.pl with the most common options, run the program Makefile.pl instead. The configuration process performs a number of tests on your system to determine some important parameters. These tests may make several minutes on some systems, so be patient. In addition, configuration creates a number of platform-specific code files for your system. Without these generated files in place, the build process cannot procede. After Configure.pl is finished executing, you should have a file named Makefile (with no suffix). From the shell, go to the Parrot directory and type the command "make" or "nmake" on Windows. This will start the process to build Parrot. The Parrot build process could take several minutes because there are a number of steps. We will discuss some of these steps in a later section.
MANIFEST
The root directory of Parrot contains a file called MANIFEST. MANIFEST contains a list of all necessary files in the Parrot repository. If you add a new file to the Parrot source tree, make sure to add that file to MANIFEST. Configure.pl checks MANIFEST to ensure all files exist properly before attempting to build.
Configure.pl Options
Depending on what tasks you want to perform, or how you are using Parrot, there are a number of options that can be specified to Configure.pl. These options may change the makeup of several generated files, including the Makefile. Here, we will list some of these options:
--help --version --verbose --fatal --silent --nomanicheck --languages --ask --test Shows a help message Prints version information about Configure.pl Prints extra information to the console If any step fails, kill Configure immediately and do not run additional tests No output to the console Do not check the file MANIFEST to ensure all files exist. Specify a comma-separated list of languages to build also, after Parrot has been built. Ask the user for answers to common questions, instead of running probes. Test the configuration tools first, then Configure, then the build tools. Use --test=configure to test the configuration tools then run Configure.pl. Use--test=build to run Configure.pl and then also test the build tools. set --debugging=0 to turn off debugging. Debugging is on by default. Specify whether your C compiler supports inline code using the C inline keyword. Compile Parrot using compiler optimizations, and a few other speed-up tricks. Creates a faster bird, but may expose more errors and failures. Use --optimize=(flags) to specify compiler optimization flags to use. Link Parrot dynamically to libparrot, instead of linking statically. On a 64-bit platform, compile a 32-bit Parrot. Turn on profiling. Only used with the GCC compiler, for now. Turn on additional warnings, for the Cage Cleaners.
Building Parrot
12
Specify the compiler to use. For instance, --cc=gcc for the GCC compiler, and --cc=cl for Microsoft's C++ compiler. Use --ccflags to specify any additional compiler flags, and --ccwarn to turn on any additional warnings. Here are some more options: 1. 2. 3. 4. 5. 6. 7. To build Parrot with a C++ compiler, use --cxx to specify the compiler to use. Use --libs to specify any additional libraries to link Parrot with. Use --link to specify a linker Use --linkflags to send options to the linker Use --ld to select a loader Use --ldflags to send flags to the loader Use --make to specify what make utility to use
--cc
--intval --floatval --opcode --ops --pmc --without-gmp --without gdbm --without-opengl --without-crypto --icu-config --without-icu --maintainer
Set the C data types to use for each value. Notice that --intval and --opcode must be the same, or strange errors may result.
Specify any optional OPS files to build. Specify any optional PMC files to build. Do not use Build Parrot without GMP Build Parrot without OpenGL support Build Parrot witout the cryptography library Specify a location for the Unicode ICU library on your system. Build Parrot without ICU and Unicode support. Compile IMCC's tokenizer and parser using Lex and Yacc (or equivalent). Use --lex to specify the name of the lexer, nd --yacc to specify the name of the parser. Build miniparrot Specify a path prefix Specify an execution path prefix The directory for binary executable files on your system The system admin executables folder Program executables folder read-only data directory for machine-independent data. read-only data that is machine dependent modifiable architecture-independent data directory modifiable architecture-dependet data directory Object code directory Folder for Compiler include files C header file directory for old versions of GCC info documentation directory Man pages docmentation folder
--miniparrot --prefix --exec-prefix --bindir --sbindir --libexecdir --datadir --sysconfdir --sharedstatedir --localstatedir --libdir --includedir --oldincludedir --infodir --mandir
Building Parrot
13
Parrot Executable
After the build process you should have, among other things, an executable file for Parrot. This will be, on Windows systems, named parrot.exe. On other systems, it may be named slightly differently, such as with no suffix. Two other programs of interest are created, miniparrot.exe and libparrot.dll. These files will be named something different if you are not on a Windows system.
Make Targets
For readers who are not familiar with the make program, it is a program which can be used to automatically determine how to build a software project from source code files. In a makefile, you specify a list of dependencies, and the method for producing one file from others. make then determines the order and method to build your project. make has targets, which means a single makefile can have multiple goals. For Parrot, a number of targets have been defined which can help with building, debugging, and testing. Here are a list of some of the make targets:
Command make make clean Explanation Builds Parrot from source. Only rebuilds components that have changed from the last build. removes all the intermediate files that are left over from the build process. Cleans the directory tree so that Parrot can be completely rebuilt. Completely removes all temporary files, all intermediate files, and all makefiles. After a make realclean command, you will need to run Configure.pl again. Builds Parrot, if needed, and runs the test suite on it. If there are errors in the test results, you can try to fix them yourself, or you can submit a bug report to the Parrot developers. This is always appreciated. Build Parrot, if needed, and runs the test suit on every run core. This can be a very time-consuming operation, and is typically only performed prior to a new release. Performs smoke testing. This runs the parrot test suite and attempts to transmit the test results directly to the Parrot development servers. Smoke test results help the developers to keep track of the systems where Parrot is building correctly.
Resources
http://www.parrotcode.org/docs/gettingstarted.html
References
[1] http:/ / www. parrotcode. org/ release/ devel
Running Parrot
14
Running Parrot
Running Parrot
Parrot can be run from the command line in a number of modes with a number of different options. There are three forms of input that Parrot can work with directly: Parrot Assembly Language (PASM), which is a low-level human readable assembly language for the virtual machine, Parrot Intermediate Representation (PIR) which is a syntactic overlay on PASM with nicer syntax for some expressions, and Parrot Bytecode (PBC) which is a compiled binary input format. PIR and PASM are converted to PBC during normal execution. Only PBC can be executed by Parrot directly. The compilation stage to convert PIR or PASM to PBC takes some time, and can be done separately. We'll be talking about these processes a little later.
Parrot Information
To get information about the current Parrot version, type: parrot -V To get a list of command-line options and their purposes, type: parrot -h We'll discuss all the various command-line options later in this book, but it's always good to have multiple resources when a question pops up.
File Types
Files that end in .pbc are treated as parrot bytecode files and are executed immediately. Files that end in .pir or .pasm are treated as PIR or PASM source code files, respectively, and interpreted. To compile PIR or PASM into bytecode, use the -o switch, as such: parrot -o output.pbc input.pir or parrot -o output.pbc input.pasm Notice that if we use a .pasm file extension, we can output to PASM instead of PBC: parrot -o output.pasm input.pir To force output of PBC even if the output file does not have a .pbc extension, use the --output-pbc switch. To run the generated PBC file after you generate it, use the -r switch. To force a file to be run as PASM regardless of the file extension, use the -a switch. To force a file to be run as a PBC file, regardless of the file extension, use the -c switch.
Running Parrot
15
Runtime Options
Parrot can operate with a number of additional options too.
Optimizations
Optimizations can take time to perform, but increase the execution speed of the resulting program. For simple programs, short and sloppy one-time programs, extensive optimizations might not make much sense. You would spend more time optimizing a piece of software then you even spend executing it. However, for programs which are run frequently, or for very large programs, or programs which must run continuously with good performance, optimizations can be a valuable thing. Compile a program once with optimizations, and the output optimized bytecode can be saved to disk, never needing to be optimized again (unless Parrot integrates better optimizations). Parrot has multiple optimization options, depending on the extensiveness of the optimizations to be performed. Each can be activated using different commandline switches in the form -Ox where the x is a character representing the type of optimization to perform:
Flag -O0 Description no optimizations, this is the default mode
-O1 or -O optimizations without life info (e.g. branches) -O2 -Op -Ot -Oc optimizations with life info rewrite I and N PASM registers most used first select fastest runcore (default with -O1 and -O2) turns on the optional/experimental tail call optimizations
Life info is an analysis step where code and data is traced to determine control flow patterns and lifetimes of local variables. Knowing the areas where certain variables are used and not used enables registers to be reused instead of having to allocate new ones. Knowing when certain code is unreachable enables the optimizer to ignore it completely.
Run Cores
The run core is the central loop of the Parrot program, and there are several different runcores available that specify the performance and capabilities of Parrot. Runcores determine how parrot executes the bytecode instructions that are passed into the interpreter. Runcores can perform certain tasks such as bounds-checking, testing, or debugging. Other runcores have been optimized to operate extremely quickly. Implementation details about the various cores can be found in src/runops_cores.c. Different cores can be activated by passing particular switches at the command-line. The sections below will discuss the various runcores, what they do, how they work, and how to activate them.
Running Parrot
16
Basic Cores
Slow Core The default "slow" core treats all ops as individual C functions. Each function is called, and returns the address of the next instruction operation. Many cores, such as the tracing and debugging cores, are based on the slow core design. Fast core The fast core is a bare-bones core that does not perform any special operations such as tracing, debugging, or bounds-checking. Computed Goto Core Computed goto is a feature of some compilers that allows a goto instruction to target a variable containing the address of a label, not necessarily directly to a label. By caching the addresses of all labels into an array, a jump can be made directly to the necessary instructions. This avoids the overhead of multiple subroutine calls, and can be very quick on platforms that support it. For more information about the workings of the computed-goto runcore, see the generated file src/ops/core_ops_cg.c. Switch Core The switch core uses the standard C switch and case structure to select the next operation to run. At each iteration, a switch is performed, and each case represents one of the ops. After the op has been performed, control flow jumps back to the top of the switch and the cycle repeats. Switch statements, especially those that use many consecutive values, are typically converted by the compiler into jump tables which perform very similarly to computed-goto jumps.
Variant Cores
The above cores are the basic designs upon which other specialized cores are based.
mod_parrot
Some members of the Parrot team have developed an extension for the Apache webserver that allows Parrot to be used to generate server-side content. The result of this work is mod_parrot, which can be used to produce web sites using PIR or PASM. This has limited usefulness by itself. However, mod_parrot allows the creation of additional modules for languages with compilers that target parrot. One notable module like this, mod_perl6 is a bytecode module that runs on top of mod_parrot. More information about mod_parrot is available at it's website: http://www.parrot.org/mod_parrot
17
Programming Steps
There are a number of different methods to program Parrot, as we see in the list above. However, different programming methods require different steps. Here, we will give a very brief overview of some of the ways you program Parrot. PASM and PIR A program written in PASM or PIR, such as Foo.pasm or Bar.pir, can be run in one of two different ways. They can be interpreted directly by typing (on most Windows and Unix/Linux systems): ./parrot Foo.pasm or ./parrot Bar.pir This will run Parrot in interpreter mode. However, we can compile these programs down to Parrot Bytecode (PBC) using the following flags: ./parrot -o Foo.pbc Foo.pir ./parrot -o Bar.pbc Bar.pir Of course, you can name the output files anything you want. Once you have a PBC file, you can run it like this: ./parrot Foo.pbc NQP
Parrot Programming NQP must be compiled down to PIR using the NQP compiler. This is located in the compilers/nqp directory of the Parrot repository High Level Languages To program parrot in a higher-level language than NQP or PIRsuch as Perl 6, Python, or Rubythere must first be a compiler available for that language. To run file Foo.pl for example (".pl" is the file extension for Perl programs), you would type: ./parrot languages/perl6/perl6.pbc Foo.pl This runs the Perl 6 compiler on Parrot, and passes the file name Foo.pl to the compiler. To output a file into PIR or PBC, you would use the --target= option to specify an output format.
18
Virtual Machines?
One term that we are going to be using frequently in this book is "Virtual Machine", or VM for short. It's worth discussing now what exactly a VM is. Before talking about virtual machines, let's consider actual computer hardware first. In an ordinary computer system, a native machine, there is a microprocessor which takes instructions and performs the necessary actions. Those instructions are written in a high level language and compiled into the binary machine code that the processor uses. The problem with this is that different types of processors use a different machine code, and to get a program to run on different platforms it needs to be recompiled for each. A virtual machine, unlike a regular computer processor, is built using software, not hardware. The virtual machine is written in a high level language, and compiled to machine code as usual. However, programs that run on the virtual machine are compiled into bytecode instead of machine code. This bytecode runs on top of the virtual machine, and the virtual machine converts it into processor instructions. Here is a table that summarizes some of the important differences between a virtual machine and a native machine:
Native Machine Implementation Speed of Execution Native Machine Code Programming Portability Hardware Fast Must compile every program into native machine code Every program must be recompiled on every new hardware platform Impossible Software Slow Must only compile the virtual machine into native machine code, everything else is converted to bytecode Programs only need to be compiled into bytecode once, and can run anywhere a VM is installed Virtual machines can be improved, extended, patched, and added to over time Virtual Machine
Extensibility
19
Parrot Assembly Language Integer Integer registers start with an "I". Integer registers can be named things like "$I0" or "$I56". Number Floating point number registers, registers which can hold a floating point number, start with a letter "N". These registers can be named things like "$N0" or "$N354". PMC PMCs are advanced object-oriented data types, and a PMC register can be used to hold many different kinds of data. PMC registers start with a "P" identifier, and can be named things like "$P0" or "$P35".
20
Basic Statements
A basic PASM statement contains an optional label, an instruction mnemonic, and a series of comma-separated arguments. Here is an example: my_label: add_n $P0, $P1, $I1 In this example the add_n instruction performs addition on two registers and stores the result in a third. The values from $P1 and $I1 are added together, and the result is stored in $P0. Notice that the operands are different types. One of the arguments, and the result are both PMC registers, but the second operand is an integer and the add_n instruction is an integer instruction. Parrot will automatically handle data type conversions as necessary when performing instructions like this. The only thing that is required is that it is possible to convert between two data types. If it is possible, Parrot will handle the details. In some cases, however, automatic type conversions are not possible and in these cases Parrot will raise an exception.
Directives
PASM has few available directives. .pcc_sub This directive defines the start of a new subroutine.
Resources
http://www.parrotcode.org/docs/pdd/pdd06_pasm.html
21
PIR Syntax
PIR syntax is similar in many respects to older programming languages such as C or BASIC. In addition to PASM-like operations, there are control structures and arithmetic operations which simplify the syntax for human readers. All PASM is legal PIR code, PIR is almost little more then an overlay of fancy syntax over the raw PASM instructions. When available, you should always use PIR's syntax instead of PASM's for ease. Even though PIR has more features and better syntax then PASM, it is not itself a high-level language. PIR is still very low-level and is not really intended for use building large systems. There are many other tools available to language and application designers on Parrot that PIR only really needs to be used in a small subset of areas. Eventually, enough tools might be created that PIR is never needed to be used directly.
Comments
Similarly to Perl, PIR uses the "#" symbol to start comments. Comments run from the # until the end of the current line. PIR also allows the use of POD documentation in files. We'll talk about POD in more detail later.
Subroutines
Subroutines start with the .sub directive, and end with the .end directive. We can return values from a subroutine using the .return directive. Here is a short example of a function that takes no parameters and returns an approximation of :
Parrot Intermediate Representation .sub 'GetPi' $N0 = 3.14159 .return($N0) .end Notice that the subroutine name is written in single quotes. This isn't a requirement, but it's very helpful and should be done whenever possible. We'll discuss the reasons for this below.
22
Subroutine Calls
There are two methods to call a subroutine: Direct and Indirect. In a direct call, we call a specific subroutine by name: $N1 = 'GetPi'() In an indirect call, however, we call a subroutine using a string that contains the name of that subroutine: $S0 = 'GetPi' $N1 = $S0() The problem arises when we start to use named variables (which we will discuss in more detail below). Consider the following snippet where we have a local variable called "GetPi": GetPi = 'MyOtherFunction' $N0 = GetPi() In this snippet here, do we call the function "GetPi" (since we made the call GetPi()) or do we call the function "MyOtherFunction" (since the variable GetPi contains the value 'MyOtherFunction')? The short answer is that we would call the function "MyOtherFunction" because local variable names take precidence over function names in these situations. However, this is a little confusing, isn't it? To avoid this confusion, there are some standards that people use to make this easier:
$N0 = GetPi() $N0 = 'GetPi'() Used only for indirect calls Used for all direct calls
By sticking with this convention, we avoid all possible confusions later on.
Subroutine Parameters
Parameters to a subroutine can be declared using the .param directive. Here are some examples: .sub 'MySub' .param int myint .param string mystring .param num mynum .param pmc mypmc In a parameter declaration, the .param directives must be at the top of the function. You may not put comments or other code between the .sub and .param directives. Here is the same example above:
23
.sub 'MySub' # These are my params: .param int myint .param string mystring .param num mynum .param pmc mypmc Wrong!
Named Parameters
Parameters that are passed in a strict order like we've seen above are called positional arguments. Positional arguments are differentiated from one another by their position in the function call. Putting positional arguments in a different order will produce different effects, or may cause errors. Parrot supports a second type of parameter, a named parameter. Instead of passing parameters by their position in the string, parameters are passed by name and can be in any order. Here's an example: .sub 'MySub' .param int yrs :named("age") .param string call :named("name") $S0 = "Hello " . call $S1 = "You are " . yrs $S1 = $S1 . " years old print $S0 print $S1 .end .sub main :main 'MySub'("age" => 42, "name" => "Bob") .end In the example above, we could have easily reversed the order too: .sub main :main 'MySub'("name" => "Bob", "age" => 42) .end
# Same!
Named arguments can be a big help because you don't have to worry about the exact order of variables, especially as argument lists get very long.
Optional Parameters
Functions may declare optional parameters, which the caller may or may not specify. To do this, we use the :optional and :opt_flag modifiers: .sub 'Foo' .param int bar :optional .param int has_bar :opt_flag In this example, the parameter has_bar will be set to 1 if bar was supplied by the caller, and will be 0 otherwise. Here is some example code that takes two numbers and adds them together. If the second argument is not supplied, the first number is doubled: .sub 'AddTogether' .param num x
Parrot Intermediate Representation .param num y :optional .param int has_y :opt_flag if has_y goto ive_got_y y = x ive_got_y: $N0 = x + y .return($N0) .end And we will call this function with 'AddTogether'(1.0, 1.5) 'AddTogether'(3.0) #returns 2.5 #returns 6.0
24
Slurpy Parameters
A subroutine can take any number of arguments, which can be loaded into an array. Parameters which can accept a variable number of input arguments are called :slurpy parameters. Slurpy arguments are loaded into an array PMC, and you can loop over them inside your function if you wish. Here is a short example: .sub 'PrintList' .param list :slurpy print list .end .sub 'PrintOne' .param item print item .end .sub main :main PrintList(1, 2, 3) # Prints "1 2 3" PrintOne(1, 2, 3) # Prints "1" .end Slurpy parameters absorb the remainder of all function arguments. Therefore, slurpy parameters should only be the last argument to a function. Any parameters after a slurpy parameter will never take any values, because all arguments passed for them will get absorbed by the slurpy parameter instead.
Parrot Intermediate Representation .param pmc d :slurpy We have an array called x that contains three Integer PMCs: [1, 2, 3]. Here are two examples:
Function Call 'ExampleFunction'(x, 4, 5) Parameters a = [1, 2, 3] b=4 c=5 d = [] 'ExampleFunction'(x :flat, 4, 5) a=1 b=2 c=3 d = [4, 5]
25
Variables
Local Variables
Local variables can be defined using the .local directive, using a similar syntax as is used with parameters: .local .local .local .local int myint string mystring num mynum pmc mypmc
In addition to local variables, in PIR you can use the registers for data storage as well.
Namespaces
Namespaces are constructs that allow the reuse of function and variable names without causing conflicts with previous incarnations. Namespaces are also used to keep the methods of a class together, without causing naming conflicts with functions of the same names in other namespaces. They are a valuable tool in promoting code reuse and decreasing naming pollution. In PIR, namespaces are specified with the .namespace directive. Namespaces may be nested using a key structure: .namespace ["Foo"] .namespace ["Foo";"Bar"] .namespace ["Foo";"Bar";"Baz"] The root namespace can be specified with an empty pair of brackets: .namespace [] .namespace #Right! Enters the root namespace #WRONG! Brackets are required!
Strings
Strings are a fundamental datatype in PIR, and are incredibly flexible. Strings can be specified as quoted literals, or as "Heredoc" literals in the code.
Heredocs
Heredoc string literals have become a common tool in modern programming languages to specify very long multi-line string literals. Perl programmers will be familiar with them, but so will most shell programmers and even modern .NET programmers too. Here is how a Heredoc works in PIR: $S0 = << "TAG"
Parrot Intermediate Representation This is part of the Heredoc string. Everything between the '<< "TAG"' is treated as a literal string constant. This string ends when the parser finds the end marker. TAG Heredocs allow long multi-line strings to be entered without having to use lots of messy quotes and concatenation operations.
26
File Includes
You can include an external PIR file into your current file using the .include directive. For example, if we wanted to include the file "MyLibrary.pir" into our current file, we would write: .include "MyLibrary.pir" Notice that the .include directive is a raw text-substitution function. A file of PIR code is not self contained the way you might expect from some other languages. For instance, one problem that occurs relatively commonly among new users is the concept of namespace overflow. Consider two files, A.pir and B.pir:
A.pir .namespace ["namespace 2"] B.pir .namespace ["namespace 1"] #here, we are in "namespace 1" .include "A.pir" #here we are in "namespace 2"
The .namespace directive from file A overflows into file B, which is counter intuitive for most programmers.
Parrot Intermediate Representation $N0 = 3.14159 .return($N0) .end .sub 'GetE' :method $N0 = 2.71828 .return($N0) .end With this class (which we probably store in "MathConstants.pir" and include into our main file), we can write the following things: .local pmc mathconst mathconst = new 'MathConstants' $N0 = mathconst.'GetPi'() #$N0 contains the value 3.14159 $N1 = mathconst.'GetE'() #$N1 contains the value 2.71828 We'll explain more of the messy details later, but this should be enough to get you started.
27
Control Statements
PIR is a low-level language and so it doesn't support any of the high-level control structures that programmers may be used to. PIR supports two types of control structures: conditional and unconditional branches. Unconditional Branches are handled by the goto instruction. Conditional Branches use the goto command also, but accompany it with an if or unless statement. The jump is only taken if the if-condition is true or the unless-condition is false.
HLL Namespace
Each HLL compiler has a namespace that is the same as the name of that HLL. For instance, if we were programming a compiler for Perl, we would create the namespace .namespace ["Perl"]. If we are not writing a compiler, but instead writing a program in pure PIR, we would be in the default namespace .namespace ["Parrot"]. To create a new HLL compiler, we would use the .HLL directive to create the current default HLL namespace: .HLL "mylanguage", "mylanguage_group" Everything that is in the HLL namespace is visible to programs written in that HLL. For example, if we have a PIR function "Foo" that is in the "PHP" namespace, a program written in PHP can call the Foo function as if it were a regular PHP function. This may sound a little bit complicated. Here is a short example:
PIR Code .namespace ["perl6"] .sub 'AddTwo' .param int a .param int b $I0 = a + b .return($I0) .end $x = AddTwo(4 + 5); Perl 6 code
To simplify, we can write simply .namespace (without the brackets) to return to the current HLL namespace.
28
Multimethods
Multimethods are groups of subroutines which share the same name. For instance, the subroutine "Add" might have different behavior depending on whether it is passed a Perl 5 Floating point value, a Parrot BigNum PMC, or a Lisp Ratio. Multiple dispatch subroutines are declared like any other subroutine in PIR, except they also have the :multi flag. When a Multi is invoked, Parrot loads the MultiSub PMC object with the same name, and starts to compare parameters. Whichever subroutine has the best match to the accepted parameter list gets invoked. The "best match" routine is relatively advanced. Parrot uses a Manhattan distance to order subroutines by their closeness to the given list, and then invokes the subroutine at the top of the list. When sorting, Parrot takes into account roles and multiple inheritance. This makes it incredibly powerful and versatile.
Macro Constants
Constant values can be defined with the .macro_const keyword. Here is an example: .macro_const PI 3.14 .sub main :main print .PI .end
#Prints "3.14"
Parrot Intermediate Representation A .macro_const can be an integer constant, a floating point constant, a string literal, or a register name. Here's another example: .macro_const MyReg S0 .macro_const HelloMessage "hello world!" .sub main :main .MyReg = .HelloMessage print .MyReg .end This allows you to give names to common constants, strings, or registers.
29
Macros
Basic text-substitution macros can be created using the .macro and .endm keywords to mark the start and end of the macro respectively. Here is a quick example: .macro SayHello print "Hello!" .endm .sub main :main .SayHello .SayHello .SayHello .end This example, as should be obvious, prints out the word "Hello!" three times. We can also give our macros parameters, to be included in the text substitution: .macro CircleCircumference(r) $N0 = r * 3.1.4 $N0 = $N0 * 2 print $N0 .endm .sub main :main .CircleCircumference(5) .CircleCircumference(10) .end
Parrot Intermediate Representation .sub main :main .PrintSomething .PrintSomething .end After we do the text substitution, we get this: .sub main :main .local string something something = "This is a message" print something .local string something something = "This is a message" print something .end After the substitution, we've declared the variable something twice! Instead of that, we can use the .macro_local declaration to create a variable with a unique name that is local to the macro: .macro PrintSomething .macro_local something something = "This is a message" print something Now, the same function translates to this after the text substitution: .sub main :main .local string main_PrintSomething_something_1 main_PrintSomething_something_1 = "This is a message" print main_PrintSomething_something_1 .local string main_PrintSomething_something_2 main_PrintSomething_something_2 = "This is a message" print main_PrintSomething_something_2 .end Notice how the local variable declarations now are unique? They depend on the name of the parameter, the name of the macro, and other information from the file? This is a reusable approach that doesn't cause any problems.
30
Resources
http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd19_pir.pod.html
31
Writing PMCs in C
PMC definitions are written in a C-like language that is translated to C code using a special PMC compiler program called pmc2c.pl. Once converted to C code, the PMCs are included in the Parrot build process.
PMC Script
The script language used to write a PMC is based on C. In fact, it's mostly C with a few additional keywords and constructs. The PMC compiler converts PMC files into C code for compilation. All standard ANSI C 89 code is acceptable for use in PMC files. Here we will list some of the additions.
32
Specifier is SUPERNAME need_ext abstract no_init Specifies the parent class, if any Needs a PMC_EXT for special handling The class is abstract and cannot be instantiated
Meaning
The PMC does not have an init vtable method for Parrot to call. Normally, Parrot calls the init method when the PMC is first created. if you don't need that, use no_init. INTERFACE is one of the standard interfaces, and the PMC can be used as if it were an object of that type. The interfaces are "array", "hash"
provides INTERFACE
Helper Functions
Like ordinary C, you can define addtional functions to help with your calculations. These functions should be written in ordinary C (without any special keywords or values) and should be defined outside of the C<pmclass> definition.
33
VTABLES
All PMCs have a standard API, an interface that they share in common with all other PMCs. This standard interface is called a VTABLE. A VTABLE is a list of about 150 standard functions, called "VTABLE Interfaces" that implement basic, common, behavior for PMCs. All PMCs implement all these interfaces, although if one is not explicitly provided it can inherit from a parent PMC class, or it can default to throwing an exception. VTABLE methods can be defined in one of two ways, in the .pmc using the C-like PMC language, or in PIR using the :vtable function qualifier. VTABLEs correspond to some basic operations that can be performed on any object, such as arithmetic, class operations, casting operations (to INTVAL, FLOATVAL, STRING, or PMC), and other common operations. Regardless of how the VTABLE method is defined, they must have very specific names.
Methods
In addition to VTABLEs, a PMC may supply a series of custom interface functions called methods to supply additional functionality. Notice that methods are not integrated into the PIR operators or PASM opcodes in the same way that VTABLE methods are. Methods can be written in the C-like PMC script for individual PMCs, or they can be written in PIR for user-defined PMC subclasses.
Invoking Methods
Once a method has been defined, it can be accessed in a PMC file using the PCCINVOKE command.
34
VTABLE List
A complete list of all vtable methods is located in the appendix.
Resources
Built-In PMC Appendix http://www.parrotcode.org/docs/pdd/pdd04_datatypes.html http://www.parrotcode.org/docs/pdd/pdd17_pmc.html http://www.parrotcode.org/docs/pmc2c.html http://www.parrotcode.org/docs/pmc.html
Coroutines
While not technically "multithreading", coroutines represent a simple form of concurrency that will find many interesting uses as they catch on with mainstream programmers. Coroutines are like subroutines which use a yield statement instead of a return statement. The .yield statement causes the coroutine to return a value to the caller. However, .yield does not cause the coroutine to exit, disappear, or lose it's current state. The next time the caller calls the coroutine, the coroutine will pick up where it left off the last time. Here is an example: .sub 'MyCoroutine' .yield(1) .yield(2) .return(3) .end .sub 'MyCaller' $I0 = 'MyCoroutine'() say($I0) $I0 = 'MyCoroutine'() say($I0) $I0 = 'MyCoroutine'() say($I0) .return The output of the function MyCaller will be: 1 2 3
Multithreading and Concurrency Coroutines are immediately useful in places where a persistent state needs to be maintained, but where it might be difficult to coordinate that state between multiple callers, or multiple threads. One such example is in file or database access where a coroutine can serialize accesses, and maintain persistent state information about the size of the file and locations of certain features in the file. Coroutines are also called "continuations" in other places, or even "continuation sandwich" in others.
35
Coroutine Parameters
Parameters to a coroutine are only passed the first time the coroutine is called, or any time the coroutine is called after having a .return statement. Here is an example: .sub 'MyCoroutine' .param int a .yield(a) .yield(a) .yield(a) .return(a) .end .sub 'main' :main $I0 = 'MyCoroutine'(1) say $I0 $I0 = 'MyCoroutine'(2) say $I0 $I0 = 'MyCoroutine'(3) say $I0 $I0 = 'MyCoroutine'(4) say $I0 .end
This code will print the following result, even though the parameter to the MyCoroutine function changes with each call: 1 1 1 1 This is because only the first call to MyCoroutine is actually a function call that creates the local parameters. The other calls to MyCoroutine are simply continuations, not calls. A continuation does not create new parameter variables.
36
Threads
Internally, Parrot supports multiple different methods to perform threading. Luckily, all of these different methods are abstracted and the details are hidden from the HLL programmer. Parrot's concurrency system is modular, so new multithreading technologies can be added to Parrot later, and HLL programmers will be able to benefit from these changes and additions without having to make any changes to their code.
Managing Threads
The scheduler is an instance of the Scheduler PMC type. However, you don't use the scheduler PMC directly, Parrot uses it for you. There are four basic operations that can be used to create and manage a new thread. In Parrot, threads are all abstracted in the Task PMC. To create a new thread, you first create a new Task PMC, and then you add it to the scheduler.
Resources
http://www.parrotcode.org/docs/pdd/pdd25_concurrency.html Synopsis 17
Exception Handling
37
Exception Handling
Exceptions
Exceptions are errors that are caused in a program. However, unlike an ordinary error which causes the program to terminate unexpectedly, exceptions are controlled and can be recovered from without having to restart the program. In Parrot, exceptions are objects. This means you create an exception using the new keyword, and manipulate an exception using methods on that object. Before we discuss exceptions any further, we need to discuss some terminology. Readers who are familar with exceptions in other programming languages can probably skip these definitions. Throw To throw an exception means to create an exception object. Once an exception has been created, the system enters a kind of "panic state" where it attempts to fix the exception. If the exception cannot be fixed, the program terminates. Raise To raise an exception is the same as to throw one. Handler A handler is a routine which can fix an exception. When an exception is raised and the system enters panic mode, it looks for a handler. If there is a handler available, the exception is sent to that handler. If no handlers are available to handle the exception, the system will terminate. Catch A handler that receives an exception object is said to "catch" it. Whenever an exception is thrown, a hander should be available to catch it. Again, if no handlers are available to catch an exception, Parrot will exit. Rethrow All handlers are not equipped to handle all exceptions. If a handler catches an exception that it cannot fix, it can optionally rethrow that exception. Rethrowing causes the system to search for a different handler.
Creating an Exception
Creating a new exception object in PIR is deceptively simple: $P0 = new 'Exception' Exceptions are hash-like, which means they have named fields. One such field is the '_message' field, which contains the name of the exception. Exception handlers will check the name to determine if they can handle a particular exception, or if it needs to be rethrown.
Exception Handling
38
Creating a Handler
In Parrot, a handler is a label that the system jumps to in the event of an exception. These labels are stored in a stack structure. The exception handler on the top receives the instruction, but if it rethrows, the exception will be propagated down through the exception stack until it is finally handled.
Resources
http://www.parrotcode.org/docs/pdd/pdd23_exceptions.html
Initializers
An initializer is a function that is called at the beginning of the program to set up the class. PIR doesn't have a syntax for declaring information about the class directly, you have to use a series of opcodes and statements to tell Parrot what your class looks like. This means that you need to create the various data fields in your class (called "attributes" here), and set up relationships with other classes. Initializer functions tend to follow this format: .namespace .sub 'onload' :anon :init :load .end The :anon flag means that the name of the function will not be stored in the namespace, so you don't end up with all sorts of name pollution. Of course, if the name of the function isn't stored, it can be difficult to make additional calls to this function, although that doesn't matter if we only want to call it once. The :init flag causes the function to run as soon as parrot initializes the file, and the :load flag causes the function to run as soon as the file is loaded, if it is loaded as an external library. In short: We want this function to run as soon as possible and we only want it to run once. Notice also that we want the initializer to be declared in the HLL namespace.
39
Subclassing
We can set up a subclass/superclass relationship using the subclass command. For instance, if we want to create a class that is a subclass of the builtin PMC type "ResizablePMCArray", and if we want to call this subclass "List", we would write: .sub 'onload' :anon :load :init subclass $P0, "ResizablePMCArray", "List" .end This creates a class called "List" which is a subclass of the "ResizablePMCArray" class. Notice that like the newclass instruction above, we store a reference to the class in the PMC register $P0. We'll use this reference to modify the class in the sections below.
40
Adding Attributes
Attributes can be added to the class by using the add_attribute keyword with the class reference that we received from the newclass or subclass keywords. Here, we create a new class 'MyClass', and add two data fields to it: 'name' and 'value': .sub 'initmyclass' :init :load :anon newclass $P0, 'MyClass' add_attribute $P0, 'name' add_attribute $P0, 'value' .end We'll talk about accessing these attributes below.
Methods
Methods, as we mentioned earlier, have three major differences from subroutines: The way they are flagged, the way they are called, and the fact that they have a special self variable. We know already that methods should use the :method flag. :method indicates to Parrot that the other two differences (dot-based calling convention and "self" variable) need to be implemented for the method. Some methods will also use the :vtable flag as well, and we will discuss that below. We want to create a class for a stack class. The stack has "push" and "pop" methods. Luckily, Parrot has push and pop instructions available that can operate on array-like PMCs (like the "ResizablePMCArray" PMC class). However, we need to wrap these PIR instructions into functions or methods so that they can be used from our high-level language (HLL). Here is how we can do that: .namespace .sub 'onload' :anon :load :init subclass $P0, "ResizeablePMCArray", "Stack" .end .namespace ["Stack"] .sub 'push' :method .param pmc arg push self, arg .end .sub 'pop' :method pop $P0, self .return($P0) .end Now, if we had a language compiler for Java on Parrot, we could write something similar to this: Stack mystack = new Stack(); mystack.push(5); System.out.println(mystack.pop());
Classes and Objects The example above would print the value "5" at the end. If we look at the same example in a language like Perl 5, we would have: my $stack = Stack::new(); $stack->push(5); print $stack->pop(); This, again, would print out the number "5".
41
Accessing Attributes
If our class has attributes, we can use the setattribute and getattribute instructions to write and read those attributes, respectively. If we have a class 'MyClass' with data attributes 'name' and 'value', we can write accessors and setter methods for these: .sub 'set_name' :method .param pmc newname $S0 = 'name' setattribute self, $S0, newname .end .sub 'set_data' :method .param pmc newdata $S0 = 'data' setattribute self, $S0, newdata .end .sub 'get_name' :method $S0 = 'name' $P0 = getattribute self, $S0 .return($P0) .end .sub 'get_value' :method $S0 = 'value' $P0 = getattribute self, $S0 .return($P0) .end
42
Constructors
The constructor is the function that we call when we use the new keyword. The constructor initializes the data object attributes, and maybe performs some other bookkeeping tasks as well. A constructor must be a method named 'new'. Besides the special name, the constructor is like any other method, and can get or set attributes on the self variable as needed.
Resources
http://www.parrotcode.org/docs/pdd/pdd15_objects.html http://www.parrotcode.org/docs/pdd/pdd21_namespaces.html
Resources
http://www.parrotcode.org/docs/debug.html http://www.parrotcode.org/docs/debugger.html
43
Parrot Compiler Tools Once a program has been converted into bytecode, that bytecode is loaded into the interpreter where it is executed. This is just a very brief overview of these components, we will discuss them in more detail in later chapters. It is worth nothing here, however, that many of these components are modular and can be swapped out if you would like to use a different one. For instance, if you already have a parser written for a particular language, instead of having to rewrite the parser using PCT, you can load your existing parser into Parrot. Of course, you will probably need to make modifications to ensure that your custom parser outputs a proper AST, but that's a small price to pay to avoid having to completely rewrite your language parser from the ground up.
44
5. Create the driver program Once you have your compiler, you can use it to run programs written in your high-level language. Here are some steps involved in running your compiler: 1. 2. 3. 4. Compile your grammar into a Parrot Abstract Syntax Tree (PAST) Compile the PAST into a Parrot Optimized Syntax Tree (POST) Compile the POST into Parrot Bytecode (PBC) or PIR Run the PBC or PIR on Parrot
This should give you a rough idea of what needs to be done to create a compiler, and how a compiler operates. We'll elaborate on each of these steps in this chapter and the next few chapters in this section.
Parrot Compiler Tools practice, open the makefile and see how things are being built. If you've never seen a makefile before, this is your opportunity to learn about what they are and how they work.
45
Once you create your language shell, all of these files will be produced for you. All you need to do is fill in your grammar and actions into the necessary files, write the rest of the necessary built-in code, and you should have a working compiler. Once you have modified these files to do what you need them to, there is an additional optional step that you should take: Write a series of test modules to verify that your language operates properly. We will discuss testing and test harnesses later. We will discuss writing parsers and action files in the next few chapters.
46
Getting Help
When you are writing your new language compiler, there are a number of places that you can go to get help. The Parrot repository contains all the current Parrot documentation, in POD format. Perl 5 programmers will be familiar with POD, but other users might not be. POD is a simple documentation format that is treated like multi-line comments in Perl code. Special programs like pod2html can be used to convert POD files into other file types for presentation, such as HTML. There are many languages in the languages/ directory. If you are trying to implement a particular feature for your language, chances are good that you can find an existing example of how another language has implemented that feature. One excellent tool to use, especially when you are constructing PAST node trees, or writing functions in PIR is the --target= directive to Parrot. This directive lets you specify an output dump format. For instance, if you go to the languages/perl6/ directory, you can type the following ../../parrot perl6.pbc --target=pir This command will output the PIR of any Perl 6 instructions that you type in. These options work for Parrot, so all the languages will use them, not just Perl 6. Here are some of the other targets you may want to try: pir: prints out the result PIR from the code pasm: Prints out the result PASM code past: Prints out the past node tree that is generated from the code parse: prints out a parse tree of the code
Try all these, and see what kinds of results you get using different languages. If you have looked for help in the POD documentation and in the existing code examples, it might be time to find a real human to ask. Parrot developers and enthusiasts congregate in the #parrot [1] (irc.perl.org) chatroom. Perl 6 developers and enthusiasts congregate in the #perl6 [2] (freenode) chatroom. Other resources and methods of contact are available at http://www.parrotcode.org/resources.html
47
Resources
http://www.parrotcode.org/docs/compiler_faq.html
References
[1] irc:/ / irc. perl. org/ parrot [2] irc:/ / irc. freenode. net/ perl6
Parrot Grammar Engine case, the target language is PIR or PBC. All compilers are formed from these three key components: The lexical analyzer, the parser, and the back end. Parrot provides regular expressions for the lexical analyzer, and it handles all the target language generation. The only thing you, as a compiler designer, need to design is the parser. However, there is a lot of information about parsers that we need to know first. There are two methods to parse a string of input tokens. First, we can do a top-down parse where we start with our highest-level token and try to find a match. Second, we can do a bottom-up parse where we start with the small tokens and try to combine them into bigger and bigger tokens until we have the highest-level. The ultimate goal of both parsing methods is to reduce the string of input tokens into a single match token.
48
Bottom-Up Parse
note: Parser-generators such as yacc or Bison produce bottom-up parsers. As a simple bottom-up parse example, we will start at the left side of our string of tokens and try to combine small tokens into bigger ones. What we want, in the end, is a token that represents the entire input. We do this by utilizing two fundamental operations: shift and reduce. A shift operation means we read a new input token into the parser. A reduce operation means we turn a matched pattern of small tokens into a single larger token. Let's put this method into practice below. KEYWORD IDENTIFIER OPERATOR IDENTIFIER PAREN INTEGER OPERATOR INTEGER PAREN "int" "x" "=" "add_both" "(" "5" "," "4" ")" The above string turns into the string below if we realize that the tokens "int" and "x" are a variable declaration. We reduce the first two tokens into a variable declaration token: DECLARATION OPERATOR IDENTIFIER PAREN INTEGER OPERATOR INTEGER PAREN "=" "add_both" "(" "5" "," "4" ")" Now, we move through the line, shifting tokens into the parser from left to right. We can't see anything that we can reduce until we reach the parenthesis. We can reduce the open and close parenthesis, and all the tokens in the middle of them into an argument list as shown below: DECLARATION OPERATOR IDENTIFIER ARGUMENT_LIST "=" "add_both" Now, we know that when we have an identifier followed by an argument list, that it is a function call. We reduce these two tokens to a single function call token: DECLARATION OPERATOR FUNCTION_CALL "=" And finally, by skipping a few steps, we can convert this into an assignment statement: ASSIGNMENT_STATEMENT Every time we reduce a set of small tokens into a bigger token, we add them to the tree. We add the small tokens as children of the bigger token. This type of parser that we are talking about here is called a "shift-reduce" parser because those are the actions the parser performs. A subset of shift-reduce parsers that are useful for arithmetic expressions is called an operator precedence parser, and is one that we will talk about more below. Note Shift-reduce parsers are prone to a certain type of error called a shift-reduce conflict. A shift-reduce conflict is caused when a new input token can cause the parser to either shift or reduce. In other words, the parser has
Parrot Grammar Engine more then one option for the input token. A grammar that contains possible shift-reduce conflicts is called an ambiguous grammar. While there are ways to correct such a conflict, it is often better to redesign the language grammar to avoid them entirely. For more information on this, see the Resources section at the bottom of the page.
49
Top-Down Parse
A top-down parser is a little bit different from a bottom-up parser. We start with the highest level token, and we try to make a match by testing for smaller patterns. This process can be inherently inefficient, because we must often test many different patterns before we can find one that matches. However, we gain an ability to avoid shift-reduce conflicts and also gain a certain robustness because our parser is attempting multiple options before giving up. Let's say that we have a string of tokens, and a top-level definition for a STATEMENT token. Here is the definition in a format called a context-free grammar: STATEMENT := ASSIGNMENT | FUNCTION_CALL | DECLARATION This is a simple example, of course, and certainly not enough to satisfy a language like C. The ":=" symbol is analogous to the words "is made of". The vertical bar "|" is the same as saying "or". So, the statement above says "A statement is made of an assignment or a function call or a declaration". We have this grammar rule that tells us what a statement is, and we have our string of input tokens: KEYWORD IDENTIFIER OPERATOR IDENTIFIER PAREN INTEGER OPERATOR INTEGER PAREN "int" "x" "=" "add_both" "(" "5" "," "4" ")" The top-down parser will try each alternative in the STATEMENT definition to see if it matches. If one doesn't match, it moves to the next one and tries again. So, we try the ASSIGNMENT rule, and ASSIGNMENT is defined like this: ASSIGNMENT := (VARIABLE_NAME | DECLARATION) '=' (EXPRESSION | FUNCTION_CALL) Parenthesis are used to group things together. In English, this statement says "An assignment is a variable name or a declaration, followed by a '=' followed by an expression or a function call". So, to satisfy ASSIGNMENT, we try VARIABLE_NAME first, and then we try DECLARATION. The string "int x" is a declaration and not a simple variable name, so VARIABLE_NAME fails, and DECLARATION succeeds. Next, the '=' matches. To satisfy the last group, we try EXPRESSION first, which fails, and then we try FUNCTION_CALL, which succeeds. We proceed this way, trying alternatives and slowly moving down to smaller and smaller tokens, until we've matched the entire input string. Once we have matched the last input token, if we have also satisfied the top-level match, the parser succeeds. This type of parser is called a "recursive descent" parser, because we recurse into each subrule, and slowly descend from the top of the parse tree to the bottom. Once the last subrule succeeds, the whole match succeeds and the parser returns. In this process, when a rule matches, we create a node in our PAST tree. Because we test all subrules first before a rule succeeds, all the children nodes are created before the higher-level nodes are created. This creates the parse tree from the bottom going up, even though we started at the top of the tree and moved down. Top-down parsers can be inefficient because the parser will attempt to match patterns that obviously cannot succeed. However, there are techniques that we can use to "prune" the tree of possibilities by directing the parser towards certain paths or stopping it from going down branches that will not match. We can also prevent the parser from backtracking from subrules back up to larger rules, which helps to reduce unnecessary repetition.
50
Basic Rules
Rules have a couple basic operators that we can use, some of which have already been discussed. People who are familiar with regular expressions will recognize most of them.
51
Op
What It Means "zero or more of" "one or more of" "one or zero"
Example
Explanation
Accepts a string with zero-or-more numbers, followed by a period, followed by 1 number. Example: 1234.5 One-or-more letters, followed by a number. Example: abcde5, or ghij6
<letter>+ <number>
An optional number followed by a period and a string of one-or-more numbers. Examples: 1.234 or .234 A letter followed by any amount of letters or number. Example: a123, ident, wiki2txt
[] Group
We have already discussed the or operator "|". Here are some examples. Decimal Numbers rule decimal_numbers { <digit>* '.' <digit>+ |<digit>+ '.' <digit>* } Function Parameters rule function_parameters { '(' [ <identifier> [ ',' <identifier> ]* ]? ')' }
Actions
As it successfully matches rules, PGE creates a special "match object", which contains information about the match. This match object can be sent to a function written in PIR or NQP for processing. The processing function that receives the match object is called the action. Each rule in the grammar has an associated action function with the same name. When we have completed a successful match, we need to send the match object to the action method. We do this with the special symbol {*}. {*} sends the current match object, not necessarily the complete one. You can call {*} multiple times in a rule to invoke the action method multiple times, if needed. Here is an example: rule k_r { {*} 'hello' {*} 'world' {*} }
#Calls the action method with an empty match object #calls the action method after matching 'hello' #Calls the action method after matching 'hello' 'world'
There are two important points to remember about action methods: 1. The parser moves from left-to-right, top-to-bottom. In the k_r example above, if the input was "hello johnny", the action method would get called the first two times, but the rule would fail to match the word "world" and the action method would not be called the third time. 2. The parser returns after a successful match, even if there are more possibilities to try. Consider this example below, where only one of the action methods can be called depending on the result of the alternation. In this example, if the input is "hello franky", the action method only gets called for the {*} after 'franky'. After that
Parrot Grammar Engine matches, the parser returns and does not try to match 'johnny'. rule say_hello { 'hello' [ | 'tommy' {*} | 'franky' {*} | 'johnny' {*} ] } It can be very helpful sometimes to specify which action method got called, when there is a list of possibilities. This is because we want to treat certain matches differently from others. to treat an action method differently, we use a special comment syntax: rule say_hello { 'hello' [ | 'tommy' {*} #= tommy | 'franky' {*} #= franky | 'johnny' {*} #= johnny ] } The special "#=" symbol is not a regular comment. Instead, it's a second parameter for the action method called a "key". If you use "#=", the action method will receive two arguments: the match object and the key. You can then test the value of the key to determine how to treat the match object. We will discuss this more in the next chapter about NQP.
52
Resources
Synopsis 5: Regexes and Rules [1] Synopsis 6: Subroutines [2]
References
[1] http:/ / perlcabal. org/ syn/ S05. html [2] http:/ / perlcabal. org/ syn/ S06. html
53
Variables in NQP
Here we are going to discuss some of the basics of NQP programming. Experienced Perl programmers, even programmers who are familiar with Perl 5 but not necessarily Perl 6, will find most of this to be a simple review. NQP is not perl5 or perl 6. This point cannot be stressed enough. There are a lot of features from Perl that are missing in NQP. Sometimes, this means you need to do some tasks the hard way. In NQP, we use the := operator, which is called the bind operator. Unlike normal variable assignment, bind does not copy the value from one "container" to another. Instead, it creates a link between the two variables, and they are, from that point forward, aliases of the same container. This is similar to the way copying a pointer in C does not copy the data being pointed to. Variables in NQP typically have one of three basic types: scalars, arrays, and hashes. Scalars are single values, like an integer, a floating point number, or a string. Arrays are lists of scalars that are accessed with an integer index. A hash is a list of scalars that use a string, called a key, for indexing. All variable names have a sigil in front of them. A sigil is a punctuation symbol like "$", "@", or "%" that tells the type of the variable. Scalar variables have a "$" sigil. The following are examples of scalar values: $x := 5; $mystring := "string"; $pi := 3.1415; Arrays use the "@" sigil. We can use arrays like this: @myarray[1] := 5; @b[2] := @a[3]; Notice that NQP does not have a list context like Perl6 has. This means you can't do a list-assignment, like: @b := (1, 2, 3); # WRONG! $b := (1, 2, 3); # CORRECT
Not Quite Perl NQP is designed to be bare-bones, as little as is needed to support development of Perl6. The above line could be written also: @b[0] := 1; @b[1] := 2; @b[2] := 3; We'll discuss this in more detail a little bit further down the page. Hashes are prefixed with the "%" sigil: %myhash{'mykey'} := 7 %mathconstants{'pi'} := 3.1415; %mathconstants{'2pi'} := 2 * %mathconstants{'pi'}; Hashes, for people who aren't familiar with Perl, are also known as Dictionaries (in Python) or associative arrays. Basically, they are like arrays but with string indices instead of integer indices.
54
55
Branching Constructs
In terms of branches, we have: If/Then/Else if ($key eq 'foo') { THEN DO SOME FOO STUFF } elsif ($key eq 'bar') { THEN DO THE BAR-RELATED STUFF } else { OTHERWISE DO THIS } Unless/Then/Else
Looping Constructs
For A "For" loop iterates over a list and sets $_ to the current index, as in perl5. There's no c-style loop with STARTING_POINT and STEP_ACTION in NQP, although there is a similar construct in both Perl 5 and Perl 6. Here is a basic for loop: for (1,2,3) { Do something with $_ } Translated exactly into this PIR code: .sub 'for_statement' .param pmc match .local pmc block, past $P0 = match['EXPR'] $P0 = $P0.'item'() $P1 = match['block'] block = $P1.'item'() block.'blocktype'('sub') .local pmc params, topic_var params = block[0] $P3 = get_hll_global ['PAST'], 'Var' topic_var = $P3.'new'('name'=>'$_', 'scope'=>'parameter') params.'push'(topic_var) block.'symbol'('$_', 'scope'=>'lexical') $P2 = get_hll_global ['PAST'], 'Op'
Not Quite Perl $S1 = match['sym'] past = $P2.'new'($P0, block, 'pasttype'=>$S1, 'node'=>match) match.'result_object'(past) .end You can also iterate over the keys of a hash like so: for (keys %your_hash) { DO SOMETHING WITH %your_hash{$_} } where keys %your_hash creates a list of all of the keys in %your_hash, and iterates through this list setting $_ to hold the current key. While "While" loops are similar to for loops. In NQP, a while loop looks like this: while(EXIT_CONDITION) { LOOP_CONTENTS } Which roughly becomes in PIR: loop_top: if(!EXIT_CONDITION) goto loop_end LOOP_CONTENTS goto loop_top loop_end: Do/While A "do/while" loop is similar to a while loop except that the condition is tested at the end of the loop and not at the beginning. This means that the loop is always executed at least once, and possibly more times if the condition is not satisfied. In NQP: do { LOOP_CONTENTS } while(EXIT_CONDITION); In PIR: loop_top: LOOP_CONTENTS if(!EXIT_CONDITION) goto loop_end goto loop_top loop_end:
56
57
Operators
NQP supports a small set of operators for manipulating variables.
Operator +, *, / % $( ... ) @( ... ) %( ... ) ~ eq ne := >, <, >=, <=, ==, != Purpose Scalar addition and subtraction Scalar multiplication and division integer modulus Convert the argument into a scalar Treat the argument as an array Treat the argument as a hash String concatenation String equality comparison String inequality comparison binding Equality and inequality operators
Not Quite Perl $<first> $<second> $<third> $<andmore> If we have multiples of any one field, such as: rule my_rule { <first> <second> <first> <second> } Now, $<first> and $<second> are both two-item arrays. Also, we can extend this behavior to repetition operators in the grammar: rule my_rule { <first>+ <second>* } Now, both $<first> and $<second> are arrays whose length indicate how many items were matched by each. You can use the + operator or the scalar() function to get the number of items matched.
58
Examples
Example: Word Detection
We want to make a simple parser that detects the words "Hello" or "Goodbye". If either of these words are entered, we want to print out a success message and the word. If neither word was entered, we print an error. To pick out words in our input, we will use the built-in subrule <ident>. rule TOP { <ident> $ {*} } In this grammar rule we are looking for a single identifier (which will be a word, for our purposes), followed by the end of the file. Once we have these, we create our match object and we call our Action method: method TOP($/) { if($<ident> eq "Hello") { say("success! we found Hello"); } elsif($<ident> eq "Goodbye") { say("success! we found Goodbye"); } else { say("failure, we found: " ~ $<ident>); } make PAST::Stmts.new(); }
Not Quite Perl Since the HLLCompiler class expects our action method to return a PAST node, we must create and return an empty stmts node. When we run this parser on input it will have three possible outcomes: 1. We've received a "Hello" or a "Goodbye", and the system will print a success method. 2. We've received a different word, and we will receive an error message. 3. We've received too many words, not enough words, or something that isn't a word. This will cause a parse error. Try it!
59
Example: Oct2Bin
Here is a simple example that shows how to make a program to convert octal numbers into binary. We start with a basic language shell from mk_language_shell.pl Grammar File: grammar Oct2Bin::Grammar is PCT::Grammar; rule TOP { <octdigit>+ [ $ || <panic: Syntax error> ] {*} } token octdigit {'0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'} Action File: class Oct2Bin::Grammar::Actions; method TOP($/) { my @table; @table[0] := '000'; @table[1] := '001'; @table[2] := '010'; @table[3] := '011'; @table[4] := '100'; @table[5] := '101'; @table[6] := '110'; @table[7] := '111'; my $string := ""; for $<octdigit> { $string := $string ~ @table[$_]; } say( $string ); make PAST::Stmts.new( ); } Notice how in our actions file we had to instantiate the look-up table one element at a time? this is because NQP does not have a complete understanding of arrays. Notice also that we have our TOP method return an empty PAST::Stmts node, to suppress warnings from PCT that there are no PAST nodes.
60
PIR Actions
NQP isn't the only way for writing action methods to accompany a grammar. It's an attractive tool for a number of reasons, but it isn't the only option. Action methods can also be written in PIR or PASM. This is how the NQP compiler itself is implemented. Here is an example of how a PIR action might look: .sub 'block' :method .param pmc match .param string key .local pmc past $P0 = get_hll_global ['PAST'], 'Stmts' past = $P0.'new'('node' => match) ... match.'result_object'(past) # make $past; .end
61
Creating Optables
An optable is created by creating a rule that is optable. Here is an example: rule expression is optable { ... } The ... is not an omission, it's part of the rule. If you don't have the ... three dots, your rule will not work. This is called the yada yada yada operator, and is used to define a function prototype. PGE will instantiate the function for you, using the options you specify. The keyword is optable tells the parser to convert to bottom-up mode to parse expressions
Operator Prototypes
In an optable, tokens are specified using the proto keyword instead of the rule or token keywords. A proto definition must define several things: The name and type of the operator, the precidence of the operator, and the PIR code that is used to perform the particular operation. Here are some examples: proto 'infix:+' is precidence('1') is pirop('add') { ... } proto 'prefix:-' is tighter('infix:+') is pirop('neg') { ... } proto 'postfix:++' is equal('prefix:-') { ... } These examples show how to parse three common operators: A "+" between two values, a negative value for a negative number (such as "-123") and a post-increment operator ("x++"). There are lots of important keywords here, and this isn't even all the keywords you will need to produce a full expression parser. Operator Types There are four types of operator: prefix, postfix, infix, and circumfix. Here are some examples:
Type Prefix postfix infix circumfix Example prefix:postfix:* infix:+ circumfix:() Use -123 x* x+y (x)
Operator Precedence
Bottom-up parsers, as we mentioned before, suffer from a particular problem called shift-reduce conflicts. A shift-reduce conflict occurs when a parser cannot determine whether to shift in a new input token, or reduce the tokens it already has. Here is an example from the C programming language: -x + 2 * y We know, because we've been taught about operator precedence since we were children, that the expression is grouped like this: (-x) + (2 * y) However, a computer doesn't know this unless we tell it what precedence the operators follow. For instance, without knowing the order in which operators act, a parser might group the expression on a first-come, first-served basis like this:
Optables and Expressions ((-x) + 2) * y Or maybe it would shift in all tokens and start reducing from the right: -(x + (2 * y) The point is, we need to create rules to tell the operator precedence parser how to parse these conflicts. Here are the precedence rules that we can specify:
Code is precidence('1') Effect Sets a basic precidence level. The number in the quotes is arbitrary. The important part is that this is a constant, and that other operators can be defined in relation to this.
62
is equiv('OPERATOR') Here, the word OPERATOR will be replaced by an existing operator, such as is operator('infix:+') This shows that one operator has the same precedence as another operator. All operators with the same precedence are evaluated from left-to-right. is tighter('OPERATOR') is looser('OPERATOR') The operator binds tighter than OPERATOR. In the expression -x + y, we expect the prefix:- to bind tighter than the infix:+. Similar to is tighter(), but the opposite. Shows that the operator binds less tightly than OPERATOR
We didn't show this in the example, but it's also possible to determine whether operators act to the left or to the right. We define this with the is assoc keyword. The following table summarizes these rules:
Associativity is assoc('right') is assoc('left') is assoc('list') is assoc('chain') is assoc('none') Explanation
Operator Actions
There are three ways to specify how an operator behaves. The first is to associate the operator with a PIR opcode. The second is to associate the operator with a PAST node type. The third and final way to specify the actions of an operator is to create a function. The first two methods can be performed with the is pirop() and is pasttype() specifiers, respectively.
is pirop Operators
For simple operations, it doesn't make any sense to create and call a function. This is especially true for simple arithmetic operations which can be performed very quickly using PASM opcodes that Parrot already defines. Use the code is pirop('opname') to implement an operator as a single opcode.
63
is pasttype Operators
Some operators, such as logical operators, can be best implemented using types of nodes that PAST already provides. Use the is pasttype('TYPE') form to specify the type of PAST node to use.
Operators as Functions
By default, unless you specify an operator as is pirop or is pasttype, the operator will default to being a function. Most operators are therefore defaulted to be functions, which is very good for languages which intend to allow operator overloading.
Proto Regexes
We will use the term "Proto", "Proto Regex" or "Protoregex" on occasion in this book, and it is used often in the documentation for Perl 6 as well. Protos are like rules except they offer something similar to multiple dispatch: You can have multiple rules of the same name that depend on the operands passed to it. By declaring an operator as a proto, you are giving it a name and you are also allowing future additions to be made to the language. This is an imporant feature for languages which utilize operator overloading. Protos are defined with the proto keyword. They can be used in either the top-down or bottom-up parser, although because of their most common use as a support mechanism for operator overloading, they are most often used in the bottom-up operator precidence parser. To define a new proto, use the form: proto NAME OPTIONS { ... } In this declaration, the { ... } is a literal piece of code: you actually must type the three dots (ellipsis) between the curly brackets. What this does is alerts the parser that this is not an implementation of the rule itself, it is actually just a signature or a forward-declaration for other functions. For instance, we can define a proto MyProto: proto MyProto { ... } And in another file we can define functions to implement this: .sub MyProto # Insert the parameter list and function logic here .sub .sub MyProto # Another MyProto! # Insert a different parameter list and different function logic here .end
Parsing Terms
Terms are the values between operators. In the following expression: 1 + 2 * 3 The numbers 1, 2, and 3 are all terms. Terms are parsed using the ordinary rules, and are loaded into the operator precedence table using the following syntax: proto 'term:' is precedence('1') is parsed(&term) { ... } Of course, you can set the precedence however you see fit, but it's typically best to make terms parse as tightly as possible. The is parsed() attribute specifies that terms are to be parsed using a rule named "term". You can define this term using ordinary rule syntax. The & symbol is the sigil that NQP uses to refer to a subroutine. In this
Optables and Expressions case, the subroutine in question is the grammar method that is intended to parse terms (but it can conceivably be anything else). We will learn more about NQP and sigils in the following chapter. Terms are the way the parser switches back from bottom-up to top-down parsing.
64
Advanced PGE
Advanced PGE
We've already looked at some of the basics of parser constructing using PGE and NQP. In this chapter we are going to give a more in-depth look at some of the features of the grammar engine that we haven't seen yet. Some of these more advanced features, such as inline PIR code, assertions, function calls and built-in token types will make the life of a compiler designer much easier, but are not needed for most basic tasks.
Advanced PGE
65
Calling Functions
functions or subroutines are an integral part of modern programming practices. As such, support for them is part of the PAST system, and is relatively easy to implement. We're going to cover a little bit of necessary background information first, and then we will discuss how to put all the pieces together to create a system with usable subroutines.
return Described
In Parrot control flow, especially return operations from subroutines, are implemented as special control exceptions. The reason why it is done as an exception and not as a basic .return() PIR statement is a little bit complicated. Many languages allow for nested lexical scopes, where variables defined in an "inner" scope cannot be seen, accessed, or modified by statements in the "outer" scope. In most compilers, this behavior is enforced by the compiler directly, and is invisible when the code is converted to assembly and machine languages. However PIR is like an assembly language for the Parrot system, and it's not possible to hide things at that level. All local variables are local to the entire subroutine and cannot be localized to a single part of a subroutine. To implement nested scopes, Parrot instead uses nested subroutine
Advanced PGE
66
Assertions
MetaSyntactic Assertions
You can call a function from within a rule using the <FUNC( )> format.
Non-Capturing Assertions
Use <. > form to create a match object that does not capture its contents.
Indirect Rules
A rule of the form <$ >, which can be a string or some other data, is converted into a regular expression and then run.
Character Classes
Rules of the form <[ ]> contain custom character classes. Rules with <-[ ]> are complimented character classes.
Built-in Assertions
<?before>, <!before> <?after>, <!after> <?same>, <!same> <.ws> <?at()>, <!at()>
Partial Matches
You can specify a partial match, a match which attempts to match as much as possible and never fails, with the <* > form.
Recursive Calls
You can recurse back into subrules of the current match rule using the <~~ > rule.
Resources
Synopsis 5 (rules) [1] Synopsis 6 (Subroutines) [2]
References
[1] http:/ / dev. perl. org/ perl6/ doc/ design/ syn/ S05. html [2] http:/ / dev. perl. org/ perl6/ doc/ design/ syn/ S06. html
Building A Compiler
67
Building A Compiler
HLL Compilers
Now that we know NQP and PGE, we have the tools that we need to start writing a compiler for our favorite language. Let's review the steps to creating a compiler: 1. 2. 3. 4. 5. Write Grammar using PGE Write Actions using NQP Actions create PAST tree PCT converts PAST into PIR Parrot converts PIR into bytecode.
After step 5, if we have done everything correctly, we should have a working compiler. Of course, there is a lot of work to do before we reach step 5. Lucky for us, however, steps 4 and 5 are automated by the build process, so we only need to worry about 1, 2, and 3. In the Tutorials section we are going to have several language-building tutorials that you can follow.
Building A Compiler
68
Examples
We are going to show some very small examples here to demonstrate the basic compiler construction method. More advanced tutorials are provided in the Tutorials section.
Building A Compiler We now need to create two actions, one for TOP and the other for expression. The TOP method should take the value of the parsed expression and pass that to the say function. The expression function should generate the necessary PIR operations, and insert the term values into these operations. Here is a basic actions file to do this: method TOP($/) { my $past := PAST::Op.new(:pasttype('inline')); my $expr := $( $<expression> ); $past.inline('say(%0)'); $past.unshift($expr); make $past; } method expression($/) { my $left := $( $<term>[0] ); my $right := $( $<term>[1] ); my $op := $( $<operation> ); my $pirop := "sub_n"; if $op eq "+" { my $pirop := "add_n"; } my $past := PAST::Op.new(:pasttype('pirop'), :pirop($pirop)); $past.unshift($left); $past.unshift($right); make $past; }
69
HLL Interoperation
70
HLL Interoperation
HLL Interoperability
Parrot was designed not just for Perl6, even though that was one of the bigger driving forces initially. Parrot is being designed and implemented to be a virtual machine for all dynamic programming languages (and even a few statically-typed ones too). The ultimate goal is to be able to combine tools and libraries written in various languages together, and allow developers to write different parts of a project in the language that makes the most sense to that project. Parrot makes interoperability easy, but it's ultimately the responsibility of the language designers to make sure that their languages play nicely with others.
Sharing Data
Vtables: Standard Interfaces
All PMC objects implement the standard vtable functions, and these are to be the primary interface for dealing with foreign or exotic data types. If you receive a Ratio object from common LISP, you might not know what it's methods are, or what it's internal storage structure is when you use it from Perl 6. However, by calling the standard vtable methods, you can interact with it in an easy way.
Sharing Functions
Multi Methods
Like many modern programming languages, Parrot allows function name overloading. Called multiple method dispatch (MMD), Parrot's function call system is very powerful and flexible. In the MMD system, multiple functions in a single namespace can have the same name, so long as they have different call signatures. A call signature specifies the number and type of parameters and return values that the function expects. If we call a function: (a, b, c, d) = Foo(x, y, z) Parrot only calls a function Foo if it has three inputs and four outputs. Multiple dispatch isn't just used for functions, it's also used for opcodes. A program (or, more likely an HLL) can add new opcodes with the same name as an existing opcode, so long as it uses a different function signature. This allows a very powerful and flexible way to customize your system. We will talk more about MMD in a later chapter.
HLL Interoperation
71
Translation Interfaces
The idea of including some sort of translation interface, functions that would automatically convert calls from one HLL to appropriate calls to another, has been attractive to some developers. However, there is a large associated development cost because for n languages we would require interfaces to translate to and from them all. Because of this additional complexity cost, it is not recommended to handle data or function sharing using interfaces.
Documentation
The most important thing a language developer can do is document their interfaces. It is important to document what data types a function expects and what data types it returns. This way, a person using these libraries from a different languages know how to interact with these functions and objects. Plus, documentation may expose some functions or objects as being overly difficult for reuse. This will help deter developers from using a complicated interface. It will also help to expose to the library designers ways to simplify their library.
72
Parrot Hacking
Parrot Internals
Parrot Development Process
The Parrot development project is a large and complex project with multiple facets. Here is an overview of some key points about the Parrot build process. Some of the points here have not been discussed before, but we will covert them in this or later chapters: The build environment is configured using the Configure.pl program. This program is written, like many of the build tools for Parrot, in Perl 5. Configure.pl determines options on your system including which compiler you are using, which Make program (if any) you are using, what platform-specific libraries are required, etc. PMCs are written in a C-like script which is compiled into C code using the PMC Compiler. The PMC Compiler will produce C code and associated header files for all PMCs, and will register the PMCs into the Parrot PMC table. Opcodes are written in a C-like script which is compiled into C, just like PMCs are. The syntax of Opcode files is similar in some respects to that used for PMCs, but is different in many ways too. Ops files are converted into C before being compiled into machine code. Native Call Interface (NCI) function signatures must be converted into C functions prior to compilation using the NCI compiler Just-In-Time operations must be converted into C code for compilation into native code. The parsers for PASM and PIR are written in Lex/Bison. These need to be compiled into C files for compilation. The constant string converter converts CONST_STRING declarations into string constants at compile time. This saves a lot of time at execution. The Makefile automates the build process by compiling all the PMCs, Compiling all the C files, building the executables and libraries, etc. In this chapter we are going to give an overview of some of the components of the Parrot Virtual Machine, later chapters will discuss the various Parrot subsystems including many of the processes that we've described above. The chapters in this section are all going to discuss Parrot hacking and development. If you aren't interesting in helping with Parrot development, you can skip these chapters.
Parrot Repository
Here is the general structure of the Parrot Repository, as far as source code is concerned:
Parrot Internals
73
Interpreter
While the bytecode compiler takes input symbols from PIRC or IMCC and converts them into a bytecode for storage and later execution, the interpreter uses these symbols to execute the program directly. This means that there is no intermediate step of compilation, and a script can be execute quickly without having to be compiled.
Subsystems
I/O Subsystem
The I/O subsystem controls reading and writing operations to the console, to files, and to the operating system. Much of this functionality is being performed in special PMCs.
APIs
Embedding API
Parrot is not just an executable program, it's also a linkable library called libparrot. libparrot can be linked to other programs, and a Parrot interpreter object can be called from inside that program. An entire embedding API has been created to allow libparrot to communicate with other programs.
Extensions API
Parrot can be extended by using dynamic libraries, such as linux .so files, or Windows .dll files. These extensions must interact with Parrot in a safe and controlled way. For this, the Extensions API was written to given extensions a communications channel into the heart of Parrot.
Parrot Hacking
The next several chapters are going to look at the individual components of Parrot. We will discuss the software architectures and operations of each component. As we have already seen, Parrot itself is written using the C programming language, although individual components (such as the opcodes, PMCs, and other features) are written in special domain-specific languages and later translated into C code. Some higher-level functionality, such as PCT
Parrot Internals is writtin in PASM and PIR too. Parsers for PIR are written using a combination of Lex and Yacc. Programming for Parrot is typically going to require a good knowledge of the C programming language, but also a good understanding of Perl 5. this is because Perl 5 is used to write all the development tools which control the build process for Parrot.
74
Resources
http://www.parrotcode.org/docs/pdd/pdd01_overview.html
IMCC
IMCC, the Intermediate Code Compiler is the current front-end for Parrot that reads PASM and PIR code. It's a relatively old subsystem, and has proven itself difficult to extend or maintain. IMCC will likely be the "offical" parser for PIR and PASM until the Parrot 1.0 release. However, the plan for the future is to move to a more extensible and more manageable parser, such as PIRC. IMCC is written using a combination of a Lex tokenizer and a Yacc parser. It includes a number of components, such as register allocators, and code optimizers.
PIRC
PIRC is, in theory anyway, a better alternative to parsing PIR and PASM than IMCC is. However, PIRC is not currently complete, and does not offer the feature set of IMCC yet. There are two different implementations of PIRC: one is a hand-written recursive descent parser, and the other is based on a multi-stage Lex/Yacc parser. The old version, the hand-written recursive descent version, is obsolete and is no longer maintained. We will only talk about the new implementation of PIRC here. PIRC is divided into three different parsers: 1. A parser for "Heredoc" constructs, strings which are embedded directly into the file. 2. A macro parser and text replacer. This is a preprocessor for handling macros and constants. 3. A parser for the rest of PIR and PASM Breaking things up like this proves to keep things much more simple, and reduces the number of states and conditionals that a single large parser would require.
Run Core
75
Run Core
Run Core
We've discussed run cores earlier, but in this chapter we are going to get into a much deeper discussion of them. Here, we are going to talk about opcodes, and the special opcode compiler that converts them into standard C code. We will also look at how these opcodes are translated by the opcode compiler into different forms, and we will see the different runcores that perform these opcodes.
Opcodes
Opcodes are written using a very special syntax which is a mix of C and special keywords. Opcodes are converted by the opcode compiler, tools/dev/ops2c.pl into the formats necessary for the different run cores. The core opcodes for Parrot are all defined in src/ops/, in files with a *.ops extension. Opcodes are divided into different files, depending on their purpose:
Ops file bit.ops cmp.ops core.ops debug.ops bitwise logical operations comparison operations Basic Parrot operations, private internal operations, control flow, concurrency, events and exceptions. ops for debugging Parrot and HLL programs. Purpose
experimental.ops ops which are being tested, and which might not be stable. Do not rely on these ops. io.ops math.ops object.ops obscure.ops pic.ops pmc.ops ops to handle input and output to files and the terminal. mathematical operations ops to deal with object-oriented details ops for obscure and specialized trigonometric functions private opcodes for the polymorphic inline cache. Do not use these. Opcodes for dealing with PMCs, creating PMCs. Common operations for dealing with array-like PMCs (push, pop, shift, unshift) and hash-like PMCs ops to set and load registers Ops for software transactional memory, the inter-thread communication system for Parrot. In practice, these ops are not used, use the STMRef and STMVar PMCs instead. Ops for working with strings Operations to interact with the underlying system ops to deal with lexical and global variables
set.ops stm.ops
Run Core
76
Writing Opcodes
Ops are defined with the op keyword, and work similarly to C source code. Here is an example: op my_op () { } Alternatively, we can use the inline keyword as well: inline op my_op () { } We define the input and output parameters using the keywords in and out, followed by the type of input. If an input parameter is used but not altered, you can define it as inconstThe types can be PMC, STR (strings), NUM (floating-point values) or INT (integers). Here is an example function prototype: op my_op(out NUM, in STR, in PMC, in INT) { } That function takes a string, a PMC, and an int, and returns a num. Notice how the parameters do not have names. Instead, they correspond to numbers: op my_op(out NUM, in STR, in PMC, in INT) ^ ^ ^ ^ | | | | $1 $2 $3 $4 Here's another example, an operation that takes three integer inputs, adds them together, and returns an integer sum: op sum(out INT, in INT, in INT, in INT) { $1 = $2 + $3 + $4; } Nums are converted into ordinary floating point values, so they can be passed directly to functions that require floats or doubles. Likewise, INTs are just basic integer values, and can be treated as such. PMCs and STRINGs, however, are complex values. You can't pass a Parrot STRING to a library function that requires a null-terminated C string. The following is bad: #include <string.h> op my_str_length(out INT, in STR) { $1 = strlen($2); // WRONG! }
Advanced Parameters
When we talked about the types of parameters above, we weren't entirely complete. Here is a list of direction qualifiers that you can have in your op:
Run Core
77
direction in out
meaning The parameter is an input The parameter is an output op my_op(in INT) op pi(out NUM) { $1 = 3.14; }
example
inout
op increment(inout INT) { $1 = $1 + 1; } || inconst || The input parameter is constant, it is not modified | <pre> op double_const(out INT, inconst INT) { $1 = $2 + $2; } And, in PIR: $I0 = double_const 5 # numeric literal "5" is a constant
invar
op my_op(invar PMC)
Run Core Notice the "_i_i" and "_n_i" suffixes at the end of the function names? This is how Parrot ensures that function names are unique in the system to prevent compiler problems. This is also an easy way to look at a function signature and see what kinds of operands it takes.
78
Control Flow
An opcode can determine where control flow moves to after it has completed executing. For most opcodes, the default behavior is to move to the next instruction in memory. However, there are many sorts of ways to alter control flow, some of which are very new and exotic. There are several keywords that can be used to obtain an address of an operation. We can then goto that instruction directly, or we can store that address and jump to it later.
Keyword NEXT() Jump to the next opcode in memory Meaning
ADDRESS(a) Jump to the opcode given by a. a is of type opcode_t*. OFFSET(a) POP() Jump to the opcode given by offset a from the current offset. a is typically type in LABEL. get the address given at the top of the control stack. This feature is being deprecated and eventually Parrot will be stackless internally.
Run Cores
Runcores are the things that decode and execute the stream of opcodes in a PBC file. In the most simple case, a runcore is a loop that takes each bytecode value, gathers the parameter data from the PBC stream, and passes control to the opcode routine for execution. There are several different opcores. Some are very practical and simple, some use special tricks and compiler features to optimize for speed. Some opcores perform useful ancillary tasks such as debugging and profiling. Some runcores serve no useful purpose except to satisfy some basic academic interest.
Run Core
79
Basic Cores
Slow Core In the slow core, each opcode is compiled into a separate function. Each opcode function takes two arguments: a pointer to the current opcode, and the Parrot interpreter structure. All arguments to the opcodes are parsed and stored in the interpreter structure for retrieval. This core is, as it's name implies, very slow. However, it's conceptually very simple and it's very stable. For this reason, the slow core is used as the base for some of the specialty cores we'll discuss later. Fast Core The fast core is exactly like the slow core, except it doesn't do the bounds checking and explicit context updating that the slow core does. Switched Core The switch core uses a gigantic C switch { } statement to handle opcode dispatching, instead of using individual functions. The benefit is that functions do not need to be called for each opcode, which saves on the number of machine code instructions necessary to call an opcode.
Advanced Cores
The two cores that we're going to discuss next rely on a specialty feature of some compilers called computed goto. In normal ANSI C, labels are control flow statements and are not treated like first-class data items. However, compilers that support compute goto allow labels to be treated like pointers, stored in variables, and jumped to indirectly. void * my_label = &&THE_LABEL; goto *my_label; The computed goto cores compile all the opcodes into a single large function, and each opcode corresponds to a label in the function. These labels are all stored in a large array: void *opcode_labels[] = { &&opcode1, &&opcode2, &&opcode3, ... }; Each opcode value can then be taken as an offset to this array as follows: goto *opcode_labels[current_opcode]; Computed Goto Core The computed goto core uses the mechanism described above to dispatch the various opcodes. After each opcode is executed, the next opcode in the incoming bytecode stream is looked up in the table and dispatched from there. Predereferenced Computed Goto Core
Run Core In the precomputed goto core, the bytecode stream is preprocessed to convert opcode numbers into the respective labels. This means they don't need to be looked up each time, the opcode can be jumped to directly as if it was a label. Keep in mind that the dispatch mechanism must be used after every opcode, and in large programs there could be millions of opcodes. Even small savings in the number of machine code instructions between opcodes can make big differences in speed.
80
Specialty Cores
GC Debug Core Debugger Core Profiling Core Tracing Core
Garbage Collectors
Parrot has a number of garbage collectors available, which can be selected at compile time using compiler directives. The most mature and robust collector at this time is a simple mark & sweep collector, GC_MS. The different collectors can be activated or deactivated prior to compilation in the file include/Parrot/settings.h. That file contains a number of options that you can set to customize the behavior of Parrot.
81
Resources
http://www.parrotcode.org/docs/memory_internals.html http://www.parrotcode.org/docs/pdd/pdd09_gc.html
PMC System
PMCs
We've discussed PMCs already -- in the Parrot Virtual Machine/Polymorphic Containers (PMCs) chapter -including how to define new PMC types using the PMC compiler, and how to use them in PIR programs. This chapter is going to go into more detail about how PMCs are actually used in Parrot, including memory management of PMCs, morphing PMCs, and interfacing with PMCs.
PMC System
82
PMC Data
Every PMC type also contains a pointer for a data structure that's specific to that PMC. These data structures are defined based upon the inheritance hierarchy of the PMC and the various attributes that it has been defined with. For instance, this PMC definition: pmclass MyPmc { ATTR INTVAL a; ATTR FLOATVAL b; ATTR STRING *c; ATTR PMC *d; ... } will turn into this C data structure definition: typedef struct Parrot_MyPmc_attributes { INTVAL a; FLOATVAL b; STRING *c; PMC *d; } Parrot_MyPmc_attributes; This structure is supposed to be contained in the ->data pointer, which should always be accessed using the PMC_data macro. This way if the PMC structure definition changes eventually, all code that uses the macros properly will be automatically updated because the macro will be updated. Here is an example of an initialization VTABLE that uses these attributes: VTABLE void init () { Parrot_MyPmc_attributes *p = mem_allocate_typed(Parrot_MyPmc_attributes); p->a = 0; p->b = 0.0 p->c = NULL; p->d = PMCNULL; PMC_data(SELF) = p; } There is another macro which uses the word PARROT and the name of the PMC in all capital letters to retrieve the data structure again and properly cast it (so your compiler doesn't give warnings about using pointers without a cast): Parrot_MyPmc_attributes *attr = PARROT_MYPMC(SELF);
PMC System
83
PObjects
C isn't a class-based (or "object oriented") language, but many lessons from OO programming methodologies have been adapted for use in Parrot's code base. PMCs, STRINGs, and a few other data types are based off the definition of a "PObj", also known as a "Buffer": typedef struct Buffer { Parrot_UInt flags; } Buffer; Notice how the first two entries in the Buffer are the same as in the PMC? All objects that start off with these two data items are said to be "PObject isomorphic". For short, we say that all pobject isomorphic are simply "PObjects", and there are many types of PObjects. The memory management system, for example, can test the flags of all pobjects to determine what type of PObject a memory object is. A PMC may optionally contain a PMC_EXT structure, which adds additional functionality. PMC_EXT allows a PMC to be shared between multiple threads, or multiple Parrot interpreters without introducing data contention. PMC_EXT also allows a PMC to contain a hash of metadata (attribute value pairs), which are typically added as attributes in PIR.
PMC Management
PMCs are allocated from two special pools, a PMC pool and a constant PMC pool. Constant PMCs are considered to be immutable and everlasting, and so are never modified nor collected by the garbage collector. STRINGS are allocated in either a string pool or a constant string pool. The same relationship applies, constant strings are never modified and never collected. PMC_EXT structures are not currently managed by the memory management subsystem. However, since PMC_EXTs are assigned to PMCs in a one-to-one relationship, we always know we can free it when it's PMC is freed. In terms of garbage collection, PMCs are one of the only aggregate data types in Parrot. STRINGS do not contain pointers to other data items of interest to the garbage collector. Stack chunks, which are used internally in some structures are PObjs and are also aggregates, but are marked separately by the collector and are not treated as aggregates directly.
VTables
VTables represent a standard interface to PMCs of all types. For every single PMC, there are a series of standard operations that you can perform (or attempt). Not all PMCs support all Vtable operations
VTable Types
VTables are complicated data items that contain, in addition to a large series of function pointers, a number of data items to support PMCs. One of the data items is a class PMC, a PMC that represents a particular PMC class. Another item is an enum that differentiates between all PMC classes.
Resources
http://www.parrotcode.org/docs/vtables.html
String System
84
String System
Strings Overview
Exception Subsystem
Exception handling has become a staple in most modern programming languages. Parrot, since it intends to host many such languages must support a robust exception system. Not only does Parrot use exceptions for error handling and recovery, but it also encourages the use of control flow exceptions to implement high-level control flow features of those languages as well. What this means is that the exception subsystem is one of the most important for language implementers to become familiar with.
IO Subsystem
85
IO Subsystem
I/O Subsystem Resources
http://www.parrotcode.org/docs/pdd/pdd22_io.html
NCI
Parrot Embedding
Embedding Parrot
Because the Parrot Virtual Machine is modular, we can link the libparrot library into other executables, creating programs which contain a Parrot interpreter object. One simple use for such a technology would be to create an executable file which contains data in the form of precompiled bytecode and a simple instantiation of the Parrot interpreter to create a standalone executable for a particular programming language. This is called Native Execution, and we will discuss it in more detail below.
Resources
http://www.parrotcode.org/docs/pdd/pdd10_embedding.html http://www.parrotcode.org/docs/native_exec.html
Extensions
86
Extensions
Extending Parrot
Parrot is not just an executable VM, it's a dynamically-linkable library that exports a Parrot API. The API allows add-ons to be developed to extend the functionality of Parrot. There are two basic APIs or, more specifically, a single API that can be divided into two distinct categories: Those that have access to the internals of Parrot, and those that do not. In general, most extensions will not need deep internal access to Parrot's structures, and most should not rely on them. Parrot's internal structures are subject to change, and relying on a precise format of one for your extension could cause compatibility problems later on.
Packfiles
Packfiles
Parrot bytecode files are called "packfiles" internally. Access routines to get information from and feed information to a packfile are stored in /src/packfile.c Some things that can affect the way a file is stored are: 1. Endianness. Some computers are called "little endian" and some computers are called "big endian". This has to do with the way that bits are ordered in a byte. Instead of picking one or the other to be the default. 2. Value sizes. Things like pointers and INTVALs are going to be different sizes on different computers. Parrot must translate between 16 bit, 32 bit, and 64 bit values for these and other things. Also, FLOATVALS may be 32bit, 64bit, or 128bit, and need to be translated.
Serialization
HLL code is most often used on the computer where it is first compiled. To that end, Parrot is optimized to write packfiles using local settings. If a packfile is read which has been created on some other computer, Parrot must translate it internally so that it can be run on your computer. This translation process adds extra execution overhead, but it only needs to be run once on your computer to get things into the proper local format.
87
Appendices
PIR Reference
Resources
http://www.parrotcode.org/docs/pdd/pdd19_pir.html
PASM Reference
Resources
http://docs.parrot.org/parrot/latest/html/docs/pdds/draft/pdd06_pasm.pod.html http://docs.parrot.org/parrot/latest/html/docs/pdds/draft/pdd05_opfunc.pod.html http://docs.parrot.org/parrot/latest/html/ops.html
Languages on Parrot
88
Languages on Parrot
Languages on Parrot
There are a number of programming languages being implemented on Parrot, some of which are nearing functional completion, some which are still in active development, and some which have been started but are now abandoned. Interested developers may want to help join in the development effort with some of these languages, adopt an abandoned language project, or start a new language project entirely. As of the 1.0.0 release, all language implementations except toy and example languages will be developed and maintained outside of the central Parrot repository. Where available, locations to external project pages will be provided.
Language Projects
Rakudo (Perl 6)
Rakudo is the name of the Perl6 implementation on Parrot. This is not the only implementation of Perl6, however. Rakudo development is test-driven. There is a gigantic suite of tests for the Perl6 language that have been developed over the years. The progress of the Rakudo interpreter is measured by the number of specification tests, or "spectests" that pass. There is not a straight-forward way to measure the percentage progress of the project, because the total number of tests is changing regularly as well. Rakudo is under active development by several volunteers. Some developers have even received funding to work on Rakudo more regularly.
abc
A basic calculator language.
C99
The implementation of the C programming language, following the C99 specification, has a number of purposes. C is a strongly-typed language, so it isn't necessarily the best candidate for implementation on the dynamically-typed Parrot. However, there are a number of benefits to the Parrot project to having a C parser available. The C99 language parser is being used, at least in part, to help automate the process of generating NCI function signatures for new libraries and extensions. This is under active development by volunteers, some of which have been funded.
Cardinal (Ruby)
An implementation of Ruby
Jako
A language derived from C and Perl
Pipp (PHP)
Pipp is a recursive acronym for Pipp is Parrot's PHP. This language implementation was previously named "Plumhead", shorthand for the name "Plum-headed Parakeet". Pipp is being maintained on github at pipp [1]. The project seems halted: the website is down/missing and the last commit was at 2009-07-22.
Languages on Parrot
89
Partcl (TCL)
ParTCL is the TCL compiler for Parrot. The ParTCL project lives at http://code.google.com/p/partcl/
Translator Projects
Projects to translate to or from Parrot Bytecode.
Resources
http://www.parrot.org/languages/
References
[1] http:/ / wiki. github. com/ bschmalhofer/ pipp
HLLCompiler Class
HLLCompiler Class
The HLLCompiler class is used to help instantiate and operate a compiler for a high-level language written for Parrot. HLLCompiler coordinates the execution of the HLL grammar, and controls the conversion of the HLL from PAST to POST, from POST to PIR, and ultimately to Parrot bytecode. This page is going to serve as a brief reference to the HLLCompiler class, and a description of how the class is used to create a compiler for an HLL.
90
Built-In PMCs
Built-in PMCs
Parrot ships with a number of PMC data types built in. This means that these standard types are always available. This page is going to serve as a reference to these PMC types. We will not attempt to cover all the PMC types that are added specifically for other HLLs, libraries, or programs. (For information on using these PMC types, and on defining new PMC types, see the Parrot Virtual Machine/Polymorphic Containers (PMCs) chapter). The entries in this list should (A) contain a link to the relevant PMC documentation, and (B) provide a brief overview of the PMC and it's methods.
AddrRegistry
http://docs.parrot.org/parrot/latest/html/src/pmc/addrregistry.pmc.html
Array
http://www.parrotcode.org/docs/pmc/pmc/array.html A simple array class, serves as the base class for other array PMCs. This type of PMC is rarely used directly. Instead, more versatile array PMC types, such as ResizablePMCArray are used. Array specifies an interface that all other Array classes must share. It also provides a number of defaults that other array-like PMCs may default to.
BigInt
http://docs.parrot.org/parrot/latest/html/src/pmc/bignum.pmc.html A PMC type for storing an arbitrarily large number, or a number with arbitrary precision. Not currently implemented.
Boolean
http://docs.parrot.org/parrot/latest/html/src/pmc/boolean.pmc.html A boolean True/False PMC.
Bound_NCI
http://www.parrotcode.org/docs/pmc/pmc/bound_nci.html
Built-In PMCs
91
Capture
http://docs.parrot.org/parrot/latest/html/src/pmc/capture.pmc.html
Closure
http://www.parrotcode.org/docs/pmc/pmc/closure.html
Compiler
http://www.parrotcode.org/docs/pmc/pmc/compiler.html A Compiler PMC for a particular language. Can be used to convert an HLL into PIR and eventually into Parrot Bytecode.
Complex
http://www.parrotcode.org/docs/pmc/pmc/complex.html A PMC for Complex numbers.
Continuation
http://www.parrotcode.org/docs/pmc/pmc/continuation.html A Continuation PMC allows Parrot to take a snapshot of the current state of the system to return to later.
Coroutine
http://www.parrotcode.org/docs/pmc/pmc/coroutine.html A sub-like PMC that implements a coroutine.
Default
http://www.parrotcode.org/docs/pmc/pmc/default.html
Deleg_PMC
http://www.parrotcode.org/docs/pmc/pmc/deleg_pmc.html
Delegate
http://www.parrotcode.org/docs/pmc/pmc/delegate .html
Enumerate
http://www.parrotcode.org/docs/pmc/pmc/enumerate.html
Env
http://www.parrotcode.org/docs/pmc/pmc/env.html Allows access to the system's environment variables, as a hash.
Built-In PMCs
92
Eval
http://www.parrotcode.org/docs/pmc/pmc/eval.html
Exception
http://www.parrotcode.org/docs/pmc/pmc/exception.html An Exception PMC holds information about system errors for recovery.
Exception_Handler
http://www.parrotcode.org/docs/pmc/pmc/exception_handler.html A sub-like routine that catches and resolves exceptions
Exporter
http://www.parrotcode.org/docs/pmc/pmc/exporter.html
File
http://docs.parrot.org/parrot/latest/html/src/dynpmc/file.pmc.html A read/write interface for files
FixedBooleanArray
http://www.parrotcode.org/docs/pmc/pmc/fixedbooleanarray.html An array of fixed size of Boolean values.
FixedFloatArray
http://www.parrotcode.org/docs/pmc/pmc/fixedfloatarray.html An array of fixed size for FLOATVAL floating point numbers
FixedPMCArray
http://www.parrotcode.org/docs/pmc/pmc/fixedpmcarray.html An array of fixed size for PMC values
FixedStringArray
http://www.parrotcode.org/docs/pmc/pmc/fixedstringarray.html An array of fixed size for STRING values
Built-In PMCs
93
Float
http://www.parrotcode.org/docs/pmc/pmc/float.html A floating point number PMC. Used similarly to a FLOATVAL, except has methods and vtable methods. FLOATVALs become Float PMCs when they are promoted to become a PMC.
Hash
http://www.parrotcode.org/docs/pmc/pmc/hash.html A hash, also known as a "dictionary" or an "associative array". Like an array but indexed with strings instead of integers
Integer
http://www.parrotcode.org/docs/pmc/pmc/integer.html A basic integer number PMC. Used similarly to an INTVAL, but with methods and vtable methods. INTVALS become Integer PMCs when they are promoted to become a PMC.
IntList
http://www.parrotcode.org/docs/pmc/pmc/intlist.html A simple list, or array, of integers.
Iterator
http://www.parrotcode.org/docs/pmc/pmc/iterator.html An Iterator PMC provides a stateful counter that enables you to iterate over the items in one of the array classes, one at a time.
Key
http://www.parrotcode.org/docs/pmc/pmc/key.html A value, typically a string, which is used to look up values in a hash.
LexInfo
http://www.parrotcode.org/docs/pmc/pmc/lexinfo.html
LexPad
http://www.parrotcode.org/docs/pmc/pmc/lexpad.html
ManagedStruct
http://www.parrotcode.org/docs/pmc/pmc/managedstruct.html A low-level structure whose memory is allocated and automatically deallocated by Parrot. Extends UnManagedStruct, but adds automatic memory collection.
Built-In PMCs
94
MultiArray
http://www.parrotcode.org/docs/pmc/pmc/multiarray.html
MultiSub
http://www.parrotcode.org/docs/pmc/pmc/multisub.html A collection of subroutines with the same name. In Multiple Method Dispatch (MMD) the parameters of the function called determine which subroutine from the collection to call.
Namespace
http://www.parrotcode.org/docs/pmc/pmc/namespace.html Implements a Parrot namespace. Contains information about variables, subroutines, coroutines, and MultiSubs that are stored in that namespace.
NCI
http://www.parrotcode.org/docs/pmc/pmc/nci.html A native call function PMC. Stores interface information to a function which has been written in C.
Null
http://www.parrotcode.org/docs/pmc/pmc/null.html A PMC with a NUL value
Object
http://www.parrotcode.org/docs/pmc/pmc/object.html
OrderedHash
http://www.parrotcode.org/docs/pmc/pmc/orderedhash.html
OS
http://www.parrotcode.org/docs/pmc/pmc/os.html
Pair
http://www.parrotcode.org/docs/pmc/pmc/pair.html An association of a Key PMC with a PMC value. Hashes are typically implemented as an array of Pair PMCs
ParrotClass
http://www.parrotcode.org/docs/pmc/pmc/parrotclass.html
Built-In PMCs
95
ParrotInterpreter
http://www.parrotcode.org/docs/pmc/pmc/parrotinterpreter.html An interface to the interpreter structure.
ParrotIO
http://www.parrotcode.org/docs/pmc/pmc/parrotio.html A read/write interface to the console
ParrotLibrary
http://www.parrotcode.org/docs/pmc/pmc/parrotlibrary.html A dynamically-loaded library object.
ParrotObject
http://www.parrotcode.org/docs/pmc/pmc/parrotobject.html
ParrotRunningThread
http://www.parrotcode.org/docs/pmc/pmc/parrotrunningthread.html
ParrotThread
http://www.parrotcode.org/docs/pmc/pmc/parrotthread.html A PMC that stores information about a thread
Pmethod_test
http://www.parrotcode.org/docs/pmc/pmc/pmethod_test.html
Pointer
http://www.parrotcode.org/docs/pmc/pmc/pointer.html
Random
http://www.parrotcode.org/docs/pmc/pmc/random.html
Ref
http://www.parrotcode.org/docs/pmc/pmc/ref.html
ResizableBooleanArray
http://www.parrotcode.org/docs/pmc/pmc/resizablebooleanarray.html A resizable array to store Boolean values
Built-In PMCs
96
ResizableFloatArray
http://www.parrotcode.org/docs/pmc/pmc/resizablefloatarray.html A resizable array to store floating point values.
ResizableIntegerArray
http://www.parrotcode.org/docs/pmc/pmc/resizableintegerarray.html A resizable array to store integer values
ResizablePMCArray
http://www.parrotcode.org/docs/pmc/pmc/resizablepmcarray.html A resizable array to store PMC values
ResizableStringArray
http://www.parrotcode.org/docs/pmc/pmc/resizablestringarray.html A resizable array to store Strings
RetContinuation
http://www.parrotcode.org/docs/pmc/pmc/retcontinuation.html A return continuation. Like a regular Continuation PMC, but can only be used once. Can be promoted to a Continuation using the Clone vtable method.
Role
http://www.parrotcode.org/docs/pmc/pmc/role.html An abstract role, or interface, for a class. Specifies actions and properties of a class, but cannot be instantiated
SArray
http://www.parrotcode.org/docs/pmc/pmc/sarray.html
SharedRef
http://www.parrotcode.org/docs/pmc/pmc/sharedref.html
Slice
http://www.parrotcode.org/docs/pmc/pmc/slice.html
Built-In PMCs
97
SMOP_Attribute
http://www.parrotcode.org/docs/pmc/pmc/smop_attribute.html
SMOP_Class
http://www.parrotcode.org/docs/pmc/pmc/smop_class.html
STMLog
http://www.parrotcode.org/docs/pmc/pmc/stmlog.html
STMRef
http://www.parrotcode.org/docs/pmc/pmc/stmref.html
STMVar
http://www.parrotcode.org/docs/pmc/pmc/stmvar.html
String
http://www.parrotcode.org/docs/pmc/pmc/string.html A PMC to contain a string value. Like a STRING value, but has methods and vtable methods. STRINGS become String PMCs when they are promoted to PMCs.
Sub
http://www.parrotcode.org/docs/pmc/pmc/sub.html A Parrot subroutine. Implements a basic subroutine (using the sub command in PIR), but also serves as a base class for more intricate sub-like classes
Super
http://www.parrotcode.org/docs/pmc/pmc/super.html A parent PMC class, to support multiple inheritance.
Timer
http://www.parrotcode.org/docs/pmc/pmc/timer.html
TQueue
http://www.parrotcode.org/docs/pmc/pmc/tqueue.html
Undef
http://www.parrotcode.org/docs/pmc/pmc/undef.html An undefined PMC with no usable type.
Built-In PMCs
98
UnamangedStruct
http://www.parrotcode.org/docs/pmc/pmc/unamangedstruct.html A low-level structure which the programmer must manage manually. Parrot does not automatically collect memory allocated for the struct.
Version
http://www.parrotcode.org/docs/pmc/pmc/version.html
Resources
http://www.parrotcode.org/docs/parrotbyte.html
VTABLE List
Vtable List
Vtable name absolute add_attribute Returns the absolute value of the PMC, as a PMC Adds an attribute to the PMC object. Attributes are typically stored in the pmc->pmc_ext->_metadata field. Adds a new method to the PMC's class Description
add_method add_parent add_role add_vtable_override assign_pmc assign_string_native bitwise_not bitwise_nots can clone clone_pmc decrement defined defined_keyed defined_keyed_int
Decrements the integer value of the PMC by 1 Determines if the PMC is defined
VTABLE List
99
defined_keyed_str delprop destroy does does_pmc elements exists_keyed exists_keyed_int exists_keyed_str find_method freeze get_attr get_bignum get_bool get_class get_integer get_integer_keyed get_integer_keyed_int get_integer_keyed_str get_iter get_namespace get_number get_number_keyed get_number_keyed_int get_number_keyed_str get_pmc get_pmc_keyed get_pmc_keyed_int get_pmc_keyed_str get_pointer get_pointer_keyed get_pointer_keyed_int get_pointer_keyed_str get_repr get_string get_string_keyed get_string_keyed_int get_string_keyed_srt getprop Get the value of a certain property from the PMC Gets the string representation of the PMC Gets the PMC representation of the PMC Gets the floating-point representation of the PMC Gets the integer representation of the PMC Gets the BigNum representation of the PMC Gets the boolean representation of the PMC Deletes a property from the PMC Destroy the PMC
VTABLE List
100
getprops i_absolute i_bitwise_not i_bitwise_nots i_logical_not i_net increment init init_pmc inspect inspect_str instantiate invoke The invoke vtable method is called when the PMC is called like a function. In the following code: .local pmc mypmc = new 'MyPMCType' mypmc() The invoke vtable method is called on the second line when the PMC is treated as a function call. For string functions, for example, the string class uses the value of the string to look up a function with the same name, and then invokes that function. Subroutine PMCs invoke the given function when they are called. is_same isa isa_pmc logical_not mark Mark the PMC and all it's children as alive for the memory manager. This prevents children of the PMC from being collected prematurely by the garbage collector. Increments the integer value of the PMC by 1 Initializes the PMC. This method is called when the new keyword is used to create a new PMC.
morph name neg new_from_string nextkey_keyed nextkey_keyed_int nextkey_keyed_str pop_float pop_integer pop_pmc pop_string push_float push_integer push_pmc push_string remove_attribute If the PMC is an array, pops a floating point value off the top of it If the PMC is an array, pops an integer value off the top of it If the PMC is an array, pops a PMC value off the top of it If the PMC is an array, pops a string value off the top of it If the PMC is an array, pushes a floating point value onto the top of it If the PMC is an array, pushes an integer onto the top of it If the PMC is an array, pushes a PMC onto the top of it If the PMC is an array, pushes a string onto the top of it Removes an attribute from the PMC
VTABLE List
101
remove_method remove_parent remove_role remove_vtable_override set_attr set_attr_keyed set_attr_keyed_str set_bugnum_int set_bignum_num set_bignum_str set_bool set_integer_keyed set_integer_keyed_int set_integer_keyed_str set_integer_native set_number_keyed set_number_keyed_int set_number_keyed_str set_number_native set_number_same set_pmc set_pmc_keyed set_pmc_keyed_int set_pmc_keyed_str set_pointer set_pointer_keyed set_pointer_keyed_int set_pointer_keyed_str set_string_keyed set_string_keyed_int set_string_keyed_str set_string_native set_string_same setprop share share_ro shift_float shift_int shift_pmc If the PMC is an array, shifts a floating point value onto the bottom of it If the PMC is an array, shifts an integer onto the bottom of it If the PMC is an array, shifts a PMC onto the bottom of it Sets the value of the PMC as if it were a string Sets the value of a PMC to the value of another PMC Sets the value of the PMC as if it were a floating point value Sets the value of the PMC as if it were an integer Sets the value of the PMC as if it were a boolean Sets an attribute value for the given PMC
VTABLE List
102
If the PMC is an array, shifts a string onto the bottom of it
shift_string slice splice substr substr_str thaw thawfinish type type_keyed type_keyed_int type_keyed_str unshift_float unshift_integer unshift_pmc unshift_str visit
If the PMC is an array, unshifts a floating point valuefrom the bottom of it If the PMC is an array, unshifts an integer from the bottom of it If the PMC is an array, unshifts a PMC from the bottom of it If the PMC is an array, unshifts a string from the bottom of it
103
Introduction
104
Introduction
Squaak Language Tutorial
Episode 1: Introduction Episode 2: Poking in Compiler Guts Episode 3: Squaak Details and First Steps Episode 4: PAST Nodes and More Statements Episode 5: Variable Declaration and Scope Episode 6: Scope and Subroutines Episode 7: Operators and Precedence Episode 8: Hashtables and Arrays Episode 9: Wrap-Up and Conclusion
Introduction
This is the first episode in a tutorial series on building a compiler with the Parrot Compiler Tools. If you're interested in virtual machines, you've probably heard of the Parrot virtual machine. Parrot is a generic virtual machine designed for dynamic languages. This is in contrast with the Java virtual machine (JVM) and Microsoft's Common Language Runtime (CLR), both of which were designed to run static languages. Both the JVM and Microsoft (through the Dynamic Language Runtime -- DLR) are adding support for dynamic languages, but their primary focus is still static languages.
Introduction
105
As you can see, a number of common (more advanced) features are missing. Most notable are: classes and objects exceptional control statements such as break and return advanced control statements such as switch closures (nested subroutines and accessing local variables in an outer scope)
Getting Started
For this tutorial, it is assumed you have successfully compiled parrot (and maybe even run the test suite). If you browse through the languages directory in the Parrot source tree, you'll find a number of language implementations. Most of them are not complete yet; some are actively maintained actively and others aren't. If, after reading this tutorial, you feel like contributing to one of these languages, you can check out the mailing list or join IRC (see the references section for details). The languages subdirectory is the right spot to put our language implementation. Parrot comes with a special Perl5 script to generate the necessary files for a language implementation. Before it can be run, parrot must be installed for development. From parrot's root directory, type: $ make install-dev In order to generate a new directory containing files for our language, type (assuming you're in parrot's root directory): $ perl tools/dev/mk_language_shell.pl Squaak
Introduction (Note: if you're on Windows, you should use backslashes.) This will generate the files in a directory "squaak", and use the name "Squaak" as the language's name. After this, go to this directory and type: $ parrot setup.pir test This will compile the generated files and run the test suite. If you want more information on what files are being generated, please check out the references at the end of this episode. Note that we didn't write a single line of code, and already we have the basic infrastructure in place to get us started. Of course, the generated compiler doesn't even look like the language we will be implementing, but that's ok for now. Later we'll adapt the grammar to accept our language. Now you might want to actually run a simple script with this compiler. Launch your favorite editor, and put in this statement: say "Squaak!"; Save the file (for instance as test.sq) and type: $ parrot squaak.pbc test.sq This will run Parrot, specifying squaak.pbc as the file to be run by Parrot, which takes a single argument: the file test.sq. If all went well, you should see the following output: $ parrot squaak.pbc test.sq Squaak! Instead of running a script file, you can also run the Squaak compiler as an interactive interpreter. Run the Squaak compiler without specifying a script file, and type the same statement as you wrote in the file: $ parrot squaak.pbc say "Squaak!"; which will print: Squaak!
106
What's Next?
This first episode of this tutorial is mainly an overview of what will be coming. Hopefully you now have a global idea of what the Parrot Compiler Tools are, and how they can be used to build a compiler targeting Parrot. If you want to check out some serious usage of the PCT, check out Rakudo (Perl 6 on Parrot) or Pynie (Python on Parrot). The next episodes will focus on the step-by-step implementation of our language, including the following topics: structure of PCT-based compilers using PGE rules to define the language grammar implementing operator precedence using an operator precedence table using NQP to write embedded parse actions implementing language library routines
In the mean time, experiment for yourself. You are welcome to join us on IRC (see the References section for details). Any feedback on this tutorial is appreciated.
Introduction
107
Exercises
The exercises are provided at the end of each episode of this tutorial. In order to keep the length of this tutorial somewhat acceptable, not everything can be discussed in full detail. The answers and/or solutions to these exercises will be posted several days after the episode. Problem 1 Advanced Interactive Mode Launch your favorite editor and look at the file src/Squaak/Compiler.pm, still in the directory squaak. This file is written in Not Quite Perl ("NQP") and is compiled to src/gen_compiler.pir when you run parrot setup.pir test. It contains the setup for the compiler. The class HLL::Compiler defines methods to set a command-line banner and prompt for your compiler when it is running in interactive mode. For instance, when you run Python in interactive mode, you'll see: Python 2.6.5 (r265:79063, Apr 1 2010, 05:28:39) [GCC 4.4.3 20100316 (prerelease)] on linux2 Type "help", "copyright", "credits" or "license" for more information. or something similar (depending on your Python installation and version). This text is called the command line banner. And while running in interactive mode, each line will start with: >>> which is called a prompt. For Squaak, we'd like to see the following when running in interactive mode (of course you can change this according to your personal taste): $ ../../parrot squaak.pbc Squaak for Parrot VM. > Add code to the file Compiler.pm to achieve this. Hint: Note that only double-quoted strings in NQP can interpret escape-characters such as '\n'. Answer Given the hints that were provided, it was probably not too hard to find the solution, which is shown below. This code can be found in the file Compiler.pm. The relevant lines are printed in bold. INIT { Squaak::Compiler.language('Squaak'); Squaak::Compiler.parsegrammar(Squaak::Grammar); Squaak::Compiler.parseactions(Squaak::Actions); Squaak::Compiler.commandline_banner("Squaak for Parrot VM.\n"); Squaak::Compiler.commandline_prompt('> '); } There are more options provided by the HLL::Compiler class that you can set, you should explore them all.
Introduction
108
References
Parrot mailing list: parrot-dev@lists.parrot.org IRC: join #parrot on irc.perl.org Getting started with PCT: docs/pct/gettingstarted.pod Parrot Abstract Syntax Tree (PAST): docs/pct/past_building_blocks.pod Operator Precedence Parsing with PCT: docs/pct/pct_optable_guide.pod Perl 6/PGE rules syntax: Synopsis 5 [1]
Poking in Compiler Guts This is an example of using the target option set to "parse", which will print the parse tree of the input to stdout: $ parrot squaak.pbc --target=parse test.sq In interactive mode, giving this input: say 42; will print this parse tree (without the line numbers):
109
1 "parse" => PMC 'Regex;Match' => "say 42;\n" @ 0 { 2 <statementlist> => PMC 'Regex;Match' => "say 42;\n" @ 0 { 3 <statement> => ResizablePMCArray (size:1) [ 4 PMC 'Regex;Match' => "say 42" @ 0 { 5 <statement_control> => PMC 'Regex;Match' => "say 42" @ 0 { 6 <EXPR> => ResizablePMCArray (size:1) [ 7 PMC 'Regex;Match' => "42" @ 4 { 8 <integer> => PMC 'Regex;Match' => "42" @ 4 { 9 <decint> => PMC 'Regex;Match' => "42" @ 4 10 <VALUE> => \parse[0][0] 11 } 12 } 13 ] 14 <sym> => PMC 'Regex;Match' => "say" @ 0 15 } 16 } 17 ] 18 } 19 } When changing the value of the target option, the output changes into a different representation of the input. Why don't you try that right now? So, a HLL::Compiler object has four compilation phases: parsing, construction of a Parrot Abstract Syntax Tree (PAST), construction of a Parrot Opcode Syntax Tree (POST), generation of Parrot Intermediate Representation (PIR). After compilation, the generated PIR is executed immediately. If your compiler needs additional stages, you can add them to your HLL::Compiler object. For Squaak, we will not need this, but for details, check out compilers/pct/src/PCT/HLLCompiler.pir. We shall now discuss each compilation phase in more detail. The first two phases, parsing the input and construction of the PAST are executed simultaneously. Therefore, these are discussed together.
110
PAST to POST
After the parse phase during which the PAST is constructed, the HLL::Compiler transforms this PAST into something called a Parrot Opcode Syntax Tree (POST). The POST representation is also a tree structure, but these nodes are on a lower abstraction level. For instance, on the PAST level there is a node type to represent a while statement (constructed as PAST::Op.new( :pasttype('while') ) ). The template for a while statement typically consists of a number of labels and jump instructions. On the POST level, the same while statement is represented by a set of nodes, each representing a one instruction or a label. Therefore, it is much easier to transform a POST into something executable than when this is done from the PAST level. Usually, as a user of the PCT, you don't need to know details of POST nodes, which is why this will not be discussed in further detail. Use the target option to see what a POST looks like.
111
POST to PIR
In the fourth (and final) stage, the POST is transformed into Parrot Intermediate Representation (PIR). As mentioned, transforming a POST into something executable is rather straightforward, as POST nodes already represent individual instructions and labels. Again, normal usage of the PCT does not require you to know any details about this transformation.
where we noted that the first two are done during the parse stage. Now, as you're reading this tutorial, you're probably interested in using the PCT for implementing Your Favorite Language for Parrot. We already saw that a language grammar is expressed in Perl 6 Rules. What about the other transformations? Well, earlier in this episode we mentioned the term parse actions, and that these actions create PAST nodes. After you have written a parse action for each grammar rule, you're done!
Say what?
That's right. Once you have correctly constructed a PAST, your compiler can generate executable PIR, which means you just implemented your first language for Parrot. Of course, you still need to implement any language specific libraries, but that's besides the point. PCT-based compilers already know how to transform a PAST into a POST, and how to transform a POST into PIR. These transformation stages are already provided by the PCT.
What's next?
In this episode we took a closer look at the internals of a PCT-based compiler. We discussed the four compilation stages, that transform an input string (a program, or script, depending on your definition) into a PAST, a POST and finally executable PIR. The next episodes is where the Fun Stuff is: we will be implementing Squaak for Parrot. Piece by piece, we will implement the parser and the parse actions. Finally, we shall demonstrate John Conway's "Game of Life" running on Parrot, implemented in Squaak.
Exercises
Starting in the next episode, the exercises will be more interesting. For now, it would be useful to browse around through the source files of the compiler, and see if you understand the relation between the grammar rules in Grammar.pg and the methods in Actions.pm. It's also useful to experiment with the --target option described in this episode. If you don't know PIR, now is the time to do some preparation for that. There's sufficient information to be found on PIR, see the References section for details. In the mean time, if you have any suggestions, questions and whatnot, don't hesitate to leave a comment.
112
References
1. PIR language specification: docs/pdds/pdd19_pir.pod 2. PIR book: docs/book/pir
Squaak Grammar
Without further ado, here is the full grammar specification for Squaak. This specification uses the following meta-syntax: statement {statement} [step] 'do' indicates indicates indicates indicates a non-terminal, named "statement" zero or more statements an optional step the keyword 'do'
Below is Squaak's grammar. The start symbol is program. program stat-or-def ::= {stat-or-def} ::= statement | sub-definition ::= | | | | | if-statement while-statement for-statement try-statement throw-statement variable-declaration
statement
Squaak Details and First Steps | assignment | sub-call | do-block block do-block if-statement ::= {statement} ::= 'do' block 'end' ::= 'if' expression 'then' block ['else' block] 'end' ::= 'while' expression 'do' block 'end' ::= 'for' for-init ',' expression [step] 'do' block 'end' ::= ',' expression ::= 'var' identifier '=' expression ::= 'try' block 'catch' identifier block 'end' ::= 'throw' expression ::= 'sub' identifier parameters block 'end' ::= '(' [identifier {',' identifier}] ')'
113
while-statement
for-statement
throw-statement sub-definition
parameters
variable-declaration ::= 'var' identifier ['=' expression] assignment sub-call primary postfix-expression ::= primary '=' expression ::= primary arguments ::= identifier postfix-expression* ::= key | index | member
114
::= '{' expression '}' ::= '[' expression ']' ::= '.' identifier ::= '(' [expression {',' expression}] ')' ::= | | | ::= | | | | | expression {binary-op expression} unary-op expression '(' expression ')' term float-constant integer-constant string-constant array-constructor hash-constructor primary
term
::= '{' [named-field {',' named-field}] '}' ::= string-constant '=>' expression ::= '[' [expression {',' expression} ] ']' ::= '+' | '-' | '/' | 'and | 'or' | '>' | '==' | '!=' ::= 'not' | '-' | '*' | '%' | '>=' | '<' | '..' | '<='
unary-op
Gee, that's a lot, isn't it? Actually, this grammar is rather small compared to "real world" languages such as C, not to mention Perl 6. No worries though, we won't implement the whole thing at once, but in small steps. What's more, the exercises section contains enough exercises for you to learn to use the PCT yourself! The solutions to these exercises will be posted a few days later (but you really only need a couple of hours to figure them out).
115
Semantics
Most of the Squaak language is straightforward; the if-statement executes exactly as you would expect. When we discuss a grammar rule (for its implementation), a semantic specification will be included. This is to prevent myself from writing a complete language manual, which could take some pages.
Interactive Squaak
Although the Squaak compiler can be used in interactive mode, there is one point of attention to be noted. When defining a local variable using the 'var' keyword, this variable will be lost in any consecutive commands. The variable will only be available to other statements within the same command (a command is a set of statements before you press enter). This has to do with the code generation by the PCT, and will be fixed at a later point. For now, just remember it doesn't work.
Squaak Details and First Steps token keyword { ['and'|'catch'|'do' |'not'|'or' |'sub' }
116
Now, change the rule "value" into this (renaming to "expression"): rule expression { | <string_constant> {*} | <integer_constant> {*} }
#= string_constant #= integer_constant
Rename the rule "integer" as "integer_constant", and "quote" as "string_constant" (to better match our language specification). Phew, that was a lot of information! Let's have a closer look at some things that may look unfamiliar. The first new thing is in the rule "identifier". Instead of the "rule" keyword, you see the keyword "token". In short, a token doesn't skip whitespace between the different parts specified in the token, while a rule does. For now, it's enough to remember to use a token if you want to match a string that doesn't contain any whitespace (such as literal constants and identifiers), and use a rule if your string does (and should) contain whitespace (such as a an if-statement). We shall use the word "rule" in a general sense, which could refer to a token. For more information on rules and tokens (and there's a third type, called "regex"), take a look at synopsis 5. In token "identifier", the first subrule is called an assertion. It asserts that an "identifier" does not match the rule keyword. In other words, a keyword cannot be used as an identifier. The second subrule is called "ident", which is a built-in rule in the class PCT::Grammar, of which this grammar is a subclass. In token "keyword", all keywords of Squaak are listed. At the end there's a ">>" marker, which indicates a word boundary. Without this marker, an identifier such as "forloop" would wrongly be disqualified, because the part "for" would match the rule keyword, and the part "loop" would match the rule "ident". However, as the assertion <!keyword> is false (as "for" could be matched), the string "forloop" cannot be matched as an identifier. The required presence of the word boundary prevents this. The last rule is "expression". An expression is either a string-constant or an integer-constant. Either way, an action is executed. However, when the action is executed, it does not know what the parser matched; was it a string-constant, or an integer-constant? Of course, the match object can be checked, but consider the case where you have 10 alternatives, then doing 9 checks only to find out the last alternative was matched is somewhat inefficient (and adding new alternatives requires you to update this check). That's why you see the special comments starting with a "#=" character. Using this notation, you can specify a key, which will be passed as a second argument to the action method. As we will see, this allows us to write very simple and efficient action methods for rules such as expression. (Note there's a space between the #= and the key's name).
117
And... Action!
Now we have implemented the initial version of the Squaak grammar, it's time to implement the parse actions we mentioned before. The actions are written in a file called src/Squaak/Actions.pm. If you look at the methods in this file, here and there you'll see that the Match object ($/) , or rather, hash fields of it (like $<statement>) is evaluated in scalar context, by writing "$( ... )". As mentioned in Synopsis 5, evaluating a Match object in scalar context returns its result object. Normally the result object is the matched portion of the source text, but the special make function can be used to set the result object to some other value. This means that each node in the parse tree (a Match object) can also hold its PAST representation. Thus we use the make function to set the PAST representation of the current node in the parse tree, and later use the $( ... ) operator to retrieve the PAST representation from it. In recap, the Match object ($/) and any subrules of it (for instance $<statement>) represent the parse tree; of course, $<statement> represents only the parse tree what the <statement> rule matched. So, any action method has access to the parse tree that the equally named grammar rule matched, as the Match object is always passed as an argument. Evaluating a parse tree in scalar context yields the PAST representation (obviously, this PAST object should be set using the make function). If you're following this tutorial, I highly advise you to get your feet wet, and do the exercises. Remember, learning and not doing is not learning (or something like that :-). This week's exercises are not that difficult, and after doing them, you'll have implemented the first part of our little Squaak language.
What's next?
In this episode we introduced the full grammar of Squaak. We took the first steps to implement this language. The first, and currently only, statement type is assignments. We briefly touched on how to write the action methods that are invoked during the parsing phase. In the next episode, we shall take a closer look on the different PAST node types, and implement some more parts of the Squaak language. Once we have all basic parts in place, adding statement types will be rather straightforward. In the mean time, if you have any questions or are stuck, don't hesitate to leave a comment or contact me.
Exercises
This episode's exercises are simple enough to get started on implementing Squaak. Problem 1 Rename the names of the action methods according to the name changes we made on the grammar rules. So, "integer" becomes "integer_constant", "value" becomes "expression", and so on. Problem 2 Look at the grammar rule for statement. A statement currently consists of an assignment. Implement the action method "statement" to retrieve the result object of this assignment and set it as statement's result object using the special make function. Do the same for rule primary. Solution method statement($/) { make $( $<assignment> ); } Note that at this point, the rule statement doesn't define different #= keys for each type of statement, so we don't declare a parameter $key. This will be changed later.
Squaak Details and First Steps method primary($/) { make $( $<identifier> ); } Problem 3 Write the action method for the rule identifier. As a result object of this "match", a new PAST::Var node should be set, taking as name a string representation of the match object ($/). For now, you can set the scope to 'package'. See "pdd26: ast" for details on PAST::Var nodes. Solution method identifier($/) { make PAST::Var.new( :name(~$/), :scope('package'), :node($/) ); } Problem 4 Write the action method for assignment. Retrieve the result objects for "primary" and for "expression", and create a PAST::Op node that binds the expression to the primary. (Check out pdd26 for PAST::Op node types, and find out how you do such a binding). Solution method assignment($/) { my $lhs := $( $<primary> ); my $rhs := $( $<expression> ); $lhs.lvalue(1); make PAST::Op.new( $lhs, $rhs, :pasttype('bind'), :node($/) ); } Note that we set the lvalue flag on $lhs. See PDD26 for details on this flag. Problem 5 Run your compiler on a script or in interactive mode. Use the target option to see what PIR is being generated on the input "x = 42". Solution .namespace .sub "_block10" new $P11, "Integer" assign $P11, 42 set_global "x", $P11 .return ($P11) .end The first two lines of code in the sub create an object to store the number 42, the third line stores this number as "x". The PAST compiler will always generate an instruction to return the result of the last statement, in this case $P11.
118
119
Some Notes
Help! I get the error message "no result object". This means that the result object was not set properly (duh!). Make sure each action method is invoked (check each rule for a "{*}" marker), and that there is an action method for that rule, and that "make" is used to set the appropriate PAST node. Note that not all rules have action methods, for instance the "keyword" rule (there's no point in that). While we're constructing parts of Squaak's grammar, we'll sometimes make a shortcut, by forgetting about certain rules for a while. For instance, you might have noticed we're ignoring float-constants right now. That's ok. When we'll need them, these rules will be added.
References
pdd26: ast synopsis 5: Rules docs/pct/*.pod
PAST Nodes and More Statements on that in a later episode), and all other child nodes are passed to that subroutine as arguments. It generally doesn't matter of which PAST node type the children are. For instance, consider a language in which a simple expression is a statement: 42 You might wonder what kind of code is generated for this. Well, it's really very simple: a new PAST::Val node is created (of a certain type, for this example that would be 'Integer'), and the value is assigned to this node. It might seem a bit confusing to write something like this, as it doesn't really do anything (note that this is not valid Squaak input): if 42 then "hi" else "bye" end But again, this works out correctly; the "then" and "else" blocks are compiled to instructions that load that particular literal into a PAST::Val node and leave it there. That's fine, if your language allows such statements. The point I'm trying to make is, that all PAST nodes are equal. You don't need to think about the node types if you set a node as a child of some other parent node. Each PAST node is compiled into a number of PIR instructions.
120
If-then-else
The first statement we're going to implement now is the if-statement. An if-statement has typically three parts (but this of course depends on the programming language): a conditional expression, a "then" part and an "else" part. Implementing this in Perl 6 rules and PAST is almost trivial: rule if_statement { 'if' <expression> 'then' <block> ['else' $<else>=<block> ]? 'end' {*} } rule block { <statement>* {*} } rule statement { | <assignment> {*} | <if_statement> {*} }
#= assignment #= if_statement
Note that the optional else block is stored in the match object's "else" field. If we hadn't written this $<else>= part, then <block> would have been an array, with block[0] the "then" part, and block[1] the optional else part. Assigning the optional else block to a different field, makes the action method slightly easier to read.
PAST Nodes and More Statements Also note that the statement rule has been updated; a statement is now either an assignment or an if-statement. As a result, the action method statement now takes a key argument. The relevant action methods are shown below: method statement($/, $key) { # get the field stored in $key from the $/ object, # and retrieve the result object from that field. make $( $/{$key} ); } method block($/) { # create a new block, set its type to 'immediate', # meaning it is potentially executed immediately # (as opposed to a declaration, such as a # subroutine definition). my $past := PAST::Block.new( :blocktype('immediate'), :node($/) ); # for each statement, add the result # object to the block for $<statement> { $past.push( $( $_ ) ); } make $past; } method my my my if_statement($/) { $cond := $( $<expression> ); $then := $( $<block> ); $past := PAST::Op.new( $cond, $then, :pasttype('if'), :node($/) ); if $<else> { $past.push( $( $<else>[0] ) ); } make $past;
121
} That's, easy, huh? First, we get the result objects for the conditional expression and the then part. Then, a new PAST::Op node is created, and the :pasttype is set to 'if', meaning this node represents an if-statement. Then, if there is an "else" block, this block's result object is retrieved and added as the third child of the PAST node. Finally, the result object is set with the make function.
Result objects
At this point it's wise to spend a few words on the make function, the parse actions and how the whole PAST is created by the individual parse actions. Have another look at the action method if_statement. In the first two lines, we request the result objects for the conditional expression and the "then" block. When were these result objects created? How can we be sure they're there? The answer lies in the order in which the parse actions are executed. The special {*} symbol that triggers a parse action invocation, is usually placed at the end of the rule. For this input
PAST Nodes and More Statements string: "if 42 then x = 1 end" this implies the following order: 1. parse TOP 2. parse statement 3. parse if_statement 4. parse expression 5. parse integer 6. create PAST::Val( :value(42) ) 7. parse block 8. parse statement 9. parse assignment 10. parse identifier 11. create PAST::Var( :name('x')) 12. parse integer 13. create PAST::Val( :value(1) ) 14. create PAST::Op( :pasttype('bind') ) 15. create PAST::Block (in action method block) 16. create PAST::Op( :pasttype('if') ) 17. create PAST::Block (in action method TOP) As you can see, PAST nodes are created in the leafs of the parse tree first, so that later, action methods higher in the parse tree can retrieve them.
122
Throwing Exceptions
The grammar rule for the "throw" statement is really quite easy, but it's useful to discuss the parse action, as it shows the use of generating custom PIR instructions. First the grammar rule: rule throw_statement { 'throw' <expression> {*} } I assume you know how to update the "statement" rule by now. The throw statement will compile down to Parrot's "throw" instruction, which takes one argument. In order to generate a custom Parrot instruction, the instruction can be specified in the :pirop attribute when creating a PAST::Op node. Any child nodes are passed as arguments to this instruction, so we need to pass the result object of the expression being thrown as a child of the PAST::Op node representing the "throw" instruction. method throw_statement($/) { make PAST::Op.new( $( $<expression> ), :pirop('throw'), :node($/) ); }
123
What's Next?
In this episode we implemented two more statement types of Squaak. You should get a general idea of how and when PAST nodes are created, and how they can be retrieved sub (parse) trees. In the next episode we'll take a closer look at variable scope and subroutines. In the mean time, I can imagine some things are not too clear. In case you're lost, don't hesitate to leave comment, and I'll try to answer (as far as my knowledge goes).
Exercises
Problem 1 We showed how the if-statement was implemented. The while-statement and try-statement are very similar. Implement these. Check out pdd26 to see what PAST::Op nodes you should create. Solution The while-statement is straightforward: method while_statement($/) { my $cond := $( $<expression> ); my $body := $( $<block> ); make PAST::Op.new( $cond, $body, :pasttype('while'), :node($/) ); } The try-statement is a bit more complex. Here are the grammar rules and action methods. rule try_statement { 'try' $<try>=<block> 'catch' <exception> $<catch>=<block> 'end' {*} } rule exception { <identifier> {*} } method try_statement($/) { ## get the try block my $try := $( $<try> ); ## ## ## ## ## ## create a new PAST::Stmts node for the catch block; note that no PAST::Block is created, as this currently has problems with the exception object. For now this will do.
PAST Nodes and More Statements my $catch := PAST::Stmts.new( :node($/) ); $catch.push( $( $<catch> ) ); ## get the exception identifier; ## set a declaration flag, the scope, ## and clear the viviself attribute. my $exc := $( $<exception> ); $exc.isdecl(1); $exc.scope('lexical'); $exc.viviself(0); ## ## ## ## ## my generate instruction to retrieve the exception objct (and the exception message, that is passed automatically in PIR, this is stored into $S0 (but not used). $pir := " .get_results (%r, $S0)\n" ~ " store_lex '" ~ $exc.name() ~ "', %r";
124
$catch.unshift( PAST::Op.new( :inline($pir), :node($/) ) ); ## do the declaration of the exception ## object as a lexical here: $catch.unshift( $exc ); make PAST::Op.new( $try, $catch, :pasttype('try'), :node($/) ); } method exception($/) { our $?BLOCK; my $past := $( $<identifier> ); $?BLOCK.symbol( $past.name(), :scope('lexical') ); make $past; } Instead of putting "identifier" after the "catch" keyword, we made it a separate rule, with its own action method. This allows us to insert the identifier into the symbol table of the current block (the try-block), before the catch block is parsed. First the PAST node for the try block is retrieved. Then, the catch block is retrieved, and stored into a PAST::Stmts node. This is needed, so that we can make sure that the instructions that retrieve the exception object come first in the exception handler.
PAST Nodes and More Statements Then, we retrieve the PAST node for the exception identifier. We're setting its scope, a flag telling the PAST compiler this is a declaration, and we clear the viviself attribute. The viviself attribute is discussed in a later episode; if you didn't read that yet, just keep in mind the viviself attribute (if set) will make sure all declared variables are initialized. We must clear this attribute here, to make sure that this exception object is not initialized, because that will be done by the instruction that retrieves the thrown exception object, discussed next. In PIR, we can use the .get_results directive to retrieve a thrown exception. You could also generate the get_results instruction (note the missing dot), but this is much easier. Currently, in PIR, when retrieving the exception object, you must always specify both a variable (or register) for the exception object, and a string variable (or register) to store the exception message. The exception message is actually stored within the exception object. We use $S0 to store the exception message, and we'll ignore it after that. Just remember for now that if you want to retrieve the exception object, you must also specify a place to store the exception message. There is no special PAST node to generate these instructions, so we use a so-called inline PAST::Op node. We store the instructions to be generated into a string and store that in the inline attribute of a PAST::Op node. Once created, this node is unshifted onto the PAST::Stmts node representing the exception handler. After that, the declaration is stored in that PAST::Stmts node, so that this declaration comes first. Finally, we have the block representing the try block, and a PAST::Stmts node representing the exception handler. Both are used to create a PAST::Op node whose pasttype is set to the built-in "try" type. Problem 2 Start Squaak in interactive mode, and specify the target option to show the generated PIR instructions. Check out what instructions and labels are generated, and see if you can recognize which instructions make up the conditional expression, which represent the "then" block, and which represent the "else" block (if any). Solution Start Squaak in interactive mode, and specify the target option to show the generated PIR instructions. Check out what instructions and labels are generated, and see if you can recognize which instructions make up the conditional expression, which represent the "then" block, and which represent the "else" block (if any). > if 1 then else end .namespace .sub "_block16" new $P18, "Integer" assign $P18, 1 ## this is the condition: if $P18, if_17 ## this is invoking the else-block: get_global $P21, "_block19" newclosure $P21, $P21 $P20 = $P21() set $P18, $P20 goto if_17_end ## this is invoking the then-block: if_17: get_global $P24, "_block22" newclosure $P24, $P24
125
PAST Nodes and More Statements $P23 = $P24() set $P18, $P23 if_17_end: .return ($P18) .end .namespace .sub "_block22" :outer("_block16") .return () .end .namespace .sub "_block19" :outer("_block16") .return () .end
126
References
PDD26: AST docs/art/*.pod for good introductions to PIR
Variable Declaration and Scope a + b # neither a nor b was declared; # both default to the value "Undef"
127
# prints 1
So, each do/end pair defines a new scope, in which any declared variables hide variables with the same name in outer scopes. This behavior is common in many programming languages. The PCT has built-in support for symbol tables; a PAST::Block object has a method symbol that can be used to enter new symbols and query the table for existing ones. In PCT, a PAST::Block object represents a scope. There are two blocktypes: immediate and declaration. An immediate block can be used to represent the blocks of statements in an do-block statement, for instance: do block end When executing this statement, block is executed immediately. A declaration block, on the other hand, represents a block of statements that can be invoked at a later point. Typically these are subroutines. So, in this example: sub foo(x) print(x) end a PAST::Block object is created for the subroutine foo. The blocktype is set to declaration, as the subroutine is defined, not executed (immediately). For now you can forget about the blocktype, but now that I've told you, you'll recognize it when you see it. We'll come back to it in a later episode.
128
Implementing Scope
So, we know how to use global variables, declare local variables, and about PAST::Block objects representing scopes. How do we make our compiler to generate the right PIR instructions? After all, when handling a global variable, Parrot must handle this differently from handling a local variable. When creating PAST::Var nodes to represent the variables, we must know whether the variable is a local or a global variable. So, when handling variable declarations (of local variables; globals are not declared), we need to register the identifier as a local in the current block's symbol table. First, we'll take a look at the implementation of variable declarations.
Variable declaration
The following is the grammar rule for variable declarations. This is a type of statement, so I assume you know how to extend the statement rule to allow for variable declarations. rule variable_declaration { 'var' <identifier> ['=' <expression>]? {*} } A local variable is declared using the var keyword, and has an optional initialization expression. If the latter is missing, the variable's value defaults to the undefined value called Undef. Let's see what the parse action looks like: method variable_declaration($/) { # get the PAST for the identifier my $past := $( $<identifier> ); # this is a local (it's being defined) $past.scope('lexical'); # set a declaration flag $past.isdecl(1); # check for the initialization expression if $<expression> { # use the viviself clause to add a # an initialization expression $past.viviself( $($<expression>[0]) ); } else { # no initialization, default to "Undef" $past.viviself('Undef'); } make $past; } Well, that wasn't too hard, was it? Let's analyze what we just did. First we retrieved the PAST node for the identifier, which we then decorated by setting its scope to lexical (a local variable is said to be lexically scoped, hence lexical), and setting a flag indicating this node represents a declaration (isdecl). So, besides representing variables in other statements (for instance, assignments), a PAST::Var node is also used as a declaration statement.
Variable Declaration and Scope Earlier in this episode we mentioned the need to register local variables in the current scope block when they are declared. So, when executing the parse action for variable-declaration, there should already be a PAST::Block node around, that can be used to register the symbol being declared. As we learned in Episode 4, PAST nodes are created in a depth-first fashion; the leafs are created first, and then the nodes "higher" in the parse tree. This implies that a PAST::Block node is created after the statement nodes (which variable_declaration is) that will be the children of the block. In the next section we'll see how to solve this problem.
129
#= open
#= close
We now have two parse actions for TOP, which are differentiated by an additional key parameter. The first parse action is executed before any input is parsed, which is particularly suitable for any initialization actions you might need. The second action (which was already there) is executed after the whole input string is parsed. Now we can create a PAST::Block node before any statements are parsed, so that when we need the current block, it's there (somewhere, later we'll see where exactly). Let's take a look at the parse action for TOP. method TOP($/, $key) { our $?BLOCK; our @?BLOCK; if $key eq 'open' { $?BLOCK := PAST::Block.new( :blocktype('declaration'), :node($/) ); @?BLOCK.unshift($?BLOCK); } else { # key is 'close' my $past := @?BLOCK.shift(); for $<statement> { $past.push( $( $_ ) ); } make $past; } } Let's see what's happening here. When the parse action is invoked for the first time (when $key equals "open"), a new PAST::Block node is created and assigned to a strange-looking (if you don't know Perl, like me. Oh wait, this is Perl. Never mind..) variable called $?BLOCK. This variable is declared as "our", which means that it is a package variable. This means that the variable is shared by all methods in the same package (or class), and, equally important, the variable is still around after the parse action is done. Please refer to the Perl 6 specification [1] for
Variable Declaration and Scope more semantics on "our". The variable $?BLOCK holds the current block. After that, this block is unshifted onto another funny-looking variable, called @?BLOCK. This variable has a "@" sigil, meaning this is an array. The unshift method puts its argument on the front of the list. In a sense, you could think of the front of this list as the top of a stack. Later we'll see why this stack is necessary. This @?BLOCK variable is also declared with "our", meaning it's also package-scoped. However, as we call a method on this variable, it should have been already created; otherwise you'd invoke the method on an undefined ("Undef") variable. So, this variable should have been created before the parsing starts. We can do this in the compiler's main program, squaak.pir. Before doing so, let's take a quick look at the "else" part of the parse action for TOP, which is executed after the whole input string is parsed. The PAST::Block node is retrieved from @?BLOCK, which makes sense, as it was created in the first part of the method and unshifted on @?BLOCK. Now this node can be used as the final result object of TOP. So, now we've seen how to use the scope stack, let's have a look at its implementation.
130
A List Class
We'll implement the scope stack as a ResizablePMCArray object. This is a built-in PMC type. However, this built-in PMC does not have any methods; in PIR it can only be used as an operand of the built-in shift and unshift instructions. In order to allow us to write this as method calls, we create a new subclass of ResizablePMCArray. The code below creates the new class and defines the methods we need. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 .namespace .sub 'initlist' :anon :init :load subclass $P0, 'ResizablePMCArray', 'List' new $P1, 'List' set_hll_global ['Squaak';'Grammar';'Actions'], '@?BLOCK', $P1 .end .namespace ['List'] .sub 'unshift' :method .param pmc obj unshift self, obj .end .sub 'shift' :method shift $P0, self .return ($P0) .end
Well, here you have it: part of the small amount of PIR code you need to write for the Squaak compiler (there's some more for some built-in subroutines, more on that later). Let's discuss this code snippet in more detail (if you know PIR, you could skip this section). Line 1 resets the namespace to the root namespace in Parrot, so that the sub 'initlist' is stored in that namespace. The sub 'initlist' defined in lines 2-6 has some flags: :anon means that the sub is not stored by name in the namespace, implying it cannot be looked up by name. The :init flag means that the sub is executed before the main program (the "main" sub) is executed. The :load flag makes sure that the sub is executed if this file was compiled and loaded by another file through the load_bytecode instruction. If you don't understand this, no worries. You can forget about it now. In any case, we know for sure there's a List class when we need it, because the class creation is done before running the actual compiler code. Line 3 creates a new subclass of ResizablePMCArray, called "List". This results in a new class object, which is left in register $P0, but it's not used after that. Line 4 creates a new List object, and stores it in register $P1. Line 5, stores this List object by name of "@?BLOCK" (that name should ring a bell now...) in the namespace of the Actions
Variable Declaration and Scope class. The semicolons in between the several key strings indicate nested namespaces. So, lines 4 and 5 are important, because the create the @?BLOCK variable and store it in a place that can be accessed from the action methods in the Actions class. Lines 7-11 define the unshift method, which is a method in the "List" namespace. This means that it can be invoked as a method on a List object. As the sub is marked with the :method flag, the sub has an implicit first parameter called "self", which refers to the invocant object. The unshift method invokes Parrot's unshift instruction on self, passing the obj argument as the second operand. So, obj is unshifted onto self, which is the List object itself. Finally, lines 12-15 define the "shift" method, which does the opposite of "unshift", removing the first element and returning it to its caller.
131
Storing Symbols
Now, we set up the necessary infrastructure to store the current scope block, and we created a datastructure that acts as a scope stack, which we will need later. We'll now go back to the parse action for variable_declaration, because we didn't enter the declared variable into the current block's symbol table yet. We'll see how to do that now. First, we need to make the current block accessible from the method variable_declaration. We've already seen how to do that, using the "our" keyword. It doesn't really matter where in the action method we enter the symbol's name into the symbol table, but let's do it at the end, after the initialization stuff. Naturally, we're only going to enter the symbol if it's not there already; duplicate variable declarations (in the same scope) should result in an error message (using the panic method of the match object). The code to be added to the method variable_declaration looks then like this: method variable_declaration($/) { our $?BLOCK; # get the PAST node for identifier # set the scope and declaration flag # do the initialization stuff # cache the name into a local variable my $name := $past.name(); if $?BLOCK.symbol( $name ) { # symbol is already present $/.panic("Error: symbol " ~ $name ~ " was already defined.\n"); } else { $?BLOCK.symbol( $name, :scope('lexical') ); } make $past; }
132
What's Next?
With this code in place, variable declarations are handled correctly. However, we didn't update the parse action for identifier, which creates the PAST::Var node and sets its scope; currently all identifiers' scope is set to 'package' (which means it's a global variable). As we already covered a lot of material in this episode, we'll leave this for the next episode. In the next episode, we'll also cover subroutines, which is another important aspect of any programming language. Hope to catch you later!
Exercises
Problem 1 In this episode, we changed the action method for the TOP rule; it is now invoked twice, once at the beginning of the parse, once at the end of the parse. The block rule, which defines a block to be a series of statements, represents a new scope. This rule is used in for instance if-statement (the then-part and else-part), while-statement (the loop body) and others. Update the parse action for block so it is invoked twice; once before parsing the statements, during which a new PAST::Block is created and stored onto the scope stack, and once after parsing the statements, during which this PAST node is set as the result object. Make sure $?BLOCK is always pointing to the current block. In order to do this exercise correctly, you should understand well what the shift and unshift methods do, and why we didn't implement methods to push and pop, which are more appropriate words in the context of a (scope) stack. Solution Keeping the Current block up to date: Sometimes we need to access the current block's symbol table. In order to be able to do so, we need a reference to the "current block". We do this by declaring a package variable called "$?BLOCK", declared with "our" (as opposed with "my"). This variable will always point to the "current" block. As blocks can nest, we use a "stack", on which newly created blocks are stored. Whenever a new block is created, we assign this to $?BLOCK, and store it onto the stack, so that the next time a new block is created, the "old" current block isn't lost. Whenever a scope is closed, we pop off the current block from the stack, and restore the previous "current" block. Why unshift/shift and not push/pop?: When we're talking about stacks, it would seem logical to talk about stack operations such as "push" and "pop". Instead, we use the operations "unshift" and "shift". If you're not a Perl programmer (such as myself), these names might not make sense. However, it's pretty easy. Instead of pushing a new object onto the "top" of the stack, you unshift objects onto this stack. Just see it as an old school bus, with only one entrance (at the front of the bus). Pushing a new person means taking the first free seat when entering, while unshifting a new person means everybody moves (shifts) one place to the back, so the new person can sit in the front seat. You might think this is not as efficient (more stuff is moved around), but that's not really true (actually: I guess (and certainly hope) the shift and unshift operations are implemented more effectively than the bus metaphor; I don't know how it is implemented). So why unshift/shift, and not push/pop? When restoring the previous "current block", we need to know exactly where it is (what position). It would be nice to be able to always refer to the "first passenger on the bus", instead of the last person. We know how to reference the first passenger (it's on seat no. 0 (it was designed by an IT guy)); we don't really know what is the seat no. of the last person: s/he might sit in the middle, or at the back. I hope it's clear what I mean here... otherwise, have a look at the code, and try to figure out what's happening: method block($/, $key) { our $?BLOCK; our @?BLOCK; if $key eq 'open' { $?BLOCK := PAST::Block.new(
Variable Declaration and Scope :blocktype('immediate'), :node($/) ); @?BLOCK.unshift($?BLOCK); } else { my $past := @?BLOCK.shift(); $?BLOCK := @?BLOCK[0]; for $<statement> { $past.push( $( $_ ) ); } make $past; } }
133
References
[1] http:/ / perlcabal. org/ syn/ S02. html#Names
Variables
In the previous episode, we entered local variables into the current block's symbol table. As we've seen earlier, using the do-block statement, scopes may nest. Consider this example: do var x = 42 do print(x) end end
Scope and Subroutines In this example, the print statement should print 42, even though x was not declared in the scope where it is referenced. How does the compiler know it's still a local variable? That's simple: it should look in all scopes, starting at the innermost scope. Only when the variable is found in any scope, should its scope be set to "lexical", so that the right instructions are being generated. The solution I came up with is shown below. Please note that I'm not 100% sure if this is the "best" solution, but my personal understanding of the PAST compiler is limited. So, while this solution works, I may teach you the wrong "habit". Please be aware of this. method identifier($/) { our @?BLOCK; my $name := ~$<ident>; my $scope := 'package'; # default value # go through all scopes and check if the symbol # is registered as a local. If so, set scope to # local. for @?BLOCK { if $_.symbol($name) { $scope := 'lexical'; } } make PAST::Var.new( :name($name), :scope($scope), :viviself('Undef'), :node($/) ); }
134
Viviself
You might have noticed the viviself attribute before. This attribute will result in extra instructions that will initialize the variable if it doesn't exist. As you know, global variables spring into life automatically when they're used. Earlier we mentioned that uninitialized variables have a default value of "Undef": the viviself attribute does this. For local variables, we use this mechanism to set the (optional) initialization value. When the identifier is a parameter, the parameter will be initialized automatically if it doesn't receive a value when the subroutine it belongs to is invoked. Effectively this means that all parameters in Squaak are optional!
135
Subroutines
We already mentioned subroutines before, and introduced the PAST::Block node type. We also briefly mentioned the blocktype attribute that can be set on a PAST::Block node, which indicates whether the block is to be executed immediately (for instance, a do-block or if statement) or it represents a declaration (for instance, subroutines). Let us now look at the grammar rule for subroutine definitions: rule sub_definition { 'sub' <identifier> <parameters> <statement>* 'end' {*} } rule parameters { '(' [<identifier> [',' <identifier>]* ]? ')' {*} } This is rather straightforward, and the action methods for these rules are quite simple, as you will see. First, however, let's have a look at the rule for sub definitions. Why is the sub body defined as <statement>* and not as a <block>? Surely, a subroutine defines a new scope, which was already covered by <block> Well, you're right in that. However, as we will see, by the time that a new PAST::Block node would be created, we are too late! The parameters would already have been parsed, and not entered into the block's symbol table. That's a problem, because parameters are most likely to be used in the subroutine's body, and as they are not registered as local variables (which they are), any usage of parameters would not be compiled down to the right instructions to fetch any parameters. So, how do we solve this in an efficient way? The solution is simple. The only place where parameters live, is in the subroutine's body, represented by a PAST::Block node. Why don't we create the PAST::Block node in the action method for the parameters rule. By doing so, the block is already in place and the parameters are registered as local symbols right in time. Let's look at the action methods. method parameters($/) { our $?BLOCK; our @?BLOCK; my $past := PAST::Block.new( :blocktype('declaration'), :node($/) ); # now add all parameters to this block for $<identifier> { my $param := $( $_ ); $param.scope('parameter'); $past.push($param); # register the parameter as a local symbol $past.symbol($param.name(), :scope('lexical')); }
136
# now put the block into place on the scope stack $?BLOCK := $past; @?BLOCK.unshift($past); make $past; } method sub_definition($/) { our $?BLOCK; our @?BLOCK; my $past := $( $<parameters> ); my $name := $( $<identifier> ); # set the sub's name $past.name( $name.name() ); # add all statements to the sub's body for $<statement> { $past.push( $( $_ ) ); } # and remove the block from the scope # stack and restore the current block @?BLOCK.shift(); $?BLOCK := @?BLOCK[0]; make $past; } First, let's check out the parse action for parameters. First, a new PAST::Block node is created. Then, we iterate over the list of identifiers (which may be empty), each representing a parameter. After retrieving the result object for a parameter (which is just an identifier), we set its scope to "parameter", and we add it to the block object. After that, we register the parameter as a symbol in the block object, specifying the scope as "lexical". Parameters are just a special kind of local variables, and there's no difference in a parameter and a declared local variable in a subroutine, except that a parameter will usually be initialized with a value that is passed when the subroutine is invoked. After handling the parameters, we set the current block (referred to by our package variable $?BLOCK) to the PAST::Block node we just created, and push it on the scope stack (referred to by our package variable @?BLOCK). After the whole subroutine definition is parsed, the action method sub_definition is invoked. This will retrieve the result object for parameters, which is the PAST::Block node that will represent the sub. After retrieving the result object for the sub's name, we set the name on the block node, and add all statements to the block. After this, we pop off this block node of the scope stack (@?BLOCK), and restore the current block ($?BLOCK). Pretty easy, huh?
137
Subroutine invocation
Once you defined a subroutine, you'll want to invoke it. In the exercises of Episode 5, we already gave some tips on how to create the PAST nodes for a subroutine invocation. In this section, we'll give a complete description. First we'll introduce the grammar rules. rule sub_call { <primary> <arguments> {*} } Not only allows this to invoke subroutines by their name, you can also store the subroutines in an array or hash field, and invoke them from there. Let's take a look at the action method, which is really quite straightforward. method sub_call($/) { my $invocant := $( $<primary> ); my $past := $( $<arguments> ); $past.unshift($invocant); make $past; } method arguments($/) { my $past := PAST::Op.new( :pasttype('call'), :node($/) ); for $<expression> { $past.push( $( $_ ) ); } make $past; } The result object of the sub_call method should be a PAST::Op node (of type 'call'), which contains a number of child nodes: the first one is the invocant object, and all remaining children are the arguments to that sub call. In order to "move" the result objects of the arguments to the sub_call method, we create the PAST::Op node in the method arguments, which is then retrieved by sub_call. In sub_call, the invocant object is set as the first child (using unshift). This is all too easy, isn't it? :-)
What's Next?
In this episode we finished the implementation of scope in Squaak, and implemented subroutines. Our language is coming along nicely! In the next episode, we'll explore how to implement operators and an operator precedence table for efficient expression parsing. In the mean time, should you have any problems or questions, don't hesitate to leave a comment!
Exercises
Problem 1 By now you should have a good idea on the implementation of scope in Squaak. We haven't implemented the for-statement yet, as it needs proper scope handling to implement. Implement this. Check out episode 3 for the BNF rules that define the syntax of the for-statement. When implementing it, you will run into the same issue as we did when implementing subroutines and parameters. Use the same trick for the implementation of the for-statement. Solution
Scope and Subroutines First, let us look at the BNF of the for-statement: for-statement ::= 'for' for-init ',' expression [step] 'do' block 'end' step for-init ::= ',' expression ::= 'var' identifier '=' expression
138
It's pretty easy to convert this to Perl 6 rules: rule for_statement { 'for' <for_init> ',' <expression> <step>? 'do' <block> 'end' {*} } rule step { ',' <expression> {*} } rule for_init { 'var' <identifier> '=' <expression> {*} } Pretty easy huh? Let's take a look at the semantics. A for-loop is just another way to write a while loop, but much easier in certain cases. This: for var <ident> = <expr1>, <expr2>, <expr3> do <block> end corresponds to: do var <ident> = <expr1> while <ident> <= <expr2> do <block> <ident> = <ident> + <expr3> end end If <expr3> is absent, it defaults to the value "1". Note that the step expression (expr3) should be positive; the loop condition contains a <= operator. When you specify a negative step expression, the loop variable will only decrease in value, which will never make the loop condition false (unless it overflows, but that's a different issue; this might even raise an exception in Parrot; this I do not know). Allowing negative step expressions introduces more complexity, which I felt was not worth the trouble for this tutorial language.
Scope and Subroutines Note that the loop variable <ident> is local to the for loop; this is expressed in the equivalent while loop by the surrounding do/end pair: a new do/end pair defines a new (nested) scope; after the end keyword, the loop variable is no longer visible. Let's implement the action method for the for-statement. As was mentioned in the exercise description, we're dealing with the same situation as with subroutine parameters. In this case, we're dealing with the loop variable, which is local to the for-statement. Let's check out the rule for for_init: method for_init($/) { our $?BLOCK; our @?BLOCK; ## create a new scope here, so that we can ## add the loop variable ## to this block here, which is convenient. $?BLOCK := PAST::Block.new( :blocktype('immediate'), :node($/) ); @?BLOCK.unshift($?BLOCK); my $iter := $( $<identifier> ); ## set a flag that this identifier is being declared $iter.isdecl(1); $iter.scope('lexical'); ## the identifier is initialized with this expression $iter.viviself( $( $<expression> ) ); ## enter the loop variable into the symbol table. $?BLOCK.symbol($iter.name(), :scope('lexical')); make $iter; } So, just as we created a new PAST::Block for the subroutine in the action method for parameters, we create a new PAST::Block for the for-statement in the action method that defines the loop variable. (Guess why we made for-init a subrule, and didn't put in "var <ident> = <expression>" in the rule of for-statement). This block is the place to live for the loop variable. The loop variable is declared, initialized using the viviself attribute, and entered into the new block's symbol table. Note that after creating the new PAST::Block object, we put it onto the stack scope. Now, the action method for the for statement is quite long, so I'll just embed my comments, which makes reading it easier. method for_statement($/) { our $?BLOCK; our @?BLOCK; First, get the result object of the for statement initialization rule; this is the PAST::Var object, representing the declaration and initialization of the loop variable. my $init := $( $<for_init> );
139
Scope and Subroutines Then, create a new node for the loop variable. Yes, another one (besides the one that is currently contained in the PAST::Block). This one is used when the loop variable is updated at the end of the code block (each iteration). The difference with the other one, is that it doesn't have the isdecl flag, and it doesn't have a viviself clause, which would result in extra instructions checking whether the variable is null (and we know it's not, because we initialize the loop variable). ## cache the name of the loop variable my $itername := $init.name(); my $iter := PAST::Var.new( :name($itername), :scope('lexical'), :node($/) ); Now, retrieve the PAST::Block node from the scope stack, and push all statement PAST nodes onto it. ## the body of the loop consists of the statements written by the user and ## the increment instruction of the loop iterator. my $body := @?BLOCK.shift(); $?BLOCK := @?BLOCK[0]; for $<statement> { $body.push($($_)); } If there was a step, we use that value; otherwise, we use assume a default step size of "1". Negative step sizes won't work, but if you Feel Lucky, you could go ahead and try. It's not that hard, it's just a lot of work, and I'm too lazy for that now.... ehm, I mean, I leave it as the proverbial exercise to the reader. my $step; if $<step> { my $stepsize := $( $<step>[0] ); $step := PAST::Op.new( $iter, $stepsize, :pirop('add'), :node($/) ); } else { ## default is increment by 1 $step := PAST::Op.new( $iter, :pirop('inc'), :node($/) ); } The incrementing of the loop variable is part of the loop body, so add the incrementing statement to $body. $body.push($step); The loop condition uses the <= operator, and compares the loop variable with the maximum value that was specified. ## while loop iterator <= end-expression my $cond := PAST::Op.new( $iter, $( $<expression> ), :name('infix:<=') ); Now we have the PAST for the loop condition and the loop body, so now create a PAST to represent the (while) loop.
140
Scope and Subroutines my $loop := PAST::Op.new( $cond, $body, :pasttype('while'), :node($/) ); Finally, the initialization of the loop variable should go before the loop itself, so create a PAST::Stmts node to do this: make PAST::Stmts.new( $init, $loop, :node($/) ); } Wow, we've done it! This was a good example of how to implement a non-trivial statement type using PAST.
141
142
rule factor { <value> [ <mulop> <value> ]* } token mulop { '*' | '/' | '%' } rule value{ | <number> | '(' <expression> ')' } This basic expression grammar implements operator precedence by taking advantage of the nature of a recursive-descent parser (if you haven't seen the word, google it). However, the big disadvantage of parsing expressions this way, is that the parse trees can become quite large. Perhaps more importantly, the parsing process is not very efficient. Let's take a look at some sample input. We won't show the parse trees as shown in Episode 2, but we'll just show an outline. input: 42 results in this parse tree: TOP expression term factor value number 42 As you can see, the input of this single number will invoke 6 grammar rules before parsing the actual digits. Not that bad, you might think. input: "1 + 2" results in this parse tree (we ignore the operator for now): TOP expression term factor | value | number | 1 factor value number 2 Only a few more grammar rules are invoked, not really a problem either. input: "(1 + 2) * 3" results in this parse tree: TOP expression term factor
Operators and Precedence value | expression | term | | factor | | value | | number | | 1 | term | factor | value | number | 2 value number 3 Right; 16 grammar rules just to parse this simple input. I'd call this slightly inefficient. The point is, implementing operator precedence using a recursive-descent parser is somewhat problematic, and given the fact there are better methods to parse expressions like these, not the way to go. Check out this nice explanation [1] or google it.
143
Operators and Precedence parsing) to top-down. To declare this "switch-back" point, write: proto 'term:' is tighter('prefix:-') is parsed(&term) { ... } The name "term:" is a built-in name of the operator bottom-up parser; it is invoked every time a new operand is needed. The "is parsed" clause tells the parser that "term" (which accidentally looks like "term:", but you could also have named it anything else) parses the operands. Note: it is very important to add a "is tighter" clause to the declaration of the "term:" rule. Otherwise your expression parser will not work! My knowledge here is a bit limited, but I usually define it as "is tighter" relative to the tightest operator defined.
144
Squaak Operators
We have defined the entry and exit point of the expression (bottom-up) parser, now it's time to add the operators. Let's have a look at Squaak's operators and their precedence. The operators are listed with decreasing precedence (so that high-precedence operators are listed at the top). (I'm not sure if this precedence table is common compared to other languages; some operators may have a different precedence w.r.t. other operators than you're used to. At least the mathematical operators are organized according to standard math rules). unary "-" unary "not" * / % + - .. < <= >= > != == and or (".." is the string concatenation operator). Besides defining an entry and exit point for the expression parser, you need to define some operator as a reference point, so that other operators' precedence can be defined relative to that reference point. My personal preference is to declare the operator with the lowest precedence as the reference point. This can be done like this: proto 'infix:or' is precedence('1') { ... } Now, other operators can be defined: proto proto proto proto proto proto 'infix:and' is tighter('infix:or') { ... } 'infix:<' is tighter('infix:and') { ... } 'infix:+' is tighter('infix:<') { ... } 'infix:*' is tighter('infix:+') { ... } 'prefix:not' is tighter('infix:*') { ... } 'prefix:-' is tighter('prefix:not') { ... }
Note that some operators are missing. See the exercises section for this. For more details on the use of the optable, check out docs/pct/pct_optable_guide.pod in the Parrot repository.
145
146
Operators and Precedence :pirop($<top><pirop>), :lvalue($<top><lvalue>), :node($/) ); for @($/) { $past.push( $($_) ); } make $past; } }
147
What's Next?
This episode covered the implementation of operators, which allows us to write complex expressions. By now, most of our language is implemented, except for one thing: aggregate data structures. This will be the topic of Episode 8. We will introduce the two aggregate data types: array and hashtables, and see how we can implement these. We'll also discuss what happens when we pass such aggregates as subroutine arguments, and the difference with the basic data types.
Exercises
Problem 1 Currently, Squaak only has grammar rules for integer and string constants, not floating point constants. Implement this grammar rule. A floating-point number consists of zero or more digits, followed by a dot and at least one digit, or, at least one digit followed by a dot and any number of digits. Examples are: 42.0, 1., .0001. There may be no whitespace between the individual digits and the dot. Make sure you understand the difference between a "rule" and a "token". Hint currently, the Parrot Grammar Engine (PGE), the component that "executes" the regular expressions (your grammar rules), matches alternative subrules in order. This means that this won't work: rule term { | <integer_constant> | <float_constant> ... } because when giving the input "42.0", "42" will be matched by <integer_constant>, and the dot and "0" will remain. Therefore, put the <float_constant> alternative in rule term before <integer_constant>. At some point, PGE will support longest-token matching, so that this issue will disappear. Solution token float_constant { [ | \d+ '.' \d* | \d* '.' \d+ ] {*} }
Operators and Precedence Problem 2 Implement the missing operators: (binary) "-", "<=", ">=", "==", "!=", "/", "%", "or" Solution For sake of completeness (and easy copy-paste for you), here's the list of operator declarations as I wrote them for Squaak: rule expression is optable { ... } proto 'infix:or' is precedence('1') is pasttype('unless') { ... } proto 'infix:and' is tighter('infix:or') is pasttype('if') { ... } proto proto proto proto proto proto 'infix:<' 'infix:<=' 'infix:>' 'infix:>=' 'infix:==' 'infix:!=' is is is is is is is is is is tighter('infix:and') { equiv('infix:<') { ... equiv('infix:<') { ... equiv('infix:<') { ... equiv('infix:<') { ... equiv('infix:<') { ... ... } } } } } }
148
proto 'infix:..'
is equiv('infix:+') is pirop('n_concat') { ... } is is is is is is tighter('infix:+') pirop('n_mul') { ... } equiv('infix:*') pirop('n_mod') { ... } equiv('infix:*') pirop('n_div') { ... } tighter('infix:*') pirop('n_not') { ... } tighter('prefix:not') pirop('n_neg') { ... }
149
References
docs/pct/pct_optable_guide.pod
References
[1] http:/ / epaperpress. com/ oper/ download/ oper. pdf
A hashtable stores key-value pairs; the key is used as index to store a value. Keys must be string constants, but the value can be of any type. An example is shown below: lastnames{"larry"} = "wall" lastnames{"allison"} = "randal"
150
Array constructors
Just as there are integer literals (42) and string literals ("hello world") that can be assigned to variables, you can have array literals. Below is the grammar rule for this: rule array_constructor { '[' [ <expression> [',' <expression>]*]? ']' {*} } Some examples are shown below: foo = [] bar = [1, "hi", 3.14] baz = [1, [2, 3, 4] ] The first example creates an empty array and assigns this to foo. The second example shows the construction of three elements, assigning the array to bar. Note that the elements of one array can be of different types. The third example shows the construction of nested arrays. This means that element baz[1][0] evaluates to the value 2 (indexing starts at 0).
Hashtable constructors
Besides array literals, Squaak supports hashtable literals, that can be constructed through a hashtable constructor. The syntax for this is expressed below: rule hash_constructor { '{' [<named_field> [',' <named_field>]* ]? '}' {*} } rule named_field { <string_constant> '=>' <expression> {*} } Some examples are shown below: foo = {} bar = { "larry" => "wall", "allison" => "randal" } baz = { "a" => { "b" => 42} } The first line creates an empty hashtable and assigns this to foo. The second creates a hashtable with two fields: "larry" and "allison". Their respective values are: "wall" and "randal". The third line shows that hashtables can be nested, too. There, a hashtable is constructed that has one field, called "a", and its value is another hashtable, containing a field "b" that has the value 42.
151
Implementation
You might think implementing support for arrays and hashtables looks rather difficult. Well, it's not. Actually, the implementation is rather straightforward. First, we're going to update the grammar rule for primary: rule primary { <identifier> <postfix_expression>* {*} } rule postfix_expression { | <index> {*} #= index | <key> {*} #= key } rule index { '[' <expression> ']' {*} } rule key { '{' <expression> '}' {*} } A primary object is now an identifier followed by any number of postfix-expressions. A postfix expression is either a hashtable key or an array index. Allowing any number of postfix expressions allows to nest arrays and hashtables in each other, allowing us to write, for instance: foo{"key"}[42][0]{"hi"} Of course, you as a Squaak programmer must make sure that foo is actually a hashtable, and that foo{"key"} yields an array, and so forth. Implementing this is actually quite simple. First, let us see how to implement the action method index. method index($/) { my $index := $( $<expression> ); my $past := PAST::Var.new( $index, :scope('keyed'), :viviself('Undef'), :vivibase('ResizablePMCArray'), :node($/) ); make $past; } First, we retrieve the PAST node for expression. Then, we create a keyed variable access operation, by creating a PAST::Var node and setting its scope to 'keyed'. If a PAST::Var node has keyed scope, then the first child is evaluated as the aggregate object, and the second child is evaluated as the index on that aggregate. But wait! The PAST::Var node we just created has only one child!
Hash Tables and Arrays Here's where the updated action method for primary comes in. This is shown below. method primary($/) { my $past := $( $<identifier> ); for $<postfix_expression> { my $expr := $( $_ ); $expr.unshift( $past ); $past := $expr; } make $past; } First, the PAST node for identifier is retrieved. Then, for each postfix-expression, we get the PAST node, and unshift the (current) $past onto it. Effectively, the (current) $past is set as the first child of $expr. And you know what $expr contains: that's the keyed variable access node, that was created in the action method index. After that, $past is set to $expr; either there's another postfix-expression, in which case this $past will be set as the first child of that next postfix-expression, or, the current $past is set as the result object.
152
Implementing Constructors
To implement the array and hashtable constructors, we're going to take advantage of Parrot's Calling Conventions (PCC). The PCC supports, amongst others, optional parameters, named parameters and slurpy parameters. If you're Dutch, you might think that slurpy parameters make a lot of noise ("slurpen" is a Dutch verb meaning drinking carefully, which you usually do if your beverage is hot, making noise in the process), but you would be wrong. Slurpy parameters will store all remaining arguments that have not yet been stored in other parameters (implying that there can only be one slurpy (positional) parameter, and it should come after all normal (positional) parameters). Parrot will automatically create an aggregate to store these remaining arguments. Besides positional slurpy parameters, you can also define a named slurpy parameter, which will store all remaining named parameters, after all normal (named) arguments have been stored. You might be confused by now. Let's look at an example, as this issue is worth a few brain cells to store. .sub foo .param .param .param .param .param .param .end foo(1, 2, 3, 4, 6 :named('y'), 5 :named('x'), 7 :named('p'), 8 :named('q') ) This will result in the following mapping: a: 1 b: 2 c: {3, 4}
a b c k l m
Hash Tables and Arrays k: 5 l: 6 m: {"p"=>7, "q"=>8} So, after the positional parameters (a, b), c is declared as a slurpy parameters, storing all remaining positional parameters. Parameters k and l are declared as named parameters, which have the respective names "x" and "y". Using these names, values can be passed. After the named parameters, there's the parameter m, which is both flagged as named and slurpy. This parameter will store all remaining named arguments that have not yet been stored by the normal named parameters. The interesting parameters for us are "c" and "m". For the positional slurpy parameter, Parrot creates an array, while for the named slurpy parameter a hashtable is created. This happens to be exactly what we need! Implementing the array and hash constructors becomes trivial: .sub '!array' .param pmc fields :slurpy .return (fields) .end .sub '!hash' .param pmc fields :named :slurpy .return (fields) .end Array and hashtable constructors can then be compiled into subroutine calls to the respective Parrot subroutines, passing all fields as arguments. (Note that these names start with a "!", which is not a valid Squaak identifier. This prevents us from calling these subs in normal Squaak code).
153
Hash Tables and Arrays var a = 0 var b = [] var c = {} foo(a,b,c) print(a, b[0], c{"hi"} ) # prints 0, 1, 2
154
What's Next?
This was the last episode to discuss implementation details to make Parrot (run) Squaak. After doing this episode's exercises, your implementation should be fairly complete. Next episode will be the last of this series, in which we'll recap what we did, and demonstrate our language with a nice demo program.
Exercises
Problem 1 We've shown how to implement keyed variable access for arrays, by implementing the action method for index. The same principle can be applied to keyed access for hashtables. Implement the action method for key. method key($/) { my $key := $( $<expression> ); make PAST::Var.new( $key, :scope('keyed'), :vivibase('Hash'), :viviself('Undef'), :node($/) ); } Problem 2 Implement the action methods for array_constructor and hash_constructor. Use a PAST::Op node and set the pasttype to 'call'. Use the "name" attribute to specify the names of the subs to be invoked (e.g., :name("!array") ). Note that all hash fields must be passed as named arguments. Check out PDD26 for doing this, and look for a "named " method. method named_field($/) { my $past := $( $ ); my $name := $( $ ); ## the passed expression is in fact a named argument, ## use the named() accessor to set that name. $past.named($name); make $past; } method ## ## ## ## my array_constructor($/) { use the parrot calling conventions to create an array, using the "anonymous" sub !array (which is not a valid Squaak name) $past := PAST::Op.new( :name('!array'),
Hash Tables and Arrays :pasttype('call'), :node($/) ); for $<expression> { $past.push($($_)); } make $past; } method ## ## ## my hash_constructor($/) { use the parrot calling conventions to create a hash, using the "anonymous" sub !hash (which is not a valid Squaak name) $past := PAST::Op.new( :name('!hash'), :pasttype('call'), :node($/) ); for $<named_field> { $past.push($($_)); } make $past;
155
} Problem 3 We'd like to add a little bit of syntactic sugar for accessing hashtable keys. Instead of writing foo{"key"}, I'd like to write foo.key. Of course, this only works for keys that do not contain spaces and such. Add the appropriate grammar rule (call it "member") that enables this syntax, and write the associated action method. Make sure this member name is converted to a string. Hint: use a PAST::Val node for the string conversion. rule postfix_expression | <key> {*} #= | <member> {*} #= | <index> {*} #= } rule member { '.' <identifier> {*} } method member($/) { my $member := $( $<identifier> ); ## x.y is syntactic sugar for x{"y"}, ## so stringify the identifier: my $key := PAST::Val.new( :returns('String'), :value($member.name()), :node($/) ); ## the rest of this method is the same { key member index
Hash Tables and Arrays ## as method key() above. make PAST::Var.new( $key, :scope('keyed'), :vivibase('Hash'), :viviself('Undef'), :node($/) ); }
156
Review
In Episode 1, we introduced the Parrot Compiler Tools (PCT), gave a high-level feature overview of Squaak, the case study language that we are implementing in this tutorial, and we generated a language shell that we use as a foundation to implement Squaak. Episode 2 discussed the general structure of PCT-based compilers. After this, we described each of the four default compilation stages: parse phase, parse tree to PAST, PAST to POST and POST to PIR. We also added a command line banner and command line prompt to the interactive language shell. In Episode 3, we introduced the full grammar of the Squaak language. After this, we started implementing the first bits, after which we were able to generate code for (simple) assignments. In Episode 4 we discussed the construction of Parrot Abstract Syntax Tree nodes in more detail, after which we implemented the if-statement and throw-statement. Episode 5 focused on variable declarations and variable scope. We implemented the necessary infrastructure to handle global and local variables correctly. In Episode 6 we continued the discussion of scope, but now in the context of subroutines. After this we implemented subroutine invocation. Episode 7 extended our grammar to handle complex expressions that allows us to use arithmetic and other operators. We discussed how to use PCT's built-in support for handling operator precedence. In the previous episode, Episode 8, we discussed the grammar and action methods for handling the aggregate data types of Squaak: arrays and hashes. We also touched on the topic of argument passing by reference and by value. If you followed the tutorial and did the exercises, your implementation should be complete. Although a lot of the implementation was discussed, some parts were left as the proverbial exercise to the reader. This is to stimulate you to get your hands dirty and figure out things for yourself, while the text contained enough hints (in my opinion) to
Wrap-Up and Conclusion solve the given problems. Sure enough, this approach requires you to spend more time and think for yourself, but I think you're reading all this stuff to learn something. The extra time spent is well worth it, in my opinion. Now it's time to see what we can do with this language. Squaak is more than just the average calculator example, which is often provided in beginner's discussions on parsers; it's a complete programming language.
157
What's Next?
This is the last episode of the Parrot Compiler Tools tutorial. We showed how we implemented a complete language for the Parrot virtual machine in only a few hundred lines of source code. Surely, this must be the proof that the PCT really is an effective toolkit for implementing languages. At the moment of writing, the PCT still lacks efficient support for certain language constructs. Therefore, we focused on the parts that are easy to build with the PCT. Once the PCT is feature complete, there's bound to be another tutorial on advanced features. Think of object-oriented programming, closures, coroutines, and advanced control-flow such as return statements. Most of them can be done already, but are too complex for this tutorial's level.
158
## calculate the next generation. sub evolve(thisgen, nextgen) var ym1 = height - 1 var y = height var yp1 = 1 var yi = height while yi > 0 do var xm1 = width-1 var x = width var xp1 = 1 var xi = width while xi > 0 do var sum = + + + + + + + thisgen[ym1][xm1] thisgen[ym1][x] thisgen[ym1][xp1] thisgen[y][xm1] thisgen[y][xp1] thisgen[yp1][xm1] thisgen[yp1][x] thisgen[yp1][xp1]
nextgen[y][x] = sum==2 and thisgen[y][x] or sum==3 xm1 x xp1 xi end ym1 y yp1 yi end end ## display thisgen to stdout. sub display(thisgen) var line = "" for var y = 0, height do for var x = 0, width do if thisgen[y][x] == 0 then = = = = y yp1 yp1 + 1 yi - 1 = = = = x xp1 xp1 + 1 xi - 1
Wrap-Up and Conclusion line = line .. "-" else line = line .. "O" end end line = line .. "\n" end print(line, "\nLife - generation: ", generation) end ## main program sub main() var heart = [1,0,1,1,0,1,1,1,1] var glider = [0,0,1,1,0,1,0,1,1] var explode = [0,1,0,1,1,1,1,0,1,0,1,0] var thisgen = [] initboard(thisgen) var nextgen = [] initboard(nextgen) spawn(thisgen,3,5,3,3,heart) spawn(thisgen,5,4,3,3,glider) spawn(thisgen,25,10,3,4,explode) while generation <= numgenerations do evolve(thisgen, nextgen) display(thisgen) generation = generation + 1 ## prevent switching nextgen and thisgen around, ## just call evolve with arguments switched. evolve(nextgen, thisgen) display(nextgen) generation = generation + 1 end end ## start here. main() Note the use of a subroutine "print". Check out the file src/builtins/say.pir, and rename the sub "say" (which was generated by the language shell creation script) to "print".
159
160
Solution
If you don't feel like doing exercises or just want to see what it looks like without doing any trouble, here's what it looks like (this is life generation 9). -----------------------------------------------------------------------------------------------------------------------------O--------------------------------------OO------------------------------------OO--O--------------------------------------OO-----------------------------------OOOOO------------------OOO------------------------------------O---O----------------------------------O-----O--------------------------------O---O---O-------------------------------O--O-O--O-------------------------------O---O---O--------------------------------O-----O----------------------------------O---O------------------------------------OOO--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Life - generation: 9 But really, it doesn't compare to seeing this program run on Parrot :-)
Exercises
Squaak was designed to be a simple language, offering enough features to get some work done, but at the same time keeping it simple. Of course, after reading this tutorial, You are an expert too ;-) If you feel like adding more features, here are some suggestions. Implement prefix and postfix increment/decrement operators, allowing you to write "generation++" instead of "generation = generation + 1". Implement augmenting assign operators, such as "+=" and friends. Extend the grammar to allow multiple variable declarations in one statement, allowing you to write "var x = 1, y, z =3". Of course, the initialization part should still be optional. How do you make sure that the identifier and initialization expression are kept together? Implement a mechanism (such as an "import" statement) to include or load another Squaak file, so Squaak programs can be split into multiple files. The PCT does not have any support for this, so you'll need to write a bit of PIR to do this. Improve the for-statement, to allow for a negative step. Note that the loop condition becomes more complex when doing so. Note that these are suggestions, and I did not implement them myself, so I won't have a solution for you at the end.
161
162
http://www.parrotblog.org/2008/03/episode-8-hashtables-and-arrays.html Other Resources Parrot Documentation [1] Parrot Design Docs [2] Parrot Project Home [3]
References
[1] http:/ / www. parrotcode. org/ docs/ [2] http:/ / www. parrotcode. org/ docs/ pdd/ [3] http:/ / www. parrotcode. org
Licensing
163
Licensing
Licensing
The text of this book is released under the following license:
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License."
Parrot, it's source code, and the tools required to build it are released under the terms of the Artistic License 2.0.
164
165
166
License
167
License
Creative Commons Attribution-Share Alike 3.0 //creativecommons.org/licenses/by-sa/3.0/