Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
This process is very complex; hence, from the logical as well as an implementation point of view, it is customary to partition the compilation process into several phases, which are nothing more than logically cohesive operations that input one representation of a source program and output another representation. A typical compilation, broken down into phases, is shown in Figure 1.
Figure 1: Compilation process phases. The initial process phases analyze the source program. The lexical analysis phase reads the characters in the source program and groups them into streams of tokens; each token represents a logically cohesive sequence of characters, such as identifiers, operators, and keywords. The character sequence that forms a token is called a "lexeme". Certain tokens are augmented by the lexical value; that is, when an identifier like xyz is found, the lexical analyzer not only returns id, but it also enters the lexeme xyz into the symbol table if it does not already exist there. It returns a pointer to this symbol table entry as a lexical value associated with this occurrence of the token id. Therefore, when internally representing a statement like X: = Y + Z, after the lexical analysis will be id 1: = id 2 + id3. The subscripts 1, 2, and 3 are used for convenience; the actual token is id. The syntax analysis phase imposes a hierarchical structure on the token string, as shown in Figure 2.
Code Optimization
In the optimization phase, the compiler performs various transformations in order to improve the intermediate code. These transformations will result in faster-running machine code.
Code Generation
The final phase in the compilation process is the generation of target code. This process involves selecting memory locations for each variable used by the program. Then, each intermediate instruction is translated into a sequence of machine instructions that performs the same task.
Having relatively few passes is desirable from the point of view of reducing the compilation time. To reduce the number of passes, it is required to group several phases in one pass. For some of the phases, being grouped into one pass is not a major problem. For example, the lexical analyzer and syntax analyzer can easily be grouped into one pass, because the interface between them is a single token; that is, the processing required by the token is independent of other tokens. Therefore, these phases can be easily grouped together, with the lexical analyzer working as a subroutine of the syntax analyzer, which is charge of the entire analysis activity. Conversely, grouping some of the phases into one pass is not that easy. Grouping intermediate and object code-generation phases is difficult, because it is often very hard to perform object code generation until a sufficient number of intermediate code statements have been generated. Here, the interface between the two is not based on only one intermediate instruction-certain languages permit the use of a variable before it is declared. Similarly, many languages also permit forward jumps. Therefore, it is not possible to generate object code for a construct until sufficient intermediate code statements have been generated. To overcome this problem and enable the merging of intermediate and object code generation into one pass, the technique called "back-patching" is used; the object code is generated by leaving statementholes, which will be filled later when the information becomes available.
We cannot specify the language tokens by enumerating each and every identifier, operator, keyword, delimiter, and punctuation symbol; our specification would end up spanning several pages and perhaps never end, especially for those languages that do not limit the number of characters that an identifier can have. Therefore, token specification should be generated by specifying the rules that govern the way that the language's alphabet symbols can be combined, so that the result of the combination will be a token of that language's identifiers, operators, and keywords. This requires the use of suitable language-specific notation.
Regular expression notation can be used for specification of tokens because tokens constitute a regular set. It is compact, precise, and contains a deterministic finite automata (DFA) that accepts the language specified by the regular expression. The DFA is used to recognize the language specified by the regular expression notation, making the automatic construction of recognizer of tokens possible. Therefore, the study of regular expression notation and finite automata becomes necessary. Some definitions of the various terms used are described below.