Data Dependence Analysis of Assembly Code

International Journal of Parallel Programming, Vol. 28, No.
5, 2000
Data Dependence Analysis of Assembly Code

Wolfram Amme, 1 Peter Braun, 1 Francois Thomasset, 2 and Eberhard Zehendner 3, 4
Received July 1999; revised March 2000 Determination of data dependences is a task typically performed with high-level language source code in today's optimizing and parallelizing compilers. Very little work has been done in the field of data dependence analysis on assembly language code, but this area will be of growing importance, e.g., for increasing instruction-level parallelism. A central element of a data dependence analysis in this case is a method for memory reference disambiguation which decides whether two memory operations may access (or definitely access) the same memory location. In this paper we describe a new approach for the determination of data dependences in assembly code. Our method is based on a sophisticated algorithm for symbolic value propagation, and it can derive value-based dependences between memory operations instead of just address-based dependences. We have integrated our method into the Salto system for assembly language optimization. Experimental results show that our approach greatly improves the precision of the dependence analysis in many cases. KEY WORDS: Data dependence analysis; value-based dependences; memory reference disambiguation; assembly code; monotone data flow frameworks.
1. INTRODUCTION The determination of data dependences is nowadays most often done by parallelizing and optimizing compiler systems on the level of source code,
1 2
Computer Science Department, Friedrich Schiller University, D-07740 Jena, Germany. INRIA, Rocquencourt, 78153 Le Chesnay Cedex, France. E-mail: Francois.Thomasset inria.fr. 3 Computer Science Department, Faculty of Mathematics and Computer Science, Friedrich Schiller University, D-07740 Jena (P.O. Box), Germany. 4 To whom correspondence should be addressed at e-mail: zehendneracm.org. 431
0885-7458001000-0431 18.000 2000 Plenum Publishing Corporation
432
Amme, Braun, Thomasset, and Zehendner
e.g., C or FORTRAN 90, or some intermediate code, e.g., RTL. (1) Data dependence analysis on the level of assembly code aims at increasing instruction-level parallelism. Using various scheduling techniques like list scheduling, (2) trace scheduling, (3) or percolation scheduling, (4) a new sequence of instructions is constructed with regard to data and control dependences, and properties of the target processor. Most of today's instruction schedulers only determine data dependences between register accesses, and consider memory to be one cell, so that every pair of memory accesses must be assumed as being data dependent. Analyzing memory accesses becomes particularly important while doing global instruction scheduling. (5) Performing optimizations at the level of assembly code has many benefits. The developer of machine-dependent optimization techniques needs a tool that is easily programmable and exposes (almost) all processor properties. Implementing new techniques in an existing compiler (even an easily retargetable compiler, such as the GNU C Compiler gcc (6) ) fails because new processor features may not be expressible within the machine description. In addition, working on assembly code also gives the opportunity to optimize during link-time. Wall (7) gives an overview of systems which perform the so-called late-code-modification; see Srivastava and Wall (8) for a full description of such techniques. Another aspect concerns optimization of delivered code. Fisher (9) makes the suggestion to translate an executable program during loading, e.g., replace multimedia extensions by ``normal'' instructions or vice versa. When processors of the same family differ in the number of functional units or number of registers, rescheduling (10) can be used to optimize for the new processor. A program can be made executable on a completely different processor by the use of binary translation. (11) In all these areas, a better analysis of assembly code and more precise data dependence information would offer significant benefits. In this paper, we describe an intraprocedural value-based data dependence analysis. (12, 13) When analyzing data dependences in assembly code we must distinguish between accesses to registers and those to memory. In both cases we derive data dependences from reaching definitions and reaching uses (cf. Section 3) information that we obtain by a monotone data flow analysis. Register analysis does not involve any complication: the set of used and defined registers in one instruction can be established easily because registers do not have aliases. Therefore, determination of data dependences between register accesses is not in the scope of this paper. For memory references we have to solve the aliasing problem, cf. Landi and Ryder; (14) and Wall (15): decide whether two memory references access the same location. We have to prove that two references always point to the same location (must-alias) or must show that they never refer to the
Data Dependence Analysis
433
same location. If we cannot prove the latter, we would like to have a conservative approximation of all alias pairs (may-alias), i.e., memory references that might refer to the same location. To derive all possible addresses that might be accessed by one memory instruction, we developed a symbolic value propagation algorithm. To compare symbolic values for memory addresses we use a modification of the GCD test. (16) We implemented our technique in the context of the Salto tool. (17) Salto is a framework to develop optimization and transformation techniques for various processors. The user describes the target processor using a mixture of RTL and C language. A program written in assembly code can then be analyzed and modified using an interface in C++. Salto already contains some kind of conflict analysis, (18) but only determines address-based dependences between register accesses and assumes memory to be one cell. The technique we present in this paper goes far beyond that. Experimental results indicate that our analysis can be more precise in the determination of data dependences than other previous methods. The rest of this paper is structured as follows: In Section 2 we introduce our programming model. Section 3 gives a brief introduction to the field of data dependence analysis and alias analysis in assembly code. Section 4 describes the concept of monotone data flow systems, which is the theoretical basis of our approach. We present our method in detail in Section 5. Section 6 shows experimental results, in Section 7 we discuss related work, and in Section 8 we conclude with an outlook to further developments. 2. PROGRAMMING MODEL AND ASSUMPTIONS In the following we assume a RISC instruction set that is strongly influenced by the SPARC architecture. Note however that our analysis is not limited to the SPARC. Memory is only accessed through load (ld) and store (st) instructions. Memory references can have the following formats: (a) mem=M rx+M ry or (b) mem=M rx+offset. Use of a scaling factor is not provided in this model but an addition would not be difficult. Our method supports memory instructions that read or write blocks of any reasonable size. [Note: We haven't looked at general memory-to-memory instructions yet, but that will be a topic for a further study. Since memory blocks loaded to registers will always be small compared to the whole address space, our method is sufficient at least in this respect.] For global memory access, the address (which is a label) first has to be moved to a register; then the corresponding memory block can be read or written using a memory instruction. Initialization of registers or copying the contents of one register to another can be done by the mv instruction. Each logical or
434
arithmetic instruction has the following format: op src 1 , src 2 , dest. The operation op is executed on operands src 1 and src 2 ; the result is written to register dest. An operand can be a register or an integer constant. Our method requires that any register directly or indirectly involved in address calculation must not be changed by exception handling, and that a value written to such a register can be retrieved afterwards. 2.1. Arithmetic on Addresses We have to take care of the fact that computation of addresses may wrap around beyond 2 L, where L is the number of bits used to represent addresses. Arithmetic operations in the processor are based on modulo arithmetic; we assume that all integer registers have the same width L and calculation is performed modulo 2 L with wrap around where applicable. Therefore we prescribe the calculation of the coefficients from Section 5 to be performed modulo 2 L, either in signed or in unsigned binary arithmetic. 2.2. Control Flow Control transfer in the program might be by either unconditional (b) or conditional (bcc) branch instructions, or by call instructions. Control flow is modeled using an intraprocedural control flow graph whose nodes stand for individual instructions, not basic blocks. Our method is able to handle any control flow graph including irreducible graphs that starts at a single entry node. However, the validity of the results relies on the assumption that any actual control path in the program is compatible with a path in the control flow graph; i.e., when control leaves the intraprocedural control flow graph at a certain node n by a call instruction or by an exception, succession concerning the instructions in the intraprocedural control flow graph necessarily is at a successor node of n. 2.3. Memory Classes Runtime memory is divided into three classes: static or global memory, stack, and heap memory. (19) When an address unequivocally references one of these classes, some simple memory reference disambiguation is feasible cf. Section 3. However, to apply this analysis technique, we would have to rely heavily on compiler conventions. In the current state of our implementation we make no use of memory classes; in the future our tool could be parameterized by any compiler conventions as well as by the target machine.
435
3. DATA DEPENDENCES IN ASSEMBLY CODE In assembly programs we have two classes of locations in which a program can store data: registers and memory. For a definition of data dependences we can merge these classes and treat them both as memory locations: A statement S 2 has a value-based data dependence (12, 13) on a statement S 1 if S 2 can be reached after the execution of S 1 , and the following conditions hold: (i) (ii) (iii) Both statements access a common memory location l. At least one of them writes to l. Between the execution of S 1 and S 2 , there is no statement S 3 that also writes to l.
A data dependence is called a flow dependence if S 1 writes to l whereas S2 reads from l. It is called an anti-dependence if S 1 reads from l and S 2 writes to l. It is an output dependence if both statements write to l. S 1 and S2 are in conflict if conditions (i) and (ii) hold, whereas (iii) may or may not be fulfilled. Another name for a conflict is address-based data dependence. [Note: Feautrier; (12) and Pugh and Wonnacott (13) call this a memory-based data dependence.] Conflict analysis is a frequently used approximation of data dependences. The determination of data dependences can be achieved by different means. The most commonly used for flow dependences or output dependences is the calculation of reaching definitions. Finding reaching definitions (19) means to detect all statements where the value of a specific memory location could have been written last. Once the reaching definitions have been determined, we are able to infer def-use and def-def associations; a def-use pair of statements indicates a flow dependence between them, and a def-def pair shows an output dependence. For finding anti-dependences, we take a dual approach. We determine what we call reaching uses for all statements. A statement s is a reaching use for another statement t with respect to a specific memory location if the contents of this location is read by s and may still be there when the control flow reaches t. From reaching uses we can infer use-def associations, each indicating an anti-dependence. Data dependence analysis of registers causes no problems. The set of used and defined registers can be established for each instruction by its semantics. For memory references we have to solve the aliasing problem, i.e., we have to determine whether two memory references access the same location. In the following we briefly review techniques for alias analysis of memory references.
436
Fig. 1.
Sample SPARC code for different techniques of alias detection: (a) and (b) can be solved by instruction inspection, whereas (c) needs a sophisticated analysis.
Doing no alias analysis leads to the assumption that a load instruction is always dependent on a store instruction, and a store instruction is always dependent on any memory instruction. A common technique in compiletime instruction schedulers is alias analysis by instruction inspection, where the scheduler looks at two instructions to see if it is obvious that different memory addresses are referenced. With this technique, independence of the memory references in Fig. 1a can be proved because the same base register but different offsets are used; by using some compiler conventions, we might show independence of the memory references in Fig. 1b, since different memory classes are referenced. Figure 1c shows an example where these techniques fail. By looking only at the second and the third statement, we miss the independence of these statements, since we ignore the relation between the registers M o1 and Mfp that could be inferred from the first statement. This example makes it clear that a two-fold improvement is needed. First, we need to save information about address arithmetic, and second, we need some kind of copy-propagation. Provided that we have such an algorithm, it would be easy to show that in the second statement register M o1 has the value Mfp-20 and thus there is no overlap between the 4 bytes wide memory blocks starting at M o1-4 and Mfp-20. 4. MONOTONE DATA FLOW ANALYSIS The most important passes of our analysis are performed by data flow analysis. (20) Therefore, several parts of the analysis have been described as a monotone data flow framework. Formally, a monotone data flow framework is a triple MDFS=(DF, @DF , F ). DF is called the data flow information set, which is a formal description of the information that will be propagated through the control flow graph. The meet operator @DF of a data flow framework describes the effect of joining paths in the flow graph, i.e., how the incoming data flow information is computed for a node with more than one predecessor. F [ f : DF DF ] is the set of our semantic functions. To each node of the control flow graph we assign one of these semantic functions, that specifies how the outgoing data flow information is derived from the incoming data flow information.
437
If the semantic functions are monotone and (DF, @DF ) forms a bounded semi-lattice with a top element and a bottom element, (16) we can use a general iterative algorithm (20) that determines for each statement of the control flow graph an element d # DF that is a safe approximation of the data flow information reaching the statement. The solution of a monotone data flow framework thus can be expressed as a function mdfs: STMT DF. The initial data flow information at each statement is the top element of the corresponding semi-lattice. We define a relation = on the semi-lattice (DF, @DF ) of a monotone C data flow framework by \a, b # DF : a = b C a @DF b=a
Monotonicity of the semantic functions can be expressed using the relation =. Semantic function f # F is monotone iff C C \a, b # DF : a = b O f (a) = f (b) C Note that for a statement s and for all e # DF that may reach statement s, the solution obtained when applying the general algorithm satisfies mdfs(s) = e. C 4.1. Description of Semantic Functions A unifying description of semantic functions can simplify the proving of monotonicity. If there is a suitable definition of a difference operator " , the general form of a semantic function is given as 5 fs (D)=(D"K s (D)) @DF G s (D) where D stands for data flow information that is reaching statement s, an application of K s (D) returns the part of the incoming data flow information that will be destroyed by the execution of statement s, and G s (D) describes the data flow information that is introduced by the execution of s. 5. DETERMINATION OF DATA DEPENDENCES Figure 2 shows an overview of our data dependence analysis for machine code. Our technique determines data dependences for memory accesses (and for register accesses) by the calculation of reaching definitions and uses (RDU). For registers the determination of reaching definitions
5
K s (D) is a ``kill function,'' and G s (D) is a ``generate function.''
438
Fig. 2.
Overview of the determination of data dependences.
439
and uses can be performed by a well-known standard algorithm. (19) To use this algorithm for data dependence analysis of memory accesses we have to derive may-alias information; in principle, we have to check whether two storage accesses could refer to the same storage object. To improve the precision of the data dependence analysis, must-alias information is needed, i.e., we have to find out whether two storage accesses always refer to the same storage object. [Note: We calculate neither must-dependences nor a full may-alias or must-alias relation in this paper, although our data flow information would permit us to do so.]
5.1. Informal Description of the Method The main task of our analysis concerns the determination of memory addresses. Therefore, we have to determine all possible values of registers in memory expressions. In some rare cases solving this problem is trivial, e.g., when a constant is moved to a register, or when adding two registers, for which the values are known. In contrast, there are situations, in which it is in general impossible to derive the value of a register, e.g., when the register is defined by a load instruction. In Section 8, we describe an approach for alleviating this restriction. As we want to perform a precise and safe analysis, we have to look for a way to work with these as yet unknown values. 5.1.1. Symbolic Value Sets In our approach we use the concept of symbolic values. A statement j is called an initialization point if j defines a register with a value we are not able to relate to a constant or to the values residing in other registers. We distinguish artificial initialization points from natural initialization points in our programming model. Natural initialization points are load instructions, call nodes, entry nodes of a procedure, or instructions operating on their operands in a nonlinear way. An artificial initialization point is a kind of dummy assignment that is inserted into the original program by our method as explained later. If an initialization point j defines the contents of register r i we refer to this unknown value through the symbol R i, j . In this method we calculate possible symbolic value sets for each register and each program statement. A symbolic value set is a set of linear polynomials, where each polynomial stands for a possible content of the associated register. Variables of such polynomials are represented by the symbols R i, j , i.e., the definition values of initialization points. Figure 3 shows the calculation of symbolic value sets for a simple assembly code with control flow. For each statement we determined symbolic value sets
440
Fig. 3. Results of a symbolic value set propagation for a simple procedure code with control flow. Registers r i that are not mentioned have the value [R i, 0 ].
that describe the register contents immediately before the execution of the statement. There could be a register defined inside a loop or, more generally, inside a cycle of the control flow graph whose contents would change in each iteration in a predictable way, e.g., by an increment instruction. [A new iteration of a cycle is started when passing a reference point in the cycle, called the cycle head in the sequel.] Then our data flow propagation algorithm might not terminate. However, by limiting the cardinality of each symbolic value set we can enforce termination. One method to achieve this goal is k-bounding where the data flow information becomes the bottom element = of the semi-lattice whenever no symbolic value set with at most k elements would constitute a safe approximation. In fact, inflation of unbounded symbolic value sets can also happen in other situations, for instance through the control flow induced by deeply nested conditionals. For this reason, we are motivated to employ k-bounding anyway, although we tend to use not too small values for k; usually the precision of the analysis raises when we take a larger k. Concerning cycles, this decision has two consequences: First, our analysis although guaranteed to terminate could still need a large number of iterations for stabilizing. Second, in most cases we would not retain useful information for registers changing inside a cycle since nothing can be inferred from the value =. Compatible with k-bounding both techniques can be used together or independently we introduce the concept of artificial initialization points to achieve fast termination as well as high precision with respect to cycles. An artificial initialization point is a kind of dummy assignment to a register, inserted between an entry point to a cycle in the control flow graph and the first instruction after the entry that corresponds to a machine instruction. This dummy assignment causes the symbolic value set
441
of the register in question to be set to a symbol, as with a natural initialization point. Thus changes to the data flow information for this register, caused by running once through the whole cycle, are not propagated to the next iteration, forcing early termination of the analysis. [Note as a drawback that without the use of induction variables, some of the available information may be obscured. See also our discussion in Sections 6 8.] Artificial initialization points therefore only make sense for registers that change inside the cycle. As another advantage, to a certain amount we can compare symbolic value sets for memory addresses derived from registers that change inside a cycle. Figure 4 shows the results of a symbolic value set propagation using artificial initialization points for a simple loop code. [See Fig. 6 for the complete source and assembly code.] Changing registers of the loop are r0, r1, r2, and r3. For each changing register an artificial initialization point could be inserted into the program. However, for reasons explained in Section 5.2, a single initialization point is sufficient here. As a consequence, the data flow algorithm terminates already during the second iteration. Moreover, without the artificial initialization point for register r0, the value of register r1 would have been set to = eventually, disabling any dependence analysis. We continue to discuss this example in Section 5.6. 5.1.2. Symbolic Address Sets Calculation of symbolic value sets for registers is necessary to determine the symbolic address set of a memory statement. In a subsequent step of our analysis we use symbolic address information and information of control flow for the determination of must-alias information as well as reaching definitions and uses. From these and from may-alias information, we derive data dependence information of memory accesses in the last step. In order to obtain all this information we need a mechanism which checks whether the index expressions of two storage accesses X and Y
Fig. 4. Symbolic value set propagation for a loop. Register contents are only mentioned after they changed.
442
could (or definitely) represent the same value. To solve this problem, we replace the appearances of registers in X and Y with elements of their corresponding symbolic value sets, and check for all possible combinations whether the equation X&Y=0 has a solution. For an example, we refer to Fig. 3. Obviously, instruction 1 is a reaching use of memory in instruction 11. The derived memory addresses are R 31, 0 +68 for instruction 1, and R 31, 0 +72 or R 31, 0 +76 for instruction 11. Given the assumption that both instructions access memory words having 4 bytes, we can prove that disjoint memory locations will be concerned, thus showing independence of these memory instructions. When accessing nonaligned memory blocks or such of different length, the independence test becomes more demanding. We then have to check for possible intersections of memory blocks, meaning that we show the infeasibility of a constraint problem.
5.2. Handling Cycles in the Control Flow Graph The chosen front-end, Salto in our case, provides us with a classical control flow graph whose nodes are basic blocks. During a depth-first traversal of the given control flow graph we expand each basic block to its corresponding sequence of instructions, thus simplifying our implementation. Each elementary instruction constitutes a separate node in the resulting expanded control flow graph. These nodes are placed into an array in reverse postorder. The index of a node in the array is called its depth- first number (19) and serves to identify the node. For each node s we store some information concerning the node, in particular the set of control flow predecessors pred(s). The new structure supports efficient depth-first traversals of the control flow graph or some subgraph, needed during data flow forward propagation and some preparative phases. Next we identify simple cycles in the control flow graph. For each cycle we take the node with the least depth-first number as a reference point, called the head of the cycle. Several cycles can share a single head. It is not necessary to identify individual cycles. Instead, we calculate the set of all nodes belonging to any of the simple cycles that share a head; we call this set of nodes a cycle pack. [In a reducible control flow graph, a cycle pack is simply a loop body.] We can identify each cycle head as the destination of a control flow back edge with respect to the reverse postorder. (21) An arc ( f, h) in the control flow graph is called a back edge iff h f in the reverse postorder. Thus h is a cycle head iff h f for some arc ( f, h). To find out whether h is a cycle head, we check for h f over all predecessor nodes f of h in
443
Fig. 5.
Sample loop codes motivating initialization points.
the control flow graph. Remember that we stored a description of these predecessors with h in our representation. The source of a back edge targetting h we call a back node of h. It is convenient to also store a description of the back nodes along with the node information of h since we have to scan over these nodes in succeeding phases of our analysis. Formally, the set of back nodes is given by back(h)=[ f # pred(h) : h f ]. 5.2.1. Placing Artificial Initialization Points Placement of artificial initialization points at cycle heads, only, is sufficient to force termination. To determine for which registers we need initialization points, we analyze the definitions and uses of the registers inside a cycle pack to some depth. We find the nodes belonging to a cycle pack by backward depth-first search, starting from the sources of the back edges corresponding to the cycle head. As a side-effect, we detect all entries to the cycle pack. An artificial initialization point is required at the head of a cycle pack for a register that is defined within the cycle pack and that satisfies either of the following conditions: 1. The register is also used within the cycle pack, and a use at one iteration sees a definition from a previous iteration. This may, for instance, happen when the register is used before being defined registers r1 and r2 in Fig. 5a or when use and definition appear in different branches register r1 in Fig. 5b. 6 2. The register is used after exit from the cycle pack, and such a use sees a definition from an iteration not being the last one before leaving the cycle pack register r2 in Fig. 5c.
6
We prefer an intermediate notation instead of assembly code here because it is easier to read.
444
To free the analysis from searching through the whole control flow graph following a cycle pack, 7 we relax the conditions concerning the second case, and prescribe an artificial initialization point for a register already when some code following a cycle pack could observe a definition from an iteration not being the last one before leaving the cycle pack. We call a node e contained in a cycle pack an exit node of the cycle pack when the control flow may leave the cycle pack there, i.e., when any node n outside the cycle pack has e as a predecessor. By exit(h) we denote the set of all exit nodes corresponding to cycle head h. We now introduce basic definitions that we need for the formal description of our method. Let REGS be the set of all registers, and STMT the set of all statements. Furthermore, let defreg and usereg be defined as following: defreg: STMT P(REGS ), statement s may write to all registers in defreg(s), usereg: STMT P(REGS ), statement s may read from any register in usereg(s). As mentioned before, defreg and usereg can be derived directly from the semantics of the instruction set of the supposed target processor. The set of all symbols is denoted as SYM. Note that all symbols that we introduced for initialization points are elements of SYM. The set of all initialization points of a program is given by the image of function ip, ip: SYM STMT, symbol x is defined by statement ip(x) To determine the necessity of initialization points at the head h of a cycle pack, we investigate definitions and uses of registers on all control flow paths that completely belong to the cycle pack excluding back edges with respect to h starting at h and ending at a node n. This includes the effects induced by n itself. We calculate the following information: ,(n)=set of registers r defined on every path to n, +(n)=set of registers r used on some path to n but not defined before this use, (n)=set of registers r defined on at least one path to n
7
This would imply a liveness analysis at the exit nodes.
445
We compute this information by a MDFS, where the domain of the semi-lattice is the Cartesian product P(REGS )_P(REGS )_P(REGS ) The meet operator is 8 (,, +, ) @ (,$, +$, $)=(, & ,$, + _ +$, _ $)
inducing (REGS, <, <) as top element and (<, REGS, REGS ) as bottom element. The semantic function for the cycle head h is f h =(defreg(h), usereg(h), defreg(h)) that for any other node n in the cycle pack is f n(,, +, )=(, _ defreg(n), + _ (usereg(n)",), _ defreg(n))
Initialization points are only needed for registers from the set .
f # back(h)
(f)&
.
f # back(h)
+( f ) _
.
e # exit(h)
(e)
Although a heuristics at the moment only, we take for granted that on average our decisions will not harm precision. In some cases it might be more advantageous not to insert an artificial initialization point for a certain register, or to place it at another node. We doubt, however, that improved strategies could be implemented without increasing analysis time by a large factor.
5.3. Symbolic Value Propagation To describe functional relations between register contents and certain initial values represented by symbols from SYM, we use polynomials that are linear in the symbols. Each such polynomial can be described formally as a mapping from the symbols to the integers. For representing a constant additive term we introduce = as an artificial symbol: SYM =SYM _ [=]. Then, the space SV (symbolic values) of all formal linear polynomials in the symbols with coefficients from the integers can be described as
8
Edges entering a cycle pack must not be considered.
446
follows: SV=[SYM Z]. A more familiar representation of such a polynomial is as a formal sum v=a+: a h } x h
h
with a, a h # Z,
x h # SYM
5.3.1. Free Symbols We describe the set of symbols with nonzero coefficients in a polynomial the so-called free symbols by a function free: SV P(SYM ), free(v)=[x # SYM : v(x){0]
We write v=a as a shortcut when free(v)=< and v(=)=a. 5.3.2. Operations Since we want to stay within the space of linear polynomials, only some of the known operations on polynomials are feasible: \v, w # SV, x # SYM : (v w)(x)=v(x)+w(x), \v, w # SV, x # SYM : (v \v # SV, x # SYM : ( w)(x)=v(x)&w(x),
v)(x)=&v(x),
\v # SV, x # SYM , c # Z : (c x v)(x)=(v x c)(x)=c } v(x) We use SV =SV _ [=] to represent unknown or approximated values. All coefficients ought to be residuals because the machine calculates modulo 2 L. However, since addition, subtraction, and multiplication provide ring homomorphisms between Z and Z 2L , we can pretend to calculate in Z and perform the reduction implicitly during the dependence test. In our implementation this reduction is implicit in an even more natural way because we store the coefficients to registers of width L bits. 5.3.3. A Bounded Semi-Lattice Since we abstract from the predicates of conditionals, we must work with sets of values from SV. To do an effective analysis each such set should contain only a small number of elements. An appropriate ``bounding concept'' is introduced via the space SVS=[S SV : |S | l ] (symbolic value sets), where l is some predetermined natural. The size of l influences the precision of the analysis and should be chosen carefully to fit the demands. Furthermore, since = stands for any value, we can restrict in this respect to a single set containing = alone, SVS =SVS _ [[=]]. [We use [=] instead of = to have the opportunity to enumerate sets from SVS or to build Cartesian products from them.]
447
On SVS we define a meet operator @SVS : SVS _SVS SVS by S @SVS T=
{ [=],
S _ T,
if S, T # SVS and |S _ T | otherwise.
l;
(SVS , @SVS ) is a bounded semi-lattice with bottom element [=] and top element <. It seems that making the meet operator more precise than here would not be of interest for our analysis. 5.3.4. Approximation of the Values in Registers What we are now looking for is an approximation of the values that the registers may contain when reaching a certain statement. This approximation must be safe, in the sense that not every value found during the analysis necessarily appears during runtime, but all values that do in fact show up have to be considered in the approximating set. We define regvals: STMT [REGS_SVS ], (r, v) # regvals(s)
register r may contain value v when read by statement s, and calculate regvals by the following MDFS: Data flow information set: 9 RSV=[REGS_SVS ]. Top element: <. Meet operator @RSV : RSV_RSV RSV, \r # REGS: ( f @RSV g)(r)= f (r) @SVS g(r) Semantic functions: K s(D)=defreg(s)_SVS , G s(D)=[r i_F(D, g i , r i1 ,..., r ik ): r i # defreg(s)], (abbreviation: [ g i ; r i1 ,..., r ik ] i ) where the functions g i depend on the instruction performed, and F(D, g, r i1 ,..., r ik )=
9
SVS
[ g(v 1 ,..., v k ): (r ij , v j ) # D] ^
Actually, the information at each program point is a function from the registers, but we found it convenient to define it as a relation, so as to simplify the definition of the semantic functions.
448
in the sequel, with g(v 1 ,..., v k )=[ g(v 1 ,..., v k )] ^ Semantic functions have to be defined for any instruction. The most important cases here are: mov a, ri mov rj, ri init ri ld [mem], ri entry call add rj, rk, ri [a; ] i with a( )=a, [id; r j ] i with id(v)=v, [unknown; ] i with unknown()=R i, s # SYM, [unknown; ] i , [unknown; ] i for all registers r i , [unknown; ] i for a suitable subset of the registers, [add; r j , r k ] i with add(v, w)= sub rj, rk, ri
{ v w,
=, w,
=,
if v== or w==; otherwise,
[sub; r j , r k ] i with sub(v, w)=
{v
if v== or w==; otherwise,
mul rj, rk, ri
[mul; r j , r k ] i with 0, c x w, mul(v, w)= v x c, =,
if v=0 or w=0; if v=c # Z and w{=; if v{= and w=c # Z; otherwise,
div rj, rk, ri
[div; r j , r k ] i with div(v, w)=
v, v, =,
if w=1; if v{= and w=&1; otherwise.
There are variants of these instructions with integer constants used instead of operand registers; these are treated analogously, one of the polynomials then degenerating to the constant additive term. We have D out = s D in if defreg(s)=<. For all other instructions not treated so far we set s [unknown; ] i for all registers r i with r i # defreg(s). Most of these functions
449
with some programming effort could be modeled a bit more precisely e.g., shift left like mul, shift right like div, exact handling of Boolean operators, etc. We already added this to our work list. For division, we only handle trivial cases since in most cases we cannot find a useful approximation for the reminder. But even if we could, we would have to multiply by reciprocals with respect to Z 2L causing much effort in calculation time or storage space. Note also that from y=2 x x we could not infer y2=x in Z 2L ; to see that, take x=2 L&1, for instance. The reader might doubt that a symbol propagated in different registers to an add instruction that uses both registers as operands has the same meaning in the polynomials and thus can be added. The feasibility of this operation follows from the fact that initialization points introduce new incarnations of symbols in each iteration for all such registers. Remark 1. The assumption that all calculations in the processor be modulo 2 L is essential for addition, subtraction, and other instructions. Should the processor trap with overflow, functions like add must be defined in a much more restrictive way than described to get a safe approximation. For instance, we could set v w, w, add(v, w)= v, =,
if v, w{= and v w # Int; if v=0; if w=0; otherwise,
and sub(v, w)=
{=,
w,
if v, w{= and v otherwise,
w # Int;
where Int is the set of integers representable in a register. = here also models possible overflow situations that could terminate the execution of the instruction with an exception.
5.4. Evaluating Address Expressions The results of our symbolic value set propagation have to be substituted into the address expressions of all memory instructions to get a safe approximation of the addresses accessed by these instructions. To perform this substitution we have to touch each instruction only once. We assume that the only instructions involving memory are loads, stores, and
450
calls. The set of addresses possibly accessed by an instruction is calculated as follows: addr: STMT SVS , v # addr(s)
v may be the address of a memory cell accessed in statement s, F(regvals(s), add(a, .), r j ), if [rj+a] is the address expression; F(regvals(s), add, r j , r k ), if [rj+rk] is the address expression; [=], <, if s is a call instruction; otherwise.
addr(s)=
Now, we have to distinguish whether an instruction reads from memory or writes to it. Thus, we derive functions defmem and usemem from addr: defmem: STMT SVS , v # defmem(s)
statement s possibly writes to address v, defmem(s)=
{<,
addr(s),
if s is a store or call instruction; otherwise, v # usemem(s)
usemem: STMT SVS ,
statement s possibly reads from address v, usemem(s)=
{<,
addr(s),
if s is a load or call instruction; otherwise.
The handling of call instructions here is quite approximative and could possibly be improved by distinguishing certain storage areas with specific bottom symbols.
5.5. Reaching Definitions and Uses of Memory We are now ready for setting up reaching definitions and uses of memory. We formulate this pass as a MDFS that propagates a set of possible reaching definitions and a set of possible reaching uses to the
451
successors of a statement. In order to keep the reaching sets as small as possible and thus the analysis as precise as could be we check by a special kind of must-alias analysis whether a definition (or a use) from a previous statement t reaching a statement s will not be visible to the successors of s. To assure this, addr(t) must completely be covered by defmem(s); we formulate this by a predicate covers. For deciding the predicate covers here or the predicate cut in Section 5.6, we could in principle use a tool like the Omega Library (22) that manipulates linear constraints. However, our method produces very particular constraints that are more efficiently solved by a direct approach. [Note for clarification that calculating mustdependences is out of the scope of this paper. Also, we partly determine and exploit may-aliases in Section 5.6, but not here. And we do not derive the full must-alias relation in this paper. However, our data flow information would permit us to provide all these relations to full extent.] We define functions for reaching definitions and uses, rdmem: STMT P(STMT ), t # rdmem(s)
s could observe a storage contents written by t, rumem: STMT P(STMT ), t # rumem(s)
s could observe a storage contents read by t and calculate them by the following MDFS: Data flow information set: STMT. Top element: <. Meet operator @STMT : STMT_STMT STMT, S @STMT T=S _ T. Semantic functions for reaching definitions: G s(D)=
{<,
[s],
if s is a store or call instruction; otherwise,
K s(D)=[t # D : covers(s, defmem(s), t, defmem(t))] Semantic functions for reaching uses: G s(D)=
{<,
[s],
if s is a load or call instruction; otherwise,
K s(D)=[t # D : covers(s, defmem(s), t, usemem(t))]
452
Because the empty address set or the unknown address = can never cover any address, and since the unknown address = can never be covered by any address, we find 10 covers(s, A, t, B)=
{covers2(s, A, t, B),
false,
if A=< or A=[=] or B=[=]; otherwise
A nonempty address set A can only cover another nonempty 11 address set B if each element in B is covered by each element in A, thus covers2(s, A, t, B)= covers3(s, a, t, b)
(a, b) # A_B
The arguments a and b of covers3 may contain some free symbols. We introduce a predicate sep that describes the situation where the initialization points of two symbols to be compared do not share any control path. Cases where such symbols appear together in comparisons are spurious; they are due to the abstraction of control flow where the predicates of conditionals are not used in deciding on paths to be taken. Thus covers3(s, a, t, b)= where the predicate sep SYM_SYM, (x, y) # sep
true, if sep(x, y) for some x # free(a), y # free(b); covers4(s, a, t, b a), otherwise
the symbols x and y are not defined on a common path, can be derived as sep(x, y) ip(x){ip( y) and
ip(x) reaches(ip( y)) and ip( y) reaches(ip(x)) from a function reaches that describes which statements reach a statement in the control flow graph, reaches: STMT P(STMT ), t # reaches(s)
there is a path from statement t to statement s

10
In an implementation of the formulas from this section as well as from other sections of the paper, we might make careful use of lazy evaluation rules for Boolean operators. 11 The case B=< cannot appear.
453
We calculate reaches by the following MDFS: Data flow information set: STMT. Top element: <. Meet operator @ : STMT_STMT STMT, S @ T=S _ T. Semantic functions: K s(D)=<, G s(D)=[s]. When comparing two addresses for coverage we have to form the difference between them. This difference is itself a polynomial, and when it contains a free symbol then we might find a valuation of the symbols where this difference is large in magnitude and thus coverage is not assured. Thus, covers4 never has the value true when there are any free symbols in the difference polynomial, covers4(s, a, t, p)=
false, covers5(s, a, t, p),
if free( p){<; otherwise
Moreover, eliminating free symbols from both argument polynomials by assuming equality is only feasible when such a symbol has the same meaning in both contexts. To delete a statement t from a reaching set arriving at statement s, we have to be sure that (i) the memory contents addressed in statement t is overwritten every time the control flow passes statement s, and that (ii) there is no way to bypass statement s between the use of a symbol x in statement t and its redefinition. The latter is checked with a predicate shields, 12 shields: STMT_STMT_STMT BOOL, shields(s, u, t) every path from s to t contains u
Notice that we use shields only for t of the form t=ip(x). For fixed t, we can observe the similarity with postdominance, 13 so that a straightforward extension of the computation of postdominance delivers, for any fixed t, the set [u: shields(s, u, t)]. Furthermore, if we compute a tree to represent this relation, similar to the post dominance tree, and adopt the coding of trees suggested by Brandis, (23) then the required amount of memory is of the order O(|STMT |_|ip(SYM )| ). Finally, we check whether the memory block of size(s) bytes starting at address a completely covers the memory block of size(t) bytes starting at address b 14:
12 13
We shall say that u shields s with respect to t. If t is the final node, then this is exactly the postdominance relation. 14 If the number of bytes read or written by a memory instruction is always the same, the formula size(t) p+size(t) size(s) in covers5 simplifies to p=0.
454
covers5(s, a, t, p)=
(size(t) false,
p+size(t)
size(s)),
if shields(t, s, ip(x)) for all x # free(a); otherwise,
where the function size: STMT N reflects the number of bytes accessed in memory. 5.6. Determination of Value-Based, Memory-Induced Dependences To derive value-based, memory-induced dependences from reaching definitions (or reaching uses) for memory, we have to check whether the memory cells accessed in one statement may intersect with the memory cells accessed in another statement. The intersection is checked using the cut predicate, calculating a kind of may-alias information. Given this, the data dependences are as follows: t is flow dependent on s s # rdmem(t) and cut(s, defmem(s), t, usemem(t)), t is anti-dependent on s s # rumem(t) and cut(s, usemem(s), t, defmem(t)), t is output dependent on s s # rdmem(t) and cut(s, defmem(s), t, defmem(t)) An unknown address = may intersect with any address, thus cut(s, A, t, B)=
true, cut2(s, A, t, B),
if A=[=] or B=[=]; otherwise
An intersection between sets of addresses can only be excluded when there is definitely no intersection for any pair of addresses, meaning cut2(s, A, t, B)=
(a, b) # A_B
cut3(s, a, t, b)
If addresses contain symbols that are incompatible, they can never appear together, so cut3(s, a, t, b)=
false, if sep(x, y) for some x # free(a), y # free(b); cut4(s, a, t, b), otherwise
455
Now, we have to form a difference polynomial for the final intersection test. A common free symbol from the polynomials a and b might take different values in both polynomials. For correctly setting up the difference polynomial we have to substitute a new symbol (for the common one) into one of the polynomials if its value is not fixed. The latter property is checked by the predicate variant. Since we are interested in disproving iteration-independent dependences, we define the variant predicate with respect to a cycle head. So variant(x, h) should hold whenever we cannot show that the symbol x has a unique meaning during one iteration of the cycle pack with head h. [For a complete run of the procedure, we formally take h to be the procedure entry node.] A safe approximation then is variant(x, h)=\f # back(h): cshields(ip(x), f, ip(x)) We can improve this approximation by setting variant(x, h)= false in the following special cases: 1. ip(x)=h, or 2. the cycle pack is a loop that comprises only reducible code, and it is the innermost loop containing ip(x). 15 The substitution process is described by cut4(s, a, t, b)=cut5(size(s), a$, size(t), b) with a$=a[dup(x)x, for all x # free(a) & free(b) such that variant(x, h)] where a[dup(x)x...] means that we substitute a new symbol for each conflicting old one, all at the same time. The function dup has to provide unique duplicates for the symbols appearing in the semantic functions without interfering with the latter. Assuming that statements are counted up from 0, an easy way to achieve this would be by dup: SYM SYM, R i, s [ R i, &s&1 .
15
A cycle head h is the entry node of a loop if h dominates all f # back(h). During depth-first search we detect whether the body of the loop has a reducible control flow graph. When checking the loop entries following the reverse postorder, inner loops are scanned after outer loops; thus we can easily determine the nesting of loops, and in particular find for each node the innermost loop it belongs to.
456
The difference polynomial is implicitly formed in cut5(ss, a$, st, b)= where p=a$&b, c= p(=), and $=gcd(2 L, gcd
x # free( p)
1&ss&c $
| \
st&1&c $
[ p(x)])
We refer to the appendix for the derivation of this formula. This implements what can be considered as an extension of the GCD test the best we can do in the current situation. As shown in the appendix, the test amounts to asking that an integer multiple of $ be comprised between the numbers (1&ss&c) and (st&1&c); so it corresponds to the classical GCD test, applied to all values in this interval. We demonstrate our dependence test by analyzing the code from Fig. 4. Consider, for example, the problem of determining whether the store from instruction 7 is anti-dependent on one of the previous loads, say instruction 5. 16 We find 5 # rumem(7). Therefore we compute cut(5, usemem(5), 7, defmem(7)). We have usemem(5)=[8 V R 0, 1 +4], defmem(7)=[8 V R 0, 1 +8], and sep(R 0, 1 , R 0, 1 )= false. So, cut(5, [8 V R 0, 1 +4], 7, [8 V R 0, 1 +8]) =cut2(5, 8 V R 0, 1 +4, 7, 8 V R 0, 1 +8) =cut3(5, 8 V R 0, 1 +4, 7, 8 V R 0, 1 +8) =cut4(5, 8 V R 0, 1 +4, 7, 8 V R 0, 1 +8) =cut5(size(5), 8 V R$ 1 +4, size(7), 8 V R 0, 1 +8) 0, where R$ 1 is a new symbol. This gives us p=(8 V R$ 1 +4)&(8 V R 0, 1 +8), 0, 0, c=&4, and $=8. Since we assume that the memory instructions in question access words of 4 bytes, each, we get ss=size(5)=4 and st=size(7)=4. Thus the result is: cut5(4, 8 V R$ 1 +4, 4, 8 V R 0, 1 +8) 0, =
1&4&(&4) 8 0)= false
| \
4&1&(&4) 8
+ = \ 8 | \ 8 +
=(1
so we can conclude absence of dependence.

16
The answer with the other load will be the same.
457
6. IMPLEMENTATION AND RESULTS The method for determining data dependences in assembly code presented in the last sections was implemented for reducible code as a user function in Salto on a Sun SPARCstation 10, with the following simplifications: covers3(s, a, t, b)=covers4(s, a, t, b covers5(s, a, t, p)= a),
{(size(t)
false,
if variant(x, h) for some x # free(a); p+size(t) size(s)), otherwise,
cut3(s, a, t, b)=cut4(s, a, t, b) Presently, only the assembly code for the SPARC instruction set can be analyzed, but an extension to other processors will require minimal technical effort. Results of our analysis can be used by tools built upon Salto. For evaluation of our method we have taken a closer look at two aspects: 1. Comparison of the number of data dependences using our method against the method implemented in Salto; this shows the difference between address-based dependence analysis and valuebased dependence analysis concerning register accesses. 2. Comparison of the number of data dependences for memory accesses using address-based dependence analysis versus valuebased dependence analysis. As a sample we chose 160 procedures out of the sixth public release of the Independent JPEG Group's free JPEG software, a package for compression and decompression of JPEG images. We distinguish between the following four levels of precision in the analysis: Level 1. Address-based dependences between register accesses. Memory is modeled as one cell, i.e., every pair of memory accesses is assumed to introduce a data dependence. Level 2. Value-based dependence analysis for register accesses, memory is modeled as one cell. Level 3. Value-based dependence analysis for register accesses, and address-based dependence analysis for memory accesses. Level 4. Value-based dependence analysis for register accesses, value-based dependence analysis for memory accesses.
458
Salto (17) in principal can be classed at level 1, although it is sometimes less precise than our level 1 analysis because it does not consider control flow, i.e., even instructions never appearing on a common control path may be considered as data dependent in Salto. Level 2 precision is common with today's instruction schedulers, e.g., the one in gcc (1) or the one used by Larus et al. (10) Current systems that do some kind of value propagation (we will have a closer look at other techniques for value propagation in Section 7) determine only address-based dependences and are thus classed at level 3. Apart from our own method, there seems to be no other method supporting level 4 precision. We instrumented our implementation to report dependences on all four levels. Level 4 is attained by the approach as described in Section 5. For level 3, we also use the approach from Section 5 but with the restriction that the predicate covers always returns false. To demonstrate the improvement from level 3 to level 4, we summarize in Table I the results for those 37 procedures from the JPEG package where we disproved more false dependences on level 4. [Note: These figures depend on the chosen target machine and in particular might vary with the amount of spill code produced by a compiler.] Complementary to these quantitative evaluation, we try a general assessment of our method in the sequel. Therefore we give a characterization of the types of dependences that we can break. [We formulate this with respect to loops, not cycle packs, for staying compatible with the standard literature.] As principal limit of the current state of our method, where we don't use induction variables or similar techniques yet, loopcarried dependences of a store instruction on itself can never be broken. However, loop-carried dependences between different statements are broken by our cut predicate if the alignment of the memory blocks accessed in these statements inhibits intersect. We illustrate this claim by the following example. Assume a source program working on interleaved partitions of an array, like in red-black solving for partial differential equations. A prototype for such problems might be the C sample program reproduced in Fig. 6a. From this source code, a compiler could produce some assembly code principally determined by the scheme sketched in Fig. 4. For instance, the full assembly code shown in Fig. 6b was produced by the gcc compiler, version 2.8.1 using optimizing level 1. Eventually applying the generalized GCD test from Section 5.6 to the results of our symbolic value propagation, we have proven that the store instruction is always independent of any of the load instructions. This means that we can apply software pipelining to this loop, provided we respect the self-dependence of the store instruction that we were not able to disprove with our method. In fact,
Data Dependence Analysis Table I. Sum of True, Anti-, and Output Dependences Found on Four Levels of Precision a
level 1 lines of procedure name keymatch test3function is shifting signed jpeg CreateCompress jpeg suppress tables jpeg finish compress emit byte emit dqt emit dht emit sof emit sos write any marker write frame header write scan header write tables only jpeg abort jpeg CreateDecompress jpeg start decompress post process 2pass jpeg read coefficients select file name jround up jcopy sample rows read 1 byte read 2 bytes next marker first marker skip variable process COM process SOFn scan JPEG header read byte read colormap read non rle pixel read rle pixel flush packet start output tga
a
459
level 2
level 3
level 4
improved
code
reg
mem 81 29 38 1945 396 2333 214 1097 1461 1282 1285 175 1368 626 390 84 1972 902 1385 897 1146 29 197 84 360 137 259 83 979 729 82 105 668 93 289 187 3272
reg 149 15 87 423 143 1121 119 575 589 661 574 184 679 934 716 93 507 674 278 538 473 30 115 186 555 305 360 258 1147 670 306 129 305 125 268 131 974
mem 81 29 38 1945 396 2333 214 1097 1461 1282 1285 175 1368 626 390 84 1972 902 1385 897 1146 29 197 84 360 137 259 83 979 729 82 105 668 93 289 187 3272
reg 149 15 87 423 143 1121 119 575 589 661 574 184 679 934 716 93 507 674 278 538 473 30 115 186 555 305 360 258 1147 670 306 129 305 125 268 131 974
mem 58 16 31 1664 229 2210 189 771 980 1087 873 110 870 486 324 67 1716 860 907 853 714 20 89 70 297 112 197 83 697 601 78 102 583 84 280 187 2937
reg 149 15 87 423 143 1121 119 575 589 661 574 184 679 934 716 93 507 674 278 538 473 30 115 186 555 305 360 258 1147 670 306 129 305 125 268 131 974
mem 47 13 22 1619 184 2197 184 726 870 1077 840 106 744 459 267 63 1659 856 878 851 644 15 64 67 285 98 187 73 592 598 77 97 568 83 279 182 2876
reg 77 0 38 0 51 0 90 0 87 0 89 0 73 0 88 0 89 0 90 0 89 0 67 0 84 0 74 0 71 0 65 0 90 0 84 0 89 0 86 0 92 0 62 0 68 0 72 0 82 0 46 0 82 0 52 0 83 0 85 0 62 0 69 0 86 0 66 0 73 0 72 0 92 0
mem 19 0 19 0 29 0 30 20 0 10 30 60 11 0 10 40 40 14 0 60 18 0 60 30 10 30 10 10 0 25 0 28 0 40 40 12 0 50 12 0 15 0 10 10 50 30 10 10 30 20
59 643 15 24 33 178 126 4273 74 1127 144 10432 42 433 125 4794 134 5219 100 6389 100 5252 41 561 142 4309 86 3656 83 2495 38 268 124 4878 135 4097 111 2583 113 3783 104 5631 20 80 46 354 48 653 93 3115 42 567 84 1989 33 533 107 6901 75 4545 34 804 43 415 67 2221 40 368 80 976 44 468 215 12870
The results are divided into register-induced and memory-induced dependences. The two rightmost columns show the improvement of level 4 analysis on level 3 analysis, i.e., of a value-based memory dependence analysis on an address-based memory dependence analysis.
460
Fig. 6.
Disproving loop-carried dependences: (a) sample C code; (b) SPARC assembly program.
there is no memory dependence at all in this example, but we would only be able to prove this supplementary fact with the help of induction variables detection. Dependences within one loop iteration can be broken in many cases. Control flow inside the loop body as well as in pre-loop code might inhibit some breakings; in particular, we cannot always achieve the same precision as with full linear relations over register contents. Also, breaking dependences inside any basic block or between basic blocks might be inhibited for the same reasons. We can sometimes break dependences between a loop nest and some past-loop code, in rare cases also between a loop nest and some pre-loop code or even between different loop nests. Except for the situation of a statement depending on itself, which can only happen in context of output dependences, we see no principal differences in the power of our dependence testing with respect to the three kinds of dependences (flow dependences, anti-dependences, and output dependences). 7. RELATED WORK Work in the field of memory reference disambiguation for assembly language is beginning to emerge as a topic of research. Ellis (24) presented a method to derive symbolic expressions for memory addresses by chasing back all reaching definitions of a symbolic register; the expression is simplified using elementary arithmetic and two expressions are compared
461
using, e.g., the GCD test. The method is implemented in the Bulldog compiler, but it works on an intermediate level close to high-level language. In order to deal with loops, dummy assignments are inserted for loop counters, very similar to our initialization points. Note that Ellis also proposed (see Chap. 5) what looks like a precursor of ,-functions, before the term was coined. (25) He suggested to stop chasing back reaching definitions at a merge point, returning instead a symbol representing a set of reaching definitions. Then it would be possible in some cases to reduce the number of terms in a symbolic value set to one element, allowing a precise answer. This might be compatible with our approach, where we would insert ,-functions, and chase back beyond the merge point only when a nonsatisfactory answer is obtained. Other authors inspired by Ellis were Lowney et al., (26) Bockle, (27) Moon and Ebcioglu. (28) The latter approach calculates alias information $ for assembly code and is implemented in the Chameleon compiler. (29) First, a procedure is transformed into DFG form (Dependence Flow Graph (30) ), which provides integrated control and data flow information. For gathering possible register values the same technique as in the Bulldog compiler is used. If a register has multiple definitions, the algorithm described by Moon and Ebcioglu (28) can chase all reaching definitions, whereas the $ concrete implementation in the Chameleon compiler apparently does not support this. Compared to our method, Chameleon preserves more pre-loop relations by applying the Banerjee inequalities (31) to the ranges of induction variables besides the GCD test (16) when comparing memory addresses. Debray et al. (32) present an approach close to ours. They use address descriptors to represent abstract addresses, i.e., addresses containing symbolic registers. An address descriptor is a pair (I, M ) where I is an instruction and M is a set of Mod-k residues. M denotes a set of offsets relative to the register defined in instruction I. Note that an address descriptor only depends on one symbolic register. So the meaning of an address descriptor ( I, M) is the set of addresses w+x+k } i, where w ranges over values that may be computed into a register by instruction I, x # M, and i is any integer. The set of address descriptors is made into a bounded semi-lattice, so that a data flow system can be used to propagate values through the control flow graph. [Note: In the tests, k=64 had been used.] However, this leads to an approximation of address representation that makes it impossible to derive must-alias information. The second drawback is that definitions of the same register in different control flow paths are not joined in a set, but mapped to =. Comparing address descriptors can be reduced to a comparison of Mod-k sets, which can be done very efficiently thanks to a bit-vector representation. They illustrate the use of their method by showing elimination of spurious loads, and perform interprocedural analysis.
462
Fig. 7.
Example demonstrating Bodik's method.
We observed that entry points to cycle packs are precisely the places where a widening operator would be placed, which is an alternative to obtaining convergence in finite time, provided one is able to define the widening operator. (33) Finally, let us mention the work of Bodik and Anik, (34) which although not directly targetting the analysis of assembly code might be a useful complement to our analysis. The idea is to start from the expressions of interest (for us, for instance, this could be the address expressions), and to visit the control flow graph backwards. At each node, the variable appearing on the left-hand side is substituted back into the expression(s) of interest. Take, for instance, the example from Fig. 7. Starting from instruction 7 and its operand (r2&r1), we visit the nodes in postorder. Node 6 defines a component of (r2&r1), namely r1. Back-substituting, we find that the value of the expression is: (r2&(r2&1))=1. The technique will be able to discover that r3 at instruction 7 can only take one value, namely 1. There have been prior publications by us on the same topic, in particular Amme et al. (35) and Braun et al. (36) The paper by Amme et al. (35) described the project at an earlier stage. Compared to this preliminary version, we here use a generalized and more precisely described programming model; in particular, we handle arbitrary cycles in the control flow graph now, and memory accesses of any, possibly mixed, width. We improved the placement rules for initialization points, and the precision of our reaching definitionsuses calculation and dependence test. In our more recent publication, (36) emphasis is on the implementation strategy of our method within the Salto tool, giving rise to an object-oriented system for monotone data flow analysis, called jSalto.
8. CONCLUSIONS In this paper we presented a new method to detect data dependences in assembly code. It works in two steps: First we perform a symbolic value set propagation using a monotone data flow system. Then we compute reaching definitions and uses for memory access, and derive value-based
463
data dependences. Note that other approaches only calculate address-based dependences. For comparing memory references we use a modification of the GCD test. Software pipelining will be one major application of the present work in the near future. This family of techniques overlaps the execution of different loop iterations and therefore can profit from a precise dependence analysis. Although the dependence information our current implementation is able to deliver can already be used by software pipelining tools, we must strive for producing dependence distances. Unrolling several iterations of the loop before performing a dependence analysis could be a preliminary technique to calculate dependence distances. However, we plan to directly tackle the problem by implementing passes that calculate loop invariants and discover induction variables. [Note: Sharing of induction variables will also diminish the number of artificial initialization points and thus allow more pre-loop information to be exploited inside the loop body.] A further improvement will come from deriving bounds on the contents of registers, e.g., by including the predicates of conditionals. Then coupling with known dependence tests such as the Banerjee test (31) or the Omega test (22) can be considered. Another project is to propagate symbolic values through memory cells. Assume a register is loaded from memory. We can determine a safe approximation S of all statements that might have contributed to the loaded value. Provided some alignment conditions, instead of representing the loaded value by a natural initialization point we might take the meet over all values stored by statements from S. A complementary technique could find out whether different statements load the same value from a memory cell; the value in the target registers should then be represented by a common symbol. We hope to improve the precision of our dependence analysis by these propagation techniques. Solving such problems is also closely related to the elimination of redundant memory instructions, as for instance appearing in spill code. Finally, extending our method to interprocedural analysis would lead to a more precise dependence analysis. Presently we have to assume that the contents of almost all registers and all memory cells may have changed after the evaluation of a procedure call. We plan to investigate more in depth the relationship of our work to abstract interpretation. (37) We also consider to exploit assertions passed from high-level analysis to low-level analysis. (38) APPENDIX: DERIVATION OF FORMULA cut5 See Section 5.6 for the use of this predicate and notations. Note that $ is always positive, as it is the GCD of several quantities including 2 L.
464
cut5(ss, a$, st, b) _:, ; # Z, h, k # N 0 , h<ss, k<st: a$+h+: } 2 L =b+k+; } 2 L _:, ; # Z, h, k # N 0 , h<ss, k<st: p+(:&;) } 2 L =k&h with p=a$&b
_:, ; # Z, h, k # N 0 , h<ss, k<st: ( p&c)+(:&;) } 2 L =k&h&c _# # Z, h, k # N 0 , h<ss, k<st: # } $=k&h&c _# # Z: &ss+1 Hence the conclusion: cut5(ss, a$, st, b)= with # } $+c $=gcd(2 L, gcd x # free( p)[ p(x)]) st&1 with c= p(=)
1&ss&c $
| \
st&1&c $
Even in sight of this proof it might be surprising that the calculation of $ has to involve 2 L. The reason is that address calculations might wrap around 2 L for any number of times, i.e., strictly seen we are not calculating in Z but in some system of the residuals modulo 2 L. This phenomenon is not restricted to assembly language programs. You might observe it, for instance, with the sample C program from Fig. 8, provided addresses are 32 bits wide. You will notice that forming $ without considering 2 L as a constituent of the GCD would miss the anti-dependence of the first loop iteration on the pre-loop code.
Fig. 8.
Sample program showing the necessity to consider wrap-around during address calculation.
465
ACKNOWLEDGMENTS This work has been supported in part through a travelling grant from the French Ministry of Foreign Affairs (MAE) and the German Academic Exchange Service (DAAD) under the PROCOPE program. We also are indebted to our referees for their thorough comments and their helpful advice.
REFERENCES
1. M. D. Tiemann, The GNU instruction scheduler. Technical Report CS 343, Free Software Foundation, Cambridge, Massachusetts (June 1989). 2. S. Davidson, D. Landskov, B. D. Shriver, and P. W. Mallet, Some experiments in local microcode compaction for horizontal machines, IEEE Trans. Computers 30(7):460 477 (July 1981). 3. J. A. Fisher, Trace scheduling: A technique for global microcode compaction, IEEE Trans. Computers 30(7):478 490 (July 1981). 4. A. Nicolau, Percolation scheduling: A parallel compilation technique, Technical Report TR 85-678, Cornell University, Department of Computer Science (May 1985). 5. B. R. Rau and J. A. Fisher, Instruction-level parallel processing: History overview, and perspective, J. Supercomputing 7(12):9 50 (May 1993). 6. R. M. Stallman, Using and porting the GNU CC, Technical Report, Free Software Foundation, Cambridge, Massachusetts (January 1989). 7. D. W. Wall. Systems for late code modification. In R. Giegerich and S. L. Graham (eds.), Code Generation Concepts, Tools, Techniques, Workshops in Computing, SpringerVerlag, pp. 275 293 (1992). 8. A. Srivastava and D. W. Wall, A practical system for intermodule code optimization at link-time, J. Progr. Lang. 1(1):1 18 (June 1989). 9. J. A. Fisher, Walk-time techniques catalyst for architectural change, IEEE Computer 30(9):40 42 (September 1997). 10. E. Schnarr and J. R. Larus, Instruction scheduling and executable editing, Proc. 29th Ann. IEEEACM Int'l. Symp. Microarchitecture (MICRO-29), Paris, pp. 288 297 (December 1996). 11. R. L. Sites, A. Chernoff, M. B. Kirk, M. P. Marks, and S. G. Robinson, Binary translation, Commun. ACM 36(2):69 81 (February 1993). 12. P. Feautrier, Array expansion, Proc. Second Int'l. Conf. Supercomputing, ACM Press (July 1988). 13. W. Pugh and D. Wonnacott, An exact method for analysis of value-based array data dependences, In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua (eds.), Proc. Sixth Workshop on Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, Springer-Verlag, Portland, Oregon, Vol. 768, pp. 546 566 (1993). 14. W. Landi and B. G. Ryder, Pointer-induced aliasing: A problem classification, Conf. Record 18th Ann. ACM Symp. Principles Progr. Lang., Orlando, Florida, pp. 93 103 (January 1991). 15. D. W. Wall, Limits of instruction level parallelism, ACM SIGPLAN Notices 26(4): 176 188 (April 1991). 16. S. S. Muchnick, Advanced Compiler Design and Implementation, Morgan Kaufmann Publishers, San Francisco, California (1997).
466
17. E. Rohou, F. Bodin, and A. Seznec, Salto: System for assembly-language transformation and optimization. In M. Gerndt (ed.), Proc. Sixth Workshop Compilers for Parallel Computers, Konferenzen des Forschungszentrums Julich, Forschungszentrum Julich, Aachen, Vol. 21, pp. 261 272 (December 1996). 18. J. R. Larus and P. N. Hilfinger, Detecting conflicts between structure accesses, ACM SIGPLAN Notices 23(7):21 34 (July 1988). 19. A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and Tools, Addison-Wesley Publishing Company, Reading, Massachusetts (1988). 20. J. B. Kam and J. D. Ullman, Monotone data flow analysis frameworks, Acta Informatica 7:305 317 (1977). 21. A. V. Aho, J. E. Hopcroft, and J. D. Ullman, The Design and Analysis of Computer Algorithms, Addison-Wesley Publishing Company, Reading, Massachusetts (1974). 22. W. Pugh, The Omega test: A fast and practical integer programming algorithm for dependence analysis, Commun. ACM 8:102 114 (August 1992). 23. M. M. Brandis, Optimizing compilers for structured programming languages, Ph.D. dissertation, Institute for Computer Systems, ETH Zurich (1995). 24. J. R. Ellis, Bulldog: A Compiler for VLIW Architectures, ACM Doctoral Dissertation Awards, MIT Press, Cambridge, Massachusetts (1985). 25. R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadek, An efficient method of computing static single assignment form, Conf. Record 16th Ann. ACM Symp. Principles Progr. Lang., Austin, Texas, pp. 25 35 (January 1989). 26. P. G. Lowney, S. M. Freudenberger, T. J. Karzes, W. D. Lichtenstein, R. P. Nix, J. S. O'Donnell, and J. C. Ruttenberg, The multiflow trace scheduling compiler, J. Supercomputing 7:51 142 (1993). 27. G. Bockle, Exploitation of Fine-Grain Parallelism, Lecture Notes in Computer Science, Vol. 942, Springer-Verlag, Berlin (1995). 28. S.-M. Moon and K. Ebcioglu, A study on the number of memory ports in multiple $ instruction issue machines, Proc. 26th Ann. Int'l. Symp. Microarchitecture (Micro-26), Austin, Texas, pp. 49 58 (December 1993). 29. M. Moudgill, J. H. Moreno, K. Ebcioglu, E. Altman, S. K. Chen, and A. Polyak, $ Compilerarchitecture interaction in a tree-based VLIW processor, Workshop on Interaction between Compilers and Computer Architecture '97 (in conjunction with HPCA-3), San Antonio, Texas (February 1997). 30. K. Pingali, M. Beck, R. Johnson, M. Moudgill, and P. Stodghill, Dependence flow graphs: An algebraic approach to program dependencies, Conf. Record 18th Ann. ACM Symp. Principles Progr. Lang., ACM Press, Orlando, Florida (January 1991). 31. U. Banerjee, Dependence Analysis for Supercomputing, Kluwer Academic Publishers, Boston, Massachusetts (1988). 32. S. Debray, R. Muth, and M. Weippert, Alias analysis of executable code, Proc. 25th ACM SIGPLAN-SIGACT Symp. Principles Progr. Lang. (POPL '98), ACM Press, San Diego, California, pp. 12 24 (January 1998). 33. P. Cousot and R. Cousot, Abstract interpretation and applications to logic programs, J. Logic Progr. 13(23):103 180 (July 1992). 34. R. Bodik and S. Anik, Path-sensitive value-flow analysis, Proc. 25th ACM SIGPLANSIGACT Symp. on Principles of Progr. Lang. (POPL '98), ACM Press, San Diego, California, pp. 237 251 (January 1998). 35. W. Amme, P. Braun, F. Thomasset, and E. Zehendner, Data dependence analysis of assembly code, Proc. Int'l. Conf. Parallel Architectures and Compilation Techniques (PACT '98), Paris, pp. 340 347 (October 1998).
467
36. P. Braun, W. Amme, F. Thomasset, and E. Zehendner, A data flow framework for analyzing assembly code, Proc. 8th Int'l. Workshop on Compilers for Parallel Computers (CPC 2000), Aussois, France, pp. 163 172 (January 2000). 37. P. Cousot, Abstract interpretation, ACM Computing Surveys 28(2):324 328 (June 1996). 38. S. Novack and A. Nicolau, A hierarchical approach to instruction level parallelization, IJPP 23(1):35 62 (February 1995).

Data Dependence Analysis of Assembly Code

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Data Dependence Analysis of Assembly Code

Caricato da

Copyright:

Formati disponibili

International Journal of Parallel Programming, Vol. 28, No.

Data Dependence Analysis of Assembly Code

Amme, Braun, Thomasset, and Zehendner

Data Dependence Analysis

Amme, Braun, Thomasset, and Zehendner

Data Dependence Analysis

Amme, Braun, Thomasset, and Zehendner

Data Dependence Analysis

K s (D) is a ``kill function,'' and G s (D) is a ``generate function.''

Amme, Braun, Thomasset, and Zehendner

Overview of the determination of data dependences.

Data Dependence Analysis

Amme, Braun, Thomasset, and Zehendner

Data Dependence Analysis

Amme, Braun, Thomasset, and Zehendner

Data Dependence Analysis

Sample loop codes motivating initialization points.

Amme, Braun, Thomasset, and Zehendner

This would imply a liveness analysis at the exit nodes.

Data Dependence Analysis

Edges entering a cycle pack must not be considered.

Amme, Braun, Thomasset, and Zehendner

Data Dependence Analysis

On SVS we define a meet operator @SVS : SVS _SVS SVS by S @SVS T=

if S, T # SVS and |S _ T | otherwise.

Amme, Braun, Thomasset, and Zehendner

if v== or w==; otherwise,

[sub; r j , r k ] i with sub(v, w)=

if v== or w==; otherwise,

mul rj, rk, ri

[mul; r j , r k ] i with 0, c x w, mul(v, w)= v x c, =,

if v=0 or w=0; if v=c # Z and w{=; if v{= and w=c # Z; otherwise,

div rj, rk, ri

[div; r j , r k ] i with div(v, w)=

if w=1; if v{= and w=&1; otherwise.

Data Dependence Analysis

if v, w{= and v w # Int; if v=0; if w=0; otherwise,

and sub(v, w)=

if v, w{= and v otherwise,

Amme, Braun, Thomasset, and Zehendner

statement s possibly writes to address v, defmem(s)=

if s is a store or call instruction; otherwise, v # usemem(s)

usemem: STMT SVS ,

statement s possibly reads from address v, usemem(s)=

if s is a load or call instruction; otherwise.

Data Dependence Analysis

s could observe a storage contents written by t, rumem: STMT P(STMT ), t # rumem(s)

if s is a store or call instruction; otherwise,

if s is a load or call instruction; otherwise,

K s(D)=[t # D : covers(s, defmem(s), t, usemem(t))]

Amme, Braun, Thomasset, and Zehendner

if A=< or A=[=] or B=[=]; otherwise

true, if sep(x, y) for some x # free(a), y # free(b); covers4(s, a, t, b a), otherwise

there is a path from statement t to statement s

Data Dependence Analysis

false, covers5(s, a, t, p),

if free( p){<; otherwise

Amme, Braun, Thomasset, and Zehendner

if shields(t, s, ip(x)) for all x # free(a); otherwise,

true, cut2(s, A, t, B),

if A=[=] or B=[=]; otherwise

false, if sep(x, y) for some x # free(a), y # free(b); cut4(s, a, t, b), otherwise

Data Dependence Analysis