The basics of programming embedded processors: Part 7
Trace-Driven Performance Analysis
By Wayne Wolf Embedded.com (09/05/07, 12:54:00 AM EDT) A program trace (or more succinctly, a trace) is a record of the execution path of a program that has been measured during program execution. If we can measure a trace from a running program, we avoid the need to analyze the program's CDFG to determine the execution path. Traces are typically large. Since we usually need to look at the entire execution history of a program, traces can be gigabytes in size. However, because disks to hold the traces are inexpensive and workstations to analyze them are fast and cheap, traces are an appealing way to analyze programs. We have to consider two problems: how to generate a trace and how to use it to determine program performance. A trace can be captured by using hardware methods. For example, a logic analyzer can be attached to the microprocessor bus to capture memory bus traffic. There are two major limitations to this approach. First, the size of hardware trace buffers is usually limited. The hardware must be able to store instructions at execution speed; if we use high-speed RAM for the buffer, we are limited to perhaps several million instructions. This means that we probably cannot capture a trace of the entire program but only a section of its execution. Second, if the CPU has an on-chip cache, we will not be able to see internal memory references to the program or the data. Alternatively, a microprocessor emulator can be used to trace the PC over time. Although this technique allows us to see what is going on inside the chip and not just behavior on the memory bus, it is too slow to generate long traces. Since the microprocessor must be stopped for probing and probing is very slow compared to normal execution, capturing long traces would require exorbitant amounts of time. Some CPUs have hardware facilities for automatically generating trace information. For example, the Pentium family microprocessors generate a special bus cycle, a branch trace message, that shows the source and/or destination address of a branch [Col97]. If we record only traces, we can reconstruct the instructions executed within the basic blocks while greatly reducing the amount of memory required to hold the trace. There are three major methods for generating a trace by software: PC sampling, instrumentation instructions, and simulation. The PC sampling technique uses the operating system or some other facility to periodically interrupt the program and save away the current PC value. This method has the same limitations of any sampling method - if the sampling period is too slow, we may miss important behavior; in particular, the wrong sampling interval can be a multiple of some periodic behavior in the program that could be totally missed. Alternatively, instructions can be added to the program that will write out information to the trace buffer. The instructions required depend on the information to be captured. The simplest case is to add code to the start of each basic block that remembers the program execution going through that point. Traces are often used to analyze cache or superscalar behavior, in which case the instrumentation code may need to save additional information about the state of the CPU. Software-based trace generation methods do add execution overhead time to the program, but this usually does not affect the measurement. Simulation methods interpret the instructions to simulate the CPU, and they can generate the required trace information at the same time. Obtaining a representative trace requires some knowledge of what the program does. Someone needs to supply inputs that properly exercise the program. Program users should have knowledge of the types of data typically presented to the program. Page 1of 7 Embedded Systems Design - Embedded.com 10/3/2010 mhtml:file://L:\Areas%20of%20Interest\Emb%20Systems\The%20basics%20of%20progr... However, because they may not know which data inputs will cause worst-case behavior, some collaboration between the program designers and users may be necessary. The techniques to be described later in this series can be used to measure how thoroughly a set of inputs covers the program's control and data flow, giving you an idea of the representativeness of your traces. Trace Analysis Once we have a trace, several things can be done with it. At minimum, the trace tells us the execution path of the program. By assigning the proper execution times to the instructions in the trace (taking into account interactions between instructions, etc.), we can determine the total execution time from the trace. In the case of a trace that does not record the complete execution of a program, we have to infer the trace through the missing pieces. However, if the sampling interval is properly chosen, the missing program fragments can be easily determined. Traces are often used to analyze cache behavior, allowing us to calculate memory access times throughout the code. This information can be used to select the proper size of the cache to be used with a CPU (either by selecting the proper CPU model with an internal cache or by adding an external cache). It can also be used to optimize the program to improve its cache behavior. A trace is generated on a particular CPU. We may want to use the trace to analyze the performance of the program on a different CPU. For example, when selecting which microprocessor to use, we may not want to purchase a complete evaluation and compilation system for each potential CPU just to measure the performance of each on our programs. There are limits to what we can do in this case, but the trace can still be useful. We can use the trace to determine the execution path through the program, although we have to be careful that the compilers used for the two CPUs perform similar optimizations. We can approximate the execution time on the new CPU, particularly if the instruction sets of the CPUs are relatively simple. Since many RISC processors have similar basic instructions with similar execution times (and because many compilers use only a small fraction of the CPU's instruction set), we can often obtain reasonable estimates of execution time on the new CPU by examining the trace from another CPU. Loop Optimizations Loops are important targets for optimization because programs with loops tend to spend a lot of time executing those loops. There are three important techniques in optimizing loops: code motion, induction variable elimination, and strength reduction. Code motion lets us move unnecessary code out of a loop. If a computation's result does not depend on operations performed in the loop body, then we can safely move it out of the loop. Code motion opportunities can arise because programmers may find some computations clearer and more concise when put in the loop body, even though they are not strictly dependent on the loop iterations. A simple example of code motion is also common. Consider the following loop: for (i = 0; i < N*M; i++) { z[i] = a[i] + b[i]; } The code motion opportunity becomes more obvious when we draw the loop's CDFG as shown in Figure 5-23 below. The loop bound computation is performed on every iteration during the loop test, even though the result never changes. We can avoid N M ' 1 unnecessary executions of this statement by moving it before the loop, as shown in the figure. Page 2of 7 Embedded Systems Design - Embedded.com 10/3/2010 mhtml:file://L:\Areas%20of%20Interest\Emb%20Systems\The%20basics%20of%20progr... An induction variable is a variable whose value is derived from the loop iteration variable's value. The compiler often introduces induction variables to help it implement the loop. Properly transformed, we may be able to eliminate some variables and apply strength reduction to others. A nested loop is a good example of the use of induction variables. A simple nested loop appears below. for (i = 0; i < N; i++) for (j = 0; j < M; j++) z[i][j] = b[i][j]; The compiler uses induction variables to help it address the arrays. Let us rewrite the loop in C using induction variables and pointers. (Later, we use a common induction variable for the two arrays, even though the compiler would probably introduce separate induction variables and then merge them.) for (i = 0; i < N; i++) for (j = 0; j < M; j++) { zbinduct = i*M + j; *(zptr + zbinduct) = *(bptr + zbinduct); } In the above code, zptr and bptr are pointers to the heads of the z and b arrays and zbinduct is the shared induction variable. However, we do not need to compute zbinduct afresh each time. Since we are stepping through the arrays sequentially, we can simply add the update value to the induction variable as follows: zbinduct = 0; for (i = 0; i < N; i++) { for (j = 0; j < M; j++) { *(zptr + zbinduct) = *(bptr + zbinduct); zbinduct++; } } This is a form of strength reduction since we have eliminated the multiplication from the induction variable computation. Strength reduction helps us reduce the cost of a loop iteration. Consider the following assignment: Figure 5-23. Code motion in a loop Page 3of 7 Embedded Systems Design - Embedded.com 10/3/2010 mhtml:file://L:\Areas%20of%20Interest\Emb%20Systems\The%20basics%20of%20progr... y = x * 2; In integer arithmetic, we can use a left shift rather than a multiplication by 2 (as long as we properly keep track of overflows). If the shift is faster than the multiply, we probably want to perform the substitution. This optimization can often be used with induction variables because loops are often indexed with simple expressions. Strength reduction can often be performed with simple substitution rules since there are relatively few interactions between the possible substitutions. Cache optimizations A loop nest is a set of loops, one inside the other, as illustrated in Figure 5-24 below. Loop nests occur when we process arrays. A large body of techniques has been developed for optimizing loop nests. Rewriting a loop nest changes the order in which array elements are accessed. This can expose new parallelism opportunities that can be exploited by later stages of the compiler, and it can also improve cache performance. In this section we concentrate on the analysis of loop nests for cache performance. Let us first consider the examples of Figure 5-24 above in more detail. We can represent the set of array locations indexed by the loop by an iteration space. Each point in this space is defined by a vector, each component of which is one of the loop's iteration variables. The points in the space represent the iterations executed by the program. The top example in the figure accesses every element of the arrays (the a and b matrices have the same access patterns in this case). The lower example accesses bands in the a and b matrices. We can represent any particular point in the iteration space as an iteration vector, such as the vector i 3 , 2 = <3, 2> shown in the figure. Cashe miss equations. Ghosh et al. [Gho97] developed an analytical technique for describing data cache misses in nested loops. The equations not only describe the set of cache misses but also help us to optimize the program's access patterns. Figure 5-24. Vector representation of nexted loops Page 4of 7 Embedded Systems Design - Embedded.com 10/3/2010 mhtml:file://L:\Areas%20of%20Interest\Emb%20Systems\The%20basics%20of%20progr... Because numerous equations are necessary to describe a program, the equations are designed to be generated and solved by design tools. Here we concentrate on the basic principles underlying the equations. After understanding these principles, we can apply them by hand in small cases as well as understand what our tools are doing in large cases. We need to write equations that describe all types of misses, including compulsory, capacity, and conflict. The equations are written in terms of reuse vectorsif two iterations i 1 and i 2 use the same array location, the reuse vector that defines that reuse is r = i 2 ' i 1 (assuming that i 2 comes after i 1 ). Assume for the moment that our cache is direct mapped; we need to know the cache's line size so that we can compute which cache line each memory reference falls into. Compulsory misses occur in two cases - in the first access along the reuse vector and when the reuse vector spans more than one cache line. Capacity and conflict misses can be captured in a single form of equation. In the following equation, two references i and j conflict if they access distinct memory lines that map to the same cache line: L( i) = L( j ) . (EQ 5-1) If the cache size is Cs and the line size is Ls, we can rewrite this equation as mem( i) = mem( j ) +nCs+b. (EQ 5-2) In Equation 5-2 above, n is an integer that represents an arbitrary offset and b is an integer in the range " L off b Ls " 1 " L off , where L off is the offset of the memory reference in the cache line. We can use these equations for a variety of purposes. For instance, if we want to minimize the number of conflicts between array accesses, we can parameterize the equations in terms of the starting addresses of the arrays and the number of elements in each row (by padding the arrays), and then find the values for these parameters that minimize the number of cache conflicts. Example 5-10 below illustrates the concepts behind cache miss equations. Example 5-10 Data Realignment and Array Padding Assume we want to optimize the cache behavior of the following code: for (j = 0; j < M; j++) for (i = 0; i < N; i++) a[j][i] = b[j][i] * c; Let us also assume that the a and b arrays are sized with M at 265 and N at 4 and a 256-line, four-way set- associative cache with four words per line. Even though this code does not reuse any data elements, cache conflicts can cause serious performance problems because they interfere with spatial reuse at the cache line level. Assume that the starting location for a[] is 1024 and the starting location for b[] is 4099. Although a[0][0] and b[0][0] do not map to the same word in the cache, they do map to the same block as shown below. Page 5of 7 Embedded Systems Design - Embedded.com 10/3/2010 mhtml:file://L:\Areas%20of%20Interest\Emb%20Systems\The%20basics%20of%20progr...
As a result, we see the following scenario in execution: The access to a[0][0] brings in the first four words of a[]. The access to b[0][0] replaces a[0] through a[0][3] with b[0][3] and the contents of the three locations before b[]. When a[0][1] is accessed, the same cache line is again replaced with the first four elements of a[]. Once the a[0][1] access brings that line into the cache, it remains there for the a[0][2] and a[0][3] accesses since the b[] accesses are now on the next line. However, the scenario repeats itself at a[0][4] and every four iterations of the cache. One way to eliminate the cache conflicts is to move one of the arrays. We do not have to move it far. If we move b's start to 4100, we eliminate the cache conflicts. However, that fix won't work in more complex situations. Moving one array may only introduce cache conflicts with another array. In such cases, we can use another technique called padding. If we extend each of the rows of the arrays to have four elements rather than three, with the padding word placed at the beginning of the row, we eliminate the cache conflicts. In this case, b[0][0] is located at 4100 by the padding. Although padding wastes memory, it substantially improves memory performance. In complex situations with multiple arrays and sophisticated access patterns, we have to use a combination of techniques - relocating arrays and padding them - to be able to minimize cache conflicts. Optimizing for execution speed If you need to improve the speed of your program, you can attack the problem at several levels of abstraction. First, make sure that the code really needs to be accelerated. If you are dealing with a large program, the part of the program using the most time may not be obvious. Profiling the program will help you find hot spots. Many systems are built from multiple programs running as communicating processes; a slow program may not actually limit the overall system performance. You may be able to redesign your algorithm to improve efficiency. Examining asymptotic performance is often a good guide to efficiency. Doing fewer operations is usually the key to performance. In a few cases, however, brute force may provide a better implementation. A seemingly simple high-level language statement may in fact hide a very long sequence of operations that slows down the algorithm. Using dynamically allocated memory is one example, since managing the heap takes time but is hidden from Page 6of 7 Embedded Systems Design - Embedded.com 10/3/2010 mhtml:file://L:\Areas%20of%20Interest\Emb%20Systems\The%20basics%20of%20progr... the programmer. For example, a sophisticated algorithm that uses dynamic storage may be slower in practice than an algorithm that performs more operations on statically allocated memory. Finally, you can look at the implementation of the program itself. A few hints on program implementation are summarized below. Try to use registers efficiently. Group accesses to a value together so that the value can be brought into a register and kept there. Make use of page mode accesses in the memory system whenever possible. Page mode reads and writes eliminate one step in the memory access. You can increase use of page mode by rearranging your variables so that more can be referenced contiguously. Analyze cache behavior to find major cache conflicts. Restructure the code to eliminate as many of these as you can as follows: - For instruction conflicts, if the offending code segment is small, try to rewrite the segment to make it as small as possible so that it better fits into the cache. Writing in assembly language may be necessary. For conflicts across larger spans of code, try moving the instructions or padding with NOPs. - For scalar data conflicts, move the data values to different locations to reduce conflicts. - For array data conflicts, consider either moving the arrays or changing your array access patterns to reduce conflicts. Next in Part 8: Analysis and optimization of energy, power amd program design To read Part 1, go to "Program design and analysis." To read Part 2 , go to " Models of program, assemblers and linkers." To read Part 3, go to "Basic Compilation Techniques" To read Part 4, go to "The creation of procedures" To read Part 5, go to "Register allocation and scheduling" To read Part 6, go to "Analysis and optimization of execution time" Used with the permission of the publisher, Newnes/Elsevier, this series of six articles is based on copyrighted material from "Computers as Components: Principles of Embedded Computer System Design" by Wayne Wolf. The book can be purchased on line. Wayne Wolf is currently the Georgia Research Alliance Eminent Scholar holding the Rhesa "Ray" S. Farmer, Jr., Distinguished Chair in Embedded Computer Systems at Georgia Tech's School of Electrical and Computer Engineering (ECE). Previously a professor of electrical engineering at Princeton University, he worked at AT&T Bell Laboratories. He has served as editor in chief of the ACM Transactions on Embedded Computing and of Design Automation for Embedded Systems. References: [Gho97] Somnath Ghosh, Margaret Martonosi and Sharad Malik "Cache miss equations: an analytical representation of cache misses." Proceedings of the 11th ACM International Conference on Supercomputing, ACM Press, 1997 [Col97] Robet Collins, "In-circuit emulation," Dr. Dobb's Journal, September, 1997. pp. 111-113. Page 7of 7 Embedded Systems Design - Embedded.com 10/3/2010 mhtml:file://L:\Areas%20of%20Interest\Emb%20Systems\The%20basics%20of%20progr...