Proceedings of The Workshop On Binary Instrumentation and Applications 2009

WBIA09
Proceedings of the Workshop on Binary Instrumentation and Applications

In cooperation with the The 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42)
December 12, 2009, NYC, NY, U.S.A.
ACM International Conference Proceedings Series ACM Press
The Association for Computing Machinery 2 Penn Plaza, Suite 701 New York New York 10121-0701
ACM COPYRIGHT NOTICE. Copyright 2009 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept., ACM, Inc., fax +1 (212) 869-0481, or permissions@acm.org. For other copying of articles that carry a code at the bottom of the first or last page, copying is permitted provided that the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, +1-978-7508400, +1-978-750-4470 (fax). Notice to Past Authors of ACM-Published Articles ACM intends to create a complete electronic archive of all articles and/or other material previously published by ACM. If you have written a work that was previously published by ACM in any journal or conference proceedings prior to 1978, or any SIG Newsletter at any time, and you do NOT want this work to appear in the ACM Digital Library, please inform permissions@acm.org, stating the title of the work, the author(s), and where and when published.
ACM ISBN: 978-1-60558-793-6
FOREWORD
Welcome to the Workshop on Binary Instrumentation and Applications 2009, the third in the series. Instrumentation is an effective technique to observe and verify program properties. This technique has been used for diverse purposes, from profile guided compiler optimizations, to microarchitectural research via simulations, to enforcement of software security policies. While instrumentation can be performed at the source level as well as binary level, the latter has the advantage of having the ability to instrument the whole program, including dynamically linked libraries. Binary instrumentation also obviates the need to have source code. As a result, instrumentation at the binary level has become immensely useful and has growing popularity. This workshop provides an opportunity for developers and users of binary instrumentation to exchange ideas for building better instrumentation systems and new use cases for binary instrumentation, static or dynamic. The first session contains papers that use binary instrumentation to study and improve hardware features. In "Studying Microarchitectural Structures with Object Code Reordering," Shah Mohammad Faizur Rahman, Zhe Wang and Daniel A. Jimnez describe an approach to understand microarchitectural structures by running different versions of the same program. Using branch predictors as an example, they vary the object code layout. In "Synthesizing Contention," Jason Mars and Mary Lou Soffa present a profiling approach for studying performance aspects of applications running on multi-core systems due to interference from other cores. The approach creates contention by running a synthetic application at the same time as the real application. In "Assessing Cache False Sharing Effects by Dynamic Binary Instrumentation," Stephan M. Gnther and Josef Weidendorfer use binary instrumentation to estimate the effects of false sharing in caches for multi-threaded applications. The second session contains papers that improve software systems. The ideas presented improve existing binary instrumentation techniques or use binary instrumentation to improve other software systems. In "Metaman: System-Wide Metadata Management," Daniel Williams and Jack W. Davidson describe an infrastructure to store and access meta-information for programs during its entire build process and runtime, thereby improving the build and runtime systems. In "A Binary Instrumentation Tool for the Blackfin Processor," Enqiang Sun and David Kaeli provide a static binary instrumentation system for embedded systems and use it to perform dynamic voltage and frequency scaling as a case study. In "Improving Instrumentation Speed via Buffering," Dan Upton, Kim Hazelwood, Robert Cohn and Greg Lueck present a technique for reducing instrumentation overhead by decoupling data collection (i.e. instrumentation) from data analysis. Finally, in "ThreadSanitizer -- Data Race Detection In Practice," Konstantin Serebryany and Timur Iskhodzhanov present a new tool based on Valgrind for detecting data races.
Thank you for participating in WBIA. We would like to especially thank the program committee for their careful reviews and quick turnaround, and the organizers of Micro 2009 for providing a venue for this workshop. Robert Cohn Jeff Hollingsworth Naveen Kumar Workshop organizers December, 2009 Program Committee: Derek Bruening, Vmware Bruce Childers, University of Pittsburgh Robert Cohn, Intel Saumya Debray, University of Arizona Koen De Bosschere, Ghent University Sebastian Fischmeister, U of Waterloo Jeff Hollingsworth, U of Maryland Robert Hundt, Google Naveen Kumar, Vmware Greg Lueck, Intel Nicholas Nethercote, Mozilla Stelios Sidiroglou, MIT Alex Skaletsky, Intel Mustafa Tikir, SDSC
Table of Contents
Session 1: Instrumentation for improving hardware Studying Microarchitectural Structures with Object Code Reordering Shah Mohammad Faizur Rahman, Zhe Wang and Daniel A. Jimnez Synthesizing Contention Jason Mars and Mary Lou Soffa Assessing Cache False Sharing Effects by Dynamic Binary Instrumentation Stephan M. Gnther, Josef Weidendorfer Session 2: Instrumentation for improving software Metaman: System-Wide Metadata Management Daniel Williams and Jack W. Davidson A Binary Instrumentation Tool for the Blackfin Processor Enqiang Sun and David Kaeli Improving Instrumentation Speed via Buffering Dan Upton, Kim Hazelwood, Robert Cohn and Greg Lueck ThreadSanitizer - data race detection in practice Konstantin Serebryany and Timur Iskhodzhanov
34 7
17
26
43
52
62
Studying Microarchitectural Structures with Object Code Reordering

Shah Mohammad Faizur Rahman Zhe Wang Daniel A. Jimnez
Department of Computer Science The University of Texas at San Antonio

{srahman,zhew,dj}@cs.utsa.edu
ABSTRACT
Modern microprocessors have many microarchitectural features. Quantifying the performance impact of one feature such as dynamic branch prediction can be dicult. On one hand, a timing simulator can predict the dierence in performance given two dierent implementations of the technique, but simulators can be quite inaccurate. On the other hand, real systems are very accurate representations of themselves, but often cannot be modied to study the impact of a new technique. We demonstrate how to develop a performance model for branch prediction using real systems based on object code reordering. By observing the behavior of the benchmarks over a range of branch prediction accuracies, we can estimate the impact of a new branch predictor by simulating only the predictor and not the rest of the microarchitecture. We also use the reordered object code to validate a reverseengineered model for the Intel Core 2 branch predictor. We simulate several branch predictors using Pin and measure which hypothetical branch predictor has the highest correlation with the real one. This study in object code reorder points to way to future work on estimating the impact of other structures such as the instruction cache, the second-level cache, instruction decoders, indirect branch prediction, etc.
sion is that, if I look at a program from one point of view and you look at it from another, then any comparison of our results is meaningless, and we might even come to dierent conclusions about whether a given optimization is a good idea. We might even engage in measurement bias, i.e., we might come to believe that an accidental improvement in program behavior is due to our own brilliant technique. However, by sampling and observing many of these points, we can get a much better understanding of program behavior. We can compare results with one another as well as answer interesting questions about programs and the processors that run them. In this paper, we present two techniques based on the object code reordering work of Mytkowicz et al.. A benchmark program is compiled into object les, and then many binary executable versions of that program are produced by linking the object les in dierent orders. Each binary is semantically equivalent, but because the instruction addresses are dierent, dierent conicts will arise among microarchitectural structures such as the branch predictor and instruction cache. The situation is isomorphic to one in which we keep the binary executable constant, but change the hash functions for these microarchitectural structures. Based on this object code reordering work, we describe two techniques: 1. We demonstrate how to develop a performance model for SPEC CPU 2006 benchmarks running on the Intel Core 2 processor. The technique perturbs benchmark executables to yield a wide variety of performance points without changing program semantics or other important execution characteristics such as the number of retired instructions. By observing the behavior of the benchmarks over a range of branch prediction accuracies, we can estimate the impact of a new branch predictor by simulating only the predictor and not the rest of the microarchitecture. 2. We use the reordered object code to validate a reverseengineered model for the Intel Core 2 branch predictor. Modern microprocessors come with sophisticated branch predictors. If the organization of the branch predictor is known to an architecture aware compiler, it can use the information to improve program performance. But unfortunately the organization of Intel Branch predictor are not widely disclosed. Uzelac et al. introduced reverse engineering techniques using some microbenchmarks to expose the organization of Pentium M Branch Predictor[19]. We try to use similar techniques to get an idea about the organization
1. INTRODUCTION
Last year at ASPLOS, Mytkowicz et al. showed us that, in essence, each of us studies a given program from our own very limited point of view [15]. We measure the properties of the program behavior and report results and conclusions. But if we step outside of that limited point of view, just a little, the conclusions could be very dierent. For each combination of program and compiler options, there is a whole world of points of view, each represented by a dierent perturbation of instruction and data addresses. Their paper was provocatively named, Producing wrong data without doing anything obviously wrong! The conclu-
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. WBIA 09, Dec 12, New York City, NY Copyright 2009 ACM 978-1-60558-793-6/12/09 ...$10.00.
of Intel Core 2 branch predictor. Those microbenchmarks provide a rough outline about the organization of the branch predictor but there is no certain way to tell whether our assumption is correct. We use the reordered object code to validate our hypothetical branch predictor organization. We rst reverse engineered the Intel Core 2 branch predictor to nd out its organization. Then we use the reordered object code to validate the predictor and reveal dierent attributes and characteristics of it. We simulate several branch predictors using Pin[10] and try to nd out the hypothetical branch predictor which has the highest co-relation with the original one.
code placement. If thoughtful code placement optimizations like those mentioned above were widely adopted, our results would show less variance in execution behavior and less condence in the regression lines. Nevertheless, most production code is not optimized with code placement in mind; thus, our results are widely applicable to real systems.
2.3
Estimating Performance of real systems
2. RELATED WORK
In this section we discuss related work.
2.1 Eliciting Performance Variance

Mytkowicz et al. introduce the technique of object le reordering for showing that dierent link orders of object les, as well as other seemingly random and harmless details of an experimental setup, can yield signicantly dierent performance [15]. That work indicts the ASPLOS community for falling victim to measurement bias, i.e., allowing oneself to believe that some observed improvement in program behavior is due to ones own technique rather than a happy coincidence of experimental factors. Our work was partly inspired by Mytkowicz et al.. We choose to see the phenomenon they exposed as an interesting opportunity to develop a tool to examine microarchitectural behavior. Rubin et al. propose a framework to explore the space of data layouts using prole feedback to nd layouts that yield good performance [17]. They point out that the general problem of optimal data layout is NP-hard and poorly approximable. The space of data layouts is similar to the space of object le reorderings, and the impact of data layouts on the data cache is similar to the impact of code placement on the branch predictor and instruction cache.
Contreras and Martonosi use performance monitoring counters to develop a linear power model of the Intel XScale processor [2]. This approach can enable a technique capable of quickly estimating future power behavior and adapting to it at run-time. Our performance model technique is similar in that it uses performance monitoring counters to develop a model of program behavior. However, we focus on modeling the behavior of one program at a time to get very precise information about the change in performance in response to a small change in the behavior of microarchitectural structures, i.e., our work concentrates on a much ner level of granularity, and we focus on performance instead of power.
2.4
Reverse Engineering Branch Predictors
Milenkovic et al. [13] proposed a reverse engineering ow focusing on P6 and Netburst architectures and suggested the size and organization of BTB and the presence and lengths of global and local histories. This ow does not include any experiments for determining the organization of predictor structures indexed by program path information nor their internal operation. Uzelac et al. [19] introduced a new set of experiment ows to determine the organization and dierent attributes of Pentium M microprocessor branch predictor. However, all these experiment ows provide an idea about the organization of the branch predictors based on the logical reasoning of their experimental results. These results could have been interpreted dierently to come up with slightly dierent branch predictor organizations. Our object code reordering technique validates dierent simulated branch predictors to nd out the organization which is closest to the original branch predictor.
2.2 Impact of Code Placement on Performance

The impact of code placement on performance has not gone unnoticed in the academic literature. Many code improving transformations have been proposed based on code placement. Hateld and Gerald [3], Ferrari [5], McFarling [11], Pettis and Hanson [16], and Gloy and Smith [6] present techniques to rearrange procedures to improve locality based on prole data. Most of these techniques use a proled weighted call graph with edges weighted by call frequency. Procedures with high weight are placed close to one another to avoid conict misses. Calder and Grunwald present branch alignment, an algorithm that seeks to minimize the number of taken branches by reordering code such that the hot path through a procedure is laid out in a straight line [1], thus minimizing the performance penalty of a discontinuous fetch. Their technique improves performance by an average of 5% on an Alpha AXP 21064. Young et al. present a near-optimal version of branch alignment [21]. Jimnez proe poses a technique to use code placement to explicitly avoifd branch mispredictions due to conicts in the predictor tables [8]. Knights et al. propose exploiting fortuitous object code orderings to improve performance [9]. Our performance model technique is not an optimization, but a tool for peering inside the microarchitecture using
3.
APPROACH
In this section we describe our approach of studying microarchitecture structure with object code reordering. The basic idea is to execute code under many dierent reorderings, causing a wide variance in performance due to dierent accidental collisions in microarchitectural structures. By measuring the resulting adverse microarchitectural events, we can build a performance model for the program and microarchitecture, we can also validate a reverse-engineered model for branch predictor.
3.1
Instruction Addresses in Microarchitectural Structures
Our approach exploits the fact that several microarchitectural structures use a hash of instruction addresses. For example: 1. A 128-set instruction cache with 64 byte blocks would likely use bits 6 through 12 of the instruction address as the set index. 2. A branch direction predictor might index a table of counters using a combination of branch history and branch address bits.
3. A branch target buer (BTB) or indirect branch predictor would use lower-order bits of the branch address to index a table of branch targets. Sometimes instruction addresses will accidentally collide in some microarchitectural structure. For example, conict misses in the instruction cache occur when the number of blocks mapping to a particular set exceeds the associativity of the cache. Although this phenomenon has been studied in academic research, most compilers do not optimize to protect against these kinds of conicts. Compiler writers are aware of uses of instruction addresses and write compilers to exploit these uses. For instance, a common heuristic is to align the target of a branch on a boundary divisible by the number of bytes in a fetch block to allow the fetch beginning at that target to read the maximum number of instruction bytes in one cycle.
penalty, and each extra misprediction increases the number of cycles by this penalty.
3.5
Validating Reverse-Engineered Branch Predictor
We execute each reordered object code in native environment and also in our simulated branch predictor environment and measure the number of mispredictions. Dierent layouts of object code of the same program will have considerably dierent number of mispredictions because of the program counter. By executing all of these object code reorderings on dierent simulated branch predictor, we use correlation co-ecient to draw a comparison between the original and our simulated branch predictors.
4.
EXPERIMENTAL METHODOLOGY
3.2 A Wide Range in Performance

These accidental conicts result in adverse microarchitectural events such as branch mispredictions, instruction cache misses, BTB misses, etc. A particular layout of the code will result in a particular number of accidental collisions with a particular impact on performance. A dierent layout will result in a dierent impact on performance. By exploring a wide range of layouts, we can force a wide range of adverse performance events to take place and explore a wide range of performances. Figure 1 shows the percent dierence from average performance as measured by CPI caused by 100 random but plausible code reorderings for the SPEC CPU 2006 benchmarks running on Intel Core 2 processor. The graph is a violin plot, showing the probability density at each CPI value, i.e., the thickness at each CPI value is proportional to the number of CPIs observed in that neighborhood. Clearly, some benchmarks are greatly aected by dierences in instruction addresses while some are less sensitive.
This section describes the experimental methodology used for this paper.
4.1
Compiler
We use the Camino compiler infrastructure [7]. This system is a post-processor for the Gnu Compiler Collection (GCC) version 4.2.4. C, C++, and FORTRAN programs are compiled into assembly language, the assembly language is instrumented by Camino, and the result is assembled and linked into a binary executable. Camino features a number of proling passes and optimizations, but for this study we implement and use only the proling and instrumentation pass described below. All of the binary executables produced for this study target the x86 64 instruction set.
4.2
Benchmarks
3.3 Causing Collisions

To generate many random but plausible code layouts, we use the technique of Mytkowicz et al., i.e., object-le reordering. We compile each benchmark once, then link it 100 times, each with a dierent pseudo-randomly-generated order of the object les. The linker lays code out in the order in which it is encountered on the command line, so each random ordering results in a dierent code layout. We then execute each resulting binary executable ve times, collecting performance monitoring counter information such as number of instructions committed, number of branch mispredictions, number of clock cycles, etc. We take the performance monitoring counter statistics that gave the median performance. Details of our infrastructure are given in Section 4
We use the SPEC CPU 2006 benchmarks for this study. Of the 29 benchmarks, 23 compile and run without errors with our compiler infrastructure. These benchmarks are listed in the x-axes of several graphs in later sections.
4.3
Generating Random Object Orderings
Each benchmark is compiled once with Camino. The resulting object les can be linked to make a binary executable. We use a program that accepts a seed to a pseudorandom number generator to generate a pseudo-random but reproducible orderings of the object les. This program takes as input a list of object les and produces as output a linker command that links the object les in the pseudorandom order.
4.4
System
3.4 Making Predictions

Once the performance monitoring counter information has been collected, we can begin using statistical tools to build a performance model. We use least-squares linear regression to estimate the relationship between various microarchitectural events and performance outcomes. Linear regression only works if we may condently assume that there is a linear relationship. Under normal circumstances CPI and MPKI do indeed have a linear relationship: for each benchmark, there is an average misprediction
We perform our study using four seven-processor Dell systems with identical congurations running the 64-bit version of Ubuntu Linux 8.04 Server and a custom compiled kernel with performance monitoring counter support. Each system contains seven quad-core Intel Xeon E5440 processors. The Intel processor 5400 Series are based on the Intel Harpertown core which is the same core used in quad-core Intel Core 2 processors. Each processor has 16GB of SDRAM and 12MB second level cache. Each core in the Intel Xeon E5440 processor has 32KB instruction and 32KB data caches. The branch predictor of the Intel Xeon E5440 is not documented, but through reverse-engineering experiments we have determined that it is likely to contain a hybrid of a GAs-style branch predictor and a bimodal branch predictor [20, 18, 4].
Percent Difference in CPI
Figure 1: Violin plots for SPEC CPU 2006 percentage performance variation with object reordering.
4.5 Running with Performance Monitoring Coun- 4.6 Simulation ters We develop several branch predictor simulators.
We measure a number of performance monitoring counters using the perfex command found in the PAPI performance monitoring package [14]. The Intel Core processor allows up to two user-dened microarchitectural events to be counted simultaneously. We are interested in more than two events, so we make multiple runs of each benchmark to collect all of the desired counters. We group the counters into three sets of two. For each set we run each benchmark ve times and take the measurements given by the run with the median number of cycles. Only the microarchitectural events that occur while user code is running are counted, thus the impact of system events is minimized. We collect the following statistics: 1. Retired branches mispredicted. 2. Retired x86 instructions excluding exceptions and interrupts. 3. L1 instruction cache misses. 4. L2 cache misses. 5. Elapsed clock cycles. From these counters, we can derive other statistics such as cycles-per-instruction (CPI), branch mispredictions per 1000 instructions (MPKI), various cache miss rates, etc. Although each system is congured identically and each core has the same microarchitecture, we use the Linux taskset command to make sure that each benchmark always runs on the same core to eliminate the eect of possible slight dierences among the cores. Each run is performed on an otherwise quiescent system with as many system services stopped as possible without compromising the ability to access remote les and log in remotely. Stack address randomization, a security feature that resists stack-smashing attacks, is disabled to minimize performance variance not due to code placement.
k bm nc la xa 3 3. 48 inx h sp 2. 48 ar t as p 3. p 47 net om 1. 47 o nt to f 5. 46 64re h2 tum 4. n 46 qua lib DTD 2. 46 msF e G 9. r 45 me hm 6. x 45 culi l ca 4. 4 5 le x p so 0. 45 bmk go 5. 44 d m D na M 4. A 44 tus c ca s 6. ac 43 om gr 5. p m 43 us ze 4. 43 ilc m 3. 43 f c m 9. s 42 mes ga 6. s 41 ave bw 0. 41 c gc 3. 40 p2 i bz ch 1. n 40 lbe r pe 0. 40
Benchmark
We implement these as a tool in Pin [10]. We then run pin on the same binary executables that we run natively. Our Pin tool instruments each branch with a callback to code that simulates a set of branch predictors. The tool counts the number of branches executed and the number of branches mispredicted for each predictor simulated.
4.7
Timing Concerns
Many of the SPEC CPU 2006 benchmarks run for over 30 minutes on the rst ref input. For this study, we have executed each of the 23 benchmarks at least 100 times on a set of 4 computers. To facilitate this study, we instrument the benchmarks such that under native execution they run for up to approximately two minutes each. To do this, we implement a two-pass proling and instrumentation pass in the Camino compiler. The rst pass inserts instrumentation that collects information about each procedure. The benchmark is allowed to run for two minutes. Then the collected information is analyzed to nd a procedure with a low dynamic count that is also executed near the end of the two-minute run. The second pass of the compiler instruments only that procedure such that when it is executed the same number of times as before, the program is ended. The rst instrumentation has low overhead, thus the resulting executable runs for approximately two minutes. The second instrumentation aects a low-frequency procedure and takes two x86 instructions, thus it has negligible overhead. All of the binary executables in this study are compiled from this second instrumentation, or are from benchmarks that naturally run for less than two minutes. Because we are counting procedures and not elapsed time, each run of a benchmark executes the same number of user instructions.
5.
BUILDING PERFORMANCE MODEL FOR MICROARCHITECTURE STRUCTURES
In this section we show how to build a performance model for branch prediction using object code reordering. We develop and evaluate regression models for a number of benchmarks. We explore using several characteristics of program
10
1.0
Correlation Coefficient
0.5
0.0
Branch Mispredictions L1 Instruction Cache Misses L2 Cache Misses Due to Fetch Combined Estimator
-0.5
40 A 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 0.p 01.b 03.g 10.b 16.g 29.m 33.m 34.z 35.g 36.c 44.n 45.g 50.s 54.c 56.h 59.G 62.l 64.h 65.t 71.o 73.a 82.s 83.x rith erl zip cc wa am cf ilc eus rom act am obm opl alc mm em ibq 264 ont mn sta phi ala me ben 2 ves ess nx ncb tic mp ac usA d etp r k ex ulix er sFD uantu ref o 3 s ch p mk Me DM TD m a
Benchmark
Figure 2: Correlation coecients of various measurements. behavior such as branch prediction and cache misses. that would be encountered in a given domain (i.e. set of MPKIs).
5.1 Statistics
We make use of some statistical techniques in the performance model study. We briey review these techniques: 1. Correlation coecients. Also known as Pearsons r, correlation coecients range from -1.0 to 1.0 and measure the correlation between two random variables. A higher magnitude for r means a higher degree of correlation. A negative value of r means that the two variables are negatively correlated. 2. Coecient of determination. The sample coecient of determination, computed as r2 where r is the correlation coecient, gives the fraction of dependence of a given observation on an underlying factor. 3. Linear regression. We develop several estimators of performance using linear least-squares regression that nds a best-t equation of a line between two variables, e.g. we nd a linear equation in terms of MPKI that estimates CPI. The best t minimizes the sum of squared errors of the regression line and the observed data. We also use multi-linear least-squares regression to produce an estimator for performance in terms of several observed variables. 4. Hypothesis testing. We use Students t-test for hypothesis testing. We formulate a null hypothesis, e.g. there is no correlation between CPI and MPKI then use hypothesis testing to see if the null hypothesis can be rejected. We consider a result signicant if the null hypothesis can be rejected with p = 0.05, i.e., the probability that the null hypothesis is not true is 95%. Students t-test gives a meaningful result in the presence of normally distributed data. 5. Condence intervals and prediction intervals. For the linear regression lines, we plot 95% condence intervals and 95% prediction intervals, which are closely related to the t-test mentioned above. A 95% condence interval has a 95% chance of containing the true regression line, i.e., of all the data collected, the line that best illustrates the linear relationship between CPI and MPKI has a 95% chance of being in that condence interval. The larger 95% prediction interval has a 95% chance of containing all of the observations (i.e. CPIs)
5.2
Establishing Correlation
Object le reordering can elicit a wide range of CPIs for our benchmarks, but before we can make use of this we must establish that there is correlation between microarchitectural events measured and performance observed. We focus on what we believe to be the microarchitectural events most likely to be aected by code placement: 1. Branch mispredictions. Conditional branch predictors use the address of an instruction to index one or more tables. If two or more branches conict with one another in these tables, a phenomenon called aliasing [12], branch prediction accuracy can suer. 2. L1 instruction cache misses. The Intel Xeon Core has a 32KB 8-way set associative instruction cache. If nine or more frequently used blocks map to the same set, there will be frequent cache misses. 3. L2 cache misses. 4. We also use multi-linear regression to develop a combined model that takes into account all three of these events in the hope that a combined model will be more accurate than using one of the observations by itself. Figure 2 shows the correlation coecients between CPI and the counts of these events. Clearly, for most benchmarks, branch prediction is most signicantly correlated with performance. Some benchmarks show negative correlations between CPI and events. The reason is sometimes the branch misprediction and caches misses cause the prefetch of the useful data which is benecial for the performance. It must be emphasized that the correlation we report between microarchitectural events and performance is with respect to object code ordering. Other changes to the execution environment would show other correlations. For instance, if we allowed the operating system to do stack address randomization, we would observe signicant variance in L1 data cache misses, and a commensurate impact on performance.
5.3
Assigning Blame
Using r2 , the coecient of determination, we can determine what portion of performance is due to a particular microarchitectural event. Figure 3 shows the cumulative r2
11
Coefficient of Determination
1.0
0.5
L2 Cache Misses L1 Instruction Cache Misses Branch Mispredictions Combined Estimator
0.0
40
0.p 401.b 403.g 410.b 416.g 429.m433.m434.z 435.g 436.c 444.n 445.g 450.s 454.c 456.h 459.G462.l 464.h 465.t 471.o 473.a 482.s 483.x Arith erl zip cc wa am cf ilc eus rom act am ob opl alc mm em ibq 26 ont mn sta phi ala me ben 2 ves ess nx ncb tic mp ac usA d mk ex ulix er sF uan 4re o etp r 3 DT tum f ch s p mk Me DM a D
Benchmark
Figure 3: Coecient of determination showing how much of each type of event accounts for overall performance. for each of the three events, as well as r2 for the combined regression model. Some benchmarks are more sensitive; for instance, 84.2% of the CPI variance of 462.libquantum is due to branch mispredictions. For the average model, 82.3% of the variance can be blamed on branch prediction. The average bar for the combined model does not reach exactly the same height as that of the sum of the three measurements. This is because the three measurements are not altogether independent of one another; for instance, in some cases, a branch misprediction might cause an L1 cache event, sometimes causing cache pollution and other times causing prefetching. Event L1 L2 Benchmark Branch I-Cache Cache Combined Name MPKI Misses Misses Estimator 400.perlbench yes yes yes 401.bzip2 403.gcc yes yes yes yes 410.bwaves 416.gamess yes yes yes 429.mcf yes yes 433.milc 434.zeusmp yes yes 435.gromacs yes yes 436.cactusADM 444.namd yes 445.gobmk yes yes yes 450.soplex yes 454.calculix yes yes 456.hmmer yes yes 459.GemsFDTD 462.libquantum yes yes yes 464.h264ref yes yes 465.tonto yes yes yes 471.omnetpp yes yes 473.astar yes yes yes 482.sphinx3 483.xalancbmk yes yes Table 1: Yes means that the null hypothesis of no correlation is rejected with p 0.05, i.e., with 95% probability, the given measurement is correlated with CPI.
5.4 Establishing Statistical Signicance

Clearly many benchmarks performance show correlation with microarchitectural events. However, we must ask whether the correlation is statistically signicant. It could be the case that enough accidental correlation exists to make any models derived from these events meaningless. We use Students t-test to determine statistical signicance. For each of the three measurements as well as the combined model we attempt to reject the null hypothesis that there is no correlation. The value p 0.05 for the t-test is traditionally accepted as proof of statistical signicance. Table 5.4 shows yes for each combination of measurement and benchmark where the null hypothesis can be rejected with at most p = 0.05, i.e., with 95% probability there is correlation between CPI and the measurement for that benchmark.
5.4.1
Blame the Branch Predictor
Of the 23 benchmarks, 13 show signicant correlation between CPI and branch prediction. In other words, for over half of the benchmarks, we determined that there was at least a 95% chance that our performance model techinique found signicant correlation between CPI and MPKI. For the other benchmarks, there was not enough range of MPKI to predict CPI. No other measurement consistently shows statistically signicant correlation with CPI. The combined estimator does not increase the number of benchmarks showing signicant correlation. Thus, in this paper we focus our attention on branch prediction.
5.4.2
Other Measurements
The object le reordering methodology clearly elicits a
large impact on branch prediction. In future work, we will explore other methods of manipulating binaries to emphasize other measurements such as instruction cache and data cache misses. In future work we will study the impact of other events dependent on code placement. For instance, the number and type of x86 instructions in a 32-byte fetch block on the Opteron has a large impact on the eciency decoding, but at this point it is not clear how to measure that impact.
12
5.5 A Linear Performance Model

We use least-squares linear regression to derive branch prediction performance models for each of the benchmarks that passed the hypothesis testing phase. For each benchmark, we nd the best t of the observed data to a regression line y = mx + b where y is CPI and x is MPKI. The slope (m) gives the cost for performance of one additional MPKI and the y-intercept (b) gives the predicted average CPI for perfect branch prediction, i.e. 0 MPKI. We also derive 95% condence intervals and 95% prediction intervals for the regression lines. Figure 4 shows the regression line and intervals for 400.perlbench. The condence interval has a 95% chance of containing the true regression line for the data observed. The much wider prediction interval has a 95% chance of containing future observations. Linear regression allows us to make the following predictions for this benchmark with 95% probability: 1. A perfect branch predictor would yield a CPI of 0.517 0.029, an improvement of 26.0% 4.2%. 2. Halving the average MPKI from 6.50 to 3.25 would improve CPI by 13.0%2.2% from 0.70 to 0.610.022. 3. A 10% improvement in CPI due to branch prediction improvement would require a 38% reduction in mispredictions. Table 2 shows the slopes and y-intercepts found by linear regression for each benchmark. It also shows the high and low prediction intervals for perfect prediction.
Benchmark 400.perlbench 403.gcc 416.gamess 429.mcf 435.gromacs 445.gobmk 456.hmmer 462.libquantum 464.h264ref 465.tonto 471.omnetpp 473.astar 483.xalancbmk
Slope 0.028 0.028 0.041 0.019 0.020 0.019 0.041 0.022 0.032 0.027 0.036 0.022 0.029
y-intercept 0.517 1.839 0.548 4.675 0.811 0.643 0.203 1.432 0.466 0.632 1.901 2.373 1.914
Low 0.488 1.796 0.519 4.531 0.795 0.515 0.032 1.433 0.451 0.617 1.860 2.289 1.881
High 0.546 1.882 0.577 4.819 0.827 0.771 0.375 1.431 0.481 0.647 1.941 2.456 1.947
Table 2: Least-squares regression model relating branch prediction to performance. Shows high and low prediction intervals for perfect prediction i.e. 0 MPKI. to verify which of these branch predictor organizations most closely resembles the original one.
6.1
Demystifying Branch Predictor
6. REVERSE ENGINEERING BRANCH PREDICTOR AND VALIDATION

In this section we will discuss the techniques and the results of our branch predictor reverse engineering experiment. First, we will try to nd out an overview of the branch predictor and make some educated guesses about its organization. Then we will use the object code reordering technique
0.72
As branch predictor has substantial impact on the performance of a program, it is regarded as one of the most important component of modern day microprocessor. For this very reason,to achieve best performance they are becoming more and more complex. Not only multiple predictors are combined together to form hybrid predictor but also complex hash functions are used to index dierent tables. Thus, it is dicult to determine the exact conguration. Using dierent micro benchmarks as proposed in [19] and [13] we can make an educated guess. We use techniques similar to those proposed by Uzelac et al. to determine the attributes of the Intel Core 2 Branch Predictor. To determine the global history length we use with the following experiment. We generate patterns of lengths ranging from 2 to 16. For each pattern length we generate 20 samples. For each samples we execute microbenchmarks of g 5 and g 6 5 times and measure the number of mispredictions using Intel VTune Performance Analyzer. i n t main ( void ) { i n t p a t t e r n [MAX LENGTH] , i , j , a =0; read pattern ( pattern , length ) ; f o r ( i =0; i <ITERATIONS ; i ++) { f o r ( j =0; j <l e n g t h ; j ++) i f ( pattern [ j ] ) a++; } return 0 ; } Figure 5: Experiment 1 for nding Global History Length We plot the pattern length vs number of mispredictions in Figure 7. From the graph we can see that for Experiment 1 where there is only one if condition inside the loop, there
Cycles per Instructions
0.70
0.68
95% prediction intervals 95% confidence intervals Least-squares regression line Performance counter measurements 6.0 6.5 7.0 Mispredictions per 1000 Instructions
Figure 4: Regression line and 95% condence and prediction intervals for 400.perlbench.
13
i n t main ( void ) { i n t p a t t e r n [MAX LENGTH] , i , j , a =0 ,b=0; read pattern ( pattern , length ) ; f o r ( i =0; i <ITERATIONS ; i ++) { f o r ( j =0; j <l e n g t h ; j ++) { i f ( pattern [ j ] ) a++; i f ( ! pattern [ j ] ) b++; } } return 0 ; } Figure 6: Experiment 2 for nding Global History Length is a noticable change in number of mispredictions for pattern lengths greater than 6. As for each branch instruction the inner loop is involved we suppose that the Intel Core 2 branch predictor has a 12 bit global history. Experiment 2 of listing 6 further conrms our claim. With two ifs inside the loop the number of mispredictions increases for pattern lengths greater than 4, so again we conrm that at least 12 bits of global history are used.
benchmarks from SPEC 2006. We use the same technique as mentioned earlier to make sure under native execution they run up to approximately two minutes each. We use the same technique as mentioned before to generate 100 random object code reordering for each of the benchmarks.
6.3
Native execution
We use the perfex command found from PAPI performance monitoring package[14] to measure two performance monitoring counters while executing our benchmarks. We measure number of retired branches and number of mispredicted branches.Like before we use the linux taskset command to make sure that each benchmark runs on the same core to eliminate the possible inconsistencies among the cores. For each 100 reordering of every benchmark we run 5 times and take the median.
6.4
Branch Predictor Simulation
Based on the results of our reverse engineering, we use different combinations of program counter and global history XORed together for indexing. We simulate each of these congurations and build them as tools in Pin [10]. We run each of the 100 reordering of those benchmarks with our simulated branch predictor and measure the number of mispredictions for every conguration.
6.5
Data Analysis
We have the data of MPKIs for native execution and simulated branch predictor execution. Figure 9 shows a bar chart where the left most bar is the original MPKI whereas the other four MPKIs are obtained from four dierent simulated branch predictor. The four congurations are:
Figure 7: Finding Global History Length The other attributes of the branch predictor are found as follows: We use the same microbenchmarks as [19], to nd the existence of a loop predictor in Intel Core 2 branch predictor. From the experiment we conclude that it might be a 128 entry 2 way set associative loop predictor. We found a the existence of a Branch History Table 16K entries, where each entry is a two bit saturating counter. The program counter is also used for indexing the table. We assume the existence of a hash function, which may be combined with some bits from the global history and some bits from the program counter with some bitwise function.
Figure 9: MPKI of real and simulated branch predictors.
1. 12 bits of global history and 12 bits from program counter with 10 bits xor-ed. 2. 14 bits of global history and 14 bits from program counter with 14 bits xor-ed. 3. 10 bits of global history and 12 bits from program counter with 8 bits xor-ed. 4. 14 bits of global history and 14 bits from program counter where rst 8 bits of global history xor-ed with last 8 bits of program counter and last 6 bits of global history xor-ed with rst 6 bits of program counter.
6.2 Benchmarks
We emphasize integer benchmarks because these programs are known to have more branches and more hard-to-predict branches. We select eleven integer and three oating point
14
Figure 8: Correlation between actual and simulated branch predictor We calculate the correlation coecient between the MPKI from the original and simulated branch predictors mispredictions. We want to nd a conguration that provides correlation coecient close 1.0 Figure 8 shows a graph, plotted with data recorded from original branch predictor and our simulated one for all the 14 benchmarks. From the graph we can see most of the benchmarks are very close to the diagonal line that represents the ideal correlation. We found the only benchmark that strays a little bit away from the line is 445.gobmk. The best correlation coecient we found so far is 0.993 by conguration 1, which shows high level of condence. indicates that acquiring more data points will lead to statistically signicant estimates of performance from L1 instruction cache misses for many benchmarks. 2. Code placement is easily manipulated by object code reordering, but this technique is sometimes insucient to elicit a wide range of program behaviors. We will investigate other ways to aect code placement. We will also look at other factors such as placement of data in the program text, on the stack, and with the memory allocator, to allow using binary interferometry to estimate the performance impact of the data caches. 3. We will validate this approach with other microarchitectures such as Intel Core i7 and other processors as we are able to acquire access to them. 4. We will investigate how to determine the best number of samples to choose for each benchmark to balance accurate estimation with computationally intensive collection of measurements and simulation. 5. We will use this technique to validate reverse-engineered branch predictor models and will try to incorporate more branch prediction features like update policy, hash function accuracy. Using this technique, we can be sure that the reverse-engineered model behaves exactly the same as the real predictor.
Figure 10: Global History table access hash The structure contains a 16K-entry table of bimodal twobit saturating counters indexed by 12 bit from program counter and 12 bit from global history, 10 bit xor-ed together Figure 10. We also found that there is a 128 entry, 2-way set associative loop predictor. Using this object code reordering technique we successfully manage to validate our assumption about the branch predictor.
8.
CONCLUSION
7. FUTURE WORK
We see great potential in this technique to enhance our ability to accurately model microarchitectural behavior. We plan ve main thrusts for our future work: 1. We apply this technique to other microarchitectural structures such as the instruction cache, the secondlevel cache, instruction decoders, indirect branch prediction, etc. Preliminary evidence we have gathered
We have presented a technique for developing a performance model for programs running on a given microarchitecture based on the eect of code placement on certain microarchitectural structures. We have shown that reverse engineered branch predictor model can be successfully validated using this technique.
9.
REFERENCES
[1] Brad Calder and Dirk Grunwald. Reducing branch costs via branch alignment. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1994.
15
[2] Gilberto Contreras and Margaret Martonosi. Power prediction for Intel XScale R processors using performance monitoring unit events. In ISLPED 05: Proceedings of the 2005 international symposium on Low power electronics and design, pages 221226, New York, NY, USA, 2005. ACM. [3] D.J.Hateld and J.Gerald. Program restructuring for virtual memory. IBM Systems Journal, 10(3):168192, 1971. [4] M. Evers, P.-Y. Chang, and Y. N. Patt. Using hybrid branch predictors to improve branch prediction accuracy in the presence of context switches. In Proceedings of the 23rd International Symposium on Computer Architecture, May 1996. [5] Domenico Ferrari. Improving locality by critical working sets. Communications of the ACM, 17(11):614620, November 1974. [6] Nikolas Gloy and Michael D. Smith. Procedure placement using Temporal-Ordering information. ACM Transactions on Programming Languages and Systems, 21(5):9771027, September 1999. [7] Chunling Hu, John McCabe, Daniel A. Jimnez, and e Ulrich Kremer. The camino compiler infrastructure. SIGARCH Comput. Archit. News, 33(5):38, 2005. [8] Daniel A. Jimnez. Code placement for improving e dynamic branch prediction accuracy. In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Langu age Design and Implementation (PLDI), pages 107116, June 2005. [9] Dan Knights, Todd Mytkowicz, Peter F. Sweeney, Michael C. Mozer, and Amer Diwan. Blind optimization for exploiting hardware features. In CC 09: Proceedings of the 18th International Conference on Compiler Construction, pages 251265, Berlin, Heidelberg, 2009. Springer-Verlag. [10] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geo Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In PLDI 05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, pages 190200, New York, NY, USA, 2005. ACM. [11] Scott McFarling. Program optimization for instruction caches. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 183191. ACM, 1989. [12] Pierre Michaud, Andr Seznec, and Richard Uhlig. e Trading conict and capacity aliasing in conditional
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
branch predictors. In Proceedings of the 24th International Symposium on Computer Architecture, pages 292303, June 1997. M Milenkovic, A Milenkovic, and J Kulick. Microbenchmarks for determining branch predictor organization. In Software Practice and Experience, pages 465488. John Wiley & sons, 2004. Shirley Moore, David Cronk, Felix Wolf, Avi Purkayastha, Patricia Teller, Robert Araiza, Maria Gabriela Aguilera, and Jamie Nava. Performance proling and analysis of dod applications using papi and tau. In DOD UGC 05: Proceedings of the 2005 Users Group Conference on 2005 Users Group Conference, page 394, Washington, DC, USA, 2005. IEEE Computer Society. Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. Producing wrong data without doing anything obviously wrong! In ASPLOS 09: Proceeding of the 14th international conference on Architectural support for programming languages and operating systems, pages 265276, New York, NY, USA, 2009. ACM. Karl Pettis and Robert C. Hansen. Prole guided code positioning. In Proceedings of the ACM SIGPLAN90 Conference on Programming Language Design and Implementation, pages 1627, June 1990. Shai Rubin, Rastislav Bod and Trishul Chilimbi. An k, ecient prole-analysis framework for data-layout optimizations. In POPL 02: Proceedings of the 29th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 140153, New York, NY, USA, 2002. ACM. James E. Smith. A study of branch prediction strategies. In Proceedings of the 8th Annual International Symposium on Computer Architecture, pages 135148, May 1981. Vladimir Uzelac and Aleksandar Milenkovic. Experiment ows and microbenchmarks for reverse engineering of branch predictor structures. In ISPASS, pages 207217. IEEE, 2009. T.-Y. Yeh and Yale N. Patt. Two-level adaptive branch prediction. In Proceedings of the 24th ACM/IEEE International Symposium on Microarchitecture, pages 5161, November 1991. Cli Young, David S. Johnson, David R. Karger, and Michael D. Smith. Near-optimal intraprocedural branch alignment. In Proceedings of the SIGPLAN97 Conference on Program Language Design and Implementation, June 1997.
16
Synthesizing Contention
Jason Mars
University of Virginia
Mary Lou Soffa

jom5x@cs.virginia.edu
soffa@cs.virginia.edu
ABSTRACT
Multicore microarchitecture designs have become ubiquitous in todays computing environment enabling multiple processes to execute simultaneously on a single chip. With these new parallel processing capabilities comes a need to better understand how co-running applications impact and interfere with each other. The ability to characterize and better understand cross-core performance interference can prove critical for a number of application domains, such as performance debugging, compiler optimization, and application co-scheduling to name a few. We proposed a novel methodology for the characterization and proling of crosscore interference on current multicore systems, which we call contention synthesis. Our proling approach characterizes an applications cross-core interference sensitivity by manufacturing contention with the application and observing the impact of this synthesized contention on the application. Understanding how to synthesize contention on current chip microarchitectures is unclear as there are a number of potentially contentious data access behaviors. This is further complicated by the fact that current chip microprocessors are engineered and tuned to circumvent the contentious nature of certain data access behaviors. In this work we explore and evaluate ve designs for a contention synthesis mechanism. We also investigate how these ve contention synthesis engines impact the performance of 19 of the SPEC2006 benchmarks on two state of the art chip multiprocessors, namely Intels Core i7 and AMDs Phenom X4 architectures. Finally we demonstrate how contention synthesis can be used to accurately characterize an applications cross-core interference sensitivity.
General Terms
Performance, Contention, Multicore
Keywords
cross-core interference, proling framework, program understanding
1.
INTRODUCTION
Categories and Subject Descriptors

D.1.3 [Programming Techniques]: Concurrent Programmingparallel programming; D.3.4 [Programming Languages]: Processorsrun-time environments, compilers, optimization, debuggers; D.4.8 [Operating Systems]: Performancemeasurements, monitors
Multicore architectures are quickly becoming ubiquitous in todays computing environment. With each new generation of general purpose processors, much of the performance improvement in micro-architectural design is typically achieved by increasing the number of individual processing cores on a single chip. However, shared on-chip resources and the memory subsystem is typically shared among many cores, resulting in a potential performance bottleneck when scaling up multiprocessing capabilities. Current multicore architectures include early levels of small core-specic private caches, and larger caches shared among multiple cores [8]. When multiple processes or threads run in tandem, they can contend for the shared cache by evicting a neighboring process or threads data in order to cache its own data. This form of contention occurs when the working set of the neighboring processes or threads exceed the size of the private caches, relying on the shared cache resources. This contention can result in a signicant degradation in application performance. When an application suers a performance degradation due to contention for shared resources with an application on a separate processing core, we call this cross-core interference. The ability to characterize an applications sensitivity to cross core interference can prove indispensable for a number of applications. Compiler and optimization developers can use knowledge about the contention sensitivity of a particular code region to develop both static and dynamic contention conscious compiler techniques and optimization heuristics. For example, knowledge of contention sensitive code regions can be used to direct where software cache prefetching should be applied. Software developers can also use this cross-core interference sensitivity information for performance debugging and software tuning. The ability to characterize the most contention sensitive application phases allows the programmer to pin-point parts of the application on which to focus. In addition to performance debugging, software tuning, and compiler optimization, the characterization of application sensitivity to cross core interference
17
1.4x 1.35x 1.3x Slowdown 1.25x 1.2x 1.15x 1.1x 1.05x perlbench libquantum xalancbmk omnetpp sphinx3 h264ref povray gobmk soplex bzip2 dealII namd sjeng hmmer mean astar gcc lbm mcf milc 1x
Intel Core i7 Quad AMD Phenom X4
Figure 1: Performance impact due to contention from co-location with LBM. can enable smarter, contention conscious dynamic and online scheduling techniques. For example, the understanding of an applications contention sensitivity characteristics enables contention conscious application co-scheduling. Applications that have higher demands on shared memory resources can be co-located with applications that have a lower demand to gain better performance and throughput. While ad-hoc and indirect approaches, such as measuring cache hits and misses via performance counters, can give a coarse indication of cross-core interference sensitivity, they are not sucient to provide accurate and detailed proling information about on-chip contention. Monitoring the shared cache misses directly is not sucient in that not all cache misses reported by the hardware performance monitoring unit are misses that relate to code dependencies. In modern processors many misses reported by the hardware monitors are caused by hardware prefetches and hardware page table walks [8]. These eects do not relate to contention and are indistinguishable from cache misses that do. We propose a cross-core interference proling environment that uses contention synthesis. To accurately characterize and prole an applications sensitivity to cross-core interference, we synthesize contention, meaning we synthetically create contention with the host application. This contention synthesis is achieved by synthetically applying pressure on the shared cache using a contention synthesis engine (CSE). The proling framework manipulates the execution of the CSE while observing the eect on the host application, measuring the impact over time, and assigning an impact score to the application. However, understanding how to synthesize contention on current chip microarchitectures is unclear as there are a number of diering contentious data access behaviors in addition to the fact that current chip microprocessors are engineered and tuned to circumvent the poor cache performance of certain data access behaviors. In this work we explore the design space of our contention synthesis engine and investigate how contention synthesis can be used to characterize cross-core performance interference on modern multicore architectures. We have designed and evaluated ve contention synthesis mechanisms that mimic ve common data access behaviors. These include the random access of elements in a large array, the random traversal of large linked data structures, a real world uid dynamics application (the lbm SPEC2006 benchmark), data movement in 3D object space commonly found in simulations and scientic computing, and nally, a contention synthesis engine that was constructed by reverse engineering lbm, nding its most contentious code, and further tweaking it to construct a highly contentious synthesis engine. In addition to presenting the design and implementation of these contention synthesis methods we investigate how these ve contention synthesis engines impact the performance of 19 of the SPEC2006 benchmarks on two state of the art chip multiprocessors, Intels Core i7 and AMDs Phenom X4 architectures. We also answer a number of questions as to whether the cross-core interference properties of applications tend to remain consistent regardless of the particular contention synthesis method chosen. Finally we demonstrate how contention synthesis can be used to accurately characterize an applications cross-core interference sensitivity. The contributions of this work includes: The discussion of a novel methodology for the characterization of cross-core performance interference on current multicore architecture. The design and implementations of ve contention synthesis mechanisms. The evaluation and study of the impact these ve contention synthesis mechanism on 19 SPEC2006 benchmarks on both the Intel Core i7 and AMD Phenom X4 Architectures. The cross-core interference sensitivity scoring of the SPEC2006 benchmarks. Next in Section 2 we motivate our work. We then provide an overview of our proling approach in Section 3. Section 4 presents our contention synthesis methodology. We evaluate our contention synthesis approach in Section 5. Section 6 presents related work, and nally, we conclude in Section 7.
2.
MOTIVATION
With the recent growth in popularity of multicore architecture, comes an increase in the parallel processing capabilities of commodity systems. These commodity systems are now common-place both in the general purpose desktop and laptop markets as well as in industry data-center and computing clusters. Companies such as Google, Yahoo, and Microsoft use these o the shelf computing components to build their data-centers as they are cheap, abundant and easily replaceable [3]. The increase in parallel processing capabilities in these chip architectures are in fact leading to
18
server consolidation. However, the memory wall is preventing these parallel processing capabilities from being fully realized. The memory subsystem on current commodity multicore architectures is shared among the processing cores. Two representative examples of the state of the art multicore chip designs are the Intel Core i7 Quad Core chip and AMDs Phenom X4 Quad Core. Intels Core i7 has four processing cores, each with a private 32kb L1 cache and a 256kb L2 cache. A large 8mb L3 cache is shared among the four cores [8]. AMDs Phenom X4 also has 4 cores with a similar cache layout. Each core has a private 64kb L1 and 512kb L2, with a shared 6mb L3 cache. These chips were designed to accommodate 4 simultaneous streams of execution. However, as we can see through experimentation, their shared caches and memory subsystem often cannot eciently accommodate even 2 co-running processes. Figure 1 illustrates the potential cross-core interference that can occur when multiple co-running applications are executing on current multicore architectures. We perform the following experiment using the Core i7 and Phenom X4 architectures. In this experiment we study the cross-core performance interference caused to each of the SPEC2006 benchmarks when co-running with lbm (one of the SPEC2006 benchmarks known be especially heavy on the on-chip memory subsystem). Figure 1 shows the slowdown of each benchmark due to the cross-core interference. Each application was executed to completion on their ref inputs. On the y-axis we show the execution time of the application while co-running with lbm normalized to the execution-time of the application running alone on the system. The rst bar in Figure 1 presents this data for the Core i7 architecture, and the second bar for the Phenom X4. As this graph shows, there are severe performance degradations due to cross-core interference on a large number of Spec benchmarks. The large last level on-chip caches of these two architectures do little to accommodate many of these co-running applications. On a number of benchmarks including lbm, mcf, omnetpp, and sphinx, this degradation approaches 35%. In addition to the general performance degradation, this sensitivity to cross-core interference is particularly undesirable for real time and latency sensitive application domains. In the latency sensitive domain of web search for instance this kind of cross core interference can cause unexpected slowdowns, negatively impacting the QoS on a search query. A commonly used solution is to simply disallow the colocation of latency sensitive applications with others on a single machine, resulting is lowered utilization and higher energy cost [14]. Note that not all applications are eected by the contention properties of their co-runners. Applications such as hmmer, namd, and povray seem immune to lbms cross core interference. This observation shows that cross-core interference sensitivity vary substantially across applications.
Application
Proler CiPE
Core
Contention Synthesis Engine Core
Shared Cache
Figure 2: Our Proling Framework
application in isolation is not sucient as no contention is actually occurring. Although it is possible to observe the performance counters of current architecture designs to analyze cache misses, etc, these are indirect means to infer an applications memory behavior as no actual contention or cross-core interference is occurring. Our methodology is to prole the application with real contention in a controlled environment, where we manufacture contention using a contention synthesis mechanism and dynamically monitor and prole the impact on the host application. Figure 2 shows an overview of our cross core interference proling environment. This gure shows a multicore architecture with two separate cores sharing an on-chip cache and memory subsystem. The shaded boxes show our proling framework, which is composed of the proler runtime and a contention synthesis engine (CSE). As shown on the left side of Figure 2, the host application is controlled by the proler runtime and is monitored throughout the execution of the application. Before the execution of the host application, the proler spawns the contention synthesis engine on a neighboring core, as shown to the right of the gure. This CSE shares the cache and memory subsystem of the host application. As the application executes, the CSE engine aggressively accesses memory trying to cause as much cross-core interference as possible. The proler manipulates the execution of the contention synthesis engine allowing bursts of execution to occur. Slowdowns in the applications instruction retirement rate that result from this bursty execution are monitored using the hardware performance monitoring (HPM) information [8]. This intermittent control of the CSE and monitoring of the HPM is achieved using the periodic probing approach [15]. A timer interrupt is used to periodically execute the monitoring and proling directives. This has shown to be a very low overhead approach for the dynamic monitoring and analysis of applications.
3.
PROFILING FRAMEWORK
4.
SYNTHESIZING CONTENTION
Contention and cross-core interference can only occur dynamically, and depends on a combination of the applications memory behavior, the design of the particular underlying architecture, and the applications co-running on this microarchitecture at any particular time. Because of these properties, characterizing this sensitivity using static analyses is intractable. Also, the traditional proling of the
Many types of applications cause cache contention. With the continuing advances in micro-architectural design simply accessing a large amount of data does not necessarily imply high pressure on cache and memory performance. The type of data access pattern and the way that data is mapped into the cache is very important to consider when constructing the contention synthesis engine. Structures such as hardware cache prefetchers and victim caches can avert poor and
19
contentious cache behavior even when the working set of the application is very large. The features and functionality of these hardware techniques are dicult to anticipate as vendors keep these details closely guarded. Access patterns that exhibit a large amount of spatial or temporal locality can easily be prefetched into the earlier and later levels of cache.
#i n c l u d e <i o s t r e a m > #i n c l u d e < c s t d l i b > #i n c l u d e <c s t r i n g > u s i n g namespace s t d ; const int p a y l o a d s i z e =128;
4.1
Designing Contention Synthesis
void g e n n a m e ( char r e t ) { f o r ( i n t i =0; i <p a y l o a d s i z e ; r e t [ i ] = ( char ) r a n d ( ) % 2 5 6 ; } }
i ++) {
To design our contention synthesis engine we explored and experimented with a number of common data access patterns. These designs consist of the random access of elements in a large array, the random traversal of large linked data structures, a real world uid dynamics application (the lbm SPEC2006 benchmark), data movement in 3d object space commonly found in simulations and scientic computing and nally, and nally we reverse engineered lbm, found its most contentious code, and further tweaked it to construct a highly contentious synthesis engine we call The Sledgehammer.
struct t r e e n o d e { tree node (){ i f ( l e f t ) delete l e f t ; i f ( r i g h t ) } tree node l e f t ; tree node right ; int data ; char t e x t [ p a y l o a d s i z e ] ; };
delete
right ;
4.1.1
Naive
c l a s s BST { private : tree node root ; public : BST ( ) { r o o t = NULL ; } bool isEmpty ( ) c o n s t { return r o o t== NULL ; } void i n s e r t ( i n t ) ; void remove ( i n t ) ; void c l e a r ( ) { i f ( r o o t ) { d e l e t e r o o t ; } r o o t=NULL; } unsigned long unsigned long }; . . . [ standard implementation of insert and remove ] trample ( ) ; trample ( t r e e n o d e
Figure 3 shows the C implementation of our naive contention synthesis mechanism.

#i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e < s t d i o . h> < s t d l i b . h> <t i m e . h> <u n i s t d . h>
p ) ;
char d a t a ; main ( i n t a r g c , char a r g v [ ] ) { s r a n d ( t i m e (0)+ g e t p i d ( ) ) ; i f ( a r g c <2) e x i t ( 1 ) ; i n t b y t e s=a t o i ( a r g v [ 1 ] ) 1 0 2 4 ; d a t a =(char ) m a l l o c ( b y t e s ) ; f o r ( i n t i =0; i <b y t e s ; i ++) d a t a [ i ]= r a n d ( ) % 2 5 6 ; w hil e ( 1 ) { f o r ( i n t j =0; j <b y t e s 2; j ++) { d a t a [ r a n d ()% b y t e s ]+= d a t a [ r a n d ()% b y t e s ] ; } } p r i n t f ( %d\n , ( i n t ) d a t a [ r a n d ()% b y t e s ] ) ; }
unsigned long BST : : t r a m p l e ( ) { return t r a m p l e ( r o o t ) ; } unsigned long BST : : t r a m p l e ( t r e e n o d e p ) { unsigned long r e t =0; i f ( p != NULL) { / / U s i n g r a n d o m t r a v e r s a l + 5% i f ( p >d a t a %2) { / / U s i n g p> d a t a i n s t e a d o f r a n d + 2% i f ( p l e f t ) r e t+=t r a m p l e ( p l e f t ) ; > > i f ( p >r i g h t ) r e t+=t r a m p l e ( p >r i g h t ) ; r e t +=(unsigned long ) p >t e x t [ p >d a t a%p a y l o a d s i z e ] ; p >d a t a+=r e t ; / / M o d i n g d a t a + 6% p >t e x t [ p >d a t a%p a y l o a d s i z e ]=p >d a t a %256; } else { i f ( p >r i g h t ) r e t+=t r a m p l e ( p >r i g h t ) ; i f ( p l e f t ) r e t+=t r a m p l e ( p l e f t ) ; > > r e t =(unsigned long ) p >t e x t [ p >d a t a%p a y l o a d s i z e ] ; p >d a t a+=r e t ; / / M o d i n g d a t a + 6% p >t e x t [ p >d a t a%p a y l o a d s i z e ]=p >d a t a %256; } } return r e t ; } i n t main ( i n t a r g c , char a r g v [ ] ) i n t f o o t p r i n t =8192; {
Figure 3: Naive Contention Synthesis The rst inclination is to simply access a large array of memory (a little larger than the L3 cache) performing both loads and stores. Our earliest CSE design attempts consisted of an array of memory just larger than the last level of onchip cache, which we traversed in a number of clever ways. However, the hardware prefetchers on both the Intel and AMD chips cleverly prefetched to early levels of cache. One example of an approach subverted by the hardware prefetchers was the caching of 10,000 random numbers to be used to access the elements of a large array. This naive design evolved to simply calculating the random index on the y as shown in Figure 3. The hardware prefetcher was unable to anticipate these memory accesses. The drawback of this approach however is the fact that each memory access is interleaved with the logic to calculate the random number, allowing for a high degree of instruction level parallelism.
BST b ; s r a n d ( t i m e (0)+ g e t p i d ( ) ) ; unsigned int n o d e s i z e=s i z e o f ( t r e e n o d e )+ s i z e o f (BST ) ;
f o r ( i n t i =0; i <f o o t p r i n t 1 0 2 4 / n o d e s i z e ; i ++) { b . i n s e r t ( p a y l o a d s i z e +( r a n d () p a y l o a d s i z e ) ) ; } unsigned long long sum=0; w hi l e ( 1 ) sum+=b . t r a m p l e ()+ b . t r a m p l e ( ) ; }
Figure 4: Contention Synthesis Using a Linked Data-structure
4.1.2
Linked Data Structure
Figure 4 shows the C++ implementation of our linked data traversal contention synthesis mechanism based on a binary search tree. This design for the CSE consisted of the random construction and traversal of a binary search tree. There were also
a number of steps taken to reverse optimize (de-optimize for poor performance) this linked structure CSE approach. For example, the trample function is a specialized traversal that recursively picks whether the left or right subtree is to be trampled rst. In the nal design of this CSE, each tree node consisted of an id and a payload, and this custom traversal function, trample, is used, as shown in Figure 4 The payload consisted of a number of random bytes (128 in our design) to have the node map into its own cache line. The contentious kernel of this approach uses the trample function to performed a random depth rst search through
20
the tree touching and changing the data alone the way.
#i n c l u d e < s t d l i b . h> typedef double LBM Grid [ 2 6 0 0 0 0 0 0 ] ; s t a t i c double s r c G r i d , d s t G r i d ; i n t main ( ) { c o n s t unsigned long m a r g i n = 4 0 0 0 0 0 , s i z e = s i z e o f ( LBM Grid ) + 2 m a r g i n s i z e o f ( srcGrid = malloc ( s i z e ) ; dstGrid = malloc ( s i z e ) ; s r c G r i d += m a r g i n ; d s t G r i d += m a r g i n ; w hi l e ( 1 ) { int i ; for ( i = dstGrid dstGrid dstGrid dstGrid dstGrid dstGrid dstGrid dstGrid dstGrid dstGrid dstGrid dstGrid dstGrid dstGrid dstGrid dstGrid dstGrid dstGrid dstGrid } } return 0 ; }
4.1.3
LBM from SPEC2006
The implementation of the LBM benchmark can be found in the ocial SPEC2006 benchmarks suite [7]. LBM is an implementation of the Lattice Boltzmann Method (LBM). The Boltzmann Method is used to simulate incompressible uids. We selected this benchmark as one of our synthesis mechanisms, as it proved to be one of the most contentious of the SPEC2006 benchmark suite. For a complete description of LBM please refer to [7].
double
);
4.1.4
3D Data Movement
Figure 5 shows the C++ implementation of our 3D data movement contention synthesis mechanism.
#i n c l u d e <i o s t r e a m > #i n c l u d e < c s t d l i b > u s i n g namespace s t d ; const int n u g s i z e =128;
c l a s s nugget { public : char n [ n u g s i z e ] ; nugget (){ f o r ( i n t i =0; i <n u g s i z e ; } }; class block { public : n u g g e t b ; unsigned s i z e ; b l o c k ( unsigned block ( ) ; };
0 ; i < 2 6 0 0 0 0 0 0 ; i += 20 ) { [ i ] = srcGrid [ i ] ; [ i 1998] = s r c G r i d [ ( 1 ) + i ] ; [ i +2001] = s r c G r i d [ ( 2 ) + i ] ; [ i 16] = s r c G r i d [ ( 3 ) + i ] ; [ i +23] = s r c G r i d [ ( 4 ) + i ] [ i 199994] = s r c G r i d [ ( 5 ) + i ] ; [ i +200005] = s r c G r i d [ ( 6 ) + i ] ; [ i 2010] = s r c G r i d [ ( 7 ) + i ] ; [ i 1971] = s r c G r i d [ ( 8 ) + i ] ; [ i +1988] = s r c G r i d [ ( 9 ) + i ] ; [ i +2027] = s r c G r i d [ ( 1 0 ) + i ] ; [ i 201986] = s r c G r i d [ ( 1 1 ) + i ] ; [ i +198013] = s r c G r i d [( 12)+ i ] ; [ i 197988] = s r c G r i d [ ( 1 3 ) + i ] ; [ i +202011] = s r c G r i d [( 14)+ i ] ; [ i 200002] = s r c G r i d [ ( 1 5 ) + i ] ; [ i +199997] = s r c G r i d [( 16)+ i ] ; [ i 199964] = s r c G r i d [ ( 1 7 ) + i ] ; [ i +200035] = s r c G r i d [( 18)+ i ] ;
i ++){n [ i ]= r a n d ( ) % 2 5 6 ; }
Figure 6: Sledge
sz ) ;
b l o c k : : b l o c k ( unsigned s z ) { b=new n u g g e t [ s z ] ; f o r ( i n t i =0; i <s z ; i ++) { b [ i ]=new n u g g e t [ s z ] ; f o r ( i n t j =0; j <s z ; j ++) b [ i ] [ j ]=new n u g g e t [ s z ] ; } s i z e =s z ; } block : : block () { f o r ( i n t i =0; i <s i z e ; i ++) { f o r ( i n t j =0; j <s i z e ; j ++) delete [ ] b [ i ] [ j ] ; delete [ ] b [ i ] ; } delete [ ] b ; } i n t main ( ) { c o n s t i n t s i z e =30; b l o c k b1 ( s i z e ) ; b l o c k b2 ( s i z e ) ; b l o c k b3 ( s i z e ) ; cout < < smash < < endl ; w hil e ( 1 ) f o r ( i n t i =0; i <s i z e ; i ++) f o r ( i n t j =0; j <s i z e ; j ++) f o r ( i n t k =0; k<s i z e ; k++) b1 . b [ i ] [ j ] [ k ]= b2 . b [ j ] [ k ] [ i ]= b3 . b [ i ] [ j ] [ k ] ; return 0 ; }
This last design is the result of reverse engineering and investigating lbm to learn its contentious core nature. This name is motivated by the fact that the behavior of this design can be visualized as hitting an element in a 1D or 2D array, and a number of sparsely surrounding elements feel the shock-wave, e.g. are eected. As shown in Figure 6, the nal version of this CSE rst allocates two large arrays and enters its contentious kernel which copies data back and forth with this sledgehammer pattern.
5.
EVALUATION
First we evaluate the eectiveness of these ve contention synthesis engines at generating contention, and investigate how these varying contention generation mechanisms aect real applications on current commodity multicore microarchitectures. Our second goal is to use contention synthesis to characterize cross-core interference sensitivity and evaluate the ability of such a proling framework to accurately identify applications that are indeed sensitive to cross core interference.
Figure 5: Blockie This 3D data movement micro benchmark consists of a number of large 3D arrays of double precision values that represent solid virtual cubes. In our experimentation the dimensionality chosen for these cubes was 30x30x30. The contentious kernel of this CSE is the transposition of cells of each cube into the space of another cube. The cells of one cube is continuously copied to another.
5.1
Evaluating Contention Synthesis Designs
4.1.5
"The Sledgehammer"
Figure 6 shows the C implementation of our sledgehammer contention synthesis mechanism.
With the variety of contention synthesis mechanisms presented above, a number of questions arise. The rst has to do with whether there is a drastic dierence between the interactions of dierent applications to the dierent contention syntheses designs. We hypothesize that contention is agnostic to the nature of the memory access. We seek to evaluate this very question. The other goal of this evaluation is to learn whether there exist a synthesis engine that is better than all others, and if so, to identify it. Figures 7 and 8 show the performance impact of co-running each of the contention synthesis designs with each of the SPEC2006 benchmarks (C/C++ only, run to completion on
21
1.6x 1.5x Slowdown 1.4x 1.3x 1.2x 1.1x 1x

Naive BST LBM Core Blockie Sledge
perlbench
libquantum
xalancbmk
omnetpp
sphinx3
h264ref
povray
gobmk
soplex
bzip2
dealII
namd
sjeng
hmmer
Figure 7: Slowdown caused by contention synthesis on Intel Core i7.

1.6x 1.5x Slowdown 1.4x 1.3x 1.2x 1.1x 1x
Naive BST LBM Core Blockie Sledge
perlbench
libquantum
xalancbmk
omnetpp
sphinx3
h264ref
povray
gobmk
soplex
bzip2
dealII
namd
sjeng
hmmer
Figure 8: Slowdown caused by contention synthesis on AMD Phenom X4. ref inputs). These benchmarks were compiled with GCC 4.4 on the Linux 2.6.31 kernel. Figure 7 shows the results when performing this co-location on Intels Core i7 Quad architecture, and Figure 8 shows these results on AMDs Phenom X4 Quad. The bars show the slowdown when co-located with naive random access (naive), binary search tree (BST), the lbm benchmark (LBM Core), the 3D block data movement (Blockie), and our sledgehammer technique (Sledge), in that order. The lbm benchmark can be viewed as a baseline to compare the other synthetic engines. We believe lbm to be a good point of reference as it is actually a naturally occurring example of a contention application behavior. It is clear from the graphs that the naive and BST approaches produce the smallest amount of contention. However note that they do an adequate job of indicating the applications that are most sensitive to cross-core interference. The contention produced by these two approaches is low as there is a good bit of computation between single memory accesses. blockie and sledge touch large amounts of data in a single swoop and with less computation. Note that our Blockie and Sledge techniques are more eective than using the most contentious of the SPEC benchmarks. Across the two architectures the general trend is similar, although we do see some dierences. We see that applications that tend to be sensitive to contention tend to be uniformly so across these two representative architectures. We also see that the varying contention synthesis designs rank similarly on both architectures. This general trend supports our hypothesis that contention is agnostic across this class of commodity multicore architectures. Although the general trend is the same, there are some clear dierences. For example the benchmark most sensitive to cross-core interference on the two architectures differs. On Intels architecture mcf shows the most signicant degradation in performance, while on AMDs architecture lbm is the clear winner (or loser). These variations are due to the idiosyncrasies of the microarchitectural design. The key observation is the fact that the eectiveness of the contention synthesis designs are mostly uniform across the dierent benchmark workloads. This trend supports our hypothesis that in addition to being generally agnostic across this class of commodity multicore architectures, it is also agnostic across the varying workloads and memory access patterns present in SPEC. In the following section we selected to use Sledge as our main CSE for our proling framework as it most vividly illustrates contention across the entire benchmark suite.
5.2
Characterizing Cross Core Interference
To characterize an applications cross core interference sensitivity our proling framework spawns the contention synthesis engine (CSE) on a neighboring core. As the application executes, the proling runtime directs the CSE to produce short bursts of contentious execution. For every millisecond of execution the proler will pause the CSE for one millisecond. Slowdowns in the applications instruction retirement rate that result from this bursty execution are monitored using the instructions_retired hardware performance counter. A cross-core interference sensitivity score is then calculated as the average of these slowdowns.
22
mean
astar
gcc
lbm
mcf
milc
mean
astar
gcc
lbm
mcf
milc
0.35x 0.3x 0.25x Slowdown 0.2x 0.15x 0.1x 0.05x perlbench libquantum xalancbmk omnetpp sphinx3 h264ref povray gobmk soplex bzip2 dealII namd sjeng astar hmmer mean mean gcc lbm mcf milc 0x
Score Performance Impact
Figure 9: Comparing Characterization Score Trend to Actual Cross-core Interference on Intel Core i7.
0.35x 0.3x 0.25x Slowdown 0.2x 0.15x 0.1x 0.05x libquantum perlbench h264ref omnetpp povray gobmk sphinx soplex bzip2 hmmer Xalan dealII namd sjeng astar gcc lbm mcf milc 0x
Score Performance Impact
Figure 10: Comparing Characterization Score Trend to Actual Cross-core Interference on AMD Phenom X4. Figure 9 and 10 show the cross-core interference sensitivity scores calculated using the described method for all C/C++ benchmarks in SPEC2006, compared against the performance degradation when each benchmark is co-running with lbm, on both Intel Core i7 and AMD Phenom X4. Our results show that generally, an applications cross-core interference sensitivity score has a strong correlation proportionally with its performance degradation (e.g a lower CIS scores indicate smaller degradations and vice versa). Note that Figures 9 and 10 are intended to demonstrate the trend of how the cross core interference sensitivity score relates to the applications actual degradations due to cross core interference. These gures display a strong trend, indicating that our approach is indeed accurately characterizing cross-core interference sensitivity. However, there are three relative exceptions, sphinx, xalan and astar. These three benchmarks have very clear phases that seem to increase the inaccuracy on its average. One possible way to address this challenge is increasing the periodic probing interval length. Also studying their phase level cross-core interference sensitivity scores would give more insight about their dynamic sensitivity. We have also experimented with using the change in last level cache misses per cycle to detect contention and measure cross-core interference. Our results show that it is a worse indicator than directly measuring performance degradation using IPCs. Also last level cache misses alone is not always a good indicator either. For example, although an application with a large number of cache misses per cycle because of its heavy cache reliance may in fact be sensitive to cross-core interference, it could also be insensitive if this application already experiences heavy cache misses when running alone. In this latter case cross-core interference would not hurt its already poor cache performance. This often occurs when the working set of the application greatly exceeds the size of the last level cache.
6.
RELATED WORK
In this paper we present a proling and characterization methodology for program sensitivity to cross-core interference on modern CMP architecture. Related to our work is a cache monitoring system for shared caches [25], which proposes novel hardware designs to facilitate better understanding of how applications are interacting and contending when running together. Similar to our work, the system is then used for proling and program behavior characterization. However, in contrast to our methodology, this work requires hardware extensions and thus is evaluated using simulations. Our methodology and framework is applicable to current commodity multicore architectures. In addition, our framework is not limited to cache contention but any contention in the memory system that would impact performance, and can be produced by our CSE. In recent years, cache contention has received much research attention. Most works focus on exploring the design space of cache and memory proposing novel hardware solutions or managing policies to alleviate the contention problem. Hardware techniques and related algorithms to enable cache management such as cache partitioning and memory scheduler are proposed [22, 12, 19, 16, 4]. Other hardware solutions to guarantee fairness and QoS include [17, 10, 20].
23
Related to novel cache designs and architectural support, analytical models to predict impact of cache sharing are also proposed by [2]. In addition to new hardware cache management, approaches to managing shared cache through OS are [21, 5]. Instead of novel hardware or software solution to managing shared caches, our solution focuses on the other side of the problem, namely the applications inherent sensitivity to interference on existing modern microarchitecture. The idea of proling applications and learning about their memory-related characteristics from the prole to improve performance and develop more eective compiler optimizations is rather common. Much work has been done for constructing a general framework for memory proling of applications [18, 9], proling techniques and methods to use such proling to improve performance or help develop better compilers and optimizations [9, 24]. Our work is dierent that we focuses on proling programs behavior in the presence of cross-core interference. Contention conscious scheduling schemes that guarantee fairness and increase QoS for co-running applications or multithreaded application have been proposed [13, 6, 1]. Fedorova et al. used cache model prediction to enhance the OS scheduler to provide performance isolation. There are also theoretical studies that investigate approximation algorithms to optimally schedule co-running jobs on CMPs [11, 23].
9.
REFERENCES
7.
CONCLUSION
In this paper, we present a methodology for proling and applications sensitivity to cross-core performance interference on current multicore microarchitectures by synthesizing contention. Our proling framework is composed of a lightweight runtime environment on which a host application runs, along with a carefully designed contention synthesis engine that executes on a neighboring core. We have explored and evaluated ve contention synthesis mechanisms which include the random access of elements in a large array, the random traversal of large linked data structures, a real world uid dynamics application, data movement in 3D object space commonly found in simulations and scientic computing and nally, we reverse engineered lbm, found its most contentious code, and further tweaked it to construct a highly contentious synthesis engine. We have presented the design and implementation of these contention synthesis mechanisms and demonstrated their impact on the SPEC2006 benchmark suite on two real-world multicore architectures. Finally we demonstrate how contention synthesis can be used dynamically, using a bursty execution method, to accurately characterize and applications crosscore interference sensitivity.
8.
ACKNOWLEDGEMENTS
We would like to acknowledge the important input from Robert Hundt and Neil Vachharajani of Google. Their work and ideas has been instrumental for better understanding the problem of contention, especially as it relates to the data-center and cluster computing domains. We would also like to acknowledge the contribution of Lingjia Tang of the University of Virginia. Her comments and insights have been very helpful to the work and paper. Finally, we would like to thank Google for the funding support for this project.
[1] M. Banikazemi, D. Po, and B. Abali. Pam: a novel performance/power aware meta-scheduler for multi-core systems. In SC 08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 112, Piscataway, NJ, USA, 2008. IEEE Press. [2] D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In HPCA 05: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, pages 340351, Washington, DC, USA, 2005. IEEE Computer Society. [3] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2):126, 2008. [4] J. Chang and G. S. Sohi. Cooperative cache partitioning for chip multiprocessors. In ICS 07: Proceedings of the 21st annual international conference on Supercomputing, pages 242252, New York, NY, USA, 2007. ACM. [5] S. Cho and L. Jin. Managing distributed, shared l2 caches through os-level page allocation. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pages 455468, Washington, DC, USA, 2006. IEEE Computer Society. [6] A. Fedorova, M. Seltzer, and M. D. Smith. Improving performance isolation on chip multiprocessors via an operating system scheduler. In PACT 07: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pages 2538, Washington, DC, USA, 2007. IEEE Computer Society. [7] J. L. Henning. Spec cpu2006 benchmark descriptions. SIGARCH Comput. Archit. News, 34(4):117, 2006. [8] Intel Corporation. IA-32 Application Developers Architecture Guide. Intel Corporation, Santa Clara, CA, USA, 2009. [9] M. Itzkowitz, B. J. N. Wylie, C. Aoki, and N. Kosche. Memory proling using hardware counters. In SC 03: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, page 17, Washington, DC, USA, 2003. IEEE Computer Society. [10] R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell, Y. Solihin, L. Hsu, and S. Reinhardt. Qos policies and architecture for cache/memory in cmp platforms. In SIGMETRICS 07: Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages 2536, New York, NY, USA, 2007. ACM. [11] Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In PACT 08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 220229, New York, NY, USA, 2008. ACM. [12] S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor
24
[13]
[14] [15]
[16]
[17]
[18]
[19]
architecture. In PACT 04: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, pages 111122, Washington, DC, USA, 2004. IEEE Computer Society. R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using os observations to improve performance in multicore systems. IEEE Micro, 28(3):5466, 2008. S. Lohr. Demand for data puts engineers in spotlight. The New York Times, 2008. Published June 17th. J. Mars and R. Hundt. Scenario based optimization: A framework for statically enabling online optimizations. In CGO 09: Proceedings of the 2009 International Symposium on Code Generation and Optimization, pages 169179, Washington, DC, USA, 2009. IEEE Computer Society. K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. Fair queuing memory systems. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pages 208222, Washington, DC, USA, 2006. IEEE Computer Society. K. J. Nesbit, J. Laudon, and J. E. Smith. Virtual private caches. In ISCA 07: Proceedings of the 34th annual international symposium on Computer architecture, pages 5768, New York, NY, USA, 2007. ACM. N. Nethercote and J. Seward. Valgrind: a framework for heavyweight dynamic binary instrumentation. In PLDI 07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, pages 89100, New York, NY, USA, 2007. ACM. M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on
[20]
[21]
[22]
[23]
[24]
[25]
Microarchitecture, pages 423432, Washington, DC, USA, 2006. IEEE Computer Society. N. Raque, W.-T. Lim, and M. Thottethodi. Architectural support for operating system-driven cmp cache management. In PACT 06: Proceedings of the 15th international conference on Parallel architectures and compilation techniques, pages 212, New York, NY, USA, 2006. ACM. L. Soares, D. Tam, and M. Stumm. Reducing the harmful eects of last-level cache polluters with an os-level, software-only pollute buer. In MICRO 08: Proceedings of the 2008 41st IEEE/ACM International Symposium on Microarchitecture, pages 258269, Washington, DC, USA, 2008. IEEE Computer Society. G. E. Suh, S. Devadas, and L. Rudolph. A new memory monitoring scheme for memory-aware scheduling and partitioning. In HPCA 02: Proceedings of the 8th International Symposium on High-Performance Computer Architecture, page 117, Washington, DC, USA, 2002. IEEE Computer Society. K. Tian, Y. Jiang, and X. Shen. A study on optimally co-scheduling jobs of dierent lengths on chip multiprocessors. In CF 09: Proceedings of the 6th ACM conference on Computing frontiers, pages 4150, New York, NY, USA, 2009. ACM. Q. Wu, A. Pyatakov, A. Spiridonov, E. Raman, D. W. Clark, and D. I. August. Exposing memory access regularities using object-relative memory proling. In CGO 04: Proceedings of the international symposium on Code generation and optimization, page 315, Washington, DC, USA, 2004. IEEE Computer Society. L. Zhao, R. Iyer, R. Illikkal, J. Moses, S. Makineni, and D. Newell. Cachescouts: Fine-grain monitoring of shared caches in cmp platforms. In PACT 07: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pages 339352, Washington, DC, USA, 2007. IEEE Computer Society.
25
Assessing Cache False Sharing Effects by Dynamic Binary Instrumentation

Stephan M. Gnther
Technische Universitt Mnchen Faculty of Informatics
Josef Weidendorfer
Technische Universitt Mnchen Faculty of Informatics
guenther@cs.tum.edu ABSTRACT
Unnecessary sharing of cache lines among threads of a program due to private data which is located in close proximity in memory is a performance obstacle. Depending on access frequency to this data and scheduling of threads to processor cores, this can lead to substantial overhead because of latency induced by cache lines exchanges, known as false sharing. Since processor hardware can not distinguish these eects from real data exchange (true sharing), all measurement tools have to rely on heuristics for detection. In this paper, we describe an approach using dynamic binary instrumentation to derive an estimate for the number of unnecessary exchanges of cache lines caused by false sharing and a tool assisting the programmer to identify the data structures involved as well as the code sections triggering false sharing. To evaluate the impact of false sharing the estimated number of occurrences is translated into a temporal overhead. Results of our tool are presented for two small example codes and a real-world application.
weidendo@cs.tum.edu
line and is accessed frequently in an interleaved manner by dierent processor cores, copies of the corresponding memory block are unnecessarily and repeatedly exchanged among caches which potentially results in visible slowdowns. If a programmer sees unexpected slowdown, proling tools are used to analyze performance bottlenecks. To be useful, a tool should hint at the reason of a slowdown and hence also be able to relate arising memory latencies to false sharing eects. However, available tools provide at most statistics about data addresses involved with last-level cache misses, using sampling techniques for minimal overhead [5]. This is because support for direct detection of false sharing is unfeasible in hardware. However, it can be done by cycle-accurate architecture simulation which tries to approximate reality by a detailed machine model. Aside from model dependency, a single result of such a simulation is still questionable for the problem at hand. Typically, when false sharing is the reason for visible slowdown, the runtime on actual hardware varies a lot if the code is executed multiple times in a row. The reason for this is that even small changes in OS scheduling can result in totally dierent interleaving of memory accesses by dierent processors. For detailed analysis of the expected average runtime behavior, multiple cycle-accurate simulation runs with slightly changed parameters would be needed. Taking the high time-consumption of such simulations into account, it seems to be dicult to develop a tool with resource requirements acceptable for regular usage. Instead of relying on measurement of real or accurately simulated hardware, we propose an approach to estimate worst case numbers of false sharing events. The idea is to provide a list of data structures and code positions which is ordered by their worst-case inuence on runtime. In addition, we propose a way to map our false sharing estimates to temporal overhead which makes it possible to compare them e.g. to latency due to bad cache behavior in general. Our approach identies segments of execution separated by barrier synchronization. Only inside such segments threads can run concurrently and therefore may cause false sharing. This allows to handle each segment separately. As key concept for the worst case estimate, we do not assume or simulate any order of interleaved accesses by dierent threads within such segments, but only count the number of loads and stores executed. This has important consequences: First, as temporal order is not relevant for the estimate, we can use a simple machine model, because there is no need for exact simulation of advancing time. Second, the dynamic

B.3.3 [Memory Structures]: Performance Analysis and Design Aids
General Terms
False Sharing
Keywords
False Sharing, Processor Caches, Dynamic Binary Instrumentation
1.
INTRODUCTION
The development of multithreaded code for shared memory systems is expected to increase with the ongoing trend towards multicore architectures. One source for unnecessary parallelization overhead with multithreaded code is false sharing of cache lines. If private data of threads is located in the same memory block which is equal in size to a cache
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. WBIA 09, Dec 12, New York City, NY Copyright c 2009 ACM 978-1-60558-793-6/12/09... $10.00
26
binary instrumentation framework does not need to be able to run client threads concurrently. Also, there is no need to make the time slice of instrumentation frameworks which serialize thread execution especially small. Third, we keep the amount of data collected for false sharing detection at a manageable level. Detection can be postponed to a postprocessing phase. The prototype of our tool, called Pluto, consists of two parts. The data collection phase is implemented using Valgrind [8] as DBI framework. This allows the tool to work with Linux and OS X binaries on x86, AMD64 and PPC architectures (with ARM expected to join). A patch for Valgrind 3.5.0 sources is available on http://mmi.in.tum.de. The second phase post-processes data collected and computes estimates for the number of occurrences and for time. It is planned to be integrated as an import plugin into an enhanced version of KCachegrind [11]. This allows for a side by side visualization of temporal estimates from Pluto and existing cache simulators. The remainder of this paper is organized as follows: Section 2 presents related work. Afterwards we dene our use of the term false sharing in Section 3 and describe the worst case estimate. Furthermore a promising modication to sharpen the estimate is presented. The tools usage is described in Section 4 and typical output is shown. Finally, Section 5 shows some examples which exhibit false sharing. We provide real runtime measurements as well as our estimates for overhead due to false sharing.
False sharing is also a signicant problem for DSM (distributed shared memory) systems. There, the smallest memory unit used for the software caching algorithm (as well as the coherency protocol) has to be a memory page, in order to use the hardware memory management unit. As this is much larger than the size of a cache line in a processor cache, it drastically increases the risk of false sharing. Approaches to reduce this issue were researched in the 90s [4, 1]. In [6], methods for compilers to reduce false sharing eects were proposed, but these do not seem to have been implemented.
3. CONCEPT
The following section gives a short overview of the notation used. Afterwards, false sharing events are dened for the scope of this paper. Using this formalism an estimate for the number of occurrences is derived. Further, a way to better approach real-world applications is proposed. The last section outlines how synchronization of threads by means of barriers can be taken into account.
3.1 Model and notation

We initially choose a simplied model for our approach where each core has a private cache of unlimited size. This is motivated by the assumption that false sharing in real-world applications involves only a small subset of cache lines which is frequently accessed and for this reason not subject to capacity misses. However, we also outline an approach to consider nite caches in Section 5. Furthermore, we assume that exactly one thread t T is executed per processor core and that threads are not relocated at runtime. Threads may be synchronized by barriers. In such a case, we refer to a parallel section as a code segment encompassed by barriers. Partial synchronization between threads is not considered yet. Now let a coherency block b B be an aligned block of shared memory which equals the size of a cache line. The memory can be represented by an ordered sequence of coherency blocks. Let further denote X the set of possible osets and Y the set of allowed access sizes to memory. Then we can denote a single load or store operation by thread t T to coherency block b B with oset x X and size (t) (t) y Y by lxy [b] or sxy [b] respectively. The total number of load or store operations with identical attributes is denoted (t) (t) by capital letters Lxy [b] and Sxy [b] respectively. For example, the total number of store operations S (t) [b] to block b B by thread t T is given by X X (t) S (t) [b] = Sxy [b].
xX yY
2.
RELATED WORK
False sharing is often mentioned in programmer manuals for parallel shared memory machines, e.g. SGI architectures [3]. However, it is dicult to provide an exact denition of the eect which matches intuition. Our denition is chosen to allow for occurrence estimates via execution-driven traces. In [2], dierent denitions and their practicability are compared on example code. They conclude that the eect on runtime is always dicult to predict. In [12], the intuitive eect of false sharing is captured by a simple benchmark, and latencies on various cache architectures are compared, including multicore systems. It is shown that shared caches can drastically reduce the eects of false sharing. Intels PTU tool [5] is able to give hints at such events by providing osets within a cache line for sampled accesses to a given memory block, separated by thread ID. Last level cache miss events are taken into account. For more precise identication, newer systems can even focus on coherency misses. The feature is based on PEBS (precise event based sampling) functionality of modern Intel processors, allowing to identify the data address of memory accesses. However, as only a subset of events is sampled, exact ordering and interleaving of accesses can not be detected, prohibiting direct false sharing detection. Also, aggregated information of accessed osets can wrongly hint at true sharing where actually false sharing is happening. False sharing detection by architecture simulation is done in [10] and provides very detailed cache behavior, but is based on a machine model which just takes latencies of specic instructions into account.
Sometimes the oset x and access size y are not considered. Then we rely only on the summed values L(t) [b] and S (t) [b]. For ease of notation we skip the index b whenever possible and then assume only a single coherency block.
3.2 Denition of false sharing

An intuitive denition of false sharing events which is also applicable to our model is given as follows: Let two threads t1 , t2 T be running on two distinct processor cores with separate caches. For a coherency block accessed by both threads, we refer to a false sharing event between these cores if the eviction of this block from the cache of one core triggered by the other core does not serve the purpose of data exchange between t1 and t2 .
27
Thread 0 Load Store
Thread 1
(a) False Sharing

Thread 0 Load Store Thread 1
for the bound which might result in a big error but in general does not underestimate the number of false sharing events. Given a number of memory references, the worst case is an alternating sequence of store and load operations by thread 0 and 1 respectively. When no more loads are available, there might still be a number of store operations of both threads left which can also be arranged in a way such that false sharing occurs. This can be written as a sequence of non-distinguishable operations: s(0) l(1) s(0) l(1) . . . s(1) l(0) s(1) l(0) . . . s(0) s(1) s(0) s(1) . . . The rst reference causes no false sharing since it is a compulsory miss and any subsequent access causes one false sharing event. We do not consider compulsory misses since there is at most one per memory block accessed. It is important to consider false sharing events involving load operations rst because solely load operations can not cause false sharing. In order to obtain a sequence of maximal length, there must not be two subsequent store operations until no more loads are available. The problem of constructing such a worst case sequence can also be seen from the point of set-theory. Since we do not consider time we can also abandon the view of sequences. Instead, we can assume sets of load and store operations for both threads. Now we combine each load of thread 0 with a store of thread 1 and vice versa. Afterwards, there might by store operations left for both threads. These can also be combined. By counting the number of combined operations we end up with a value which equals in length to the sequence above. From both views, we can derive the following formula which gives an estimate of the false sharing events for a given coherency block b B: 2 3 7 6 = 2 6min S (0) , L(1) + min S (1) , L(0) 7 (1) 4 5 | {z } | {z }
=: =:
(b) Ambiguous access pattern Figure 1: Exemplary access patterns to the same coherency block of two dierent threads. Given an individual memory access, we can not make any statement of whether it can trigger a false sharing event or not. It always depends on the sequence of operations. Therefore, in reality, the temporal order of memory references plays an important role. Consider the access pattern as given by Figure 1(a). Both threads access disjoint sections which is why no data can be exchanged via this cache line. Whether false sharing events happen, depends on the sequence of memory references: 1. s0,1 s1,1 l2,1 l3,1 In this case, thread 0 writes to both locations before thread 1 loads a value. This results in one false sharing event since the coherency block is exchanged only once. 2. s0,1 l2,1 s1,1 l3,1 Here, the accesses of both threads are interleaved which results in three exchanges of the cache line and consequently three false sharing events. Note, that such an access pattern is unnecessary and avoidable. Since no data shall be exchanged there is no point for dierent threads to access memory locations which are that near to each other. Now consider the access pattern as given by Figure 1(b). It is no longer obvious whether false sharing happens or not since both threads access a common location. For example, thread 0 may write to oset 0 and afterwards thread 1 may read from oset 1. Such crosswise accesses would make up a worst case sequence of accesses. Although such a sequence might be deliberately provoked, it is very unlikely that a real-world application would behave in this way. In contrast, accesses of dierent threads to the same coherency block at the same oset and of the same size are a very strong indication for data exchange.
(0) (1) (0) (1) (0) (0) (1) (1)
+ 2 max |
min S
(0)
L , S (1) L(0) , 0 {z }
(1) =:
The part abbreviated by considers all possible combinations of store operations by thread 0 with load operations by thread 1. Accordingly, considers the inverse case. Remaining store operations (if any) of both threads are combined by . If either L(1) S (0) or L(0) S (1) holds, there are no more remaining store operations and is forced to zero. Equation (1) does a good job if there is no data exchange in reality. Unfortunately, it can not dierentiate between false sharing and data exchange at all, which is why the result may turn out to be a massive overestimation. Nevertheless, it may be useful if the programmer knows where data exchange between threads is performed.
3.3 Estimate for false sharing

Due to the lack of temporal order between memory references, we try to construct a worst case sequence which maximizes the number of false sharing events within a parallel section. Assuming two threads T := {0, 1}, a rst estimate can be made by only considering the absolute number of accesses per thread for each coherency block L(t) and S (t) with t {0, 1}. We do not consider oset and size of references here. This makes it impossible to dierentiate between false sharing and possible data exchange. However, if data exchange is treated as false sharing we increase the estimate
3.4 Considering offsets and sizes

So far, we did not consider the oset x X and size y Y of a memory reference. This is inevitable to determine if threads access disjoint parts of a coherency block. Now we try to derive an estimate for the maximum number of data elements exchanged between two threads. This estimate
28
Thread 0 Loads
(0, 2) (2, 2) (4, 2) (0, 2)
Thread 1
(2, 2) (4, 2)
3.5 Considering barriers

Barriers can be quite easily considered as long as they aect all threads. Synchronization between a subset of threads is much more dicult and part of future work. For now, we assume a number of parallel sections dened as a code segment which can be executed in parallel and independently by an arbitrary number of threads and which is encompassed by barriers. Anything discussed so far holds within a single parallel section. Subsequent parallel sections can be taken into account by introducing an index s to the set of coherency blocks B where s denotes the parallel section. Let S := {0, 1, . . . , |S| 1} be the set of parallel sections and Bs the set of coherency blocks which have been accessed within section s S. We note that at most one false sharing event (or one data exchange) can occur between subsequent parallel sections. This happens when a core accesses a memory block the rst time within this section which has been accessed by another core in the previous section provided that no capacity miss occurred in the meantime. The number of false sharing events (or data exchanges) between any section s and its successor s + 1 is bounded above by |Bs Bs+1 | if s + 1 S and 0 otherwise. Estimates and which take multiple parallel sections into account can be expressed by s and s per parallel section s S: = = X
sS |S|2
(0, 4)
(6, 2)
(2, 2)
(4, 2)
Stores
Figure 2: Finding pairs of matching load and store operations. The tuples (x, y) denote the oset x X and size y Y of each memory reference.
is useful to make a statement regarding the reliability of calculated according to (1). Again, we assume only two threads and no temporal order of the memory references. (u) (v) If we nd a pair (sxy , lxy ) of references within the same coherency block where u = v and u, v T , this might have been a data exchange in reality. This process is illustrated by Figure 2. Two loads of thread 0 match with two stores of thread 1 in size and oset which we count as a total of four true sharing events. In a similar way one load of thread 1 matches a store of thread 0 which is counted as two true sharing events. The justication for doubling the number of events is that we designed the estimate in a way such that = 0 holds if the access patterns to a coherency block of two threads are very similar. For example, one thread may write a number of values which are afterwards read by another thread. This would result in an ambiguous case and a high value of , although it is very likely that both threads exchanged data. At the same time, the similar access pattern would be considered by . If turns out to be in the order of magnitude of , we know about the ambiguous case. Strictly speaking, we can not make any statement whether or not the result of is meaningful. If, on the other hand, is more than an order of magnitude smaller than , it is very likely that false sharing occurs, unless there are some temporal constraints in reality which prevent false sharing at all. The estimate for a given coherency block b B can be calculated as follows: i X Xh (0) (1) =2 min Sxy , L(1) + min Sxy , L(0) (2) xy xy
xX yY
s + s +
X
s=0
|Bs Bs+1 |
(4)
X
sS
|S|2
X
s=0
|Bs Bs+1 |
(5)
Note that considering parallel sections is not necessary if we are only interested in because the inuence of barriers vanishes for this case.
4. IMPLEMENTATION AND USAGE

The measurement part of our approach collects statistics about memory references for each parallel section. These are written to le when reaching the end of such a section. Currently, we expect the programmer to notify Pluto on passing borders of sections. This is done via so a called Valgrind client request PLUTO_FLUSH_STATS which is a C macro inserting a special sequence of assembly code resulting in a NOP operation. However, when run in Valgrind, such a sequence can be detected and forwarded e.g. to Pluto. To be able to calculate the estimate for an execution unit, we need to count accesses to any given coherency block. Number of accesses have to be collected separately for each thread and depending on whether they are store or load accesses. Further, to be able to calculate estimates not only for false sharing, but also for potential true sharing events, and to relate the estimates to code positions for visualization, distinct access counts are also aggregated for dierent access osets and sizes, as well as dierent code positions at which accesses are happening. The counters are put into a global hash table. All items listed above are part of the key for the hash table. An issue with this approach is that the full key has to be stored for every counter to check against at hash table lookup. To keep memory requirements low, the number of hash table entries should be minimized. E.g., code positions can just be function names, and only for selected functions also include line numbers. Also, similar access types could be mapped to one entry with multiple counters.
The inner term of the sums is very similar to + in (1), except for the added osets and sizes. Since alternating store operations of both threads can not serve the purpose of data exchange, these are not considered as true sharing and are therefore omitted in (2). The rst estimate for false sharing does not consider oset and size of memory references. Now that we have an estimate for the number of data exchanges, it appears tempting to obtain a better estimate := . (3)
Although, this approach delivers astonishingly exact results in rst tests, as we will see in Section 5.2, these might must be seen critical. The general problem here is that the validity of the result completely depends on the unknown sequence of memory references. It is possible to construct a sequence such that fails. However, such a sequence is unlikely to be produced and dampening if matching memory references exist appears reasonable for real-world applications.
29
> show summary 3 CBlock address 0x08054080 0x080502c0 0x070239c0
#loads 3.98e+06 5.08e+06 4.27e+03
#stores 9.07e+05 2.65e+06 4.29e+03
#threads 2 2 2
#FS 7.03e+06 1.28e+06 5.63e+03
#TS 7.03e+06 6.00e+00 5.63e+03
#FS-#TS Overhead (cycles) 2.00e+00 1.76e+09 1.28e+06 3.21e+08 0.00e+00 1.41e+06
> show cblock 0x080502c0 CBlock 0x80502c0: 5076622 loads, 2646301 stores by 4 threads (1280644 #FS, 18.2% Total#FS) Thread 0: [XXXXXXXXXXXX----SSSSSSSSXXXXXXXXLLLLLLLL------------------------] petlib.c:272 3083204 (60.8%) loads, 0 stores ( 0.0%) => 0.0% #FS 0.0% Total#FS petlib.c:324 0 ( 0.0%) loads, 529259 stores (20.0%) => 82.7% #FS 14.7% Total#FS
Listing 1: Example output of plutoparser. Another way to reduce entries comes from the fact that our estimate is zero if there were only load access to a coherency block, or the block is private to a thread. This information sometimes can be detected (memory areas with real-only permission), or could be provided by the programmer with client requests. Currently, the following heuristic is implemented: If a coherency block was never written to before, load accesses are ignored. In principle, this is problematic for our estimate as early loads could be rearranged with late stores of another thread to produce false sharing events in a worst-case ordering. Therefore, we print out a statistic of ignored loads to blocks where there was a store afterwards. If this number is low in relation to resulting estimate, the heuristic worked for a given run of an application. To sum up, Plutos instrumentation adds a callback to a helper function for every data access happening in supervised code. This function looks up the counter in the global hash table of access types and either increments a counter of an already existing entry, or creates a new entry when the given access type was observed rst. At the end of an execution unit (or at program termination), all entries are sorted by coherency block, and for each block, it is rst checked whether accesses are only loads, or accesses are only from one thread. If this is the case, no output has to be generated for this block. Otherwise, all access counts are written to an output le. To denote a code position, we write source name and line to this le, if debug information is available. We could do something similar for coherency block addresses, e.g. use the symbol name for static/global variables, or otherwise use the code position of the memory allocation which includes a given block. Currently, Pluto just outputs the address. Finally, all counters are set to zero (currently, we remove all hash table entries), and a marker is written to the output le indicating that a new execution unit has started. After the run, the terminal application plutoparser reads in the le, sums up access counts per coherency block and calculates estimates as explained in section 3. With a specied average time overhead for false sharing events, temporal estimates can be calculated from the occurrence estimate. plutoparser supports two types of output: First, a summary view shows a coherency block per line with true/false sharing estimates, by default ordered by descending false sharing occurrence number. Second, a detailed view for a coherency block shows access masks for each thread which touched the block. Every byte of the block is denoted by a character, which shows whether there was a load or a store at this oset, respectively. Further, an ordered list of code positions is shown with access counts and the percentage inuence of this line to the false sharing estimate. Listing 1 shows an excerpt from a typical interactive session of plutoparser after loading a le generated by Pluto. Here, #TS relates to , the estimated number of data exchange events. The rst command on the prompt (show summary) shows the memory blocks with highest sharing estimates (the temporal overhead is by default calculated purely from the false sharing estimate). The second command shows details of one block. After the access mask for a thread, source lines with highest number of load and store accesses are shown, together with their inuence on the estimate. The latter numbers add up to more than 100% over all threads, as there need to be two accesses for a false sharing event.
5. RESULTS
To conrm our approach, we have tested with two selfwritten applications which provoke false sharing and a realworld application. The rst test is a synthetic benchmark with a very regular access pattern designed to trigger false sharing. This can be used to estimate the penalty per false sharing event. The second application is a simple implementation of a prime sieve parallelized using OpenMP. The real-world application calculates sparse matrix-vector products and is part of an imaging process. All tests have been performed on a system equipped with two Intel Xeon X53551 processors.
5.1 Synthetic false sharing benchmark

Two threads modify elements of a data array at 64 byte strides. In order to trigger false sharing, both threads have an oset of 4 bytes to each other, which corresponds to a single data element. A second version of this benchmark avoids false sharing by accessing data elements located in disjoint parts of the array. The access pattern is illustrated by Figure 3. The test is repeated for a number of iterations. The benchmark is written in a way such that the number of operations performed by each thread remains constant and independent of array size. For this purpose, the number of iterations is adapted dynamically. We perform the test with 6.25 107 operations, which results in twice as much memory references, since modication of an element implies a load and a subsequent store. This way, we prevent the processors from eventually buering unconditional stores. To avoid inaccuracies due to relocation of a thread at runtime and to get an estimate for the worst-case penalty we pin the threads to cores located on dierent sockets. Figure 4(a) shows the time per memory reference for both cases with and without
1 Quadcore processor, 32KiB L1 cache, 8MiB L2 cache split into two times 4MiB for two cores per socket, 2.66GHz core frequency.
30
Thread 0 Thread 1 ... 64 Byte
Thread 0 Thread 1 ...
time/access [cycles]
120
100
(a) Access pattern provoking FS

Thread 0 ... ... 64 Byte Thread 0 ...... Thread 1 ... 64 Byte Thread 1 ...
40 False sharing pattern False sharing avoided 80 60
20
(b) Access pattern avoiding FS

Size
Figure 3: Access patterns to provoke or avoid FS. The type of access depends on the selected mode. false sharing. The samples were obtained by time measurement and afterwards converted to clock cycles. Apart from a slump at small array sizes, which we can not explain at the moment, one can clearly see that false sharing causes a massive penalty. By calculating the dierence between both runs, we end up with the overhead caused by false sharing for a single memory access. This is shown by the continuous line in Figure 4(b). The average penalty lies around 125 cycles per reference which is for now accepted as penalty c per false sharing event. The dashed line shows the estimate as given by Pluto when using this penalty, which is reduced to a trivial result due to the constant number of memory references and the regular access pattern. These results are still based on time measurement. Furthermore, we do not know whether every memory access really triggers a false sharing event. Therefore, we checked the results by means of hardware performance counters using pfmon. It is certain that every false sharing event will invalidate the corresponding memory block in the cache of the aected processor. This causes an eviction from the L2 cache which can be measured by using the L2_LINES_OUT:ANY counter. The total number of load and store events for one of the processors can be recorded by the L1D_CACHE_LD:MESI and L1D_CACHE_ST:MESI counters respectively. The total number of memory references is therefore the sum of both counters. Now we calculate the ratio := L2_LINES_OUT:ANY L2_LINES_OUT:ANY . L1D_CACHE_LD:MESI + L1D_CACHE_ST:MESI
256
1K
4K
16K
64K
256K
1M
4M
16M
(a) Time per memory reference

overhead/access [cycles]
120
100
80
60
40
Overhead derived from wall clock Overhead derived from performance counters Average overhead
20
Size
256 1K 4K 16K 64K 256K 1M 4M 16M
(b) Overhead per memory reference Figure 4: Clock cycles and overhead due to false sharing per memory reference. preceding store of the other core to the same cache line. For this reason c as obtained by the wall clock approach is only half the amount as the real penalty for false sharing. Consequently, we can assume a penalty c 250 cycles per false sharing event. This is of course a hardware dependent factor which must either be known or measured using a test application like this one. Otherwise, inaccuracies induced by a default value must be accepted. Although the estimate as given by Pluto is trivial for this case one problem is apparent: Since Pluto does not consider capacity misses, it is not aware, that for arrays larger than 4MiB false sharing does no longer cause overhead. This is of course not true in general but a consequence of the regular access pattern and the huge number of iterations. Assuming uniformly distributed accesses, the probability of false sharing would decrease with size but not drop to zero almost instantly after the cache size is exceeded. However, our approach would benet from considering nite caches. One possibility to do this is described in Section 6.
between the dierence of the evicted lines with and without false sharing and the total number of memory references. The number of evicted L2 lines without false sharing remains negligible until the L2 cache is lled. Afterwards, capacity misses occur, which prevent false sharing. This is considered by taking the dierence of both values in the numerator. If every memory reference causes a false sharing event, 1 should hold. In this case, the penalty per false sharing event is given by c , which is shown by the dotted line in Figure 4(b). However, it turned out that 0.5 holds at array sizes up to the L2 cache aside from the discrepancies at small array sizes and 0 for sizes larger than the L2 cache. This means, that in reality only every second memory reference causes a false sharing event. Such a result is plausible since a store operation is done almost instantly after the preceding load operation is completed. It is very likely that only the load operations cause false sharing in combination with a
5.2 Prime sieve

Now that we have an estimate for the penalty induced by false sharing at least for our test system, we can turn to a more realistic application. We implemented a simple variant of the prime sieve by Eratosthenes. The idea behind the
31
# pragma omp parallel private (i , mask , x ) { # pragma omp for schedule ( static ,1) nowait for ( i =2; i < N ; i ++ ) { x = 2* i ; while ( x < max_size ) { mask = 128 >> ( x % 8 ); ptr [ x /8 ] |= mask ; x += i ; }}}
# pragma omp parallel for schedule ( dynamic ) for ( r = 0; r < rows ; r ++) for ( i = rowptr [ r ]; i < rowptr [ r +1]; i ++) y [ r ] += val [ i ] * x [ col [ i +1]];
Listing 3: Forward projection of an MLEM iteration. lead to an unequal load balancing (there are fewer multiples smaller than N for large numbers than for small numbers). In this case, we pinned the threads to cores located on different sockets again. Figure 5(a) shows the average time per prime test measured in clock cycles for a given array size. Since the algorithm has no linear complexity, the time per prime test increases for larger numbers. Parallelization fails for array sizes up to 8MiB which is double the size of the L2 cache. This is explained by the irregular access pattern of the algorithm since the stride sizes increase with the numbers tested. Coherency blocks which make up the latter part of the array are accessed more frequently than other ones, e.g. the rst cache line is never accessed again after the rst 512 numbers have been processed (64 byte times eight numbers per byte). The continuos line in Figure 5(b) shows the dierence between the dual and single thread variant and therefore the overhead induced by parallelization. The dashed and dotted lines indicate the false sharing estimates and as given by Pluto. It turns out that overestimates the penalty by more than an order of magnitude2 . This is obvious for two reasons: First, modifying values involves reading and writing and therefore a pattern of subsequent loads and stores for each thread. The estimate assumes the worst case, since it does not know the sequence of references. Secondly, the access pattern is not strictly sequential. The stride sizes vary which is why not every modify results in a false sharing event even for small array sizes. The result as given by is much more accurate since it is considerably reduced by matching accesses between threads. Nevertheless, there remains a suciently large estimate to reect the overhead induced by false sharing. Although these results are promising, it must be kept in mind that is no longer an estimate for an upper bound, but an attempt to counter the pessimistic overestimation of . Therefore, it might happen that it underestimates the overhead.
Listing 2: Parallelized loop of the prime sieve.

Time per prime test [cycles]
1 thread 1000 2 threads 800
600
400
200
Size
128 512 2K 8K 32K 128K 512K 2M 8M 32M 128M
(a) Time measured in clock cycles for 1 and 2 threads.

Overhead [cycles]
3000 Overhead derived from wall clock 2500 Overhead derived from Overhead derived from 2000
1500
1000
500
Size
128 512 2K 8K 32K 128K 512K
(b) Overhead of threaded version measured in clock cycles. Figure 5: Time and overhead per prime test. algorithm is to calculate all multiples of n {2, 3, . . . , N } up to a given bound N and to remember which numbers are multiples of another one. This is implemented by using a large bit mask initialized to zero where the n-th bit represents the number n. Afterwards the algorithm calculates the multiples starting at n = 2 and sets the corresponding bits to one. At the end, all bits which are not marked represent a prime number. Of course there is a huge overhead in this naive implementation since most numbers are marked multiple times. However, it can be easily parallelized using OpenMP and completely avoids the need for synchronization between threads working on the same array. The relevant code section is shown in Listing 2. We used a static parallelization to avoid any overhead induced by task assignment. The chunk size is set to one to ensure a tight interleaving between threads. This makes sense since otherwise OpenMP would divide the loop anywhere around N/2 which would
5.3 Image Reconstruction Algorithm

We are cooperating with the department for nuclear medicine, working on medical imaging. Our side works on cache optimization and parallelization of various algorithms used in this eld [7]. As third example, we look at the OpenMP parallelization of the MLEM image reconstruction algorithm [9]. This algorithm runs so-called forward and backward projection phases in an iterative loop. Each projection is a sparse-matrix vector multiplication. We concentrate on the forward projection, as this is subject to false sharing (see Listing 3). The matrix is stored in standard compressed sparse row (CSR) format, consisting of the three arrays rowptr, val, and col. The parallel for loop is partitioned according to the rows of the matrix. As the number of entries per row varies a lot, dynamic scheduling needs to be used for load balancing. This scheduling uses a default chunk size of 1. Thus, dierent threads frequently modify elements of
2 The slope of is roughly linear for array sizes greater than 256KiB, not exponential as one could assume by looking at the section in Figure 5(b).
32
d1 d16
Wallclock time [s] 0.901 0.616
FS est. [MEvents] 12.70 0.17
time est. [s] 0.594 0.016
Pluto runtime [s] 168 155
our estimate, we have to check its results with a larger number of applications. Finally, we plan to add support into KCachegrind to be able to visualize false sharing estimates near-side to cache simulation data.
Table 1: Results for an MLEM forward projection. the array y in an interleaved fashion. The sparse matrix has a size of 2.7 GB and the involved dense vectors have around 800 k entries. The rst row of Table 1 shows the runtime of this code for two threads, pinned to cores on dierent sockets. A chunk size of 16 eliminates false sharing, if array y is properly aligned (second row of Table 1). A sampling based proling tool shows that less than 2% of runtime is spent in scheduling related compiler routines. Therefore, OpenMP scheduling overhead is not falsely interpreted as false sharing. As can be seen, the time estimate is quite accurate for this example if we use a false sharing overhead as determined in Section 5.1. The last column of Table 1 shows the wall clock time of the forward projection phase run in Pluto. This currently shows a slowdown of around factor 200. However, we expect that we can signicantly reduce this slowdown.
7. REFERENCES
[1] D. Badouel, T. Priol, and L. Renambot. SVMview: A performance tuning tool for DSM-based parallel computers. In Proceedings of Euro-Par96, volume 1123 of LNCS, pages 98105. Springer, 1996. [2] W. J. Bolosky and M. L. Scott. False sharing and its eect on shared memory performance. In Sedms93: USENIX Systems on USENIX Experiences with Distributed and Multiprocessor Systems, pages 5771, Berkeley, CA, USA, 1993. USENIX Association. [3] D. Cortesi, J. Fier, J. Wilson, and J. Boney. Origin2000 and Onyx2 Performance Tuning and Optimization Guide. Silicon Graphics, Inc, 1998. [4] V. Freeh. Dynamically Controlling False Sharing in Distributed Shared Memory. In Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing, pages 403411. IEEE Computer Society, 1996. [5] Intel Corporation. Intel Performance Tuning Utility 3.2, User Guide, chapter 7.4.6.5, 2008. [6] T. Jeremiassen and S. Eggers. Reducing false sharing on shared memory multiprocessors through compile time data transformations. ACM SIGPLAN Notices, 30(8):179188, 1995. [7] T. K stner, J. Weidendorfer, J. Schirmer, T. Klug, u C. Trinitis, and S. Ziegler. Parallel MLEM on multicore architectures. In 9th International Conference on Computational Science, ICCS09, volume 5544. Springer, 2009. [8] N. Nethercote and J. Seward. Valgrind: a framework for heavyweight dynamic binary instrumentation. SIGPLAN Not., 42(6):89100, 2007. [9] L. Shepp and Y. Vardi. Maximum likelihood reconstruction for emission tomography, 1982. [10] J. Tao and W. Karl. CacheIn: A Toolset for Comprehensive Cache Inspection. In Proceedings of ICCS 2005, volume 3515 of LNCS, pages 174181. Springer, 2005. [11] J. Weidendorfer, M. Kowarschik, and C. Trinitis. A tool suite for simulation based analysis of memory access behavior. In ICCS 2004: 4th International Conference on Computational Science, volume 3038 of LNCS, pages 440447. Springer, 2004. [12] J. Weidendorfer, M. Ott, T. Klug, and C. Trinitis. Latencies of conicting writes on contemporary multicore architectures. In 9th International Conference, PaCT 2007, volume 4671 of LNCS, pages 318327. Springer, 2007.
6.
CONCLUSION AND FUTURE WORK
This paper presents a new approach to detect and quantify cache false sharing eects. The goal is to come up with a practical tool pin-pointing at code positions and data structures responsible for bad scalability of shared-memory applications because of false sharing. To this end, we use data collected in an execution-driven way via dynamic binary instrumentation to derive estimates for cache line sharing events. By ignoring temporal order of accesses for these estimates, we avoid the need for time-consuming simulation of complex multi-processor systems. A worst-case false sharing estimate is provided by assuming access orders that maximize the number of false sharing events. This rough gure is rened by a best-case estimate for true sharing events. In practical cases, the dierence of these estimates seems to work nicely, and we obtain promising results for real-world applications. We provide a micro-benchmark which allows to measure false-sharing latencies on real machines, and use this to get temporal overhead of false sharing. However, our estimate sometimes can be far o from reality. Currently, we assume that every access can trigger a false sharing event, even if in the meantime a capacity miss might have occurred which prevents false sharing. Similarly, results show that the estimate has to improve for cases where multiple accesses to the same cache line are done in a very short time frame by one processor core. Further, automatic detection of synchronization needs to be added, and more than just global barriers should be supported. To validate
33
Metaman: System-Wide Metadata Management

Daniel Williams Jack W. Davidson
Department of Computer Science University of Virginia

{dan_williams,jwd}@cs.virginia.edu
ABSTRACT
Understanding how programs are created and how they behave at run time is vital to building secure and efficient programs. Typically program information generated when building and linking a program is not available to run-time instrumentation tools that are used to better understand and improve program behavior. This paper presents the structure of Metaman, the metadata manager, a tool for building instrumented systems that spans the entire build and run-time system. Metaman collects available metadata about the program from build tools as well as run-time tools as XML, and then makes the metadata available to the rest of the tools in the toolchain. In order for Metaman to be practical, it must be easy for each tool to send and receive metadata from Metaman. This paper discusses: (1) the process of adapting of build-time tools such as compilers, assemblers and static analyzers for use with Metaman, (2) the process of adapting the build system to assist in correctly coordinating novel metadata, and finally (3) the process of integrating run-time development tools such as debuggers, instrumentation tools, and software dynamic translators (SDT) for use with Metaman.
1. Introduction
As computer systems are becoming faster and more capable in terms of raw computing power, users and system-builders are putting ever-greater demands on the capabilities of these systems. As a result, computer software has become increasingly complex and difficult to manage. In order to manage this complexity, systembuilders and application developers require a diverse and advanced set of tools to understand and manipulate their programs. Tools such as Integrated Development Environments (IDEs) [20], debuggers [6], instrumentation tools [17, 13], and Software Dynamic Translators (SDTs) [14, 19, 4] have joined the traditional software tool chain to aid in optimizing, securing, debugging and understanding the program. Because of the diversity and differing goals of these tools, the tools themselves are highly modular: they must conform to the file formats used in the rest of the build process. Therefore, any additional information about the program must be added as explicit annotations into the output source, which may be impractical, depending on the output format. Even if the annotations can be added, they then must be preserved by all the other tools that manipulate the files so they can reach the tool that can properly consume the information. To address the need for improved communication between tools within the software development toolchain, we have created Metaman, the metadata manager. Metaman integrates into the toolchain, and allows tools to submit new metadata about the program, and query the Metaman for more metadata provided by other tools, removing the requirement for all tools to track and preserve annotations across the build and deployment process. An example use of Metaman is in the handling of indirect branches in SDTs. Indirect branches cause performance problems in SDT systems because the system must always check the target of an indirect branch each time the branch is executed. Using high-level compile-time information about the source of indirect branches such as the associated virtual function table,
Category.
D.2.6 [Programming Environments]
Keywords.
Metadata, run-time systems, software toolchain, software dynamic translation.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WBIA 09, December 12, New York City, NY Copyright 2009 ACM 978-1-60558-793-6/12/09... $10.00
34
Metaman can provide an SDT with enough information to generate a completely translated indirect branch, and not rely on hashtable-based run lookup mechanisms, thus avoiding the performance penalty [21]. In this paper, we examine the structure and design trade-offs of building Metaman, in terms of integrating with the software tools and the associated data representation for those tools. Because of their differing requirements, build-time tools, run-time tools, and the build system are examined separately. Additonally we discuss the advantages and drawbacks of using XML as a medium for storing large amounts of program data and the trade-offs necessary for effectively storing and manipulating large amounts of data associated with the program.
2. Metaman Overview
Metaman is designed to be agnostic with regard to the tools that provide metadata during the life cycle of an application. It is impossible to predict all possible metadata that can be useful to software tools in the future, therefore we have designed Metaman to be as flexible as possible so that new types of metadata can be stored and organized. Metaman uses XML to store and query metadata, but the ultimate goal is to provide customizable interface to request data in any format. Modern software development tools perform sophisticated analysis, and a large amount of information about the structure and intent of the program can be gained by
collecting the information resulting from these analyzes, and what metadata is useful will vary from system to system. Table 1 gives examples of program metadata that are collected in the course of building software, but that are often unavailable to tools which could make use of the data to improve the program. Note that this information is being generated at all points within to software development toolchain, both at compile-time and run-time. Some of the metadata, such as the control-flow and alias info, are consumed by run-time tools. Instrumentation information, however, is fed back into the next build of the program. Still other metadata such as the program fault data and list of performance optimizations, can be consumed by the developers and users to better understand program behavior. To support such diverse metadata, a broad and flexible infrastructure is required. The structure of Metaman is shown in Figure 1. Tools within the toolchain are able to submit metadata. Other tools then are able to access this metadata as desired. The following sections discuss the data format and tool integration with Metaman: Section 3 discusses the integration with build-time tools that generate useful information about the application. Next, Section 4 examines the interaction between Metaman and tools that operate at runtime, and therefore have performance-sensitive requirements. Finally, Section 5 describes the integration between Metaman and the build system. As is shown
Table 1: List of example valuable program metadata.
Metadata
VFT Table Control Flow Instrumentation Optimizations performed Alias Info System/Library Calls Memory Management Program Fault Data
Description
Virtual function table size and targets Basic block boundaries and CFG Function execution frequency, etc. List of actions taken by the optimizer Metadata on which pointers are aliased Which libraries are used and where system calls are made Policy on how memory is collected Core data and diagnostics from program failures
Gathered from
Compiler, Assembler Compiler Runtime system Optimizer Compiler, Static analyzer Assembler, Build system Memory analysis tools, GC OS, runtime system
Useful for
SDT code layout Control Flow Integrity [2] Basic block layout Deoptimization, debugging Formal verification, security tools Security policy enforcement Performance analysis, real-time tools Debugging tools, automated repair [8]
35
in Figure 1, Metaman is integrated directly with the build system to take advantage of the build system's ability to track dependencies and keep files up to date.
3. Build-time Tools
Tools used at build-time include the preprocessor, compiler, linker and static analyzers. These tools are often run in an environment where the system-builder has complete control (i.e., a development system), and where programmers are willing to tolerate long build times in exchange for more detailed analysis. Therefore, the focus for Metaman is not on size or run-time efficiency, rather the primary goal is making it easy for tools to integrate with Metaman, and to increase the amount of information available.
DIE is located. The Metaman XML format translates these references to unique identifies. The resulting XML format allows for all the information contained for DWARF debugging, but also allows new tools to add arbitrary data by including their own metadata as additional tags within the encompassing compilation-unit tag.
3.2 Tool Integration

The tools used at build time often perform a highly detailed analysis of the code, and while the metadata is usually not included in the final application binary, many tools make their analysis available in human readable output to assist power users in understanding what is happening to their code. An example of human readable annotations is support for listing output by the assembler. Typically, the assembler takes a compiler-generated assembly file as input, and produces an machine-code object file, usually Executable Linking Format (ELF) on a linux system. For Metaman, the assembly process is very important because it represents the final mapping between source-level information and the specific bytes that will be executed on hardware. The top of Figure 2 shows a snippet of a typical assembly listing, from the GNU Assembler, invoked with the -al options. The snippet shows two instructions, a move of a constant onto the stack, and a call to the puts() function. It presents a compact representation of useful metadata: the assembly line number, the offset into the current objectfile section (i.e. the .text section), the hexadecimal representation of the values being emitted, and the text of the assembly line. While this information is potentially useful to a variety of tools, the output is specifically designed for a human reading on a terminal screen. The listing of the hexadecimal output wraps the data on line 34 into two logical lines to avoid making the line too long and therefore unreadable in a 80-character terminal.. One possible option for collecting this data is to process the listing file, either using regular expressions or a parser generated for a CFG. Scanning for the required data is the quickest way to add data to Metaman, however because the output is formatted for humans, generating a full and accurate parse tree is difficult, and regular expressions do not always collect the desired data in all cases. However, because XML allows addition tags and attributes to be added without requiring changes to the tools consuming the XML data, it is often possible to create a quick regularexpression based prototype, which can later be replaced with a more robust data collection technique.
3.1 Data Representation

Current software tools used to create programs can and do produce binaries with metadata, typically debugging output. Metaman uses an XML format that is inspired by the DWARF debugging format [1]. We choose not to simply extend the DWARF specification for a number of reasons. First, the tools for parsing and querying DWARF information are highly limited compared to those available to XML. Secondly, the DWARF format has a highly focused purpose specifically aiding in program debugging so the information collected in the DWARF format is primarily useful for program understanding, not general-purpose use. While the DWARF format was not sufficiently general for our purposes, it does offer a good starting point for general metadata collection. We implemented a tool, dwarfxml, that converts DWARF information into our XML format. DWARF breaks up file information into separate compilation units, typically corresponding to individual object files. This partitioning helps the debugger correctly parse and identify shadowed variable names and other file-local data. Metaman's XML format mirrors this structure by enclosing information into a compilation-unit tag, which can comprise a standalone XML file, or can be bundled with as a sequence of tags in the case of an executable built from multiple source files. Further, the DWARF standard translates to XML in a straightforward manner because the DWARF Debugging Information Entry (DIE) is organized in a hierarchical format, appended by name/value attributes which map directly to XML tags and attributes respectively. DIEs cross-reference other DIEs using an offset, which identifies the location in memory (relative to the start of the debugging section) where the referenced
36
Figure 1: Metaman Structure Another option for collecting data from buildtools is to alter the tools themselves. In the case of the GNU assembler, this technique was surprisingly easy and effective. We implemented XML-listing as a small (150 line) patch to the listings.c file of GNU assembler. The second half of Figure 2 shows the resulting XML listing, after the patch is applied. The XML version of the listing maps the logical lines of the listing to a assembly-line tag. It contains sub-tags for the hexadecimal representation as well at the assembly text. This metadata allows Metaman to gather a complete mapping of object files used to create a program, including symbolic information and static data. Further, invoking the assembler with the -alh flags also includes the high-level language source. That data also maps directly to an XML tag, which allows Metaman to associate source statements to particular bytes in the executable. Note that XML produced by the patched as program does not exactly match the DWARF inspired XML presented earlier. One of the primary benefits of using XML as the metadata storage format is that there are many XML-based tools for manipulating and organizing data. One such tool is Extensible Stylesheet Language Transformations (XSLT). An XSLT program changes the format and adds information to existing XML files. For example, XSLT program can make the listing XML compatible with the dwarfxml output by adding unique identifiers to the assembly listing, and adding a few tags to make it conform to Metamans schema.
4. Run-time Tools
Run-time tools provide a wide range of important tasks, operating in conjunction with the application. Software Dynamic Translation (SDT) systems such as Strata [19], DynamoRIO [4] and Pin [14] are a class of runtime tools that provide flexible frameworks for building a variety of tools such as instrumentation systems, security systems, and optimization systems. Figure 3 shows the typical operation of an SDT system. SDT systems operate by gaining control of an application, examining the instructions to be executed, and then translating those into a code cache. Code that has already been translated is reused and linked together to amortize the cost of translation. While there are other run-time tools that are not SDT-based, (notably managed runtimes,) our integration focuses on SDT systems due to their flexibility, and their requirement for a small process footprint. Run-time tools have significantly different requirements from build tools. This is partially because every cycle used by the runtime system is a cycle that could be used to execute the program. But, also the system builders have much less control over the environment in which the application is running. Therefore, the representation of the metadata and integration with Metaman require special care.
37
Context Capture
Software Dynamic Translation

No Cached? New Fragment
New PC
Fetch Yes Decode Translate Context Switch Yes Finished? Next PC
No
Host CPU (Executing Translated Code from fragment cache)
Figure 3: SDT Architecture such as glibc and the standard C++ library, so that the SDT's use of those libraries does not interfere with the application. Pin, which includes an advanced Pin Tool architecture even includes a third set for the Pin Tool being run. Therefore, a full XML parser including all the associated XML libraries might not be practical for SDT systems. Metaman generates tool-specific binary output to simplify the parsing and interpreting the metadata. For SDT systems using systems-focused languages like C and C++, the binary output can be memory mapped directly into the data structures to be used. Currently this is a manual process, where the system builder must write the code to emit the metadata into the proper binary format. Ideally, Metaman could automate much of this with a simple markup language, perhaps similar to XSLT that would take a request similar to an XPath query, but instead of producing new XML as output, it would produce binary data. An important consideration, even when using binary information, is the effect of the metadata on runtime memory usage. When collecting metadata to improve indirect branch handling [21], we examined the effect on memory usage for the Xalan SPEC2006 benchmark [9]. The size of the virtual memory over time, as reported by the ps utility, is shown in Figure 5. At the point of highest memory usage, Strata with binary metadata resulted in a 63% size increase, while Strata alone results in a 24% size increase Such an increase in memory usage might be tolerable for large server-grade systems where it can be assumed the system has many gigabytes of memory available. However, as discussed previously, it is unclear what metadata will be useful and available in the future. Thus
Figure 2: Assembly Listing converted to XML

An important goal of Metaman is allowing for a wide range of tools to use and submit data, and further, to allow these tools to submit their own novel metadata. For run-time tools, this means providing the ability to use formats other than XML. XML is an ASCII-based file format, and therefore XML files are often larger than corresponding binary-encoded data. To examine the exact file size increase, we converted debugging information into our XML format, and also stored the data in an XML database, DBXML, discussed in Section 5.1. Figure 4 shows the increase XML and DBXML size (in MiB) per KiB of debugging information, and their respective sizes when compressed by the gzip utility. The base XML file uses 11.6 bytes per byte of debug data (R=.99), while the DBXML format uses 24.5 bytes per debug byte (R=.99). For DBXML, we used five indices to improve looking operations, resulting in the higher memory usage. Gzip is a common approach to manage XML file sizes, and for Metaman it significantly reduces the file size. For the DBXML files, the gzip files only require 2.7 bytes per byte of debugging data. .

SDT systems link into the program being executed, either at build-time, link-time or load-time. Often, they include their own copies of general-purpose libraries
38
Figure 4: Comparison of the size of representations of debug information it is important to examine methods to reduce memory usage for potentially larger metadata. Another option to manage the memory usage of the metadata in a run-time tool is to allow the runtime tool itself to query Metaman. This technique allows the SDT to only allocate Memory for Metaman about parts of the program that are actual used, rather than all the metadata that could possibly be used. An alternative to storing the binary representation is to use Metaman as a metadata service, connecting through an IPC mechanism. Such a technique allows the run-time tool to only load the metadata used by that particular run of the program, rather than the entire set of metadata that could be used for a given program. We tested this technique by setting up a simple, prototype metadata server, using UNIX sockets to communicate metadata about call-return pairs, useful for improving execution of returns in an SDT [21]. Table 2 shows the run-time overhead of the metadata server. The performance of the metadata server in terms of runtime is very small: less than 1%. This result is expected because queries are only made when a code fragment is being translated, and cost that is amortized over the run of the program. The memory usage, however, drops dramatically using the Metadata server, because embedding the metadata requires large amounts of space for information that may never occur at runtime.
Figure 5: Xalans Memory Usage

As a program is built, the build system integrated with Metaman invokes the usual tools to compile, link and assemble the program. As the tools generate output, they also submit metadata to Metaman, and possibly query Metaman as well. A core part of Metamans value is the ability to efficiently store and query this data. Since XML data can be large and often unwieldy to update and query directly we examined the option of optimized XML storage in the form of XML databases [15]. We chose Berkeley DBXML database for its simple setup and its large number of bindings to various programming languages. As with most XML databases, it is possible to use a single XML file, or multiple XML files (e.g., one per object file). DBXML allows Metaman to update with multiple logical XML file, and perform XPath queries on the entire dataset. Finally, because it may not be practical to enable integrate an XML-based system with all tools, particularly run-time tools, Metaman used the data conversion system to convert XML data into a more terse binary format, discussed in Section 4.2.

To track dependencies and relationships between groups of files, Metaman integrates with build system to gather metadata from tools at each step of the build. Currently Metaman supports Scons as the primary build tool [12]. SCons' modular Tool allow Metaman to transparently wrap its behavior by replacing default tools with augmented ones that also send information into the XML database. For example, typically a C or C++ program is built by multiple invocations of the C compiler to compile the source files into the object
5. Build Systems
Build systems manage the creation of applications, as well as the bundling and preparation for distribution. As .as result they integrate into nearly every part of program creation, and are therefore an important articulation point for collecting and organizing metadata.
39
Table 2: Performance of metadata delivery on the Xalan benchmark.
Xalan Configuration
Strata w/o metadata Embedded metadata Metadata Server
Runtime Overhead
79.9% 27.8% 28.3%
Mem Usage (MiB)

379 501 390
files, implicitly assembling them as part of the command. With SCons, Metaman is able to create its own Tool that replaces the cc Tool, so that the compiler is invoked with -S (to only emit assembly), then the Metaman Tool is able to invoke an analysis pass, generating XML, and finally assemble the file with a call to as. The Metaman tool replaces the default behavior of the Object command in SCons, the command used to generate object files. Similarly, the XML files corresponding to the object files can be correlated to the final executable by replacing the Program tool, which invokes the link command. Because of these abstract Tools, developers can use Metaman simply by adding the Metaman Tool at the beginning of the SCons file. SCons was chosen as initial target because of its flexibly, however most software today is built with a variety of dependency management tools. Therefore, it is important for Metaman to be able to be ported to additional build systems (e.g., Ant, and Autotools). Most build tools have been developed with portability in mind, for example Makefiles traditionally used variables for their commands, such as $CC for the C compiler, to allow for an easy way to choose the specific compiler on a given system. Metaman can use these variables to wrap the metadata collection commands, however many make users have built complex systems on top of make with autoconf and automake, which can make correctly inserting Metaman difficult. IDEs represent another important set of tools that incorporate a build system. Two popular IDEs are the open source Eclipse IDE, and Microsoft's Visual Studio. Both Eclipse and Visual Studio support flexible build systems, but for this discussion we will focus on Eclipse. For C and C++ programs, Eclipse used the C/ C++ Development Tool (CDT) plug-in, which has its own built-in build system, the Managed Build System (MBS). The MBS is highly extensible, offering a detailed extensibility guide [7]. The MBS offers users the ability to choose both an external build command (such as make or SCons), as well as a Managed build which allows separately managed configuration settings which allow the user to specify the tools and options used to build the system. Because the MBS is so extensible, a Metaman plug-in appears to be a good
way to achieve the desired effects for metadata collection. Thus far we have only discussed compiled languages such as C and C++, however we believe Metaman is applicable to managed language such as Java and C# which are typically compiled into byte-code and executed in a virtual machine performing Just-In-Time (JIT) compilation. Eclipse is primarily a Java IDE, and it integrates well with Ant, a common Java build tool. Ant offers many integrated actions, similar to SCon's Tools, which are called Tasks. For Ant, instead of replacing the Task, Ant provides Build Events which allow additional actions to be triggered before or after a Task is executed. By attaching metadata collection as an event to be triggered after the appropriate task, such as Javac, Metaman will still be able to collect and aggregate metadata.
6. Related Work and Discussion

There are large number of projects that collect and store metadata for the specific needs of their project[2, 10, 18, 5]. A number of tools have also proposed more general metadata formats, with varying degrees of generality. The LENS project has a similar goal of better program understanding [16]. Similarly, Xu et al. embed metadata with the goal of improving dynamic binary translation [22]. These projects both gather information about the executable, in the case of LENS information about what compile-time optimizations were performed, and how they affected the application, and for Xu et al. information to improve translation from x86 to IA64, specifically register allocation information to assist in re-allocation on the target platform. Metaman's goal is, as much as is possible, to allow for any type of metadata about the program to be stored and retrieved, and to make that processes seamless to the people making and using software tools. As a result, we have focused on both run-time and compile-time information gathering, with an opt-in XML interface. There are many systems that take a holistic approach to building systems where there is a high level of metadata available. Montana provides an end-to-end framework for adding custom plug-ins to collect and
40
use program metadata [20]. Similarly, Oberon and Jikes (formerly Jalapeno) provide rich API to allow programmers to extend the basic functionality [11, 3]. However, using these framework requires writing your plug-in the systems language and conforming to their plug-in API. One of Metaman's goals is to be tool agnostic, allow any tool to submit metadata, as long as it can specify the metadata to Metaman. This approach allows system builders to opt-in to Metaman, allowing the tool builders to choose what degree they wish to be integrated with Metaman. As a result, Metaman can gain program metadata in an evolutionary manner, and as more tools provide more data, the number of uses of the data increase. Initially most of the application of Metaman have been non-essential. That is, Metaman provides additional benefit to the program, but is not required for correct execution. Eventually build systems can be constructed which rely on Metaman for correct execution, however, Metaman will continue to conform to the file interfaces of the build process, allowing individual tools to still be developed that do not need to interact with Metaman. Understanding and formally representing the underlying structure and use of software is vital for building secure efficient systems. Metaman is an important first step in the next generation of software tools that will depend on metadata being available at all points of the software development process. Metaman provides the basic building block for tools to send, query and organize metadata, ensuring accuracy and timeliness.
[4]
[5]
[6]
[7]
[8]
[9]
[10]
7. Acknowledgments
This research was supported by the US Department of Defense under grant FA9550-07-1-0532. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the sponsoring agencies.
[11]
[12]
8. References
[1] [2] Dwarf debugging information format version 3. http://dwarf.freestandards.org, 2005. M. Abadi, M. Budiu, U. Erlingsson, and J. Ligatti. Control-flow integrity. In CCS 05: Proceedings of the 12th ACM Conference on Computer and Communications Security, pages 340353, New York, NY, USA, 2005. ACM Press. M. Arnold, S. Fink, D. Grove, M. Hind, and P. F. Sweeney. Adaptive optimization in the Jalapeo JVM. In OOPSLA 00: Proceedings of the
[13]
[14]
[3]
15th ACM SIGPLAN Conference on ObjectOriented Programming, Systems, Languages, and Applications, pages 4765, New York, NY, USA, 2000. ACM Press. D. Bruening. Efficient, Transparent, and Comprehensive Runtime Code Manipulation. PhD thesis, Massachusetts Institute of Technology, August 2004. B. Buck and J. K. Hollingsworth. An api for runtime code patching. International Journal of High Performance Computing Applications, 14(4):317329, November 2000. H. Cleve and A. Zeller. Locating causes of program failures. In ICSE 05: Proceedings of the 27th International Conference on Software Engineering, pages 342351, New York, NY, USA, 2005. ACM. S. Evoy, L. Treggiari, M. Sennikovsky, and C. Recoskie. Managed build system extensibility document. S. Forrest, T. Nguyen, W. Weimer, and C. Le Goues. A genetic programming approach to automated software repair. In GECCO 09: Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, pages 947954, New York, NY, USA, 2009. ACM. J. L. Henning. SPEC CPU2006 benchmark descriptions. SIGARCH Comput. Archit. News, 34(4):117, 2006. W. Hu, J. Hiser, D. Williams, A. Filipi, J. W. Davidson, D. Evans, J. C. Knight, A. N. Tuong, and J. Rowanhill. Secure and practical defense against code-injection attacks using software dynamic translation. In VEE06: Second International Conference on Virtual Execution Environments, pages 212, New York, NY, USA, 2006. ACM Press. T. Kistler and M. Franz. Continuous program optimization: A case study. ACM Trans. Program. Lang. Syst., 25(4):500548, 2003. S. Knight. SCons design and implementation. In Tenth International Python Conference, 2002. N. Kumar, J. Misurda, B. R. Childers, and M. L. Soffa. Instrumentation in software dynamic translators for self-managed systems. In WOSS 04: Proceedings of the 1st ACM SIGSOFT Workshop on Self-managed systems, pages 90 94, New York, NY, USA, 2004. ACM Press. C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In PLDI 05: Proceedings of the 2005 ACM SIGPLAN conference on Programming
41
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
Language Design and Implementation, volume 40, pages 190200, New York, NY, USA, June 2005. ACM Press. N. Mabanza, J. Chadwick, and G. S. V. R. Krishna Rao. Performance evaluation of open source native xml databases - a case study. volume 3, pages 18611865, May 2006. M. O. McCracken. The design and implementation of the lens program information framework. Technical report, UCSD CSE, 2006. J. Misurda, J. A. Clause, J. L. Reed, B. R. Childers, and M. L. Soffa. Demand-driven structural testing with dynamic instrumentation. In ICSE 05: Proceedings of the 27th International Conference on Software Engineering, pages 156165, New York, NY, USA, 2005. ACM Press. N. Nethercote and J. Seward. How to shadow every byte of memory used by a program. In VEE 07: Proceedings of the 3rd International Conference on Virtual Execution Environments, pages 6574, New York, NY, USA, 2007. ACM. K. Scott, N. Kumar, S. Velusamy, B. Childers, J. W. Davidson, and M. L. Soffa. Retargetable and reconfigurable software dynamic translation. In CGO 03: Proceedings of the International Symposium on Code Generation and Optimization, pages 3647, Washington, DC, USA, 2003. IEEE Computer Society. D. Soroker, M. Karasick, J. Barton, and D. Streeter. Extension mechanisms in montana. In ICCSSE 97: Proceedings of the 8th Israeli Conference on Computer-Based Systems and Software Engineering, page 119, Washington, DC, USA, 1997. IEEE Computer Society. D. Williams, J. D. Hiser, and J. W. Davidson. Using program metadata to support SDT in object-oriented applications. In ICOOOLPS 09: Proceedings of the 4th workshop on the Implementation, Compilation, Optimization of ObjectOriented Languages and Programming Systems, pages 5562, New York, NY, USA, 2009. ACM. C. Xu, J. Li, T. Bao, Y. Wang, and B. Huang. Metadata driven memory optimizations in dynamic binary translator. In VEE 07: Proceedings of the 3rd International Conference on Virtual Execution Environments, pages 148 157, New York, NY, USA, 2007. ACM.
42
A Binary Instrumentation Tool for the Blackn Processor

Enqiang Sun
Department of Electrical and Computer Engineering Northeastern University Boston, MA, U.S.A
David Kaeli
Department of Electrical and Computer Engineering Northeastern University Boston, MA, U.S.A
esun@ece.neu.edu
kaeli@ece.neu.edu
ABSTRACT
While a large number of program proling and instrumentation tools have been developed to support hardware and software analysis on general purpose systems, there is a general lack of sophisticated tools available for embedded architectures. Embedded systems are sensitive to performance bottlenecks, memory leaks, and software ineciencies. There is a growing need to develop more sophisticated proling and instrumentation tools in this rapidly growing design space. In this paper we describe, DSPInst, a binary instrumentation tool for the Analog Devices Blackn family of Digital Signal Processors (DSPs). DSPInst provides for ne-grained control over the execution of programs. Instrumentation tool users are able to gain transparent access to the processor and memory state at instruction boundaries, without perturbing the architected program state. DSPInst provides a platform for building a wide range of customized analysis tools at an instruction level granularity. To demonstrate the utility of this toolset, we provide an example analysis and optimization tool that performs dynamic voltage and frequency scaling to balance performance and power.
tems. Software developers have used them to gather program information and identify critical sections of code. Hardware designers use them to facilitate their evaluation of future designs. Instrumentation tools can be divided into two categories, based on when instrumentation is applied. Instrumentation applied at run-time is called dynamic instrumentation; instrumentation applied at compile time or link time is called static instrumentation. Instrumentation techniques in the embedded domain are relatively immature, though due to the increased sophistication of recent embedded systems, the need for more powerful tools is growing. It is critical for embedded system designers to be able to properly debug hardware and software issues. The need for improved instrumentation frameworks for digital signal processing systems continues to grow [18, 11]. In this paper we describe DSPInst, a binary instrumentation tool targeting Analog Devices Blackn family of DSPs. DSPInst provides an infrastructure wherein embedded system developers can design a wide range of customized analysis tools that operate at an instruction granularity. In DSPInst we have adopted a static instrumentation approach, and modied the binary executables of the applications before run-time. The remainder of the paper is organized as follows. Section 2 introduces the Blackn DSP architecture used in our work. Section 3 presents an survey of related instrumentation tools. Section 4 presents the design details of DSPInst. In section 5, we illustrate the usage of DSPInst by presenting an example utility. Section 6 concludes the paper and discusses future directions.

D.3.4 [Programming languages]: Code generation, Optimization
General Terms
Languages, Management, Measurement, Performance
Keywords
DSPInst, Binary instrumentation, Embedded architectures, Dynamic voltage and frequency scaling
1.
INTRODUCTION
Instrumentation tools have been shown to be extremely useful in analyzing program behavior on general purpose sys-
2. ANALOG DEVICES BLACKFIN DSP ARCHITECTURE

The infrastructure discussed in this paper targets the Analog Devices Blackn family of DSPs. The specic DSP used in our example application is the ADSP-BF548. We start by providing a brief overview of the Blackn family and its core architecture. More detailed information is available in the Analog Devices Programmers Reference Manual [4, 1, 5, 2]. The Blackn DSP is a Micro Signal Architecture (MSA) based architecture developed jointly by Analog Devices and Intel Corporation. The architecture combines a dual 16-
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee.
WBIA 09, Dec 12, New York City, NY

Copyright(c) 2009 ACM 978-1-60558-793-6/12/09...$10.00.
43
bit Multiply Accumulate (MAC) signal processing engine, exible 32-bit Single Instruction Multiple Data (SIMD) capabilities, an orthogonal RISC-like instruction set and multimedia features into a single instruction set architecture. Combining DSP and microcontrol in a single instruction set enables the Blackn to perform equally well in either signal processing or control intensive applications. Some of the Blackn Instruction Set Architecture (ISA) features include: two 16-bit multipliers, two 40-bit accumulators, two 40-bit arithmetic logic units (ALUs), four 8-bit video ALUs and a 40-bit shifter, two Data Address Generator (DAG) units, 16-bit instructions (which represent the most frequently used instructions), complex DSP instructions are encoded into 32-bit opcodes as multifunction instructions, support limited multi-issue capabilities in a Very Long Instruction Word (VLIW) fashion, where a 32-bit instruction can be issued in parallel with two 16-bit instructions, and a xed point processor, oering 8, 16, 32-bit signed or unsigned traditional data types, as well as 16 or 32-bit signed fractional data types. The Blackn processor supports a modied Harvard architecture in combination with a hierarchical memory structure, which is organized into two levels. The L1 memory can be accessed at the core clock frequency in a single clock cycle. The L2 memory is slightly slower, but still faster than external memory. The L1 memory can be congured as cache and/or SRAM, giving Blackn the exibility to satisfy a range of application requirements [19]. The Blackn Processor was designed to be a low-power processor, and is equipped with a Dynamic Power Management Controller (DPMC). The DPMC works together with the Phase Locked Loop (PLL) circuitry, allowing the user to scale both frequency and voltage, to arrive at the best power/performance frequency/voltage operating point for the target application. The Blackn Processor has a built-in performance monitor unit (PMU) that monitors internal resources unintrusively. The PMU covers a wide range of events, including pipeline and memory stalls, and includes penalties associated with these events. Developers can use the PMU to count processor events during program execution. This kind of proling can be utilized to better understand performance bottlenecks and opportunities for voltage/frequency scaling. The PMU provides a more ecient debugging utility as opposed to recreating those events in a simulation environment.
Tools that operate on a binary format can be further divided into two categories based on when instrumentation is applied. Probably the most commonly used static binary instrumentation toolset ever developed was ATOM, which targeted the Digital Alpha processor [21]. Other commonly used instrumentation toolsets include EEL [14], Etch [17] and Morph [24]. The second class of binary instrumentation tools we consider are dynamic instrumentation. A number of high quality dynamic instrumentation tools have been developed in recent years, and include Dyninst [7], Kerninst [22], Detours [12] and Vulcan [10]. These tools dynamically modify the original code in memory during execution in order to insert instrumentation trampolines (i.e., a mechanism that jumps betweeen the instrumented code and analysis code, and back again). Most of these dynamic instrumentation systems do not address instrumentation transparency, preserving the architected state of the processor. Other dynamic instrumentation systems use caches and dynamic compilation of the binary, and include Valgrind [16], Strata [20], DynamoRIO [6], Diota [15] and Pin [13]. Most of these tools target general purpose architectures. Instrumentation tools targeting embedded systems need to satisfy a dierent set of requirements [11]. Some of these constraints include power eciency and lack of operating system support. Many of these issues force us to take a minimalists approach to instrumentation (i.e., less is more). The DELI [9] developed by Hewlett-Packard and ST Microelectronics provides utilities to manipulate VLIW instructions on the LX/ST210 embedded processor. DSPTune [18] is a toolset similar to DSPInst targeting the Analog Devices SHARC architecture. However, DSPTune can only instrument C or assembly code, which limits its usefulness when source is not available. DSPInst utilizes static instrumentation. The main contributions of DSPInst are: This is the rst tool-building system targeting the Blackn DSP processors. Analysis tools to perform code proling or code modication can be built easily. DSPInst allows for selective instrumentation; the user can specify on instruction boundaries when to turn proling or code modications on/o. DSPInst instruments object code, versus source or assembly. DSPInst decouples the user from having to provide source code of the application to be instrumented. The basic philosophy of DSPInst is to replace instructions in the original binary with trampoline code that jumps to analysis procedures. Despite the presence of variable-length instructions in the Blackn ISA, our tool is able to achieve transparency for all the instructions. In the next section we will delve into the implementation details of DSPInst.
3.
RELATED WORK
High quality proling and instrumentation techniques have been developed to support general computing platforms.
44
4.
DESIGN AND IMPLEMENTATION DETAILS
4.1 Design Overview
VisualDSP++ tools [3]. We utilize VisualDSPs build tools to generate the nal executable for the Blackn processor. The build tools are composed of an optimizing C compiler, Blackn assembler, linker, loader and cycle-accurate simulator. Internally, DSPInst works in two steps, as illustrated in Figure 1. In the rst step, the binary executable and libraries of the application program are parsed, and instructions at user-dened address points are replaced by trampoline code. DSPInst uses the FORCE INTERRUPT instruction available in the Blackn ISA to initiate the trampoline. Our next task is to preserve the ISA state and then carry out the analysis prescribed in the analysis le. To modify the interrupt vector table so that control transfers to our desired analysis code, we execute a startup routine. This code performs the following: initializes the proper entry point of the interrupt vector table according to the application binary executable, initializes a memory buer to store analysis data, and creates a table, in which the address of each modied instruction is stored with its corresponding index. There are two issues that the startup routine needs to consider. First, the startup routine has to run in supervisor mode in order to congure the interrupt vector. Second, we have to make sure that the instrumentation points are not located in program areas where system priority will become an issue. In the second step, we generate the instrumented binary application by assembling the users analysis functions, wrapping them with the proper register protection, and linking them with the instrumented application program and startup routine. Figure 2 shows an instrumented application program example. In this program, 64-bit VLIW instructions are replaced by 16-bit interrupt instructions. The interrupt instructions eect a jump from the original program to the instrumentation procedure. To maintain transparency (in terms of addressing), we insert three NOP instructions to pad the interrupt instruction out to a full 64 bits. The instrumentation procedure runs in the same address space as the application. This helps to insure that a precise hardware state and program information are preserved and presented to the analysis function at all times.
Figure 1: The instrumentation process The design of DSPInst allows us to insert calls to analysis routines before or after any instruction in a program. For example, if we wanted to perform some analysis on the execution of every loop in a program on a Blackn DSP, we would choose to instrument every zero-overhead loop setup instruction sequence. By detecting loop setups, we can detect the beginning of computationally-intensive loop bodies. Identifying these structures can help guide hot/cold optimization [8]. A transparent instruction-level instrumentation tool allows customized analysis procedures to be added at any points within the program, as specied by the user. DSPInst views a program as a linear collection of procedures, a procedure as a collection of basic blocks and a basic block as a collection of instructions. DSPInst provides a platform for users to build their own custom analysis procedures that can be inserted at any point specied by the user. We have implemented DSPInst on top of the Analog Devices
4.2 Transparency
DSPInst must avoid interfering with program execution, since transparency is critical for instrumentation. If the original program execution has been modied, the integrity of the proled state will be compromised.
4.2.1 Trampoline Instruction Selection

In terms of instruction length, the majority of instructions in Blackn ISA are 16-bit instructions. There are also 32-bit
45
Figure 2: An instrumented application instructions. The Blackn processor is not a superscalar; it does not execute multiple instructions at once. However, it does permit up to three instructions to be issued in parallel with some limitations. A multi-issue instruction is 64 bits in length and consists of one 32-bit instruction and two 16-bit instructions. Only a limited type of 32-bit Arithmetic Logic Unit(ALU) or Multiply Accumulate Unit (MAC) operations can be in a parallel instruction. Figure 3 shows the parallel issue combinations.
Figure 4: Padding the FORCE INTERRUPT instruction Figure 3: Parrallel issue combinations To achieve transparency, we use the Blackn FORCE INTERRUPT instruction to eect a jump from a user application to the instrumentation routine. The FORCE INTERRUPT instruction is a 16-bit instruction and is the smallest instruction in the Blackn ISA. By using this 16-bit instruction, we can replace any instruction without impacting the address of any other instruction in the binary. We could have elected to utilize the SHORT JUMP instruction, though the displacement provided would place limits on the size of the binary that we could eectively instrument. Instrumenting binaries with the FORCE INTERRUPT instruction removes this limitation. However, for replacing 32-bit instructions or 64-bit instructions, we have to pad the FORCE INTERRUPT instruction to 32 bits or 64 bits. Otherwise, the next instruction will not remain at its original address. Figure 4 shows examples of how to pad the FORCE INTERRUPT instruction.
4.2.2 Program Flow Control Instructions

Some instructions can be moved to another address and still execute without disturbing the original program behavior. However, not all instructions can be replaced as straightforwardly. All instructions that use relative addressing e.g., program ow control instructions are not easily moved to another address.
PC related instruction. The Blackn architecture can only

reference the PC as a source register; the PC is only used in branch instructions (JUMPs and CALLs). When we instrument an instruction with the PC as a source register,
46
we replace the original instruction with a sequence of instructions. This involves a series of steps. First, we protect the data registers that the instrumentation routine will overwrite. Second, we need to save the interrupt return register RETI value to a temporary data register. Next, we calculate the original PC value from the temporary register and the length of the original instruction. Then, we calculate the destination address from the PC value and related registers. Finally, we issue the RETI and restore the register values. When the interrupt return instruction (RTI) is executed at the end of the instrumentation routine, control will return to the correct destination address.
sequence that initializes the three registers, accordingly.
4.2.3 Program Execution Status Protection

Before entering an analysis procedure, the registers are preserved by being pushed onto the stack. Before executing the original instruction, the protected register state is restored from the stack. This ensures that the analysis procedure does not perturb the original program state. Any deviation from native execution will introduce references into the instruction and data footprint in caches, especially when we map the analysis function and instruction search table to the external memory. We either avoid proling the cache during these periods, or map analysis functions and the instruction search table to the internal L1 SRAM. The total on-chip L1 memory for ADSP-BF548 processor is 80 KB, including 48 KB of instruction SRAM and 32 KB of data SRAM. Besides the congurable 16 KB of instruction cache/SRAM and 32 KB of data cache/SRAM, we have extra space that can be allocated for instrumentation. Instruction and data accesses to the internal L1 SRAM will not perturb the cache state. To obtain accurate proling information, we can protect hardware performance counters before entering any analysis procedure by disabling them. If necessary, the events occurring during the execution of an analysis procedure will be ignored and are not counted.
JUMP, RTS and Conditional Branch instruction. To avoid

nested interrupts, we have to make sure that the interrupt return instruction (RTI) is the only exit from the instrumentation routine. When we instrument either a JUMP, a subroutine return instruction (RTS) or a conditional branch instruction, we replace the original instruction with a sequence of instructions. In this sequence, the destination address is calculated from the current program state so that the condition code status ag and register values are preserved. The result will be updated by the RETI. We properly protect any registers used in computations for the instructions, thus the original program behavior is eectively unmodied.
5. BUILDING A CUSTOMIZED TOOL: PERFORMING DYNAMIC VOLTAGE AND FREQUENCY SCALING

In this section, we illustrate the utility of DSPInst to build customized proling and optimization tools for embedded system research. In this example, we utilize DSPInst to identify code regions that incur signicant stalls during execution. If these stalls are due to cache misses, we have the ability to slow down the processor to better match the latency of the o-chip SDRAM. This scaling can save power, while not sacricing performance (if performance is impacted, so will the energy budget). We begin by presenting our proleguided dynamic voltage and frequency scaling(DVFS) algorithm. Then we provide an implementation of this algorithm on the Blackn ADSP-BF548 EZ-KIT platform and discuss the benets obtained.
Figure 5: Instrumenting the Zero-overhead Loop Setup instruction
CALL instruction. In the instrumentation routine, the original CALL instruction is also replaced by an instruction sequence. To instrument CALL instructions, we copy the address value in the interrupt return register (RETI) to the subroutine return register (RETS) right before updating the destination address in RETI. By doing this, we guarantee that not only the instrumentation routine returns correctly, but that the called subroutine also returns correctly.
5.1 DVFS decision algorithm

The advantage of static instrumentation driven DVFS is that the optimization decision is made oine before the actual application run. The decision process avoids consuming valuable native execution cycles. Furthermore, additional oine analysis has a chance to be conducted. One example is that we can merge the neighboring code regions with same DVFS decision to further constrain the overhead. The disadvantage of a static instrumentation driven approach is that the proling data and decision is based on a static program behavior and so cannot react to changes in application behavior. In our static instrumentation driven DVFS, we prole the
Zero-overhead Loop Setup instruction. The zero-overhead

loop setup instruction provided in the Blackn ISA is a counter-based, hardware-loop mechanism that does not incur a performance penalty when executed. This instruction also reduces code size when loops are needed. The zerooverhead loop setup instruction will initialize three registers, which are Loop Top Register, Loop Bottom Register and Loop Count Register, in a single instruction. The address values initialized in the Loop Top and the Bottom Register are PC-relative osets. Therefore, for instrumenting this instruction, we have to replace the original instruction with a
47
application and make DVFS decisions in the rst run. As soon as the decisions are made, DVFS instructions are inserted as an instrumentation procedure in the application program to balance performance and power consumption. Like other instrumentation based optimization techniques [23], it is important to select the best candidate code regions. To be cost eective, we want to select code regions that are frequently executed. At the same time, the regions need to consume enough time to amortize the cost of scaling; voltage and frequency scaling is a relatively slow process, taking about 1000 clock cycles on the Blackn for the Phase Lock Loop to stabilize. In our design, we construct a proling tool on DSPInst to apply instrumentation at each Zero-overhead Loop Setup instruction. By instrumenting these instructions, we divide program execution into phases that have a loop execution in the middle and Zero-overhead Loop Setup instructions on the boundaries. The analysis function of the proling tool congures the hardware performance counters and collects information about performance counter events, execution time, start address and end address, etc. This provides us with the necessary information to decide when to apply voltage/frequency scaling. In order to be benecial for DVFS, the selected code region rst needs to run long enough to amortize the scaling overhead. But it is not sucient that the code regions is long running. We need other metrics to complete the criterion by analyzing the DVFS decision model.
speed as fast as possible, so that T3 doesnt change much. From these observation, we can scale down the voltage and frequency of the processor and reduce the number of CPU stalls during long latency memory operations, such as T3, thus saving energy without incurring a high performance penalty. One class of events identifying phase T3 are data cache misses, which can be directly obtained from hardware performance counters on the Blackn. Assuming, on average, a data cache miss take N CPU clock cycles, the percentage of CPU slack time during time period T is:
data cache misses N execution time
In this equation, execution time is represented in CPU (versus SDRAM or board clock) cycles. Therefore, we dene this new metric:
misses per cycle =
data cache misses execution time
The higher the data cache misses per cycle, the greater percentage of time the processor will be stalled waiting for data from external memory.
Figure 6: Our DVFS decision model Figure 6 is our DVFS decision model. T1 is a CPU bound execution phase, and T2 and T3 are memory bound execution phases. It is possible that the CPU and memory are executing concurrently due to multiple issued instructions, such as in T2. Due to the long latency of external memory accesses, T3 is still signicantly long, although the CPU and memory is possible to run concurrently. In memory bound code regions, the CPU is stalled waiting for data cache misses to be serviced from external memory. For the whole execution time T, T3 takes up a signicant portion. Figure 7: The DVFS decision algorithm. In general, scaling down the CPU voltage and frequency will surely decrease processor power consumption, but it will also slow down the CPU execution speed in phase T1 and T2. However, the memory system clock is usually independent of CPU clock and so it is possible to maintain the memory Figure 7 shows a ow diagram of the DVFS decision algorithm. We set two thresholds in the decision algorithm to lter ineligible execution phases. The rst threshold is the data cache miss number incurred during each program phase
48
Dynamic Data Cache Misses
phases with a low number of misses will be pruned. The second threshold is the number of misses per cycle, which has to be larger than a set threshold. With these two thresholds, we lter the ineligible phases and select only program execution phases that contribute signicantly to the overall CPU slack time. Once we nd the candidate code regions, DVFS instructions will be inserted at every entry point of the code region to scale down the voltage and frequency, as well as at the exit points in the code region to restore the voltage and frequency level. We merge the neighboring candidate code regions to further reduce the overhead.
20000
15000
10000
5000
5.2 Experimental Result

We implement the example on the ADSP-BF548 EZ-KIT Lite, with a Blackn ADSP-BF548 processor. Figure 8 is the product image of ADSP-BF548 EZ-KIT Lite evaluation system. We congure 16 KB L1 instruction cache and 32 KB L1 data cache, and all the program code and data are mapped to external SDRAM memory. We use the consumer applications subset of the EEMBC Consumer benchmark suite as benchmark applications.
20M 40M 60M 80M 100M120M140M160M Program Execution Flow (clock cycles)
Figure 9: Dynamic MP4DECODER. Data Cache Misses in
we then use an optimizing tool to apply DVFS as the congurations in the Table 2 . without VFS 500MHz 1.20V with VFS 250MHz 1.00V
frequency voltage
Table 2: Congurations with and without voltage/frequency scaling.
Figure 8: Product image of ADSP-BF548 EZ-KIT Lite Evaluation System. DJPEG MPEG4 DEC MPEG4 ENC MP3PLAYER JPEG decoding MPEG4 decoding MPEG4 encoding MP3 player
Figure 10: Execution Time and Energy Estimation with or without VFS. Figure 10 shows the normalized performance and energy usage in the instrumented code region and the impact on the overall MPEG4 DEC application. For the selected code region in MPEG4 DEC, we incur a 3.3% overhead in execution time while reducing energy consumption by 65.3% when compared to the full-speed execution. Since the instrumented code region constitutes 9.5% of MPEG4 DECs entire dynamic execution time, we save 6.2% in energy consumption with a 0.3% performance penalty. We applied the same proling and optimization tools to the other consumer applications. On average, we saved 7.6% in energy consumption, with only a 0.8% performance impact.
Table 1: Consumer applications subset of EEMBC benchmark suite We take MPEG4 DEC as an example. The analysis function selected 21 program phases from from potential candidates (1.5 million in this case), and identied the start and ending addresses for each program phase. We further conrm the output with an oine analysis of the proling information in Figure 9. Figure 9 shows the number of dynamic data cache misses occurring in MPEG4 DEC. We can easily see that there are signicant opportunities to apply voltage/frequency scaling in this application. Based on the selected program phases,
49
Execution Time and Energy Estimation relative to the native Execution
100
80
60
40
20
Energy Estimation
Execution Time
Figure 11: Execution time and energy estimation with or without VFS of consumer applications.
6.
CONCLUSIONS
As embedded systems are deployed in a rapidly increasing number of applications, and as the sophistication of these applications continues to grow, the need for high quality instrumentation and analysis tools also grows. Until recently binary instrumentation tools targeting embedded systems have not been readily available. In this paper, we present design, implementation, and performance analysis of DSPInst, a static binary instrumentation toolset for the Analog Devices Blackn family of DSPs. We have provided an example utility to illustrate how a user can build customized tools with DSPInst to both collect run-time proles, and to dynamically adjust system operating conditions. DSPInst opens up new opportunities for software analysis and design space exploration on the Blackn architecture. In future work we plan to consider how DSPInst can generate standard proles that can be used during compilation. We will also consider how best to integrate the toolset into the ADI VisualDSP framework.
7.
ACKNOWLEDGMENTS
The authors would like to thank Richard Gentile and Kaushal Sanghai at Analog Devices for their help and guidance with the Blackn tools. The Northeastern University Computer Architecture Laboratory is supported by a generous research gift from Analog Devices.
8.
REFERENCES
[1] Analog Devices Inc. Blackn processors webpage. http://www.analog.com/embedded-processingdsp/processors/en/index.html. [2] Analog Devices Inc. Blackn Embedded Processor ADSP-BF542/ADSP-BF544/ADSP-BF547/ADSPBF548/ADSP-BF549 Data Sheet, 2009. [3] Analog Devices Inc. VDSP++ 5.0 Product Release Bulletin., August 2007.
[4] Analog Devices Inc. Adsp-bf54x blackn processor hardware reference., August 2008. [5] Analog Devices Inc. Blackn processor programming reference., September 2008. [6] D. L. Bruening. Ecient, transparent, and comprehensive runtime code manipulation. PhD thesis, Cambridge, MA, USA, 2004. Supervisor-Amarasinghe Saman. [7] B. Buck and J. K. Hollingsworth. An api for runtime code patching. Int. J. High Perform. Comput. Appl., 14(4):317329, 2000. [8] R. Cohn and G. Lowney. Hot cold optimization of large windows/nt applications. In Proc. MICRO29, pages 8089, 1996. [9] G. Desoli, N. Mateev, E. Duesterwald, P. Faraboschi, and J. A. Fisher. Deli: a new run-time control point. In MICRO 35: Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, pages 257268, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press. [10] A. Edwards, H. Vo, A. Srivastava, and A. Srivastava. Vulcan: Binary transformation in a distributed environment. Technical report, 2001. [11] K. Hazelwood and A. Klauser. A dynamic binary instrumentation engine for the arm architecture. In CASES 06: Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, pages 261270, New York, NY, USA, 2006. ACM. [12] G. Hunt and D. Brubacher. Detours: binary interception of win32 functions. In WINSYM99: Proceedings of the 3rd conference on USENIX Windows NT Symposium, pages 1414, Berkeley, CA, USA, 1999. USENIX Association. [13] C. keung Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. Janapa, and R. K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In In Programming Language Design and Implementation, pages 190200. ACM Press, 2005. [14] J. R. Larus and E. Schnarr. Eel: machine-independent executable editing. In PLDI 95: Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation, pages 291300, New York, NY, USA, 1995. ACM. [15] J. Maebe, M. Ronsse, and K. D. Bosschere. Diota: Dynamic instrumentation, optimization and transformation of applications. In In Proc. 4th Workshop on Binary Translation (WBTa02), 2002. r [16] N. Nethercote and J. Seward. Valgrind: a framework for heavyweight dynamic binary instrumentation. SIGPLAN Not., 42(6):89100, 2007. [17] T. Romer, G. Voelker, D. Lee, A. Wolman, W. Wong, H. Levy, B. Bershad, and B. Chen. Instrumentation and optimization of win32/intel executables using etch. In NT97: Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997, pages 11, Berkeley, CA, USA, 1997. USENIX Association. [18] S. Sair, G. Olivadoti, D. Kaeli, and J. Fridman. Dsptune: A performance evaluation toolset for the
D EG JP
PE M G 4 D EC
PE M G 4 EN C
P3 M A PL Y ER
50
sharc signal processor. In SS 00: Proceedings of the 33rd Annual Simulation Symposium, page 51, Washington, DC, USA, 2000. IEEE Computer Society. [19] K. Sanghai, D. Kaeli, A. Raikman, and K. Butler. A code layout framework for embedded processors with congurable memory hierarchy. In Proceedings of the 5th Workshop on Optimizations for DSP and Embedded Systems, pages 2938, 2007. [20] K. Scott, N. Kumar, S. Velusamy, B. Childers, J. W. Davidson, and M. L. Soa. Retargetable and recongurable software dynamic translation. In CGO 03: Proceedings of the international symposium on Code generation and optimization, pages 3647, Washington, DC, USA, 2003. IEEE Computer Society. [21] A. Srivastava and A. Eustace. Atom: a system for building customized program analysis tools. In PLDI 94: Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation, pages 196205, New York, NY, USA, 1994. ACM.
[22] A. Tamches and B. P. Miller. Fine-grained dynamic instrumentation of commodity operating system kernels. In OSDI 99: Proceedings of the third symposium on Operating systems design and implementation, pages 117130, Berkeley, CA, USA, 1999. USENIX Association. [23] Q. Wu, M. Martonosi, D. W. Clark, V. J. Reddi, D. Connors, Y. Wu, J. Lee, and D. Brooks. A dynamic compilation framework for controlling microprocessor energy and performance. In MICRO 38: Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, pages 271282, Washington, DC, USA, 2005. IEEE Computer Society. [24] X. Zhang, Z. Wang, N. Gloy, J. B. Chen, and M. D. Smith. System support for automatic proling and optimization. In SOSP 97: Proceedings of the sixteenth ACM symposium on Operating systems principles, pages 1526, New York, NY, USA, 1997. ACM.
51
Improving Instrumentation Speed via Buffering

Dan Upton Kim Hazelwood
Robert Cohn Greg Lueck

Intel Corporation
Abstract
Dynamic binary instrumentation systems are useful tools for a wide range of tasks including program analysis, security policy enforcement, and architectural simulation. The overhead of such systems is a primary concern, as some tasks introduce as much as several orders of magnitude slowdown. A large portion of this overhead stems from both data collection and analysis. In this paper, we present a method to reduce overhead by decoupling data collection from analysis. We accomplish this by buering the data for analysis in bulk. We implement buering as an extension to Pin, a popular dynamic instrumentation system. For collecting a memory trace, we see an average improvement of nearly 4x compared to the best-known implementation of buering using the existing Pin API.

D.2.5 [Software Engineering]: Testing and Debugging Debugging aids, Tracing; D.3.4 [Programming Languages]: ProcessorsRun-time environments
General Terms
Performance, Experimentation
Keywords
Instrumentation, program analysis tools
1.
Introduction
An important part of developing and maintaining ecient hardware and software is understanding the run-time behavior of the system. Collecting information about an executing application can be an involved process, requiring the user to not only determine relevant characteristics to study but also to determine how to collect the results and verify their correctness. Additionally, collecting information about an application for which the user does not have source code can be dicult. Dynamic binary instrumentation systems provide a layer of abstraction for collecting information about a programs
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. WBIA 09, Dec 12, New York City, NY Copyright c 2009 ACM 978-1-60558-793-6/12/09 ...$10.00.
run-time behavior. Furthermore, they require only the application binary for proling, allowing proling of an application without needing source code access or requiring the application to be recompiled. Such systems have been used for a variety of tasks ranging from collecting an instruction mix or memory trace to architectural simulation or enforcing security policies. The main drawback for dynamic binary instrumentation systems (DBIs) is the overhead involved. Without adding any instrumentation code, there is still some amount of extra code introduced simply by running the DBI. The system must generally perform some sort of recompilation of the hosted binary; beyond that, it must perform a number of management tasks, such as managing the recompiled code, maintaining control, and handling signals and system calls. Adding instrumentation further increases the overhead, possibly ranging as high as 1000x or greater slowdown for sophisticated, compute-intensive tasks. Reducing the overhead of the DBI itself has been studied extensively with various dynamic binary systems [2, 4, 6, 8, 10, 11] and is not addressed in this paper. Proling an application can be broken down into two main phases collecting the data and analyzing it. Both parts are responsible for overhead: collecting data requires adding and executing additional instructions, such as calculating the effective address of a memory access; analyzing the data generally involves running some user-dened function. Strategies for collecting and processing data can also be broken down into two general classes, online analysis and decoupled analysis. Several methods have been suggested for reducing the impact of data analysis. For instance, PiPA [15] sought to overlap analysis with application execution and data collection by collecting small chunks of data and processing them in one or more separate threads. Other systems [9, 13] have parallelized data collection in a separate thread while the unmodied application runs in its own thread. In this paper, we focus on reducing the overhead of the rst phase, data collection. As noted above, analysis can either be performed online or in a decoupled manner, collecting chunks of data and processing a full chunk at once. We implement a buering system for Pin [8] to eectively collect data for decoupled analysis. While such buering can be accomplished using the preexisting interface to Pin, an ecient implementation is complex and a simple implementation is inecient. Using our buering system, users can write a comparable amount of code to the simple implementation while achieving greater performance than the more complex implementation. There are three main contributions in this paper. First, we add a new API to Pin specic to tools that collect data in
52
buers. The API allows the user to control how the buers are dened, how they are lled, and how their memory is managed. The buers are thread-private, meaning the user does not need to use sophisticated methods to separate collected data. While it is possible to write a tool that collects data in a buer with the existing API, the additions make it simpler to write the tool and simpler for Pin to optimize the code. Second, we optimize the code generated to write to the buer using a combination of scalar optimizations and register allocation. Third, we reduce the cost of detecting when the buer is full by using a probabilistic algorithm that is backed up by a safe but slower method. We achieve on average nearly a 4x improvement over the best-known method for implementing a buer using the existing API. The rest of the paper is organized as follows. Section 2 presents background on dynamic binary instrumentation systems and on Pin specically, including a short overview of the Pin instrumentation API. Section 3 describes our implementation in detail, including the additions to the API, options for code generation, and how we handle buer overow. Section 4 provides a performance evaluation of our system, comparing execution time on both single- and multithreaded programs. Section 5 discusses related work. Finally, Section 6 presents future work and concludes.
2.
Background
Before discussing implementation details of buering in Pin [8], we rst present an overview of dynamic binary instrumentation and related systems. We then provide a brief introduction to Pin and its instrumentation interface. We will also describe options for buering data for decoupled analysis which already exist in Pin. Dynamic binary instrumentation systems (DBIs) such as Pin [8] and Valgrind [10] take control of a hosted application and have the ability to inspect every instruction in the application before executing it. This allows DBIs to add, modify, or remove instructions from the dynamic execution stream. Specically, they are commonly used for inserting user-dened instrumentation instructions before points of interest, such as memory operations, branches, or function calls. Such instrumentation has many uses for proling an application, and has also been used for such tasks as enforcing security policies [7]. DBIs are similar to dynamic binary translators and dynamic binary optimizers. Dynamic binary translators such as IA-32EL [3], Boa [1], or Transmetas CMS [6] take a stream of instructions for one instruction set architecture (ISA) and translate them to the ISA of the current processor. Dynamic binary optimizers such as Dynamo [2] and DynamoRIO [4] seek to improve performance of hosted applications by targeting hot code for optimization and utilizing information available at run time. Most of these systems share implementation details as they have many of the same basic issues to solve, such as maintaining control (e.g., capturing branches to make sure the application does not return to its original code). Additionally, many of these systems maintain a software code cache from which code is executed after translating, optimizing, or instrumenting once. This avoids the overhead of repeating work. In this paper, we describe an implementation of buering in Pin. Pin is a DBI developed at Intel that consists of a virtual machine (VM), a software code cache as described above, and an instrumentation interface. The instrumenta-
tion interface provides a large API for writing user-dened programs, called PinTools, which describe where to add instrumentation and how to process the collected data. Pins API allows most tools to be written once and then compiled and run on any supported OS (Linux, Windows, MacOS, or FreeBSD) and architecture (IA-32, Intel 64, IA-64, or ARM). When instrumentation code is added, a just-in-time compiler (JIT) in the VM manages isolating program state from VM and tool state, including saving and restoring registers as necessary. The JIT also performs register allocation and inlines analysis code when possible. Pins instrumentation API provides several ways to add instrumentation code, such as before an instruction or before a basic block, using functions such as INS InsertCall or BBL InsertCall respectively. More sophisticated instrumentation is also possible, such as if/then instrumentation, where a lightweight boolean call is added rst and more involved instrumentation is only executed if the boolean function evaluates to true. The user may also specify where to insert instrumentation using IPOINT directives, such as IPOINT AFTER, used to place analysis code after the instruction under analysis. There are two basic methods for analyzing the data collected by adding instrumentation code, online or decoupled. Online analysis can be very exible. User-dened arguments, such as branch target or memory operation addresses, are passed to a user-dened routine that analyzes the data. However, the exibility comes at a cost. The userdened routine may overwrite registers, and these registers may be live in the application when the routine is called. Thus, Pin inserts register saves and restores to prevent the registers from being overwritten. To reduce the number of saves and restores, Pin does extensive optimizations. Pin may also inline analysis routines; however, this code duplication can negatively impact cache performance, and if the code is infrequently used, the additional time spent in the JIT may outweigh any benets. Decoupled analysis collecting a large amount of data to be processed at once can lessen or avoid both of these sources of overhead, because a specialized buer ll routine can have a smaller register and instruction footprint than generalized analysis routines. Buering data for decoupled analysis is possible in Pin using the existing API. However, as will be described in more detail later in Section 4, a straightforward implementation of buering is inecient. An ecient implementation is possible but is more involved and dicult to get right for instance, the user must be aware of how to use the if/then instrumentation, ensure the buer ll code is inlined, and avoid modifying the eflags register on IA-32 and Intel 64 architectures. The buering implementation described in this paper manages these details transparently to the user, allowing them to write a tool in a straightforward manner while achieving the performance of the more involved implementation.
3. Implementation
In this section, we will discuss the various facets of implementing buering of data for decoupled analysis in Pin. First, we describe the user interface, implemented as additions to the Pin instrumentation API. Next, we describe the code generation for lling the buers, including options for store instructions. Finally, we discuss handling the buer overow condition and allowing the user to process the data.
53
#define NUMPAGES 1000 struct MEMREF{ ADDRINT addr; } VOID* BufferFull(BUFFER_ID id, THREADID tid, const CONTEXT * ctxt, VOID *buf, unsigned numElts, VOID *v){ // process buffer as necessary return buf; } VOID Ins(INS ins, VOID* v){ if(INS_IsMemoryRead(ins)){ INS_InsertFillBuffer(ins, IPOINT_BEFORE, IARG_MEMORYREAD_EA, offsetof(struct MEMREF, addr), IARG_END); } } int main(){ PIN_Init(argc, argv); BUFFER_ID id=PIN_DefineBuffer(sizeof(MEMREF), NUMPAGES, BufferFull, 0); INS_AddInstrumentFunction(Ins, 0); PIN_StartProgram(); return 0; } Figure 1: Sample PinTool using the buering API described in this paper.
3.1 Buffering User Interface

The buering API can largely be broken into three classes: describing the buer, lling the buer, and managing the buer memory. Figure 1 shows a sample PinTool demonstrating the use of this API to collect a trace of memory load addresses. Managing buer memory is not shown in the sample tool. Describing the buer: We rst provide a new call, PIN DefineTraceBuffer, which takes as parameters the size of an individual record, the total size of the buer, a callback function, and an optional pointer (such as to a struct or object). In the general use case, an individual record will be a struct or an object, although it is possible to store a single value without wrapping it in a struct or class. The record size parameter is used during code generation to determine by how much to increment the pointer within the buer. The total size of the buer is given as a number of pages. Depending on the performance of a particular tool and the relative amounts of work collecting and analyzing the data, the user may choose a dierent size for the buer. The callback function is invoked any time the buer overows; details for handling overow will be discussed later. Additionally, the callback function is invoked on thread exit, which allows the user to drain the buer of any values inserted between the last overow and the exit. The callback function is called with a pointer to the base of the overowed buer, the number of elements, information about the application state including the thread ID and architectural context, and the optional pointer from PIN DefineTraceBuffer. The specied buer size is allocated for each thread, in a thread-private manner, so the thread ID allows the user to analyze each thread separately
or record information about thread interleaving. The user may dene multiple buer types; for example, the user may create one buer type to collect a memory trace, and a second buer type to collect a branch trace. We dierentiate these using a new ID type. The ID of a buer is also delivered to the user in the callback. This allows the user to consolidate their callback functions; it is also used in the memory management functions described later, if the user wishes to manage the buer in the callback. Buer lling: The new API elements for lling a buer are parallel to many of the existing instrumentation calls as described in Section 2. For instance, instead of INS InsertCall, present in the existing API, we provide INS InsertFillBuffer for inserting a buer ll around an instruction. As with InsertCall functions, there are dierent versions for instrumenting with respect to an instruction, basic block, routine, etc., as well as a version for if/then instrumentation (i.e., the user can opt to only insert data to the buer if a certain condition is true). The InsertFillBuffer functions also support the various IPOINT directives, which instruct Pin where to insert instrumentation code with respect to the instruction or basic block being instrumented. There are two primary dierences between the InsertCall and InsertFillBuffer functions. First, rather than providing a pointer to an analysis function, the user provides the ID of the buer they would like to ll. This is used internally during code generation to nd the size of the record to advance the buer pointer. Second, where the user would provide arguments to an analysis call, they must also provide the oset into the record. This is necessary for generating the ll code to make sure the correct values go to the correct places in a struct; otherwise, the user may get nonsensical values when they process the buer later. It has the additional benet of allowing the user to specify the elds for a particular InsertFillBuffer call in the order that makes the most sense at that time; furthermore, they may only ll the parts of the record they care about. For instance, rather than having two separate buer types for memory and branch traces as described previously, the user could dene their record to store information about both, along with a ag; by requiring them to specify the oset for the elds they care about, they need not write code in their tool to ll unimportant elds. Likewise, when Pin generates the actual code to ll the buer, it does not have to generate ll code for elds the user does not care about for that particular instruction, which reduces both the instruction count and memory trac. Managing buer memory: For each buer type, one physical buer is allocated automatically when a thread begins execution. In addition to dening multiple buer types, the user may request allocation of multiple buers for any given buer type. This allows the user to have a second buer, or an arbitrary number of buers, so one or more processing threads may work on analyzing the data while the main application thread(s) continue to run and collect data. This multi-buering can be used with a technique like PiPA [15] as well to break data up into smaller chunks to be processed further by other threads. We provide two API calls for explicitly managing buers, PIN AllocateBuffer and PIN DeallocateBuffer. To allocate a new buer, the user only needs to provide the ID for
54
the type. For deallocation, the user needs to give both the ID and address of the base of the buer. Pin maintains a list of the memory associated with each buer type, so it will not free memory from an address that does not match the base of an appropriate buer. If the user has ever allocated their own buers, Pin can not safely automatically free the memory, so the user is responsible for deallocating buers to avoid memory leaks. The user explicitly manages which buer to write after an overow by returning a pointer from the callback. In the single-buer case, the user will just return the pointer to the buer they received in the callback. For multiple buers, the user may either allocate a new buer each time and free the processed buer at the end of the processing thread, or they may allocate some number of buers in advance and rotate through them. Which of these is ideal depends on the processing task: if processing takes considerably longer than lling, the application may stall while waiting for an empty buer; however, creating too many processing threads and extra buers can quickly exhaust memory and execution contexts, degrading overall performance.
mov mov mov mov lea mov mov mov lea
r8, 0x0 qword ptr [r14+0x14], r8 r8, 0x8 dword ptr [r14+0x10], r8d r8, ptr [rsp-0x8] qword ptr [r14+0x8], r8 r8, 0x3668c01147 qword ptr [r14], r8 r14, ptr [r14+0x18]
; ; ; ; ; ; ; ; ;
flag: load fill argument 3 size of reference fill argument 2 EA of reference fill argument 1 original PC fill argument 0 advance pointer
Figure 2: Sample buer ll code. Here, r14 stores the current buer pointer, and r8 holds analysis arguments. Fill Method gmemset serial frame serialnti framenti Bandwidth (GB/s) 2.38 1.58 1.58 2.43 2.40
3.2 Code Generation

As the application is running, Pin recompiles the original application binary to include the specied instrumentation. When analyzing data online, this basically involves generating code to set up arguments for the analysis function in registers and on the stack, then either inlining the analysis code or generating a call to the analysis function. Inlining analysis functions is not always possible, and can have a negative impact on performance. Additional time is required to recompile and allocate registers for the analysis function, and a larger, frequently inlined function can reduce instruction cache locality. Alternately, calling the analysis function can be slow, as it requires saving and restoring at least part of the application state. When generating code to buer data, we can largely mitigate both of those issues. We guarantee the buer ll code is inlined, which avoids the slower call to an analysis function. Furthermore, we have a known small number of instructions required for lling the buer, reducing the impact on code size, compile time, and instruction cache eects. We store the pointer to the current location in the buer in a virtual register, where the set of virtual registers available in Pin is larger than the set of architected registers. Depending on the current mapping from virtual to physical registers, we may rst need to generate a spill and ll to get the buer pointer into a physical register; this is handled automatically by Pins register allocator. The basic approach is to move the analysis argument eective address of a load, branch target, etc. into a register, and then move the argument from that register to the specied oset from the buer pointer. After lling all relevant elds of a record, we advance the buer pointer by the size of one record. One simple optimization we make here is to use an lea, rather than an add, to advance the pointer. This avoids modifying the eflags register, which could aect program execution, or having to save and restore it, which has been demonstrated in the past to be slow. We show a sample of this code in Figure 2, taken from instrumentation code generating a trace consisting of the program counter, the eective address, the size of the reference, and a ag indicating load or store. In the code, r14
Table 1: Fill bandwidth for combinations of ll instruction and buer pointer advancement scheme. Higher is better.
has already been lled with the value of the current buer pointer, and r8 (or r8d, the lower 32 bits of r8) has been used to store each of the analysis arguments to be added to the buer. As shown, we only need two instructions per analysis argument plus one to advance the buer pointer. We considered two dimensions for code generation: which store instruction to generate, and how often to advance the buer pointer. In terms of how to actually store the buer, we considered using mov as shown in Figure 2, or movnti, a non-temporal store instruction from the SSE instruction set. The non-temporal instructions essentially tell the processor the data being written will not be accessed soon, removing the need to load the cache line if it is not already present in the rst-level data cache. Depending on the specic microarchitecture, the line may or may not be loaded into the last-level cache. There are two options for advancing the buer pointer: advance each ll, or advance once per basic block or trace. Advancing once per ll is very straightforward and means the eective osets are the same for each ll. However, it can have a larger overhead in terms of instructions, depending on the number of buer lls per basic block or trace. Advancing once per basic block requires keeping track of the number of lls executed so far, so the osets can be shifted appropriately. For instance, if the record is 16 bytes and the pointer is advanced once per basic block, the second ll in a basic block must have all the osets adjusted by 16 bytes, the third ll by 32 bytes, and so on. Additionally, if the basic policy is to advance once per trace, all branches that exit early from the trace must be modied to advance the buer pointer an appropriate amount before branching. Our main concern with respect to the code generated for lling the buer and advancing the pointer was to ensure buer lling is not bandwidth-limited if the memory bandwidth is saturated, no other changes will improve performance. We used a microbenchmark to test the maximum ll bandwidth for several congurations; this comparison is
55
Figure 3: Demonstration of buering. In (a), the buer is partially full and the buer pointer points to empty space. In (b), the write at the buer pointer overows the buer, triggering an overow.
Bandwidth vs. Buffer Size Bandwidth (GB/s) 2.5 2 1.5 1 0.5 0 4KB 40KB 400KB 4MB 40MB 400MB MOVNTI MOV
Buffer Size
Figure 4: Bandwidth with respect to buer size on an ideal application. For small buers, frequent overows limit performance. shown in Table 1. The bandwidth is approximated as the average over 15 executions of lling a 40MB block of memory, using the time stamp counter to measure time. We use memset from glibc, a highly-tuned implementation of lling memory, as a baseline. We compare lling with either mov or movnti, and either advancing the pointer after each ll (serial) or once every four lls (frame). From the graph, we see movnti gives performance roughly equal to memset, whereas mov only reaches approximately 65% of the maximum. Furthermore, the two methods for advancing the pointer are equivalent. Based on this, we initially chose to use movnti with the serial advancing scheme as our baseline. However, as we will detail further in Section 4, performance degrades on real applications. This is due to movnti performing best when moving full, aligned, cache lines, which is an assumption broken by interleaving application stores or even non-aligned buer lls.
3.3 Handling Overow

The nal major component of the buering implementation is handling overow. For some instrumentation tasks, applications, and buer sizes, the initial chunk of memory allocated for the buer will hold the entire applications data, and the users callback only needs to be called on termination. However, on most interesting cases, the users instrumentation routines will generate many buers of data. Our basic strategy is to allocate an additional page at the end of the buer, without read or write permissions, to act as a guard page. When the buer is full, the next attempt to write into the buer will land on this guard page and generate a signal for a segmentation fault. Figure 3 depicts this case. For transparency, Pin must catch all signals and either reconstruct the original application state or report the signal as an internal fault. We add an extra case to the signal handler for segmentation faults; if Pin is unable to
nd an original application instruction matching the faulting instruction in the code cache, we check the address of the write against the guard page ranges for currently-allocated buers. If the write falls in a guard page, we can easily obtain the necessary information for calling back to the user, including the ID of the overowing buer and the number of records stored in it. In this case, the number of records is simply the size of the buer divided by the specied record size. The primary issue with this scheme is the overhead of handling the signal, including both catching the signal and reconstructing the state. Reconstructing the state essentially requires recompiling the current trace, including all instrumentation, which becomes a signicant source of overhead if done frequently. This can be mitigated by using a larger buer, which reduces frequency of overows but which scales poorly when multiple threads are involved. Figure 4 shows the bandwidth for lling a buer with both mov and movnti at several sizes, on an idealized application where the only memory trac is from lling the buer. In this case, wallclock performance closely follows bandwidth performance; the primary point to note is how quickly performance drops o as the buer size decreases. To mitigate this, we implemented multiple high-water mark checks for the buer. This basically means we check at various points, preferably infrequently, to see how close to overowing each buer is. If the buer is within a certain distance of the end, we force an early overow without needing to recompile to reconstruct the state. We implement this in two ways. The rst is to check if the buer is past the high-water mark every time the VM is entered, which happens for various reasons such as compiling new code or handling a system call. This method works well in some cases, avoiding as many as nearly 95% of signal handling overows, but avoids less than 3% on average. This is due primarily to the fact that in the ideal case, which is realized on some applications, all of the application code is compiled early and most of the execution is spent executing from the software code cache. The second implementation is periodic if/then instrumentation. As described in Section 2, we can execute a small boolean function quickly and execute a more heavyweight function if the boolean function evaluates to true. In this case, our if function is whether the buer is above the high water mark, and the then function mimics an overow. Because these if calls generally happen more frequently than VM entries, this method generally leads to more early overows than the VM entry check. Pathological cases exist, such as where an instrumented instruction could appear in a loop nest and not be monitored by an if/then. This would lead to all overows being handled by catching a signal. One way to attempt to avoid this case is to recompile any trace which causes a full overow and explicitly insert the if/then instrumentation. We will discuss this further in Section 4.
3.4 Summary
In this section, we described the implementation details of buering in Pin in three major parts: the user interface, code generation, and handling buer overow. The user interface for buering is implemented as additions to Pins instrumentation API, allowing the user to dene multiple buer types, insert buer ll instructions in a variety
56
20 18 Time (Normalized To Native) 16 14 12 10 8 6 4 2 cactusADM GemsFDTD dealII xalancbmk libquantum omnetpp INT Average sphinx3 leslie3d gobmk soplex bzip2 namd gcc calculix zeusmp sjeng bwaves tonto mcf gamess milc lbm FP Average SPEC Average perlbench h264ref povray gromacs hmmer astar wrf 0
MOVNTI Buffer MOV Buffer Tuned Acall Untuned Acall
Figure 5: Runtime comparison of buering implementations, comparing the implementation described in this paper with each instruction type to two analysis call-based implementations. For either ll instruction, our implementation performs better than analysis call-based implementations. of manners, and manage memory for lling and processing buers in parallel. These additions generally parallel the existing Pin API. Fill code is generated as pairs of instructions to calculate an analysis argument into a register and then store it into the buer. These ll instructions must then be followed by an instruction to advance the buer pointer. We discussed choices for buer ll instruction and buer pointer advancement scheme. Based on bandwidth experiments, we use as a baseline lling the buer with the SSE movnti instruction and advance the pointer with an lea after each record. Finally, we described our method for dealing with overow, catching a signal due to writing into a guard page at the end of the buer, and two methods to avoid the overhead associated with signal handling. handling. Figure 5 compares the overhead of collecting a memory trace, as described in the beginning of the section, using four methods. The rst two, MOVNTI Buer and MOV Buer, use the buering API described in this paper with dierent ll instructions. The last two, Tuned Acall and Untuned Acall are implementations using the preexisting Pin API. As a point of reference, the two versions using our API require approximately 75 lines of user code, the untuned analysis call requires approximately 50 lines, and the tuned analysis call implementation requires roughly 500 lines. The tuned analysis call implementation is included with the Pin kit as the memtrace tool and uses if/then instrumentation before each ll to determine whether the buer will overow on the current ll. If the buer would overow, it rst processes the buer and resets the pointer before continuing. In this case, the if call can be inlined. The untuned analysis call implementation performs the overow check in the same function as the buer ll and can not be inlined. As mentioned previously in Section 3.2, although performance is better in the ideal case when lling the buer with movnti, on SPEC applications buer lling with mov is faster on average and provides a 2.6x speedup compared to memtrace. However, it is worth noting that using movnti provides the lowest overhead and highest speedup compared to memtrace on a single benchmark: a 43% overhead and nearly a 3x speedup on 462.libquantum. The untuned analysis call implementation also performs better than memtrace on this benchmark, so it is possible the memory access patterns for libquantum are a pathological case for memtrace. In many of the cases where the mov buer is faster than the movnti buer, the dierence is relatively small. However, on 470.lbm, we see both the mov buer and memtrace have roughly a 2.3x overhead while the movnti buer incurs nearly a 10x overhead. After investigation using hardware performance counters, we discovered this was due to how non-temporal stores update memory. Non-temporal stores rely on write-combining full cache lines to improve performance; interleaving non-aligned stores, either from Pin or the application, limits write-combining and thus performance.
4.
Performance Evaluation
In this section we will present performance results for our buering API. We rst present a comparison of run times for basic implementations based on both ll instructions compared to unoptimized and optimized analysis call-based implementations. We then compare the eects of our overowhandling optimizations. The results were collected on a 1.6GHz dual-processor Intel Core2 Quad with 8GB of RAM running 64-bit CentOS 4.7. We report results using two benchmark suites, SPECCPU and SPECOMP. For SPECCPU, we used the reference inputs to SPEC2006 v1.1, compiled with gcc 4.1.2 with optimization -O2 -fPIC. For SPECOMP, we used the medium inputs for SPECOMP 2001, compiled with icc 10.1.013 with optimization -O3. SPECOMP benchmarks were run with eight threads. Measuring buering performance only makes sense in the context of using it in a PinTool. Our comparisons are based on a PinTool that collects a memory trace of eective addresses of loads and stores that is, each record is a single 64-bit value. Since we only care about the performance of the buer, no analysis is performed on the data.
4.1 Basic Implementation

We rst demonstrate the performance of our buering implementation without the modications to improve overow
57
40 Time (Normalized To Native) 35 30 25 20 15 10 5 cactusADM GemsFDTD dealII xalancbmk libquantum omnetpp INT Average sphinx3 leslie3d gobmk soplex bzip2 namd gcc calculix zeusmp sjeng bwaves tonto mcf gamess milc lbm FP Average FP Average SPEC Average SPEC Average perlbench h264ref povray gromacs hmmer astar wrf wrf 0
400MB Buffer 40MB Buffer 4MB Buffer
Figure 6: Overhead of a movnti-based buer with dierent buer sizes. Performance degrades slightly when decreasing the buer size from 400MB to 40MB, but degrades more noticeably when decreasing from 40MB to 4MB.
14 Time (Normalized To Native) 12 10 8 6 4 2 cactusADM GemsFDTD libquantum gobmk omnetpp xalancbmk dealII bzip2 sjeng gcc INT Average perlbench gromacs sphinx3 mcf bwaves zeusmp gamess leslie3d soplex namd calculix tonto h264ref milc hmmer povray astar lbm 0
MOVNTI 4MB Recompile MOVNTI 4MB Norecompile MOV 4MB Recompile MOV 4MB Norecompile
Figure 7: Buering overhead after applying early overow optimizations, triggering an overow if the buer is more than 50% full at check time. Small buers now perform better than larger buers.
58
We now compare the eect of buer size on overhead. As mentioned in Section 3.3, memory bandwidth on an ideal application drops o rapidly as the buer size is decreased. Figure 6 compares the run time of lling a buer with movnti with buer sizes of 4MB, 40MB, and 400MB. The results follow what one might expect from the bandwidth comparison of Figure 4; there is a large increase in performance from 4MB to 40MB, and less of an increase from 40MB to 400MB. This allows the user to make a tradeo between a large buer for increased performance on applications with few threads, or smaller buers for many-threaded applications or applications where memory issues could arise.
14 12 10 8 6 4 2 0 INT Average FP Average
Time (Normalized To Native)
400MB Acall 400MB NTI 400MB MOV 4MB NTI 4MB MOVRecomp 4MB MOVNoRecomp 4MB NTIRecomp 4MB NTINoRecomp
SPEC Average
4.2 Optimizing Overow

Although the user has the option of increasing performance by increasing the buer size, this is not possible in all cases. For instance, large buers do not scale to applications with many threads; furthermore, they may lead to thrashing in memory. Since the increased frequency of overows is the primary dierence in behavior with varying buer sizes, we will now discuss the performance benets of modifying overow behavior as described in Section 3.3. Recall that we implemented multiple methods for causing an early overow to avoid handling a signal and reconstructing the state, including checking when entering the VM, checking with periodically-inserted if/then instrumentation, and potentially recompiling traces which still overow due to a segmentation fault. The if/then instrumentation is eectively the same as that described for the analysis call-based memtrace, although we automatically add the checks rather than the user explicitly inserting them. There are a number of parameters which can be varied for these methods, including how full the buer can grow before an early overow is desirable (the high-water mark), how often to insert if/then calls, whether the time spent recompiling the trace is worthwhile, and whether both the VMentry check and the if/then check work well together. We omit a full exploration of these parameters but summarize here. For nearly all variations of the parameters, noticeably better performance was obtained using a 4MB buer. Although the best high-water mark and if/then frequency varied across benchmarks, the best case on average was with the high-water mark set at 50% and inserting if/then checks every 50 lls. Using these parameters along with an additional high-water check every time the VM is entered, we vary the ll instruction and whether to recompile a faulting trace with the if/then instrumentation. Based on those initial studies varying the parameters, we selected four cases to perform a larger comparison on. Figure 7 compares the four combinations of remaining parameters, ll instruction and recompilation. There are two interesting points to take away from the comparison. First, as in Figure 5, the mov-based buer performs better on average than the movnti-based buer, but the gap is much more pronounced. Still, there are cases where mov performs worse, which generally parallel the performance prior to the overow optimizations. The second interesting point is that although recompilation of faulting traces occasionally improves performance, in general it has a negative or negligible impact. Since the intent behind recompiling faulting traces was to instrument loops which may otherwise miss being checked, a possible cause for slowdown here is that making the check every time through the loop may have a
Figure 8: Summary of overhead for various buering implementations. Initially, large buers performed well but small buers performed poorly. Optimizing overow signicantly improves small-buer performance. higher total overhead than the overhead of dealing with the overow. A more sophisticated method that allows tuning how often to perform the check may improve performance. Figure 8 summarizes the improvements of optimizing overow handling. Bars 2-4 show the buering API prior to optimizing overow. While the large buer leads to a 2.6x improvement over the analysis call implementation, the small buer has much poorer performance. After optimizing overow, the small mov buer achieves nearly a 4x improvement over the base. The smaller movnti buer performs better after optimization, but still has a higher overhead than the 400MB buer.
4.3 Multithreaded Applications

We now consider the performance on multithreaded benchmarks. Figure 9 shows a performance comparison between the 400MB movnti- and mov-based buers, along with the tuned analysis call implementation. The overow optimizations were not used here; the optimized version will be discussed later. In contrast with the single-threaded case, the movnti-based buer performs better on average with these benchmarks. As discussed previously, movnti has a better potential for performance, dependent upon memory use characteristics of the analyzed program, so this dierence simply demonstrates a dierence in memory use from the single-threaded benchmark suite. Figure 10 compares the performance using the movnti buer at buer sizes of 400MB, 40MB, and 4MB. As before, the decrease in size leads to a small decrease in performance at 40MB and a larger decrease in performance at 4MB. The dierence here is much more pronounced due to the multithreaded nature of the applications being analyzed. Any time an overow happens, the signal handler prepares to search for the relevant code section and buer information. This requires taking a lock on the VM, to prevent issues such as another thread deleting internal data structures needed by the signal handler. The threads in OpenMP programs will generally be performing the same actions and thus will overow at roughly the same time, making overow handling act as a serialization point. Figure 11 compares the performance on multithreaded applications after applying the best optimization from the single-threaded case a high-water mark set at 50%, with checks set every 50 lls and also on VM entry. As with the
59
18 Time (Normalized To Native) 16 14 12 10 8 6 4 2 wupwise_m applu_m galgel_m fma3d_m equake_m ammp_m swim_m mgrid_m apsi_m gafort_m art_m average 0 Time (Normalized To Native)
MOVNTI Buffer MOV Buffer Tuned Acall
247x 60 50 40 30 20 10 wupwise_m galgel_m equake_m swim_m

400MB Buffer 4MB Buffer 4MB Opt
79x
apsi_m
fma3d_m
ammp_m
applu_m
mgrid_m
gafort_m
Figure 9: Buering overhead on multithreaded benchmarks. Write-combining eects cause movnti to perform better here.
247x 60 Time (Normalized To Native) 50 40 30 20 10 wupwise_m galgel_m equake_m applu_m swim_m fma3d_m ammp_m mgrid_m apsi_m gafort_m average art_m 0
400MB Buffer 40MB Buffer 4MB Buffer
79x
Figure 11: Comparison of buering overhead with optimizations. As with the single-threaded case, early overow optimizations generally greatly improve performance of small buers.
Figure 10: Buering overhead with respect to size on multithreaded applications. Performance degradation of smaller buers is more pronounced in the multithreaded case due to lock contention. single-threaded case, performance is generally greatly improved, in many cases better than with the large buer. On art m, lock contention is still an issue.
with the unaltered application. Moseley et al. presented shadow proling [9], which parallelized sampling with an unaltered application to reduce the overhead of prole-guided optimization. Systems such as ADORE [5] and Trident [14] reduce the overhead of proling for dynamic optimization using performance monitoring hardware. The work most similar to ours is PiPA, presented by Zhao et al. [15]. PiPA is a technique for reducing the overhead of analysis by collecting data in a lightweight manner and passing it to one or more external threads for processing. They described as their lightweight collection method lling a small buer with data, then passing that to a thread which could then either process the data or further distribute the data amongst other threads. The buer they described is similar to memtrace as described in Section 4. Shen and Shaw [12] also used a PinTool-based buer using similar techniques. Our work provides a more thorough exploration of the design space and automates management tasks which must be performed explicitly by the user in their implementations.
6. Conclusions and Future Work

We have presented an implementation of buering data for decoupled analysis in Pin. Buering information about a program of interest can be performed in a lightweight manner to limit instrumentation overhead, and analysis in bulk reduces the overhead of switching from the context of the application to the context of the analysis routines. We described in detail our implementation, including additions to Pins instrumentation API, code generation, and handling overow. We then characterized the performance of the system, beginning with our unoptimized implementation which achieved on average a 2.6x improvement over the previous best-known method. After optimizing overow handling, we were able to decrease the buer size while improving our performance to nearly a 4x improvement over the previous best-known method. Two potential avenues of future work are improving the performance of invalidating and recompiling traces and avoiding lock contention. Performance of invalidation and recompilation may be improved by using a more sophisticated
4.4 Summary
In this section we have presented performance results for our buering implementation. Our basic implementation had a roughly 2.6x improvement over the previous bestknown implementation. However, performance rapidly degraded as the buer size decreased, due to overhead associated with processing buer overows. After implementing several optimizations, we achieved nearly a 4x speedup over the best-known implementation even with a much smaller buer. On multithreaded applications, we achieved roughly a 3.4x improvement after optimizations.
5.
Related Work
The work presented in this paper shares with many other works the goal of reducing overhead of instrumentation and analysis, orthogonally to the overhead of the dynamic instrumentation system itself. Wallace and Hazelwood presented SuperPin [13], which parallelized analysis in slices
60
average
art_m
method to check for early overow in loops, such as increasing check frequency depending on whether the loop tends to lead to overow. Avoiding lock contention requires decreasing the size of critical sections or avoiding the use of thread-shared data structures. These and other improvements can make buering for decoupled analysis an even more useful tool for studying program behavior.
7.
References
[1] E. R. Altman, M. Gschwind, S. Sathaye, S. Kosonocky, A. Bright, J. Fritz, P. Ledak, D. Appenzeller, C. Agricola, and Z. Filan. BOA: The architecture of a binary translation processor. IBM Research Report RC 21665, Dec 2000. [2] V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: A transparent dynamic optimization system. In Programming language design and implementation, pages 112, Vancouver, BC, 2000. [3] L. Baraz, T. Devor, O. Etzion, S. Goldenberg, A. Skaletsky, Y. Wang, and Y. Zemach. Ia-32 execution layer: a two-phase dynamic translator designed to support ia-32 applications on itanium r -based systems. In 36th Symposium on Microarchitecture, San Diego, CA, 2003. [4] D. Bruening, T. Garnett, and S. Amarasinghe. An infrastructure for adaptive dynamic optimization. In 1st Symposium on Code Generation and Optimization, pages 265275, San Francisco, CA, Mar 2003. [5] H. Chen, W.-C. Hsu, J. Lu, P.-C. Yew, and D.-Y. Chen. Dynamic trace selection using performance monitoring hardware sampling. In 1st Symposium on Code generation and optimization, pages 7990, San Francisco, CA, 2003. [6] J. C. Dehnert, B. K. Grant, J. P. Banning, R. Johnson, T. Kistler, A. Klaiber, and J. Mattson. The transmeta code morphing software: Using speculation, recovery, and adaptive retranslation to address real-life challenges. In 1st Symposium on Code generation and optimization, pages 1524, San Francisco, CA, Mar 2003.
[7] V. Kiriansky, D. Bruening, and S. Amarasinghe. Secure execution via program shepherding. In 11th USENIX Security Symposium, pages 191206, San Francisco, CA, Aug 2002. [8] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. Janapa Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In Programming language design and implementation, pages 190200, Chicago, IL, 2005. [9] T. Moseley, A. Shye, V. J. Reddi, D. Grunwald, and R. Peri. Shadow proling: Hiding instrumentation costs with parallelism. In 5th Symposium on Code Generation and Optimization, pages 198208, San Jose, CA, 2007. [10] N. Nethercote and J. Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Programming Language Design and Implementation, 2007. [11] K. Scott, N. Kumar, S. Velusamy, B. Childers, J. Davidson, and M. L. Soa. Recongurable and retargetable software dynamic translation. In Code Generation and Optimization, pages 3647, San Francisco, CA, Mar 2003. [12] X. Shen and J. Shaw. Scalable implementation of ecient locality approximation. In 21st Workshop on Languages and Compilers for Parallel Computing, Edmonton, AB, July 2008. [13] S. Wallace and K. Hazelwood. Superpin: Parallelizing dynamic instrumentation for real-time performance. In 5th Symposium on Code Generation and Optimization, pages 209220, San Jose, CA, 2007. [14] W. Zhang, B. Calder, and D. M. Tullsen. An event-driven multithreaded dynamic optimization framework. In 14th Conference on Parallel Architectures and Compilation Techniques, pages 8798, St. Louis, MO, 2005. [15] Q. Zhao, I. Cutcutache, and W.-F. Wong. Pipa: pipelined proling and analysis on multi-core systems. In Code generation and optimization, pages 185194, Boston, MA, 2008.
61
ThreadSanitizer data race detection in practice

Konstantin Serebryany
OOO Google 7 Balchug st. Moscow, 115035, Russia
Timur Iskhodzhanov
MIPT 9 Institutskii per. Dolgoprudny, 141700, Russia
kcc@google.com
timur.iskhodzhanov@phystech.edu
ABSTRACT
Data races are a particularly unpleasant kind of threading bugs. They are hard to nd and reproduce you may not observe a bug during the entire testing cycle and will only see it in production as rare unexplainable failures. This paper presents ThreadSanitizer a dynamic detector of data races. We describe the hybrid algorithm (based on happensbefore and locksets) used in the detector. We introduce what we call dynamic annotations a sort of race detection API that allows a user to inform the detector about any tricky synchronization in the user program. Various practical aspects of using ThreadSanitizer for testing multithreaded C++ code at Google are also discussed.
The problem of precise race detection is known to be NPhard (see [20]). However, it is possible to create tools for nding data races with acceptable precision (such tools will miss some races and/or report false warnings). Virtually every C++ application developed at Google is multithreaded. Most of the code is covered with tests, ranging from tiny unit tests to huge integration and regression tests. However, our codebase had never been studied using a data race detector. Our main task was to implement and deploy a continuous process for nding data races.
2. RELATED WORK
There are a number of approaches to data race detection. The three basic types of detection techniques are: static, onthe-y and postmortem. On-the-y and postmortem techniques are often referred to as dynamic. Static data race detectors analyze the source code of a program (e.g. [11]). It seems unlikely that static detectors will work eectively in our environment: Googles code is large and complex enough that it would be expensive to add the annotations required by a typical static detector. Dynamic data race detectors analyze the trace of a particular program execution. On-the-y race detectors process the programs events in parallel with the execution [14, 22]. The postmortem technique consists in writing such events into a temporary le and then analyzing this le after the actual program execution [18]. Most dynamic data race detection tools are based on one of the following algorithms: happens-before, lockset or both (the hybrid type). A detailed description of these algorithms is given in [21]. Each of these algorithms can be used in the on-the-y and postmortem analysis.

D.2.5 [Software Engineering]: Testing and Debugging Testing tools.
General Terms
Algorithms, Testing, Reliability.
Keywords
Concurrency Bugs, Dynamic Data Race Detection, Valgrind.
1.
INTRODUCTION
A data race is a situation when two threads concurrently access a shared memory location and at least one of the accesses is a write. Such bugs are often dicult to nd because they happen only under very specic circumstances which are hard to reproduce. In other words, a successful pass of all tests doesnt guarantee the absence of data races. Since races can result in data corruption or segmentation fault, it is important to have tools for nding existing data races and for catching new ones as soon as they appear in the source code.
3. HISTORY OF THE PROJECT

Late in 2007 we tried several publicly available race detectors, but all of them failed to work properly out of the box. The best of these tools was Helgrind 3.3 [8] which had a hybrid algorithm. But even Helgrind had too many false positives and missed many real races. Early in 2008 we modied the Helgrinds hybrid algorithm and also introduced an optional pure happens-before mode. The happens-before mode had fewer false positives but missed even more data races than the initial hybrid algorithm. Also, we introduced dynamic annotations (section 5) which helped eliminate false positive reports even in the hybrid mode. Still, Helgrind did not work for us as eectively as we would like it to it was still too slow, missed too many
62
races in the pure happens-before mode and was too noisy in the hybrid mode1 . So, later in 2008 we implemented our own race detector. We called this tool ThreadSanitizer. ThreadSanitizer uses a new simple hybrid algorithm which can easily be used in a pure happens-before mode. It supports the dynamic annotations we have suggested for Helgrind. Also, we have tried to make the race reports as informative as possible to make the tool easier to use.
Thread T1 S1 Signal(H1) S4
Thread T2 S2 Wait(H1) S5 Signal(H2) S6
Thread T3 S3
Wait(H2) S7
4.
ALGORITHM
Figure 1: Example of happens-before relation. S1 S4 (same thread); S1 S5 (happens-before arc S7 (happens-before is SignalT1 (H1 ) WaitT2 (H1 )); S1 transitive); S4 S2 (no relation). events). The context of a segment is the context of the rst event in the segment. Each segment has its writer and reader LockSets (LSW r and LSRd ). Each memory access event belongs to exactly one segment. Happens-before arc: a pair of events X = SignalTX (AX ) and Y = WaitTY (AY ) such that AX = AY , TX = TY and X is observed rst. Happens-before: a partial order on the set of events. Given two events X = TypeXTX (AX ) and Y = TypeYTY (AY ), the event X happens-before or precedes the event Y (in short, X Y; and are dened naturally) if X has been observed before Y and at least one of the following statements is true: TX = TY . {X, Y } is a happens-before arc. E1 , E2 : X E1 E2 Y (i.e. is transitive).
ThreadSanitizer is implemented as a Valgrind [19] tool2 . It observes the program execution as a sequence of events. The most important events are memory access events and synchronization events. Memory access events are Read and Write. Synchronization events are either locking events or happens-before events. Locking events are WrLock, RdLock, WrUnlock and RdUnlock. Happens-before events are Signal and Wait3 . These events, generated by the running program, are observed by ThreadSanitizer with the help of the underlying binary translation framework (Valgrind). The detector keeps the state based on the history of the observed events and updates it using a certain state machine. To formally describe the state and the state machine we will need some denitions.
4.1 Denitions
Tid (thread ID): a unique number identifying a thread of the running program. ID: a unique ID of a memory location4 . EventType: one of Read, Write, WrLock, RdLock, WrUnlock, RdUnlock, Signal, Wait. Event: a triple {EventType, T id, ID}. We will write EventTypeT id (ID) or EventType(ID) if the T id is obvious from the context. Lock: an ID that appeared in a locking event. A lock L is write-held by a thread T at a given point of time if the number of events WrLockT (L) observed so far is greater than the number of events WrUnlockT (L). A lock L is read-held by a thread T if it is write-held by T or if the number of events RdLockT (L) is greater than the number of events RdUnlockT (L). Lock Set (LS): a set of locks. Writer Lock Set (LSW r ): the set of all write-held locks of a given thread. Reader Lock Set (LSRd ): the set of all read-held locks of a given thread. Event Lock Set: LSW r for a Write event and LSRd for a Read event. Event context: the information that allows the user to understand where the given event has appeared. Usually, the event context is a stack trace. Segment: a sequence of events of one thread that contains only memory access events (i.e. no synchronization
1 The current version of Helgrind (3.5) is dierent; it is faster but has only a pure happens-before mode. 2 At some point it was a PIN [15] tool, but the Valgrindbased variant has proved to be twice as fast. 3 In the original Lamports paper [14] these are called Send and Receive. 4 In the current implementation ID represents one byte of memory, so on a 64-bit system it is a 64-bit pointer.
The happens-before relation can be naturally dened for segments since segments dont contain synchronization events. Figure 1 shows three dierent threads divided into segments. Segment Set: a set of N segments {S1 , S2 , ..., SN } such that i, j : Si Sj . Concurrent: two memory access events X and Y are concurrent if X Y , Y X and the intersection of the lock sets of these events is empty. Data Race: a data race is a situation when two threads concurrently access a shared memory location (i.e. there are two concurrent memory access events) and at least one of the accesses is a Write.
4.2 Hybrid state machine

The state of ThreadSanitizer consists of global and perID states. The global state is the information about the synchronization events that have been observed so far (lock sets, happens-before arcs). Per-ID state (also called shadow memory or metadata) is the information about each memory location of the running program. ThreadSanitizers per-ID state consists of two segment sets: the writer segment set SSW r and the reader segment set SSRd . SSW r of a given ID is a set of segments where the writes to this ID appeared. SSRd is a set of all segments where the reads from the given ID appeared, such
63
that Sr SSRd , Sw SSW r : Sr Sw (i.e. all segments in SSRd happen-after or are unrelated to segments in SSW r ). Each memory access is processed with the following procedure. It adds and removes segments from SSW r and SSRd so that SSW r and SSRd still match their denitions. At the end, this procedure checks if the current state represents a race. Handle-Read-Or-Write-Event(IsW rite, T id, ID) 1 Handle event Read T id (ID) or Write T id (ID) 2 (SSW r , SSRd ) Get-Per-ID-State(ID) 3 Seg Get-Current-Segment(T id) 4 if IsW rite 5 then Write event: update SSW r and SSRd 6 SSRd {s : s SSRd s Seg} 7 SSW r {s : s SSW r s Seg} {Seg} 8 else Read event: update SSRd 9 SSRd {s : s SSRd s Seg} {Seg} 10 Set-Per-ID-State(ID, SSW r , SSRd ) 11 if Is-Race(SSW r , SSRd ) 12 then Report a data race on ID 13 Report-Race(IsW rite, T id, Seg, ID) Checking for race follows the denition of race (4.1). Note that the intersection of lock sets happens in this procedure, and not earlier (see also 4.5). Is-Race(SSW r , SSRd ) 1 Check if we have a race. 2 NW Segment-Set-Size(SSW r ) 3 for i 1 to NW 4 do W1 SSW r [i] 5 LS1 Get-Writer-Lock-Set(W1 ) 6 Check all write-write pairs. 7 for j i + 1 to NW 8 do W2 SSW r [j] 9 LS2 Get-Writer-Lock-Set(W2 ) 10 Assert(W1 W2 and W2 W1 ) 11 if LS1 LS2 = 12 then return true 13 Check all write-read pairs. 14 for R SSRd 15 do LSR Get-Reader-Lock-Set(R) 16 if W1 R and LS1 LSR = 17 then return true 18 return f alse Our ultimate goal is the race-reporting routine. It prints the contexts of all memory accesses involved in a race and all locks that were held during each of the accesses. See appendix B for an example of output. Once a data race is reported on ID, we ignore the consequent accesses to ID. Report-Race(IsW rite, T id, Seg, ID) 1 (SSW r , SSRd ) Get-Per-ID-State(ID) 2 Print( Possible data race: ) 3 Print( IsWrite ? Write : Read ) 4 Print( at address , ID) 5 Print-Current-Context(T id) 6 Print-Current-Lock-Sets(T id) 7 for S SSW r \ Seg 8 do Print( Concurrent writes: ) 9 Print-Segment-Context(S)
10 11 12 13 14 15 16 17
Print-Segment-Lock-Sets(S) if not IsW rite then return for S SSRd \ Seg do if S Seg then Print( Concurrent reads: ) Print-Segment-Context(S) Print-Segment-Lock-Sets(S)
4.3 Segments and context

As dened in (4.1), the segment is a sequence of memory access events and the context of the segment is the context of its rst event. Recording the segment contexts is critical because without them race reports will be less informative. ThreadSanitizer has three dierent modes5 with regard to creation of segments: 1 (default): Segments are created each time the program enters a new super-block (single-entry multiple-exit region) of code. So, the contexts of all events in a segment belong to a small range of code, always within the same function. In practice, this means that the stack trace of the previous access is nearly precise: the line number of the topmost stack frame may be wrong, but all other line numbers and all function names in the stack traces are exact. 0 (fast): Segments are created only after synchronization events. This means that events inside a segment may have very dierent contexts and the context of the segment may be much dierent from the contexts of other events. When reporting a race in this mode, the contexts of the previous accesses are not printed. This mode is useful only for regression testing. For performance data see 7.2. 2 (precise, slow): Segments are created on each memory access (i.e. each segment contains just one event). This mode gives precise stack traces for all previous accesses, but is very slow. In practice, this level of precision is almost never required.
4.4 Variations of the state machine

The state machine described above is quite simple but exible. With small modications it can be used as a pure happens-before detector or else it can be enhanced with a special state similar to the initialization state described in [22]. ThreadSanitizer can use either of these modications (adjustable by a command line ag).
4.4.1 Pure happens-before state machine

As any hybrid state machine, the state machine described above has false positives (see 6.4). It is possible to avoid most (but not all) false positives by using the pure happensbefore mode. Extended happens-before arc: a pair of events (X, Y ) such that X is observed before Y and one of the following is true: X = WrUnlockT1 (L), Y = WrLockT2 (L). X = WrUnlockT1 (L), Y = RdLockT2 (L). X = RdUnlockT1 (L), Y = WrLockT2 (L). Controlled by the --keep-history=[012] command ag; Memcheck and Helgrind also have similar modes controlled by the ags --track-origins=yes|no and --historylevel=none|approx|full respectively.
5
64
Thread T1
Thread T2 RdLock(L) RdUnlock(L)
4.5 Comparison with other state machines

ThreadSanitizer and Eraser [22, 13] use locksets dierently. In Eraser, the per-ID state stores the intersection of locksets. In ThreadSanitizer, the per-ID state contains original locksets (locksets are stored in segments, which are stored in segment sets and, hence, in the per-ID state) and lockset intersection is computed each time when we check for a race. This way we are able to report all locks involved in a race. Surprisingly enough, this extra computation adds only a negligible overhead. This dierence also allows our hybrid state machine to avoid a false report on the following code. The accesses in three dierent threads do not have any common lock, yet they are correctly synchronized8 .
WrLock(L) WrUnlock(L)
Figure 2: Extended happens-before arc. (X, Y ) is a happens-before arc. If we use the extended happens-before arc in the denition of happens-before relation, we will get the pure happensbefore state machine similar to the one described in [9]6 . The following example explains the dierence between pure happens-before and hybrid modes.
Thread1 Thread2
Thread1
Thread2
Thread3
obj - > UpdateMe (); mu . Lock (); flag = true ; mu . Unlock ();
mu . Lock (); bool f = flag ; mu . Unlock (); if ( f ) obj - > UpdateMe ();
mu1 . Lock (); mu2 . Lock (); obj - > Change (); mu2 . Unlock (); mu1 . Unlock ();
The rst thread accesses an object without any lock and then sets the flag under a lock. The second thread checks the flag under a lock and then, if the flag is true, accesses the object again. The correctness of this code depends on the initial value of the flag. If it is false, the two accesses to the object are synchronized correctly; otherwise we have a race. ThreadSanitizer cannot distinguish between these two cases. In the hybrid mode, the tool will always report a data race on such code. In the pure happens-before mode, ThreadSanitizer will behave dierently: if the race is real, the race may or may not be reported (depends on timing, this is why the pure happens-before mode is less predictable); if there is no race, the tool will be silent.
ThreadSanitizers pure happens-before mode nds the same races as the classical Lamports detector [14] (we did not try to prove it formally though). On our set of unit tests [7], it behaves the same way as other pure happens-before detectors (see appendix A)9 . The noticeable advantage of ThreadSanitizer in the pure happens-before mode is that it also reports all locks involved in a race the classical pure happens-before detector knows nothing about locks and cant include them in the report.
5. DYNAMIC ANNOTATIONS
Any dynamic race detector must understand the synchronization mechanisms used by the tested program, otherwise the detector will not work. For programs that use only POSIX mutexes, it is quite possible to hard-code the knowledge about the POSIX API into the detector (most popular detectors do this). However, if the tested program uses other means of synchronization, we have to explain them to the detector. For this purpose we have created a set of dynamic annotations a kind of race detection API. Each dynamic annotation is a C macro denition. The macro denitions are expanded into some code which is later intercepted and interpreted by the tool10 . You can nd our implementation of the dynamic annotations at [7]. The most important annotations are: ANNOTATE_HAPPENS_BEFORE(ptr), ANNOTATE_HAPPENS_AFTER(ptr) These annotations create, respectively, Signal(ptr) and
8 These cases are rare. During our experiments with Helgrind 3.3, which reported false positives on such code, we saw this situation only twice. 9 It would be interesting to compare the accuracy of the detectors on real programs, but in our case it appeared to be too dicult. Other detectors either did not work with our OS and compiler or did not support our custom synchronization utilities and specic synchronization idioms (e.g. synchronization via I/O). Thus we have limited the comparison to the unit tests. 10 Currently, the dynamic annotations are expanded into functions calls, but this is subject to change.
4.4.2 Fast-mode state machine

In most real programs, the majority of memory locations are never shared between threads. It is natural to optimize the race detector for this case. Such an optimization is implemented in ThreadSanitizer and is called fast mode7 . Memory IDs in ThreadSanitizer are grouped into cache lines. Each cache line contains 64 IDs and the T id of the thread which made the rst access to this cache line. In fast mode, we ignore all accesses to a cache line until we see an access from another thread. This indeed makes the detection faster according to our measurements it may increase the performance by up to 2x, see 7.2. This optimization aects accuracy. Eraser [22] has the initialization state that reduces the number of false positives produced by the lock-set algorithm. Similarly, the ThreadSanitizers fast mode reduces the number of false positives in the hybrid state machine. Both these techniques may as well hide real races. The fast mode may be applied to the pure happens-before state machine, but we dont do this because the resulting detector will miss too many real races.
6 Controlled by the --pure-happens-before command line ag. 7 Controlled by the --fast-mode command line ag.
65
Wait(ptr) events for the current thread; they are used to annotate cases where a hybrid algorithm may produce false reports, as well as to annotate lock-free synchronization. Examples are provided in section 6.4. Other annotations include: ANNOTATE_PURE_HAPPENS_BEFORE_MUTEX(lock) Tells the detector to treat lock as in pure-happensbefore mode (even if all other locks are handled as in hybrid mode). Using this annotation with the hybrid mode we can selectively apply pure hapens-before mode to some locks. In the pure happens-before mode this annotations is a no-op. ANNOTATE_CONDVAR_LOCK_WAIT(cv,mu) Creates a Wait(cv) event that matches the cv.Signal() event (cv is a conditional variable, see 6.4.1). ANNOTATE_BENIGN_RACE(ptr) Tells that races on the address ptr are benign. ANNOTATE_IGNORE_WRITES_BEGIN, ANNOTATE_IGNORE_WRITES_END Tells the tool to ignore all writes between these two annotations. Similar annotations for reads also exist. ANNOTATE_RWLOCK_CREATE(lock), ANNOTATE_RWLOCK_DESTROY(lock), ANNOTATE_RWLOCK_ACQUIRED(lock, isRW), ANNOTATE_RWLOCK_RELEASED(lock, isRW) Is used to annotate a custom implementation of a lock primitive. ANNOTATE_PUBLISH_MEMORY_RANGE(ptr,size) Reports that the bytes in the range [ptr, ptr + size) are about to be published safely. The race detector will create a happens-before arc from this call to subsequent accesses to this memory. Usually required only for hybrid detectors. ANNOTATE_UNPUBLISH_MEMORY_RANGE(ptr,size) Opposite to ANNOTATE_PUBLISH_MEMORY_RANGE. Reports that the bytes in the range [ptr, ptr + size) are not shared between threads any more and can be safely used by the current thread w/o synchronization. The race detector will create a happens-before arc from all previous accesses to this memory to this call. Usually required only for hybrid detectors. ANNOTATE_NEW_MEMORY(ptr, size) Tells that a new memory has been allocated by a custom allocator. ANNOTATE_THREAD_NAME(name) Tells the name of the current thread to the detector. ANNOTATE_EXPECT_RACE(ptr) Is used to write unit tests for the race detector. With the dynamic annotations we can eliminate all false reports of the hybrid detector and hide benign races. As a result, ThreadSanitizer will nd more races (as compared to a pure happens-before detector) but will not report false races.
6. RACE DETECTION IN PRACTICE 6.1 Performance

Performance is critical for the successful use of race detectors, especially in large organizations like Google. First, if a detector is too slow, it will be inconvenient to use it for manual testing or debugging. Second, slower detector will require more machine resources for regular testing, and the machine resources cost money. Third, most of the C++ applications at Google are time-sensitive and will simply fail due to protocol timeouts if slowed down too much. When we rst tried Helgrind 3.3 on a large set of unit tests, almost 90% of them failed due to slowdown. With our improved variant of Helgrind and, later, with ThreadSanitizer we were able to achieve more than 95% pass rate. In order to make the remaining tests pass, we had to change various timeout values in the tests11 . On an average Google unit test or application the slowdown is 20-50 times, but in extreme cases the slowdown could be as high as 10000 times (as for example an articial stress test for a race detector) or as low as 2 times (the test mostly sleeps or waits for I/O). See also section 7.2. ThreadSanitizer spends almost all the time intercepting and analyzing memory accesses. If a given memory location has been accessed by just one thread, the analysis is fast (especially in the fast mode, see section 4.4.2). If a memory location has been accessed by many threads and there have been a lot of synchronization events, the analysis is slow. So, there are two major ways to speed up the tool: make the analysis of one memory access faster and analyze fewer memory accesses. In order to make the analysis of one access faster, we used various well known techniques and algorithms such as vector time-stamps ([9]) and caching. We also limited the size of a segment set with a small constant (currently, 4) to avoid huge slowdowns in corner cases. But whatever we do to speed up the analysis, the overhead will always remain signicant: remember that we replace a memory access instruction with a call to a quite sophisticated function that usually runs for few hundreds of CPU cycles. A much more attractive approach is to reduce the number of analyzed memory accesses. For example, ThreadSanitizer does not instrument the internals of the threading library (there is no sense in analysing races on internal representation of a mutex). The tool also supports a mechanism to ignore parts of the program marked as safe by the user12 . In some cases this allows to speed up the run by 2-3 times by ignoring a single hot spot. Adaptive instrumentation [16] seems promising. We also plan to use static analysis performed by a compiler to skip instrumentation when we can prove thread-safety statically. Another way to reduce the number of analyzed memory accesses is to run the tool on an optimized binary. Unfortunately, the current implementation does not work well with fully optimized code (e.g. gcc -O2)13 , but the following gcc ags give 50%-100% speedup compared to a non-optimized
11
This had a nice side eect. Many of the tests that were failing regularly under ThreadSanitizer were known to be aky (they were sometimes failing when running natively) and ThreadSanitizer helped to nd the reason of that akiness just by making tests slower. 12 Controlled by the --ignore command line ag. 13 This is a limitation of both gcc, Valgrind and ThreadSanitizer.
66
compilation while maintaining the same level of usability: gcc -O1 -g -fno-inline -fno-omit-frame-pointer -fnobuiltin14 .
benign (the code counts some statistic that is allowed to be imprecise). But sometimes such races are extremely harmful (e.g. see 7.1).
Thread1 Thread2
6.2 Memory consumption

The memory consumption of ThreadSanitizer consists mostly of the following overhead: A constant size buer that stores segments, including stack traces. By default, there are 223 segments and each occupies 100 bytes (50 bytes in 32-bit mode). So, the buer is 800M. Decreasing this size may lead to loosing some data races. If we are not tracking the contexts of previous accesses (see 4.3), the segments occupy much less memory (250M). Vector time clocks attached to each segment. This memory is limited by the number of threads times the number of segments, but in most cases it is quite small. Per-ID state. In the fast mode, the memory required for per-ID state linearly depends on the amount of memory shared between more than one thread. In the full hybrid and in the pure happens-before modes, the footprint is a linear function of all memory in the program. However, these are the worst case assumptions and in practice a simple compression technique reduces the memory usage signicantly. Segment sets and locksets may potentially occupy arbitrary large amount of memory, but in reality they constitute only a small fraction of the overhead. All these objects are automatically recycled when applicable. On an average Google unit test the memory overhead is within 3x-4x (compared to a native run). Obviously, a test will fail under ThreadSanitizer if there is not enough RAM in the machine. Almost all unit tests we have tried require less than 4G when running under ThreadSanitizer. Real applications may require 8G and more. See also 7.2 for the actual numbers.
int v ; ... v ++;
v ++;
6.3.2 Race on a complex type

Another popular race happens when two threads access a non-thread-safe complex object (e.g. an STL container) without synchronization. These are almost always dangerous.
Thread1 Thread2
std :: map < int , int > m ; ... m [123] = 1;
m [345] = 0;
6.3.3 Notication
A data race occurs when a boolean or an integer variable is used to send notications between threads. This may work correctly with some combination of compiler and hardware, but for portability we do not recommend programmers to assume implicit semantics of the target architecture.
Thread1 Thread2
bool done = false ; ... while (! done ) sleep (1); done = true ;
6.3.4 Publishing objects without synchronization

One thread initializes an object pointer (which was initially null) with a new value, another thread spins until the object pointer becomes non-null. Without proper synchronization, the compiler may do surprising transformations (code motion) with such code which will lead to (occasional) failures. In addition to that, on some architectures this race may cause failures due to cache-related eects.
Thread1 Thread2
6.2.1 Flushing state

Even though the memory overhead of ThreadSanitizer is sane on average, there are cases when the tool would consume all the memory it could get. In order to stay robust, ThreadSanitizer ushes all its internal state when the memory overhead is above a certain limit (supplied by the user or derived from ulimit) or when the tool has used all available segments and none of them can be recycled. Obviously, if a ush happens between two memory accesses which race with each other, such a race will be missed, but the probability of such situation is low.
MyObj * obj = NULL ; ... obj = new MyObj ();
while ( obj == NULL ) yield (); obj - > DoSomethi ng ();
6.3.5 Initializing objects without synchronization

static MyObj * obj = NULL ; void InitObj () { if (! obj ) obj = new MyObj (); }
Thread1
Thread2
6.3 Common real races

In this section we will show the examples of the most frequent races found in our C++ code. The detailed analysis of some of these races is given at [7].
InitObj ();
InitObj ();
This may lead e.g. to memory leaks (the object may be constructed twice).
6.3.6 Write during a ReaderLock

Updates happening under a reader lock.
Thread1 Thread2
6.3.1 Simple race

The simplest possible data race is the most frequent one: two threads are accessing a variable of a built-in type without any synchronization. Quite frequently, such races are
14
This applies to gcc 4.4 on x86 64.
mu . ReaderLock (); var ++; mu . R e a d e r U n l o c k ();
mu . ReaderLock (); var ++; mu . R e a d e r U n l o c k ();
67
6.3.7 Adjacent bit elds

The code below looks correct at the rst glance. But if x is struct { int a:4, b:4; }, we have a bug.
Thread1 Thread2
if the rst thread executes obj->F() after the second thread started executing A::~A, then A::F will be called instead of B::F.
Thread1 Thread2
x . a ++;
x . b ++;
obj - > F (); obj - > Done ();
delete obj ;
6.3.8 Double-checked locking

The so called doubled-checked locking is well known to be an anti-pattern ([17]), but we still nd it occasionally (mostly in the old code).
bool void // if inited = false ; Init () { May be called by multiple threads . (! inited ) { mu . Lock (); if (! inited ) { // .. initialize something } inited = true ; mu . Unlock ();
6.4 Common false positives

Here we show the three most common types of false positives, i.e. the situations where the code is correctly synchronized, but ThreadSanitizer will report a race. The annotations given in the code examples explain the synchronization to the tool; with these annotations no reports will appear.
6.4.1 Condition variable

Thread1 Thread2
} }
6.3.9 Race during destruction

Sometimes objects are created on the stack, passed to another thread and then destroyed without waiting for the second thread to nish its work.
void Thread1 () { SomeType object ; ExecuteCallbackInThread2 ( SomeCallback , & object ); ... // " object " is destroyed when // leaving its scope . }
obj - > UpdateMe (); mu . Lock (); c = true ; cv . Signal (); mu . Unlock ();
mu . Lock (); while (! c ) cv . Wait (& mu ); ANNOTATE_CONDVAR_LOCK_WAIT ( & cv , & mu ); mu . Unlock (); obj - > UpdateMe ();
This is a typical usage of a condition variable [12]: the two accesses to obj are serialized. Unfortunately, it may be misunderstood by the hybrid detector. For example, Thread1 may set the condition to true and leave the critical section before Thread2 enters the critical section for the rst time and blocks on the condition variable. The condition of the while(!c) loop will never be true and cv.Wait() method wont be called. As a result, the happens-before dependency will be missed.
6.3.10 Race on vptr

Class A has a function Done(), virtual function F() and a virtual destructor. The destructor waits for the event generated by Done(). There is also a class B, which inherits A and overrides A::F().
class A { public : A () { sem_init (& sem_ , 0 , 0); } virtual void F () { printf ( " A :: F \ n " ); } void Done () { sem_post (& sem_ ); } virtual ~ A () { sem_wait (& sem_ ); sem_destro y (& sem_ ); } private : sem_t sem_ ; };
6.4.2 Message queue

Some message queues may also be unfriendly to the hybrid detector.
class Queue { public : void Put ( int * ptr ) { mu_ . Lock (); queue_ . push_back ( ptr ); A N N O T A T E _ H A P P E N S _ B E F O R E ( ptr ); mu_ . Unlock (); } int * Get () { int * res = NULL ; mu_ . Lock (); if (! queue_ . empty ()) { res = queue_ . front (); A N N O T A T E _ H A P P E N S _ A F T E R ( res ); queue_ . pop_front (); } mu_ . Unlock (); return res ; } private : std :: queue queue_ ; Mutex mu_ ; };
class B : public A { public : virtual void F () { printf ( " B :: F \ n " ); } virtual ~ B () { } }; static A * obj = new B ;
An object obj of static type A and dynamic type B is created. One thread executes obj->F() and then signals to the second thread. The second thread calls delete obj (i.e. B::~B) which then calls A::~A, which, in turn, waits for the signal from the rst thread. The destructor A::~A overwrites the vptr (pointer to virtual function table) to A::vptr. So,
The queue implementation above does not use any happensbefore synchronization mechanism but it does actually create a happens-before dependency between Put() and Get().
68
Thread1
Thread2
6.5.1 Choosing the mode

Which of the three modes of ThreadSanitizer should one choose? If you are testing an existing software project, we suggest you to start with the pure happens-before mode (4.4.1). Unless you have lock-free synchronization (which you will have to annotate), every reported race will be real. Once you xed all reports from the pure happens-before mode (or if you are starting a new project), switch to the fast mode (4.4.2). You may see few false reports (6.4), which can be easily eliminated. If your aim is to nd the maximal number of bugs and agree to spend some more time for annotations, use the full hybrid mode (4.2). For regression testing prefer the hybrid mode (either full or fast) because it is more predictable. It is often the case that a race is detected only on one of 10-100 runs by the pure happens-before mode, while the hybrid mode nds it in each run.
* ptr = ...; queue . Put ( ptr );
ptr = queue . Get (); if ( ptr ) * ptr = ...;
A message queue may be implemented via atomic operations (i.e. without any Mutex). In this case even a pure happens-before detector may report false positives.
6.4.3 Reference counting

Another frequent cause of false positives is reference counting. As with message queues, mutex-based reference counting will result in false positives in the hybrid mode, while a reference counting implemented via atomics will confuse even the pure happens-before mode. And again, the annotations allow the tool to understand the synchronization.
class S o m e R e f e r e n c e C o u n t e d C l a s s { public : void Unref () { A N N O T A T E _ H A P P E N S _ B E F O R E (& ref_ ); if ( A t o m i c I n c r e m e n t (& ref_ , -1) == 0) { A N N O T A T E _ H A P P E N S _ A F T E R (& ref_ ); delete this ; } } ... private : int ref_ ; }
7. RACE DETECTION FOR CHROMIUM

One of the applications we test with ThreadSanitizer is Chromium [1], an open-source browser project. The code of Chromium browser is covered by a large number of tests including unit tests, integration tests and interactive tests running the real application. All these tests are continuously run on a large number of test machines with dierent operating systems. Some of these machines run tests under Memcheck (the Valgrind tool which nds memory-related errors, see [8]) and ThreadSanitizer. When a new error (either a test failure or a race report from ThreadSanitizer) is found after a commit to the repository, the committer of the change is notied. These reports are available for other developers and maintainers as well. We have found and xed a few dozen data races in Chromium itself, and in some third party components used by this project. You may nd all these bugs by searching for label:ThreadSanitizer at www.crbug.com.
6.5 General advice

Applying a data race detector to an arbitrary C++ program may be arbitrarily hard. However, if the developers follow several simple rules, race detectors can be used at full power. Here we summarize the recommendations we give to C++ developers at Google. First of all, variables shared between threads are best protected by a mutex. Always use mutex unless you know for sure that it causes a signicant performance loss. When possible, try to reuse the existing standard synchronization primitives (e.g. message queues, reference counting utilities, etc) instead of re-inventing the wheel. If you really need your own synchronization mechanism, annotate it with dynamic annotations (section 5). Avoid using condition variables directly as they are not friendly to hybrid detectors. Instead, wrap the condition loop while(!c) cv.Wait(&mu) into a separate function and annotate it (6.4.1). In Googles internal C++ library such function is a part of the Mutex API. Try not to use atomic operations directly. Instead, wrap the atomic operations into functions or classes that implement certain synchronization patterns. Remember that dynamic data race detection (as well as most other kinds of dynamic analysis) is slow. Do not hardcode any timeout values into your program. Instead, allow the timeout values to be changed via command line ags, environment variables or conguration les. Never use sleep() as synchronization between threads, even in unit tests. Dont over-synchronize. Excessive synchronization may be just as incorrect as no synchronization at all, but it may hide real races from data race detectors.
7.1 Top crasher

One of the rst data races we found in Chromium happened to be the cause of a serious bug, which had been observed for several months but had not been understood nor xed15 . The data race happened on a class called RefCounted. The reference counter was incremented and decremented from multiple threads without synchronization. When the race actually occurred (which happened very rarely), the value of the counter became incorrect. This resulted in either a memory leak or in two calls of delete on the same memory. In the latter case, the internals of the memory allocator were corrupted and one of the subsequent calls to malloc failed with a segmentation fault. The cause of these failures was not understood for a long time because the failure never happened during debugging, and the failure stack traces were in a dierent place. ThreadSanitizer found this data race in a single run. The x for this data race was simple. Instead of the RefCounted class we needed to use RefCountedThreadSafe, the class which implements reference counting using atomic instructions.
15
See the bug entries http://crbug.com/18488 and http://crbug.com/15577 describing the race and the crashes, respectively.
69
Table 1: Time and space overhead compared to Helgrind and Memcheck on Chromium tests. The performance of ThreadSanitizer is close to Memcheck. On large tests (e.g. unit), ThreadSanitizer can be twice as fast as Helgrind. The memory consumption is also comparable to Memcheck and Helgrind. app base ipc net unit native 3s 172M 77s 1811M 5s 325M 50s 808M 43s 914M Memcheck-no-hist 6.7x 2.0x 1.7x 1.1x 5.2x 1.1x 3.0x 1.6x 14.8x 1.7x Memcheck 10.5x 2.6x 2.2x 1.1x 8.2x 1.2x 5.1x 2.3x 29.7x 1.9x Helgrind-no-hist 13.9x 2.7x 1.8x 1.8x 5.4x 1.5x 4.5x 2.2x 48.7x 3.4x Helgrind 14.9x 3.8x 1.7x 1.9x 6.7x 1.7x 11.9x 2.5x 62.3x 3.8x TS-fast-no-hist 6.2x 4.2x 2.2x 1.2x 11.1x 1.8x 3.9x 1.7x 19.2x 2.2x TS-fast 7.9x 7.6x 2.4x 1.5x 12.0x 3.6x 4.7x 2.4x 21.6x 2.8x TS-full-no-hist 8.4x 4.2x 2.4x 1.2x 11.3x 1.8x 4.7x 1.6x 22.3x 2.3x TS-full 13.8x 7.4x 2.8x 1.5x 11.9x 3.6x 6.3x 2.3x 28.6x 2.5x TS-phb-no-hist 8.3x 4.2x 2.8x 1.2x 11.2x 1.8x 4.7x 1.8x 23.0x 6.2x TS-phb 14.2x 7.4x 2.6x 1.5x 11.8x 3.6x 6.2x 2.3x 28.6x 2.5x
7.2 Performance evaluation on Chromium

We used Chromium unit tests for performance evaluation of ThreadSanitizer. We compared our tool with Helgrind and Memcheck 3.5.0 [8]. Even though Memcheck is not a race detector, it performs similar instrumentation; this tool is well known for its high quality and practical usefulness. Table 1 gives the summary of the results. ThreadSanitizer was run in three modes: --pure-happens-before=yes (phb), --fast-mode=yes (fast) and --fast-mode=no (full). Similarly to ThreadSanitizer, Helgrind and Memcheck have modes where the history of previous accesses is not tracked (4.3). In the table, such modes are marked with no-hist. The tests were built using gcc -O1 -g -fno-inline -fnoomit-frame-pointer -fno-builtin ags for the x86 64 platform and run on Intel Core 2 Duo Q6600 with 8Gb or RAM. As may be seen from Table 1, the performance of ThreadSanitizer is close to Memcheck. The average slowdown compared to the native run is less than 30x. On large tests like unit, ThreadSanitizer can be twice as fast as Helgrind. The memory consumption is also comparable to Memcheck and Helgrind. ThreadSanitizer allocates a large constant size buer of segments (see 6.2), hence on small tests it consumes more memory than other tools. ThreadSanitizer ushes its state (see 6.2.1) 90 times on unit, 34 times on net and 4 times on base test sets when running in the full or pure happens-before modes with history tracking enabled. In the fast mode and with disabled history tracking ThreadSanitizer never ushes its state on these tests.
keeping zero noise level (no false positives or benign races are reported). ThreadSanitizer is heavily used at Google for testing various C++ applications, including Chromium. In this paper we discussed a number of practical issues which we have faced while deploying ThreadSanitizer. We believe that our ThreadSanitizer has noticeable advantages over other dynamic race detectors in terms of practical use. The current implementation of ThreadSanitizer is built on top of the Valgrind binary translation framework and it can be used to test C/C++ programs on Linux and Mac. The source code of ThreadSanitizer is published under the GPL license and can be downloaded at [7].
9. ACKNOWLEDGMENTS
We would like to thank Mike Burrows, the co-author of Eraser [22], for his great support of our project at Google and for many algorithmic suggestions, and Julian Seward, the author of Valgrind and Helgrind [8, 19], for his amazing tools and fruitful discussions.
10. REFERENCES
[1] Chromium project. http://dev.chromium.org. [2] Intel Parallel Studio. http://software.intel.com/enus/intel-parallel-studio-home. [3] Intel Thread Checker. http://software.intel.com/en-us/intel-thread-checker. [4] Multi-Thread Run-time Analysis Tool for Java. http://www.alphaworks.ibm.com/tech/mtrat. [5] Pin - a dynamic binary instrumentation tool. http://www.pintool.org. [6] Sun Studio. http://developers.sun.com/sunstudio. [7] ThreadSanitizer project: documentation, source code, dynamic annotations, unit tests. http://code.google.com/p/data-race-test. [8] Valgrind project. Home of Memcheck, Helgrind and DRD. http://www.valgrind.org. [9] U. Banerjee, B. Bliss, Z. Ma, and P. Petersen. A theory of data race detection. In PADTAD 06: Proceedings of the 2006 workshop on Parallel and distributed systems: testing and debugging, pages 6978, New York, NY, USA, 2006. ACM. [10] U. Banerjee, B. Bliss, Z. Ma, and P. Petersen. Unraveling Data Race Detection in the Intel R Thread
8.
CONCLUSIONS
In this paper we have presented ThreadSanitizer, a dynamic detector of data races. ThreadSanitizer uses a new algorithm; it has several modes of operation, ranging from the most conservative mode (which has few false positives but also misses real races) to a very aggressive one (which has more false positives but detects the largest number of real races). To the best of our knowledge ThreadSanitizer has the most detailed output and it is the only dynamic race detector with hybrid and pure happens-before modes. We have introduced the dynamic annotations, a sort of API for a race detector. Using the dynamic annotations together with the most aggressive mode of ThreadSanitizer enables us to nd the largest number of real races while
70
[11]
[12] [13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
Checker. In First Workshop on Software Tools for Multi-core Systems (STMCS), in conjunction with IEEE/ACM International Symposium on Code Generation and Optimization (CGO), March, volume 26, 2006. D. Engler and K. Ashcraft. Racerx: eective, static detection of race conditions and deadlocks. In SOSP 03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 237252, New York, NY, USA, 2003. ACM. F. Garcia and J. Fernandez. Posix thread libraries. Linux J., 70es (Feb. 2000):36, 2000. J. J. Harrow. Runtime checking of multithreaded applications with visual threads. In Proceedings of the 7th International SPIN Workshop on SPIN Model Checking and Software Verication, pages 331342, London, UK, 2000. Springer-Verlag. L. Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558565, 1978. C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In PLDI 05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, pages 190200, New York, NY, USA, 2005. ACM. D. Marino, M. Musuvathi, and S. Narayanasamy. Literace: eective sampling for lightweight data-race detection. In PLDI 09: Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, pages 134143, New York, NY, USA, 2009. ACM. S. Meyers and A. Alexandrescu. C++ and the Perils of Double-Checked Locking: Part I. DOCTOR DOBBS JOURNAL, 29:4649, 2004. S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards, and B. Calder. Automatically classifying benign and harmful data races using replay analysis. In PLDI 07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, pages 2231, New York, NY, USA, 2007. ACM. N. Nethercote and J. Seward. Valgrind: a framework for heavyweight dynamic binary instrumentation. In PLDI 07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, pages 89100, New York, NY, USA, 2007. ACM. R. H. B. Netzer and B. P. Miller. What are race conditions?: Some issues and formalizations. ACM Letters on Programming Languages and Systems (LOPLAS), 1(1):7488, 1992. R. OCallahan and J.-D. Choi. Hybrid dynamic data race detection. In PPoPP 03: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 167178, New York, NY, USA, 2003. ACM. S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: a dynamic data race detector for multithreaded programs. ACM Trans. Comput. Syst., 15(4):391411, 1997.
APPENDIX A. OTHER RACE DETECTORS

Here we briey describe some of the race detectors available for download. Helgrind is a tool based on Valgrind [19, 8]. Helgrind 3.5 is a pure happens-before detector; it supports a subset of dynamic annotations described in section 5. Part of ThreadSanitizers instrumentation code is derived from Helgrind. DRD is one more Valgrind-based race detector with similar properties. Intel Thread Checker [3, 10, 9] is a pure happens-before race detector. It supports an analog of dynamic annotations (a subset). It works on Linux and Windows. Thread Checkers latest reincarnation is called Intel Parallel Inspector [2] and is based on PIN [15, 5]. As of November 2009, the Parallel Inspector is available only for Windows. Sun Thread Analyzer, a part of Sun Studio [6]. A hybrid race detector. It supports an analog of dynamic annotations (a small subset). It works only together with the Sun Studio compiler (so we did not try it for our real tasks). IBM MTRAT [4] is a race detector for Java. It uses some variant of hybrid state machine and does not support any annotations. As of the version from March 2009, the noise level seems to be rather high.
B.
EXAMPLE OF OUTPUT
Here we give a simple test case where a wrong mutex is used in one place. For more examples refer to [7].
Mutex mu1 ; // This Mutex guards var . Mutex mu2 ; // This Mutex is not related to var . int var ; // Runs in thread named test - thread -1 void Thread1 () { mu1 . Lock (); // Correct Mutex . var = 1; mu1 . Unlock (); } // Runs in thread named test - thread -2 void Thread2 () { mu2 . Lock (); // Wrong Mutex . var = 2; mu2 . Unlock (); }
The output of ThreadSanitizer will contain stack traces for both memory accesses, names of both threads, information about locks held during each access and the description of the memory location. WARNING: Possible data race during write of size 4 T2 (test-thread-2) (locks held: {L134}): #0 Thread2() racecheck_unittest.cc:7034 #1 MyThread::ThreadBody(MyThread*) ... Concurrent write(s) happened at these points: T1 (test-thread-1) (locks held: {L133}): #0 Thread1() racecheck_unittest.cc:7029 #1 MyThread::ThreadBody(MyThread*) ... Address 0x63F260 is 0 bytes inside data symbol "var" Locks involved in this report: {L133, L134} L133 #0 Mutex::Lock() ... #1 Thread1() racecheck_unittest.cc:7028 ... L134 #0 Mutex::Lock() ... #1 Thread2() racecheck_unittest.cc:7033 ...
71

Proceedings of The Workshop On Binary Instrumentation and Applications 2009

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Proceedings of The Workshop On Binary Instrumentation and Applications 2009

Caricato da

Copyright:

Formati disponibili

WBIA09

Proceedings of the Workshop on Binary Instrumentation and Applications

ACM International Conference Proceedings Series ACM Press

ACM ISBN: 978-1-60558-793-6

Studying Microarchitectural Structures with Object Code Reordering

Department of Computer Science The University of Texas at San Antonio

Estimating Performance of real systems

2.1 Eliciting Performance Variance

Reverse Engineering Branch Predictors

2.2 Impact of Code Placement on Performance

Instruction Addresses in Microarchitectural Structures

Validating Reverse-Engineered Branch Predictor

3.2 A Wide Range in Performance

3.3 Causing Collisions

Generating Random Object Orderings

3.4 Making Predictions

Percent Difference in CPI

BUILDING PERFORMANCE MODEL FOR MICROARCHITECTURE STRUCTURES

L2 Cache Misses L1 Instruction Cache Misses Branch Mispredictions Combined Estimator

5.4 Establishing Statistical Signicance

Blame the Branch Predictor

The object le reordering methodology clearly elicits a

5.5 A Linear Performance Model

Demystifying Branch Predictor

6. REVERSE ENGINEERING BRANCH PREDICTOR AND VALIDATION

Cycles per Instructions

Branch Predictor Simulation

Figure 9: MPKI of real and simulated branch predictors.

Mary Lou Soffa

Categories and Subject Descriptors

Contention Synthesis Engine Core

Figure 2: Our Proling Framework

Designing Contention Synthesis

void g e n n a m e ( char r e t ) { f o r ( i n t i =0; i <p a y l o a d s i z e ; r e t [ i ] = ( char ) r a n d ( ) % 2 5 6 ; } }

Figure 3 shows the C implementation of our naive contention synthesis mechanism.

BST b ; s r a n d ( t i m e (0)+ g e t p i d ( ) ) ; unsigned int n o d e s i z e=s i z e o f ( t r e e n o d e )+ s i z e o f (BST ) ;

f o r ( i n t i =0; i <f o o t p r i n t 1 0 2 4 / n o d e s i z e ; i ++) { b . i n s e r t ( p a y l o a d s i z e +( r a n d () p a y l o a d s i z e ) ) ; } unsigned long long sum=0; w hi l e ( 1 ) sum+=b . t r a m p l e ()+ b . t r a m p l e ( ) ; }

Figure 4: Contention Synthesis Using a Linked Data-structure

Linked Data Structure

LBM from SPEC2006

Evaluating Contention Synthesis Designs

Figure 6 shows the C implementation of our sledgehammer contention synthesis mechanism.

1.6x 1.5x Slowdown 1.4x 1.3x 1.2x 1.1x 1x

Figure 7: Slowdown caused by contention synthesis on Intel Core i7.

Characterizing Cross Core Interference

Assessing Cache False Sharing Effects by Dynamic Binary Instrumentation

Categories and Subject Descriptors

3.1 Model and notation

3.2 Denition of false sharing

Thread 0 Load Store

(a) False Sharing

3.3 Estimate for false sharing

3.4 Considering offsets and sizes

3.5 Considering barriers

4. IMPLEMENTATION AND USAGE

> show summary 3 CBlock address 0x08054080 0x080502c0 0x070239c0

#loads 3.98e+06 5.08e+06 4.27e+03

#stores 9.07e+05 2.65e+06 4.29e+03

#FS 7.03e+06 1.28e+06 5.63e+03

#TS 7.03e+06 6.00e+00 5.63e+03

#FS-#TS Overhead (cycles) 2.00e+00 1.76e+09 1.28e+06 3.21e+08 0.00e+00 1.41e+06

5.1 Synthetic false sharing benchmark