Sei sulla pagina 1di 12




Tilak Agerwala Siddhartha Chatterjee IBM Research

Computer architecture forms the bridge between application needs and the capabilities of the underlying technologies. As application demands change and technologies cross various thresholds, computer architects must continue innovating to produce systems that can deliver needed performance and cost effectiveness. Our challenge as computer architects is to deliver end-to-end performance growth at historical levels in the presence of technology discontinuities. We can address this challenge by focusing on power optimization at all levels. Key levers are the development of power-optimized building blocks, deployment of chip-level multiprocessors, increasing use of accelerators and ofoad engines, widespread use of scale-out systems, and system-level power optimization.

with some observations on the evolving nature of workloads. The computational and storage demands of technical, scientic, digital media, and business applications continue to grow rapidly, driven by ner degrees of spatial and temporal resolution, the growth of physical simulation, and the desire to perform real-time optimization of scientic and business problems. The following are some examples of such applications: A computational uid dynamics (CFD) calculation on an airplane wing of a 512 64 256 grid, with 5,000 floatingpoint operations per grid point and 5,000 time steps, requires 2.1 1014 floating-point operations. Such a computation would take 3.5 minutes on a machine sustaining 1 billion floatingpoint operations per second (Tops). A similar CFD simulation of a full aircraft, on the other hand, would involve 3.5 1017 grid points, for a total of 8.7 1024

To design leadership computer systems, we must thoroughly understand the nature of the workloads that such systems are intended to support. It is, therefore, worthwhile to begin


Published by the IEEE Computer Society

0272-1732/05/$20.00 2005 IEEE

oating-point operations. On the same 1-Tflops machine, this computation would require more than 275,000 years to complete.1 Materials scientists currently simulate magnetic materials at the level of 2,000atom systems, which require 2.64 Tops of computational power and 512 Gbytes of storage. In the future, simulation of a full hard-disk drive will require about 30 Tflops of computational power and 2 Tbytes of storage (http://www.zurich. Current investigation of electronic structures is limited to about 1,000 atoms, requiring 0.5 Tflops of computational power and 250 Gbytes of storage ( deepcomputing/). Future investigations involving some 10,000 atoms will require 100 Tops of computational power and 2.5 Tbytes of storage. Digital movies and special effects are yet another source of growing demand for computation. At around 1014 floatingpoint operations per frame and 50 frames per second, a 90-minute movie represents 2.7 1019 floating-point operations. It would take 2,000 1-Gflops CPUs approximately 150 days to complete this computation. Large amounts of computation are no longer the sole province of classical highperformance computing. There is an industry trend toward continual optimizationrapid and frequent modeling for timely business decision support in domains as diverse as inventory planning, risk analysis, workforce scheduling, and chip design. Such applications also contribute to the drive for improved performance and more cost-effective numerical computing. Applications continue to drive the growth of absolute performance and cost-performance at the historical level of an 80 percent compound annual growth rate (CAGR). This rate shows no foreseeable slowdown. If anything, application demands will grow even faster perhaps a 90 to 100 percent CAGRover the next few years. New workloads, such as delivery and processing of streaming and rich

media, massively multiplayer online gaming, business intelligence, semantic search, and national security, are increasing the demand for numerical- and data-intensive computing. Another growing workload characteristic is variability of demand for system resources, both across different workloads and within different temporal phases of a single workload. Figure 1 shows an example of variable and periodic behavior of instructions per cycle (IPC) in the SPEC2000 benchmarks bzip2 and art.2 Important business and scientific applications demonstrate similar variability. Designing computer architectures to adequately handle such variability is essential. A third important characteristic of many workloads is that they are amenable to scaling out. A scale-out architecture is a collection of interconnected, modular, low-cost computers that work as a single entity to cooperatively provide applications, systems resources, and data to users. Scale-out platforms include clusters; high-density, rack-mounted blade systems; and massively parallel systems. On the other hand, conventional symmetric multiprocessor (SMP) systems are scale-up platforms. Many important workloads are scaling out. Enterprise resource planning, customer relationship management, streaming media, Web serving, and science/engineering computations are prime examples of scale-out workloads. However, some commercially important workloads, such as online transaction processing, are difcult to scale out and continue to require the highest possible single-thread performance and symmetric multiprocessing. We will discuss later how different workload characteristics can drive computer systems to different design points. As a community, computer architects must make a concerted effort to better characterize applications and environments to drive the design of future computing platforms. This effort should include developing a detailed understanding of applications scale-out characteristics, developing opportunities for optimizing applications across all system stack levels, and developing tools to aid the migration of existing applications to future platforms.

Even as application demands for computational power continue to grow, silicon




3.5 3.0 Instructions per cycle 2.5 2.0 1.5 1.0 0.5 0 0 (a) 100 200 300 Time(s) 400 500

0.6 0.55 Instructions per cycle 0.5 0.45 0.4 0.35 0.3

31 (b)


31.4 Time(s)




Figure 1. Variability of instructions per cycle (IPC) in SPEC2000: IPC over the entire execution for benchmark bzip2 (a) and a 1-second interval from 31 to 32 seconds for the art benchmark (b).2 (Copyright IEEE Press, 2003)

chip-level performance must result from on-chip functional integration rather than continued frequency scaling. CMOS device scaling rules, as initially stated by Dennard et al., predict that scaling of device geometry, process, and operating-environment parameters by a factor of will result in higher density (~2), higher speed (~), lower switching power per circuit (~1/2), and constant active-power density.3 In the past several years, however, in our pursuit of higher operating frequency, we have not scaled operating voltage as required by this scaling theory. As a result, power densities have grown with every CMOS technology generation. Dennard et al.s scaling theory is based on considerations of active (or switching) power, the dominant source of power dissipation when CMOS device features were large relative to atomic dimensions. As CMOS device features shrink, additional sources of passive (or leakage) power dissipation are increasing in importance. There are two distinct forms of passive power:

technology is running into some major discontinuities as it scales to smaller feature sizes. When we study operating frequencies of microprocessors introduced over the last 10 years and projected frequencies for the next two to three years, it is clear that frequency will grow in the future at half the rate of the past decade. Although technology scaling delivers devices with ever-finer feature sizes, power dissipation is limiting chip-level performance, making it more difficult to ramp up operating frequency at historical rates. In the near future, therefore,

Gate leakage is a quantum tunneling effect in which electrons tunnel through the thin gate dielectric. This effect is exponential in gate voltage and oxide thickness. Subthreshold leakage is a thermodynamic phenomenon in which charge leaks between a MOSFETs source and drain. This effect increases as device channel lengths decrease and is also exponential in turn-off voltage, the difference between the devices power supply and threshold voltages.



The implication of the growth of passive power at the chip level is profound. Although scaling allows us to grow the number of devices on a chip, these devices are no longer free that is, they leak signicant amounts of passive power even when they are not performing useful computation or storing useful data.4 Chip-level power is already at the limits of air cooling. Liquid cooling is an option being increasingly explored, as are improvements in air cooling. But, in the end, all heat extraction and removal processes are inherently subexponential. They will thus limit the exponential growth of power density and total chip-level power that CMOS technology scaling is driving. We faced a similar situation two decades ago, when the heat ux of bipolar technology was similarly exploding beyond the effective air-cooling limits of the day. However, there was a signicant difference between that situation and the current one: We had CMOS available as a mature, low-power, high-volume technology then. We have no other technology with similar characteristics waiting in the wings today. Technologists are making many advances in materials and processes, but computer architects must find alternate designs within the connes of CMOS, the basic silicon technology. CMOS scaling results in another dimension of complexityit affects variability. The critical dimensions in our designs are scaling faster than our ability to control them, and manufacturing and environmental variations are becoming critical. Such variations affect both operating frequency and chip yield, and ultimately they adversely affect system cost and cost-performance. The implications of such variability are twofold: We can either use chip area to obtain performance, or we can design for variability. The industry is beginning to use both approaches to counteract the increasing variability of deep-submicron CMOS.

tem performance and price-performance while simultaneously solving the power problem. Rather than riding on the steady frequency growth of the past decade, system performance improvements will increasingly be driven by integration at all levels, together with hardware-software optimization. The shift in focus implied by this challenge requires us to optimize performance at all system stack levels (both hardware and software), constrained by power dissipation and reliability issues. Opportunities for optimization exist at both the chip and system levels.

Microprocessors and chip-level integration

Chip-level design space includes two major options: how we trade power and performance within a single processor pipeline (core), and how we integrate multiple cores, accelerators, and off-load engines on chip to boost total chip-level performance. The investigation of these issues requires appropriate methodologies for evaluating design choices. The following discussion illustrates such a methodology; readers should focus less on the specic numerical values of the results and more on how the results are derived. The term power is often used loosely in discussions like this one. Depending on context, the term can be a proxy for various quantities, including energy, instantaneous power, maximum power, average power, power density, and temperature. These quantities are not interrelated in a simple manner, and the associated physical processes often have vastly different time constants. The evaluation methodology must accommodate the subtleties of the context.

Power-performance optimization in a single core

Let us consider an instruction set architecture (ISA) and a family of pipelined implementations of that ISA parameterized by the number of pipeline stages or, equivalently, the depth in fan-out of four (FO4) of each pipeline stage. (FO4 delay is the delay of one inverter driving four copies of an equal-sized inverter. The amount of logic and latch overhead per pipeline stage is often measured in terms of FO4 delay. This implies that deeper pipelines have smaller FO4 delays.) The following discussion also xes the circuit family and assumes it to be one of the standard static CMOS circuit families.

We face a gap. We need 80-plus percent compound growth in system-level performance, while frequency growth has dropped to 15 to 20 percent because of power limitations. The computer architecture communitys challenge, therefore, is to devise innovative ways of delivering continuing growth in sys-




1.1 1.0 Relative to optimal FO4 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2
bips/W bips bips2/W bips3/W IPC

The falloff past this optimal point is much steeper than in the case of the performance-only curve, demonstrating the fundamental superlinear trade-off between performance and power.

The power model for these curves incorporates active 0.1 power only. If we added pas0 sive power to the model, the 37 34 31 28 25 22 19 16 13 10 7 optimal power-performance design point would shift Total FO4 per stage somewhat to the right of the Figure 2. Power-performance trade-off in a single-processor pipeline.5 (Copyright IEEE Press, 18 FO4 bips3/W design point (because combined 2002.) active and passive power increases less rapidly with Now consider the implementation familys increasing pipeline depth). behavior for some agreed-upon workload and Figure 3 plots the same information in a metric of goodness. Figure 2 shows plots of different manner, making the trade-off such behavior. The number of pipeline stages between power and performance visually increases from left to right along the x-axis, obvious. Here, a family of pipeline designs and the y-axis shows normalized behavior; the shows up as a single curve, with performance pipeline organization with the best value is decreasing from left to right on the x-axis and dened as 1. The y-axis numbers came from power increasing from bottom to top on the detailed simulation. y-axis. FO4 numbers of individual design The curve labeled bips (billions of points appear on the curve. instructions per second) plots performance We now focus on two example design points: for the SPEC2000 benchmark suite as a func- the 12 FO4 design, which delivers high pertion of pipeline stages and shows an optimal formance (at a high power cost), and the 18 design point of 10 FO4 per pipeline stage. FO4 design, which is optimal for the powerPerformance drops off for deeper pipelines as performance metric. Once these designs are the effects of pipeline hazards, branch mis- committed to silicon and fabricated, it is posprediction penalties, and cache and transla- sible to determine whether they meet the chiption look-aside buffer misses play an level power budget, shown as the horizontal dashed line in the gure. Suppose that the 12 increasing role. The curve labeled bips3/W measures FO4 design exceeds the power budget, as the power-performance as a function of pipeline gure shows. Options exist, even at this stage of stages, again for SPEC2000. The term bips3 the process, to trade performance and power per watt is a proxy for (energy delay2)1, a by reducing either the operating voltage (shown metric commonly used to quantify the power- in the Varying VDD and curve) or the operperformance efciency of high-performance ating frequency (the Reducing f curve). processors. There are two key differences Either choice could return this design to an between this curve and the performance-only acceptable power budget, but at a signicantcurve: ly reduced level of single-core performance, once again emphasizing the superlinear trade The optimal design point for the power- off between performance and power. On the performance metric is at 18 FO4 per other hand, suppose that the less-aggressive pipeline stage, corresponding to a shal- 18 FO4 design comes in slightly below the lower pipeline. power budget. Applying VDD scaling would



2.4 12 FO4 2.2 2.0 1.8 14 FO4 1.6 Maximum power budget 1.4 1.2 18 FO4 1.0 0.8 0.6 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30
Experimental points Varying depth (fixed Vdd and ) Varying VDD and (fixed depth) Reducing (fixed depth, Vdd, and )

Relative power, P/P0

23 FO4

Relative delay, D/D0

Figure 3. Effect of pipeline depth on a single-core design.6 (Copyright IEEE Press, 2004)

boost its performance, while staying within the power budget. The preceding example illustrates the importance of incorporating power as an optimization target early in the design process along with the traditional performance metric. Although voltage- and frequency-scaling techniques can certainly correct small mismatches, selecting a pipeline structure on the basis of both performance and power is critical because a fundamental error here could lead to an irrecoverable postsilicon power-performance (hence, cost-performance) deficiency. In addition to fixing and scaling pipeline depth appropriately to match technology trends, additional enhancements to increase power efciency at the microarchitecture level are possible and desirable. The computer architecture research community has worked for several years on power-aware microarchitectures, developing various techniques for reducing active and passive power in cores.7-13 Table 1 shows some of these techniques. Microarchitects are using an increasing

Table 1. Power-aware microarchitectural techniques.

Microarchitecture optimization goal Active-power reduction

Active- and passive-power reduction

Techniques Clock gating Bandwidth gating Register port gating Asynchronously clocked pipelined units and globally asynchronous, locally synchronous architectures Power-efcient thread prioritization (simultaneous multithreading) Simpler cores Voltage gating of unused functional units and cache lines Adaptive resizing of computing and storage resources Dynamic voltage and frequency scaling

number of these techniques in commercial microprocessors. However, many difficult problems remain open. For example: determining the proper logic-level granularity of applying clock-gating techniques to maximize power savings,






No. of wide-issue, out-of-order cores 1 2 4 No. of narrow-issue, in-order cores 1 2 4 8

Relative power






1.5 Relative chip throughput




Figure 4. Power-performance trade-offs in integrating multiple cores on a chip. (Courtesy of V. Zyuban, Power-Performance Optimizations across Microarchitectural and Circuit Domains, invited course at Swedish Intelect Summer School on Low-Power Systems on Chip, 23 to 25 Aug. 2004.)

reconciling pervasive clock gatings effect on cycle time, building in predictive support for voltage gating at the microarchitectural and compiler levels to minimize switchingunit overhead, and addressing increased design verication complexity in the presence of these techniques.

Integrating multiple cores on a chip

With single-core performance improvements slowing, multiple cores per chip can help continue the exponential growth of chiplevel performance. This solution exploits performance through higher chip, module, and system integration levels and optimizes for performance through technology, system, software, and application synergies. IBM is a trailblazer in this space. The Power4 microprocessor, introduced in 2001 in 180-nm technology, comprised two cores per chip.14 The Power4+ microprocessor, introduced in 2003, was a remapping of Power4 to 130-nm technology. The Power5, introduced in 2004 in 130-nm technology,

augments the two cores per chip with twoway simultaneous multithreading per core.15 The 389-mm2 Power5 chip contains 276 million transistors, and the resulting systems lead in 34 industry-standard benchmarks. Increasingly, CPU manufacturers are moving to the multiple-cores-per-chip design. Lets examine the trade-offs that arise in putting multiple cores on a chip. What types of cores should we integrate on a chip, and how many of them should we integrate? Of course, well leverage what we learned in our discussion of power-performance trade-offs for a single core. Figure 4 presents two extreme designs that illustrate the methodology: a complex, wide-issue, out-of-order core and a simple, narrow-issue, in-order core. Given the relative difference in size between these two organizations, we assume that we could integrate up to four of the complex cores or up to eight of the simple cores on a single chip. The curves show the power-performance trade-offs possible for each of these designs through variation of the pipeline depth, as discussed earlier. Several conclusions follow from the curves in Figure 4:



For a given power budget (consider a horizontal line at 1.5), multiple simple cores produce higher throughput (aggregate chip-level performance). The simulations used to derive the curves show that this conclusion holds for both SMP workloads and independent threads. A complex core provides much higher single-thread performance than a simple core (compare the curves 1 wide-issue out-of-order core and 1 narrow-issue in-order core). Scaling up a simple core by reducing FO4 and/or raising VDD does not achieve this level of performance. Integrating a heterogeneous mixture of simple and complex cores on a chip might provide acceptable performance over a wider variety of workloads. As discussed later, such a solution has signicant implications on programming models and software support. These conclusions show that no single design for chip-level integration is optimal for all workloads. We can choose the appropriate design only by weighing the relative importance of single-thread performance and chip throughput for workloads that the systems are expected to run.

However, memory issues, such as latency hiding and locality enhancement, need further examination.22 A fundamental issue in exploiting thread-level parallelism is identifying the threads in a computation. Explicitly parallel languages such as Java make the programmer responsible for this determination. Sequential languages require either automatic parallelization techniques21,23 or OpenMP-like compiler directives ( In addition, for more effective exploitation of shared resources, the operating system must provide richer functionality in terms of coscheduling and switching threads to cores.

Accelerators and ofoad engines

Special-purpose accelerators and offload engines offer an alternative means of increasing performance and reducing power. Systems will increasingly rely on accelerators for improved performance and cost-performance. Such engines help exploit concurrency and data formats in a specialized domain for which we have a thorough understanding of the bottlenecks and expected end-to-end gains. A lack of compilers, libraries, and software tools to enable acceleration is the primary bottleneck to more pervasive deployment of these engines. Accelerators are not new, but in recent years several conditions have changed, making wider deployment feasible: Functionality that merits acceleration has become clearer. Examples include Transmission Control Protocol/Internet Protocol (TCP/IP) offloading, security, streaming and rich media, and collective communications in high-performance computing. In the past, accelerators had to compete against the increasing frequency, performance, and exibility of general-purpose processors. The slowing of frequency growth makes accelerators more attractive. Increasing density allows the integration of accelerators on chips along with the CPU. This results in tighter coupling and ner-grained integration of the CPU and the accelerator, and allows the accelerator to benet from the same technology advances as the CPU. Domain-specific programmable and recongurable accelerators have emerged,

Software issues
The systems just described depend on exploiting greater levels of locality and concurrency to gain performance within an acceptable power budget. Appropriate support from compilers, runtime systems, operating systems, and libraries is essential to delivering the hardwares potential at the application level. The fundamental technology discontinuities discussed earlier, which slow the rate of frequency growth, make such enablement, integration, and optimization even more important. Increasing software componentization, combined with vastly increased hardware system complexity, requires the development of higher-level abstractions,16,17 innovative compiler optimizations,17,18 and high-performance libraries19,20 to sustain the performance growth levels that applications demand. Processor issues involved in exploiting instruction-level parallelism, such as code generation, instruction scheduling, and register allocation, are generally well understood.21




BLC DD 1.0 Tree



and both link-time and dynamic compiler optimizations) for software enablement of accelerators, and developing industry-standard software interfaces and practices that support accelerator use. Given the potential for improvement, the judicious use of accelerators will remain an important part of system design methodology in the foreseeable future.

FPU0 Eth JTAG Perf

Scale-out provides the opportunity to meet performance demands beyond the levels that chip-level integration can provide. Moreover, given that the power-performance trade-off is superlinear, scale-out can provide the same computational performance for far less power. In other words, if an application is amenable to scale-out, we can execute it on a large enough collection of lower-power, lower-performance cores to satisfy the applications overall computational requirement with much less power dissipation at the system level. An effective scale-out solution requires a balanced building block, which integrates high-bandwidth, low-latency memory and interconnects on chip to balance data transfer and computational capabilities. Figure 5 shows an example of such a building block, the chip used in the Blue Gene/L machine that IBM Research is building in collaboration with Lawrence Livermore National Laboratory.24 The relatively modest-sized chip (121 mm2 in 130-nm technology) integrates two PowerPC 440 cores (PU0 and PU1) running at 700 MHz, two enhanced floatingpoint units (FPU0 and FPU1), L2 and L3 caches, communication interfaces (Torus, Tree, Eth, and JTAG) tightly coupled to the processors and performance counters. This chip provides 5.6 Gops of peak computation power for approximately 5 W of power dissipation. On top of this balanced hardware platform, an innovative hierarchically structured system software environment and standard programming models (Message-Passing Interface) and APIs for le systems, job scheduling, and system management result in a scalable, power-efcient system. Sixteen racks (32,768 processors) of the system sustained a Linpack performance of 70.72 Tflops on a problem size of 933,887, securing the top spot on the 24th Top500 list of supercomputers (


Figure 5. Integrated functionality on IBMs Blue Gene/L computer chip. It uses two enhanced oating-point units (FPU) per chip, each FPU is two-way SIMD, and each SIMD FPU unit performs one Fused Multiply Add operation (equivalent to two oating-point operations) per cycle. This structure produces a peak computational rate of 8 oating-point operations per cycle, or 5.6 Gops for a 700-MHz clock rate.

replacing xed-function, dedicated units. Examples include SIMD instruction set architecture extensions and FPGA-based accelerators. Given the power issues discussed earlier, accelerators are not free. It is extremely important to achieve high utilization of an accelerator or to clock gate and power gate it effectively. Programming models, compilers, and tool chains for exploiting accelerators must continue to mature to make such specialized functions easier for application developers to use productively. The end-to-end benet of deploying an accelerator critically depends on the workload and the ease of accessing the accelerator functionality from application code. Much work remains in this area; for example, deciding what functions to accelerate, understanding the system-level implications of integrating accelerators, developing the right tools (including libraries, prolers,



System-level power management

Power is clearly a limiting factor at the system level. It is now a principal design constraint across the computing spectrum. Although the preceding discussion has concentrated primarily on the CPU, the power densities of all computing components at all scales are increasing exponentially. Microprocessors, caches, dual in-line memory modules, and buses are each capable of trading power for performance. For example, todays DRAM designs have different power states and both microprocessors and bus frequencies can be dynamically voltage- and frequency-scaled. The power distributions in Table 2 make it clear that we can ignore none of the power components. To effectively manage the range of components that use power, we must have a holistic, system-level view. Each level in the hardware/software stack needs to be aware of power consumption and must cooperate in an overall strategy for intelligent power management. To do this in real-time, power usage information must be available at all levels of the stack and managed via a global systems view. Dynamically rebalancing total power across system components is key to improving system-level performance. Achieving dynamic power balancing requires three enablers: System components must support multiple power-performance operating points. Sleep modes in disks are a mature example of this feature. The systems design must exploit the extremely unlikely fact that all components will simultaneously operate at their maximum power dissipation points (while providing a safe fallback position for the rare occasion when this might actually happen). Researchers must develop algorithms, most likely at the operating system or workload manager level, to monitor and/or predict workloads power-performance trade-offs over time. These algorithms must also dynamically rebalance maximum available power across components to achieve the required quality of service, while maintaining the health of the system and its components.

Table 2. Power distribution across system components.

Power dissipation (percentage) 46 28 17 7 2 30 28 23 11 5 3

System and component Data center Servers Tape drives Direct-access storage devices Network Other Midrange server DRAM system Processors Fans Level-three cache I/O fans I/O and miscellaneous

he inexorable growth in applications requirements for performance and costperformance improvements will continue at historical rates. At the same time, we face a technology discontinuity: the exponential growth in device and chip-level power dissipation and the consequent slowdown in frequency growth. As computer architects, our challenge over the next decade is to deliver end-to-end performance growth at historical levels in the presence of this discontinuity. We will need a maniacal focus on power at all architecture and design levels to bridge this gap, together with tight hardware-software integration across the system stack to optimize performance. The right building blocks (cores), chip-level integration (chip multiprocessors, system on chips, and accelerators), scale-out and parallel computing, and systemlevel power management are key levers. The discontinuity is stimulating renewed interest in architecture and microarchitecture, and opportunities abound for innovative work to meet the challenge. MICRO Acknowledgments The work cited here came from multiple individuals and groups at IBM Research. We thank Pradip Bose, Evelyn Duesterwald, Philip Emma, Michael Gschwind, Hendrik Hamann, Lorraine Herger, Rajiv Joshi, Tom Keller, Bruce Knaack, Eric Kronstadt, Jaime Moreno, Pratap Pattnaik, William Pulley-




blank, Michael Roseneld, Leon Stok, Ellen Yoffa, Victor Zyuban, and the entire Blue Gene/L team for the technical results and for helping us to coherently formulate the views discussed in this article. The Blue Gene/L project was developed in part through a partnership with the Department of Energy, National Nuclear Security Administration Advanced Simulation and Computing Program to develop computing systems suited to scientic and programmatic missions. References
1. A. Jameson, L. Martinelli, and J.C. Vassberg, Using Computational Fluid Dynamics for Aerodynamics: A Critical Assessment, Proc. 23rd Intl Congress Aeronautical Sciences (ICAS 02), Intl Council of Aeronautical Sciences, 2002. 2. E. Duesterwald, C. Cascaval, and S. Dwarkadas, Characterizing and Predicting Program Behavior and Its Variability, Proc. 12th Intl Conf. Parallel Architectures and Compilation Techniques (PACT 03), IEEE Press, 2003, pp. 220-231. 3. R.H. Dennard et al., Design of Ion-Implanted MOSFETs with Very Small Physical Dimensions, IEEE J. Solid-State Circuits, vol. 9, no. 5, Oct. 1974, pp. 256-268. 4. International Technology Roadmap for Semiconductors, 2003 ed., Files/2003ITRS/Home2003.htm. 5. V. Srinivasan et al., Optimizing Pipelines for Power and Performance, Proc. 35th ACM/IEEE Intl Symp. Microarchitecture (MICRO-35), IEEE CS Press, 2002, pp. 333344. 6. V. Zyuban et al., Integrated Analysis of Power and Performance for Pipelined Microprocessors, IEEE Trans. Computers, vol. 53, no. 8, Aug. 2004, pp. 1004-1016. 7. P. Bose, Architectures for Low Power, Computer Engineering Handbook, V. Oklobdzija, ed., CRC Press, 2001. 8. D. Brooks and M. Martonosi, Value-Based Clock Gating and Operation Packing: Dynamic Strategies for Improving Processor Power and Performance, ACM Trans. Computer Systems, vol. 18, no. 2, May 2000, pp. 89-126. 9. A. Buyuktosunoglu et al., Power Efcient Issue Queue Design, Power-Aware Computing, R. Melhem and R. Graybill, eds., Kluwer Academic, 2001.

10. D.M. Brooks et al., Power-Aware Microarchitectures: Design and Challenges for NextGeneration Microprocessors, IEEE Micro, vol. 20, no. 6, Nov.-Dec. 2000, pp. 26-44. 11. K. Skadron et al., Temperature-Aware Computer Systems: Opportunities and Challenges, IEEE Micro, vol. 23, no. 6, Nov.-Dec. 2003, pp. 52-61. 12. Z. Hu et al., Microarchitectural Techniques for Power Gating of Execution Units, Proc. Intl Symp. Low Power Electronics and Design (ISLPED 04), IEEE Press, 2004, pp. 32-37. 13. Z. Hu, S. Kaxiras, and M. Martonosi, Let Caches Decay: Reducing Leakage Energy via Exploitation of Cache Generational Behavior, ACM Trans. Computer Systems, vol. 20, no. 2, May 2002, pp. 161-190. 14. J. Tendler et al., Power4 System Microarchitecture, IBM J. Research & Development, vol. 46, no. 1, Jan. 2002, pp. 5-26. 15. R. Kalla, B. Sinharoy, and J. Tendler, IBM Power5 Chip: A Dual-Core Multithreaded Processor, IEEE Micro, vol. 24, no. 2, Mar.Apr. 2004, pp. 40-47. 16. W.W. Carlson et al., Introduction to UPC and Language Specication, tech. report CCS-TR99-157, Lawrence Livermore Natl Lab., 1999. 17. Y. Dotsenko, C. Coarfa, and J. Mellor-Crummey, A Multi-Platform Co-Array Fortran Compiler, Proc. 13th Intl Conf. Parallel Architectures and Compilation Techniques (PACT 04), IEEE CS Press, 2004, pp. 29-40. 18. A.E. Eichenberger, P. Wu, and K. OBrien, Vectorization for SIMD Architectures with Alignment Constraints, Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI 04), ACM Press, 2004, pp. 82-93. 19. R.C. Whaley, A. Petitet, and J.J. Dongarra, Automated Empirical Optimization of Software and the ATLAS Project, Parallel Computing, vol. 27, no. 1-2, Jan. 2001, pp. 3-35. 20. K. Yotov et al., A Comparison of Empirical and Model-Driven Optimization, Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI 03), ACM Press, 2003, pp. 63-76. 21. R. Allen and K. Kennedy, Optimizing Compilers for Modern Architectures, Morgan Kaufmann, 2002. 22. X. Fang, J. Lee, and S.P. Midkiff, Automatic Fence Insertion for Shared Memory Mul-



tiprocessing, Proc. 17th Ann. Intl Conf. Supercomputing (ICS 03), ACM Press, 2003, pp. 285-294. 23. W. Blume et al., Parallel Programming with Polaris, Computer, vol. 29, no. 12, Dec. 1996, pp. 78-82. 24. G. Almasi et al., Unlocking the Performance of the BlueGene/L Supercomputer, Proc. Supercomputing 2004, IEEE Press, 2004.

Siddhartha Chatterjee is a research staff member and manager at IBM Research. His research interests include all aspects of highperformance systems and software quality. Chatterjee has a PhD in computer science from Carnegie Mellon University. He is a senior member of IEEE, and a member of ACM and SIAM. Direct questions and comments about this article to Tilak Agerwala, IBM T.J. Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, NY 10598; tilak@us.
For further information on this or any other computing topic, visit our Digital Library at

Tilak Agerwala is vice president, systems, at IBM Research. His primary research area is high-performance computing systems. He is responsible for all of IBMs advanced systems research programs in servers and supercomputers. Agerwala has a PhD in electrical engineering from The Johns Hopkins University. He is a fellow of the IEEE, and a member of ACM.

Advancing in the IEEE Computer Society can elevate your standing in the profession. Application to Senior-grade membership recognizes ten years or more of professional expertise Nomination to Fellow-grade membership recognizes exemplary accomplishments in computer engineering