Sei sulla pagina 1di 9

Multi/many-core: energy and power management methods for processor cores

October 28, 2012

Authors W. van Teijlingen A. J. Luna A. B. Gebregiorgis R. de Wit

Contact information 4170377 w.vanteijlingen@student.tudelft.nl 4245369 a.a.jimenezluna@student.tudelft.nl 4230523 antscholar@gmail.com 4179889 R.deWit-1@student.tudelft.nl

Table of Contents
Introduction .................................................................................................................................................................3 Technical notes ...........................................................................................................................................................3 Power Balanced Pipelines by Alejandro Jimenez Luna ......................................................................................4 High-Performance Low-Vcc In-Order Core by Wouter van Teijlingen .........................................................5 The BubbleWrap Many-Core: Popping Cores for Sequential Acceleration by Anteneh Bogale Gebregiorgis ..............................................................................................................................................................6 Idempotent processor architecture by Remco de Wit........................................................................................7 Summarization ............................................................................................................................................................8 Conclusion ....................................................................................................................................................................8 References.....................................................................................................................................................................9

Modern Computer Architectures: Reading Assignment

Introduction The subject of our research is energy/power management methods for multi/many-core processors. Energy and power consumption in modern processors are important design metrics. The upcoming market for high performance mobile devices drives the evolution of power-efficient processors. These devices have a very tight power budget. The papers used for the technical notes and summarization are listed below: Power Balanced Pipelines High-Performance Low-Vcc In-Order Core The BubbleWrap Many-Core: Popping Cores for Sequential Acceleration Idempotent processor architecture

First, the technical notes are presented. Second, a summarization and categorization of the papers is given, and finally we discuss some conclusions and present the list of used references. Technical notes On the following pages the technical notes are presented. The authors of the technical notes are mentioned in the sections title.

Page 3

Modern Computer Architectures: Reading Assignment

Power Balanced Pipelines by Alejandro Jimenez Luna Autors of this paper concentrate their efforts on the improvement of power efficiency of pipelined processor architectures, which by the way, has been one of the key issues in the field because of the nonproportional advances on faster processors architectures compared to the bottleneck created by electrical conditions of atomic electronic elements. This work proposes a different approach to improve the results provided by the well-known concept of balancing delay of micro architectural pipeline stages that maximizes throughput but sacrifices power by ignoring the difference in size (and power consumption) between complex and simple stages, e.g. fetching and execution stages. Instead, the concept of power balanced pipelines is presented, which proposes the idea of assigning delays on different stages of pipelines with the target of obtaining a balanced power consumption but maintaining global performance. Power balanced pipelines technique stands on the concept of cycle stealing to implement its purpose by assigning cycle time from low to high power consuming stages, maximizing power saving scheme of the architecture. A mathematical approach correlates power and voltage of donor and receiver stages, concluding that processor power saving is achieved whenever the power of the time stealing pipeline stage is greater than the power of the time donating stage represented by the following expression: Phi,total +*Phi,dyn>Plo,total+*Phi,dyn(=1+V/V) Where: P = power, V=voltage, Hi = time receiving/stealing stage, Lo = time donating stage Implementation of power balanced pipelines is sub divided in static and dynamic. Static approach takes no consideration on the fluctuation of absolute power while power breakdown between stages remains near constant values; static implementation can be used during design-level stage of processor applying a cyclic algorithm to find and select the appropriate constraints to meet the minimum power implementation. On the other hand, static implementation can also be used in testing stages of processors by taking advantage of dynamic voltage scaling or multiple voltage domains. Dynamic approach realizes that changes in the workload can occur and affect the relative breakdown between pipeline stages and is mainly used for architectures that mostly depend on complex stages like FPU. This is achieved by the use of an algorithm to identify complex stage instructions in a given period and thus finding the appropriate configuration for power saving. Results for implementation of power balanced pipelines (static and dynamic) performed at with different number of voltage domains, clock periods and benchmarks demonstrate power saving of 46% at maximum frequency compared to delay balanced pipelines. Final remarks Concepts and ideas presented by this work gain importance considering the fact that processors with pipelined architectures are becoming a constant on common large scale manufactured devices, for example in mobile units. Thus, the relationship between better performance and robust power administration is a key issue as technology advances on this field, and as demonstrated in this work, techniques for optimization may be achieved by analyzing and implementing the appropriate execution order of simple and complex instructions.

Page 4

Modern Computer Architectures: Reading Assignment

High-Performance Low-Vcc In-Order Core by Wouter van Teijlingen Energy is critical in new technology nodes. The authors of the paper [6] present a novel and innovative approach referred to as immediate read after write (IRAW) avoidance. It is a mechanism to decrease Vcc while keeping operating frequency high. In the design proposed by the authors it is possible to have a high operating frequency at low Vcc by overriding write delay constraints found in SRAMs. IRAW avoidance requires slight hardware modifications to the processor. Latency reducing of SRAM arrays has been a concern in the past few years. The relation with low Vcc operation is one of the key considerations, because SRAM write delay grows exponentially. The drowsy cache is referred to as one option for retaining contents at very low Vcc. This technique saves power, but SRAM write delay is not reduced. The improvement the authors present in their paper, IRAW, can be applied to all SRAM blocks of any inorder core. The mechanism proposed increases operating frequency by 57 % at 500 mV. The percentage is 99 % for 400mV. There is only a slight penalty in the form of area and power overhead. The frequency boost results into 48 % speedup at 500mV and 90% at 400 mV. The mechanism the authors propose is based on interrupting write operations before bitcells reach a readable state. This is referred to as IRAW avoidance. One of the conclusions is that bitcell write delay is the most critical path at low Vcc and that it impacts cycle time dramatically. Other techniques are discussed in the paper as well. Some of them can be used to operate at a high frequency than dictated by write delay. Two state-of-the-art techniques to override SRAM write delay are called: faulty bits and extra bypass. They both have some advantages and disadvantages. For example, the hardware overhead for extra bypass is high where it is low for faulty bits. At the other hand, faulty bits are hard to test and extra bypass is not. This adds to the relevance of the paper, because their strategies for IRAW avoidance does not suffer from the issues that the other mechanisms have. The papers importance is emphasized by the fact that their strategy works for any in-order core and the hardware overhead is low. In the paper compiler optimizations for additional improvement are considered out of scope. I suggest to include compiler optimizations and evaluate the results. I would like to see if it would be possible to apply IRAW avoidance to out-of-order architectures. Incorporate energy-efficient cache design as proposed in [12] in IRAW avoidance. With the designs discussed in that paper it may be possible to operate reliably at even lower Vcc. Final remarks The paper is written in a clear and concise way. The abstract is one of the most important parts of a paper. In their abstract they provide an introduction, mention some results and conclusions. However, a short explanation of IRAW avoidance is missing. I rate this paper as a major contribution to the field of Vcc scaling applied to in-order cores. The fact is that this paper itself is only used as reference in one other paper [A.I]. It is obvious that we cannot determine the importance of a paper by counting the number of citations, but it is an important metric. The authors have submitted a patent of their technology as well [A.II]. The mechanism proposed by the paper is targeted on in-order cores. In the paper only one core is used for testing and I would like to see results based on other in-order cores as well. The Intel Silverthorne is used a lot in mobile devices, but Vcc scaling is equally important in other areas. The authors write that lower Vcc is required due to energy constraints in the mobile market segment. Energy constraints are everywhere, so they are relevant for each CPU designer. [A.I] VARIUS-NTV: A Microarchitectural Model to Capture the Increased Sensitivity of Manycores to Process Variations at Near-Threshold Voltages. In International Conference on Dependable Systems and Networks, June 2012. [A.II] J. Abella, P. Chaparro, X. Vera, J. Carretero and A. Gonzalez. Memory Apparatuses with Low Supply Voltages. U.S. Patent 0 115 224, May 6, 2010.

Page 5

Modern Computer Architectures: Reading Assignment

The BubbleWrap Many-Core: Popping Cores for Sequential Acceleration by Anteneh Bogale Gebregiorgis Nowadays to scale up the performance of many/ multi-core processors power is becoming a bottleneck in addition to memory. Hence, many of the cores are forced to be dormant. This leads to decrease the performance far away from the ideal situation. The authors of the paper proposes a hardware design named Dynamic Voltage Scaling for Aging Management (DVSAM) to manage processor aging and attain higher performance or lower power consumption. This design mainly solves the BubbleWraps aging problem by consuming less power for the same performance and processor age, attain the highest performance for the same age and within power constraints or attain even higher performance for a shorter age and within power constraints. According to the author proposal the design is to continuously tune the processors supply voltage by a small step over the whole service time of the processor to exploit any currently left aging guard-band. Moreover, the authors assume there is an age sensor circuit. The authors proposed four DVSAM modes. 1. DVSAM-Pow attempts to consume minimum power for the same performance and service life. 2. DVSAM-Perf tries to attain maximum performance for the same service life and within power constraints. 3. DVSAM-Short tries to attain even higher performance for a shorter service life and within power constraints. Finally, 4. VSAM-Short is the same as DVSAM-Short but without changing supply voltage with time. The authors evaluate their proposal by testing on different cores. For example for DVSAM- Pow they consider 3 cores for 3 different voltage spectrums with vnormal, v = +3 and v = -3. From the test they observe the reduction of 13 31 % in power consumption. Related work has been done previously by Tiwari and Torrellas to enhance the processor age by tuning once or twice the supply voltage. The authors proposal enhances Tiwaris work by continuously tuning the voltage throughout its life time instead of tuning only once or twice. Advantages of DVSAM are it is simple to design and it doesnt need to change the number of cores per chip. But still it doesnt help to attain higher performance by consuming less power for the same service time of a processor. Moreover it doesnt address the power consumption issue of the dormant cores in the BubbleWrap approach. As a future research topic I would like to propose design that incorporates power management unit that controls the power limit and turn on/ off cores so that their life time can be enhanced. I rate this paper as an average. The paper is well organized, concise and it enhances BubbleWrap approach. However, it is not novel approach and its main target is BubbleWrap. Questions: What is the limit of voltage level can reach during the tuning process? To what extent does this proposal enhance the Tiwari and Torrellas work? Is it possible to tune voltage and frequency at a time?

Page 6

Modern Computer Architectures: Reading Assignment

Idempotent processor architecture by Remco de Wit In the current era of microprocessors power consumption has become a large design constraint in the development of processors. Especially for mobile processors which can't consume lots of power and must operate as fast as possible. However current technology nodes do not scale the decrease in power consumption as much as they used to due to the fact of increased side effects (quantum effects). The Idempotent processors could solve a part of this problem. Idempotent processors use a very simple hardware design to allow for out of order processing of specific pieces of code. To be precise only code that could be re-executed without changing the outcome of the code. These pieces of code are called Idempotent regions of code. The Idempotent processor is built as such that the instructions inside these pieces are executed in any order, if an exception occurs the processors jumps back to the beginning of the piece of idempotent code, re- executes this piece of code and then sequentially executes this code up to the point of the exception, and then resumes operation. Also in order processing of idempotent sections of code is required for correct program results. The main benefit of this architecture is that the processor can execute OoO, but the hardware is still very simple, because it can rely on the compiler to mark idempotent sections of code. This way the processor does not need to have complex logic inside to allow for OoO processing. Current OoO processors need a lot of logic blocks such as a complex scheduler with dependency checking, register renamer and other blocks to allow for OoO processing. The complexity of OoO processing is moved from the hardware to the compiler thus offloading the amount of required logic. Even compared to a in-order processor the idempotent processor only has a minimal amount of extra logic and some control logic of in-order processors could be removed. The drawback of this situation is that the performance gain of idempotent processors is not as much as that of on OoO processor (which has an average of 28,6% performance gain) and on average only 4.4% increased compared to an in-order processor. But 4.4% is still a performance boost at the cost of almost no increase in silicon area. Remarks The concept of the idea is very novel. Although the idea itself was proposed earlier, the researchers are the first to synthesize an actual implementation of the architecture. The performance gain of 4.4% is really significant, because is it easy to implement in hardware and especially embedded processors can benefit from this. However the main drawback of this architecture is that it is compiler dependant. Without an optimized complier the hardware will have a performance degradation instead of a performance gain. This need for a special compiler will probably hinder the adaptation of this architecture. The paper presents a good view about the architecture but I am missing some points. They say that a low power consumption can be maintained but hard numbers are not given. Also because they only simulated the design a real estimation of the total system power consumption increase could not be given (while in the paper they admit that some other components like cache management have to be added to create a working processor). Because they missed out on this point I give the paper an average grade. One thing I would like to ask the authors is that the algorithm jumps to the beginning of a idempotent section and then re-executes this section. But their compliers try to create large idempotent sections and the possibility exist that a lot of exceptions will be executed in the last section of idempotent code. This could lead to performance degradation because of long sections of code having to be re-executed. At which point of idempotent region length vs re-execution of long sections of these regions does this occur?

Page 7

Modern Computer Architectures: Reading Assignment

Summarization In Table I the papers are categorized using various metrics. The four papers have in common that they all apply techniques to improve the performance per watt ratio of modern day processors. All the mechanisms proposed in the papers require hardware modifications and only two of them require software adjustments. What we have found is that the bubblewrap and cycle time stealing techniques can be applied to all proposed mechanisms. While the idempotent and low-Vcc in-order core cannot be combined, because their is a difference in architectural design. The metrics that we have chosen for comparising the various techniques is based on their effect and design requirments.

Technique The BubbleWrap Many-Core Idempotent processor architecture Power Balanced Pipelines HighPerformance Low-Vcc InOrder Core Vcc scaling for processor aging Idempotent core

Hardware complexity 1

Software dependency 0

Power saving Enhances

Performance gain Increases

Comments Improves lifetime -

Stable

Increases

Cycle Time Stealing Vcc scaling for performance

Enhances

No change

Enhances

increases

Applies only to inorder cores

Table I: Categorization of techniques proposed by papers. Hardware complexity: 02 0: simple design 2: complex design Software complexity: 02 0: software independent 2: software dependent Conclusion Modern day architectures are bound to power limits, either because of low power budget in mobile devices or on high performance cores which hit the frequency scaling limit. Many techniques are proposed to improve either performance with minimal power increase or decrease power consumption without losing performance.

Page 8

Modern Computer Architectures: Reading Assignment

References [1] Cooperative Partitioning: Energy-Efficient Cache Partitioning for High-Performance CMPs. In International Symposium on High Performance Computer Architecture, February 2012. [2] Power Balanced Pipelines. In International Symposium on High Performance Computer Architecture, February 2012. [3] J. Sampson et al. Efficient Complex Operators for Irregular Codes. In International Symposium on High Performance Computer Architecture, February 2011. [4] V. Govindaraju et al. Dynamically Specialized Datapaths for Energy Efficient Computing. In International Symposium on High Performance Computer Architecture, February 2011. [5] C. Li, W. Zhang, C. Cho and T. Li. SolarCore: Solar Energy Driven Multi-core Architecture Power Management. In International Symposium on High Performance Computer Architecture, February 2011. [6] J. Abella, P. Chaparro, X. Vera, J. Carretero and A. Gonzalez. High-Performance Low-Vcc In-Order core. In International Symposium on High Performance Computer Architecture, January 2010. [7] V. J. Reddi et al. Voltage Emergency Prediction: Using Signatures to Reduce Operating Margins. In International Symposium on High Performance Computer Architecture, February 2009. [8] S. Herbert and D. Marculescu. Variation-Aware Dynamic Voltage/Frequency Scaling. In International Symposium on High Performance Computer Architecture, February 2009. [9] T. Cao, S. M. Blackburn, T. Gao and K. S. McKinley. The Yin and Yang of Power and Performance for Asymmetric Hardware and Managed Software. In International Symposium on Computer Architecture, June 2012. [10] M. Gebhart et al. Energy-efficient mechanisms for managing thread context in throughput processors. In International Symposium on Computer Architecture, June 2011. [11] K. Ma, X. Li, M. Chen and X. Wang. Scalable power control for many-core architectures running multi-threaded applications. In International Symposium on Computer Architecture, June 2011. [12] A. R. Alameldeen et al. Energy-efficient cache design using variable-strength error-correcting codes. In International Symposium on Computer Architecture, June 2011. [13] D. Gibson and D. A. Wood. Forwardflow: a scalable core for power-constrained CMPs. In International Symposium on Computer Architecture, June 2010. [14] O. Azizi et al. Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis. In International Symposium on Computer Architecture, June 2010. [15] A. Bhattacharjee and M. Martonosi. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In International Symposium on Computer Architecture, June 2009. [16] K. K. Rangan, G. Wei and D. Brooks. Thread motion: fine-grained power management for multicore systems. In International Symposium on Computer Architecture, June 2009. [17] Y. Wang, K. Ma and X. Wang. Temperature-constrained power control for chip multiprocessors with online model estimation. In International Symposium on Computer Architecture, June 2009. [18] M. De Kruijf and K. Sankaralingam. Idempotent Processor Architecture. In MICRO, December 2011. [19] E. S. Chung, P. A. Milder, J. C. Hoe and Ken Mai. Single-Chip Heterogeneous Computing:Does the Future Include Custom Logic, FPGAs, and GPGPUs? In MICRO, December 2010. [20] E. Rotem, A. Mendelson, R. Ginosar and U. Weiser. Multiple Clock and Voltage Domains for Chip Multi Processors. In MICRO, December 2009. [21] U. R. Karpuzcu, B. Greskamp and J. Torrellas. The BubbleWrap Many-Core: Popping Cores for Sequential Acceleration. In MICRO, December 2009.

Page 9

Potrebbero piacerti anche