Sei sulla pagina 1di 25

VVS: A Cohesive Approach to Vdd/Vth Assignment and Sizing for Low-Power Design1

Ashish Srivastava, Student Member, IEEE, Dennis Sylvester, Member, IEEE and David Blaauw, Member, IEEE Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor

Abstract - We present a sensitivity-based algorithm for total power minimization including dynamic and subthreshold leakage power using simultaneous sizing, Vdd, and Vth assignment. The algorithm runs in two distinct phases. For a design synthesized using low threshold voltage gates, the first phase relies on upsizing to create timing slack and maximize low Vdd assignments in a backward topological manner. The second phase, which then proceeds in a forward topological fashion, both up-sizes and re-assigns gates to high Vdd to enable significant static power savings through high Vth assignment. The proposed algorithm is implemented and tested on a set of combinational benchmark circuits. A comparison with traditional CVS and dual-Vth/sizing algorithms demonstrate the advantages of the algorithm over a range of activity factors, including an average power reduction of 45%, 59% and 36% for nominal, low and high primary input activities respectively. We also investigate the impact of level conversion delay, tightness of timing constraints, and various low Vdd and Vth choices on total power savings. We also adapt our approach to perform power optimization when the design is initially synthesized using high threshold voltage gates and investigate the various tradeoffs involved.

The work is an extension of a DAC 2004 paper. This work extends the approach to enable power optimization where the initial design is synthesized using high Vth (Section II.B, Section IV.B., Table 5) and explores the various tradeoffs involved as compared to an initial synthesis using low Vth gates. Additionally the impact of activity factors (Table 2), values of lower Vdd/Vth (Figure 4), runtime complexity (Section II.C, Figure 9) and comparisons with other approaches such as dual Vth and cutset based enumeration (Tables 3,4) are provided as compared to the DAC paper. Substantial additions have also been made to result analysis including the impact of the approach on path-delay distributions (Figure 3) and breakdown of resulting gate Vdd/Vth assignments (Figures 7,8). In addition, significant discussion has been added throughout, especially in Section I, III.C, pseudo-code in Sections II and III, and in a detailed justification of the upsizing stopping criterion (Figure 2).

I. Introduction
The well-known power management gap defined in the International Technology Roadmap for Semiconductors states that an 800X reduction in standby mode power and a 20X reduction in dynamic power are required compared to continued extrapolation of recent power consumption trends [1]. The best known method to attack this gap is the use of multiple supply (Vdd) and threshold voltages (Vth) on a chip. Early implementations of dual-Vdd designs showed very promising results with dynamic power savings on the order of 40-50% [3,4]. However, the authors of [5] claim that the dynamic power reduction achievable by dual-Vdd can be expected to decrease with reducing power supplies. Also with scaling process dimensions leakage power has grown to contribute a significant fraction to the total power budget. Although a number of approaches have been proposed to tackle either dynamic or leakage power, approaches that seek to minimize total power of a design are more desirable. Combining dual-Vdd and dual-Vth assignment [2] was shown to be extremely effective in total power minimization and allows for careful trade-off of dynamic and leakage power to minimize total power. Reference [2] also shows that using a second threshold voltage in conjunction with a second Vdd can also be used to maintain the achievable power reduction with scaling process generations. Additionally, it was demonstrated that using more than two power supplies or threshold voltages provides minimal reduction as compared to that provided by two Vdd or Vth [2,5]. Using multiple power supplies in a design imposes the topological constraint that gates operating at a lower supply voltage cannot fan-out to gates operating at a higher supply voltage without the use of dedicated level converters. Two approaches that obey this constraint have been proposed in the literature. Clustered Voltage Scaling (CVS) [6] allows only one transition from high Vdd to low Vdd gates along a path, and level converts low Vdd signals to high Vdd at the flip-flops. Extended CVS (ECVS) allows for level conversion on paths in between flip-flops and thus can improve the achievable power reduction since the problem is less constrained. When optimization is performed using a standard cell library, even the simpler problem of gate sizing has been shown to be NP-complete [37]. The problem of dual-Vdd and dual-Vth is inherently discrete since the number of Vdds and Vths available on-chip is limited. Thus approaches to handle the problem have been aimed at developing efficient heuristics using a subset of the design variables while making reasonable approximations to simplify the problem. In general the problem of supply and threshold voltage allocation (from a set of two values), along with gate sizing can be expressed as a mixed integer non-linear program (MINLP) as follows minimize Subject to Power(w, vdd, vth) Delay(w, vdd, vth) Tspec vdd = vddhVDDH + vddlVDDL vth = vddhVTHH + vddlVTHL vddh + vddl = U vthh + vthl = U w > 0, vddh 0, vddl 0, vthh 0, vthl 0

where Power and Delay are functions of the vector of gate sizes (w), supply voltages (vdd) and threshold voltage (vth). VDDH (VTHH) and VDDL (VTHL) are vectors of the value of the higher and the lower supply (threshold) voltage and U is vector of 1s. Furthermore the vectors vddh, vthh, vddl, and vthl, are constrained to be integer vectors. Additional constraints are required to enforce topological constraints if asynchronous level conversion is not allowed. There are no known good methods to solve this class of problems. An approach to continuous gate sizing can rely on the fact that delay and power can be expressed as convex functions of the device widths, allowing the use of efficient approaches to solve convex optimization problems [35]. However, even if the variables in the above problem are assumed to be continuous, the final solution needs to be snapped to the discrete choices of Vdds and Vths available, which in itself is a hard problem and heuristic approaches must be used to obtain a final solution. In [7] the authors address the problem of power optimization using simultaneous Vdd and Vth assignment. They propose two different approaches depending on whether a system is dynamic or leakage power dominated. The approach for dynamic power dominated systems fails to consider that assigning a gate to high Vth negatively impacts the extent to which other gates in the circuit can be assigned to low Vdd and thus fails to consider the optimization of total power. The approach for leakage dominated systems assigns gates to high Vth in the order of their level from the outputs. Since Vth assignment does not impose any topological constraints as in the case of Vdd assignment, this approach unnecessarily limits the achievable power savings. Recently, [24] proposed a new method for slack redistribution to solve the leakage power optimization problem with dual-Vth and sizing by iteratively formulating and solving a linear program. However, the extension to dual-Vdd assignment is formulated as an integer linear program resulting in unreasonable runtime complexity. Reference [23] uses a Lagrangian multiplier based optimization followed by heuristic clustering for dual-Vdd and dualVth assignment. This is a general technique for solving optimization problems involving discrete variables, where the problem is initially solved assuming the variables involved are continuous. This approach allows the problem to be solved in a computationally efficient manner, and then heuristically clustering the obtained solution to the discrete domain [10][26][27]. The Lagrangian multiplier based approach is used to perform module level power optimization using path enumeration. The approach requires a power-aware partitioning of a circuit, which is a very difficult problem as acknowledged by the authors. The approach cannot be extended to perform gate-level power optimization due to its computational complexity and also does not consider other circuit issues such as level conversion. Reference [38] uses a genetic algorithm based approach to solve the problem of simultaneous Vdd and Vth assignment with gate sizing, which is both computationally inefficient and non-optimal. References [8,9] address the power minimization problem using dual Vdd assignment and sizing. In particular, [8] uses maximum weighted independent sets (MWIS) to identify gates for downsizing or assignment to low Vdd by identifying sets of gates which have independent timing slacks. This technique is severely limited by the amount of slack available in the original circuit as there are no means to create additional slack by sizing gates, only to consume it. In [9], the authors use a sensitivity-based technique to optimize power dissipation using dual Vdd assignment. Another work employs a delay balancing approach to

solve the problem of Vdd assignment [10]. There has been a large amount of work recently in power optimization using dual Vth and sizing [11-14,24,31]. References [11-13] use sensitivity-based approaches to direct the optimization whereas [14] solves the problem using a Lagrangian relaxation based tool. Reference [31] employs a state-space enumeration based approach along with efficient pruning methods and demonstrates better power savings for tight delay constraints and better run-time compared to [11]. Although substantial work has been done in the area of power optimization, all previous works fail to integrate all three variables that are crucial to low-power design, namely sizing, threshold and supply voltages, in a computationally efficient manner. As a result, existing work fails to consider the optimization of total power dissipation and are restricted to either dynamic or leakage power optimization. Hence we observe that there is a pressing need to integrate all three of these low-power design variables concurrently in an efficient algorithm. This paper is the first to perform simultaneous gate-level sizing, Vdd, and Vth assignment in a dualVdd/Vth environment in a computationally efficient manner to minimize total power consumption (defined as the sum of static and dynamic power). Since our algorithm enables simultaneous optimization of total power using Vdd and Vth allocation and sizing we refer to the complete algorithm as VVS. The rest of the paper is organized as follows. Section II describes the basic concepts of the two-pass algorithm. Section III provides implementation details of the algorithm and describes the setup used to evaluate the algorithm on ISCAS85 and several signal processing oriented benchmark circuits. In Section IV we present results and compare our algorithm to previously proposed techniques. We summarize and conclude in Section V.

II. Algorithm Description


We propose a two stage sensitivity-based heuristic approach to minimize total power using dual Vdd, sizing, and dual Vth for a design mapped to a standard cell library. All the gates in the design are initially assumed to be operating at the higher supply and lower threshold voltage. Throughout the flow of the VVS algorithm a front is maintained located at the interface between the low and high Vdd gates. Similar to CVS we do not allow level conversion within the logic itself and hence we must strictly observe the topological constraint imposed in a dual-Vdd design. The timing constraints on the design remain fixed throughout the flow of the algorithm. A discussion about timing constraints is included in Section III. We propose two different versions of VVS for the two possible cases where the design is initially synthesized using all low (Case 1) or high (Case 2) threshold gates. We first describe the approach for VVS with the initial design at the lower threshold voltage in detail and then note the main differences to adopt VVS when starting with a design at high Vth.

A) Initial design synthesized at low threshold voltage (Case 1)


In the first stage of the VVS algorithm, called the backward pass, Vdd assignment and sizing are combined to minimize total power while we move the front from the primary outputs towards the primary inputs. The second stage, or the forward pass, uses the optimal location of the front found in the first stage as the starting point for the optimization and then relies on both Vdd and Vth assignment along with sizing to further reduce

total power while the front is moved back towards the primary outputs. Thus we use all three design variables and perform concurrent Vdd, Vth assignment and gate sizing in the forward pass. The advantage of using a twostage procedure will be explained later in this section. i) Backward Pass To adhere to the topological constraint imposed by dual-Vdd we define the backward front which consists of all gates operating at high Vdd that do not fanout to any gate operating at high Vdd. Thus, assigning any gate on the backward front to low Vdd will not violate the topological constraint since all its fanout also operate at low Vdd. This front is initialized to be the set of gates that drive the primary outputs of the design. A simple CVS procedure is first used to assign gates on the front to low Vdd as long as the circuit meets timing. The pseudo-code for the CVS subroutine is shown below which is similar to the approach used in [6]:
CVS() { Initialization While (list is not empty) { Select candidate from list Set candidate to low Vdd Remove candidate from list If timing is violated Set candidate to high Vdd Else add possible gates to list } }

In the CVS procedure, initialization is used to create a list of primary outputs of the design that represents the backward front of the design. A predictive metric is then used to order, or prioritize, gates in this list. This metric could be based on simple parameters such as the fanout capacitance or the slack of the gate for example. We examined various metrics for this purpose and describe them in more detail in Section III.C. The gate with the maximum value for the predictive metric is selected as the candidate gate, which is then assigned to low Vdd if the timing constraints are not violated. Gates are identified that can be added to the backward front as a result of the assignment of the previous candidate gate to low Vdd. Only gates that act as inputs to the candidate gate just assigned to low Vdd can be considered as potential additions to the backward front. At the end of CVS, none of the gates on the backward front can be assigned to low Vdd without violating the timing constraints. Figure 1 shows the scenario at this stage of the algorithm. Gates 1-3 have been set to low Vdd by CVS and gates 4, 5, and 8 now form the backward front. Gate sizing is then employed to compensate for the delay increase arising from the assignment of a gate to low Vdd. The pseudo-code for this stage is shown below

Backward Pass { CVS() While (list is not empty) { Select candidate from list Set candidate to low Vdd Remove candidate from list Until timing is violated or maximum upsizing performed Calculate sensitivity for upsizing for all gates Upsize gate with maximum sensitivity to the next higher size available in the library If timing violated Undo up-sizing moves Set candidate to high Vdd Else add possible fanin gates to list }

}
Specifically, after a candidate gate on the backward front is assigned to low Vdd, a sensitivity measure to upsize gates to the next available size in the standard-cell library for all the gates in the circuit is calculated, which is used to identify gates to be up-sized. Let D represent the change in delay of the gate and P the change in power dissipation due to upsizing a gate to the next higher size in the library. The sensitivity of each gate in the circuit to up-sizing is computed as

Sensitivity =

1 D P arcs Slack arc S min + K

(1)

where Smin is the worst slack seen in the circuit and K is a small positive quantity for numerical stability purposes. Slackarc represents the slacks associated with a particular timing arc of the gate and is calculated as the difference between the arc delay and the required arrival time at the output node minus the arrival time at the input node of the gate. The form of the sensitivity measure gives a higher value to gates lying on the critical paths of the circuit. The arcs represent the falling and rising arcs associated with each of the inputs of the gate. Thus, for a 3-input NAND gate the sensitivity measure will be obtained by summing over all six possible arcs. The D computation for a gate (say G) is performed by upsizing the gate to the next higher size in the library. Since only gates which are the immediate inputs of G see a different load capacitance, only these gates need to be re-simulated during sensitivity computation to calculate the new arrival time at the inputs of G which is used to calculate D at the output of G. Any change in delay due to slew changes in the fanout cone of G is considered to be second order and neglected for sensitivity computation. P can also be easily computed by considering only the immediate fanin of the gate. The gate with the maximum sensitivity is then selected and sized up to the next available size in the library. It is important to note that the sensitivity calculation for a gate does not require a full circuit timing analysis or an incremental timing analysis (which would propagate the impact of the change through the fanout cone), which would otherwise make the runtime prohibitively large. Complete timing and power analysis is performed and this process is repeated until all slacks in the circuit

become positive. Also, while performing gate up-sizing delays of the immediate fanin gates and fanout cone gates are changed which impact the slacks of the gates that form the fanin and fanout cones of the gate. Hence we need to re-compute the timing information and sensitivities only for these gates.

Figure 1: Backward front for an example circuit at the end of CVS

The number of up-sizing moves allowed to meet timing is fixed to a constant large number (10% of the number of gates in the circuit) to avoid pursuing bad solutions that could also possibly result in overly large area increases. The choice of a 10% limit on the number of gates to be upsized is based on the observation that varying this percentage from 8% to 50% results in a very small change in the power dissipation achieved. The power dissipation then starts to gradually increase as we reduce the maximum number of gates that can be upsized below 8% as shown in Figure 2. This limit on the number of upsizing moves that provides the maximum reduction was not found to increase with growing design sizes and was also not monotonically related. However, we do allow moves that result in a net increase of total power in an attempt to allow the flow of the algorithm to escape local minima. Due to the topological constraints imposed on low Vdd assignment, if a gate is not assigned to low Vdd then none of the gates in its input cone can be assigned to low Vdd. Hence a steepest decent approach with no means to get out of local minima will likely become stuck in a local minimum that is far from the global minimum. Consider the case where the path that goes through Gates 7, 5 and 2 forms the critical path of the circuit in Figure 1. If Gate 5 is not assigned to low Vdd, Gate 6 and other gates in its fanin cone (if present) cannot be assigned to low Vdd. Thus if sufficient slack is available for Gate 6, a lower total power can be achieved by assigning Gate 5 to low Vdd and performing up-sizing although this could result in an immediate increase in power. This implies that assigning a gate to low Vdd may actually lead to an immediate increase in total power dissipation because of the required up-sizing, even though this same move could eventually lead to a lower power solution that would never be discovered without allowing the original gate to be assigned to low Vdd. The ordering of the gates on the backward front using the predictive metric associated with each gate is a heuristic used to steer the flow of the algorithm in the right direction.

9.0

Average switching power (uW)

8.5 8.0 7.5 7.0 6.5 0 10 20 30 40

25% change in switching power

Maximum Upsizing allow ed (%)

Figure 2: Impact of the maximum allowed upsizing on switching power reduction

At all points during the backward pass the best-seen solution is saved and this solution is restored at the end of the backward pass. The end of the backward pass is signaled when the list containing the gates on the backward front becomes empty or else none of the gates in the list can be assigned to low Vdd without violating timing (even with the maximum allowed amount of upsizing). ii) Forward Pass At the end of the backward pass the front between high and low Vdd gates is in the best position in terms of the total power dissipation for a dual Vdd, single Vth environment. The second stage, or forward pass, is then used to move the front forward towards the primary outputs in conjunction with high Vth allocation and possible gate upsizing to minimize the total power in a dual-Vth scenario. We now define the forward front, which consists of all gates that are operating at low Vdd and have all of their fanins operating at high Vdd. In Figure 1, assuming that upsizing in the backward pass allows us to further assign gates 4, 5 and 8 to low Vdd, these same three gates would now form the forward front. Importantly, assigning a gate on the forward front to operate at high Vdd will not lead to a violation of the topological constraint. We now calculate 1) a sensitivity measure for gates on the forward front with respect to high Vdd operation, and 2) a sensitivity measure for all gates in the circuit with respect to upsizing to the next higher size in the library. Both these sensitivities are calculated as the ratio of the change in delay to the change in power dissipation as a result of the corresponding operation. The gate with the maximum sensitivity is then either assigned to high Vdd or up-sized based on the operation to which the maximum sensitivity corresponds. Note that any gate in the circuit may be up-sized whereas only gates in the forward front may be re-assigned to high Vdd. Once a gate is up-sized or reset to high Vdd operation, timing slack has been created in the circuit. To exploit this slack and reduce total power, the next step begins by computing the sensitivity of all gates in the circuit with respect to operation at high Vth (recall that in Case 1 all gates initially operate using low Vth). This sensitivity is calculated as the ratio of the change in power to the change in delay multiplied by the slack of the

gate in order to identify gates that provide the maximum decrease in power for the minimum increase in delay and is expressed as
Sensitivity = P
arcs

Slack arc D

(2)

Based on this sensitivity measure gates are assigned to high Vth as long as the timing constraints of the design are met. This set of moves (assignment to high Vdd or upsizing a gate followed by the associated high Vth assignments) is then accepted if the total power is found to decrease. If the total power increases and the initial move was an upsizing move then all these moves are reserved otherwise the moves are accepted in keeping with our approach to avoid local minima. The best-seen solution is always maintained and restored at the end of the forward pass .The pseudo-code for this stage of the algorithm is shown below:
Forward Pass { While (list is not empty) { Calculate sensitivity of gates in list to high Vdd operation Calculate sensitivity of all gates to up-sizing Select candidate gate based on maximum sensitivity Upsize or assign candidate to high Vdd based on maximum sensitivity Calculate sensitivity of gates to high Vth operation Set gates to high Vth while timing is not violated If upsizing initiated move and total power increases reverse moves If high Vdd initiated move Remove candidate from list Add possible fanout gates of candidate to list }

This two-stage VVS algorithm allows us to make intelligent choices to trade-off dynamic power for leakage power in order to obtain a reduction in the total power dissipation. The algorithm is effectively directed to automatically provide either more leakage or dynamic power reduction based on the initial design point. The two-stage algorithm can easily quantify the impact of setting a gate to high Vth on the extent to which other gates in the circuit can be assigned to low Vdd. In other words, we can independently judge the impact of Vth and Vdd assignment on total power, something that is difficult to achieve in a flow that simultaneously assigns low Vdd and high Vth throughout the optimization as in [7] or performs Vdd and Vth assignments separately in two independent stages.

B) Initial design synthesized at high threshold voltage (Case 2)


We first point out that starting with a design mapped to high Vdd and high Vth cannot practically be used to simplify the two-stage process to a single stage process (which would involve upsizing or assigning gates to low Vth to generate slack and then using this slack to assign gates to low Vdd). This is because a large fraction of the gates in an initial design at high Vth are expected to be heavily upsized to meet a similar delay target as the low Vth circuit and hence an effective power optimization algorithm must also allow for gate down-sizing.

In Section IV we compare these two cases, where the design is initially synthesized using high Vdd and high Vth gates and upsized to meet the delay target. Gates are then set to low Vdd using the CVS procedure as long as slack is available. The latter portion of the CVS procedure continues to set gates to low Vdd while creating slack by setting other gates to low Vth or up-sized to meet the delay target. This is achieved by calculating sensitivities of the form in Equation 1 for 1) up-sizing and 2) low Vth operation, for all gates in the circuit. The gate with the maximum sensitivity is then either set to low Vth or up-sized, based on the operation to which the maximum sensitivity corresponds. This is repeated as long as the circuit fails to meet timing. The best-seen solution is restored when none of the gates can be set to low Vdd even when a fixed large number of up-sizing and low Vth assignment moves fail to meet timing. Note that at the end of the backward pass the design employs both high and low Vth as opposed to the end of the backward pass for Case 1 where the design is initially synthesized at low Vth. The forward pass is then employed to down size gates (consuming slack) while gates are set to low Vth or alternatively some gates on the forward-front are re-assigned to high Vdd. The rationale behind this pass will be more clearly explained in Section IV, where we compare the results obtained at the end of each pass for both Case 1 and 2. C) Computational Complexity We first emphasize that VVS is applicable to one combinational sub-circuit at a time. Thus although a design may have millions of gates, the number of gates on which the algorithm works is comparatively much smaller [36]. The run-time complexity of the algorithm is O(n3) (where n is the number of gates) in the worst-case. Static timing analysis has a run time complexity of O(n) and in the worst case we can make O(n2) moves in both the backward and forward passes. Let us consider the case where we start with the design synthesized using low threshold voltage gates. In the backward pass we can potentially attempt to assign O(n) gates to low Vdd and for each of these possible assignments we can maximally upsize O(n) in the circuit and then revert all moves. Thus the possible number of Vdd assignment and upsizing moves is O(n2), making the overall worst-case complexity of the backward pass O(n3). Similarly we can find the worst-case complexity of the forward pass and hence of the overall approach to be O(n3). For the forward pass the worst-case complexity occurs only when we revert back to the original circuit after assigning O(n) gates in the circuit to high Vt. However, since the amount of slack generated in the circuit due to a single high Vdd assignment or gate upsizing is small, the number of possible high Vt assignments due to upsizing or high Vdd assignment of a single gate can be expected to be O(1), hence the complexity of the forward pass can be expected to be O(n2). For the backward pass the complexity is actually given by O(n2s) where s is the number of gates on the boundary of low and high Vdd gates that could not be assigned to low Vdd. This boundary forms the cutset of the acyclic graph which represents the circuit network. The cutset size is relevant since we only undo the up-sizing associated with the gates that form the cutset. The number of upsizing moves associated with gates other than the ones forming the cutset is O(n) since we only have a fixed
number of drive strengths for a given logic gate. In the worst case, s can be O(n) and gives us the worst-case

complexity of O(n3).

III. Implementation and Circuit Issues


The algorithm described in Section II was implemented in C and tested on ISCAS85 benchmark circuits that vary in size from 169 to 2500 gates [22] and several industrial signal processing benchmarks ranging in size from 500 for the Huffman decoder design to 31,000 gates for SOVA2. To partition an industrial design and identify the combinational sub-circuits we use a simple breadth-first based approach. The pseudo-code for partitioning is shown below and assigns a partition number to each gate. Gates with the same partition number are in the same partition.
Partition { Partition_number=0 Initialize Queue_comb to the set of primary inputs While(Queue_comb is not empty) { Pop element from Queue_comb Set partition number of element as Partition_number Explore fanin and fanout gates of element If gate is unexplored If gate is combinational push in Queue_comb else push in Queue_seq If(Queue_comb is empty) Set Queue_comb as Queue_seq Empty Queue_seq Partition_number = Partition_number + 1 } }

The optimization is performed using a standard cell library that is characterized for delay and dynamic power for seven different values of input slews and output load capacitances for each cell and for different combinations of supply and threshold voltages. An average value is used to represent the leakage power for each gate where the leakage power is averaged over different input states. If the probability of different input states occurring is known then state-dependent leakage calculations can be seamlessly incorporated in our approach. The initial standard cell mapping of the circuits was performed using an industrial 0.13m library with a nominal Vdd of 1.2V and a nominal Vth of 0.23V (these are fixed throughout) that represent the high Vdd and high Vth respectively. The standard cells in the library are also characterized at various design points including low Vdd = {0.5, 0.6, 0.7, 0.8} V and low Vth = {0.14, 0.12, 0.1, 0.08} V. The libraries consist of inverters with twelve different drive strengths and two and three input NAND and NOR gates with seven different drive strengths. We also created duplicate low Vdd libraries in which gate delays are computed with inputs switching at high Vdd rather than low Vdd. This is called the overdrive library since the cells in this library are being overdriven at their inputs (as is the case at the boundary of high and low Vdd cells) and hence are faster in one transition direction and slower in the other. This phenomenon can be used to advantage by employing gate libraries with skewed drive strength, although we did not explore such an optimization in this work. All energy components (static, short-circuit, and dynamic) and capacitance variations due to varying thresholds [11] are inherently considered using these SPICE-derived library files. It is interesting to note that

gates operating at high Vth not only have a smaller input capacitance as compared to low Vth gates but also have a much smaller internal power. Internal power accounts for the power dissipation other than those accounted by leakage and the switching output load capacitance [25]. The high input capacitance of the low Vth gates is attributed to the fact that low Vth devices spend a longer time in saturation [11]. Table 1 compares the input capacitance and the internal power for a few gates in the library to illustrate this fact. The high Vth is 0.23V and the low Vth is 0.12V, with an operating voltage of 1.2V. The internal power here refers to the internal power at a specific output load capacitance and input slew. The two different rows for the NAND2X2 and NOR2X2 correspond to the two inputs of the gate. The low Vth gates have a 7% higher input capacitance and 23% higher internal power dissipation as compared to the high Vth gates, which is consistent with the shortcircuit power predictions made in [28]. The wire capacitance is approximated by

C wire = 5 * (1 + 0.4 * ( fanouts wire 1)) fF

(3)

where fanoutswire is the number of gates to which the wire connects. Equation 3 is based on the model used in [29] and provides a wire capacitance of 5fF for a gate with one fanout, corresponding to a wire length of approximately 25m in our technology.
Table 1: Comparison of gate capacitance and internal power of gates operating at different threshold voltages
Gate Internal Capacitance(fF) Power(nW/MHz) High Vth
3.52 2.92 3.78 3.94 4.12 7.06 12.57 8.41 12.14 8.89

Gate INVX2 NAND2X2 NOR2X2

Gate Internal Capacitance(fF) Power(nW/MHz) Low Vth


3.80 2.99 3.99 4.30 4.45 9.88 15.56 11.22 14.90 11.33

The design synthesized at the lower Vth is first sized using a TILOS-like [16] sensitivity-based sizing algorithm to obtain the power-delay curve for the design. The design is then resized from the initial synthesized point to a delay point that is backed off from the minimum achievable delay (using TILOS) by a fixed percentage. This is done since the power-delay curve is very steep at the minimum achievable delay and operating at that point is sub-optimal from a power/delay tradeoff perspective. We report results in the next section at different levels of backed off circuit delay. We emphasize that the initial design in Case 1 is operating at the fastest possible combination of Vdd and Vth (high Vdd and low Vth) meaning that the circuit speed at which we perform our optimizations is quite aggressive even at backoff points in the 20% range.2 Subsequent phases of the algorithm maintain this timing and no further relaxation in timing is used to obtain power improvement.

A. Level Conversion
Since we are employing a CVS-based approach in this work, level conversion is only required at sequential
Indeed, with typical speed differences of 15% between high and low Vth devices [20], a 20% backoff in our notation would correspond to nearly the fastest possible implementation of the circuit at the high Vdd, high Vth design point.
2

elements. Furthermore, we use benchmarks that are purely combinational (the signal processing benchmarks are sequential; we partition these into combinational components). We incorporate the level converter delay penalties by considering that they result in a fixed delay overhead for the circuit. We vary this delay overhead in our experiments to understand the impact of level conversion penalties on overall power reduction. Most of the results in the work are presented using a level conversion penalty of 80ps for a low Vdd of 0.6V. This delay value is chosen based on [17,21], where the authors show that the D-Q delay overhead for a level converting flip-flop can be under two fanout-of-four (FO4) inverter delays for the target technology. In our target technology at nominal high Vdd and high Vth, the FO4 delay is 40ps. We also investigate varying the delay penalty based on the value of the lower Vdd (ranging from 0.8V to 0.5V) in order to capture the dependence of the level conversion delay overhead on the low Vdd value. This dependence is based on the results found in [32] where a change in the lower Vdd from 0.8V to 0.6V while maintaining the same lower threshold results in an increase of the delay of the Differential Cascode Voltage Switched (DCVS) level converting structure by approximately 20ps. The energy penalties of the level converting flip-flops are not considered in our results. When replacing a flip-flop operating at high Vdd with a level converting flip-flop, the energy is reduced since much of the flip-flops internal capacitance is now toggling at a lower Vdd. While the energy reduction is not quite quadratic with the ratio of (Vdd high/Vdd low), [17] shows that the traditional master-slave level converting FF from [4] demonstrates 40% lower energy than a comparable all-high Vdd master-slave FF when (Vdd high/Vdd low)2 is 0.5. Therefore, most of the energy savings are preserved and we can consider level conversion energy penalties to be negligible.

B. Switching Activity
The switching activity at each of the circuit primary inputs can be adjusted to obtain a desired initial static vs. dynamic power ratio. We apply a switching activity and state probability at each input which are then propagated through the entire circuit using the approach outlined in [18]. Later we provide results for different circuit activities to demonstrate the efficacy of Vdd and Vth assignment in varying application spaces and how the VVS algorithm efficiently trades off between static and dynamic power.

C. Predictive Metric
The selection of gates on the backward front to be set to low Vdd is done based on a predictive metric (Section II.A). Different metrics were considered with the goal of selecting candidate gates that allow for the best final results; the metric also has implications on the hill-climbing capability of the optimization flow. The fanout capacitance of the gates is a straightforward metric that captures the immediate power reduction obtained by assigning a particular gate to low Vdd. However, a metric that depends on the properties of the gate under investigation alone fails to consider the global impact of the decision. The complete circuit topology should ideally be considered when making decisions about which gate to set to low Vdd. We developed a metric based on the total capacitance in the input (fanin) cone of each gate being evaluated. This metric seeks to capture the fact that if the gate under investigation is not set to low Vdd then none of the gates in its input cone can be set to low Vdd. The gate with the largest total capacitance in its input cone is then selected as the candidate gate in the algorithm.

IV. Results and Discussion


All results shown in this section are for a low Vdd and low Vth of 0.6V and 0.12V respectively unless otherwise stated. The high Vdd and Vth are fixed at 1.2V and 0.23V respectively. The timing constraint for most results is set to 20% slower than the fastest possible delay of the circuit (i.e., 20% backoff point). The bulk of the results are shown at three different primary input switching activities, which are referred to as nominal, low, and high activity cases. The nominal activity is chosen such that leakage power constitutes approximately 8% of the total power dissipation, when the design is synthesized using high threshold voltage gates [30].3 Activity factors that are 3X lower and 3X higher are referred to as low and high activities respectively. A. Case 1 Analysis Table 2 shows the results obtained for all benchmark circuits at high, nominal and low activities. The columns corresponding to the initial power list the actual power numbers. The remaining columns show the % reduction in leakage, switching, and total power at the end of three distinct phases of the algorithm; 1) CVS only, 2) end of backward pass, and 3) VVS. Note that CVS only is not actually a phase of the VVS algorithm but is shown for comparison purposes. The results clearly show the relative advantages offered by both steps of the VVS algorithm. At the end of the backward pass, CVS coupled with sizing increases the average savings in switching power from 18% to 26% for high activities compared to CVS alone. The leakage power also shows a significant reduction of ~15% over CVS across all activity values, which can be attributed to the roughly cubic dependence of leakage power on Vdd [19]. In terms of gates assigned to low Vdd, the backward pass yields similar results at all activity cases; the small difference in switching or leakage power savings can be attributed to different gates being upsized due to different sensitivities that result from varying activities at the gate inputs. If the input activities for all gates are assumed to be identical, the performance of CVS only and the backward pass are found to be identical across all activity cases as expected.

Reference [30] provides a data point at which leakage power contributes 18% of the total power dissipation in the 0.13m technology node. Since industrial designs typically employ ~10% low Vth devices, leakage power can be calculated as contributing approximately 8% of the total power in a design synthesized using only high Vth devices.

Table 2: Power savings at various phases of the algorithm (Case 1). (a) High activity
Initial Power (uW) Leakage Switching Total
35.4 48.9 75.3 100.0 131.6 210.9 544.3 214.9 60.2 1483.2 3481.5 580.6 81.7 140.1 202.7 248.9 302.6 413.8 1716.2 521.4 144.8 3270.1 8016.7 1369.0 117.1 188.9 278.0 349.0 434.2 624.7 2260.5 736.3 205.0 4753.3 11498.2 1949.6

Circuit c432 c880 c1908 c2670 c3540 c5315 c6288 c7552 Huffman SOVA1 SOVA2 Average

Leakage
0.5% 20.6% 5.4% 20.3% 3.4% 21.2% 1.1% 30.2% 9.1% 42.7% 4.9% 14.5%

CVS only Switching


1.9% 19.8% 5.6% 21.4% 6.5% 25.4% 15.7% 32.7% 9.3% 45.3% 5.1% 17.2%

Total
1.5% 20.0% 5.5% 21.1% 5.6% 23.9% 12.2% 32.0% 9.2% 44.5% 5.1% 16.4%

% Savings compared to initial design Backward Pass Leakage Switching Total Leakage
0.5% 20.6% 5.4% 20.2% 2.8% 18.9% 1.0% 36.4% 20.9% 50.7% 41.5% 19.9% 1.9% 19.8% 5.6% 37.8% 26.4% 50.5% 15.8% 50.8% 27.2% 57.0% 69.0% 32.9% 1.5% 20.0% 5.5% 32.7% 19.2% 39.9% 12.2% 46.6% 25.4% 55.0% 60.7% 29.0% 57.8% 44.0% 44.1% 20.2% 49.4% 19.0% 20.3% 36.6% 35.6% 83.3% 49.0% 41.8%

VVS Switching
6.0% 22.9% 7.4% 37.8% 26.1% 50.7% 19.4% 51.2% 27.0% 58.6% 69.8% 34.3%

Total
21.7% 28.4% 17.4% 32.7% 33.2% 40.0% 19.6% 46.9% 29.5% 66.3% 63.5% 36.3%

(b) Nominal activity


Initial Power (uW ) Leakage Switching Total
32.5 48.0 67.7 98.7 132.6 201.9 500.2 212.3 60.0 1482.8 3473.2 573.6 27.1 47.8 66.5 81.6 102.8 138.8 586.5 171.5 48.2 1081.1 2669.2 456.5 59.6 95.8 134.3 180.2 235.5 340.8 1086.7 383.8 108.2 2563.9 6142.4 1030.1

Circuit c432 c880 c1908 c2670 c3540 c5315 c6288 c7552 Huffman SOVA1 SOVA2 Average

CVS only Leakage Switching


0.5% 20.8% 6.0% 20.8% 3.4% 21.3% 1.2% 31.0% 9.1% 43.7% 4.9% 14.8% 1.9% 19.4% 5.5% 21.6% 9.0% 24.5% 15.6% 33.1% 9.3% 46.3% 5.1% 17.4%

Total
1.1% 20.1% 5.8% 21.2% 5.8% 22.6% 9.0% 32.0% 9.2% 44.8% 5.0% 16.1%

%Savings compared to initial design Backward Pass VVS Leakage Switching Total Leakage Switching
0.5% 20.8% 6.0% 24.8% 10.9% 24.2% 1.2% 41.9% 20.5% 50.8% 41.8% 22.1% 1.9% 19.4% 5.5% 33.5% 11.0% 40.6% 15.6% 52.8% 27.0% 56.7% 68.8% 30.2% 1.1% 20.1% 5.8% 28.7% 10.9% 30.9% 9.0% 46.8% 23.4% 53.3% 53.6% 25.8% 57.6% 59.0% 67.9% 89.4% 82.0% 86.6% 20.7% 90.0% 72.3% 83.8% 48.7% 68.9% 5.0% 20.9% 3.2% 6.4% 2.5% -10.9% 19.0% -3.9% -0.6% 59.5% 69.7% 15.5%

Total
33.7% 40.0% 35.9% 51.9% 47.3% 46.9% 19.8% 48.0% 39.9% 73.6% 57.8% 45.0%

(c) Low activity


Initial Power (uW ) Leakage Switching Total 30.8 9.1 39.9 46.8 16.1 62.9 69.4 25.2 94.6 98.6 27.6 126.1 132.1 36.5 168.6 196.3 46.2 242.5 493.0 206.5 699.6 212.0 58.2 270.3
60.4 1472.6 3471.4 571.2 16.2 349.8 886.4 152.5 76.6 1822.4 4357.9 723.8

Circuit c432 c880 c1908 c2670 c3540 c5315 c6288 c7552 Huffman SOVA1 SOVA2 Average

CVS only Leakage Switching


0.6% 21.2% 5.8% 20.8% 3.4% 20.3% 1.2% 31.0% 9.0% 44.5% 4.9% 14.8% 1.8% 18.9% 4.9% 21.4% 3.0% 22.5% 14.6% 32.9% 9.3% 47.1% 5.1% 16.5%

Total
0.9% 20.6% 5.5% 20.9% 3.3% 20.7% 5.2% 31.4% 9.1% 45.0% 5.0% 15.2%

%Savings compared to initial design Backward Pass VVS Leakage Switching Total Leakage Switching
0.6% 21.2% 5.8% 19.7% 18.5% 27.1% 1.3% 41.9% 19.8% 51.4% 42.9% 22.7% 1.8% 18.9% 4.9% 26.5% 11.2% 40.4% 14.8% 51.5% 26.3% 57.0% 68.8% 29.3% 0.9% 20.6% 5.5% 21.2% 17.0% 29.6% 5.3% 44.0% 21.2% 52.5% 48.2% 24.2% 35.4% 71.3% 71.9% 87.9% 70.3% 86.7% 42.7% 88.3% 78.3% 83.2% 90.7% 73.3% 6.4% 11.9% 5.9% 5.6% -3.3% -12.5% 15.3% -5.0% -1.9% 54.7% 0.7% 7.1%

Total
28.8% 56.1% 54.3% 69.9% 54.4% 67.8% 34.6% 68.2% 61.3% 77.7% 72.4% 58.7%

The last phase of VVS for high activities shows that a small amount of switching power can be traded off to obtain substantial savings in leakage power due to the exponential dependence of leakage current on Vth. Neglecting the internal (short-circuit) power and the change in gate capacitance with Vth, a small amount of switching power (~4% on average) is traded-off to obtain the reduction in leakage power. Interestingly, considering these second order effects we find that switching power is typically reduced by a small amount during the second pass of VVS as gates are set to high Vth. For the case of nominal and low activities the

second pass significantly alters the position of the front between high and low Vdd gates since dynamic power is not as dominant a portion of total power this leads to a marked rise in dynamic power as many gates are set back to high Vdd to enable high Vth assignment. This important capability leads to a reduction in the total power dissipation of the design and shows that the algorithm is correctly steering towards a proper low-power solution. This also shows that an optimization approach where a dual-Vth and sizing optimization is followed by a dual-Vdd and sizing optimization would result in highly sub-optimal results for these activity cases. For the nominal and low activity cases on the average, to of the reduction in switching power is given back to obtain an additional ~50% savings in leakage power (savings increases from roughly 20% to 70%) which dominates the total power at such activity levels. The comparison of Tables 2a, 2b, and 2c also shows that a leakage power dominated design shows a much higher reduction in total power. This is expected due to the exponential dependence of leakage current on threshold voltage as compared to a quadratic dependence of switching power on Vdd. VVS optimization also results in an area increase of 14%, 10% and 11% for high, nominal, and low activity respectively due to upsizing steps. Since in current and future microprocessor designs, memory structures are expected to consume a large fraction of the total chip area [1], the impact of VVS on the total area of the chip can be expected to be small and VVS will provide large reduction in the total power dissipation, which is dominated by logic [1]. We note that for small benchmark circuits the backward pass provides no additional savings as compared to CVS. This is because the cycle time for these circuits is very small (<1ns) and thus leads to a much stronger impact of level conversion penalties. On the other hand the large signal processing benchmarks show a power reduction which is much higher than average as the designs were found to have unbalanced pipeline stages. Figure 3 compares the slack histograms for two benchmark circuits (c432 & c5315) at the beginning and end of VVS. It is clear that the VVS algorithm results in a much tighter histogram, since the available slack has been used to provide power dissipation reductions. We also note again that c432 provides smaller savings than the other designs due to the small cycle time of this circuit and additionally because VVS is not able to slow the paths that have huge slacks (as seen in Figure 3a). This may result from paths being highly dependent on each other, given the small number of gates in this particular circuit. Figure 4 shows the resulting power savings when using different values for the low Vdd and Vth. It is important to note that the same design at different low Vths are operating at different frequencies and hence the power savings are relative to slightly different initial design points. Thus an absolute comparison of the final achieved power at various low Vths is not justified. The figure does clearly show that a low Vdd of 0.6V provides an increase in power reduction of approximately 7% compared to a low Vdd of 0.8V. This is expected based on rules of thumb proposed in [2] which show that the optimal low Vdd is typically about half of the high Vdd in a dual-Vth environment. We also note that using a very aggressive low threshold voltage (0.08V in this case) shifts the optimal lower Vdd towards an even lower value. This trend in the change of the optimal low Vdd can also been seen for the other lower threshold voltage cases based on the change in slope in the 0.5V to 0.6V range. The actual value of the optimal lower Vdd cannot be discerned because the graph is obtained based on the results at four discrete lower Vdd values. The large spread in power savings for the lower Vdd of 0.5V

can be attributed both to the larger delay increase when a gate is assigned to lower Vdd and the increased level conversion penalties.

1000 900 800

3000 2500

c5315

c432

Number of Paths

Number of Paths

700 600 500 400 300 200 100 0 0.0 0.2 0.4 0.6

Before VVS After VVS

2000 1500 1000 500 0 0.0

Before VVS After VVS

0.8

1.0

0.2

0.4

0.6

0.8

1.0

Slack (normalized to cycletime)

Slack (normalized to cycletime)

(a)

(b)

Figure 3: Impact of power optimization on the path-slack histogram

Vth2=0.08V

36

Vth2=0.10V Vth2=0.12V Vth2=0.14V

% reduction in total power

34 32 30 28 26 24 22 0.5 0.6 0.7

0.8

Low Vdd (V)

Figure 4: Dependence of average power savings on low Vdd for high activity

Table 3 compares the average power reduction achieved using the VVS algorithm (with a level conversion penalty of 80ps) and that achieved using a dual-Vth and sizing algorithm for the high activities. The dual-Vth algorithm is implemented as a sensitivity-based algorithm similar to the second phase of VVS (Section II.B) without the availability of a 2nd Vdd. The power reduction using all three design variables simultaneously provides large benefits as compared to a dual Vth and sizing approach for most of the benchmark circuits. On an average the reduction in total power for the proposed approach is nearly double that achieved using a conventional dual Vth and sizing algorithm. For the low activity case, where the leakage power contributes more than 70% of the total power dissipation, most of the gains can be expected from high Vth insertion, and

thus we expect both the approaches to perform very similarly. Results obtained using the low activity value show an average improvement of 4% for VVS demonstrating the fact that VVS efficiently balances switching and dynamic power to achieve reduction in the total power dissipation. These results demonstrate that the new single cohesive algorithm effectively seeks out the best power reduction over a range of switching activities and initial switching/leakage power breakdowns as might be found from one functional unit to another in a given design.
Table 3. Comparison of VVS with dual Vt + sizing for high activity
Circuit c432 c880 c1908 c2670 c3540 c5315 c6288 c7552 Huffman SOVA1 SOVA2 Average Initial Power(uW) Leakage Switching
35.4 48.9 75.3 100.0 131.6 210.9 544.3 214.9 60.2 1483.2 3481.5 580.6 81.7 140.1 202.7 248.9 302.6 413.8 1716.2 521.4 144.8 3270.1 8016.7 1369.0

Leakage
14.9 27.4 42.1 79.8 66.6 170.8 433.6 136.3 38.8 247.4 1774.3 275.6

VVS Sizing+Dual-Vth Switching % savings Leakage Switching % savings


76.8 108.0 187.6 154.9 223.6 204.2 1383.8 254.7 105.7 1352.4 2420.6 588.4 21.7% 28.4% 17.4% 32.7% 33.2% 40.0% 19.6% 46.9% 29.5% 66.3% 63.5% 36.3% 14.9 22.4 38.6 58.6 62.9 107.5 423.8 104.9 24.9 235.3 1623.4 247.0 79.2 129.9 189.2 228.5 307.8 373.0 1624.6 473.7 137.8 2858.8 7156.8 1232.7 19.6% 19.4% 18.1% 17.7% 14.6% 23.1% 9.4% 21.4% 20.6% 34.9% 23.6% 20.2%

To further explore the efficacy of our approach we use an exhaustive enumerative approach to compare the solution found by VVS to a known near-optimal solution. To perform an exhaustive enumeration of the solution space we note that when representing the circuit as a graph, the front between low Vdd and high Vdd gates forms a cutset of the graph. We break the optimization problems into the following steps: 1. Add an additional input and output node. The new input node fanins to all the original input nodes and all the original output nodes fanin to the new output node. 2. Enumerate all possible cutsets of the circuit. 3. Assign gates in the partition containing the output node to low Vdd. 4. Perform a sensitivity-based dual-Vth and sizing optimization.

Table 4. Comparison of power savings offered by VVS and exhaustive cutset enumeration
Backoff
1.2 1.3 1.4 1.5 1.6

Initial Power (uW ) 117.10 95.70 78.60 72.60 66.70

Final Power using VVS (uW ) 91.70 74.27 57.84 56.12 48.40

Final Power using cutset enum eration (uW ) 91.70 73.90 56.90 51.50 46.80

% Difference
0.00% 0.38% 1.20% 6.37% 2.40%

The final step is not guaranteed to be optimal, however the enumeration step checks the effectiveness of the front-based approach and shows that examining more fronts does not lead to highly

improved solutions and hence shows the effectiveness of our approach. Also, making use of this heuristic step allows us to make the enumeration problem manageable for small benchmarks (i.e., it is impossible to investigate all possible dual-Vth and sizing assignments). Several approaches have been proposed for cutset enumeration in the literature and we employ a state-space enumeration based approach [34]. We perform additional pruning by ignoring cutsets resulting in circuits that fail to meet timing this is based on information from previous cutsets that also failed to meet timing by generating cutsets in a levelized fashion. Since cutset enumeration is known to be an NP-hard problem, this approach cannot be extended to large circuits and we only compare our results for the smallest circuit in our benchmarks, c432. Table 4 shows the results for five different timing constraints that result in varying positions of the optimal front. The last column in Table 4 shows the percentage difference in power achieved using the two approaches relative to the initial design. The results show that VVS performs very close to optimal when the set of low Vdd gates is either very small or very large (resulting from a very tight or very loose timing constraint), which results in a smaller number of possible configurations of the optimal front. For the case with a 50% backoff, approximately half of the gates in the circuit are assigned to low Vdd and VVS results in a power penalty of approximately 6%.

65

% reduction in total power

60 55 50 45 40 35 30 25 0 20 40 60

High activity Nominal activity Low activity

80

100

120

Level Conversion Penalty (ps)


Figure 5: Impact of level conversion on power reduction

Figure 5 shows the impact of level conversion on the average achievable power reduction for the three different cases of input activities studied. The level conversion penalty is modeled as a fixed delay overhead which is swept from 0 to 120ps. Though power reduction is smaller for increasing level conversion penalties, all switching activities show smaller sensitivity to level conversion penalty as it becomes a larger fraction of the circuit delay (i.e., clock cycle). Also the impact of level conversion delay on the low activity circuits is significantly smaller than in high and nominal activity circuits. This is expected due to the fact that most of the power reduction at low activities is due to high Vth assignment, which is not affected by level conversion penalties.

Figure 6 shows the change in average power savings across benchmarks as we vary the backoff point from the best delay point on the power-delay curve. The backoff is expressed as a percentage of the minimum achievable delay. The figure clearly shows a marked decline in achievable power reduction for very small backoff values since we move to the steeper region of the energy-delay curve and a large amount of upsizing must be initially performed to meet the delay target. This reduces the available upsizing moves in the circuit and hinders the assignment of gates to low Vdd or high Vth. We again point out that the initial design is synthesized at the fastest combination of Vdd and Vth and thus the 20% backoff used to present most of the results is reasonable.

70

% reduction in total power

65 60 55 50 45 40 35 30 25 10

High activity Nominal activity Low activity

20

30

40

50

% Delay Backoff

Figure 6: Dependence of power reduction on the amount of backoff

Figures 7 and 8 show the percentage of gates assigned to the various combinations of Vdd and Vths for different backoff levels at high and nominal activity respectively. Figure 7 demonstrates the need for a lower Vth to fully leverage the low power supply available since a large percentage of gates are set to the low Vdd, low Vth combination. The small number of gates assigned to high Vth can result from the fact that, for tight delay constraints, the backward pass leaves a large fraction of the gates operating at high Vdd which may possibly include regions of the circuit with available slack. These high Vdd gates are then assigned to high threshold by the forward pass since there are no topological constraints. In the circuits operating at nominal activity a large fraction of gates are set to the higher threshold to obtain a large reduction in leakage power. This again shows that the VVS algorithm approximates a dual-Vth and sizing algorithm for moderate or low activity values. Another interesting observation is that in all cases (including small backoff points) the fraction of gates operating at the initial high Vdd and low Vth design point is less than half, as these are the most power hungry gates available.

(Vddhigh, Vthlow)
60 50 % of total gates 40 30 20 10 0

(Vddlow, Vthlow) (Vddlow, Vthhigh)

(Vddhigh, Vthhigh)

10

20 % Backoff

30

Figure 7: Final assignment of gates to different Vdd/Vth combinations for high activity

80

% (Vddhigh, Vthlow % (Vddhigh, Vthhigh

%(Vddlow, Vthlow) %(Vddlow, Vthhigh)

60 % of total gates

40

20

10

20 % Backoff

30

Figure 8: Final assignment of gates to different Vdd/Vth combinations for nominal activity

B. Case 2 Analysis We now examine the impact of the choice of the threshold voltage in initial synthesis on the final power dissipation achieved. The timing constraint that is developed for low Vth synthesis (using 20% backoff) is maintained and initially gate sizing is used to meet the delay target. The circuit c3540 fails to meet this timing constraint and is not reported in the results. For further discussions we do not consider the case of low activity,

since the component of leakage power becomes very large such that a dual-Vth assignment is mostly sufficient as observed in the previous section. Table 5 shows the resulting power dissipation when the initial design is synthesized at high Vth for high and nominal activities. We first point out that the initial power dissipation of the circuits, before optimization, is quite different due to the use of either all low or all high Vth devices in Cases 1 and 2 respectively. The leakage power is clearly much higher in Table 2 than Table 5 for this reason and we also see a noticeable difference in switching power in the initial circuits due to the amount of upsizing needed in Case 2 to achieve a tight timing constraint. This leads to different starting points for the two cases; on average Case 1 (low-Vth) has much lower initial power than Case 2 for high switching activities with the reverse being true for nominal activities.
Table 5: Power savings at various phases of the algorithm using high threshold for initial synthesis (Case 2) (a) High activity
Initial Power (uW ) Leakage Switching Total
3.0 4.5 6.9 6.9 22.5 67.5 13.9 4.3 83.1 207.8 42.1 107.4 161.1 249.7 247.5 493.1 2318.7 498.3 153.1 2772.4 6931.0 1393.2 110.4 165.7 256.6 254.4 515.6 2386.2 512.2 157.4 2855.5 7138.8 1435.3

Circuit c432 c880 c1908 c2670 c3540 c5315 c6288 c7552 Huffman SOVA1 SOVA2 Average

CVS only Leakage Switching


0.3% 8.9% 2.9% 15.5% 16.2% 0.7% 9.3% 6.7% 38.1% 0.4% 9.9%

Total

%Savings com pared to initial design Backward Pass VVS Leakage Switching Total Leakage Switching
1.3% 7.8% 3.3% 17.8% 24.6% 10.2% 30.3% 10.0% 51.7% 46.5% 20.4% -52.9% 14.6% -38.1% -3.9% -250.3% -19.8% -219.9% -349.2% 26.6% 15.8% -87.7% 10.8% 10.6% 7.6% 20.1% 42.7% 15.6% 38.8% 22.0% 53.4% 48.8% 27.0%

Total
9.1% 10.7% 6.4% 19.4% 29.9% 14.6% 31.8% 11.7% 52.6% 47.8% 23.4%

1.4% 1.3% 0.3% 1.4% 7.8% 7.8% 8.9% 7.8% 3.3% 3.3% 2.9% 3.3% 17.8% 17.7% 15.5% 17.8% Failed to meet timing ( Not considered) 14.7% 14.8% -280.0% 38.6% 10.5% 10.2% 0.7% 10.5% 11.5% 11.4% -229.2% 37.5% 7.3% 7.3% -354.9% 20.4% 39.4% 39.4% 27.2% 52.4% 0.4% 0.4% 21.9% 47.2% 11.4% 11.4% -78.7% 23.7%

(b) Nominal activity


Initial Power (uW ) Leakage Switching Total
2.9 4.5 6.8 6.7 25.4 64.0 13.4 4.3 83.2 207.0 41.8 35.3 54.6 84.4 82.0 167.3 774.7 165.8 51.1 924.5 2315.0 465.5 38.2 59.1 91.2 88.7 192.7 838.7 179.2 55.5 1007.7 2522.0 507.3

Circuit c432 c880 c1908 c2670 c3540 c5315 c6288 c7552 Huffman SOVA1 SOVA2 Average

CVS only Leakage Switching


0.3% 8.4% 2.9% 16.2% 5.1% 0.7% 10.4% 6.7% 38.1% 0.2% 8.9%

Total

%Savings compared to initial design Backward Pass VVS Leakage Switching Total Leakage Switching
1.3% 7.5% 3.2% 17.8% 7.6% 9.6% 13.3% 7.3% 50.8% 42.9% 16.1% 5.0% 14.8% 7.3% 19.0% 6.7% 7.3% -118.8% 9.1% 35.3% 23.4% 0.9% 5.1% 9.9% 5.7% 18.8% 8.5% 12.9% 25.9% 8.7% 53.4% 48.6% 19.8%

Total
5.1% 10.3% 5.8% 18.8% 8.2% 12.5% 15.1% 8.8% 51.9% 46.5% 18.3%

1.4% 1.3% 0.3% 1.4% 7.4% 7.5% 8.4% 7.4% 3.2% 3.2% 2.9% 3.2% 18.0% 17.8% 16.2% 18.0% Failed to meet timing (Not considered) 8.0% 7.6% 5.1% 8.0% 10.3% 9.6% 0.4% 10.4% 12.2% 12.1% -126.0% 24.5% 7.3% 7.3% 6.7% 7.3% 39.4% 39.3% 33.4% 52.3% 0.4% 0.4% 22.7% 44.7% 10.8% 10.6% -3.0% 17.7%

The upsizing required in Case 2 leads to these designs being 40% larger on average than the analogous Case 1 circuits. To compensate for this large increase in area we employ the forward pass in Case 2 to downsize gates while slack is created by setting gates in the design to low Vth or those on the forward-front back to high Vdd. Table 5 shows that downsizing provides an additional reduction in power of 3% (2%) for high (nominal) activity while reducing the area penalty to 30%. The final power dissipation achieved in Case 2 is 66% higher (30% lower) on average compared to those synthesized using low Vth at high (nominal) activity. This indicates that a choice of whether to initially synthesize a circuit at high or low Vth depends strongly on the circuit

switching activity (or more fundamentally the relative contributions of leakage and switching power to total power). When considering the relative power improvements for each case separately, we see that the percentage reduction in switching power is comparable for Cases 1 and 2 at both activity levels, particularly at high activities. This is true since most of the switching power reduction is achieved by low Vdd assignment, and it points to the efficacy of the approach in finding the optimal front separating the high and low Vdd gates regardless of the initial synthesis conditions. In all cases an initial synthesis using high Vth results in much higher area and thus the two different options of initial synthesis provide a useful trade-off to the designer. Based on these results we conclude that a low Vth synthesis is advisable for high activity circuits whereas an initial synthesis using high Vth results in a better design for lower activities when the objective is to minimize the total power of the design with the minimum area penalty. Another observation is that final optimized designs for Case 2 employ approximately 3% (8%) low Vth devices whereas those using the Case 1 approach use 26% (85%) low Vth gates for nominal (high) activity. Such a large difference in the fraction of low Vth devices may impact the yield of the design since low Vth devices are highly sensitive to process variations, which can place limits on the amount of their use in a design [33]. Therefore, an initial synthesis using high Vth gates becomes a viable option over a larger range of activity factors as process variability becomes a larger concern in scaled CMOS processes. Figure 9 shows the dependence of the run-time on the size of the benchmark circuit. The graph clearly shows that the run-time of the algorithm increases non-linearly with the increase in design size. The figure also shows that the run-time increases linearly with respect to the square of the design size. Based on the complexity analysis in Section II.C. we conclude that the size of the cutset increases slowly as we increase the design size and since the combinational sub-circuits of a design will not be overly large we expect the run-time to be O(n2).

2500

5000000 4000000 3000000 2000000 1000000

Size of benchmark (# of gates)

2000

1500

1000

500

0
0 0 50 100 150 200 250

Run-time (sec)

Figure 9: Dependence of run-time on design size

Size of benchmark (# of gates)

V. Conclusions
We have presented the VVS algorithm that combines gate sizing with Vdd and Vth assignment to minimize the total power dissipation and provides the designer with a single approach to minimize total power across a range of circuit parameters. The efficacy of the proposed algorithm was demonstrated on a set of ISCAS benchmark circuits and a few signal processing benchmarks. The new algorithm is compared with traditional CVS and dual-Vth with sizing algorithms to show the advantage of a single complete optimization approach. While other techniques work best in either high or low switching activity circuits, VVS outperforms previous algorithms across a wide range of switching activities showing 34% and 59% average power reductions at high and low primary input activities respectively. The final assignment of the gates also shows the need of an additional threshold to fully leverage the gains made available by the additional power supply. The impact of the initial synthesis on the optimization approach is studied to demonstrate the tradeoffs involved. The impact of various design, circuit and cell library parameters such as switching activity, level conversion penalties, and delay targets have been quantified.

References
[1] International Technology Roadmap for Semiconductors, 2001, http://public.itrs.net. [2] A. Srivastava and D. Sylvester, Minimizing total power by simultaneous Vdd/Vth assignment, Proc. Asia-South Pacific Design Automation Conference, pp. 400-406, 2003. [3] K. Usami, et al., Automated low-power technique exploiting multiple supply voltage applied to a media processor, IEEE J. Solid-State Circuits, pp. 463-472, March 1998. [4] M. Takahashi, et al., A 60-mW MPEG4 video codec using clustered voltage scaling with variable supply-voltage scheme, IEEE J. Solid-State Circuits, pp. 1772-1780, Nov. 1998. [5] M. Hamada, Y. Ootaguro, and T. Kuroda, Utilizing surplus timing for power reduction, Proc. Custom Integrated Circuits Conference, pp. 89-92, 2001. [6] K. Usami and M. Horowitz, Clustered voltage scaling technique for low-power design, Proc. International Symposium on Low-Power Electronics Design, pp. 3-8, 1995. [7] K. Roy, L. Wei, and Z. Chen, Multiple-Vdd & multiple Vth CMOS (MVCMOS) for low power applications, Proc. International Symposium on Circuits and Systems, pp.366 370, 1999. [8] C. Chen and M. Sarrafzadeh,, Simultaneous voltage scaling and gate sizing for low-power design, IEEE Trans. on Circuits and Systems II: Analog and Digital Signal Processing, pp.400-408, June 2002. [9] C. Yeh, et al., Gate-level design exploiting dual supply voltages for power-driven applications, Proc. Design Automation Conference, pp.68-71, 1999. [10] V. Sundararajan and K. Parhi, Synthesis of low power CMOS VLSI circuits using dual supply voltages, Proc. Design Automation Conference, pp.72-75, 1999. [11] S. Sirichotiyakul, et al., Stand-by Power Minimization through Simultaneous Threshold Voltage Selection and Circuit Sizing, Proc. Design Automation Conference, pp. 436-441, 1999. [12] P. Pant, R. Roy, and A. Chatterjee. Dual-threshold voltage assignment with transistor sizing for low power CMOS circuits, IEEE Trans. on VLSI Systems, pp.390-394, 2001. [13] L. Wei, K. Roy, and C. Koh, Power minimization by simultaneous dual-Vth assignment and gate sizing, Proc. Custom Integrated Circuits Conference, pp.413-416, 2000. [14] T. Karnik, et al., Total power optimization by simultaneous dual-Vt allocation and device sizing in high performance microprocessors, Proc. Design Automation Conference, pp.486-491, 2002. [15] A. Dharchoudhury, et al., Transistor-level sizing and timing verification of domino circuits in the Power PC microprocessor, Proc. International Conference on Computer Design, pp.143-148, 1997.

[16] J. Fishburn and A. Dunlop, TILOS: a posynomial programming approach to transistor sizing, Proc. International Conference on Computer-Aided Design, pp.326-328, 1985. [17] M. R. Bai and D. Sylvester, Analysis and design of level-converting flip-flops for dual-Vdd/Vth integrated circuits, IEEE Intl. Symp. on System-on-Chip, pp. 151-154, 2003.w [18] S. Ercolani, et al., Estimate of signal probability in combinational logic networks, Proc. European Test Conference, pp. 294-299, 1989. [19] R. K. Krishnamurthy, et al., High-performance and low-power challenges in sub-70nm microprocessor circuits, Proc. Custom Integrated Circuits Conference, pp.125-128, 2002. [20] S. Thompson, et al., An enhanced 130nm generation logic technology featuring 60nm transistors optimized for high performance and low power at 0.7-1.4V, Proc. International Electron Devices Meeting, pp. 257-260, 2001. [21] F. Ishihara, F. Sheikh, and B. Nikolic, Level conversion for dual-supply systems, Proc. International Symposium on Low-Power Electronics Design, 2003. [22] F. Brglez and H. Fujiwara. A neutral netlist of 10 combinational benchmark circuits and a target translator in Fortran, Proc. International Symposium on Circuits and Systems, pp. 695-698, May 1985. [23] Y. S. Dhillon, et al., Algorithm for achieving minimum energy consumption in CMOS circuits using multiple supply and threshold voltages at the module level, Proc. International Conference on Computer-Aided Design, pp.693-700, 2003. [24] D. Nguyen, et al., Minimization of Dynamic and Static Power Through Joint Assignment of Threshold Voltages and Sizing Optimization, Proc. International Symposium on Low-Power Electronics Design, pp. 158-163, 2003. [25] User Guide. In Library Compiler User Manual, Synopsys, Inc., 2003. [26] A. Srivastava, Simultaneous Vt selection and assignment for leakage optimization, Proc. International Symposium on Low-Power Electronics Design, pp. 146-151, 2003. [27] V. Sundarajan and K. Parhi, Low power synthesis of dual threshold voltage CMOS VLSI circuits, Proc. International Symposium on Low-Power Electronics Design, pp. 139-144, 1999. [28] K. Nose and T. Sakurai, Analysis and future trend of short-circuit power, IEEE Trans. on CAD of Integrated Circuits and Systems, pp. 1023-1030, Sep. 2000. [29] D. Sylvester and K. Kuetzer, System level performance modeling with BACPAC-Berkeley advanced chip performance calculator, Int. Workshop on System-Level Interconnect Prediction, pp. 109-114, 1999. [30] J. Kao, S. Narendra and A. Chandrakasan, Subthreshold leakage modeling and reduction techniques, Proc. International Conference on Computer-Aided Design, pp. 141-148, 2002. [31] M. Ketkar and S. S. Sapatnekar, Standby Power Optimization via Transistor Sizing and Dual Threshold Voltage Assignment, Proc. International Conference on Computer-Aided Design, pp. 375 - 378, 2002. [32] S.H. Kulkarni and D. Sylvester, New level converters and level converting logic circuits for multi-VDD low power design, IEEE System-on-Chip (SOC) Conference, pp. 169-172, 2003. [33] Ruchir Puri, personal communication. [34] S. H. Ahmad, Simple enumeration of minimal cutsets of acyclic directed graphs, IEEE Trans. on Reliability, pp. 484-487, Dec. 1988. [35] S. S. Sapatnekar, V. B. Rao, P. M. Vaidya, and S. M. Kang, An exact solution to the transistor sizing problem for CMOS circuits using convex optimization, IEEE TCAD, vol. 12, no. 11, pp. 1621-1634, Nov. 1993. [36] S. S. Sapatnekar and W. Chuang, Power vs. delay in gate sizing: Conflicting objectives? Proc. International Conference on Computer-Aided Design, pp. 463-466, 1995. [37] P. K. Chan, Algorithms for library-specific sizing of combinational logic, Proc. Design Automation Conference, pp. 353-356, 1990. [38] W. Hung, et al., Total power optimization through simultaneous multiple-Vdd multiple-Vth assignment and device sizing and stack forcing, Proc. International Symposium on Low-Power Electronics Design, pp. 144-149, 2004.

Potrebbero piacerti anche