Sei sulla pagina 1di 21

Automatic Synthesis of Gated Clocks for Power Reduction in Sequential Circuits

L. Benini P. Siegel G. De Micheli

Center for Integrated Systems Stanford University, Stanford CA 94305

Abstract
With the proliferation of portable devices and increasing levels of chip integration, reducing power consumption is becoming of paramount importance. We describe a technique to automatically synthesize gated clocks for nite-state machines (FSMs) to reduce power in the nal implementation. This technique recognizes self-loops in the FSM (either from the state diagram or from a synchronous network) and uses the function described by the self-loops to gate the clock. The clock activation function is then used as don't-care information to minimize the logic in the FSM for additional power savings. We applied these techniques to standard MCNC benchmarks and found an average reduction in power dissipation of 25%, at the cost of a 5% increase in area.

1 Introduction
As portable devices proliferate and device sizes continue to shrink, allowing more devices to t on a chip, power consumption has taken on increased importance. Much recent work has focused on accurate estimation of power consumption and on its concomitant reduction at all levels of abstraction, from high-level synthesis down to physical layout 1, 2, 3, 4, 5, 6, 7]. Most power reduction techniques have emphasized reducing the level of activity in some portion of the circuit. We extend this research by concentrating on reducing the activity level of the clock by selectively stopping the clock. Because many sequential machines are implementations of reactive systems which wait for a certain event to occur before changing state, much power is wasted during this waiting period 8]. Latches are still clocked and the next state function is computed, consuming unnecessary power since nothing can change until the requisite event arrives. By stopping the clock during this period, we can realize substantial savings in many nite-state machines (FSMs). 1

Final Submission to IEEE Design & Test Special Issue on Low-Power

The idea of selectively stopping the clock is not new; it is a commonly used technique employed by designers of large systems as a part of dynamic power management schemes 9, 10, 11]. However in these schemes it is applied manually by the designer and is typically used at a coarse granularity to stop large functions during periods of inactivity. Our technique works automatically and at a much ner level of granularity by recognizing wait-states within FSMs and synthesizing circuitry to selectively stop the clock during these periods of no activity. Some asynchronous techniques also base their operation on the idea of selective clocking 12]. In these architectures, the asynchronous FSM is clocked only when there is activity, so the power management aspects come for free. However, one cannot apply these techniques directly, since they require environmental constraints and constraints on the realization of the outputs that would not apply to synchronous systems. We have developed algorithms to automatically synthesize the clock gating circuitry for a sequential circuit modeled either as an FSM state table or as a synchronous network. The techniques operate locally within an FSM; no information about the environment is required. However, designers can use environmental signals in conjunction with our technique to save more power. Our method uses the knowledge of the next-state function to generate a clock activation signal only when needed for the machine to perform a state transition. A local clock activation signal is automatically generated such that the machine is functionally equivalent to the original FSM implementation with a reduction in power dissipation and a small increase in area and critical path delay. Before describing the algorithms in detail, we describe our implementation style and the underlying assumptions on which we are basing our technique. Initial results from applying these algorithms to benchmark circuits show an average reduction in power dissipation of 25% with an accompanying small increase of 5% in area. Finally, we explore how this technique can be extended in new directions to realize additional power savings.

2 Background
2.1 Two-phase clocking
Many large VLSI circuits use a two-phase non-overlapping clocking scheme 13, 14, 15]. This method of clocking minimizes clock-skew problems associated with a single-phase scheme at the expense of higher area. With a two-phase clocking scheme, timing problems can be eliminated just by increasing the clock period, because there is no bilateral constraint on the clock waveform as there is with a single-phase clock scheme. In this clocking style, a general sequential circuit is implemented as shown in Figure 1. The

Final Submission to IEEE Design & Test Special Issue on Low-Power


S

S
Latch
s_2
COMBINATIONAL LOGIC

s_1
X

s_2

Latch

s_1

Y
phi 1 phi 2

phi 1 phi 2

t1 t2

Figure 1: Two-phase state machine. FSM inputs and state variables are fed into the rst set of latches which are clocked on the rst phase of the clock, ( 1). These signals are called stable-one (s1) signals, because they must be guaranteed to be set at their nal value before the clock line 1 becomes active and the corresponding latches store the values. The outputs are latched on the second phase of the clock, 2 . The inputs to the second set of latches are required to be set at their nal stable value when the 2 clock becomes active, and for this reason they are called stable-two (s2) signals. For a circuit to operate correctly under this clocking scheme, there are two conditions which must hold. First, the two phases of the clock must be non-overlapping. In other words, they must never be active at the same time. As a result, clock skew must be carefully controlled to preserve this property throughout the chip. Second, the critical path length of the combinational logic between the two sets of latches, tcl , must be such that the inputs to the nal set of latches reach their nal stable value before 2 goes high. Referring to Figure 1 this can be expressed as the inequality t1 + tcl < t2 . These two requirements can be satis ed with adequate skew control by equalizing clock distribution delays, and by partitioning the logic between latch boundaries in such a way that the worst case critical path is never too long. Although for the bulk of this paper, we assume that a two-phase clocking scheme is being used, simple extensions allow these ideas to be used with any clocking scheme, as we will show later.

2.2 Gated Clocks


Clocks are often gated with other signals to disable inactive parts of the system as a means to save power in designs 3, 11, 9]. The basic idea behind a gated clocking scheme is that the environment around a functional block can produce control signals that, when asserted, turn o the clock to the FSM, reducing the power dissipated on both the clock lines and eliminating power dissipation through the internal FSM logic. Although gated clocks can increase clock

Final Submission to IEEE Design & Test Special Issue on Low-Power

skew, causing performance problems for high-performance design, many synthesis tools provide e ective clocking equalization schemes that eliminate this problem 16, 17]. With the explosion in portable computing, gated clocks have been used in many recent designs to save a signi cant amount of power. For example, the Intel Pentium chip described in 9] uses gated clocking techniques to stop the clock while the processor is idle, resulting in a 2X average reduction in power consumption during actual system operation. A superscalar version of the IBM PowerPC chip uses gated clocks on selected portions of the chip to result in an average savings of 12-30% 11]. In the past, gated clocking schemes have been implemented manually, based on the designer's knowledge of the circuit and environment. In this paper, we propose an approach in which the gating signals are automatically generated, using only local information about the FSM.

2.3 Terminology
A Moore-type FSM can be described by a sextuple (X; Y; S; s0; ; ), where X is the set of inputs, Y is the set of outputs, S is the set of states, and s0 is the initial (reset) state. The next state function is given by:

st+1 = (X; st)


Finally, is the output function:

(1) (2)

yt = (st)

We initially assume that the state diagram is implemented as a Moore machine. This assumption is not overly restrictive, because it is well known how to transform a Mealy machine into a Moore machine 18]. Although we initially assume that the state diagram of the circuit is known, we will relax that assumption later and show how we can extract the necessary information from a gate-level speci cation. The reactive behavior of a sequential machine is described by specifying in the state diagram for which inputs the machine transitions from one state to another. Typically, table-based speci cations are used for the speci cation of FSMs, in the format

Ext;st : (xt; st) ! (st+1; yt+1)

(3)

where xt and st are the inputs and present state, and st+1 , yt+1 , are the corresponding next state and outputs, respectively. For a completely-speci ed FSM, we de ne a self-loop function for each state, Self s : X S ! f0; 1g, which represents the set of self loops from a given state s:

Final Submission to IEEE Design & Test Special Issue on Low-Power


Self s = 1 8x 2 X where (x; s) = s

5 (4)

In other words, self-loops in a table-based speci cation can be found from those entries in the table where st+1 = st . Thus, we have a set of functions Self s ; s = 1; :::; jS j which de ne the set of self-loops for the entire FSM. For an FSM with many self-loops, often only those transitions for which st+1 6= st are speci ed. In this case, the self-loops for a given state s can be found by computing the complement of the union of all the entries speci ed for the present state s:
Self s =
x

Ex;s

(5)

3 Removing Self-Loops from the State Diagram


Given the de nitions in the previous section, we can easily identify and extract the self-loops from a given state diagram. We can then use the set of self-loops for a given FSM to de ne a Boolean function which is satis ed only when the machine is in a self-loop. We call this function the activation function, fa , and we use it to selectively gate the clock for power savings, as is seen in Figure 2. Because the activation function can itself be large, we extract a reduced activation function, Fa , from fa which balances the savings from deactivating the clock during self-loops against the area penalty from creating the gating function. We also show how we can ignore glitches in this reduced activation function. We then use the reduced activation function, Fa , as don't-care information to reduce the combinational logic within the FSM itself to yield additional power savings. Finally, we describe how these techniques can be used with di erent clocking disciplines.

3.1 Activation Function Computation


We want to nd an activation function fa to gate the clock during periods of inactivity in the state machine. The FSM can then be modi ed to use this activation function as shown in Figure 2. Because the activation function fa has as inputs signals that are stable-one, fa is also a stable-one signal that can be used to enable and disable the clock 1 . Function fa is latched and used to control the second phase of the clock, as shown in the gure. We de ne the activation function, fa : X S ! f0; 1g as the minimized cover of the function representing the self-loops in the FSM:

fa = min cover( Self s)


s2S

(6)

where we use the set of unreachable states in the FSM as a don't-care set to reduce the size of the cover.

Final Submission to IEEE Design & Test Special Issue on Low-Power

Latch

s_1

s_2

COMBINATIONAL LOGIC

s_2

Latch

s_1

phi 1g

s_1
fa
phi 1 phi 2

phi 1g

s_2

phi 2g

phi 2g

fa
phi 1 phi 2

Figure 2: Two-phase state machine with gated clock.


the set of self-loops for the FSM are shown along with the DC-set formed from the unreachable states. Thus, fa = x01x2 + x1x02 + x1 s1 is the optimized cover for the set of self-loops, as shown.

Example 1 Figure 3 shows the state diagram for a simple FSM. The entries that correspond to

Once the activation function has been found, we can implement it with the same methods used for synthesis of the combinational part of the FSM.
other hand, was implemented at a cost of 23 literals. (The multi-level implementation of the FSM logic was generated automatically by a logic synthesis program; we omit the details for brevity.) In this case, the cost of the activation function is a signi cant fraction of the cost of the combinational logic in the FSM (> 25%). The overhead may take away some of the power savings from reducing the clocking activity by gating the clock with the activation function.

Example 2 The activation function in this example has a cost of 6 literals. The FSM, on the

As we saw from the example, there is no guarantee that the activation function will be e ciently implementable. Since it's possible that the complexity of the activation function is of the order of the complexity of the combinational part of the FSM, we must nd a method to reduce the size of the activation function to realize the most power savings. We want to nd a function with smaller overhead that covers the self-loops that consume the most power in an attempt to balance the power savings gained from stopping the clock against the power consumed by the clock stopping function. Our problem can be stated informally as follows: Problem 1: Given an activation function fa for a speci c FSM, nd a function Fa fa that has an implementation with small overhead, but still covers most of the self-loops. This problem is at least as complex as standard two-level logic minimization, a well-known di cult problem, for which a polynomial-time algorithm is not known to exist. As a result, we must devise a heuristic solution.

Final Submission to IEEE Design & Test Special Issue on Low-Power


1 1
State Codes: S0: 10 S1: 11 S2: 01
S0

State table: 1 S0 1 S0 00 S0 1 S1 1 S1 00 S1 10 S2 01 S2 11 S2 00 S2 S0 00 S0 00 S1 01 S1 01 S1 01 S2 10 S2 10 S2 10 S1 01 S0 00

00 00 00

11
1 1
S1 01

SelfLoops Unreachable States (DCs) 110 00 110 111 111 1001 0101

S2
10

logic minimization

00

10 01
fa = x1 x2 + x1 x2 + x1 s1

Figure 3: State diagram and corresponding activation function Our main di culty in solving the problem is selecting a cost function that results in good power savings. This cost function must express the trade-o between the size of Fa and its e ciency in stopping the clock. By using an Fa with a large number of cubes in its cover, we are likely to cover many self-loops, but at a cost of additional power consumption due to the size of the function. If Fa is too small, on the other hand, then it may not cover enough of the cases where the state machine is in a wait-state and we again won't realize any power savings. Our conjecture is that by selecting Fa from a subset of the cubes in fa we will maximize the reduction in power dissipation. A reasonable approximation is to use the area overhead of the solution (in terms of literal count) as a rough approximation of the additional power consumption. This approximation has yielded good results for many example circuits. At this point, the overall algorithm used to nd Fa can be outlined: Compute Fa(FSM) f extract fa ; /* find activation function */ ON -set(Fa ) = fa ; /* start with activation function */ DC -set(Fa ) = unreachable states(FSM); /* use same DC set as for fa */
/* * Find a smaller function which covers most of * the self-loops, but costs less. */ reduce cover(

Fa);

Looking at the overall algorithm, we can see that the function reduce cover is called to

Final Submission to IEEE Design & Test Special Issue on Low-Power

iteratively reduce the size of the activation function. The best balancing of tradeo s comes from using the actual transition probabilities to reduce fa so that the loops with the highest probability are covered by the reduced function. However, knowledge of the transition probabilities is not always available to the designer, since it requires knowledge of the environment in which the FSM is to operate. Thus, we would like to come up with a good approximation that is based only on knowledge of the functional speci cation of the FSM. Our approximation involves a constraint on the number of literals in the implementation of Fa . We specify a literal threshold, LT , as the upper bound for the number of literals Fa should have. We would like to restrict Fa to a fraction of the literals in the combinational part of the FSM to ensure that the activation function has a reasonable size and will result in power savings. Using LT as a bound, we can specify the constrained optimization problem. Our approximation to the original problem can be stated as follows: Problem 2: Given an activation function fa , we want to nd a function Fa such that cover(Fa) cover(fa) contains the maximum number of minterms of fa subject to the constraint that Nlits(Fa ) < LT , where Nlits(Fa) is the number of literals in the Boolean expression for cover(Fa ). In other words, we want to approximate the original activation function by a subset of the original cover that contains the largest number of self-loops and ts within the constraint on the maximum number of literals. Note that LT can either be speci ed by the user based on knowledge of the architecture and the constraints, or it can be calculated automatically based on the structure of the FSM. In either case, selecting an appropriate value for LT is di cult because at the time the decision is to be made, there is no data on the nal circuit implementation. One simple approach for estimating LT is to initially set LT to a percentage of the total number of literals in the combinational part of the original FSM (i.e., without the activation function). A more computationally expensive approach is to start from the complete activation function (i.e., set LT = 1), and iteratively resynthesize for decreasing values of LT . However, because the synthesis step is time-consuming, this approach becomes impractical for large FSMs. Needless to say, the selection of an optimal value for LT is still an open problem. Problem 2 can be solved using algorithms of varying degrees of complexity. We used a simple greedy heuristic to solve it, which we now describe. The heuristic performs iterative elimination of cubes in the original minimized cover of fa until the literal threshold has been reached.

Final Submission to IEEE Design & Test Special Issue on Low-Power


s1 s2 00 x1 x2 00

9
11 10

fa

F a
01 11 10

s1 s2 00 x1 x2 00

01

01

01

11

11

10

10

fa = x1x2 + x1 x2 + x1 s1
a)

F a

fa

F = x1 x2 + x1 x2 a
b)

Figure 4: Activation functions for example FSM.


reduce cover( compute

LT ; /* determine literal threshold while(Nlits(Fa ) > LT ) f E = select small cubes(ON -set(Fa)); c = select less essential(E,Fa); Fa = Fa ? c; g

Fa ) f

*/

In the algorithm above, after determining the literal threshold, the following steps are iteratively performed until the number of literals in the cover of Fa is reduced below the literal threshold. First, the function select small cubes() selects the subset, E , of the cubes in Fa that have the highest number of literals (i.e., the smallest cubes). Next, the function select less essential() then selects from E the cube, c, that is the most covered by the other remaining cubes in Fa . In case of a tie, a tie-break rule based on the number of occurrences of the speci ed literals is used, the rationale being that we want to keep the number of occurrences of each literal as uniform as possible for uniform input loading. Finally, the cube c is eliminated from Fa and the iteration repeats until the condition is satis ed.
size. Because all cubes in the cover have the same size, E contains the entire cover (the three cubes x01 x2, x1 x02 , x1s1 ). select less essential determines that the second and third cubes are partially covered by other cubes in the cover, while the rst cube is essential. To choose between the two partially redundant cubes, the algorithm selects the cube that keeps the most uniform input loading. Thus, c is set to the third cube, x1 s1, reducing the number of input

Example 3 In Figure 4a, the complete minimized cover of fa is given. We want to reduce its

Final Submission to IEEE Design & Test Special Issue on Low-Power

10

variables by 1, and this cube is eliminated from the cover. The size of the reduced function, Fa , is now 4 literals, as shown in Figure 4b.

Many approximations have been made in the formulation of this optimization problem. In reality, the implementation of fa will make use of multilevel logic, but there is only a weak correlation between the size of a multilevel implementation and its corresponding minimal two-level cover. Additionally, the size of the implementation of the activation function is only weakly correlated to the total size of the modi ed FSM. Finally, because power dissipation is only weakly correlated to the total area, it is often the case that an activation function with a large number of literals will allow power savings that overcome a large increase in area. Rather than using our simple greedy algorithm, we can use a more clever algorithm to come up with a better solution. In fact, we can cast the problem as a 0-1 knapsack problem. In this case, the items to choose are the cubes, their usefulness is related to the number of don't-care entries, and the weight is related to the number of speci ed literals. The 0-1 knapsack problem can be solved exactly in polynomial time by a dynamic programming algorithm 19]. Using a more clever algorithm will rarely lead to an appreciably better solution, since we are forced to use an approximate formulation in the rst place. Needless to say, there is a trade-o between the number of self-loops covered and the implementation area. In particular, the e ciency of Fa in reducing the power dissipation decreases as we select smaller subsets of the original cover. Di erent points of this trade-o curve can be explored through iteration using di erent choices of LT , as we will show in the results section.

3.2 Timing Considerations


There are two timing issues related to the insertion of the activation function that we must address. We must examine the e ect of glitches within the clock generation circuitry, and we must consider how the presence of the activation function a ects the critical-path timing. A hazard is an unwanted glitch on the output of a gate in response to input changes 20]. Hazards in synchronous systems consume excess power but do not cause the circuit to malfunction, because the signals stabilize by the time they are sampled. However, hazards in the clock generation circuitry may cause the circuit to operate incorrectly. As a result, we must examine the hazard behavior of Fa . Because the clock signal is gated by Fa , a hazard on Fa may have catastrophic consequences on the internal clock line. However, due to our assumptions of two-phased clocking, the NAND gate feeding the clock will remain low independently of the value of Fa as long as 1 remains low. So as long as we can insure that Fa has settled before the leading edge of 1, we do not

Final Submission to IEEE Design & Test Special Issue on Low-Power


phi 1

11

phi 2

T in
inputs
Tfa

fa
Tmax

Figure 5: Timing requirements for activation function stabilization. need to worry about hazards in Fa . This is an important safety property, because it ensures that the design style we propose is not sensitive to the complex hazardous behavior of the combinational circuit implementation. However, it is important to note that the presence of the activation function actually modi es the critical path of the circuit. In fact, the longest delay through Fa adds to the maximum delay in the stage of the sequential circuit that precedes the FSM under consideration. We can see this as illustrated in Figure 5. Recall that the activation function has as inputs the FSM inputs, which are sampled on 1. These input changes must propagate through the activation function logic before 1 goes high, thus reducing the maximum allowed delay in the logic feeding the inputs by TFa . In other words, the following timing constraint must be obeyed:

Tin + TFa < Tmax

(7)

where Tin is the delay in the logic feeding the inputs and TFa is the delay in the activation logic. As a result, particular care must be taken when applying our ideas to circuits with cycle times tightly matched to the critical path delay. Nevertheless, because timing violations can be eliminated simply by increasing the cycle time by the amount needed when a two-phase clocking scheme is used, no complex and di cult redesign e orts are required to increase the cycle time.

3.3 Extensions to Reduce the Size of the Combinational Portion of the FSM
Generating Fa increases the area used. Some of this overhead can be reduced by taking advantage of the observation that the presence of Fa allows us to use a larger DC-set for the simpli cation of the combinational part of the FSM. In particular, the DC-set of every nextstate and output function in the FSM can be increased by the ON-set of Fa . This is because for each minterm covered by the activation function, the machine will be inactivated by Fa ,

Final Submission to IEEE Design & Test Special Issue on Low-Power

12

and consequently the inputs and state values satisfying Fa will never be observed at the inputs to the combinational part. We can then use the extended DC-set to recover some of the area in the combinational part of the FSM. Thus there are times when keeping an Fa with many literals is advantageous. More formally,

DCFa ( ) = DC ( ) Fa DCFa ( ) = DC ( ) Fa

(8) (9)

Example 4 Continuing with our example circuit, we can use either fa = x01x2 + x1x02 + x1s1

or Fa = x01x2 + x1 x02 as a DC-set to reduce the logic in the combinational part of the FSM. Using fa as the DC-set, we can reduce the number of literals to 9. Using the smaller function, Fa , as the DC-set, we reduce the FSM logic to 10 literals.

3.4 Implicit Generation of the Activation Function


In real-life applications sequential circuit are often generated from speci cation styles other than a state diagram, and having to extract a complete state diagram from a given circuit involves a computation that is worst-case exponential in the number of storage elements 21]. So, as a last extension, we will relax our assumption that the state diagram of the FSM is known, and we will show that fa can be generated directly from a gate-level speci cation of the circuit. From the de nition of the next-state function (X; s) we know that a function describes a self-loop in the state diagram when: (X; s) = s Equivalently, for each bit i in the state vector:
i (X; s)

(10) (11)

si = 1

where the symbol represents the exclusive-nor operation. Because we want the above condition to be true for all the bits in the state vector, our nal equation is:

Y i(X; s)
i

si = 1

(12)

Thus, the activation function can easily be generated using BDD-based symbolic manipulation of logic networks 22], even if the state diagram is too large to be explicitly represented. Recent research on FSM synthesis and veri cation 21] has shown that the set of unreachable states can be calculated with various degrees of safe approximations (meaning that only a subset

Final Submission to IEEE Design & Test Special Issue on Low-Power


phi phi Lin
In

13

Latch

Latch

Lin

COMBINATIONAL LOGIC

Out

Lout

Out Lout In

phi_g fa phi phi

phi_g

fa phi_g phi_g

Figure 6: Single-phase clocking scheme with gated clock. of unreachable states is generated, but it will never happen that a reachable state will be marked as unreachable). We can use this information to add to the DC-set for the activation function (as we described earlier), thus allowing a more e cient implementation in the implicit case as well. In conclusion, we have shown that we do not need the complete state diagram to generate fa . This property greatly widens the range of applicability of our method, and makes it suitable for resynthesis and low-power optimization of existing large sequential circuits.

3.5 Extensions for Other Clocking Schemes


Although we have based our initial formulation on a two-phase clocking discipline, it is easy to extend these techniques to apply to other clocking schemes. For single-phase clocking schemes employing transparent latches, we must add the delay through the activation function to the delay of the functions that feed the inputs to the FSM and use this gure to determine whether the cycle-time constraint is violated (see Figure 6). Additional care must be taken to ensure that the gated clocks match the clock skew of ungated clocks elsewhere in the circuit, because single-phase clocks are much more sensitive to clock skews than two-phase clocks.

4 Implementation and Results


The ideas and algorithms described in the preceding section have been implemented as part of, Pie, a toolset for low-power synthesis that is under development at Stanford University. First, the description of the circuit at the state-diagram level is read, and information on the self-loops are extracted. The set of unreachable states is then easily extracted from the state

Final Submission to IEEE Design & Test Special Issue on Low-Power

14

diagram and is used as the DC-set for the activation function. The initial cover of fa is then optimized using SIS 23] and a standard optimization procedure (script.rugged). The size of the implementation obtained is then compared with the size of the optimized implementation of the combinational part of the machine. If fa is too expensive to be implemented in its entirety (i.e., its size exceeds the threshold LT ), the reduction algorithm reduce cover is applied, and the size of fa is iteratively reduced until the nal optimized implementation Fa is found. At this point Fa is used as an additional DC-set for the optimization of the combinational part of the FSM, and the combinational portion of the FSM is passed to SIS for logic minimization. The Ceres technology mapper 24] is then used to map the combinational portions of the design to get the nal multilevel implementation. Clock skew equalization (through bu er insertion) is done automatically in a post-processing step. This implementation is then used as the basis for power estimation. The data on power consumption is obtained using an accurate switch-level simulation based on a version of IRSIM with power estimation capabilities. The average power consumption is calculated using a large number of random traces at the inputs of the circuit. Note that to obtain accurate power estimates, simulation must be done at the transistor-level since power dissipation in the clock lines is not accurately accounted for in a gate-level simulation.
with two implementations using the activation functions fa and Fa . The original FSM dissipates an average power of 51 W . The version using the complete fa dissipates 27 W , while the version using Fa dissipates 37 W . In this case, using the complete activation function to simplify the FSM logic results in the lowest power consumption. For this example, we obtain a decrease in area for both low-power implementations, although this is not normally to be expected. The original FSM used 128 transistors, the version with fa uses 118 transistors, and the version with Fa is the smallest with 110 transistors. Because the state diagram of the machine used in this example does not have many state transitions, the size of the DC-set used to optimize the logic is large, resulting in a slight reduction in area after logic optimization.

Example 5 For our example circuit, we have synthesized the normal implementation along

Table 1 gives the results of running our tool on some sequential circuits from the MCNC benchmark suite. We report the average power dissipation gures, the power advantage obtained (expressed as the ratio between the average power dissipation of the gated clock implementation and the power dissipation of the traditional implementation), and the area overhead as measured by the number of transistors in the implementations (including clock circuitry, the activation function and the FSM implementation). The quality of the results depend on the threshold function LT ; in the table we have reported gures corresponding to the literal threshold that resulted in the most power savings.

Final Submission to IEEE Design & Test Special Issue on Low-Power Original Circuit Size Power ex 128 51 W bbsse 966 212 W bbara 348 328 W bbtas 178 62 W sse 912 175 W s386 856 179 W cse 1320 125 W dk14 758 212 W s27 146 60 W mc 182 73 W sand1 2220 265 W Gated Power Size Power Reduction (%) 118 27 W 48% 1002 190 W 11% 390 127 W 61% 188 50 W 19% 946 150 W 15% 882 140 W 21% 1406 70 W 44% 852 211 W 1% 178 54 W 10% 225 61 W 17% 2108 180 W 32%

15

Table 1: Power reduction of gated clocking applied to MCNC benchmarks The results of applying our tool to these benchmarks depend upon the structure of the initial state machine. For example, there are a number of circuits with few or no self-loops, such as counters. Our method will obviously not a ect the power of these circuits, since the clock can never be stopped. More generally, the power reduction seen from using our method depends on how much the machine approximates the reactive behavior described in the introduction. If state transitions occur only for a small fraction of the possible input vectors, our method gives very impressive results. However, for counter-like machines, the advantage is very small or nonexistent. These results can be considered as an unbiased sample of a larger set. For some examples, the advantage is small, but for the majority of cases we get good results. It is also important to notice how the use of the Fa in the DC set of the FSM allows us to recover some of the area overhead imposed by the additional logic in the activation function. Some of the examples have very small area overhead but show a substantial power reduction. All steps of the algorithm can be performed in a time-e cient fashion, with the bottleneck being the logic minimization of the combinational portion of the FSM using Fa as a don't care set. Consequently, this technique can easily be added to existing FSM synthesis methods to realize lower power implementations. We performed some more detailed electrical analyses to verify that the presence of a gate

Final Submission to IEEE Design & Test Special Issue on Low-Power

16

s27
0.98

Max

Min
0.96

0.94

0.92

0.9

Med
0.88 1.1

1.2

1.3

1.4

1.5

1.6

0.94

Min
0.92

mc
Med

0.9

0.88

0.86

0.84

Max

0.82 1.1

1.12

1.14

1.16

1.18

1.2

1.22

1.24

0.72

sand1

Min

0.71

0.7

Med

0.69

Max
0.68 0.948 0.95 0.952 0.954 0.956 0.958 0.96 0.962 0.964

Figure 7: Plot of the power ratio (Plow?power =Pnormal ) versus the area ratio (Alowp ower =Anormal ) for three MCNC benchmarks. The points marked with Min correspond to implementations with greatly reduced Fa . The Max points correspond to implementations with the complete fa , while the Med points are obtained from a slightly reduced Fa .

Final Submission to IEEE Design & Test Special Issue on Low-Power

17

on the clock signals was not creating any incorrect behavior due to hazards, skew or other unforeseen electrical phenomena. For every circuit we analyzed, the functional equivalence with the machine without gated clock was complete, and the minimum clock cycle was established by the combinational logic of the FSM. This fact can be quite deceiving, because in a real circuit the machine will be embedded in a bigger structure, and some timing problems can arise if the input signals are delayed by combinational logic blocks belonging to the environment. As was seen in previous sections, the choice of Fa has a strong impact on the power savings in the nal implementation. Looking at Figure 7 we can see the tradeo s graphically for three MCNC benchmarks, which re ect di erent typical behaviors across a larger set of benchmarks. In the gure, the Max point on each curve corresponds to the use of the complete activation function, fa . The Min point on each curve corresponds to the use of a greatly reduced activation function Fa and the Med point corresponds to a partially reduced Fa . For MCNC benchmark s27, the power consumed by fa completely overwhelms any power saved in the FSM, resulting in no power savings. If we reduce the activation function too much, as shown by the leftmost point on the curve, few self-loops are incorporated into the gating function, and again, we realize little power savings. However, by choosing a slightly reduced activation function we can realize substantial savings. In contrast, benchmarks mc and sand1 behave monotonically. For mc, the size of the activation function results in an area penalty, while reducing its size increases the power dissipation. If the main concern in the design was to reduce power consumption, the Max point would be the right choice, while an area-constrained low-power implementation could be obtained using Med. Finally, for sand1 the complete activation function is the optimal choice for both power and area. This is interesting because it shows that in some cases, a large activation function may allow drastic simpli cation of the combinational logic of the FSM because of the large DC set that can be used.

5 Conclusions
We have presented an automatic technique to generate logic for gated clocks within sequential machines. This logic stops the clock at points of inactivity, or self-loops, in the state machine operation. As a result, power dissipation is reduced, both because of the cessation of activity on the local clock and because of the reduction of activity in the combinational portion of the state machine. We showed that as a byproduct of the two-phase clocking scheme, hazards on the gating function do not need to be considered. We presented a method to further reduce the area overhead due to the gating function by utilizing the gating function to reduce the logic in

Final Submission to IEEE Design & Test Special Issue on Low-Power

18

the FSM. We also showed how these methods can be e ectively applied to the resynthesis of existing FSM implementations where the state diagram is not known. Finally, we showed how these techniques can easily be applied to other popular clocking disciplines. These techniques can also be extended to pipelined circuits. With pipelined circuits, no state information is necessary. The nature of pipelined design allows observation of the inputs one cycle in advance of the corresponding outputs changing. We can thus generate a reduced activation function from a subset of the function de ned by the XNOR of the combinational function of the inputs and the outputs of a given stage. We have implemented the algorithms described in this paper, and we ran our tool on several benchmarks from the MCNC suite. We found that in many cases the power dissipation was substantially reduced with a small increase in area. Speci cally, initial results from applying these algorithms to benchmark circuits showed an average reduction in power dissipation of 25% with an accompanying small increase of 5% in area. We also showed that state machines with few or no self-loops, such as counters, will not bene t from this technique. Although we have given ideas how to extend these techniques to clocking schemes other than two-phase non-overlapping, there is still work to be done to verify that hazards are not an issue for schemes employing edge-triggered ip- ops. We are currently addressing this problem, and we hope to take advantage of hazard-free logic minimization technique, such as those presented in 25], to generate a glitch-free clocking function.

6 Acknowledgments
The authors would like to thank Prof. Steve Nowick of Columbia University for his comments on this paper. This work was supported by ARPA and NSF under contract MIP 9115432 and by SRC under Contract No. 92-DJ-205.

References
1] L. Benini, M. Favalli, P. Olivo, and B. Ricco, \A novel approach to cost-e ective estimate of power dissipation in CMOS ICs," in EDAC, Proceedings of the European Design Automation Conference, pp. 354{360, Feb. 1993. 2] L. Benini and G. De Micheli, \State assignment for low power dissipation," in CICC, Proceedings of the IEEE Custom Integrated Circuits Conference, pp. 136{139, May 1994. 3] A. Chandrakasan, S. Sheng, and R. Brodersen, \Low-Power CMOS Digital Design," IEEE Journal of Solid-State Circuits, vol. 27, no. 4, pp. 473{484, Apr. 1992. 4] D. Liu and C. Svensson, \Trading speed for low power by choice of supply and threshold voltages," IEEE Journal of Solid-State Circuits, vol. 28, no. 1, pp. 10{17, Jan. 1993.

Final Submission to IEEE Design & Test Special Issue on Low-Power

19

5] K. Roy and S. Prasad, \Circuit activity based logic synthesis for low power reliable operations," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 1, no. 4, pp. 503{513, Dec. 1993. 6] A. Shen, A. Ghosh, S. Devadas, and K. Keutzer, \On average power dissipation and random pattern testability of CMOS combinational logic networks," in ICCAD, Proceedings of the International Conference on Computer-Aided Design, pp. 402{407, Nov. 1992. 7] C. Tsui, M. Pedram, and A. Despain, \Technology decomposition and mapping targeting low power dissipation," in DAC, Proceedings of the Design Automation Conference, pp. 68{73, 1993. 8] N. Yeung, et al., \The design of a 55SPECin92 RISC processor under 2W," in IEEE International SolidState Circuits Conference, pp. 206{207, Feb. 1994. 9] J. Schutz, \A 3.3V 0.6 m BiCMOS superscalar microprocessor," in IEEE International Solid-State Circuits Conference, pp. 202{203, Feb. 1994. 10] B. Suessmith and G. Paap III, \PowerPC 603 microprocessor power management," Communications of the ACM, no. 6, pp. 43{46, June 1994. 11] D. Pham, et al., \A 3.0W 75SPECint92 85SPECfp92 superscalar RISC microprocessor," in IEEE International Solid-State Circuits Conference, pp. 212{213, Feb. 1994. 12] S. M. Nowick and D. L. Dill, \Synthesis of asynchronous state machines using a local clock," in ICCD, Proceedings of the International Conference on Computer Design, pp. 192{197, IEEE Computer Society Press, 1991. 13] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design (Second Edition). Addison-Wesley, 1992. 14] B. Gordon and T. H.-Y. Meng, \A low power subband video decoder architecture," in International Conference on Acoustics, Speech, and Signal Processing, 1994. 15] M. Horowitz, et al., \MIPS-X: a 20-MIPS peak, 32-bit microprocessor with on-chip cache," IEEE Journal of Solid-State Circuits, no. 5, pp. 790{799, Oct. 1987. 16] M. Jackson, A. Srinivasan, and E. Kuh, \Clock routing for high performance ICs," in DAC, Proceedings of the Design Automation Conference, 1990. 17] T.-H. Chao, Y.-C. Hsu, and J.-M. Ho, \Zero skew clock net routing," in DAC, Proceedings of the Design Automation Conference, 1992. 18] J. Hartmanis and H. Stearns, Algebraic Structure Theory of Sequential Machines. Prentice-Hall, 1966. 19] T. Cormen, C. Leiserson, and R. Rivest, Introduction to Algorithms. McGraw-Hill, 1990. 20] E. J. McCluskey, Logic Design Principles With Emphasis on Testable Semicustom Circuits. Prentice-Hall, 1986. 21] H. Cho and F. Somenzi, \Sequential logic optimization based on state space decomposition," in EDAC, Proceedings of the European Design Automation Conference, pp. 200{204, Feb. 1993. 22] R. E. Bryant, \Graph-based algorithms for boolean function manipulation," IEEE Transactions on Computers, vol. 35, no. 8, pp. 677{691, Aug. 1986. 23] E. Sentovich, et al., \Sequential circuit design using synthesis and optimization," in ICCD, Proceedings of the International Conference on Computer Design, pp. 328{333, Oct. 1992. 24] F. Mailhot and G. De Micheli, \Algorithms for technology mapping based on binary decision diagrams and on boolean operations," IEEE Transactions on CAD/ICAS, pp. 599{620, May 1993.

Final Submission to IEEE Design & Test Special Issue on Low-Power

20

25] S. M. Nowick and D. L. Dill, \Exact two-level minimization of hazard-free logic with multiple-input changes," in ICCAD, Proceedings of the International Conference on Computer-Aided Design, pp. 626{ 630, 1992.

Final Submission to IEEE Design & Test Special Issue on Low-Power

21

Keywords: Low-power, CAD, synthesis, Biographies:

nite state machines

Luca Benini is a Ph.D. candidate in Electrical Engineering at Stanford University, where his dissertation is on synthesis for low power. Prior to arriving at Stanford, he worked as a research assistant on simulation techniques for power estimation at the Department of Electronics and Computer Science of the University of Bologna in 1991-1992. His current research interests are in the area of computer-aided design and simulation of digital ICs, speci cally in the design of low power-systems, algorithms for the automatic synthesis of low-power circuits and in tools for accurate estimation of the power dissipation in large digital systems. He is also interested in multiplelevel logic synthesis, algorithms for optimal state assignment, technology mapping and probabilistic simulation. He received the M.S. degree in Electrical Engineering from Stanford University in 1994, and a Laurea degree in Electrical Engineering from University of Bologna, Italy, in 1991. He is a member of the IEEE. Polly Siegel is a Ph.D. candidate in Electrical Engineering at Stanford University, where her dissertation is on technology mapping for asynchronous designs. She received a Best Paper Award at the 1993 Design Automation Conference. She has also been with Hewlett-Packard since 1982, where she has been involved in research and development of a wide variety of CAD tools including schematic capture, behavioral simulation and IC design systems frame works, and is currently working at HP Labs on asynchronous synthesis. She received her M.S. in Engineering Management from Stanford in 1994, and her M.S. and B.S. in EECS from U.C. Berkeley in 1983 and 1981, respectively. She is a member of the IEEE and the ACM. Giovanni De Micheli is Associate Professor of Electrical Engineering, and by courtesy, of Computer Science at Stanford University. Previously he held positions at the IBM T.J. Watson Research Center, Yorktown Heights, New York, at the Department of Electronics of the Politecnico di Milano, Italy and at Harris Semiconductor, Melbourne, Florida. He holds a Nuclear Engineer degree (Politecnico di Milano, 1979) and a Ph.D. degree in Electrical Engineering and Computer Science (University of California at Berkeley, 1983). His research interests include several aspects of the computer-aided design of integrated circuits and systems, with particular emphasis on automated synthesis, optimization and validation. He is author of: Synthesis and Optimization of Digital Circuits, McGraw-Hill, 1994, co-author of: High-level Synthesis of ASICs under Timing and Synchronization Constraints, Kluwer, 1992 and co-editor of: Design Systems for VLSI Circuits: Logic Synthesis and Silicon Compilation, Martinus Nijho Publishers. He was also co-director of the Advanced Study Institute on Logic Synthesis and Silicon Compilation, held in L'Aquila, Italy, under the sponsorship of NATO in 1986 and in 1987. Dr. De Micheli is a Fellow of IEEE. He was granted a Presidential Young Investigator award in 1988. He received the 1987 Best Paper Award for the best paper published on the IEEE Transactions on CAD/ICAS and two Best Paper Awards at the Design Automation Conference, in 1983 and in 1993.

Potrebbero piacerti anche