Power-Delay Product Minimization

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO.
3, MARCH 2004
235
Power-Delay Product Minimization in High-Performance 64-bit Carry-Select Adders

Amaury Nve, Member, IEEE, Helmut Schettler, Thomas Ludwig, Member, IEEE, and Denis Flandre, Senior Member, IEEE
AbstractThis paper analyzes methods to minimize the power-delay product of 64-bit carry-select adders intended for high-performance and low-power applications. A first realization in 0.18- m partially depleted (PD) silicon-on-insulator (SOI), using complex branch-based logic (BBL) cells, results in a delay of 720 ps and a power dissipation of 96 mW at 1.5 V. The reduction of the stack height in the critical path, combined with the optimization of the global carry network with cell sharing and the selection of 8-bit pre-sums, leads to a reduction of the power-delay product by 75%. The automatic tuning of the transistor widths in 0.13- m PD SOI produces an energy-efficient 64-bit adder which has a delay of 326 ps and a power dissipation of 23 mW only at 1.1 V. Index TermsAdder, digital CMOS, high performance, low power, power-delay product, silicon-on-insulator technology.
adder. The optimization of the adder structure, the global carry network, and the cells are presented in Section V. The proposed adder is compared with the original carry-select adder. Section V also gives the results for the optimized implementation of the 64-bit adder in 0.13- m PD SOI, and compares them with state-of-the-art 64-bit adders of the literature. II. LOGIC CIRCUIT DESIGN IN SOI The use of SOI technology instead of bulk CMOS opens new possibilities in the choice of the design style and in the design space itself. In Section II-A, we discuss some possible design styles for PD SOI. Among them, branch-based logic, a restricted version of static CMOS logic, seems very promising and is presented in Section II-B. A. Circuit Design Styles In this study, we concentrate on static design styles, since the performance advantage of both dynamic logic styles and pass-gate design is expected to decrease in future deep-submicron technologies [3], [7]. The features of lower dynamic power consumption and higher noise margin make static CMOS particularly attractive [8], [9]. Moreover, the activation of the parasitic bipolar transistor in PD SOI is reported to result in fatal erroneous states in dynamic logic and to make circuit design with pass-gates more difficult [10]. The renewed interest in static design styles like pseudo-NMOS [11] and ratioed CMOS [12] shows that alternative design styles are investigated in SOI in order to reduce the power dissipation while still maintaining high-speed performance. B. Branch-Based Circuit Design Style In the branch-based logic (BBL) design style, a logic cell is only made of branches that contain a few transistors in series [13]. The branches are connected in parallel between the power supply lines and the common output node. Many usual static CMOS gates have already a branch structure, as inverter and NAND and NOR gates. By using the branch-based concept, it is possible to minimize the number of internal connections, and thus, the parasitic capacitances associated with the diffusions, interconnections, and contacts. As it belongs to the family of static CMOS, it presents high noise margins and robustness to device scaling and voltage scaling. The optimal design point will depend on whether the emphasis is placed on low power, high speed, or a compromise between the two.
I. INTRODUCTION
ODAY, one of the major challenges for high-performance microelectronic systems is the power dissipation, both static and dynamic [1][3]. The circuit designer must, therefore, find an optimum between power and speed, instead of targeting them independently, and this is represented by the power-delay product, which represents the average energy dissipated for one switching event [4]. In this study, we investigate design methods to minimize the power-delay product of 64-bit adders in partially depleted (PD) silicon-on-insulator (SOI) technology. Addition is used as a benchmark here since it is one of the important tasks performed by the CPU, considering that adders are needed in the Arithmetic and Logic Units, for the memory address generation and for floating point calculations [5], [6]. The improvement of the power-delay product will be performed at the different hierarchical levels of the design: circuit design style, cell decomposition, and global architecture. Section II discusses design styles that can be used for low-power and high-performance VLSI systems in SOI. In Section III, we compare two possible implementations of branch-based cells used for the 64-bit adder, which has a classical carry-select architecture. The experimental results of this realization are discussed in Section IV and are compared with a complementary pass-gate
Manuscript received February 28, 2003; revised June 30, 2003. The work of A. Nve was supported in part by the Walloon Region of Belgium. A. Nve was with the Microelectronics Laboratory, Universit Catholique de Louvain, B-1348 Louvain-la-Neuve, Belgium. He is now with IBM Entwicklung, D-71032 Bblingen, Germany (e-mail: aneve@de.ibm.com). H. Schettler and T. Ludwig are with IBM Entwicklung, D-71032 Bblingen, Germany (e-mail: shettler@de.ibm.com; tludwig@de.ibm.com). D. Flandre is with the Microelectronics Laboratory, Universit Catholique de Louvain, B-1348 Louvain-la-Neuve, Belgium (e-mail: flandre@dice.ucl.ac.be). Digital Object Identifier 10.1109/TVLSI.2004.824305
1063-8210/04$20.00 2004 IEEE
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on April 24,2010 at 06:30:22 UTC from IEEE Xplore. Restrictions apply.
236
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 3, MARCH 2004
Fig. 1.
64-bit carry-select adder.
TABLE I LOGIC EQUATIONS AND CIRCUIT BLOCKS USED FOR THE INTERMEDIATE CARRY SIGNALS
III. DESIGN OF THE CARRY-SELECT ADDER A. Adder Structure In our first design, the 64-bit adder was classically divided into four 16-bit sections as shown in Fig. 1 [14]. In each section, two 16-bit adders generate the sum outputs with, respectively, the carry-in at 0 and at 1. The true sum is selected by a multiplexer. The control signals for the multiplexers are , , and the carry-in and the intermediate carry signals . The intermediate carry signals and the carry-out are produced by the carry-select boxes (CS-boxes). The inputs of the CS-boxes are the carry-in and the conditional carry signals and (with 015, 1631, 3247, or 4863), which refers are all generated simultaneously. The notation to the block carry signal for bit positions 0 to 15, assuming that the carry-in is at 0. For the sake of clarity, the indexes of the conditional carry signals have been simplified in Fig. 1. The carry-out and the intermediate carry signals are computed according to the equations presented in Table I. To compute is combined with the carry-out, the intermediate carry
and in one CS-C0 stage to avoid a complex CS-C3 stage. At their turn, the 16-bit adder blocks are implemented as carry-select adders, with 4-bit adder blocks having a carry-in either at 0 or at 1. At the 16-bit level, the same CS-boxes can be used as at the 64-bit level, sizing of the transistors being adapted to the particular load conditions. The 4-bit adder blocks can finally be implemented as ripple carry adders, or also as carry-select adders, which was chosen here. The carry-select architecture is thus used at three different levels: in the 64-bit, the 16-bit, and the 4-bit adders. B. Design of the Carry-Select Boxes A CS-box can be implemented in different ways, depending on its number of inputs and complexity. The CS-C0 gate is designed starting from the logic equation presented in Table I. The P-part is common to the BBL version and the CMOS OR-AND-INVERT (OAI) gate, which is further referenced as the X-gate. It can immediately be designed, just observing that the
NVE et al.: POWER-DELAY PRODUCT MINIMIZATION IN HIGH-PERFORMANCE 64-BIT CARRY-SELECT ADDERS
237
Fig. 2. One-stage complex CS-C0: (a) BBL and (b) X-gate.
Fig. 4. One-stage complex BBL CS-C1 cell.
Fig. 3.
Two-stage decomposed CS-C0 cell.
input variables must be complemented in the logic equation (Fig. 2). For the NMOS part, the complement is taken: (1) (2) Equation (2) defines the N-part of the X-gate represented in Fig. 2(b). Normally, the N-part of the BBL cell would require four transistors. However, to reduce the input capacitance and the internal parasitic capacitances, (2) can be simplified noticing that (3) This is true since the input combination AND never happens, resulting from the properties of the conditional carry signals. The resulting equation is (4) CS-C0 can thus be implemented as a complex cell in one stage, obeying the BBL principle [Fig. 2(a)], or like an X-gate, allowing for internal connections between branches [Fig. 2(b)]. These X-gates will be used in the optimized version of the adder described in Section V. CS-C0 can also be implemented in two stages, using two classical NANDs and one inverter (Fig. 3). Notice that the latter is a two-stage BBL implementation of CS-C0, since NAND gates can be considered as elementary BBL cells. This will be referenced as decomposed-cell design in the remainer of this paper. Figs. 4 and 5 represent, respectively, the complex BBL onestage implementation of CS-C1 and the complex BBL two-stage implementation of CS-C2. Using one stage only for the latter requires stacks of four transistors, which appeared to be too
Fig. 5. Two-stage complex BBL CS-C2 cell.
slow. The carry indexes have been simplified for the sake of the figures clarity. In Section III-C, these complex cells are compared with equivalent decomposed-cell designs, using NOR and NAND gates. C. Results for the Single Cells The complex and decomposed-cell BBL gates have been optimized for speed as follows. For the optimization process, each cell is loaded with one CMOS inverter (fan-out of 1). The nominal gate length is chosen at the minimum allowed drawing dim. We carefully optimized the gate mension, i.e., widths by hand, using an iterative process. In the first step, the critical branch is identified; this is the branch that has the slowest delay when the cell switches. The speed of a gate is indeed limited by the slowest branch [15]. Often this corresponds to the branch with the largest number of transistors in series. The input pattern that activates the top transistor of the critical branch in one particular cell is applied at the input. With the AS/X cirratio of this cuit simulator [16], a sweep is made on the branch, all other ratios remaining constant. Thereafter, the ratio of the other branches is further tuned to lower the capacitance of the output node, to which all the branches are connected. After this first step, we can turn back to the critical branch and refine the choice of its gate widths. Most of the time, ratios the second step leads only to minor changes of the of the branches. We evaluated the performances of the basic building blocks of the adder by using circuit simulations. The simulations are based on the schematic specification of the circuit. The device models of the SOI PD 0.18- m process include the parasitic capacitances of source, drain, and gate. The model accounts also for the parasitic capacitances associated with the contacts.
238
SIMULATED DELAY. V
TABLE II = 1:5 V; T = 25 C; FO = 1
Fig. 6. Image of the test chip. The locations of the BBL adder (BBL ADD) and the CPL adder (CPL ADD) are highlighted.
TABLE III SIMULATED STATIC POWER DISSIPATION. V = 1:5 V; T = 25 C; FO = 1
TABLE IV = 1:5 V; T = 25 C; SIMULATED DYNAMIC POWER DISSIPATION. V FO = 1; F = 1 GHz
the decomposed versions, and hence, multiple leakage paths between the power supply and ground. The complex CS-C0 and CS-C1 cells benefit from the design in one stage compared to their decomposed counterparts. The dynamic power dissipation is compared for the same logic cells in Table IV. The last column presents the dynamic power reduction when comparing the complex implementation and the decomposed implementation. The complex cells dissipate between 30% and 43% less dynamic power than in the decomposed cells. The main factor that explains why the complex cells have a lower power dissipation than the decomposed cells is linked with the lower number of internal nodes. In the CS-C0 cell, for example, there is only one highly capacitive node at the output in the complex cell [see Fig. 2(a)]. In the decomposed cell (Fig. 3), there are two internal nodes and one output node, though with a lower parasitic capacitance than in the case of the complex cell. When the output of the gate switches, this means that at least one of the internal node also switches with a full-swing charge/discharge of the associated parasitic capacitance. IV. FIRST REALIZATION OF THE 64-BIT ADDER
The results for the delay, static power consumption, and dynamic power consumption are presented, respectively, in Tables IIIV. We compare two possible BBL implementations of the cells: complex-cell design (COMPLEX) and decomposed-cell design (DEC), using only NAND, NOR, and INV gates. Two categories clearly appear: circuits with few inputs and few branches like the half-adder and CS-C0 perform much better in their complex form, with a speed increase of 8% and 20%. Circuits with more inputs, like CS-C1 and CS-C2, perform worse in their complex form than in their decomposed form. The bad performances of these two complex cells can be explained by the combination of two factors: the presence of branches with a stack of three transistors and a high number of branches connected to the output node, which increases the parasitic capacitance at the output. Moreover, the critical path in the two decomposed circuits includes a stack of three NMOS devices, whereas in the two complex-cell circuits, a stack of three PMOS devices is activated in the worst case. Floating-body effects are known to produce uncertainties in the circuit results. In particular, the hysteresis effect is associated with the dependence of the body potential on the switching history of the gates and appears as an additional delay variation [17]. With a tool based on the methodology presented in [18], we evaluated that the impact of the hysteresis effect on these cells is less than 5%. The static power dissipation is very close for the half-adder and CS-C2, since these BBL cells count multiple stages, like
The design and layout of the 64-bit carry-select adder presented in Section III has been realized for the CMOS 0.18- m PD SOI process [19]. It is composed of 18 k devices and occupies an area of 735 m 280 m (Fig. 6, BBL ADD). The same layout can be used for each 16-bit section, as they are repeated four times. All the cells are implemented using the BBL design style, with the exception of the multiplexers, which are transmission-gate multiplexers. For this version of the adder, we fixed the maximum stack height at three MOSFETs, which enables us to use the complex BBL gates discussed in the previous section. In the remainer of this paper, the adder implemented here is referenced as Adder4 16b. This notation refers to the fact that we make a selection of 16-bit pre-sums at the 64-bit adder level. A. Critical Path The critical path to the carry-out (referenced as in Fig. 1) and to the sum outputs is described here. In the 4-bit adder level, the critical path involves 1 NAND/NOR and 1 CS-C2 cell, which generates . A second CS-C2 cell in the 16-bit adder level gen, , , and erates the intermediate carry signals . These signals are then fed into the CS-C1 and CS-C2 and cells of the 64-bit adder for generation of, respectively, . But, by this way, the delay on the critical carry-out path could be too high for two reasons. First, as seen previously,
239
Fig. 7. Detail of the critical path. Buffers are added on the path to CS-C1 in order to reduce the capacitive load seen by the previous stage.
Fig. 8. Experimental delay of the 64-bit BBL and CPL adders for different V values at 25 C. The critical delay is obtained for SUM24..31 for the BBL adder and for the carry-out in the case of the CPL adder.
the delay in a CS-C2 cell is about 30% higher than in a CS-C1 cell. Second, the cells generating the intermediate carry signals , , , and in the 16-bit adders see a high capacitive load at the inputs of the CS-C1 and CS-C2 cells at the 64-bit level. In our design, we favor the carry-out generation in two ways at the 64-bit level (Fig. 7). First, since the delay , in CS-C2 is the largest, the intermediate carry signals , , and are fed directly into CS-C2 for . To compute the carry-out , only one the generation of additional stage is necessary, i.e., CS-C0. Second, buffers are added on the signal path to the inputs of CS-C1 in order to reduce the capacitive load seen by the outputs of cells gener, , , and . By ating the signals this way, these signals do not see the high capacitive charge of the CS-C1 cell. B. Simulation Results A second 64-bit carry-select adder has been designed with the decomposed cells for comparison purposes. AS/X simulations of these adders based on the cell schematics are used to compare both versions. At 1.5 V, 25 C, and with a capacitive output load of 150 fF, the BBL complex-cell adder features a dynamic power consumption which is reduced by 10% compared to the decomposed-cell adder, with random input patterns applied at a rate of 1 GHz. The delay increase associated with the complex-cell design is less than 2% for supply voltages up to 1.5 V. For higher supply voltages, the speed difference increases slightly between the two, but remains lower than 5%. For worst case input patterns, the peak dynamic power consumption is reduced by 16% in the complex-cell adder compared to decomposed-cell implementation. The overall reduction of dynamic power dissipation is lower than for the individual cells, because the complete adder also involves inverters and multiplexers, which are similar in both realizations. An equivalent bulk realization of the adder consumes 15% more dynamic power and is 29% slower than the PD SOI version. This is associated with the lower junction capacitances in SOI. A complete netlist has been extracted from the 64-bit complex-cell adder layout in order to precisely determine the static and dynamic power consumption, taking all the interconnections and parasitic elements into account. To compute the
dynamic power, we apply random patterns at the inputs of the extracted netlist. Integration between 5 and 45 ns, with a bit rate of 1 GHz, a supply voltage of 1.5 V, and a temperature of 25 C, results in a dynamic power consumption of 96 mW. The peak dynamic power consumption occurs when all the inputs switch at the same time, and rises up to 177 mW. The static power consumption is about 500 W at 1.5 V and 25 C. C. Experimental Results The experimental realization of the 64-bit adder has been tested under different voltage and temperature conditions and operates successfully. Worst case input patterns have been applied and the propagation delays to the sum and carry outputs measured. In the worst case situation and at 1.5 V, the final carry is produced after 600 ps, and thanks to the independent carry network composed of the CS-boxes, it arrives earlier than the last sum outputs, which arrive after 720 ps in the worst case. On the same chip as the BBL adder, a complementary passgate logic (CPL) 64-bit carry-select adder has been realized (CPLADD in Fig. 6). The CPL adder makes the selection of 16-bit pre-sums and uses CPL cells based on the work presented in [20]. Fig. 8 shows the critical delay for the CPL adder and for V, the BBL adder. For low operating voltages the BBL is obviously better than CPL. Indeed, BBL cells do not suffer from the voltage drop due to the threshold voltage in single-rail pass-gates. This voltage drop increases in relative terms when moving to lower supply voltages. For high V , CPL and BBL are able to achieve voltages similar performance. These results confirm the statement of [7] that the performance of CPL-like logic styles degrades much faster than other design styles due to the decreasing ratio in deep-submicron technologies. The dynamic power consumption of the CPL adder is about 50 mW. This is one half of the power consumption of Adder4 16b. Two factors explain the power advantage of the CPL adder: different structure of the 4-bit adders implying the use of a lower number of cells and lower switched capacitance thanks to the low number of NMOS and especially PMOS devices in the design (2999 PMOS devices and 4992 NMOS transistors in CPL, versus 9077 PMOS and 9142 NMOS devices in the BBL adder).
240
Fig. 9. Four-bit ripple-carry adder (RCA) with carry
0 in = 0.
TABLE V SIMULATION RESULTS FOR THE 4-BIT ADDER WITH THE ASSOCIATED CARRY-LOOKAHEAD CIRCUIT. IN FA+X-BBL THE CARRY-LOOKAHEAD CIRCUIT IS IMPLEMENTED WITH BBL CS-C0 CELLS AND X-GATES; IN FA+CS-C2, THE CARRY LOOK-AHEAD CIRCUIT IS COMPOSED OF JUST ONE CS-C2 CELL; IN FA+CMOS THE CARRY PATH IS IMPLEMENTED WITH CONVENTIONAL NAND AND NOR GATES; CSA+CS-C2 IS THE 4-BIT CARRY-SELECT ADDER WITH THE = 1:5 V; T = 25 C; F = 1 GHz; Fan Out = 4 inverters CARRY PATH CONSISTING OF CS-C2. V
V. OPTIMIZATION OF THE 64-BIT ADDER In the first part of this section, we will revisit the choices made for the maximum stack height of the BBL cells. Second, concerning the architecture, two elements must be considered. The carry-selection can occur with either 8-bit or 16-bit presums. Moreover, the structure of the carry network itself can be further optimized, regarding the constraints of power and performance. A. Stack Height The power-delay product of the adder can be improved by a better balance of the cells with a stack height of two and the cells with a stack height of three. In the 4-bit adder cells, the stack height can be increased from two to three, since this part is not in the critical path in the carry-select structure. Instead of using half-adders (HAs) and multiplexers, efficient 28-transistor 1-bit CMOS full adders (FAs) are used in a 4-bit ripple carry configuration [21]. To avoid the long delay for , which ripples through all the FA cells (Fig. 9) a carry-lookahead circuit is used. We implemented four versions and compared them with the original 4-bit carry-select adder (CSA) of Section III, which is referenced as CSA+CS-C2 in Table V. In each case, the carry network has NOR and NAND gates in the first stage to proand (with duce the conditional carry signals or ). The first way to generate is to use the complex BBL CS-C2 cell, represented in Fig. 5, which is referenced as FA+CS-C2 in Table V. The carry network with branch-based carry-select boxes can be redesigned avoiding high stacks and this will favor speed. If we limit the height of the branches up
Fig. 10. Carry logic for the 4-bit adder with C
= 0.
to two devices only, we propose another decomposition of the : equation of (5) (6) (7) By using the theorem of De Morgan, this expression becomes (8) The resulting circuit is shown in Fig. 10. Two complex BBL CS-C0 cells [see Fig. 2(a)], one two-input NOR and one complementary X-gate are combined. It has the advantage of having a maximum of two PMOSFETs in the stack. Notice that we cannot use a complementary CS-C0 cell in the last stage. Indeed, the Never Happens condition that enabled
241
Fig. 11. Block diagram of the 8-bit CLA adder with carry-in. The intermediate carry signal C5 (interrupted line) was added in order to enhance the speed of the 8-bit adder blocks. C3 and C5 are intermediate results produced by CS-C7.
the simplification of the BBL cells is not fulfilled here. If is at 1, this does not imply that is at 1. The circuit is combined with the 4-bit ripple-carry adder (RCA) and is referenced as FA+X-BBL in Table V. We can also replace the two remaining BBL CS-C0 gates by X-gates. This circuit is referenced as FA+X-gates. Finally, we can further decompose the cells allowing to design this stage using only inverters, NAND, and NOR gates. This case is FA+CMOS. Table V presents the simulation results for the 4-bit adders with the carry-lookahead circuits. FA+X-BBL and FA+X-gates are almost similar in all respects and have the best energy efficiency. FA+CMOS, FA+CS-C2, and CSA+CS-C2 have a power-delay product which is about 25%30% higher. The reduction of the stack height from three to two devices reduces the delay by 18% between FA+CS-C2 and FA+X-BBL. B. Structural Optimization A 64-bit carry-select adder can make the selection either of 8-bit pre-sums [22], [23] or of 16-bit pre-sums [14]. The use of 8-bit adders allows the use of one carry-selection level instead of three as in the first design. There are multiple ways to combine two 4-bit adders to form the 8-bit adders. In the first possibility, we use , the 4-bit block carry from the first 4-bit adder, computed with the carry-lookahead circuit of the first stage. will be re-used in the cell CS-C7, producing the carry for the entire 8-bit adder block (Fig. 11). However, the speedup was not sufficient to produce the SUM signals on time, even in the carry-select architecture. Therefore, , the result of the carry from the six lowest-order bits, will be fed directly into the FA for the bit position 6, thus forming a second carry-lookahead is further used in CS-C7 to generate . By path (Fig. 11). using intermediate results and sharing circuit blocks, the duplication of logic cells is avoided. This contributes to a lower area and lower power consumption. Adder8 8b is composed of eight 8-bit blocks, the global carry network, and seven multiplexers (Fig. 12). The first adder block contains the 8-bit adder generating SUM0-7 and CS-C7, the circuit generating the intermediate carry signal . The seven other blocks are identical and are composed of four circuits. Two of these are 8-bit adders, used to produce the pre-sums, one assuming that the carry-in is at 0, the 8-input other assuming that the carry-in is at 1. A 2 multiplexer selects the final sum output. The control signals for the multiplexers are generated by the global carry network. The
Fig. 12.
64-bit adder based on the selection of 8-bit pre-sums.
two other circuits in the seven identical adder blocks generate the conditional carry signals which are fed into the global carry network. These two circuits are referenced as CS-cin0 and CS-cin1 in Fig. 12. By using 8-bit blocks which are repeated seven times, the design and layout time is made shorter than if sections of different lengths were used. C. Global Carry Network The global carry network generates the final carry-out, , , , , , , and and the intermediate carry signals . These signals command the multiplexers which select the appropriate 8-bit pre-sums. In order to minimize the delay in the critical path, the following elements are taken into consideration: the number of successive stages, the input load presented to the CS-cin0 and CS-cin1 blocks, and the fan-out of each stage. In the decomposition that we propose below, the fan-out is limited to three, both for the conditional carry signals and for the intermediate carry signals that are re-used at different places in the global carry network. , since it commands the multiplexer The hot carry is selecting the highest order sum signals. It is generated using all and : the intermediate carry signals except
(9) Since in the global carry network the stack height is limited to two devices to favor speed, the equation of is further decomposed in order to be able to implement this function with complex BBL CS-C0 gates and, where needed, X-gates.
242
The second control signal to generate is
TABLE VI DELAY, POWER CONSUMPTION, POWER-DELAY PRODUCT, AND DEVICE COUNT FOR THE REALIZED AND SIMULATED IMPLEMENTATIONS OF ADDER4 16b AND THE SIMULATED IMPLEMENTATION OF ADDER8 8b IN THE 0.18-m CMOS = 1:5 V; T = 25 C; F = 1 GHz; FO = 4 TECHNOLOGY. V
(10) which can be decomposed as (11) is already one of the intermediate carry signals needed to command the multiplexer selecting SUM24..31. The expresand are sions of the complements of (12) (13)
Fig. 13. Layout of the 64-bit adder based on the selection of 8-bit pre-sums.
and are shared with is expressed as follows:
(14) where are also shared with and is the intermediate carry signal commanding the multiplexer selecting SUM16..23. re-uses in its expression Finally, (15) The low-order intermediate signals and are re-used as inputs for cells generating higher order carry signals. In this way, we avoid the duplication of parts of the carry logic, which favors low power, and we keep the fan-out to a maximum of three, which is beneficial for speed. D. Simulation Results This adder has been simulated using the parameters of the 0.18- m PD SOI CMOS process and is compared with the experimental and simulation results of the first adder, Adder4 16b, which has a classical structure (Table VI). The layout of Adder4 16b is taken as a reference to estimate the lengths of the wires for the schematic simulations. For Adder8 8b, the length of the long wires has been reduced by 25% to account for the lower die area thanks to the lower device count. Considering the reduction in the number of devices (67%), this is a rather conservative value. The optimized version of the 64-bit adder shows a reduction of about 20% of the maximum delay, thanks to three factors. First, the lower number of buffers on the signal path accounts for 6% of the delay reduction. Second, the use of stacks with a maximum of two PMOSFETs further improves the speed by
6%. The critical path in Adder8 8b includes a two-input NOR, , the hot carry. one CS-C0 stage and five X-gates produce Third, thanks to a more efficient architecture in Adder8 8b, the capacitive load seen by the carry-select cells on the critical path is reduced compared to Adder4 16b. To estimate the dynamic power consumption, random input patterns are applied at the inputs at a rate of 1 GHz. In Adder8 8b, the dynamic power consumption is reduced by a factor of three compared to the classical adder. This improvement is associated with the higher stacks in the noncritical path and the reduction of the number of cells in the proposed architecture. Adder8 8b shows a 75% lower power-delay product than Adder4 16b. About 20% of this improvement is associated with the cell level, the remainder coming from the modifications of the architecture. Adder8 8b has a power-delay product that is 60% lower than the CPL adder. In this case, the architecture accounts for 10% of the improvement, the remaining coming from the different logic design styles. E. Optimization and Results in 0.13- m PD SOI Adder8 8b has finally been optimized with Einstuner in a 0.13- m PD SOI CMOS technology. Einstuner is a circuit optimization package that automatically resizes the transistors [24]. The final layout area is 151 m 461 m (Fig. 13). A netlist has been extracted from the layout and is used to determine the main features of the adder. The critical delay is 326 ps at 1.1 V and 85 C. Random input patterns at a rate of 2 GHz are used to calculate the dynamic power which is found to be as low as 23 mW. The static power is evaluated to be 380 W. Our realization is compared with state-of-the-art 64-bit adders published in previous work in Fig. 14. The adder proposed here is faster than the other realizations, even those realized in finer technologies. Our adder features the extremely low power-delay product of 7.5 pJ.
243
REFERENCES
[1] T. H. Ning, CMOS in the new millennium, in Proc. IEEE Custom Integrated Circuit Conf. (CICC), 2000, pp. 4956. [2] V. De and S. Borkar, Low power and high performance design challenges in future technologies, in Proc. Great Lakes Symp. VLSI, 2000, pp. 16. [3] European Semiconductor Industry Association, Japan Electronics and Information Industries Association, Korea Semiconductor Industry Association, Taiwan Semiconductor Industry Association, and Semiconductor Industry Association, International Technology Roadmap for Semiconductors, System Drivers, 2001. [4] C. Nagendra, R. M. Owens, and M. J. Irwin, Power-delay characteristics of CMOS adders, IEEE Trans. VLSI Syst., vol. 2, pp. 377381, Sept. 1994. [5] S. Naffzifer, A subnanosecond 0.5 m 64b adder, in Proc. IEEE Int. Solid-State Circuits Conf., Slide Supplement, 1996. [6] J. Hennessy, D. Patterson, and D. Goldberg, Computer Architecture, A Quantitative Approach, 2nd ed. San Mateo, CA: Morgan Kaufman, 1996. [7] M. Allam, M. Anis, and M. Elmasry, Effect of technology scaling on digital CMOS logic styles, in Proc. IEEE Custom Integrated Circuits Conf. (CICC), 2000, pp. 19-1-119-1-8. [8] R. Yung, S. Rusu, and K. Shoemaker, Future trend of microprocessor design, in Proc. ESSCIRC 2002, pp. 4346. [9] V. De and S. Borkar, Low power and high performance design challenges in future technologies, in Proc. Great Lakes Symp. VLSI, 2000, pp. 16. [10] P.-F. Lu, C.-T. Chuang, J. Ji, L. F. Wagner, C.-M. Hsieh, J. B. Kuang, L. L.-C. Hsu, M. M. Pelella, S.-F. Sanford, and C. J. Anderson, Floating-body effects in partially depleted SOI CMOS circuits, IEEE J. Solid-State Circuits, vol. 32, pp. 12411253, Aug. 1997. [11] N. Subba, A. Salman, S. Mitra, D. E. Ioannou, and C. Tretz, Pseudo-NMOS revisited: Impact of SOI on low power, high speed circuit design, in Proc. IEEE Int. SOI Conf., Oct. 2000, pp. 2627. [12] C. R. Tretz, R. K. Montoye, and W. Reohr, Ratioed CMOS: A low power high speed design choice in SOI technologies, in Proc. IEEE Int. SOI Conf., Oct. 2000, pp. 2829. [13] J. M. Masgonty, C. Arm, and C. Piguet, Technology- and powersupply-independent cell library, in Proc. IEEE Custom Integrated Circuits Conf. (CICC), 1991, pp. 25.5/125.5/4. [14] K. Hwang, Computer Arithmetic: Principles, Architecture and Design. New York: Wiley, 1979. [15] S. Zaker and J. Zahnd, OPTIMOS: a branch-level digital circuit optimizer, in Proc. EURO ASIC, 1993, pp. 563572. [16] G. A. Katopis, W. D. Becker, T. R. Mazzawy, H. H. Smith, C. K. Vakirtzis, S. A. Kuppinger, B. Singh, P. C. Lin, J. Bartells Jr., G. V. Kihlmire, P. N. Venkatachalam, H. I. Stoller, and J. L. Frankel, MCM technology and design for the S/390 G5 system, IBM J. Res. Develop., vol. 43, no. 5/6, pp. 621650, Sept.Nov. 1999. [17] G. G. Shahidi, SOI technology for the GHz era, IBM J. Res. Develop., vol. 46, no. 2/3, pp. 121131, Mar./May 2002. [18] I. Aller and K. E. Kroell, Detailed analysis of the gate delay variability in partially depleted SOI CMOS circuits, in Proc. IEEE Int. SOI Conf., Oct. 1999, pp. 4041. [19] A. Nve, D. Flandre, H. Schettler, T. Ludwig, and G. Hellner, Design of a branch-based 64-bit carry-select adder in 0.18 m partially-depleted SOI CMOS, in Proc. Int. Symp. Low Power Electronics and Design (ISLPED), Aug. 2002, pp. 108111. [20] K. Yano, T. Yamanaka, T. Nishida, M. Saito, K. Shimohigashi, and A. Shimizu, A 3.8 ns CMOS 16 16-b multiplier using complementary pass-transistor logic, IEEE J. Solid-State Circuits, vol. 25, pp. 388395, Apr. 1990. [21] K. Martin, Digital Integrated Circuit Design. Oxford, U.K.: Oxford Univ. Press, 2000. [22] S. J. Lee, R. Woo, and H. J. Yoo, 480 ps 64-bit race logic adder, in Symp. VLSI Circuits Dig. Tech. Papers, 2001, pp. 2728. [23] J. J. Kim, R. Joshi, C.-T. Chuang, and K. Roy, SOI-optimized 64-bit high-speed CMOS adder design, in Symp. VLSI Circuits Dig. Tech. Papers, 2002, pp. 122125. [24] X. Bai, C. Visweswariah, P. Strenski, and D. Hathaway, Uncertaintyaware circuit optimization, in Proc. 39th Design Automation Conf., June 2002, pp. 5863. [25] D. Sastiak, J. Tran, F. Mounes-Toussi, and S. Storino, A 2nd generation 440 ps SOI 64 b adder, in IEEE Int. Solid-State Circuits Conf., Feb. 2000, pp. 288289. [26] M. Garg and A. Katoch, Evaluation of skew tolerance in delayed clocking scheme for dynamic circuits, in Proc. ESSCIRC, Sept. 2001, pp. 396399.
Fig. 14. Our 64-bit adder, based on the selection of 8-bit pre-sums designed in 0.13-m PD SOI, is compared with recent 64-bit adder realizations. Load = 40 fF; T = 85 C. ISSCC00 is a domino carry-select/carry-lookahead adder in 0.18-m PD SOI [25]. VLSI01 is a race logic carry-lookahead/carry-select adder in 0.18-m bulk CMOS [22], ESSCIRC02 is a BrentKung adder in 100-nm bulk CMOS [26]. VLSI02 is a carry-select/carry-lookahead adder in 0.1-m PD SOI [23].
VI. CONCLUSION In this paper, we presented a methodology to minimize the power-delay product of 64-bit carry-select adders for highperformance and low-power applications, by working at three levels of abstraction: design style, cell arrangement, and adder structure. We demonstrated that a complex-cell branch-based design reduces the dynamic power consumption by 10% compared to the design with decomposed cells, with a minimal impact on the delay. The reduction of the CMOS cell stack height from three to two devices in the critical path proved to be beneficial for speed. By avoiding the duplication of cells, without increasing the fan-out on the critical path, we were able to improve the speed while maintaining low power consumption. The structural optimization, by making the choice of the selection of 8-bit pre-sums, allowed the use of only one carry-select level, thus further contributing to the reduction of the power dissipation thanks to a lower number of cells. Compared to the classical design, the power-delay product of the optimized adder has been reduced by a factor of four. Compared with an equivalent CPL 64-bit adder, our realization shows a 60% improvement of the power-delay product. Finally, an automatic tuning tool allowed the design of an energy-efficient adder in 0.13- m PD SOI, with a power-delay product as low as 7.5 pJ. The approach presented in this paper can be extended toward -bit carry-select adders. However, the number of carry-select levels and the adder architecture might be different in order to obtain an efficient realization. ACKNOWLEDGMENT The authors express their sincere thanks to G. Hellner, J. Keinert, V. Gernhoeffer, and U. Krauch for the help they provided when using the simulation, layout, and extraction tools. The authors also to thank R. Sautter and W. Haller for useful discussions, and J. Appinger for his helpful assistance in obtaining the experimental data.
244
Amaury Nve (M96) received the electrical engineering and the Ph.D. degrees from the Universit catholique de Louvain, Louvain-la-Neuve, Belgium, in 1998 and 2004, respectively. From 1998 to 2003, he was Research Assistant with the Microelectronics Laboratory of the Universit Catholique de Louvain. He was involved in the development of design techniques for high-speed and low-power digital circuits in advanced silicon-on-insulator (SOI) processes. In May 2003, he joined the IBM Research and Development Laboratory in Bblingen, Germany, where he is working on circuit design for advanced CMOS processes.
Helmut Schettler received the Dipl.-Ing. degree in electrical engineering from the University of Stuttgart, Stuttgart, Germany, in 1969. In 1969, he joined the IBM Laboratory, Bblingen, Germany. He worked for three years in the IBM Laboratories, East Fishkill, NY, and Burlington, VT, where he was involved in bipolar and CMOS circuit and chip design for memory and m-Processor applications. He holds 23 patents and was the leading circuit designer when IBM moved its server technology from bipolar to CMOS. Presently, he is also involved in lecturing at a university of applied science in Stuttgart.
Thomas Ludwig (M90) was born in Sindelfingen, Germany, in 1957. He received the Master of Electrical Engineering from the Technische Universitt, Berlin, Germany, in 1983. He joined IBM in 1984, at the German Research and Development Laboratory, Bblingen, working on high-speed digital driver/receiver circuits. In 1992, he was on an assignment with the joint IBM/Intel Noyce Development Center, Boca Raton, FL, working on technology conversion of a -processor. He is currently Senior Engineer and Leader of the Future Product Technology Team, IBM Systems Group, Bblingen, Germany, responsible for future silicon technologies in the field of high-speed server processors. His current research interests are in the areas of silicon-on-insulator technology, especially FinFET circuit design, modeling and the influence on CAD tools.
Denis Flandre (M86SM03) was born in Charleroi, Belgium, in 1964. He received the Electrical Engineer degree, the Ph.D. degree, and the Postdoctoral thesis degree from the Universit Catholique de Louvain (UCL), Louvain-la-Neuve, Belgium, in 1986, 1990, and 1999, respectively. His doctoral research was on the modeling of silicon-on-insulator (SOI) MOS devices for characterization and circuit simulation, and his Postdoctoral thesis was on a systematic and automated synthesis methodology for MOS analog circuits. In 1985, he was a summer Student Trainee at NTT Headquarters, Tokyo, Japan. From October 1990 to September 1991, he was with the Centro Nacional de Microelectrnica, Barcelona, Spain, working on the characterization and numerical simulation of SOI MOS process and devices. He was then at the Laboratoire de Microlectronique (DICE), Louvain-la-Neuve, Belgium, as a Senior Research Associate of the National Fund for Scientific Research (FNRS, Belgium). Since 2001, he has been a full-time Professor at UCL giving courses on integrated analog circuit design, device physics, etc. He is currently involved in the research and development of digital and analog SOI MOS circuits for special applications, more specifically high-speed, low-voltage low-power, microwave, rad-hard, and high-temperature electronics. Prof. Flandre has been the recipient of the 1992 Biennial SiemensFNRS Award for an original contribution in the fields of electricity and electronics, the 1997 Wernaers Prize for innovation in pedagogical presentation of advanced research work, and the 1999 CENSCK Prize for innovation in nuclear science instrumentation. He has authored or coauthored more than 160 technical papers or conference contributions. He is a member of the Advisory Board of the EU Network of Excellence for High-Temperature Electronics (HITEN), of the Scientific Board of the Microserv large infrastructure EU program of the CNM-Barcelona and of the Director Board of the Cyclotron Research Center (CRC, Louvain-la-Neuve, Belgium). He is a founding member of the CERMIN (Centre de Recherche en Dispositifs et Matriaux Electroniques Micro- et Nanoscopiques of UCL). He is a cofounder of CISSOID S.A., a startup company, spun off of UCL in July 2000, focusing on SOI circuit design services.

Power-Delay Product Minimization

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Power-Delay Product Minimization

Caricato da

Copyright:

Formati disponibili

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO.

Power-Delay Product Minimization in High-Performance 64-bit Carry-Select Adders

1063-8210/04$20.00 2004 IEEE

64-bit carry-select adder.

NVE et al.: POWER-DELAY PRODUCT MINIMIZATION IN HIGH-PERFORMANCE 64-BIT CARRY-SELECT ADDERS

Fig. 2. One-stage complex CS-C0: (a) BBL and (b) X-gate.

Fig. 4. One-stage complex BBL CS-C1 cell.

Two-stage decomposed CS-C0 cell.

Fig. 5. Two-stage complex BBL CS-C2 cell.

TABLE III SIMULATED STATIC POWER DISSIPATION. V = 1:5 V; T = 25 C; FO = 1

TABLE IV = 1:5 V; T = 25 C; SIMULATED DYNAMIC POWER DISSIPATION. V FO = 1; F = 1 GHz

NVE et al.: POWER-DELAY PRODUCT MINIMIZATION IN HIGH-PERFORMANCE 64-BIT CARRY-SELECT ADDERS

Fig. 9. Four-bit ripple-carry adder (RCA) with carry

NVE et al.: POWER-DELAY PRODUCT MINIMIZATION IN HIGH-PERFORMANCE 64-BIT CARRY-SELECT ADDERS

64-bit adder based on the selection of 8-bit pre-sums.

The second control signal to generate is

and are shared with is expressed as follows:

NVE et al.: POWER-DELAY PRODUCT MINIMIZATION IN HIGH-PERFORMANCE 64-BIT CARRY-SELECT ADDERS

Potrebbero piacerti anche