Sei sulla pagina 1di 6

Novel Architectures for High-Speed and Low-Power 3-2, 4-2 and 5-2

Compressors
Sreehari Veeramachaneni, Kirthi Krishna M, Lingamneni Avinash, Sreekanth Reddy Puppala , M.B. Srinivas
Centre for VLSI and Embedded System Technologies.
International Institute of Information Technology
Gachibowli, Hyderabad-500032, India.
srihari@research.iiit.ac.in, {kirthikrishna, avinashl, sreekanthp}@students.iiit.ac.in, srinivas@iiit.ac.in.

Abstract XOR gates to efficiently use the outputs from the


previous stages and improve the overall performance.
The 3-2, 4-2 and 5-2 compressors are the basic It is because the use of multiplexers improves the
components in many applications, in particular partial speed when placed in the critical path [2].
product summation in multipliers. In this paper novel The rest of the paper is organized as follows: In
architectures and designs of high speed, low power 3- Section 2 the efficiency of MUX and XOR-XNOR are
2, 4-2 and 5-2 compressors capable of operating at compared and the possibility of replacing MUX with
ultra-low voltages are presented. The power XOR-XNOR is discussed. In section 3, 4, 5 & 6 the
consumption, delay and area of these new compressor proposed architectures of 3-2, 4-2 and 5-2 compressors
architectures are compared with existing and recently are presented and compared with the existing
proposed compressor architectures and are shown to architectures. Implementations have been carried out
perform better. The proposed architecture lays in 0.18µm CMOS technology.
emphasis on the use of multiplexers in arithmetic
circuits that result in high speed and efficient design. 2. MUX Vs XOR-XNOR.
Also in all existing implementations of XOR gate and
multiplexers, both output and its complement are Existing CMOS designs of 2x1 multiplexer and 2-
available but current designs of compressors do not input XOR gate are shown in Fig.1 [2].
use these outputs efficiently. In the proposed
architecture these outputs are efficiently utilized to
S S
improve the performance of compressors. The A B
combination of low power, low transistor count and
A B
lesser delay makes the new compressors a viable S
S S
option for efficient design. O MUX
A B

1. Introduction. S S
O
O O

Multiplication is a basic arithmetic operation


important in applications like digital signal processing (a)
which rely on efficient implementation of generic
arithmetic logic units (ALU) and floating point units to
xnor
execute dedicated operations like convolution and A

filtering. In the implementation of multipliers, the main B A


A B

phases are generation of partial products, reduction of B

partial products using CSA (carry-save architecture) xor

[7-10] and a carry propagation adder for the B XOR-XNOR


computation of the final result. It is obvious that the
second phase, that is, the reduction of the partial A
A B
products contributes most to the overall delay, area and xor xnor
power.
In most of these implementations, compressor lies (b)
directly within the critical path dictating the overall Fig.1. CMOS Implementations of (a) MUX (b) XOR-
circuit, due to which the demand for high-speed and XNOR
low-power compressors is continuously increasing [7-
9]. This paper presents new compressor architectures In Fig.1(a), it can be seen that if both the select bit
that lay emphasis on the use of multiplexers in place of and its complement arrive before the inputs arrive then

20th International Conference on VLSI Design (VLSID'07)


0-7695-2762-0/07 $20.00 © 2007
the output is generated with very less delay because governing the existing 3-2 compressor outputs are
switching of the transistors is already completed. Also shown below
if both the select bit and its complement are generated Sum = x1 ⊕ x 2 ⊕ x3 (2)
in the previous stage then the additional stage of the
inverter is eliminated which reduces the overall delay Carry = ( x1 ⊕ x 2) • x3 + ( x1 ⊕ x 2) • x1 (3)
in the critical path [2]. By using the output and its
complement in every stage the total number of garbage In the proposed architecture shown in Fig. 4, the
outputs is reduced. By decreasing the number of fact that both the XOR and XNOR values are
transistors the overall power consumption and the area computed is efficiently used to reduce the delay by
occupied is reduced considerably [1]. replacing the second XOR with a MUX. This is due to
An alternative design of the multiplexer is shown the availability of the select bit at the MUX block
in Fig.2. before the inputs arrive. Thus the time taken for the
switching of the transistors in the critical path is
reduced.
A X1 X2 X3

S O
XOR-XNOR
B

Fig.2. Transmission Gate Implementation of a


multiplexer MUX MUX
This design of the multiplexer is faster than the
CMOS design when buffers are not used at the output SUM Carry
[10]. But these can only be used in the intermediate Fig.4. Proposed architecture of the 3-2 Compressor
stages because of their limited driving capability. This
design also consumes lesser power than the CMOS The equations governing the 3-2 compressor
design [2]. In the proposed architectures the blocks outputs are shown below
where this design can be used are shown as MUX*.
Sum = ( x1⊕ x2) • x3 + ( x1⊕ x2) • x3 (4)
3. 3-2 Compressor. Carry= ( x1⊕ x2) • x3 + ( x1⊕ x2) • x1 (5)
It can be seen that in this implementation the
A 3-2 compressor takes 3 inputs X1, X2, X3 and overall delay is ∆-XOR +∆-MUX (where ∆ refers to
generates 2 outputs, the sum bit S, and the carry bit C delay).
as shown in Fig.3a.
The compressor is governed by the basic equation 4. 4-2 Compressor.
X1 + X2 + X3 = Sum + 2*Carry (1)
X1 X2 Cin
The 4-2 compressor has 4 inputs X1, X2, X3
and X4 and 2 outputs Sum and Carry along with a
X1 X2 X3 XOR
Carry-in (Cin) and a Carry-out (Cout) as shown in Fig
5. The input Cin is the output from the previous lower
3–2
XOR MUX significant compressor. The Cout is the output to the
compressor in the next significant stage.
Carry Sum SUM Carry

(a) (b) X1 X2 X3 X4
Fig.3. (a) A 3-2 Compressor (b) Conventional
Implementation of the 3-2 compressor
Cout 4–2 Cin

The 3-2 compressor can also be employed as a


full adder cell when the third input is considered as the Carry Sum
Carry input from the previous compressor block or X3 Fig.5. A 4-2 Compressor Block
= Cin. Similar to the 3-2 compressor the 4-2 compressor
Existing architectures shown in Fig.3 (b) employ is governed by the basic equation
two XOR gates in the critical path [3-6]. The equations

20th International Conference on VLSI Design (VLSID'07)


0-7695-2762-0/07 $20.00 © 2007
x1+x2+x3+x4+Cin = Sum + 2*(Carry + This minimizes the delay to a considerable extent. This
Cout) (6) is shown in Fig. 7.
The standard implementation [3-6] of the 4-2 The equations governing the outputs in the
compressor is done using 2 Full Adder cells as shown proposed architecture are shown below
in Fig 6(a).
X1 X2 X3 X4
Sum= (x1⊕ x2) • x3⊕ x4+ (x1⊕ x2) • (x3⊕ x4) •Cin +
X1 X2 X3 X4

XOR XOR (x1⊕ x2) • x3⊕ x4 + (x1⊕ x2) • (x3⊕ x4) •Cin (10)
FA Cin Cout= (x1⊕ x2) • x3+ (x1⊕ x2) • x1 (11)
Cin
XOR
Carry= (x1⊕ x2⊕ x3⊕ x4) •Cin+
MUX
Cout
FA
Cout XOR MUX (x1⊕ x2⊕ x3⊕ x4) • x4 (12)
Carry Sum

Sum Carry The critical path delay of the proposed


(a) (b) implementation is ∆-XOR + 2*∆-MUX.
Fig.6. (a) A 4-2 compressor implemented with full adders (b)
Existing implementation of 4-2 compressor
5. 5-2 Compressor.

When the individual full Adders are broken into The 5-2 Compressor block has 5 inputs
their constituent XOR blocks, it can be observed that X1,X2,X3,X4,X5 and 2 outputs, Sum and Carry, along
the overall delay is equal to 4*∆-XOR. The block with 2 input carry bits (Cin1, Cin2) and 2 output carry
diagram in Fig. 6(b) shows the existing architecture for bits (Cout1,Cout2) as shown in Fig.8a. The input carry
the implementation of the 4-2 compressor with a delay bits are the outputs from the previous lesser significant
of 3*∆-XOR [3-6]. The equations governing the compressor block and the output carry are passed on to
outputs in the existing architecture are shown below the next higher significant compressor block.

Sum = x1 ⊕ x2 ⊕ x3 ⊕ x4 ⊕ Cin (7 )
X1 X2 X3 X4 X5

FA
Cout = ( x1 ⊕ x 2) • x3 + ( x1 ⊕ x 2) • x1 (8) Cout1
X1 X2 X3 X4 X5 Cin1

Carry = ( x1 ⊕ x 2 ⊕ x3 ⊕ x 4) • Cin + FA
Cout1 Cin1 Cout2

( x1 ⊕ x2 ⊕ x3 ⊕ x4) • x 4 5–2
Cin2

(9) Cout2 Cin2


FA
However, like in the case of 3-2 compressor, the
fact that both the output and its complement are Carry Sum Sum Carry

available at every stage, is neglected [2]. Thus (a) (b)


replacing some XOR blocks with multiplexers results Fig.8. (a) A 5-2 compressor block (b) Conventional
implementation of a 5-2 compressor block
in a significant improvement in delay.
X1 X2 X3 X4
The basic equation that governs the function of
the 5-2 compressor block is given below
XOR-XNOR XOR-XNOR

X1+X2+X3+X4+X5+Cin1+Cin2
Cin
MUX MUX*
=Sum+2*(Carry + Cout1 + Cout2) (13)

The conventional implementation [3-6] of the


Cout MUX MUX compressor block is shown in Fig.8(b) where 3
cascaded full adder cells are used. When these full
Sum Carry
adders are replaced with their constituent blocks of
XOR gates then it can be observed that the overall
Fig 7. Proposed 4-2 Compressor Architecture
delay is equal to 6*∆-XOR for the sum or carry output.
Many architectures have been proposed where the
Also the MUX block at the SUM output gets the
delay has been reduced to 5*∆-XOR (Fig.9a) and then
select bit before the inputs arrive and thus the
further reduced to 4*∆-XOR. (Fig.9 b&c) [3-6].
transistors are already switched by the time they arrive.

20th International Conference on VLSI Design (VLSID'07)


0-7695-2762-0/07 $20.00 © 2007
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5
the XOR block in the second stage with a MUX block
XOR XOR
reduces the delay because the select bit X3 is already
(X1+X2) (X3+X4) XOR XOR (X1X2 + X3X4)
available and the time taken for the transistor
XOR MUX Cout1

XOR XOR
Cin1
switching to take place is done in parallel with the
computation of the inputs of the block.
Cout1
Cin1

XOR

XOR MUX As mentioned before, in all the general


Cout2
XOR MUX

Cin2
Cout2
Cin2 implementations of the XOR or MUX block, in
XOR MUX XOR MUX particular CMOS implementation, the output and its
SUM Carry SUM Carry complement are generated. But in the existing
(a) (b) architectures this advantage is not being utilized at all
X1 X2 X3 Cin2 X4 X5 Cin1 [3-6]. In the proposed architecture these outputs are
utilized efficiently by using multiplexers at select
CGEN1 XOR* XOR* stages in the circuit. Also additional inverter stages are
Cout1
eliminated. This in turn contributes to the reduction of
XOR^ XOR^ MUX delay, power consumption and transistor count (area).
The equations governing the outputs are shown
Cout2
XOR* below:
Sum = x1 ⊕ x 2 ⊕ x3 ⊕ x 4 ⊕ x5 ⊕ Cin1 ⊕ Cin 2 (14)
XOR MUX Cout1 = ( x1 + x 2) • x3 + x1 • x 2 (15)
SUM Carry Cout 2 = ( x 4 ⊕ x5) • Cin1 + ( x 4 ⊕ x5) • x 4 (16)
(c) Carry = (( x1 ⊕ x 2 ⊕ x3) ⊕ ( x 4 ⊕ x5 ⊕ Cin1)) • Cin 2 + (17)
Fig.9 Existing architectures of 5-2 compressors (( x1 ⊕ x 2 ⊕ x3) ⊕ ( x 4 ⊕ x5 ⊕ Cin1)) • ( x1 ⊕ x 2 ⊕ x3)
X1 X2 X3 Cin2 X4 X5 Cin1

The critical path delay of the proposed


CGEN1 XOR-XNOR XOR-XNOR
implementation is ∆-XOR + 3*∆-MUX. In the Carry
Cout1
generation module mentioned in Fig.10, we use the
MUX* MUX* MUX mathematical equation (15) to design a CMOS
implementation of Cout1 as shown in Fig.11.
Cout2
MUX*
X1

X3
MUX MUX X2

SUM Carry

Fig.10. Proposed architecture of the 5-2 compressor X1 X2

In the proposed architecture changes have been Cout1

made, to efficiently use the outputs generated at every X3 X1

stage, by replacing a few XOR blocks with MUX


blocks. X1 X2
X2

Also the select bits to the multiplexers in the


critical path are made available much ahead than the
inputs so that the critical path delay is minimized. For Fig.11. Carry Generation Module (CGEN1)
example the Cout2 output from the previous lesser
significant compressor block is utilized as the select bit 6. Simulation and results
after a stage it is produced so that the MUX block is
already switched and the output is produced as soon as a. Simulation environment.
the inputs arrive. Also if the output of the multiplexer All the simulations have been done using Cadence
is used as select bit for another multiplexer, then it can Tools. The calculation of power (including glitch
be used efficiently in similar manner because the power) and delay are carried out using the Virtual
negation of select bit is also required, as shown in Analog Simulation tool already integrated into
Figure 1(a), in the design and an extra stage to Cadence Tools. All the schematics and layouts (Fig 13,
compute the negation can be saved. Similarly replacing 15 & 17) are done using the CMOS 0.18-µm

20th International Conference on VLSI Design (VLSID'07)


0-7695-2762-0/07 $20.00 © 2007
technology. Hence the circuits are optimized for this (a)
process technology. 6

The simulations are performed under various 5

voltages ranging from 0.9V to 3.3V. All the inputs are 3


Existing
Proposed
fed at a frequency of 100MHz. 2

B. Simulation results. 0
0.9V 1.2V 1.8V 2.5V 3.3V

The proposed and the existing architectures [3-6] (b)


have been compared by implementing both of them in
70
0.18-µm CMOS technology. 60

Existing 50
100
P ro po sed 40 Existing
80 30 Proposed
Power (nW)

20
60
10
40 0
0.9V 1.2V 1.8V 2.5V 3.3V
20
0
0.9V 1.2V 1.8V 2.5V 3.3V
(c)
Voltage (V) Figure 14(a) Power consumption (nW) (b)Delay(ns)
(c)Power Delay product for 4-2 compressors
(a)
8
Existing
6 Proposed
Delay (ns)

0
0.9V 1.2V 1.8V 2.5V 3.3V
Voltage (V)

(b) Fig.15 Layout of the proposed 4-2 compressor


Existing
architecture
120
Proposed
product (nW-ns)

100
Power-delay

80 30

60 25

40 20
Exist ing
20 15
Proposed
0 10

0.9V 1.2V 1.8V 2.5V 3.3V 5


Voltage (V) 0
0.9V 1.2V 1.8V 2.5V 3.3V

(c)
Figure 12(a)Power consumption(nW) (b)Delay(ns) (a)
(c)Power Delay product for 5-2 compressors 3.5
3
2.5

2 Exist ing
1.5 Proposed

1
0.5
0
0.9V 1.2V 1.8V 2.5V 3.3V

(b)
Fig.13 Layout of the proposed 5-2 compressor 30

25
Architecture 20
Existing
15
Proposed
60
10
50
5
40
0
Existing
30 0.9V 1.2V 1.8V 2.5V 3.3V
Proposed
20

10 (c)
0
0.9V 1.2V 1.8V 2.5V 3.3V
Figure 16(a)Power consumption(nW) (b)Delay(ns)
(c)Power Delay product for 3-2 compressors

20th International Conference on VLSI Design (VLSID'07)


0-7695-2762-0/07 $20.00 © 2007
Figure 18 shows that the implementation of the
intermediate stages using CMOS+ design in the
proposed 5-2 compressor results in a delay efficiency
of 14.6%, power efficiency of 5.1% and efficiency of
18.2% in power-delay product when compared to the
CMOS implementation of the same design. Similar
Fig.17 Layout of the proposed 3-2 compressor results have been obtained with 3-2 and 4-2
architecture compressors also.
The figures 12, 14 & 16 show that the proposed
architecture for the 5-2 compressor consumes 13.2% 7. Conclusions.
lesser power and is 26% faster than the existing
architectures when operating at 1.8V. Because of the The architectures of the 3-2, 4-2 and 5-2
decrease in the number of transistors the overall area compressor are analyzed using CMOS and CMOS+
decreases by about 11.15% in the proposed 5-2 implementations of XOR and the MUX blocks. New
compressor. The 4-2 compressor architecture is 33.3% 3-2, 4-2 and 5-2 compressor architectures have been
faster and consumes 15% lesser power than the proposed and compared with the existing architectures.
existing architectures. Also the proposed 3-2 Simulations have been performed over a range of
compressor is 7% faster and consumes 10.2% lesser voltages, from 0.9V to 3.3V. The proposed
power than the existing architectures. The architectures perform better than the existing ones in
improvement in the power-delay product is 36.4%, every aspect i.e., area, power, delay and power-delay
27.8% and 24% in the proposed 5-2 compressor, 4-2 product over the complete voltage range simulated.
compressor and 3-2 compressor respectively.
As mentioned in section 1, the MUX* blocks in
the proposed architecture can be implemented using
8. References.
transmission gate (CMOS+) logic. This new
implementation is compared with the CMOS [1] A. P. Chandrakasan and R. W. Brodersen, Low Power
Digital CMOS Design. Norwell. MA: Kluwer, 1995.
implementation and the results are shown below.
[2] R. Zimmermann and W.Fichtner, “Low-power logic
styles: CMOS versus pass-transistor logic,” IEEE J. Solid-
300 State Circuits, vol. 32, pp. 1079–1090, July 1997.
250

200 M UX * AS
[3] S. F. Hsiao, M. R. Jiang, and J. S. Yeh, “Design of high-
150
C M OS speed low-power 3-2 counter and 4-2 compressor for fast
M UX * AS
100 C M OS+ multipliers,” Electron. Lett, vol. 34, no. 4, pp. 341–343,
50

0
1998.
0.9V 1.2V 1.8V 2.5V 3.3V
[4]K. Prasad and K. K. Parhi, “Low-power 4-2 and 5-2
(a) compressors,” in Proc. of the 35th Asilomar Conf. on
Signals, Systems and Computers, vol. 1, 2001, pp. 129–133.
7
6
[5] C. H. Chang, J. Gu, M. Zhang, “Ultra low-voltage low-
5 power CMOS 4-2 and 5-2 compressors for fast arithmetic
M UX* AS
4

3
C M OS
M UX* AS
circuits” IEEE Transactions on Circuits and Systems I:
2
C M OS+ Regular Papers, Volume 51, Issue 10, Oct. 2004
1
0
Page(s):1985 – 1997
0.9V 1.2V 1.8V 2.5V 3.3V
[6]S. F. Hsiao, M. R. Jiang, and J. S. Yeh, “Design of high-
(b) speed low-power 3-2 counter and 4-2 compressor for fast
multipliers,” Electron. Lett, pp. 341–343, 1998.
350
300 [7] Z. Wang, G. A. Jullien, and W. C. Miller, “A new design
250
M UX * AS
technique for column compression multipliers,” IEEE Trans.
200
15 0
C M OS
M UX * AS
Comput., vol. 44, pp. 962–970, Aug. 1995.
[8] Milos Ercegovac, Tomas Lang, "Digital Arithmetic",
C M OS+
10 0
50
0 Morgan Kaufman, 2004.
0.9V 1. 2 V 1. 8 V 2.5V 3.3V
[9] I . Koren, Computer Arithmetic Algorithms. Englewood
(c) Cliffs, NJ, Prentice Hall, 1993.
Figure 18 (a) Power consumption (nW) (b) Delay(ns) [10] J. M. Rabaey, A. Chandrakasan, and B. Nikolic, Digital
(c) Power Delay product for proposed 5-2 compressors “Integrated Circuits (A design perspective)”, Prentice Hall,
with MUX* in CMOS and CMOS+ designs. 2003

20th International Conference on VLSI Design (VLSID'07)


0-7695-2762-0/07 $20.00 © 2007

Potrebbero piacerti anche