Sei sulla pagina 1di 26

Eidgenossische Technische Hochschule Zurich

Ecole polytechnique federale de Zurich Politecnico federale di Zurigo Swiss Federal Institute of Technology Zurich

Institut f r Integrierte Systeme u

Integrated Systems Laboratory

Lecture notes on

Computer Arithmetic: Principles, Architectures, and VLSI Design


June 25, 1998

Reto Zimmermann
Integrated Systems Laboratory Swiss Federal Institute of Technology (ETH) CH-8092 Z rich, Switzerland u zimmermann@iis.ee.ethz.ch

Copyright c 1998 by Integrated Systems Laboratory, ETH Z rich u


http://www.iis.ee.ethz.ch/ zimmi/publications/comp arith notes.ps.gz

Contents

Contents

Contents
1 Introduction and Conventions 1.1 Outline

::::::::::::::::::::::: 4 :::::::::::::::::::::::::::::::::::::::::: 4 1.2 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 1.3 Conventions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5 1.4 Recursive Function Evaluation : : : : : : : : : : : : : : : : : : : : : 6 Arithmetic Operations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 2.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 2.2 Implementation Techniques : : : : : : : : : : : : : : : : : : : : : : : 9 Number Representations : : : : : : : : : : : : : : : : : : : : : : : : : : : 10 3.1 Binary Number Systems (BNS) : : : : : : : : : : : : : : : : : : : 10 3.2 Gray Numbers : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 3.3 Redundant Number Systems : : : : : : : : : : : : : : : : : : : : : : 14 3.4 Residue Number Systems (RNS) : : : : : : : : : : : : : : : : : : 16 3.5 Floating-Point Numbers : : : : : : : : : : : : : : : : : : : : : : : : : : 18 3.6 Logarithmic Number System : : : : : : : : : : : : : : : : : : : : : 19 3.7 Antitetrational Number System : : : : : : : : : : : : : : : : : : : 19 3.8 Composite Arithmetic : : : : : : : : : : : : : : : : : : : : : : : : : : : 20 3.9 Round-Off Schemes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21 Addition : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22 4.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22 4.2 1-Bit Adders, (m, k)-Counters : : : : : : : : : : : : : : : : : : : : 23
1

: : : : : : : : : : : : : : : : : : : 26 4.4 Carry-Save Adder (CSA) : : : : : : : : : : : : : : : : : : : : : : : : : 45 4.5 Multi-Operand Adders : : : : : : : : : : : : : : : : : : : : : : : : : : : 46 4.6 Sequential Adders : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 52 Simple / Addition-Based Operations : : : : : : : : : : : : : : : : 53 5.1 Complement and Subtraction : : : : : : : : : : : : : : : : : : : : : 53 5.2 Increment / Decrement : : : : : : : : : : : : : : : : : : : : : : : : : : : 54 5.3 Counting : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 58 5.4 Comparison, Coding, Detection : : : : : : : : : : : : : : : : : : : 60 5.5 Shift, Extension, Saturation : : : : : : : : : : : : : : : : : : : : : : 64 5.6 Addition Flags : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 66 5.7 Arithmetic Logic Unit (ALU) : : : : : : : : : : : : : : : : : : : : : 68 Multiplication : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69 6.1 Multiplication Basics : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69 6.2 Unsigned Array Multiplier : : : : : : : : : : : : : : : : : : : : : : : 71 6.3 Signed Array Multipliers : : : : : : : : : : : : : : : : : : : : : : : : : 72 6.4 Booth Recoding : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 73 6.5 Wallace Tree Addition : : : : : : : : : : : : : : : : : : : : : : : : : : : 75 6.6 Multiplier Implementations : : : : : : : : : : : : : : : : : : : : : : : 75 6.7 Composition from Smaller Multipliers : : : : : : : : : : : : : 76 6.8 Squaring : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 76 Division / Square Root Extraction : : : : : : : : : : : : : : : : : : 77 7.1 Division Basics : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 77
4.3 Carry-Propagate Adders (CPA)
2

Computer Arithmetic: Principles, Architectures, and VLSI Design Contents

Computer Arithmetic: Principles, Architectures, and VLSI Design

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 78 : : : : : : : : : : : : : : : : : : : : : : : : : : 78 7.4 Signed Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 79 7.5 SRT Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 80 7.6 High-Radix Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 81 7.7 Division by Multiplication : : : : : : : : : : : : : : : : : : : : : : : 81 7.8 Remainder / Modulus : : : : : : : : : : : : : : : : : : : : : : : : : : : : 82 7.9 Divider Implementations : : : : : : : : : : : : : : : : : : : : : : : : : 83 7.10 Square Root Extraction : : : : : : : : : : : : : : : : : : : : : : : : : 84 Elementary Functions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85 8.1 Algorithms : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85 8.2 Integer Exponentiation : : : : : : : : : : : : : : : : : : : : : : : : : : : 86 8.3 Integer Logarithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 87 VLSI Design Aspects : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 88 9.1 Design Levels : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 88 9.2 Synthesis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 90 9.3 VHDL : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 91 9.4 Performance : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 93 9.5 Testability : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 95 Bibliography : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 96
7.2 Restoring Division 7.3 Non-Restoring Division

Computer Arithmetic: Principles, Architectures, and VLSI Design

1 Introduction and Conventions

1.2 Motivation

1 Introduction and Conventions

1.3 Conventions

1 Introduction and Conventions


1.1 Outline Basic principles of computer arithmetic [1, 2, 3, 4, 5, 6, 7] Circuit architectures and implementations of main arithmetic operations Aspects regarding VLSI design of arithmetic units 1.2 Motivation Arithmetic units are, among others, core of every data path and addressing unit Data path is core of : microprocessors (CPU) signal processors (DSP) data-processing application specic ICs (ASIC) and programmable ICs (e.g. FPGA) Standard arithmetic units available from libraries Design of arithmetic units necessary for : non-standard operations high-performance components library development
Computer Arithmetic: Principles, Architectures, and VLSI Design 1 Introduction and Conventions 4

1.3 Conventions Naming conventions

A (1-D), Ai (2-D), ai:k (subbus, 1-D) Signals : a, ai (1-D), ai k (2-D), Ai:k (group signal) Circuit complexity measures : A (area), T (cycle time, delay), AT (area-time product), L (latency, # cycles) Arithmetic operators : +, ;, , =, log (= log2 )
Signal buses : Logic operators :

+ (or),

(and),

(xor),

(xnor), (not)

Circuit complexity measures Unit-gate model ( Inverter, buffer : gate-equivalents (GE) model) :

A=0 T =0

(i.e. ignored)

Simple monotonic 2-input gates (AND, NAND, OR, NOR) : A = 1 T = 1 Simple non-monotonic 2-input gates (XOR, XNOR) : A=2 T =2 Complex gates : composed from simple gates
) Simple m-input gates :

A = m ; 1 T = dlog me

Wiring not considered (acceptable for comparison purposes, local wiring, multilevel metallization) Only estimations given for complex circuits
Computer Arithmetic: Principles, Architectures, and VLSI Design 1 Introduction and Conventions 5

1.4 Recursive Function Evaluation

1.4 Recursive Function Evaluation a3 a2 a1 a0 1 funrsa.epsi 219 20 mm z

1.4 Recursive Function Evaluation Given : inputs ai , outputs zi , function f (graph sym. : ) Non-recursive functions (n.) Output zi is a function of input ai (or aj +m:j

2.

f is associative (r.s.a.) ) serial or single-tree structure : A = O(n) T = O(log n)

m const.)

b) with multiple outputs zi (r.m.) () prex problem) :

zi = f (ai x) ; i = 0 : : : n ; 1
) parallel structure :
a3 a2 a1 a0 funn.epsi 119 17 mm z3 z2 z1 z0

zi = f (ai zi;1) ; i = 0 : : : n ; 1 z;1 = 0=1


1.
) serial structure :

f is non-associative (r.m.n.) A = O(n) T = O(n) f is associative (r.m.a.)

a3 a2 a1 a0 1 funrmn.epsi 219 25 mm 3 z3 z2 z1 z0 a3 a2 a1 a0 1 2 z3 funrma1.epsi 19 43 mm z2 z1 z0

A = O(n) T = O(1)
Recursive functions (r.) Output zi is a function of all inputs ak a) with single output z

k i
2.

= zn (r.s.) :

) serial or multi-tree structure :

ti = f (ai ti;1) ; i = 0 : : : n ; 1 t;1 = 0=1 z = tn;1


1.
) serial structure :

A = O(n2) T = O(log n)
a3 a2 a1 a0

f is non-associative (r.s.n.) A = O(n) T = O(n)

1 funrsn.epsi 219 24 mm 3 z 6

) or shared-tree structure :

a3 a2 a1 a0 1funrma2.epsi 219 21 mm z3 z2 z1 z0

A = O(n log n) T = O(log n)


Computer Arithmetic: Principles, Architectures, and VLSI Design

Computer Arithmetic: Principles, Architectures, and VLSI Design

2 Arithmetic Operations

2.1 Overview

2 Arithmetic Operations

2.2 Implementation Techniques

2 Arithmetic Operations
2.1 Overview
based on operation related operation << , >>

2.2 Implementation Techniques Direct implementation of dedicated units :


fixed-point floating-point

always : 1 5 in most cases : 6 sometimes : 7, 8

=,<

+1 , 1

+/

+,

+,

Sequential implementation using simpler units and several clock cycles () decomposition) : sometimes : 6 in most cases : 7, 8, 9

arithops.epsi 98 83 mm

sqrt (x)

(same as on the left for floating-point numbers) complexity

Table look-up techniques using ROMs : universal : simple application to all operations efcient only for single-operand operations of high complexity (8 12) and small word length (note: ROM size = 2n n) Approximation techniques using simpler units : 712

exp (x)

log (x)

trig (x)

hyp (x)

1 2 3 4 5 6

shift/extension 7 division comparison 8 square root extraction increment/decrement 9 exponential function complement 10 logarithm function addition/subtraction 11 trigonometric functions multiplication 12 hyperbolic functions
8

taylor series expansion polynomial and rational approximations convergence of recursive equation systems CORDIC (COordinate Rotation DIgital Computer)
Computer Arithmetic: Principles, Architectures, and VLSI Design 3 Number Representations 9

Computer Arithmetic: Principles, Architectures, and VLSI Design 3 Number Representations

3.1 Binary Number Systems (BNS)

3.1 Binary Number Systems (BNS)

3 Number Representations
3.1 Binary Number Systems (BNS) Radix-2, binary number system (BNS) : irredundant, weighted, positional, monotonic [1, 2]

Complement : ;A = 2n ; A = A + 1 , where A = (an;1 an;2 : : : a0 )

n-bit number is ordered sequence of bits (binary digits) : A = (an;1 an;2 : : : a0)2 ai 2 f0 1g
Simple and efcient implementation in digital circuits MSB/LSB (most-/least-signicant bit) : an;1 / a0 Represents an integer or xed-point number, exact Fixed-point numbers :

Sign : an;1 Properties : asymmetric range, compatible with unsigned numbers in many arithmetic operations (i.e. same treatment of positive and negative numbers) Ones (1s) complement : similar to 2s complement n;2 X Value : A = ;an;1 (2n;1 + 1) + ai 2i i=0 Range : ;(2n;1 ; 1) 2n;1 ; 1] Complement : ;A = 2n ; A ; 1 = A Sign : an;1

(| m;1 {z: : a0 : | ;1 : :{z am;n ) a : } a : }


m-bit integer
(

n ; m)-bit fraction
n;1 X i=0

Unsigned : positive or natural numbers Value :

A = an;1 2n;1 +

+ a12 + a0 =

ai2i

Properties : double representation of zero, symmetric range, modulo (2n ; 1) number system Sign-magnitude : alternative representation of signed numbers n;2 X ai 2i Value : A = (;1)an;1 i=0 n;1 ; 1) 2n;1 ; 1] Range : ;(2 Complement : ;A = (an;1 an;2 Sign : an;1
10

Range : 0 2n ; 1]

Twos (2s) complement : standard representation of signed or integer numbers n;2 X ai2i Value : A = ;an;1 2n;1 + i=0 Range : ;2n;1 2n;1 ; 1]
Computer Arithmetic: Principles, Architectures, and VLSI Design

: : : a0 )

Computer Arithmetic: Principles, Architectures, and VLSI Design

11

3 Number Representations

3.1 Binary Number Systems (BNS)

3 Number Representations

3.2 Gray Numbers

Properties : double representation of zero, symmetric range, different treatment of positive and negative numbers in arithmetic operations, no MSB toggles at sign changes around 0 () low power) Graphical representation
000...0 011...1 100...0 111...1

3.2 Gray Numbers Gray numbers (code) : binary, irredundant, non-weighted, non-monotonic + Property : unit-distance coding (i.e. exactly one bit toggles between adjacent numbers) Applications : counters with low output toggle rate (low-power signal buses), representation of continuous signals for low-error sampling (no false numbers due to switching of different bits at different times) Non-monotonic numbers : difcult arithmetic operations, e.g. addition, comparison :

binary number representation

n1

n1

numrep.epsi 95 73 mm

unsigned 2s complement 1s complement sign-magnitude

0 0 0 g1 g0 g1g0 g0 g0 0 0 < 0 1 and 0 < 1 1 1 < 1 0 but 1 > 0

binary ! Gray :

gi = bi+1 bi bn = 0 ; i = 0 : : : n ; 1 (n.)
Gray ! binary :

Conventions 2s complement used for signed numbers in these notes Unsigned and signed numbers can be treated equally in most cases, exceptions are mentioned
Computer Arithmetic: Principles, Architectures, and VLSI Design 3 Number Representations 12

bi = bi+1 gi bn = 0 ; i = n ; 1 : : : 0 (r.m.a.)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

b3 b2 b1 b0 g3 g2 g1 g0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0

binary

Gray

Computer Arithmetic: Principles, Architectures, and VLSI Design 3 Number Representations

13

3.3 Redundant Number Systems

3.3 Redundant Number Systems

3.3 Redundant Number Systems Non-binary, redundant, weighted number systems [1, 2] Digit set larger than radix (typically radix 2) ) multiple representations of same number ) redundancy + No carry-propagation in adders ) more efcient impl. of adder-based units (e.g. multipliers and dividers) Redundancy ) no direct implementation of relational operators ) conversion to irredundant numbers Several bits used to represent one digit ) higher storage requirements Expensive conversion into irredundant numbers (not necessary if redundant input operands are allowed) Delayed-carry of half-adder number representation : ri 2 f0 1 2g , ci si ai bi 2 f0 1g , ri = (ci+1 si) = 2ci+1 + si = ai + bi , ci+1si = 0 ; R = Pn=01 ri2i = (C S ) = C + S = A + B i 1 digit holds sum of 2 bits (no carry-out digit) example : (00 10) = 00 + 10 = 01 + 01 = (10 00) irredundant representation of ;1 [8], since ci+1si = 0 & C + S = ;1 ! S = ;1 C = 0 Carry-save number representation : ri 2 f0 1 2 3g , ci si ai bi di 2 f0 1g , ri = (ci+1 si) = 2ci+1 + si = ai + bi + di = ai + ri0 ; R = Pn=01 ri2i = (C S ) = C + S = A + R0 i
Computer Arithmetic: Principles, Architectures, and VLSI Design 14

1 digit holds sum of 3 bits or 1 digit + 1 bit (no carry-out digit, i.e. carry is saved) standard redundant number system for fast addition Signed-digit (SD) or redundant digit (RD) number representation : ; ri si ti 2 f;1 0 1g f1 0 1g , R = Pn=01 ri2i i no carry-propagation in S = R + T :

ri + ti = (ci+1 ui) = 2ci+1 + ui , ci+1 ui 2 f1 (ci+1 ui) is redundant (e.g. 0 + 1 = 01 = 11) 8i 9(ci ui ) j ci + ui = si 2 f1 0 1g

0 1g

1 digit holds sum of 2 digits (no carry-out digit) minimal SD representation : minimal number of non-zero digits, 011f1g10 ! 100f0g10 applications : sequential multiplication (less cycles), lters with constant coefcients (less hardware) example : minimal 7 = (0111 j 1111 j 1011 j 1001 j 11111 j
z }| {

canonical SD repres.: minimal SD + not two non-zero digits in sequence, 01f1g10 ! 10f0g10 SD ! binary : carry-propagation necessary () adder) other applications : high-speed multipliers [9] similar to carry-save, simple use for signed numbers
Computer Arithmetic: Principles, Architectures, and VLSI Design 15

3 Number Representations

3.4 Residue Number Systems (RNS)

3 Number Representations

3.4 Residue Number Systems (RNS)

3.4 Residue Number Systems (RNS) Non-binary, irredundant, non-weighted number system [1] + Carry-free and fast additions and multiplications Complex and slow other arithmetic operations (e.g. comparison, sign and overow detection) because digits are not weighted, conversion to weighted mixed-radix or binary system required Codes for error detection and correction [1] Possible applications (but hardly used) : digital lters : fast additions and multiplications error detection and correction for arithmetic operations in conventional and residue number systems Base is n-tuple of integers (mn;1 mn;2 : : : m0 ), residues (or moduli) mi pairwise relatively prime

Arithmetic operations : (each digit computed separately)

Best moduli mi are 2k and (2k ; 1) :

zi = jZ jmi = jf (A)jmi = f (jAjmi ) mi = jf (ai)jmi jA + B jmi = jAjmi + jB jmi = jai + bijmi mi jA B jmi = jAjmi jB jmi = jai bijmi mi j ; ai jmi = jmi ; ai jmi a;1 mi = aimi ;2 mi (Fermats theorem) i
high storage efciency with k bits simple modular addition : 2k : k -bit adder without cout , 2k ; 1 : k -bit adder with end-around carry (cin = cout )

Example :

(m1 m0) = (3 2) , M = 6

A = (an;1 an;2 : : : a0 )mn; ai 2 f0 1 : : : mi ; 1g


Range:

mn;2 ::: m0 ,

A a1 a0

;4 ;3 ;2 ;1 0 1 2 3 4 5 6 7 8 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 0 1 0 1 0 1 0 1 0 1 0
| {z }

possible range

M=
n;1 X i=0

n; 1 Y i=0

mi, anywhere in Z Z
, Ci

ai = A mod mi = jAjmi , A = mi qi + ai
jAjM

j4 + 5j6 = (1 0) + (2 1) = = (j1 + 2j3 j0 + 1j2) = (0 1) = j3j6 j4 5j6 = (1 0) (2 1) = = (j1 2j3 j0 1j2) = (2 0) = j2j6

j5j6 = A = (a1

a0) = (j5j3

j5j2 ) = (2 1)

Ciai

= (: : :

0 1 0 |{z} i

: : :)
16

Computer Arithmetic: Principles, Architectures, and VLSI Design 3 Number Representations

Computer Arithmetic: Principles, Architectures, and VLSI Design 3 Number Representations

17

3.5 Floating-Point Numbers

3.7 Antitetrational Number System

3.5 Floating-Point Numbers Larger range, smaller precision than xed-point representation, inexact, real numbers [1, 2] Double-number form ) discontinuous precision S biased exponent E unsigned norm. mantissa M F = (;1)S M E = (;1)S 1:M 2E;bias Basic arithmetic operations : A B = (;1)SA SB MA MB EA+EB A + B = (;1)SA MA + EA (;1)SB MB (EA ; EB ) base on xed-point add, multiply, and shift operations postnormalization required (1= M < 1) Applications : processors : real oating-point formats (e.g. IEEE standard), large range due to universal use ASICs : usually simplied oating-point formats with small exponents, smaller range, used for range extension of normal xed-point numbers IEEE oating-point format : precision single double

3.6 Logarithmic Number System Alternative representation to oating-point (i.e. mantissa + integer exponent ! only xed-point exponent) [1] Single-number form ) continuous precision ) higher accuracy, more reliable S biased xed-point exponent E L = (;1)S E = (;1)S 2E;bias (signed-logarithmic) Basic arithmetic operations :

number system and double conversion A B = (;1)SA SB EA+EB p Ay = (;1)SA y EA y A = (;1)SA

(A < B ) = (EA < EB ) (additionally consider sign) A + B : by approximation or addition in conventional


EA =y

+ Simpler multiplication/exponent., more complex addition Expensive conversion : (anti)logarithms (table look-up) Applications : real-time digital lters 3.7 Antitetrational Number System Tetration (t. x = |{z}) and antitetration (a.t. x) [10] 22 x Larger range, smaller precision than logarithmic repres., otherwise analogous (i.e. 2x ! t. x log x ! a.t. x)
2

n nM nE
32 64 23 52 8 11

127 3:8 1038 1023 9 10307

bias

range

precision 10;7 10;15

Computer Arithmetic: Principles, Architectures, and VLSI Design

18

Computer Arithmetic: Principles, Architectures, and VLSI Design

19

3 Number Representations

3.8 Composite Arithmetic

3 Number Representations

3.9 Round-Off Schemes

3.8 Composite Arithmetic Proposal for a new standard of number representations [10] Scheme for storage and display of exact (primary: integer, secondary: rational) and inexact (primary: logarithmic, secondary: antitetrational) numbers Secondary forms used for numbers not representable by primary ones () no over-/underow handling necessary) Choice of number representation hidden from user, i.e. software/compiler selects format for highest accuracy Number representations :
integer : rational : logarithmic : antitetrational : tag 00 01 10 11 value 2s complement integer slash log integer a.t. integer

3.9 Round-Off Schemes Intermediate results with d additional lower bits () higher accuracy) : A = (an;1 : : : a0 a;1 : : : Rounding : keeping error small during nal word length reduction : R = (rn;1 : : : r0 ) = A ; Trade-off : numerical accuracy vs. implementation cost Truncation :
1 bias = ; 1 + 2d+ 2
1

a;d)

RTRUNC = (an;1 : : : a0 )
(= average error )

Round-to-nearest (i.e. normal rounding) :


denominator n numerator log fraction a.t. fraction
1 bias = 2d+ (nearly symmetric) + 0:1 can often be included in previous operation
1

0 RROUND = (a0n;1 : : : a0) A0 = A + 0:1

Round-to-nearest-even/-odd :

Rational numbers : slash position (i.e. size of numerator/ denominator) is variable and stored (oating slash) Storage form sizes : 32-bit (short), 64-bit (normal), 128-bit (long), 256-bit (extended) Implementation : mixed hardware/software solutions Hardware proposal : long accumulator (4096 bits) holds any oating-point number in xed-point format ) higher accurary ) large hardware/software overhead
Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition 20 4.1 Overview

0 : : a0 ) ;d RROUND ;EVEN = RROUND: : if 0(a0;)1 :otherwise6= 0 0 : a (an;1 1 bias = 0 (symmetric)


mandatory in IEEE oating-point standard

3 guard bits for rounding after oating-point operations : guard bit G (postnormalization), round bit R (round-to-nearest), sticky bit S (round-to-nearest-even)
Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition 21

4.2 1-Bit Adders, (m, k)-Counters

4 Addition
4.1 Overview
1-bit adders HA FA (m,k) (m,2)

4.2 1-Bit Adders, (m, k)-Counters Add up m bits of same magnitude (i.e. 1-bit numbers) Output sum as k -bit number (k

= blog mc + 1)

or : count 1s at inputs ) (m, k)-counter [3] (combinational counters)


RCA carry-propagate adders CLA CPA PPA COSA CSKA CSLA CIA

Half-adder (HA), (2, 2)-counter

(cout s) = 2cout + s = a + b
s=a b cout = ab

A = 3 T = 2 (1)

3-operand carry-save adders


adders.epsi 103 121 mm

CSA

(sum) (carry-out)
a b a b

multi-operand

adder array

adder tree

a b
multi-operand adders array adder tree adder
hasym.epsi HA 18 23 mm c
out

c out haschema1.epsi 19 28 mm

haschema2.epsi 21 43 mm c out

s
Legend: HA: FA: (m,k): (m,2): half-adder full-adder (m,k)-counter (m,2)-compressor CPA: carry-propagate adder RCA: ripple-carry adder CSKA:carry-skip adder CSLA: carry-select adder CIA: carry-increment adder related component CLA: carry-lookahead adder PPA: parallel-prefix adder COSA:conditional-sum adder CSA: carry-save adder

(reference)
s

based on component

Computer Arithmetic: Principles, Architectures, and VLSI Design

22

Computer Arithmetic: Principles, Architectures, and VLSI Design

23

4 Addition

4.2 1-Bit Adders, (m, k)-Counters

4 Addition

4.2 1-Bit Adders, (m, k)-Counters

Full-adder (FA), (3, 2)-counter

(m, k)-counters

(cout s) = 2cout + s = a + b + cin

A = 7 T = 4 (2 )

g = ab (generate) c0 = ab p = a b (propagate) c1 = a + b s = a b cin = p cin cout = ab + acin + bcin = ab + (a b)cin = g + pcin = pg + pcin = pa + pcin = cinc0 + cinc1
a b a b g HA faschematic3.epsi p 29 32 mm c c out in HA s a b a b a b

(sk;1 : : : s0) = k;1 m ;1 X X sj 2j = ai


j =0 i=0

a0

a m-1

...
cntsymbol.epsi 23 18 (m,k)mm

...
s k-1 s 0

Usually built from full-adders Associativity of addition allows convertion from linear to tree structure ) faster at same number of FAs

m A = 7 Plog 1 bm2;k c 7(m ; log m) k= TLIN = 4m + 2blog mc TTREE = 4dlog3 me + 2blog mc


Example : (7, 3)-counter

fasymbol.epsi FA c18 21 mm c
out

in

c out

faschematic2.epsi c in 32 35 mm

A = 28 T = 14
a0a1 a2a3a4a5a6 FA

A = 28 T = 10
a0a1 a2 FA a3a4 a5a6 FA

s s a b

FA
p faschematic4.epsi c out c in 29 1 41 mm
0

count73ser.epsi 42 59 mm
c out

count73par.epsi FA 36 48 mm

c out

faschematic1.epsi g p 29 43 mm

c in

faschematic5.epsi 0 c0 35 47 mm 1 c1
c in

FA s2

FA s1 s0

FA s2 s1 s0

(reference)

s s

linear structure

tree structure
25

Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition

24

Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition

4.3 Carry-Propagate Adders (CPA)

4.3 Carry-Propagate Adders (CPA)

4.3 Carry-Propagate Adders (CPA) Add two n-bit operands A and B and an optional carry-in cin by performing carry-propagation [1, 2, 11] Sum (cout

Carry-propagation speed-up techniques a) Concatenation of partial CPAs with fast cin ! cout
a n-1:j b n-1:j
...

S ) is irredundant (n + 1)-bit number


A B

a i-1:k b i-1:k
speedup1.epsi CPA c i84 26 mm
...

a k-1:0 b k-1:0

(cout S ) = cout2n + S = A + B + cin


2ci+1 + si

c out

CPA

cj

ck

CPA

c in

= ai + bi + ci ; i = 0 1 ::: n; 1 c0 = cin cout = cn (r.m.a.)

s n-1:j

s i-1:k

s k-1:0

cpasymbol.epsi CPA c out 29 26 mm c in

a) Fast carry look-ahead logic for entire range of bits


a n-1 b n-1 a1 b1 a0 b0

Ripple-carry adder (RCA) Serial arrangement of n full-adders Simplest, smallest, and slowest CPA structure
c out
... preprocessing

speedup2.epsi 104 50 mm

carry propagation

c in

A = 7n T = 2n AT = 14n2
a n-1 FA s n-1 b n-1
...

...

postprocessing

a1

b1

a0 FA s0

b0

s n-1

s1

s0

c out

c n-1

rca.epsi 57c 23FA mm


2

c1

c in

...

s1

Computer Arithmetic: Principles, Architectures, and VLSI Design

26

Computer Arithmetic: Principles, Architectures, and VLSI Design

27

4 Addition

4.3 Carry-Propagate Adders (CPA)

4 Addition

4.3 Carry-Propagate Adders (CPA)

Carry-skip adder (CSKA) Type a) : partial CPA with fast ck ! ci

Carry-select adder (CSLA) Type a) : partial CPA with fast ck ! ci and ck ! si;1:k

ci = P i;1:k c0i + Pi;1:k ck (bit group (ai;1 : : : ak )) Pi;1:k = pi;1pi;2 pk (group propagate)
) path ck ! c0i ! ci never sensitized ) fast ck ! ci ) false path ) inherent logic redundancy ) problems in circuit optimization, timing analysis, and testing

si;1:k = ck s0;1:k + ck s1;1:k i i ci = ck c0 + ck c1 i i

1) Pi;1:k 2) Pi;1:k

= 0 : ck 6! c0i and c0i selected (c0i ! ci) = 1 : ck ! c0i but c0i skipped (c0i 6! ci)

Two CPAs compute two possible results (cin = 0=1), group carry-in ck selects correct one afterwards Variable group sizes (faster) : larger groups at end (MSB) (balance delays a0 ! ck and ak ! c0 ) i Part. CPA typ. is RCA, CSLA () multil. CSLA), or CLA High speed-up at high hardware overhead (+ MUX/bit + (CPA + MUX)/group)

Variable group sizes (faster) : larger groups in the middle (minimize delays a0 ! ck ! si;1 and ak ! ci ! sn;1 ) Partial CPA typ. is RCA or CSKA () multilevel CSKA) Medium speed-up at small hardware overhead (+ AND/bit + MUX/group)

A
...
0

14n

2:8n1=2

AT

39n3=2
a k-1:0 b k-1:0

a i-1:k b i-1:k

A
a n-1:j b n-1:j

8n

4n1=2

AT

32n3=2
a k-1:0 b k-1:0 c out

a i-1:k b i-1:k
...

c i0

CPA

0
csla.epsi 102 50CPA mm

ci
...

ck

CPA

c in

c out

CPA

cj
...

ci

cska.epsi 99 36 mm 1

ci

CPA ck CPA c in ck

c i1

0 s i-1:k 0 1

1 s i-1:k

P i-1:k s n-1:j s i-1:k s k-1:0


28

s i-1:k
Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition

s k-1:0
29

Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition

4.3 Carry-Propagate Adders (CPA)

4.3 Carry-Propagate Adders (CPA)

Carry-increment adder (CIA) Type a) : partial CPA with fast ck ! ci and ck ! si;1:k

Example : gate-level schematic of carry-incr. adder (CIA) only 2 different logic cells (bit-slices) : IHA and IFA
max ngroup

si;1:k = s0i;1:k + ck ci = c0i + Pi;1:k ck Pi;1:k = pi;1pi;2 pk (group propagate)


Result is incremented after addition, if ck

4 6 10 12 14 16 18 20 22 24 26 28 ... 38 2 3 4 5 6 7 8 9 10 11 ... 16 1 2 4 7 11 16 22 29 37 46 56 67 ... 137


b i-1 IFA
...

= 1 [12, 11]
IFA

a i-1

a i-2

b i-2 IFA

a k+1

b k+1 IHA

ak

bk

Variable group sizes (faster) : larger groups at end (MSB) (balance delays a0 ! ck and ak ! c0i ) Part. CPA typ. is RCA, CIA () multilevel CIA) or CLA High speed-up at medium hardware overhead (+ AND/bit + (incrementer + AND-OR)/group)

...

...

Logic of CPA and incrementer can be merged [11]

A
...

10n

T
ci

2:8n1=2
a i-1:k b i-1:k

AT
0 ck

28n3=2
a k-1:0 b k-1:0

ci

s i-1 (i-k-1)IFA + IHA

ciagate.epsi 100 112 mm s i-2


2IFA + IHA

s k+1 IFA + IHA IHA

sk IHA

ck

CPA
86 cia.epsi si-1:k 43 mm

c out
...

ci P i-1:k

CPA

c in

...

bits i-1...k

...

bits 6...4

bits 3,2

bit 1

bit 0

+1

s i-1:k

s k-1:0
c out c in

Computer Arithmetic: Principles, Architectures, and VLSI Design

30

Computer Arithmetic: Principles, Architectures, and VLSI Design

31

4 Addition

4.3 Carry-Propagate Adders (CPA)

4 Addition

4.3 Carry-Propagate Adders (CPA)

Conditional-sum adder (COSA) Type a) : optimized multilevel CSLA with (log n) levels (i.e. double CPAs are merged at higher levels) Correct sum bits (si;1:k or si;1:k ) are (conditionally) selected through (log n) levels of multiplexers
0 1

Carry-lookahead adder (CLA), traditional Type b) : carries looked ahead before sum bits computed Typically 4-bit blocks used (e.g. standard IC SN74181)

Bit groups of size 2l at level l Higher parallelism, more balanced signal paths Highest speed-up at highest hardware overhead (2 RCA + more than (log n) MUX/bit)

c0 = c00 c1 = g0 + p0c00 c2 = g1 + p1g0 + p1p0c00 c3 = g2 + p2g1 + p2p1g0 + p2p1 p0c00 0 g3 = g3 + p3g2 + p3p2g1 + p3p2 p1g0 0 = p3 p2 p1 p0 p3

(g3,p3)

...

(g0,p0)

clbsymbol.epsi 26 27 CLB mm c 0

(g,p) c 3 . . . c 0 3 3

3n log n
a3 b3

T
a2

2 log n
b2

AT
a1

6n log
b1

n
a0 b0

Hierarchical arrangement using ( 1 log n) levels : 2 (g30 p03) passed up, c00 passed down between levels High speed-up at medium hardware overhead

A
c in

14n

4 log n

AT

56n log n

level 0

...

FA FA

0 1

FA FA

0 1

FA FA

0 1 FA

(g15,p15) . . . (g12,p12)(g11,p11) . . . (g8,p8) (g7,p7) . . . (g4,p4) (g3,p3) . . . (g0,p0)

level 1

...

cosa.epsi 100 57 mm
0 1

CLB c 15 . . . c 12

CLB

CLB

CLB c3 . . . c0

level 2

...

c 11 . . . c 8 cla.epsi c 7 . . . c 4 97 48 mm

c out

Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition

Parallel-prex adders (PPA) Type b) : universal adder architecture comprising RCA, CIA, CLA, and more (i.e. entire range of area-delay trade-offs from slowest RCA to fastest CLA) Preprocessing, carry-lookahead, and postprocessing step Carries calculated using parallel-prex algorithms + High regularity : suitable for synthesis and layout + High exibility : special adders, other arithmetic operations, exchangeable prex algorithms (i.e. speeds) + High performance : smallest and fastest adders

a n-1 b n-1 a n-2 b n-2

a1 b1 a0 b0

c out s n-1 s n-2

s1

Computer Arithmetic: Principles, Architectures, and VLSI Design

s0

... s3 s2 s1 s0

CLB

c in

+ preprocessing : gi = ai bi + postprocessing : si = pi

pi = ai bi ci
33

32

Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition

4.3 Carry-Propagate Adders (CPA)

4.3 Carry-Propagate Adders (CPA)

Prex problem Inputs (xn;1 : : : x0 ), outputs (yn;1 binary operator [11, 13]

: : : y0), associative
or

(yn;1 : : : y0) = (xn;1 x0 : : : x1 x0 x0) y0 = x0 yi = xi yi;1 ; i = 1 : : : n ; 1 (r.m.a.)


Associativity of
) tree structures for evaluation :
1 Y3:2 1 y1 = Y1:0 1 y1 = Y1:0

x3 (x2 (x1 {z x0)) = (x3 {z x2 ) (| 1 {z x0 ) , but y2 ? x } | } | }


| |
2 y2 = Y2:0 3 y3 = Y3:0

{z

} }

{z

2 y3 = Y3:0

{z

A
...
(gn-1 , p n-1 )

5n + 3A

T = 4 + 2T
preprocessing:
c in

...
(g0 , p0 )
add.epsi///gures 73 64 mm

gi = aibi pi = ai bi
carry-lookahead: prex algorithm

: : : xi) at level l Carry-propagation is prex problem : Yil:k = (Gl :k Pil:k ) i (G0:i Pi0:i) = (gi pi) i 1 1 (Gli:k Pil:k ) = (Gli;+1 Pil:;+1) (Glj;k1 Pjl:;1) ; k j i :j j : k 1 1 1 = (Gil;+1 + Pil:;+1Glj;k1 Pil:;+1Pjl:;1) :j j j : k ci+1 = Gm ; i = 0 : : : n ; 1 l = 1 : : : m i:0
Parallel-prex algorithms [14] : multi-tree structures (T = O(n) ! O(log n)) sharing subtrees (A = O(n2 ) ! O(n log n)) different algorithms trading area vs. delay (inuences also from wiring and maximum fan-out FOmax )

Group variables Yil:k : covers bits (xk

c n p n-1

c1

p0

c0

...

...

postprocessing:

si = pi ci
34

Computer Arithmetic: Principles, Architectures, and VLSI Design

35

4 Addition

4.3 Carry-Propagate Adders (CPA)

4 Addition

4.3 Carry-Propagate Adders (CPA)

Prex algorithms Algorithms visualized by directed acyclic graphs (DAG) with array structure (n bits m levels) Graph vertex symbols : l 1 1 (Gil;+1 Pil:;+1) (Gj;k1 Pjl:;1 ) :j j : k

Sklansky parallel-prex algorithm () PPA-SK) Tree-like collection, parallel redistribution of carries

A
0 1 2 3 4

1 2

n log n T = dlog ne FOmax

1 2

? ; y ;? l l ; (Gli:k Pil:k ) (Gi:k Pi:k )


(contains logic for )

? i ;?l l (Gli:k Pil:k ) (Gi:k Pi:k )


(contains no logic)

(Gil;1 Pil:;1 ) :k k

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

sk.epsi///gures 67 30 mm

Performance measures : A : graph size (number of black nodes)

Brent-Kung parallel-prex algorithm () PPA-BK) Traditional CLA is PPA-BK with 4-bit groups Tree-like redistribution of carries (fan-out tree)

: graph depth (number of black nodes on critical path)

Serial-prex algorithm () RCA)

A = n ; 1 T = n ; 1 FOmax = 2
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 ... 14 15

A = 2n ; dlog ne ; 2 T = 2dlog ne ; 2 FOmax log n


15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6

ser.epsi///gures 69 38 mm

bk.epsi///gures 67 38 mm

Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition

36

Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition

37

4.3 Carry-Propagate Adders (CPA)

4.3 Carry-Propagate Adders (CPA)

Kogge-Stone parallel-prex algorithm () PPA-KS) very high wiring requirements

Mixed serial/parallel-prex algorithm () RCA + PPA) linear size-depth trade-off using parameter k : 0

n log n ; n + 1 T = dlog ne FOmax = 2


15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 ks.epsi///gures 67 52 mm

k n ; 2dlog ne + 2

k = 0 : serial-prex graph k = n ; 2dlog ne + 1 : Brent-Kung parallel-prex


graph lls gap between RCA and PPA-BK (i.e. CLA) in steps of single -operations

A = n ; 1 + k T = n ; 1 ; k FOmax = var.
4 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

Carry-increment parallel-prex algorithm () CIA)

2n ; 1:4n1=2

1:4n1=2

FOmax

1:4n1=2

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5

var.epsi///gures 68 54 mm

cia.epsi///gures 67 34 mm

Computer Arithmetic: Principles, Architectures, and VLSI Design

38

Computer Arithmetic: Principles, Architectures, and VLSI Design

39

4 Addition

4.3 Carry-Propagate Adders (CPA)

4 Addition

4.3 Carry-Propagate Adders (CPA)

Example : 4-bit parallel-prex adder (PPA-SK) efcient AND-OR-prex circuit for the generate and AND-prex circuit for the propagate signals optimization: alternatingly AOI-/OAI- resp. NAND-/ NOR-gates (inverting gates are smaller and faster) can also be realized using two MUX-prex circuits
a3 b3 a2 b2 a1 b1 a0 b0

Prex adder synthesis Local prex graph transformation :


3 2 1 0

A =3 T =3

0 unfact.epsi 1 20 26 mm 2 3

depth-decr. transform

3 2 1 0 0 fact.epsi 1 20 26 mm 2 3

;! ;

size-decr. transform

A =4 T =2

c in

Repeated (local) prex transformations result in overall minimization of graph depth or size ) which sequence ? Goal: minimal size (area) at given depth (delay) Simple algorithm for sequence of applied transforms : Step 1 : prex graph compression (depth minimization) : depth-decr. transforms in right-to-left bottom-up order Step 2 : prex graph expansion (size minimization) : size-decreasing transforms in left-to-right top-down order, if allowed depth not exceeded Prex adder synthesis : 1) generate serial-prex graph, 2) graph compression, 3) depth-controlled graph expansion, 4) generate pre-/postprocessing and prex logic + Generates all previous prex graphs (except PPA-KS) + Universal adder synthesis algorithm : generates area-optimal adders for any given timing constraints [14] (including non-uniform signal arrival times)

askgate.epsi///gures 100 103 mm

c out P n-1:0 s3 s2 s1 s0
40

Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition

Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition

41

4.3 Carry-Propagate Adders (CPA)

4.3 Carry-Propagate Adders (CPA)

Multilevel adders Multilevel versions of adders of type a) possible (CSKA, CSLA, and CIA; notation: 2-level CIA = CIA-2L) + Delay is O(n1=(m+1) ) for m levels Area increase small for CSKA and CIA, high for CSLA () COSA) Difcult computation of optimal group sizes Hybrid adders Arbitrary combinations of speed-up techniques possible ) hybrid/mixed adder architectures Often used combinations : CLA and CSLA [15] Pure architectures usually perform best (at gate-level)

Self-timed adders + RCA is fast in average case (T = O(log n)), slow in worst case ) suitable for self-timed asynchronous designs [16] Completion detection is not trivial Adder performance comparisons Standard-cell implementations, 0:8
area [lambda^2] RCA 1e+07 128-bit CSKA-2L CIA-1L 5 64-bit CIA-2L PPA-SK PPA-BK 32-bit 2 16-bit 1e+06 8-bit 5

Average carry-propagation length : log n

m process

Transistor-level adders Inuence of logic styles (e.g. dynamic logic, pass-transistor logic ) faster) + Efcient transistor-level implementation of ripple-carry chains (Manchester chain) [15] + Combinations of speed-up techniques make sense Much higher design effort Many efcient implementations exist and published
Computer Arithmetic: Principles, Architectures, and VLSI Design 42
2 5 10 20

addperf.ps 84 84 mm

CLA COSA const. AT

delay [ns]

Computer Arithmetic: Principles, Architectures, and VLSI Design

43

4 Addition

4.3 Carry-Propagate Adders (CPA)

4 Addition

4.4 Carry-Save Adder (CSA)

Complexity comparison under the unit-gate model adder RCA CSKA-1L CSKA-2L CSLA-1L CIA-1L CIA-2L CIA-3L PPA-SK PPA-BK PPA-KS CLA 5 COSA
1

4.4 Carry-Save Adder (CSA) a) Adds three n-bit operands A0 , A1 , A2 performing no carry-propagation (i.e. carries are saved) [1]

n log n 10n 3n log n 14n 3n log n


3 2

8n 8n 14n 10n 10n 10n

7n

2n 4n1=2 xn1=3 4 2:8n1=2 2:8n1=2 3:6n1=3 4:4n1=4 2 log n 4 log n 2 log n 4 log n 2 log n

3n log2 n 40n log n 6n log2 n 56n log n 6n log2 n

14n2 32n3=2 xn4=3 4 39n3=2 28n3=2 36n4=3 44n5=4

AT

opt.1 syn.2 p aaa aat 3 att att ttt att

(C S ) = C + S = A0 + A1 + A2
2ci+1 + si

A0 A1 A2
csasymbol.epsi 26 21 CSAmm

p p p p p p ( )

= a0 i + a1 i + a2 i ; i = 0 1 : : : n ; 1 (n.) (C S )out = A + (C S )in

b) Adds one n-bit operand to an n-digit carry-save operand Result is in redundant carry-save format (n digits), represented by two n-bit numbers S (sum bits) and C (carry bits) + Parallel arrangement of n full-adders, constant delay

optimality regarding area and delay aaa : smallest area, longest delay aat : small area, medium delay att : medium area, short delay ttt : large area, shortest delay : not optimal 2 obtained from prex adder synthesis 3 automatic logic optimization not possible (redundancy) 4 exact factors not calculated 5 corresponds to 4-bit PPA-BK
44

A = 7n T = 4
a 0,n-1 a 1,n-1 a 2,n-1 a 0,1 a 1,1 a 2,1 a 0,0 a 1,0 FA c1 s0
csa.epsi 27 mm FA

FA cn s n-1

. . . 67

c2

s1

Multi-operand carry-save adders (m > 3) ) adder array (linear arrangement), adder tree (tree arr.)
Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition 45

Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition

4.5 Multi-Operand Adders

4.5 Multi-Operand Adders

4.5 Multi-Operand Adders Add three or more (m > 2) n-bit operands, yield (n + dlog me)-bit result in irredundant number rep. [1, 2] Array adders Realization by array adders : (see gures on next page) a) linear arrangement of CPAs b) linear arr. of CSAs (adder array) and nal CPA a) and b) differ in bit arrival times at nal CPA : ) if CPA = RCA : a) and b) have same overall delay ) if fast nal CPA : uniform bit arrival times required ) CSA array (b) Fast implementation : CSA array + fast nal CPA (note: array of fast CPAs not efcient/necessary)

a) 4-operand CPA (RCA) array :


a 0,n-1 a 1,n-1 a 0,2 a 1,2 a 0,1 a 1,1 a 0,0 a 1,0

...

a 2,0 HA a 2,0 HA a 3,0 HA s0


CPA CPA CPA

FA a 2,n-1 FA a 3,n-1 FA sn FA
... ... ...

FA a 2,2

FA a 2,1

cparray.epsi 93 FA 57 mm FA

a 3,2 FA s2

a 3,1 FA s1

s n-1

b) 4-operand CSA array with nal CPA (RCA) :


a 0,n-1 a 1,n-1 a 2,n-1 a 0,2 a 1,2 a 2,2 a 0,1 a 1,1 a 2,1 a 0,0 a 1,0 a 2,0
CSA CSA CPA

A = (m ; 2)ACSA + ACPA T = (m ; 2)TCSA + TCPA


CPA = RCA :

A0 A1 A2

A3

A m-1

CSA

...

FA a 3,n-1

...

FA a 3,2

FA a 3,1 FA

FA a 3,0 HA

A = O(mn + n) T = O(m + n)

mopadd.epsi CSA 30 58 mm
...

FA

...

csarray.epsi 99 57 mm FA

Fast CPA :

A = O(mn + n log n) T = O(m + log n)

CPA

FA sn

FA
...

FA s2

HA s1 s0

s n-1

S
Computer Arithmetic: Principles, Architectures, and VLSI Design 46 Computer Arithmetic: Principles, Architectures, and VLSI Design 47

4 Addition

4.5 Multi-Operand Adders

4 Addition

4.5 Multi-Operand Adders

(m, 2)-compressors 2(c +

m;4 X

l=0 m;1 X i=0

clout) + s =
m;4 X l =0

a0

a m-1

...
0 c out m-4 c out

ai +

clin

cprsymbol.epsi 26 37 (m,2)mm

0 c in m-4 c in

A = 7(m ; 2) TLIN = 4(m ; 2) TTREE = 6(dlog me ; 1)


Optimized (4, 2)-compressor : 2 full-adders merged and optimized (i.e. XORs arranged in tree structure)

1-bit adders (similar to (m, k)-counters) [17] Compresses m bits down to 2 by forwarding (m ; 3) intermediate carries to next higher bit position Is bit-slice of multi-operand CSA array (see prev. page) + No horizontal carry-propagation (i.e. cl ! ck in out k > l) Built from full-adders (= (3, 2)-compressor) or (4, 2)-compressors arranged in linear or tree structures Example : 4-operand adder using (4, 2)-compressors
a 0,n-1 a 1,n-1 a 2,n-1 a 3,n-1 a 0,2 a 1,2 a 2,2 a 3,2 a 0,1 a 1,1 a 2,1 a 3,1 a 0,0 a 1,0 a 2,0 a 3,0

...

...

A = 14 T = 8
a0 a1 a2 a3 FA
cpr42fa.epsi 32 38 mm

A = 14 T = 6
a0 a1 a2 a3

)
c in c out

c out

cpr42opt.epsi 1 41 53 mm

FA c s

c in
0 1

with full-adders
CSA

(4,2)

(4,2) cpradd.epsi 99 44 mm FA s2

(4,2)

(4,2)

optimized + same area, 25% shorter delay

FA s n+1 sn

FA s n-1

HA s1 s0

CPA

SD-FA (signed-digit full-adder) is similar to (4, 2)-compressor regarding structure and complexity
Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition 49

Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Addition

48

4.5 Multi-Operand Adders

4.5 Multi-Operand Adders

Advantages of (4, 2)-compressors over FAs for realizing (m, 2)-compressors : higher compression rate (4:2 instead of 3:2) less deep and more regular trees tree depth # operands FA (4,2) 012 3 4 5 6 7 8 9 10

Tree adders (Wallace tree) Adder tree : n-bit m-operand carry-save adder composed of n tree-structured (m, 2)-compressors [1, 18] Tree adders : fastest multi-operand adders using an adder tree and a fast nal CPA

2 3 4 6 9 13 19 28 42 63 94 2 4 8 16 32 64 128

A = A(m 2) n + ACPA = O(mn + n log n) T = T(m 2) + TCPA = O(log m + log n)


Adder arrays and adder trees revisited Some FA can often be replaced by HA or eliminated (i.e. redundant due to constant inputs) Number of (irredundant) FA does not depend on adder structure, but number of HA does An m-operand adder accomodates (m ; 1) carry inputs Adder trees (T = O(log n)) are faster than adder arrays (T = O(n)) at same amount of gates (A = O(mn)) Adder trees are less regular and have more complex routing than adder arrays ) larger area, difcult layout (i.e. limited use in layout generators)
50 Computer Arithmetic: Principles, Architectures, and VLSI Design 51

Example : (8, 2)-compressor

A = 42 T = 16
a0a1 a2a3 FA a4a5 a6a7 FA
0 c out 0 c in 1 c in 1 c out 2 c out 3 c out

A = 42 T = 12
a0a1a2a3 a4a5a6a7
0 c in 1 c in

0 c out 1 c out 2 c out 3 c out 4 c out

(4,2)

(4,2)

FA

FA
cpr82fa.epsi 47 65 mm

cpr82cpr42.epsi 47 50 mm

2 c in 3 c in

2 c in 3 c in

FA

4 c out 4 c in

(4,2)

4 c in

FA c s

(4, 2)-compressor tree

full-adder tree
Computer Arithmetic: Principles, Architectures, and VLSI Design

4 Addition

4.6 Sequential Adders

5 Simple / Addition-Based Operations

5.1 Complement and Subtraction

4.6 Sequential Adders Bit-serial adder : Sequential n-bit adder


ai bi
bitseradd.epsi FA 25 27 mm

5 Simple / Addition-Based Operations


5.1 Complement and Subtraction 2s complementer (negation)
;A = A + 1
A

A = AFA + AFF T = TFA + TFF L=n

neg.epsi 21 32 mm1

+1

Accumulators : Sequential m-operand adders A With CPA

si

Z A B

2s complement subtractor

A = ACPA + AREG T = TCPA + TREG L=m


With CSA and nal CPA Allows higher clock rates Final CPA too slow : ) pipelining or multiple cycles for evaluation
A

accucpa.epsi CPA 28 mm 27

A ; B = A + (;B ) =A+B+1

sub.epsi 29 32 mm

c out

CPA

S A B

2s complement adder/subtractor
CSA
accucsa.epsi 33 52 mm

A B = A + (;1)sub B = A + (B sub) + sub


1s complement adder

addsub.epsi 36 35 mm

c out

CPA

sub

A = ACSA + ACPA + 4AREG T = TCSA + TREG L=m

S CPA A B

Mixed CSA/CPA : CSA with partial CPAs (i.e. fewer carries saved), trade-off between speed and register size
Computer Arithmetic: Principles, Architectures, and VLSI Design 5 Simple / Addition-Based Operations 52

A + B (mod 2n ; 1) = A + B + cout
(end-around carry)

c out

addmod.epsi 28 29 CPAmm

c in

S
Computer Arithmetic: Principles, Architectures, and VLSI Design 5 Simple / Addition-Based Operations 53

5.2 Increment / Decrement

5.2 Increment / Decrement

5.2 Increment / Decrement Incrementer Adds a single bit cin to an n-bit operand A (cout Z ) = cout2n + Z = A + cin

Prex problem :

Ci:k = Ci:j+1Cj:k

) AND-prex struct.
1 2

A
A

1 2

n log n + 2n T = dlog ne + 2 AT

n log2 n

Decrementer
a n-1

(cout Z ) = A ; cin
a2 a1 a0
...

zi = ai ci ci+1 = aici ; i = 0 : : : n ; 1 c0 = cin cout = cn (r.m.a.)


Corresponds to addition with B

incsymbol.epsi +1 c out 29 26 mm c in

Z c out
...

= 0 () FA ! HA)
3n2

dec.epsi 93 41 mm

c in

Example : Ripple-carry incrementer using half-adders

A = 3n T = n + 1 AT
a n-1
...

z n-1

z2

z1

z0

a1
incfa.epsi HA 59 23 mm c c2 1

a0 HA z0

Incrementer-decrementer
c in

c out

HA z n-1

c n-1

(cout Z ) = A cin = A + (;1)dec cin


a n-1 a2 a1 a0 dec
...

...

z1

or using incrementer slices (= half-adder)


a n-1
...

a2

a1

a0

incdec.epsi 94 46 mm inc.epsi 83 33 mm

c out
...

c in

c out
...

c in

HA z n-1 z2 z1 z0
54

z n-1

z2

z1

z0
55

Computer Arithmetic: Principles, Architectures, and VLSI Design

Computer Arithmetic: Principles, Architectures, and VLSI Design

5 Simple / Addition-Based Operations

5.2 Increment / Decrement

5 Simple / Addition-Based Operations

5.2 Increment / Decrement

Fast incrementers 4-bit incrementer using multi-input gates :


a3 a2 a1 a0

Gray incrementer Increments in Gray number system

inccg.epsi 62 39 mm

c in

c out z3 z2 z1 z0

c0 = an;1 an;2 a0 (parity) ci+1 = ai ci ; i = 0 : : : n ; 3 (r.m.a.) z0 = a0 c0 zi = ai ai;1 ci;1 ; i = 1 : : : n ; 2 zn;1 = an;1 cn;2
Prex problem ) AND-prex structure

8-bit parallel-prex incrementer (Sklansky AND-prex structure) :


a7 a6 a5 a4 a3 a2 a1 a0 c in

incpp.epsi 98 63 mm

c out

z7

z6

z5

z4

z3

z2

z1

z0
56 5.3 Counting Computer Arithmetic: Principles, Architectures, and VLSI Design 5 Simple / Addition-Based Operations 57 5.3 Counting

Computer Arithmetic: Principles, Architectures, and VLSI Design 5 Simple / Addition-Based Operations

5.3 Counting Count clock cycles ) counter, divide clock frequency ) frequency divider (cout ) Binary counter Sequential in-/decrementer Incrementer speed-up techniques applicable Down- and up-down-counters using decrementers / incrementer-decrementers

Fast divider (T = O(1)) using delayed-carry numbers (irredundant carry-save represention of ;1 allows using fast carry-save incrementer) [8] Gray counter Counter using Gray incrementer

c out

+1 cntblock.epsi 32 33 mm

c in clk

Ring counters Shift register connected to ring :


cntring.epsi 51 16 mm

Q q n-1

Example : Ripple-carry up-counter using counter slices (= HA + FF), cin is count enable
c out
...

q2

q1

q0

State is not encoded ) n FF for counting n states Must be initialized correctly (e.g. 00 Applications: fast dividers (no logic between FF) state counter for one-hot coded FSMs 01)

c in
cntripple.epsi 87 36 mm

q n-1

q2

q1

q0

Johnson / twisted-ring counter (inverted feed-back) :


cntjohnson.epsi 59 16 mm

Asynchronous counter using toggle-ip-ops (lower toggle rate ) lower power)


T ... T T T

cntasync.epsi 64 18 mm

clk

q n-1

q2

q1

q0

q n-1

q2

q1

q0
58

n FF for counting 2n states


Computer Arithmetic: Principles, Architectures, and VLSI Design 59

Computer Arithmetic: Principles, Architectures, and VLSI Design

5 Simple / Addition-Based Operations

5.4 Comparison, Coding, Detection

5 Simple / Addition-Based Operations

5.4 Comparison, Coding, Detection

5.4 Comparison, Coding, Detection Comparison operations

Comparators Subtractor (A ; B ) :

EQ = (A = B ) (equal) NE = (A 6= B ) = EQ (not equal) GE = (A B ) (greater or equal) LT = (A < B ) = GE (less than) GT = (A > B ) = GE EQ (greater than) LE = (A B ) = GT = GE + EQ (less or equal)
Equality comparison

GE = cout EQ = Pn;1:0
(for free in PPA)

cmpsub.epsi 37 31 mm

GE = c out

CPA

EQ = P n-1:0

ARCA = 7n TRCA = 2n or APPA;KS 3 n log n TPPA;KS 2


Optimized comparator :

2 log n

EQ = (A = B ) eqi+1 = (ai = bi) eqi = (ai bi) eqi ; i = 0 ::: n ; 1 eq0 = 1 EQ = eqn (r.s.a.)
Magnitude comparison

removing redundancies in subtractor (unused si ) single-tree structure ) speed-up at no cost :

a n-1 b n-1

a2 b2

a1 b1

...

a0 b0

cmpeq.epsi 40 36 mm

A = 6n TLIN = 2n TTREE
a n-1 b n-1 a2 b2 a1 b1

2 log n

example : ripple comparator using comparator slices


EQ
a0 b0
equality & magnitude magnitude equality

GE = (A B ) gei+1 = (ai > bi) + (ai = bi) gei = aibi + (ai bi) gei ; i = 0 : : : n ; 1 ge0 = 1 GE = gen (r.s.a.)
Computer Arithmetic: Principles, Architectures, and VLSI Design 5 Simple / Addition-Based Operations 60 5.4 Comparison, Coding, Detection
GE

...

cmpripple.epsi 100 47 mm

EQ

Computer Arithmetic: Principles, Architectures, and VLSI Design 5 Simple / Addition-Based Operations

61

5.4 Comparison, Coding, Detection

Decoder Decodes binary number An;1:0 to vector Zm;1:0 (m = 2n )

Detection operations All-zeroes detection : All-ones detection :

zi =

1 if A = i 0 else ;
a2

i = 0 ::: m ; 1
a1 a0
decoder.epsi 58 28 mm

= 2A

z = an;1 + an;2 + + a0 z = an;1 an;2 a0 (r.s.a.)

A
decodersym.epsi decoder 21 26 mm

A = n T = log n
Leading-zeroes detection (LZD) : for scaling, normalization, priority encoding
z2 z1 z0

z7

z6

z5

z4

z3

A = (n ; 1)2n T = dlog ne
Encoder Encodes vector Am;1:0 to binary number Zn;1:0 (m = 2n ) (condition: 9i 8k j if k = i then ak = 1 else ak = 0) Z = i if ai = 1 ; i = 0 : : : m ; 1 Z = log2 A
A
encodersym.epsi encoder 21 26 mm

a) non-encoded output :
f0g1f0j1g ! f0g1f0g
a n-1 a n-2
...

a1

a0

(e.g. 000101 ! 000100)

lzdnenc.epsi 50 28 mm
...

A = 2n T = n

z n-1

z n-2

z1

z0

a7a5a3a1 a6a4a2a0
encoder.epsi 30 34 mm

prex problem (r.m.a.) ) AND-prex structure


z0 z1

b) encoded output : + encoder signed numbers : + leading-ones detector (LOZ)

A = n(2n;1 ; 1) T =n;1

z2

(note: connections according to PPA-SK)


62 Computer Arithmetic: Principles, Architectures, and VLSI Design 63

Computer Arithmetic: Principles, Architectures, and VLSI Design

5 Simple / Addition-Based Operations

5.5 Shift, Extension, Saturation

5 Simple / Addition-Based Operations

5.5 Shift, Extension, Saturation

5.5 Shift, Extension, Saturation Shift : a) shift n-bit vector by k bit positions b) select n out of more bits at position k also: logical (= unsigned), arithmetic (= signed)

Rotation by k bit positions, n constant (logic operation) Extension of word lengths by k bits (n ! n + k ) (i.e. sign-extension for signed numbers) Saturation to highest/lowest value after over-/underow shift a) unl. signed r. signed l. r. unsigned signed l. r. unl. signed r. signed l. r.

Applications : adaption of magnitude (shift a)) or word length (extension) of operands (e.g. for addition) multiplication/division by multiples of 2 (shift) logic bit/byte operations (shift, rotation) scaling of numbers for word-length reduction (i.e. ignore leading zeroes, shift b)) or normalization (e.g. of oating-point numbers, shift a)) using LZD reducing error after over-/underow (saturation) Implementation of shift/extension/rotation by constant values : hard-wired variable values : multiplexers n possible values : nbyn barrel-shifter/rotator Example : 4by4 barrel-rotator

shift b) rotate extend

saturate unsigned signed

an;2 0 an;1 an;1 an;3 an;1 an;1 an;2 an+k;1 a2n;1 an+k;2 an;2 a0 an;1 0 an;1 an;1 an;1 an;1 an;2 an;1 an;2 an;1 an;1 an;1

::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: :::

a0 0 a1 a0 0 a1 ak ak a0 an;1 a1 a0 a0 0 a0 a0 0 an;1 an;1

sll srl sla sra

rol ror

A = O(n2) T = O(log n)
a3 a2 a1 a0

a3 s1 s0 s1 s0 s1 s0 s1 s0

a2

a1

a0

barshift.epsi 44 49 mm

s0 s1 z3

muxshift.epsi 41 28 mm

z2

z1

z0

z3

z2

z1

z0

multiplexers
64

tristate buffers
65 5.6 Addition Flags

Computer Arithmetic: Principles, Architectures, and VLSI Design 5 Simple / Addition-Based Operations

Computer Arithmetic: Principles, Architectures, and VLSI Design 5 Simple / Addition-Based Operations

5.6 Addition Flags

5.6 Addition Flags ag

Basic and derived condition ags description carry ag signed overow ag zero ag negative ag, sign condition operation: S=0 S<0 S 0 ag formula signed

cn cn cn;1 an bnsn + anbn sn Z 8i : si = 0 N sn;1

C V

formula

Implementation of adder with ags

C , N : for free V : fast cn, cn;1 computed by e.g. PPA ) very cheap Z : a) cin = 1 (subtract.) : Z = (A = B ) = Pn;1:0 (of PPA) b) cin = 0=1 : Z = sn;1 + sn;2 + + s0 (r.s.a.) 1) A = ACPA + n TZ = TCPA + dlog ne
2) faster without nal sum (i.e. carry prop.) [19] example : 01001 1 00 0 + 10110 1 00 = 00000 0 00

S = A + B (+) or S = A ; B (;) zero Z Z negative N positive N S > max overow C (+) VC S < min underow C (;) VC operation: A ; B A=B EQ Z Z A 6= B NE Z Z A B GE C N V + NV A>B GT CZ (N V + NV )Z A<B LT C NV + NV A B LE C + Z NV + NV + Z
Unsigned and signed addition/subtraction only differ with respect to the condition ags

unsigned

z0 = ((a0 b0) cin) zi = ((ai bi) (ai;1 + bi;1)) Z = zn;1 zn;2 z0 ; i = 0 : : : n ; 1 (r.s.a.) A = ACPA + 3n TZ = 4 + dlog ne
66 Computer Arithmetic: Principles, Architectures, and VLSI Design 67

Computer Arithmetic: Principles, Architectures, and VLSI Design

5 Simple / Addition-Based Operations

5.7 Arithmetic Logic Unit (ALU)

6 Multiplication

6.1 Multiplication Basics

5.7 Arithmetic Logic Unit (ALU)


A B

6 Multiplication
6.1 Multiplication Basics Multiplies two n-bit operands A and B [1, 2] Product P is (2n)-bit unsigned number or (2n ; 1)-bit signed number Example : unsigned multiplication n;1 n;1 n;1 n;1 X X XX P = A B = ai2i bj 2j = aibj 2i+j or i=0 j =0 i=0 j =0 n;1 X Pi = ai B P = Pi2i ; i = 0 : : : n ; 1 (r.s.a.) i=0 Algorithm 1) Generation of n partial products Pi 2) Adding up partial products : a) sequentially (sequential shift-and-add), b) serially (combinational shift-and-add), or c) in parallel Speed-up techniques Reduce number of partial products Accelerate addition of partial products
Computer Arithmetic: Principles, Architectures, and VLSI Design 6 Multiplication 69

c out alusymbol.epsi c in 29 mm 30 ALU flags op Z

ALU operations arithmetic add inc pass and or xor pass sll sla rol

logic

shift/ rotate

A + B + cin A+1 A aibi ai + bi ai bi ai A 1 A a1 A r1

sub dec neg nand nor xnor not srl sra ror

A;B A;1 ;A ai bi ai + bi ai bi ai A 1 A a1 A r1

s/ro : shift/rotate ; l/r : left/right ; l/a : logic (unsigned) / arithmetic (signed)

Logic of adder/subtractor can partly be re-used for logic operations


Computer Arithmetic: Principles, Architectures, and VLSI Design 6 Multiplication 68

6.1 Multiplication Basics

6.2 Unsigned Array Multiplier

Sequential multipliers : partial products generated and added sequentially (using accumulator)

6.2 Unsigned Array Multiplier

mulseq.epsi 34 28 mm

Braun multiplier : array multiplier for unsigned numbers

CPA

A = O(n) T = O(log n) L = n
Array multipliers : partial products generated and added simultaneously in linear array (using array adder)

CSA CSA mularr.epsi 34 47 mm CSA

P=

n;1 n;1 XX i=0 j =0

aibj 2i+j a0 b3 a1 b2 a2 b1 a3 b0 p3

A = 8n2 ; 11n T = 6n ; 9 a0 b2 a0 b1 a0 b0 a1 b1 a1 b0 a2 b0 p2 p1 p0
b1 b0

a1 b3 a2 b3 a2 b2 + a3b3 a3b2 a3b1 p7 p6 p5 p4


b3 a0 b2

A = O(n ) T = O(n)
2

CSA CPA

a1

p0 HA HA HA p1

Parallel multipliers : partial products generated in parallel and added subsequently in multi-operand adder (using tree adder)

mulpar.epsi 34 43 mm

1
a2

CSA tree CPA


a3

FA

mulbraun.epsi FA 99 83 mm

FA p2

A = O(n ) T = O(log n)
2

2
CSA CPA

FA

FA

FA p3

Signed multipliers : a) complement operands before and result after multiplication ) unsigned multiplication b) direct implementation (dedicated multiplier structure)
Computer Arithmetic: Principles, Architectures, and VLSI Design 70

3
p7

FA p6

FA p5

HA p4

Computer Arithmetic: Principles, Architectures, and VLSI Design

71

6 Multiplication

6.3 Signed Array Multipliers

6 Multiplication

6.4 Booth Recoding

6.3 Signed Array Multipliers Modied Braun multiplier Subtract bits with negative weight ) special FAs [1] 1 neg. bit : ;a + b + cin = 2cout ; s 2 neg. bits : a ; b ; cin = ;2cout + s Replace FAs in regions 1 , 2 , and 3 by : (input a at mark )

6.4 Booth Recoding Speed-up technique : reduction of partial products Sequential multiplication + One cycle per non-zero partial product (i.e. 8ai j ai = 0) 6 Negative partial products Minimal (or canonical) signed-digit (SD) represent. of A

s = a b cin cout = ab + acin + bcin

Data-dependent reduction of partial products and latency Combinational multiplication Only xed reduction of partial product possible Radix-4 modied Booth recoding : 2 bits recoded to one multiplier digit ) n=2 partial products

Otherwise exactly same structure and complexity as Braun multiplier ) efcient and exible Baugh-Wooley multiplier Arithmetic transformations yield the following partial products (two additional ones) :

A=

n=2;1 X i=0

(a2i;1 + a2i ; 2a2i+1 ) 22i ; a;1 | {z }

a0 b3 a1b3 a1 b2 a2 b3 a2 b2 a2 b1 a3 b3 a3 b2 a3 b1 a3 b0 a3 a3 + 1 b3 b3 p7 p6 p5 p4 p3

Booth recoding

a0 b2 a0 b1 a0 b0 a1 b1 a1 b0 a2 b0 p2 p1 p0

f;2 ;1 0 +1 +2g

=0

Less efcient and regular than modied Braun multiplier


Computer Arithmetic: Principles, Architectures, and VLSI Design 6 Multiplication 72 6.4 Booth Recoding

a2i+1 a2i a2i;1 Pi 0 0 0 + 0 1 + B 0 0 0 1 0 + B 0 1 1 + 2B 1 0 0 ; 2B 1 0 1 ; B 1 1 0 ; B 1 1 1 ; 0


6 Multiplication

mulbooth.epsi 41 43 mm

CSA array/tree CPA


73

Computer Arithmetic: Principles, Architectures, and VLSI Design

6.6 Multiplier Implementations

Applicable to sequential, array, and parallel multipliers additional recoding logic and more complex partial product generation (MUX for shift, XOR for negation) + adder array/tree cut in half ) considerably smaller (array and tree)
) much faster for adder arrays ) slightly or not faster for adder trees

6.5 Wallace Tree Addition Speed-up technique : fast partial product addition

A : +8n T : +7 A : =2 T : =2 T : ;0 p2 p1 p0 p2 p1 p0
1

A = O(n2) T = O(log n)
Applicable to parallel multipliers : parallel partial product generation (normal or Booth recoded) Irregular adder tree (Wallace tree) due to different number of bits per column ) irregular wiring and/or layout ) non-uniform bit arrival times at nal adder 6.6 Multiplier Implementations Sequential multipliers : low performance, small area, component re-use (adder) Braun or Baugh-Wooley multiplier (array multiplier) : medium performance, high area, high regularity layout generators ) data paths and macro-cells simple pipelining, faster CPA ) higher speed Booth-Wallace multiplier (parallel multiplier) [9] : high performance, high area, low regularity custom multipliers, netlist generators often pipelined (e.g. register between CSA-tree and CPA) Signed-unsigned multiplier : signed multiplier with operands extended by 1 bit (an = an;1 =0, bn = bn;1 =0)
74 Computer Arithmetic: Principles, Architectures, and VLSI Design 75

Negative partial products (avoid sign-extension) :

p3 p3 p3 p3 p2 p1 p0 = | {z }
ext. sign

0 0 0 ;p3 = 1 + 1 1 1 p3

p03 p02 p01 p00 p03 p03 p03 p03 p02 p01 p00 p13 p12 p11 p10 p13 p13 p13 p12 p11 p10 ! p23 p22 p21 p20 p23 p23 p22 p21 p20 + p33 p32 p31 p30 + p33 p32 p31 p30

p6 p5 p4 p3 p2 p1 p0

p6 p5 p4 p3 p2 p1 p0

Suited for signed multiplication (incl. Booth recod.) Extend A for unsigned multiplication : an

=0

Radix-8 (3-bit recoding) and higher radices : precomputing 3B , : : : ) inefcient


Computer Arithmetic: Principles, Architectures, and VLSI Design

6 Multiplication

6.8 Squaring

7 Division / Square Root Extraction

7.1 Division Basics

6.7 Composition from Smaller Multipliers

7 Division / Square Root Extraction


7.1 Division Basics

(2n 2n)-bit multiplier can be composed from 4 (n n)-bit multipliers (can be repeated recursively)
A B = (AH 2n + AL) (BH 2n + BL) = AH BH 22n + (AH BL + ALBH )2n + ALBL AH BL AH BH AL BL AL BH

4 (n n)-bit multipliers + (2n)-bit CSA + (3n)-bit CPA less efcient (area and speed) 6.8 Squaring

A=Q B+R; R <B R = A rem B (remainder) A 2 0 22n ; 1] B Q R 2 0 2n ; 1] B 6= 0 Q < 2n ! A < 2nB , otherwise overow ) normalize B before division (B 2 2n;1 2n ; 1]) A =Q+ R B B
Algorithms (radix-2) Subtract-and-shift : partial remainders Ri [1, 2] Sequential algorithm : recursive, f non-associative

P = A2 = AA

: multiplier optimizations possible

!
;

a0 a3 a0 a2 a0 a1 a0 a0 a1 a3 a1 a2 a1 a1 a1 a0 a2 a3 a2 a2 a2 a1 a2 a0 + a3a3 a3a2 a3a1 a3a0 a2 a3 a1 a3 a0 a3 a0 a2 a0 a1 a0 a0 a3 a3 a1 a2 a1 a1 + a2 a2 p7 p6 p5 p4 p3 p2 p1 p0

qi = Ri+1 2iB Ri = Ri+1 ; qi2iB Rn = A R = R0 ; i = n ; 1 : : : 0 (r.m.n.)


Basic algorithm : compare and conditionally subtract ) expensive comparison and CPA Restoring division : subtract and conditionally restore (adder or multiplexer) ) expensive CPA and restoring Non-restoring division : detect sign, subtract/add, and correct by next steps ) expensive CPA SRT division : estimate range, subtract/add (CSA), and correct by next steps ) inexpensive CSA
76 Computer Arithmetic: Principles, Architectures, and VLSI Design 7 Division / Square Root Extraction 77 7.4 Signed Division

+ bn=2c + 1 partial products (if no Booth recoding used) ) optimized squarer more efcient than multiplier Table look-up (ROM) less efcient for every n
Computer Arithmetic: Principles, Architectures, and VLSI Design 7 Division / Square Root Extraction

7.3 Non-Restoring Division

7.2 Restoring Division

qi =

1 if 0 if

Ri+1 ; B 2i 0 Ri+1 ; B 2i < 0

7.4 Signed Division

q0 = 1 if
i
1 if

Ri+1 B same sign Ri+1 B opposite sign

Ri+1 ; B 2i < 0 : qi = 0 Ri = Ri+1 (restored) i ; 1 Ri+1 ; B 2i;1 0 : qi;1 = 1 Ri;1 = Ri+1 ; B 2i;1 i
7.3 Non-Restoring Division

Example : signed non-restoring array divider (simplications: B > 0, nal correction of R omitted)

qi0 =

1 if ;1 = 1 if

Ri+1 0 Ri+1 < 0

A = 9n2 T = 2n2 + 4n
a6 b2 a5 b1 a4 b0

a6 b3

b3

a3

Ri+1 0 : qi0 = 1 Ri = Ri+1 ; B 2i i ; 1 Ri+1 ; B 2i < 0 : qi0;1 = 1 Ri;1 = Ri+1 ; B 2i +B 2i;1 = Ri+1 ; B 2i;1 i
One subtraction/addition (CPA) per step Final correction step for R (additional CPA) 0 Simple quotient digit conversion : (note: qi irredundant)

q3

FA

FA

FA

FA

a2 q2 FA FA FA divarray.epsi 81 101 mm FA

qi0 2 f1 1g ! qi 2 f0 1g : qi = 1 (qi0 + 1) 2 Q = (qn;1 qn;2 qn;3 : : : q0 1)


A B

a1 q1 FA FA FA FA

A = (n + 1)ACPA = O(n2) or O(n2 log n) T = (n + 1)TCPA = O(n2) or O(n log n)

+/ CPA +/ CPA divnr.epsi 38 mm 46 +/ CPA +/ CPA +/ CPA

a0 q0 FA r3 FA r2 FA r1 FA r0
79

R
Computer Arithmetic: Principles, Architectures, and VLSI Design 78

Computer Arithmetic: Principles, Architectures, and VLSI Design

7 Division / Square Root Extraction

7.5 SRT Division

7 Division / Square Root Extraction

7.7 Division by Multiplication

7.5 SRT Division

qi0 = >0 if ;B 2i
> :1

8 >1 > <

7.6 High-Radix Division Radix

if if

B 2i Ri+1 qi0 is SD number Ri+1 < B 2i i Ri+1 < ;B 2

= 2m , qi0 2 f

;1

:::

1 0 1

:::

; 1g

m quotient bits per step ) fewer, but more complex steps


+ Suitable for SRT algorithm ) faster Complex comparisons (more bits) and decisions ) table look-up () Pentium bug!) 7.7 Division by Multiplication Division by convergence

if 2n;1 ) ;B 2i
)

B < 2n , i.e. B is normalized : ;2n+i;1 Ri+1 < 2n+i;1 B 2i 8 >1 if 2n+i;1 Ri+1 > < 0 = 0 if ;2n+i;1 R < 2n+i;1 qi > i+1 > :1 if Ri+1 < ;2n+i;1

0 + Only 3 MSB are compared ) qi are estimated ) CSA instead of CPA can be used (precise enough) [20] Correction in following steps (+ nal correction step) 0 Redundant representation of qi (SD representation) ) nal conversion necessary (CPA) + Highly regular and fast (O(n)) SRT array dividers ) only slightly slower/larger than array multipliers
A B

A A Q = B = B R0R1 R0 R1

Rm;1 ! A Rm;1 B

B 1 B

Q = Q resp. 2n 1

Bi+1 = Bi Ri = 2n(1{z y) (1 + y) = 2n(1{z y2 ) ; ; | } | {z } | } Bi Ri > Bi ! 2n y = 1 ; Bi2;n Ri = 2 ; Bi2;n = B i + 1 (signed)


Algorithm :

A = nACSA + 2ACPA = O(n2) T = nTCSA + TCPA = O(n)

+/ CSA +/ divsrt.epsi CSA +/ 50 38 mm CSA +/ CSA +/ CPA

Bi+1 = Bi Ri Ai+1 = Ai Ri Ri = B i + 1 ; i = 0 : : : m ; 1 A0 = A B0 = B Q = Am (r.s.n.) L = dlog ne


81 7.9 Divider Implementations

CPA

R
Computer Arithmetic: Principles, Architectures, and VLSI Design 7 Division / Square Root Extraction 80

Quadratic convergence :

Computer Arithmetic: Principles, Architectures, and VLSI Design 7 Division / Square Root Extraction

7.8 Remainder / Modulus

Division by reciprocation 1 A Q= B =A B Newton-Raphson iteration method : nd

7.9 Divider Implementations Iterative dividers (through multiplication) : re-use of existing components (multiplier) medium performance, medium area high efciency if components are re-used Sequential dividers (restoring, non-restoring, SRT) : re-use of existing components (e.g. adder) low performance, low area Array dividers (restoring, non-restoring, SRT) : dedicated hardware component high performance, high area high regularity ) layout generators, pipelining square root extraction possible by minor changes combination with multiplication or/and square root No parallel dividers exist (sequential nature of division)

f (X ) = 0

by recursion

f Xi+1 = Xi ; f 0((Xo) Xi)

1 1 1 f (X ) = X ; B f 0 (X ) = ; X 2 f B = 0 Algorithm :

Xi+1 = Xi (2 ; B Xi) ; i = 0 : : : m ; 1 X0 = B Q = Xm (r.s.n.) L = O(log n) Speed-up : rst approximation X0 from table


Quadratic convergence : 7.8 Remainder / Modulus Remainder (rem) : signed remainder of a division

R = A rem B M = A mod B M
0

sign(R) = sign(A)

Modulus (mod) : positive remainder of a division

M = R+B R

if A else

Computer Arithmetic: Principles, Architectures, and VLSI Design

82

Computer Arithmetic: Principles, Architectures, and VLSI Design

83

7 Division / Square Root Extraction

7.10 Square Root Extraction

8 Elementary Functions

8.1 Algorithms

7.10 Square Root Extraction p A;R =Q

8 Elementary Functions

A = Q2 + R

A2

0 22n ; 1]

Q2

0 2n ; 1]

Exponential function : ex (exp x) Logarithm function : ln x, log x Trigonometric functions : sin x, cos x, tan x

Algorithm Subtract-and-shift : partial remainders Ri and quotients Qi = Qi+1 + qi2i = (qn;1 : : : qi 0 : : : 0) 2 Q2 = Qi+1 + qi2i = Q2+1 + qi2i 2Qi+1 + qi2i i i

Inverse trig. functions : arcsin x, arccos x, arctan x Hyperbolic functions : sinh x, cosh x, tanh x 8.1 Algorithms Table look-up : inefcient for large word lengths [5] Taylor series expansion : complex implementation Polynomial and rational approximations [1, 5] Shift-and-add algorithms [5] Convergence algorithms [1, 2] : similar to division-by-convergence two (or more) recursive formulas : one formula converges to a constant, the other to the result Coordinate rotation (CORDIC) [2, 5, 21] :

qi = Ri+1 2i 2Qi+1 + 2i Qi = Qi+1 + qi2i Ri = Ri+1 ; qi2i 2Qi+1 + qi2i ; i = n ; 1 : : : 0 Rn = A Qn = 0 R = R0 Q = Q0 (r.m.n.)


Implementation + Similar to division ) same algorithms applicable (restoring, non-restoring, SRT, high-radix) + Combination with division in same component possible Only triangular array required (step i : qk i = 0)
A
+/ CPA sqrtnr.epsi +/ CPA 42 36 mm +/ CPA +/ CPA +/ CPA

A ADIV =2 T TDIV

R
Computer Arithmetic: Principles, Architectures, and VLSI Design 8 Elementary Functions 84

3 equations for x-, y-coordinate, and angle computes all elementary functions by proper input settings and choice of modes and outputs simple, universal hardware, small look-up table
Computer Arithmetic: Principles, Architectures, and VLSI Design 8 Elementary Functions 85

8.2 Integer Exponentiation

8.3 Integer Logarithm


1 1 1 0

8.2 Integer Exponentiation Approximated exponentiation :

b)

xy = ey ln x = 2y log x

E = AB = Abn; 2n; + +b 2+b = ( ((Abn; )2 Abn; )2


1 2

Ab )2 Ab
1

Base-2 integer exponentiation : 2A Integer exponentiation (exact) :

= (: : :

0 1 0 |{z} A

: : :)

Ei = Ei2+1 Abi ; i = n ; 1 : : : En = 1 E = E0 (r.s.n.)

AB

=A A A | {z }
B

L=0

2n ; 1 (!)

A = AMUL T = TMUL L = 2(n ; 1)


8.3 Integer Logarithm

Applications : modular exponentiation AB (mod in cryptographic algorithms (e.g. IDEA, RSA) Algorithms : square-and-multiply a)

C)

Z = blog2 Ac
For detection/comparison of order of magnitude Corresponds to leading-zeroes detection (LZD) with encoded output

E = AB = Abn; 2n; + +b 2+b = A2n; bn; A2n; bn;


1 1 1 0 1 1 2

A4b A2b Ab
2 1

Ei = Pibi Ei;1 Pi+1 = Pi2 ; i = 0 : : : n ; 1 E;1 = 1 P0 = A E = En;1 (r.s.n.) A = 2AMUL T = TMUL L = n A = AMUL T = TMUL L = 2n
Computer Arithmetic: Principles, Architectures, and VLSI Design

or

86

Computer Arithmetic: Principles, Architectures, and VLSI Design

87

9 VLSI Design Aspects

9.1 Design Levels

9 VLSI Design Aspects

9.1 Design Levels

9 VLSI Design Aspects


9.1 Design Levels Transistor-level design Circuit and layout designed by hand (full custom) Low design efciency High circuit performance : high speed, low area High exibility : choice of architecture and logic style Transistor-level circuit optimizations : logic style : static vs. dynamic logic, complementary CMOS vs. pass-transistor logic special arithmetic circuits : better than with gates carry chain :
ci c out c i-1 carrychain.epsi 54 17 mm ki pi gi g i-1 c in

Gate-level design Cell-based design techniques : standard-cells, gate-array/ sea-of-gates, eld-programmable gate-array (FPGA) Circuit implemented by hand or by synthesis (library) Layout implemented by automated place-and-route Medium to high design efciency Medium to low circuit performance Medium to low exibility : full choice of architecture Block-level design Layout blocks and netlists from parameterized automatic generators or compilers (library) High design efciency Medium to high circuit performance Low exibility : limited choice of architectures Implementations : data-path : bit-sliced, bus-oriented layout (array of cells: n bits m operations), implementation of entire data paths, medium performance, medium diversity macro-cells : tiled layout, xed/single-operation components, high performance, small diversity portable netlists : ) gate-level design
Computer Arithmetic: Principles, Architectures, and VLSI Design 9 VLSI Design Aspects 89 9.3 VHDL

k i-1 p i-1

c in

a b

fulladder :

c in c in

b b

facmos.epsi 76 40 mm

c in s c in b c out

c in

Computer Arithmetic: Principles, Architectures, and VLSI Design 9 VLSI Design Aspects

88 9.2 Synthesis

9.2 Synthesis High-level synthesis Synthesis from abstract, behavioral hardware description (e.g. data dependency graphs) using e.g. VHDL Involves architectural synthesis and arithmetic transformations High-level synthesis is still in the beginnings Low-level synthesis Layout and netlist generators Included in libraries and synthesis tools Low-level synthesis is state-of-the-art Basis for efcient ASIC design Limited diversity and exibility of library components Circuit optimization Efcient optimization of random logic (low factorization degree) is state-of-the-art Optimization of entire arithmetic circuits (high factorization degree) is not feasible ) only local optimizations possible Logic optimization cannot replace the synthesis of efcient arithmetic circuit structures using generators
Computer Arithmetic: Principles, Architectures, and VLSI Design 90

9.3 VHDL Arithmetic types : unsigned, signed (2s complement) Arithmetic packages numeric_bit, numeric_std (IEEE standard 1076.3), std_logic_arith (Synopsys) contain overloaded arithmetic operators and resizing / type conversion routines for unsigned, signed types Arithmetic operators (VHDL87/93) [22] relational shift, rotate (93 only) adding sign (unary) multiplying exponent, absolute Synthesis Typical limitations of synthesis tools :
/, mod, rem : both operands must be constant or divisor

: : : : : :

=, /=, <, <=, >, >= rol, ror, sla, sll, sra, srl +, +, *, /, mod, rem **, abs

must be a power of two ** : for power-of-two bases only Variety of arithmetic components provided in separate libraries (e.g. DesignWare by Synopsys)
Computer Arithmetic: Principles, Architectures, and VLSI Design 91

9 VLSI Design Aspects

9.3 VHDL

9 VLSI Design Aspects

9.4 Performance

Resource sharing Sharing one resource for multiple operations Done automatically by some synthesis tools Otherwise, appropriate coding is necessary : a) b)
S <= A + C when SELA = 1 else B + C;

9.4 Performance Pipelining Pipelining is basically possible with every combinational circuit ) higher throughput Arithmetic circuits are well suited for pipelining due to high regularity Pipelining of arithmetic circuits can be very costly : large amount of internal signals in arithmetic circuits array structures : many small pipeline registers tree structures : few large pipeline registers
) no advantage of tree structures anymore (except for smaller latency)

) 2 adders + 1 multiplexer
T <= A when SELA = 1 else B; S <= T + C; ) 1 multiplexer + 1 adder

Coding & synthesis hints Addition : single adder with carry-in/carry-out :


Aext Bext Sext S Cout <= <= <= <= <= resize(A, width+1) & Cin; resize(B, width+1) & 1; Aext + Bext; Sext(width downto 1); Sext(width+1);

Fine-grain pipelining ) systolic arrays (often applied to arithmetic circuits) High speed Fast circuit architectures, pipelining, replication (parallelization), and combinations of those Optimal solution depends on arithmetic operation, circuit architecture, user specications, and circuit environment
92 Computer Arithmetic: Principles, Architectures, and VLSI Design 9 VLSI Design Aspects 93 9.5 Testability

Synthesis : check synthesis result for allocated arithmetic units ) code sanity check, control of circuit size VHDL library of arithmetic units Structural, synthesizable VHDL code for most circuits described in this text is found in [23]
Computer Arithmetic: Principles, Architectures, and VLSI Design 9 VLSI Design Aspects

9.4 Performance

Low power Power-related properties of arithmetic circuits : High glitching activity due to high bit dependencies and large logic depth Power reduction in arithmetic circuits [24] : Reduce the switched capacitance by choosing an area efcient circuit architecture Allow for lower supply voltage by speeding up the circuitry Reduce the transition activity : apply stable inputs while circuit is not in use () disabling subcircuits) reduce glitching transitions by balancing signal paths (partly done by speed-up techniques, otherwise difcult to realize) reduce glitching transitions by reducing logic depth (pipelining) take advantage of correlated data streams choose appropriate number representations (e.g. Gray codes for counters)

9.5 Testability Testability goal : high fault coverage with few test vectors that are easy to generate/apply Random test vectors : easy to generate and apply/propagate, few vectors give high (but not perfect) fault coverage for most arithmetic circuits Special test vectors : sometimes hard to generate and apply, required for coverage of hard-detectable faults which are inherent in most arithmetic circuits Hard-detectable faults found in : circuits of arithmetic operations with inherent special cases (arithmetic exceptions) : detectors, comparators, incrementers and counters (MSBs), adder ags circuits using redundant number representations (= redundant hardware) : dividers (Pentium bug!) 6

Computer Arithmetic: Principles, Architectures, and VLSI Design

94

Computer Arithmetic: Principles, Architectures, and VLSI Design

95

Bibliography

Bibliography

Bibliography
[1] I. Koren, Computer Arithmetic Algorithms, Prentice Hall, 1993. [2] K. Hwang, Computer Arithmetic: Principles, Architecture, and Design, John Wiley & Sons, 1979. [3] O. Spaniol, Computer Arithmetic, John Wiley & Sons, 1981. [4] J. J. F. Cavanagh, Digital Computer Arithmetic: Design and Implementation, McGraw-Hill, 1984. [5] J.-M. Muller, Elementary Functions: Algorithms and Implementation, Birkhauser Boston, 1997. [6] Proceedings of the Xth Symposium on Computer Arithmetic. [7] IEEE Transactions on Computers. [8] D. R. Lutz and D. N. Jayasimha, Programmable modulo-k counters, IEEE Trans. Circuits and Syst., vol. 43, no. 11, pp. 939941, Nov. 1996. [9] H. Makino et al., An 8.8-ns 54 54-bit multiplier with high speed redundant binary architecture, IEEE J. Solid-State Circuits, vol. 31, no. 6, pp. 773783, June 1996. [10] W. N. Holmes, Composite arithmetic: Proposal for a new standard, IEEE Computer, vol. 30, no. 3, pp. 6573, Mar. 1997.
Computer Arithmetic: Principles, Architectures, and VLSI Design Bibliography 96

[11] R. Zimmermann, Binary Adder Architectures for Cell-Based VLSI and their Synthesis, PhD thesis, Swiss Federal Institute of Technology (ETH) Zurich, Hartung-Gorre Verlag, 1998. [12] A. Tyagi, A reduced-area scheme for carry-select adders, IEEE Trans. Comput., vol. 42, no. 10, pp. 11621170, Oct. 1993. [13] T. Han and D. A. Carlson, Fast area-efcient VLSI adders, in Proc. 8th Computer Arithmetic Symp., Como, May 1987, pp. 4956. [14] R. Zimmermann, Non-heuristic optimization and synthesis of parallel-prex adders, in Proc. Int. Workshop on Logic and Architecture Synthesis, Grenoble, France, Dec. 1996, pp. 123132. [15] D. W. Dobberpuhl et al., A 200-MHz 64-b dual-issue CMOS microprocessor, IEEE J. Solid-State Circuits, vol. 27, no. 11, pp. 15551564, Nov. 1992. [16] A. De Gloria and M. Olivieri, Statistical carry lookahead adders, IEEE Trans. Comput., vol. 45, no. 3, pp. 340347, Mar. 1996. [17] V. G. Oklobdzija, D. Villeger, and S. S. Liu, A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach, IEEE Trans. Comput., vol. 45, no. 3, pp. 294305, Mar. 1996.
Computer Arithmetic: Principles, Architectures, and VLSI Design 97

[18] Z. Wang, G. A. Jullien, and W. C. Miller, A new design technique for column compression multipliers, IEEE Trans. Comput., vol. 44, no. 8, pp. 962970, Aug. 1995. [19] J. Cortadella and J. M. Llaberia, Evaluation of A + B = K conditions without carry propagation, IEEE Trans. Comput., vol. 41, no. 11, pp. 14841488, Nov. 1992. [20] S. E. McQuillan and J. V. McCanny, Fast VLSI algorithms for division and square root, J. VLSI Signal Processing, vol. 8, pp. 151168, Oct. 1994. [21] Y. H. Hu, CORDIC-based VLSI architectures for digital signal processing, IEEE Signal Processing Magazine, vol. 9, no. 3, pp. 1635, July 1992. [22] K. C. Chang, Digital Design and Modeling with VHDL and Synthesis, IEEE Computer Society Press, Los Alamitos, California, 1997. [23] R. Zimmermann, VHDL Library of Arithmetic Units,
http://www.iis.ee.ethz.ch/zimmi/arith_lib.html.

[24] A. P. Chandrakasan and R. W. Brodersen, Low Power Digital CMOS Design, Kluwer, Norwell, MA, 1995.

Computer Arithmetic: Principles, Architectures, and VLSI Design

98

Potrebbero piacerti anche