Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Lecture notes on
Computer Arithmetic:
Principles, Architectures,
and VLSI Design
Reto Zimmermann
Contents
1 Introduction and Conventions 1.4 Recursive Function Evaluation 1 Introduction and Conventions 1.4 Recursive Function Evaluation
;
0
1
&1 ;
0
1 &1
01
parallel structure : a a a a 3 2 1 0 1.
is non-associative (r.m.n.)
a3 a2 a1 a0
z3 z2 z1 z0
Recursive functions (r.) a3 a2 a1 a0
'&1
01
'%&1
z2
z1
z0
1.
is non-associative (r.s.n.)
a3 a2 a1 a0
or shared-tree structure : a3 a2 a1 a0
serial structure : 1 funrsn.epsi
!
! "
219 24 mm
3
! log
!log "
1funrma2.epsi
219 21 mm
z z3 z2 z1 z0
Computer Arithmetic: Principles, Architectures, and VLSI Design 6 Computer Arithmetic: Principles, Architectures, and VLSI Design 7
2 Arithmetic Operations 2.1 Overview 2 Arithmetic Operations 2.2 Implementation Techniques
%
complexity (8 – 12) and small word length (note: ROM
log (x) trig (x) hyp (x) size 2 )
Approximation techniques using simpler units : 7–12
1 shift/extension 7 division
taylor series expansion
2 comparison 8 square root extraction
3 increment/decrement 9 exponential function polynomial and rational approximations
4 complement 10 logarithm function convergence of recursive equation systems
5 addition/subtraction 11 trigonometric functions CORDIC (COordinate Rotation DIgital Computer)
6 multiplication 12 hyperbolic functions
Computer Arithmetic: Principles, Architectures, and VLSI Design 8 Computer Arithmetic: Principles, Architectures, and VLSI Design 9
3 Number Representations 3.1 Binary Number Systems (BNS) 3 Number Representations 3.1 Binary Number Systems (BNS)
%
:
2
1 ,
3 Number Representations
Complement
where
%&1 %&2 0
Sign : %&1
3.1 Binary Number Systems (BNS)
Radix-2, binary number system (BNS) : irredundant, Properties : asymmetric range, compatible with
weighted, positional, monotonic [1, 2] unsigned numbers in many arithmetic operations
%&%&
(i.e. same treatment of positive and negative numbers)
-bit number is ordered sequence of bits (binary digits) :
%
1 2 0 2 0 1 One’s (1’s) complement : similar to 2’s complement
%& &2
Simple and efficient implementation in digital circuits
Value :
&1 2
1 2
% 1
%&1 / 0
Range :
2
1 2
1
MSB/LSB (most-/least-significant bit) :
%& %&
0
1 1
%
Represents an integer or fixed-point number, exact
& &&%
Fixed-point numbers : 1 0 1
-bit integer
-bit fraction
Complement :
2
1
Sign : %&1
%representation
Properties : double of zero, symmetric
%
range, modulo 2
1 number system
Unsigned : positive or natural numbers
%&2%&1 2 &
1
Value : 2
Range : 0 2
1
1 1 0
% 0
Sign-magnitude : alternative representation of signed
1 %
numbers
&2
Two’s (2’s) complement : standard representation of Value :
0 2
1
Range :
2 1
1 2 1
1
signed or integer numbers
%
&2 %& %&
Value :
% %&
&12 2
Complement :
%&1 %&2 0
1
Range :
1
%&
2 2
1 %&1
0
Sign : %&1
Computer Arithmetic: Principles, Architectures, and VLSI Design 10 Computer Arithmetic: Principles, Architectures, and VLSI Design 11
3 Number Representations 3.1 Binary Number Systems (BNS) 3 Number Representations 3.2 Gray Numbers
011...1
100...0
111...1
(low-power signal buses), representation of continuous
signals for low-error sampling (no false numbers due to
binary number representation switching of different bits at different times)
– Non-monotonic numbers : difficult arithmetic operations,
n−1 0 n−1 n e.g. addition, comparison :
−2 2 2
"
numrep.epsi 1 0 0 1 0
0 3binary
2 1 0 3 Gray
2 1 0
0 0 0 1 and 0 1
95 73 mm unsigned
1 1 1 0 but 1 0
0 0 0 0 0 0 0 0 0
2’s complement 1 0 0 0 1 0 0 0 1
2 0 0 1 0 0 0 1 1
binary Gray : 3 0 0 1 1 0 0 1 0
%
0 ;
1’s complement 4 0 1 0 0 0 1 1 0
5 0 1 0 1 0 1 1 1
0
1
1
sign-magnitude (n.) 6 0 1 1 0 0 1 0 1
7 0 1 1 1 0 1 0 0
Gray binary : 8
9
1
1
0
0
0
0
0
1
1
1
1
1
0
0
0
1
Conventions
%
0 ; 10 1 0 1 0 1 1 1 1
11 1 0 1 1 1 1 1 0
1 0
1
2’s complement used for signed numbers in these notes (r.m.a.) 12 1 1 0 0 1 0 1 0
Unsigned and signed numbers can be treated equally in 13 1 1 0 1 1 0 1 1
14 1 1 1 0 1 0 0 1
most cases, exceptions are mentioned 15 1 1 1 1 1 0 0 0
Computer Arithmetic: Principles, Architectures, and VLSI Design 12 Computer Arithmetic: Principles, Architectures, and VLSI Design 13
3 Number Representations 3.3 Redundant Number Systems 3 Number Representations 3.3 Redundant Number Systems
3.3 Redundant Number Systems 1 digit holds sum of 3 bits or 1 digit + 1 bit (no
Non-binary, redundant, weighted number systems [1, 2] carry-out digit, i.e. carry is saved)
Digit set larger than radix (typically radix 2) multiple standard redundant number system for fast addition
representations of same number redundancy
Signed-digit (SD) or redundant digit (RD) number
%&
+ No carry-propagation in adders more efficient impl. representation :
of adder-based units (e.g. multipliers and dividers)
' 1 0 1 1 0 1 , 0
1
2
'
– Redundancy no direct implementation of relational
operators conversion to irredundant numbers no carry-propagation in :
– Several bits used to represent one digit higher storage 2 1 , 1 1 0 1
1
requirements 1
is redundant (e.g. 0 1 01 11)
– Expensive conversion into irredundant numbers (not 1 0 1
necessary if redundant input operands are allowed) 1 digit holds sum of 2 digits (no carry-out digit)
minimal SD representation : minimal number of
0 1 2 ,
0 1 ,
Delayed-carry of half-adder number representation :
1
2 1
,
0
non-zero digits, 011 1 10 100 0 10
applications : sequential multiplication (less cycles),
%&1 2
1
filters with constant coefficients (less hardware)
example :
0
1 digit holds sum of 2 bits (no carry-out digit)
example : 00 10
00 10
01 01
10 00
minimal
7
0111 1111 1011 1001 11111
of
1
0 &
1
1
0
irredundant representation 1 [8], since
canonical SD repres.: minimal SD + not two non-zero
10 0 10
digits in sequence, 01 1 10
0 1 2 3 ,
0 1 ,
Carry-save number representation :
SD binary : carry-propagation necessary ( adder)
1
2 1
%&
other applications : high-speed multipliers [9]
1 2
similar to carry-save, simple use for signed numbers
0
Computer Arithmetic: Principles, Architectures, and VLSI Design 14 Computer Arithmetic: Principles, Architectures, and VLSI Design 15
3 Number Representations 3.4 Residue Number Systems (RNS) 3 Number Representations 3.4 Residue Number Systems (RNS)
3.4 Residue Number Systems (RNS) Arithmetic operations : (each digit computed separately)
Non-binary, irredundant, non-weighted number system [1]
+ Carry-free and fast additions and multiplications
– Complex and slow other arithmetic operations
(e.g. comparison, sign and overflow detection) because
&1
digits are not weighted, conversion to weighted
mixed-radix or binary system required
&2 (Fermat’s theorem)
Codes for error detection and correction [1] Best moduli are 2and 2
1:
Possible applications (but hardly used) : high storage efficiency with #bits
digital filters : fast additions and multiplications simple modular addition : 2: #-bit adder without ,
%&%&
4
3
2
1 0 1 2 3 4 5 6 7 8
1 0
0 0 1 0 1 0 1 01 0 1 0 1 0
residues (or moduli) pairwise relatively prime
%&1 %&2 0 ,
0 1
1
1 2 0
possible range
%
Range:
&1 , anywhere in ZZ 5
5 5
2 1
4 5 6
1 0 2 1
3 2
1 0
Computer Arithmetic: Principles, Architectures, and VLSI Design 16 Computer Arithmetic: Principles, Architectures, and VLSI Design 17
3 Number Representations 3.5 Floating-Point Numbers 3 Number Representations 3.7 Antitetrational Number System
1
1 1 2 &
accuracy, more reliable
S biased exponent E unsigned norm. mantissa M
1
1 2 &
S biased fixed-point exponent E
1
Basic arithmetic operations : (signed-logarithmic)
1
Basic arithmetic operations :
(additionally consider sign)
1
1
1
Applications :
processors : “real” floating-point formats (e.g. IEEE + Simpler multiplication/exponent., more complex addition
standard), large range due to universal use – Expensive conversion : (anti)logarithms (table look-up)
ASICs : usually simplified floating-point formats with
Applications : real-time digital filters
small exponents, smaller range, used for range
extension of normal fixed-point numbers
3.7 Antitetrational Number System
22) and antitetration (a.t. ) [10]
IEEE floating-point format : 2
Tetration (t.
precision bias
range
38
precision
&7 " !
single
double
32
64
23
52
8 127 3 8 10
11 1023 9 10307
10
10
&15 otherwise analogous (i.e. 2 t. log a.t. )
Larger range, smaller precision than logarithmic repres.,
!
Computer Arithmetic: Principles, Architectures, and VLSI Design 18 Computer Arithmetic: Principles, Architectures, and VLSI Design 19
3 Number Representations 3.8 Composite Arithmetic 3 Number Representations 3.9 Round-Off Schemes
Secondary forms used for numbers not representable by
primary ones ( no over-/underflow handling necessary)
Trade-off : numerical accuracy vs. implementation cost
%&
Truncation : 1 0
Choice of number representation hidden from user, i.e.
1 1
software/compiler selects format for highest accuracy
Number representations :
2 2 1 (= average error )
Round-to-nearest (i.e. normal rounding) :
tag value
%&
1
0 1
1
(nearly symmetric)
integer : 00 2’s complement integer 1 0 2 2
rational :
logarithmic :
01
10
slash denominator numerator
log integer log fraction “
2
0 12” can often be included in previous operation
1
if &1 &
0 0
antitetrational : 11 a.t. integer a.t. fraction Round-to-nearest-even/-odd :
Rational numbers : slash position (i.e. size of numerator/ &
denominator) is variable and stored (floating slash)
%&1 1 0 otherwise
Storage form sizes : 32-bit (short), 64-bit (normal),
0 (symmetric)
128-bit (long), 256-bit (extended)
mandatory in IEEE floating-point standard
Implementation : mixed hardware/software solutions
3 guard bits for rounding after floating-point operations :
Hardware proposal : long accumulator (4096 bits) holds
guard bit (postnormalization), round bit
higher accurary
any floating-point number in fixed-point format
large hardware/software overhead (round-to-nearest), sticky bit (round-to-nearest-even)
Computer Arithmetic: Principles, Architectures, and VLSI Design 20 Computer Arithmetic: Principles, Architectures, and VLSI Design 21
CPA
CLA PPA COSA
2
3
2 1
(sum)
(carry-out)
3-operand CSA
"
adders.epsi
carry-save adders
103 121 mm
adder adder a b
multi-operand
array tree a b
a b
"
chaschema1.epsi
out
" "
array tree hasym.epsi 19 28 mm haschema2.epsi
multi-operand adders
adder adder 18
c 23HA
mm 21 43 mm
c out
out
s s
Legend:
(reference)
HA: half-adder CPA: carry-propagate adder CLA: carry-lookahead adder
FA: full-adder RCA: ripple-carry adder PPA: parallel-prefix adder s
(m,k): (m,k)-counter CSKA:carry-skip adder COSA:conditional-sum adder
(m,2): (m,2)-compressor CSLA: carry-select adder
CIA: carry-increment adder CSA: carry-save adder
based on component related component
Computer Arithmetic: Principles, Architectures, and VLSI Design 22 Computer Arithmetic: Principles, Architectures, and VLSI Design 23
4 Addition 4.2 1-Bit Adders, (m, k)-Counters 4 Addition 4.2 1-Bit Adders, (m, k)-Counters
...
"...
0 2
0
cntsymbol.epsi
18 (m,k)
23 mm
(propagate) 1
0
(generate)
s k-1 s 0
%
% Usually built from full-adders
% %
%
Associativity of addition allows convertion from linear to
%
%
% tree structure faster at same number of FAs
% 0 % 1
7 log2&7
log
4 2 log
4 log3 2 log
1
a b
a b
s s FA FA FA
a b
a b
a b
"
count73par.epsi
FA 36 48 mm FA
"
count73ser.epsi
0 42 59 mm
"
p
" "
faschematic1.epsi
g p faschematic4.epsi faschematic5.epsi
0 FA FA
c out c in c0
29 43 mm 29 1 41 mm 35 47 mm
c out c in c out 1
c1
s2 s1 s0
c in FA
tree structure
linear
s
(reference) s s2 s1 s0 structure
s
Computer Arithmetic: Principles, Architectures, and VLSI Design 24 Computer Arithmetic: Principles, Architectures, and VLSI Design 25
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)
Sum
is irredundant 1-bit number
2%
%
"
speedup1.epsi
CPA CPA CPA
c out cj c i84 26 mm ck c in
2 1
; A B s n-1:j
...
s i-1:k s k-1:0
0 1
1
0
%
% (r.m.a.)
CPA "
cpasymbol.epsi
c out 29 26 mm c in
a) Fast carry look-ahead logic for entire range of bits
S
a n-1 b n-1 a1 b1 a0 b0
7
2
14 2
... postprocessing
"
rca.epsi
FA FA FA
c out c n-1 57c 2 23 mm c1 c in
...
s n-1 s1 s0
Computer Arithmetic: Principles, Architectures, and VLSI Design 26 Computer Arithmetic: Principles, Architectures, and VLSI Design 27
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)
(minimize delays 0
1 and
1)
Variable group sizes (faster) : larger groups in the middle High speed-up at high hardware overhead
%&
& (+ MUX/bit + (CPA + MUX)/group)
14 2 8 39
Partial CPA typ. is RCA or CSKA ( multilevel CSKA)
1 2
3 2
Medium speed-up at small hardware overhead
(+ AND/bit + MUX/group) a i-1:k b i-1:k a k-1:0 b k-1:0
8 4
1 2
32
3 2
...
c i0 0
a n-1:j b n-1:j a i-1:k b i-1:k a k-1:0 b k-1:0 0 CPA
"
csla.epsi 1 CPA
c out ci ck c in
102 50CPA
... 1
mm
c’i c i1
CPA 0 1
0 s i-1:k s i-1:k
"
CPA cska.epsi CPA ...
c out cj ci 99
1 36 mm ck c in 0 1
ck
...
P i-1:k
s i-1:k s k-1:0
s n-1:j s i-1:k s k-1:0
Computer Arithmetic: Principles, Architectures, and VLSI Design 28 Computer Arithmetic: Principles, Architectures, and VLSI Design 29
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)
&1:
&1:
&1:
max
4 6 10 12 14 16 18 20 22 24 26 28 ... 38
&1:
&1 &2 (group propagate) group 2 3 4 5 6 7 8 9 10 11 ... 16
1 2 4 7 11 16 22 29 37 46 56 67 ... 137
Result is incremented after addition, if 1 [12, 11] a i-1 b i-1 a i-2 b i-2 a k+1 b k+1 ak bk
)
Variable group sizes (faster) : larger groups at end (MSB)
(balance delays 0 and ...
Part. CPA typ. is RCA, CIA ( multilevel CIA) or CLA
...
High speed-up at medium hardware overhead
(+ AND/bit + (incrementer + AND-OR)/group)
...
Logic of CPA and incrementer can be merged [11]
10 2 8
1 2
28
3 2
ci
s i-1 100 "
ciagate.epsi
s i-2 112 mm s k+1 sk
ck
a i-1:k b i-1:k a k-1:0 b k-1:0 (i-k-1)IFA + IHA 2IFA + IHA IFA + IHA IHA IHA
...
c’i 0
CPA
CPA
"
c out ci cia.epsi
s’i-1:k ck c in ... bits i-1...k ... bits 6...4 bits 3,2 bit 1 bit 0
86 43 mm
... P i-1:k
+1
s i-1:k s k-1:0
c out c in
Computer Arithmetic: Principles, Architectures, and VLSI Design 30 Computer Arithmetic: Principles, Architectures, and VLSI Design 31
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)
2
1 1 0 1 0 0
(g3,p3) (g0,p0)
Bit groups of size 2 at level
3
2 2 1 2 1 0 2 "
clbsymbol.epsi
27 CLB
1 0 0 26 mm c′
Higher parallelism, more balanced signal paths 0
3
3 3 2 3 2 1 3
2 1 0
Highest speed-up at highest hardware overhead 3 3 2 1 0
(g′,p′)
3 3 c3
. . . c0
... 0 0 0
FA FA FA
1 1 1 FA (g15,p15) ... (g12,p12) (g11,p11) ... (g8,p8) (g7,p7) ... (g4,p4) (g3,p3) ... (g0,p0)
FA FA FA c in
0 1 0 1
cosa.epsi 0 1 0 1 CLB CLB CLB CLB
...
100 57 mm
(g′11,p′11)
(g′15 ,p′15 )
(g′,p′)
(g′,p′)
7 7
3 3
c 15 ... c 12
"
c 11 ... c 8 cla.epsi c 7 ... c 4 c 3 ... c 0
level 2
0 1 0 1 0 1
... 97 48 mm
CLB c in
c out + preprocessing :
s3 s2 s1 s0
+ postprocessing :
Computer Arithmetic: Principles, Architectures, and VLSI Design 32 Computer Arithmetic: Principles, Architectures, and VLSI Design 33
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)
0
1
0
0
&
1
1 ;
1
0
1
1
0 0
(r.m.a.)
Carries calculated using parallel-prefix algorithms tree structures for evaluation :
3 2 1 0
3 2 1 0 , but
2 ?
Associativity of
+ High regularity : suitable for synthesis and layout
+ High flexibility : special adders, other arithmetic
1 1
1
1 1:0 3:2 1 1:0
operations, exchangeable prefix algorithms (i.e. speeds)
2 2
2 2:0 3 3:0
+ High performance : smallest and fastest adders
3
5 3
4 2
3
at level
3:0
Group variables : : covers bits
Carry-propagation is prefix problem : :
: :
a n-1
b n-1
a n-2
b n-2
preprocessing:
a1
b1
a0
b0
s n-2
s1
s0
Computer Arithmetic: Principles, Architectures, and VLSI Design 34 Computer Arithmetic: Principles, Architectures, and VLSI Design 35
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)
: : :
: 1
: 0
"
1 sk.epsi///figures
2 67 30 mm
: : : : : : : : 3
4
(contains logic for ) (contains no logic)
Performance measures :
Brent-Kung parallel-prefix algorithm (
PPA-BK)
Traditional CLA is PPA-BK with 4-bit groups
: graph size (number of black nodes)
: graph depth (number of black nodes on critical path) Tree-like redistribution of carries (fan-out tree)
Serial-prefix algorithm ( RCA)
2
log
2
2 log
2
1
1 !
2 ! log
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0
1 1
" "
2 ser.epsi///figures 2 bk.epsi///figures
3 69 38 mm 3 67 38 mm
...
4
14 5
15 6
Computer Arithmetic: Principles, Architectures, and VLSI Design 36 Computer Arithmetic: Principles, Architectures, and VLSI Design 37
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)
Kogge-Stone parallel-prefix algorithm ( PPA-KS) Mixed serial/parallel-prefix algorithm (
RCA + PPA)
very high wiring requirements linear size-depth trade-off using parameter #:
log
1
log !
2
0 $#$
2 log 2
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
#
0 : serial-prefix graph
0
1
#
2 log 1 : Brent-Kung parallel-prefix
graph
2
fills gap between RCA and PPA-BK (i.e. CLA) in steps
"
ks.epsi///figures
3 67 52 mm of single -operations
1 #
1
# !
var.
4
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
CIA)
Carry-increment parallel-prefix algorithm (
0
1
2
1 4 1 2 1 4 1 2 ! 1 4 1 2
2
3
"
4 var.epsi///figures
5 68 54 mm
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
6
0 7
1 8
"
cia.epsi///figures 9
2
67 34 mm 10
3
4
5
Computer Arithmetic: Principles, Architectures, and VLSI Design 38 Computer Arithmetic: Principles, Architectures, and VLSI Design 39
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)
2 size-decr. 2
a3 b3 a2 b2 a1 b1 a0 b0 3 transform 3
c in
Repeated (local) prefix transformations result in overall
minimization of graph depth or size which sequence ?
Goal: minimal size (area) at given depth (delay)
Simple algorithm for sequence of applied transforms :
Step 1 : prefix graph compression (depth minimization) :
depth-decr. transforms in right-to-left bottom-up order
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)
! 1 1for levels
CSLA, and CIA; notation: 2-level CIA = CIA-2L)
+ RCA is fast in average case ( ˜
!
log ), slow in worst
+ Delay is case suitable for self-timed asynchronous designs [15]
RCA
Often used combinations : CLA and CSLA [14] 128-bit CSKA-2L
1e+07
CIA-1L
– Pure architectures usually perform best (at gate-level) CIA-2L
64-bit
5 PPA-SK
Transistor-level adders PPA-BK
"
32-bit addperf.ps CLA
Influence of logic styles (e.g. dynamic logic, 2 84 84 mm COSA
pass-transistor logic faster) 16-bit const. AT
1e+06
+ Efficient transistor-level implementation of ripple-carry
chains (Manchester chain) [14] 8-bit
5
+ Combinations of speed-up techniques make sense
– Much higher design effort 2 delay [ns]
Many efficient implementations exist and published 5 10 20
Computer Arithmetic: Principles, Architectures, and VLSI Design 42 Computer Arithmetic: Principles, Architectures, and VLSI Design 43
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.4 Carry-Save Adder (CSA)
Complexity comparison under the unit-gate model 4.4 Carry-Save Adder (CSA)
a) Adds three -bit operands 0 , 1 , 2 performing no
adder A T AT opt.1 syn.2
carry-propagation (i.e. carries are saved) [1]
2
RCA 7 2 14 aaa
A0 A1 A2
1 2
CSKA-1L 8 4 32 3 2
aat 3
4 3
0 1 2
1 3 4 4
1
0 1 2 ; "
CSKA-2L 8 — csasymbol.epsi
2 2
21 CSA
26 mm
CSLA-1L 14 8 1 2 39 3 2
—
CIA-1L 10 2 8 1 2 28
3 2
att
0 1
1 (n.)
3
C S
CIA-2L 10 6 1 3 36 4 3
att
4 4
b) Adds one -bit operand to an -digit carry-save operand
CIA-3L 10 1 4
44 5 4
—
%
PPA-SK 3
2
log 2 log 3 log2 ttt
PPA-BK
PPA-KS
10
3 log
4 log
2 log
40
6
log
log2
att
—
( digits),
– Result is in redundant carry-save format
represented by two -bit numbers (sum bits) and
CLA 5 14 4 log 56 log — ( ) (carry bits)
COSA 3 log 2 log 6 log2 — + Parallel arrangement of full-adders, constant delay
1 optimality regarding area and delay
7
4
aaa : smallest area, longest delay
aat : small area, medium delay
a 0,n-1
a 1,n-1
a 2,n-1
a 0,1
a 1,1
a 2,1
a 0,0
a 1,0
a 2,0
att : medium area, short delay
ttt : large area, shortest delay
"
csa.epsi
. . . 67 27FA
mm
— : not optimal FA FA
2 obtained from prefix adder synthesis
3 automatic logic optimization not possible (redundancy) cn s n-1 c2 s1 c1 s0
a 0,2
a 1,2
a 0,1
a 1,1
a 0,0
a 1,0
log -bit result in irredundant number rep. [1, 2] ...
"
cparray.epsi
a) linear arrangement of CPAs FA 93 57 mm FA
FA HA
CPA
b) linear arr. of CSAs (adder array) and final CPA a 3,n-1 a 3,2 a 3,1 a 3,0
...
a) and b) differ in bit arrival times at final CPA : CPA
...
if fast final CPA : uniform bit arrival times required
sn s n-1 s2 s1 s0
CSA array (b)
Fast implementation : CSA array + fast final CPA b) 4-operand CSA array with final CPA (RCA) :
(note: array of fast CPAs not efficient/necessary)
a 0,n-1
a 1,n-1
a 2,n-1
2
a 0,2
a 1,2
a 2,2
a 0,1
a 1,1
a 2,1
a 0,0
a 1,0
a 2,0
A0 A1 A2 A3 A m-1
2 CSA ... FA ... FA FA FA
CSA
! "
mopadd.epsi ... 99FA 57 mm CSA
CPA = RCA : CSA FA FA HA
30 58 mm
...
! log
FA FA FA HA
Fast CPA : CPA
...
sn s n-1 s2 s1 s0
S
Computer Arithmetic: Principles, Architectures, and VLSI Design 46 Computer Arithmetic: Principles, Architectures, and VLSI Design 47
4 Addition 4.5 Multi-Operand Adders 4 Addition 4.5 Multi-Operand Adders
(m, 2)-compressors
&4
2
a0 a m-1
7
2
10
4
2
6 log
1
...
& &
0
c in0
"
c out cprsymbol.epsi
4 %
...
...
37 (m,2)
0 0
m-4 26 mm
c out c inm-4
Optimized (4, 2)-compressor :
c s
2 full-adders merged and optimized (i.e. XORs
1-bit adders (similar to (m, k)-counters) [16]
arranged in tree structure)
Compresses bits down to 2 by forwarding
3
14
6
intermediate carries to next higher bit position
14
8 a0 a1 a2 a3
Is bit-slice of multi-operand CSA array (see prev. page)
+ No horizontal carry-propagation (i.e. % #)
a0 a1 a2 a3
a 3,n-1
c s
a 2,2
a 2,1
a 2,0
a 0,2
a 1,2
a 3,2
a 0,1
a 1,1
a 3,1
a 0,0
a 1,0
a 3,0
with full-adders c s
Computer Arithmetic: Principles, Architectures, and VLSI Design 48 Computer Arithmetic: Principles, Architectures, and VLSI Design 49
Advantages of (4, 2)-compressors over FAs for realizing Tree adders (Wallace tree)
(m, 2)-compressors :
higher compression rate (4:2 instead of 3:2)
Adder tree : -bit -operand carry-save adder
less deep and more regular trees composed of tree-structured (m, 2)-compressors [1, 17]
si Z
Accumulators : Sequential -operand adders A B
"
1
accucpa.epsi 29 32 mm 1
CPA
27 28 mm c out
S S
A A B
With CSA and final CPA
Allows higher clock rates 2’s complement adder/subtractor
Final CPA too slow :
1
pipelining or multiple "
CSA addsub.epsi
36 35 mm
CPA sub
"
accucsa.epsi c out
cycles for evaluation
4
33 52 mm
S
mod 2%
1
"
addmod.epsi
S
29 CPA
28 mm
c out c in
Mixed CSA/CPA : CSA with partial CPAs (i.e. fewer
carries saved), trade-off between speed and register size (end-around carry)
S
Computer Arithmetic: Principles, Architectures, and VLSI Design 52 Computer Arithmetic: Principles, Architectures, and VLSI Design 53
5 Simple / Addition-Based Operations 5.2 Increment / Decrement 5 Simple / Addition-Based Operations 5.2 Increment / Decrement
Incrementer 1
2
log 2 1 log2
Adds a single bit %to an -bit operand
2
log 2
2%
%
A Decrementer
%
29 "
incsymbol.epsi a n-1 a2 a1 a0
1
;
0
1 c
+1
26 mm
out c in
0 % % (r.m.a.)
Z
...
Corresponds to addition with
0 ( FA HA) c out "
dec.epsi
93 41 mm
c in
Example : Ripple-carry incrementer using half-adders ...
3
1 3 2
z n-1 z2 z1 z0
%
1 %
a n-1 a1 a0
... Incrementer-decrementer
"
incfa.epsi
HA 59c 23HA mm c HA
c out c n-1 2 1 c in
...
z n-1 z1 z0 a n-1 a2 a1 a0
"
incdec.epsi
94 46 mm
c out
"
inc.epsi c out
c in c in
83 33 mm
... ...
HA
z n-1 z2 z1 z0 z n-1 z2 z1 z0
Computer Arithmetic: Principles, Architectures, and VLSI Design 54 Computer Arithmetic: Principles, Architectures, and VLSI Design 55
5 Simple / Addition-Based Operations 5.2 Increment / Decrement 5 Simple / Addition-Based Operations 5.2 Increment / Decrement
1
;
0
3 (r.m.a.)
c in 0
0 0
"
&1 &1 ;
1
2
inccg.epsi
62 39 mm
a7 a6 a5 a4 a3 a2 a1 a0
c in
"
incpp.epsi
98 63 mm
c out z7 z6 z5 z4 z3 z2 z1 z0
Computer Arithmetic: Principles, Architectures, and VLSI Design 56 Computer Arithmetic: Principles, Architectures, and VLSI Design 57
5 Simple / Addition-Based Operations 5.3 Counting 5 Simple / Addition-Based Operations 5.3 Counting
5.3 Counting
!
Fast divider (
1 ) using delayed-carry numbers
(irredundant carry-save represention of
1 allows using
Count clock cycles counter,
divide clock frequency
frequency divider (
) fast carry-save incrementer) [8]
" Applications:
cntripple.epsi
... 87 36 mm
fast dividers (no logic between FF)
state counter for one-hot coded FSMs
q n-1 q2 q1 q0
Johnson / twisted-ring counter (inverted feed-back) :
Asynchronous counter using toggle-flip-flops
(lower toggle rate lower power)
"
cntjohnson.epsi
T ... T T T 59 16 mm
clk
"
cntasync.epsi q n-1 q2 q1 q0
q n-1 q2
64 18 mm
q1 q0
FF for counting 2 states
Computer Arithmetic: Principles, Architectures, and VLSI Design 58 Computer Arithmetic: Principles, Architectures, and VLSI Design 59
5 Simple / Addition-Based Operations 5.4 Comparison, Coding, Detection 5 Simple / Addition-Based Operations 5.4 Comparison, Coding, Detection
%&1:0
GE = c out
(not equal)
(greater or equal) (for free in PPA) EQ = P n-1:0
(less than)
7
2 or
$
(greater than)
(less or equal)
& 3 log & 2 log
2
a2
b2
a1
b1
a0
b0
1
6
2
2 log
;
...
"
cmpeq.epsi
40 36 mm
example : ripple comparator using comparator slices
0
1
0% (r.s.a.)
1
a n-1
b n-1
a2
b2
a1
b1
a0
b0
EQ
Magnitude comparison
... equality &
magnitude
"
cmpripple.epsi
1 magnitude
1
%
GE
; 0
0 1 (r.s.a.) equality
EQ
Computer Arithmetic: Principles, Architectures, and VLSI Design 60 Computer Arithmetic: Principles, Architectures, and VLSI Design 61
5 Simple / Addition-Based Operations 5.4 Comparison, Coding, Detection 5 Simple / Addition-Based Operations 5.4 Comparison, Coding, Detection
Decoder
%& to vector & (
2%) Detection operations
1 if
1:0
%&1 %&2 0
Decodes binary number 1:0
All-zeroes detection :
0 else ;
0
1
2
All-ones detection :
%&1 %&2 0 (r.s.a.)
A a2 a1 a0
log
" "
decodersym.epsi decoder.epsi
21decoder
26 mm 58 28 mm Leading-zeroes detection (LZD) :
for scaling, normalization, priority encoding
Z
12%
log
z7 z6 z5 z4 z3 z2 z1 z0
a) non-encoded output :
a n-1 a n-2 a1 a0
0 1 01 0 1 0 ...
Encoder
& %&
% 000100)
"
(e.g. 000101 lzdnenc.epsi
Encodes vector 1:0 to binary number
# #
(condition:
1:0 ( 2 ) 50 28 mm
if then 1 else 0)
2
...
if 1; 0
1
log2 z z z
n-1 n-2 1 z0
"
encodersym.epsi z0
21encoder
b) encoded output : + encoder
"
26 mm encoder.epsi
Z
30 34 mm
z1 signed numbers : + leading-ones detector (LOZ)
2%&1
1 z2
1 (note: connections
according to PPA-SK)
Computer Arithmetic: Principles, Architectures, and VLSI Design 62 Computer Arithmetic: Principles, Architectures, and VLSI Design 63
5 Simple / Addition-Based Operations 5.5 Shift, Extension, Saturation 5 Simple / Addition-Based Operations 5.5 Shift, Extension, Saturation
# #
scaling of numbers for word-length reduction (i.e.
Extension of word lengths by bits ( ) ignore leading zeroes, shift b)) or normalization (e.g.
(i.e. sign-extension for signed numbers) of floating-point numbers, shift a)) using LZD
Saturation to highest/lowest value after over-/underflow reducing error after over-/underflow (saturation)
shift a) un- l. %&2 0 0 sll Implementation of shift/extension/rotation by
signed r. 0 %&1 1 srl constant values : hard-wired
%&1 %&3 0 0 variable values : multiplexers
signed l.
r. %&1 %&1 %&2 1
sla
sra
possible values : –by– barrel-shifter/rotator
shift b) unsigned %&1 Example : 4–by–4 barrel-rotator
signed 2%&1 %&2
! 2 a3 a2 a1 a0
%&2 0 %&1
!log
rotate l. rol
r. 0 %&1 1 ror s1 s0
5 Simple / Addition-Based Operations 5.6 Addition Flags 5 Simple / Addition-Based Operations 5.6 Addition Flags
flag formula
%
description
carry flag condition flag
formula
% %&1
( )
unsigned
(
)
signed
%%
%
%%
%
signed overflow flag
0
operation: or
:
0
zero flag zero
%&1 negative flag, sign
00 negative —
positive
—( )
Implementation of adder with flags
, overflow
( )
: for free
underflow
% %&
: fast , 1 computed by e.g. PPA very cheap operation:
: a) %
%&
1 (subtract.) :
1:0 (of PPA)
b) %
0 1 :
%&1
%&2
0 (r.s.a.)
0
00000 0 00
0 %
&1 &1
0
%&1 %&2 0 ;
0
1 (r.s.a.)
3
4 log
Computer Arithmetic: Principles, Architectures, and VLSI Design 66 Computer Arithmetic: Principles, Architectures, and VLSI Design 67
5 Simple / Addition-Based Operations 5.7 Arithmetic Logic Unit (ALU) 6 Multiplication 6.1 Multiplication Basics
%
&1 %
&1 %
&1 %
Example : unsigned multiplication
&1
ALU operations
%
%
2 2
2 or
1
1 %
0 0 0 0
&
add sub
2 ;
0
1 (r.s.a.)
0
arithmetic inc dec 1
pass neg
and nand
or nor Algorithm
logic
xor xnor 1) Generation of partial products
pass not
2) Adding up partial products :
11 11
sll srl
shift/ a) sequentially (sequential shift-and-add),
rotate
sla
rol
1 sra
ror
1
b) serially (combinational shift-and-add), or
c) in parallel
s/ro : shift/rotate ; l/r : left/right ;
l/a : logic (unsigned) / arithmetic (signed) Speed-up techniques
Logic of adder/subtractor can partly be shared with logic Reduce number of partial products
operations
Accelerate addition of partial products
Computer Arithmetic: Principles, Architectures, and VLSI Design 68 Computer Arithmetic: Principles, Architectures, and VLSI Design 69
! 2
! CSA
×
a0
b3 b2 b1 b0
CPA
p0
a1
Parallel multipliers : × HA HA HA
1
partial products × p1
generated in parallel and added ×
× a2
" CSA
mulpar.epsi
subsequently in multi-operand
"
34 43 mm mulbraun.epsi
FA FA FA
adder (using tree adder) 99 83 mm
! 2
!log
tree p2
a3
CPA
2 FA FA FA
CSA
Signed multipliers : p3
CPA
a) complement operands before and result after
multiplication unsigned multiplication
3 FA FA HA
Computer Arithmetic: Principles, Architectures, and VLSI Design 70 Computer Arithmetic: Principles, Architectures, and VLSI Design 71
6 Multiplication 6.3 Signed Array Multipliers 6 Multiplication 6.4 Booth Recoding
Minimal (or canonical) signed-digit (SD) represent. of
2 neg. bits :
%
2
+ One cycle per non-zero partial product (i.e.
0)
%
Replace FAs in regions – Negative partial products
1 , 2 , and 3 by :
% % – Data-dependent reduction of partial products and latency
(input at mark )
Combinational multiplication
Otherwise exactly same structure and complexity as
Braun multiplier efficient and flexible Only fixed reduction of partial product possible
%2
2
221) 22 ; &1
0
&
products (two additional ones) :
2
×
3 3 3 2 3 1 3 0
0 0 0 0
recoding
Booth
×
3 3 0 0 1 ×
0 1 0 ×
"
mulbooth.epsi
1
3 3 0 1 1 2 41 43 mm
7 6 5 4 3 2 1 0 1 0 0 2 CSA
1 0 1 array/tree
– Less efficient and regular than modified Braun 1 1 0 CPA
multiplier 1 1 1 0
Computer Arithmetic: Principles, Architectures, and VLSI Design 72 Computer Arithmetic: Principles, Architectures, and VLSI Design 73
Applicable to sequential, array, and parallel multipliers 6.5 Wallace Tree Addition
– additional recoding logic and more : 8 Speed-up technique : fast partial product addition
complex partial product generation
: 7
! 2
!log
(MUX for shift, XOR for negation)
Applicable to parallel multipliers : parallel partial
+ adder array/tree cut in half
considerably smaller (array and tree)
: 2 product generation (normal or Booth recoded)
much faster for adder arrays : 2 – Irregular adder tree (Wallace tree) due to different
slightly or not faster for adder trees :
0
number of bits per column
non-uniform bit arrival times at final adder
Negative partial products (avoid sign-extension) :
3 3 3 3 2 1 0
0 0 0
3 2 1 0 6.6 Multiplier Implementations
ext. sign
1
1 1 1 3 Sequential multipliers :
2 1 0
low performance, small area, resource sharing (adder)
1
Braun or Baugh-Wooley multiplier (array multiplier) :
03 03 03 03 02 01 00
03 02 01 00
13 13 13 12 11 10 13 12 11 10
medium performance, high area, high regularity
23
33
23
32
22
31
21
30
20
33
23
32
22
31
21
30
20
layout generators data paths and macro-cells
simple pipelining, faster CPA higher speed
6 5 4 3 2 1 0 6 5 4 3 2 1 0
Booth-Wallace multiplier (parallel multiplier) [9] :
Suited for signed multiplication (incl. Booth recod.)
for unsigned multiplication : %
0 high performance, high area, low regularity
Extend custom multipliers, netlist generators
often pipelined (e.g. register between CSA-tree and CPA)
Radix-8 (3-bit recoding) and higher radices :
0)
Signed-unsigned multiplier : signed multiplier with
precomputing 3 , larger overhead
operands extended by 1 bit ( 1 0, %
%& %
%&
1
Computer Arithmetic: Principles, Architectures, and VLSI Design 74 Computer Arithmetic: Principles, Architectures, and VLSI Design 75
6 Multiplication 6.8 Squaring 7 Division / Square Root Extraction 7.1 Division Basics
0 3 0 1 0 %
;
1 0 (r.m.n.)
1 1 0
2 3 1 2 3 12 21
0
3 3 2 3 1 3
2 3 1 3 0 3 0 0 1
Basic algorithm : compare and conditionally subtract
0 0 expensive comparison and CPA
3 3 1 2 1 1
2 2
Restoring division : subtract and conditionally restore
+
optimized correct by next steps expensive CPA
Non-restoring division : detect sign, subtract/add, and
squarer more efficient than multiplier
Table look-up (ROM) less efficient for every
SRT division : estimate range, subtract/add (CSA), and
correct by next steps inexpensive CSA
Computer Arithmetic: Principles, Architectures, and VLSI Design 76 Computer Arithmetic: Principles, Architectures, and VLSI Design 77
if
2 0 1 same sign
7 Division / Square Root Extraction 7.3 Non-Restoring Division 7 Division / Square Root Extraction 7.4 Signed Division
1 if
7.2 Restoring Division 7.4 Signed Division
1
2 0
1 if 1 opposite sign
1
0 if 1
1
2 0 :
0
(restored)
1 1
2&1 0 : &1
1 &1
1
2&1
1
(simplifications:
0, final correction of omitted)
Example : signed non-restoring array divider
9 2
2 2 4
7.3 Non-Restoring Division
1
11 ifif 11 00
b3 a6 b2 a5 b1 a4 b0 a3
1 0 :
1
1
2
a6 ⊕ b3
1 1
2 0 : &1
1
&1
2
2 &1
1
2&1
1 q3 FA FA FA FA
One subtraction/addition (CPA) per step
a2
Final correction step for (additional CPA)
1 1 0 1 :
1 1
Simple quotient digit conversion : (note: irredundant)
q2 FA FA FA FA
"
divarray.epsi
81 101 mm
q1 FA FA FA FA
1
A B
2 or ! log
≥ +/− CPA q0 FA FA FA FA
≥ +/− CPA
r3 r2 r1 r0
R
Computer Arithmetic: Principles, Architectures, and VLSI Design 78 Computer Arithmetic: Principles, Architectures, and VLSI Design 79
7 Division / Square Root Extraction 7.5 SRT Division 7 Division / Square Root Extraction 7.7 Division by Multiplication
2,
1 1 0 1
1
7.5 SRT Division (Sweeney, Robertson, Tocher) 7.6 High-Radix Division
0 if
2 $1 2 is SD number
Radix
1 if 1
2 quotient bits per step fewer, but more complex steps
%& %
If 2 1 $ 2 , i.e. is normalized :
+ Suitable for SRT algorithm faster
2 $
2%&1 $1 2%&1 $2
– Complex comparisons (more bits) and decisions
%&1 $1
0 if
2%&1 $1 2%&1
1 if 2
table look-up ( Pentium bug!)
1 if
2%&1
1
7.7 Division by Multiplication
0 1
Correction in following steps (+ final correction step)
– Redundant representation of (SD representation)
1
2%1
1
2%1
2
final conversion necessary (CPA)
+ Highly regular and fast (
!
) SRT array dividers
only slightly slower/larger than array multipliers
2%
1
2&%
2
2&%
1 (signed)
2
A B
! 2
Algorithm :
≥ +/− CSA
1
1 ;
0 1
1
! "
(r.s.n.)
CPA
≥ +/− CSA
divsrt.epsi
Q ≥ mm+/− CSA
50 38
≥ +/− CSA 0 0
log
≥ +/− CPA
Quadratic convergence :
R
Computer Arithmetic: Principles, Architectures, and VLSI Design 80 Computer Arithmetic: Principles, Architectures, and VLSI Design 81
7 Division / Square Root Extraction 7.8 Remainder / Modulus 7 Division / Square Root Extraction 7.9 Divider Implementations
find
0 by recursion 1
high efficiency if components are shared
1 1 1
0 Sequential dividers (restoring, non-restoring, SRT) :
2
resource sharing of existing components (e.g. adder)
2
;
0
1
Algorithm : low performance, low area
(r.s.n.)
1
Array dividers (restoring, non-restoring, SRT) :
dedicated hardware component
0
Modulus (mod) : positive remainder of a division
mod
0
ifelse 0
Computer Arithmetic: Principles, Architectures, and VLSI Design 82 Computer Arithmetic: Principles, Architectures, and VLSI Design 83
7 Division / Square Root Extraction 7.10 Square Root Extraction 8 Elementary Functions 8.1 Algorithms
0 22%
1 0 2%
1
Exponential function : (exp )
Logarithm function : ln , log
Trigonometric functions : sin , cos , tan
Inverse trig. functions : arcsin , arccos , arctan
Algorithm
and quotients
%& 0[1]
Subtract-and-shift : partial remainders
Hyperbolic functions : sinh , cosh , tanh
1 2 1 0
2
1 2 2
21 2 2 1 2
"
sqrtnr.epsi
+/− CPA
computes all elementary functions by proper input
Q 42 36+/− mmCPA
+/− CPA
settings and choice of modes and outputs
+/− CPA
simple, universal hardware, small look-up table
R
Computer Arithmetic: Principles, Architectures, and VLSI Design 84 Computer Arithmetic: Principles, Architectures, and VLSI Design 85
8 Elementary Functions 8.2 Integer Exponentiation 8 Elementary Functions 8.3 Integer Logarithm
1 2
8.2 Integer Exponentiation b) 12 1 0
1 0
Base-2 integer exponentiation : 2
0
;
1 0
1
2
%
1
0 (r.s.n.)
log2
Applications : modular exponentiation mod
2
Algorithms : square-and-multiply For detection/comparison of order of magnitude
2 2 4 2
1
a) 2 1
1
0 Corresponds to leading-zeroes detection (LZD) with
1 2 encoded output
1 2 2 1 0
&1 1
2 ;
0
1
2
or
2
Computer Arithmetic: Principles, Architectures, and VLSI Design 86 Computer Arithmetic: Principles, Architectures, and VLSI Design 87
9 VLSI Design Aspects 9.1 Design Levels 9 VLSI Design Aspects 9.1 Design Levels
9 VLSI Design Aspects 9.2 Synthesis 9 VLSI Design Aspects 9.3 VHDL
Layout and netlist generators relational : =, /=, <, <=, >, >=
shift, rotate (’93 only) : rol, ror, sla, sll, sra, srl
Included in libraries and synthesis tools
adding : +, -
Low-level synthesis is state-of-the-art sign (unary) : +, -
Basis for efficient ASIC design multiplying : *, /, mod, rem
Limited diversity and flexibility of library components exponent, absolute : **, abs
Optimization of entire arithmetic circuits is not feasible /, mod, rem : both operands must be constant or divisor
only local optimizations possible must be a power of two
Logic optimization cannot replace the synthesis of ** : for power-of-two bases only
efficient arithmetic circuit structures using generators Variety of arithmetic components provided in separate
libraries (e.g. DesignWare by Synopsys)
Computer Arithmetic: Principles, Architectures, and VLSI Design 90 Computer Arithmetic: Principles, Architectures, and VLSI Design 91
9 VLSI Design Aspects 9.3 VHDL 9 VLSI Design Aspects 9.4 Performance
Synthesis : check synthesis result for allocated arithmetic
units code sanity check, control of circuit size
High speed
Structural, synthesizable VHDL code for most circuits Optimal solution depends on arithmetic operation, circuit
described in this text is found in [22] architecture, user specifications, and circuit environment
Computer Arithmetic: Principles, Architectures, and VLSI Design 92 Computer Arithmetic: Principles, Architectures, and VLSI Design 93
9 VLSI Design Aspects 9.4 Performance 9 VLSI Design Aspects 9.5 Testability
Power-related properties of arithmetic circuits : Testability goal : high fault coverage with few test vectors
that are easy to generate/apply
High glitching activity due to high bit dependencies
and large logic depth Random test vectors : easy to generate and
apply/propagate, few vectors give high (but not perfect)
Power reduction in arithmetic circuits [23] : fault coverage for most arithmetic circuits
Reduce the switched capacitance by choosing an area Special test vectors : sometimes hard to generate and
efficient circuit architecture apply, required for coverage of hard-detectable faults
Allow for lower supply voltage by speeding up the which are inherent in most arithmetic circuits
circuitry
Hard-detectable faults found in :
Reduce the transition activity :
apply stable inputs while circuit is not in use ( circuits of arithmetic operations with inherent special
disabling subcircuits) cases (arithmetic exceptions) : detectors, comparators,
reduce glitching transitions by balancing signal incrementers and counters (MSBs), adder flags
paths (partly done by speed-up techniques, otherwise
difficult to realize)
circuits using redundant number representations
( redundant hardware) : dividers (Pentium bug!)
reduce glitching transitions by reducing logic depth
(pipelining)
take advantage of correlated data streams
choose appropriate number representations
(e.g. Gray codes for counters)
Computer Arithmetic: Principles, Architectures, and VLSI Design 94 Computer Arithmetic: Principles, Architectures, and VLSI Design 95
Bibliography Bibliography
[2] K. Hwang, Computer Arithmetic: Principles, Architecture, [12] A. Tyagi, “A reduced-area scheme for carry-select adders”,
and Design, John Wiley & Sons, 1979. IEEE Trans. Comput., vol. 42, no. 10, pp. 1162–1170, Oct.
1993.
[3] O. Spaniol, Computer Arithmetic, John Wiley & Sons,
1981. [13] T. Han and D. A. Carlson, “Fast area-efficient VLSI
adders”, in Proc. 8th Computer Arithmetic Symp., Como,
[4] J. J. F. Cavanagh, Digital Computer Arithmetic: Design May 1987, pp. 49–56.
and Implementation, McGraw-Hill, 1984.
[14] D. W. Dobberpuhl et al., “A 200-MHz 64-b dual-issue
[5] J.-M. Muller, Elementary Functions: Algorithms and CMOS microprocessor”, IEEE J. Solid-State Circuits, vol.
Implementation, Birkhauser Boston, 1997. 27, no. 11, pp. 1555–1564, Nov. 1992.
[6] Proceedings of the Xth Symposium on Computer Arithmetic. [15] A. De Gloria and M. Olivieri, “Statistical carry lookahead
[7] IEEE Transactions on Computers. adders”, IEEE Trans. Comput., vol. 45, no. 3, pp. 340–347,
Mar. 1996.
[8] D. R. Lutz and D. N. Jayasimha, “Programmable modulo-k
counters”, IEEE Trans. Circuits and Syst., vol. 43, no. 11, [16] V. G. Oklobdzija, D. Villeger, and S. S. Liu, “A method for
pp. 939–941, Nov. 1996. speed optimized partial product reduction and generation of
fast parallel multipliers using an algorithmic approach”,
[9] H. Makino et al., “An 8.8-ns 54 54-bit multiplier with IEEE Trans. Comput., vol. 45, no. 3, pp. 294–305, Mar.
high speed redundant binary architecture”, IEEE J. 1996.
Solid-State Circuits, vol. 31, no. 6, pp. 773–783, June 1996.
[17] Z. Wang, G. A. Jullien, and W. C. Miller, “A new design
[10] W. N. Holmes, “Composite arithmetic: Proposal for a new technique for column compression multipliers”, IEEE
standard”, IEEE Computer, vol. 30, no. 3, pp. 65–73, Mar. Trans. Comput., vol. 44, no. 8, pp. 962–970, Aug. 1995.
1997.
Computer Arithmetic: Principles, Architectures, and VLSI Design 96 Computer Arithmetic: Principles, Architectures, and VLSI Design 97
Bibliography