FPGAs for DSP - Arithmetic for DSP

FPGAs for DSP 3
Arithmetic for DSP
Return Return
Home DSP Notes
Version 3.8/27/07 For Academic Use Only in Accordance with Licence-to-Use, see readme.pdf
Top
Introduction 3.1
• This section review arithmetic for DSP.
• The following key issues are presented here:
• Number representation techniques:

signed/unsigned integers, 1’s & 2’s complement, fixed point
and floating point;
• Arithmetic operation structures:

Addition/Subtraction, Multiplication, Division and Square Root;
• Complex arithmetic operations;
• FPGA specific arithmetic.
• Examples of implementing addition and multiplication in a Xilinx Virtex-

II Pro FPGA are given.
August 2007, Version 3.8/21/07 For Academic Use Only. All Rights Reserved
Top
Integer Number Representations 3.2
• A fundamental consideration in DSP is the issue of:
Number Representation
• DSP, by its very nature, requires quantities to be represented digitally -

using a number representation with finite precision.
• This representation must be sufficiently accurate to handle the “real-

world” input and outputs of the DSP system.
• The representation must also be efficient in terms of its

implementation in hardware.
Top
Unsigned Integers - Positive Values Only 3.3
• Unsigned integers can be used to represent non-negative numbers.

For example using 8 bits we can represent from 0 to 255:
Integer Value Binary Representation
0 00000000
1 00000001
2 00000010
3 00000011
4 00000100
64 01000000
65 01000001
131 10000011
255 11111111
Top
2’s Complement - the way forward 3.5
• A more sensible number system for +ve a -ve numbers is 2’s

complement which has only one representation of 0 (zero):
Positive Numbers Negative Numbers
Integer Binary Integer Binary
0 00000000 0 100000000
1 00000001 Invert all bits -1 11111111
2 00000010 and ADD 1 -2 11111110
3 00000011 -3 11111101
125 01111101 -125 10000011

126 01111110 -126 10000010
127 01111111 -127 10000001
-128 10000000
• The 9th bit generated for 0 can be ignored. Note that -128 can be
represented but +128 cannot.
Top
Fixed-point Binary Numbers 3.11
• We can now define what is known as a “fixed-point” number:
a number with a fixed position for the binary point.
• Bits on the left of the binary point are termed integer bits, and bits on
the right of the binary point are termed fractional bits, for example:
aaa.bbbbb 3 integer bits, 5 fractional bits
• This number behaves in a similar way to signed integers:

digit worth decimal
–( 2 2 ) 21 20 2 –1 2 –2 2 –3 2 –4 2 –5 value
-4 2 1 0.5 0.25 0.125 0.0625 0.03125

0 0 0 0 0 0 0 1 0.03125
0 0 0 0 0 0 1 0 0.0625
1 0 1 0 0 0 0 0 -3.0
1 1 0 0 0 1 1 1 -1.78125
1 1 1 1 1 1 1 1 -0.03125
Top
Fixed-point Quantisation 3.12
• Consider again the number format:

aaa.bbbbb 3 integer bits, 5 fractional bits
• Numbers between – 4 and 3.96785 can be represented, in steps of

0.03125 . As there are 8 bits, there are 2 8 = 256 different values.
• Revisiting our sine wave example, using this fixed-point format:

+2
-2
• Looks much better. We must always take into account the quantisation
when using fixed point - it will be +/- 1/2 of the LSB (least significant bit).
Top
Notes:
Quantisation is simply the DSP term for the process of representing infinite precision numbers with finite
precision numbers. In the decimal world, it is familiar to most to work with a given number of decimal places.
The real number π can be represented as 3.14159265.... and so on. We can quantise or represent π to 4
decimal places as 3.1416. If we use “rounding” here and the error is:
3.14159265… – 3.1416 = 0.00000735
If we truncated (just chopped off the bits below the 4th decimal place) then the error is larger:
3.14159265… – 3.1415 = 0.00009265
Clearly rounding is most desirable to maintain best possible accuracy. However it comes at a cost. Albeit the
cost is relatively small, but it is however not “free”.
When multiplying fractional numbers we will choose to work to a given number of places. For example, if we
work to two decimal places then the calculation:
0.57 x 0.43 = 0.2451
can be rounded to 0.25, or truncated to 0.24. The result are different.
Once we start performing billions of multiplies and adds in a DSP system it is not difficult to see that these small
errors can begin to stack up.
Developed by:
Top
Fractional Motivation - Normalisation 3.13
• Working with fractional binary values makes the arithmetic “easier” to

work with and to account for wordlength growth.
• As an example take the case of working with a “machine” using 4 digit

decimals and a 4 digit arithmetic unit - range -9999 to +9999.
• Multiplying two 4 digit numbers will result in up to 8 significant digits.

Scale Tr
6787 x 4198 = 28491826 2849.1826 2849
If we want to pass this number to a next stage in the machine (where

arithmetic is 4 digits accuracy) then we need scale down by 10000,
then truncate.
• Consider normalising to the range -0.9999 to +0.9999.

Tr
0.6787 x 0.4198 = 0.28491826 0.2849
now the procedure for truncating back to 4 bits is much “easier”.

Top
Truncation 3.14
• In binary, truncation is the process of simply “removing” bits. This is

usually done in a constrained way to convert from a larger to a smaller
binary wordlength;
• Usually truncation is performed on least significant bits (LSBs):
16 bits
Truncating 7 LSBs
9 bits
• The net effect is that we lose precision.
Top
Rounding 3.15
• Rounding is a more accurate, but more complicated technique that

requires an addition operation then the truncation.
9 bits 9 bits
+ 1
truncation rounding
• This process is equivalent to the technique for decimal rounding, i.e. to

go from 7.89 to one decimal place is accomplished by adding 0.05 then
truncating to 7.9.
• Therefore the process of simple rounding requires an add operation.

Top
Notes:
Some examples of truncation of LSBs of 16 bit numbers:
MSB 1 1 0 0 0 0
1 1 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
16 bits
1 1 0 0 0 0
1 1 0 0 0 0
0 0 1 1 0 0
0 -1.046875 1 0.0078125 1 0.0
0 1 1
0 0 0
0 0 0
0 no loss of precision 0 loss of precision 0 total loss of precision
0 0 0 (underflow)
LSB 0 0 0
-1.046875 0.013671875 0.005859375
The following rounding example is a fairly extreme (but perfectly valid) - 0.013671875 is very close to needing
to be rounded up (to 0.015625) so truncate makes a significantly larger error than rounding.
0 0 0
0 0 0 0
0
0 0 0 0
0
0 0 0 0
0
0 0 0 0
0
0 0 0 0
0
0 0 0 0
0
0 0 0 0
0 ROUNDING
1 1 TRUNCATE 0 1 1
0.0078125 1 0
1
1 1 0.015625
1 1
0
0 0
0 0
0 error=0.0078125-0.013671875=-0.005859375 0 error=0.015625-0.013671875=0.001953125
0 0
0.013671875 0
0.013671875 Developed by:
Top
A different approach: Trounding 3.16
• Trounding is a compromise between truncation and rounding;
• It preserves information from beyond the LSB like rounding;
• However, unlike rounding it cannot affect any bit beyond the new LSB:
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 1 1 1
1 0.0078125 1 0.0078125
1 1
0 0
0 0
0 0
0 0
0 0
0.005859375 0.013671875
Top
Trounding Explained 3.17
• First compare the logical OR operation with addition. Only when both
inputs are 1 does trounding differ from rounding:
Input A Input B OR addition

0 0 0 0
0 1 1 1
1 0 1 1
1 1 1 0 carry 1
• Trounding performs like rounding 75% of the time:

• 50% of the time, tround=round=truncate
• 25% of the time, tround=round
• 25% of the time, tround=truncate
• Trounding has a lower mean quantisation error than truncation, but a

higher mean quantisation error than rounding;
• Trounding has a higher quantisation error variance than both rounding

and truncation.
Top
Floating-point numbers 3.18
• IEEE Standard 754 describes a widely-used standard for floating-point

numbers. They are structured as follows:
k+1–N
n×2
K
K+1 exponent bits in range 1 – 2 ≤ k ≤ 2 K
N significant bits in range – 2 N ≤ n ≤ 2 N
• The standard specifies two formats as follows:
Format K+1 N
single-precision 8 24
double-precision 11 53
• Note that both the exponent and significand are signed.
Top
Notes:
Lots of information can be found at http://grouper.ieee.org/groups/754/
Floating-point Numbers Encoding
Although the above definition is as specified by the standard, a more intuitive definition can be gained when
analysing the encoding of floating-point numbers as a bit sequence:
S E E ... E E E F F ... F F F
Single precision: 0 1 8 9 31
Double precision: 0 1 11 12 63
E+B
These numbers are encoded as f ( S ) × 2 × ( 1.F )
where B is a bias term, 127 for single precision or 1023 for double precision. F is an unsigned fixed-point value
with no integer bits.
f ( S ) = – 1 when S = 1 , and f ( S ) = 1 when S = 0 . i.e. the S bit encodes the sign of the number.
Note: In accordance with the specification, the above description of floating-point numbers is only valid for
“normalised” values. There exists a class of numbers known as “subnormals”. These are not described here.
Some examples of floating point encoding (single precision floating point - 32 bits):
( 3 + 127 ) ( 3 + 127 ) ( – 4 + 127 )
10 ⇒ + 2 × 1.01 10.34 ⇒ + 2 × 1.0100101… – 0.078125 ⇒ – 2 × 1.01
And those examples as an actual floating-point value in bits:

01000001001000000000000000000000 01000001001001010111000010100000 10111101101000000000000000000000
Developed
Note that the 1 before the fraction is NOT encoded - it will always be a 1 so need not by:
be conveyed.
Top
Short Exponents 3.19
• One way of simplifying floating point hardware is to create a format

which uses a short exponent:
S E E E E F F ... F F F
0 1 4 5 15
• In this case we have a 4-bit exponent and a 11-bit mantissa. With such
an exponent we have the ability to represent exponents in the range -7
to 8. This results in a huge increase in dynamic range at a relatively
small cost to precision:
16-bit fixed point 16-bit floating point
1 integer bit 4 bit exponent
15 fractional bits 11 bit mantissa

Minimum +ve value 2 – 15 2 – 18
Maximum +ve value 1- 2 – 15 28
Precision 2 – 15 2 – 18 to 2 – 3
Dynamic range ~ 2 15 2 26
Top
Floating-point Numbers for DSP 3.20
• Floating point is widely used in many DSP processors which have a
dedicated Floating Point Unit (FPU).
• Why not use floating-point in FPGAs?
• Slow: FPU is a complex unit, and every arithmetic operation in

the design would have to share it.
• Area-inefficient: An FPU would be large when implemented

using an FPGA.
• However, in some cases an FPU might be necessary, for example any

application that requires an enormous dynamic range.
• It can also be simpler to design with floating point - fixed point design
requires care to best exploit the available dynamic range, but with
floating point being constrained to a dynamic range is not such a
concern.
Top
Wraparound Overflow & 2’s Complement 3.23
• With 2’s complement overflow will occur when the result to be produced
lies outside the range of the number of bits.
• Therefore for an 8 bit example the range is -128 to +127 (or in binary
this is 100000002 to 011111112:
-65 10111111 100 01100100
+ -112 +10010000 + 37 +00100101
-177 101001111 137 10001001
With an 8 bit result we lose the 9th bit With an 8 bit result the result “wraps
and the result “wraps around” to a around” to a negative value:
positive value: 01001111 = 47 . 10001001 = – 119 .
• One solution to overflow is to ensure that the number of bits available

is always sufficient for the worst case result. Therefore in the above
example perhaps allow the wordlength to grow to 9 or even 10 bits.
• Using Xilinx SystemGenerator we can specifically check for overflow

in every addition calculation.
Top
Notes:
Recall from previously that overflow detect circuitry is relatively easy to design. Just need to keep an eye on the
MSB bits (indicating whether number is +ve or -ve):
For example
10110111 01100100
(-73) + 127 = 54 +01111111 100 + 64 = 164 +01000000
1 00110110 10100100
Discard final 9th bit carry
No overflow MSB bit indicate -ve result! Overflow
Adding +ve and -ve will never overflow!
Adding +ve and +ve if a -ve result then overflow
Adding -ve and -ve if a +ve result then overflow
Developed by:
Top
Saturation 3.24
• One method to try to address overflow is to use a saturate technique.
• Taking the previous overflowing examples from Slide 3.23

-65 10111111 100 01100100
+ -112 +10010000 + 37 +00100101
-177 101001111 137 10001001
Detect overflow and saturate Detect overflow and saturate
-128 1000000 127 01111111
• When overflow is detected, the result is set to the close possible value
(i.e for the 8 bit case either -128 or +127).
• Therefore for every addition that is explicitly done with an adder block.
In Xilinx System Generator the user will get a checkbox choice to allow
results to either (i) Wraparound or (ii) Saturate.
• Implementing saturate will require “detect overflow & select” circuitry.

Top
Fixed point addition 3.28
• First some examples of decimal non-integer addition:
10.375 10.375
+ 3.125 + 8.125
13.500 18.500
• Now in fixed point binary (4 bits integer, 3 bits fractional):
1010.011 1010.011
+ 0011.001 + 1000.001
1101.100 10010.100
• Note that for large operands, an extra bit may be required. Care must
be taken to interpret the binary point - it must stay in the same location
w.r.t. the LSB - this means a change of location w.r.t. the MSB.
• Subtraction follows the same binary arithmetic as for integers.
Top
Constant ROM-based multipliers 3.35
• Consider a ROM multiplier with 8 bit inputs: 65,536 8-bit locations are
required
ROM
8 bits
A
16 bits 16 bits
address data P
B 8 bits 65,536 16-bit
locations
• If input B is constant and B = k only 256 locations are accessed

0×k ROM
1×k
2×k
3×k
…
8 bits 16 bits
A address data P
input B removed 256 16-bit

locations
• This constitutes a Constant Coefficient Multiplier (KCM)
Top
2’s complement Multiplication 3.37
• For one negative and one positive operand just remember to sign
extend the negative operand.
11010110 -42
x00101101 x45
1111111111010110
0000000000000000
1111111101011000
sign 1111111010110000
extends 0000000000000000
1111101011000000
0000000000000000
0000000000000000
1111100010011110 -1890
Top
On-chip multipliers 3.39
• The Xilinx Virtex-II Pro FPGA has a set of “on-chip” multipliers.
• These are in hardware on the ASIC, not actually in the user FPGA area,
and therefore are permanently available, and they use no slices. They
also consume less power than a slice-based equivalent.
A
18x18 bit
multiply P
B
• A and B are 18-bit input operands, and P is the 36-bit product

P = A × B.
• Depending upon the particular device, between 12 and 512 of these

dedicated multipliers are available.
Top
Division (i) 3.40
• Divisions are sometimes required in DSP, although not very often.
• 6 bit non-restoring division array:

a5 a4 a3 a2 a1 a0 sin
Bin
b5 b4 b3 b2 b1 b0
1
q5 bin bout
0
cout FA cin
q4
0
Bout
q3 sout
0
q2
0
Q=B/A q1
0
q0
• Note that each cell can perform either addition or subtraction as shown
in an earlier slide ⇒ either Sin+ Bin or Sin - Bin can be selected.
Top
Notes:
A Direct method of computing division exists. This “paper and pencil” method may look familiar as it is often
taught in school. A binary example is given below. Note that each stage computes an addition or subtraction of
the divisor A. The quotient is made up of the carry bits from each addition/subtraction. If the quotient bit is a 0,
the next computation is an addition, and if it is a 1, the divisor is subtracted. It is not difficult to map this example
into the structure shown on the slide.
Example: B = 01011 (11), A = 01101 (13) ⇒ -A = 10011. Compute Q = B / A.
01011 R0 = B
q4 = 0 carry 10011 -A
11110 R1
0
11100 2.R1
q3 = 1 carry 01101 +A
01001 R2
0
10010 2.R2
q2 = 1 carry 10011 -A
00101 R3
0
01010 2.R3
q1 = 0 carry 10011 -A
11101 R4
0
11010 2.R4
q0 = 1 carry 01101 +A
00111 R5
Q = B / A = 01101 x 2-4 = 0.8125

Developed by:
Top
The Problem With Division 3.42
• An important aspect of division is to note that the quotient is generated

MSB first - unlike multiplication or addition/subtraction!
• This has implications for the rest of the system.
• It is unlikely that the quotient can be passed on to the next stage until
all the bits are computed - hence slowing down the system!
• Also, an N by N array has another problem - ripple through adders.
• Note that we must wait for N full adder delays before the next row can
begin its calculations.
• Unlike multiplication there is no way around this, and as result division

is always slower than multiply even when performed on a parallel array
- a N by N multiply will run faster than a N by N divide!
Top
Square Root (i) 3.44
• 6 bit non-restoring square root array.

sin
0 a7 a6 Bin
0 1 1
1 0 0
0 a5 a4 bin bout
b5 1 1
0 0
cout FA cin
b4 0 a3 a2
1 1
0 0 Bout
b3 0 a1 a0
1 1
0 0
b2 0
1 0 1 0
0 0 sout
b1 0
B = A 1 0 1 0
0 0
b0
0
• The square root is found (with divides) in DSP in algorithms such as QR

algorithms, vector magnitude calculations and communications
constellation rotation.
Top
Notes:
Looking carefully at the non-restoring square root array, we can note that this array is essentially “half” of the
division array! If the division array above is cut diagonally from the left we can see the cells that are needed for
the square root array. The 2 extra cells on the right hand side are standard cells which can be simplified. So
square root can be performed twice as fast as divide using half of the hardware!
a4 a 3 a2 a1
A = 10 11 01 01
010 0a4
b3 = 1 carry 111 111
001 R1
0111 R1<<1 & a3
b2 = 1 carry 1011 1b311
0 a7 a6 R2
0 1 1 0010
1 0 0 01001 R2<<1 & a2
0 a5 a4 b1 = 0 c ar r y 10011 1b3b211
b5 1 1 11100 R3
0 0
b4 0 a3 a2 110001 R3<<1 & a1
1 1 b0 = 1 c ar r y 011011 0b3b2b111
0 0 001100 R4
b3 0 a1 a0
1 1
0 0
b2 0
1 0 1 0
0 0 sout
b1 0
1 0 1 0
0 0
b0
Developed by: 0
Top
Square Root - An Alternative Approach 3.45
• Unfortunately the square root algorithm suffers from the same

problems as division although not to the same extent.
• These are:
• The result is generated MSB first.
• Each row has to wait longer and longer for the data it needs
from the previous row.
• A solution is to use memory to store the pre-computed square root

values. The input is then used as an address to look up the answer.
• This can be fast but if the input wordlength is large this approach quickly
becomes unfeasible.
• Another approach is to use memory to look up a partial solution and

then use an iterative approach like the Newton-Raphson algorithm to
find the final solution.
Top
Notes:
The Newton-Raphson equation can be used to find the square root of a number. It is an iterative technique
which can achieve accurate results with relatively few iterations. However, there are two parameters that make
it less than ideal for DSP.
• An initial guess is required to start the algorithm and the accuracy of this guess effects the accuracy of the
solution after n iterations.
• The number of iterations n to achieve a desired accuracy are unknown.
The iterative algorithm is:
x n + 1 = ⎛ x n + 1 + Input
-------------⎞ ⁄ 2
⎝ x ⎠ n
where xn is the initial estimate of the square root.
One approach that uses this algorithm is to take the first b MSB bits of the input and use them to address
memory containing values for the initial guess xn. This value is then fed into the Newton-Raphson algorithm for
n iterations.
Developed by:
Top
Complex Addition/Subtraction 3.47
• Complex Addition and Subtraction obey the following:
( a + jb ) + ( c + jd ) = ( a + c ) + j ( b + d )
( a + jb ) – ( c + jd ) = ( a – c ) + j ( b – d )
• Thus 2 additions/subtractions are required:
a
+
_ Real
c
b
+
_ Imaginary
d
Top
Complex Multiplication 3.48
• Complex Multiplication requires more operations:
( a + jb ) × ( c + jd ) = ( ac – bd ) + j ( bc + ad )
• Thus, 4 multiplications and 2 additions are required:
a
x
+ Imaginary
b x
c
x
_
Real
d
x
Top
Notes:
The total number of operations that must be performed for a complex multiplication is 6. But 4 of these
operations are multiplies. Generally multiplies are more costly in terms of speed and/or area than additions.
Thus, if we can reduce the number of multiplies at the expense of a few more additions, this can be beneficial.
Note the wordlength growth that can occur (using an 8 bit example below):
8
a 16
8 x
17
8 + Imaginary
b x 16
c 8
16
x
_ 17
8 Real
d 16
x
Developed by:
Top
Alternative Complex Multiplication 3.49
• The multiplication of two complex numbers can also be written as:
( a + jb ) × ( c + jd ) = ( ac – bd ) + j [ ( a + b ) × ( c + d ) – ac – bd ]
• Which comprises of 3 multiplications and 5 additions:
a
+
b
x
c _
+ _
Imaginary
d x
_
Real
x
Top
Notes:
With some algebraic manipulation a complex multiplication can be expressed in terms of 8 operations as
opposed to 6. However, even though this form has 2 more operations than the previous one, there is 1 less
multiplier. We have effectively substituted a multiplier for 3 additions. This procedure offers an alternative
architecture which may be faster in systems where multiplication takes considerably longer than addition.
Note however the implementation cost of the 3 multiply version is not necessarily lower given that one of the
multipliers is a 9 bit multiplier and there are of course 5 adds.
8
a
+
8
b 9
18
9 x
8
c _
16 17
+ _
Imaginary
8 8
d x
8
8 _ 17
16 Real
8 x
So which would be cheaper in a hardware implementation?
Developed by:
Top
Complex Division 3.50
• Division of complex numbers uses more hardware than multiplication:

a + jb- = (--------------------------------------------------------
ac + bd ) + j ( bc – ad )-
-------------
c + jd 2 2
c +d
• Hence, 6 multiplications, 2 divisions and 3 additions are required:

a
x
_
b
x ÷ Imaginary
c
x + ÷ Real
d
x x
+
x

FPGAs for DSP - Arithmetic for DSP

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

FPGAs for DSP - Arithmetic for DSP

Caricato da

Copyright:

Formati disponibili

FPGAs for DSP 3

Arithmetic for DSP

Home DSP Notes

• This section review arithmetic for DSP.

• The following key issues are presented here:

• Number representation techniques:

• Arithmetic operation structures:

• Complex arithmetic operations;

• FPGA specific arithmetic.

• Examples of implementing addition and multiplication in a Xilinx Virtex-

• A fundamental consideration in DSP is the issue of:

• DSP, by its very nature, requires quantities to be represented digitally -

• This representation must be sufficiently accurate to handle the “real-

• The representation must also be efficient in terms of its

• Unsigned integers can be used to represent non-negative numbers.

• A more sensible number system for +ve a -ve numbers is 2’s

125 01111101 -125 10000011

• We can now define what is known as a “fixed-point” number:

a number with a fixed position for the binary point.

• This number behaves in a similar way to signed integers:

-4 2 1 0.5 0.25 0.125 0.0625 0.03125

• Consider again the number format:

• Numbers between – 4 and 3.96785 can be represented, in steps of

• Revisiting our sine wave example, using this fixed-point format:

3.14159265… – 3.1416 = 0.00000735

3.14159265… – 3.1415 = 0.00009265

0.57 x 0.43 = 0.2451

can be rounded to 0.25, or truncated to 0.24. The result are different.

• Working with fractional binary values makes the arithmetic “easier” to

• As an example take the case of working with a “machine” using 4 digit

• Multiplying two 4 digit numbers will result in up to 8 significant digits.

If we want to pass this number to a next stage in the machine (where

• Consider normalising to the range -0.9999 to +0.9999.

now the procedure for truncating back to 4 bits is much “easier”.

• In binary, truncation is the process of simply “removing” bits. This is

• Usually truncation is performed on least significant bits (LSBs):

• The net effect is that we lose precision.

• Rounding is a more accurate, but more complicated technique that

• This process is equivalent to the technique for decimal rounding, i.e. to

• Therefore the process of simple rounding requires an add operation.

• Trounding is a compromise between truncation and rounding;

• It preserves information from beyond the LSB like rounding;

Input A Input B OR addition

• Trounding performs like rounding 75% of the time:

• Trounding has a lower mean quantisation error than truncation, but a

• Trounding has a higher quantisation error variance than both rounding

• IEEE Standard 754 describes a widely-used standard for floating-point

• The standard specifies two formats as follows:

• Note that both the exponent and significand are signed.

Floating-point Numbers Encoding

And those examples as an actual floating-point value in bits:

• One way of simplifying floating point hardware is to create a format

16-bit fixed point 16-bit floating point

1 integer bit 4 bit exponent

15 fractional bits 11 bit mantissa

• Why not use floating-point in FPGAs?

• Slow: FPU is a complex unit, and every arithmetic operation in

• Area-inefficient: An FPU would be large when implemented

• However, in some cases an FPU might be necessary, for example any

• One solution to overflow is to ensure that the number of bits available

• Using Xilinx SystemGenerator we can specifically check for overflow

• One method to try to address overflow is to use a saturate technique.

• Taking the previous overflowing examples from Slide 3.23