Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic

TOA 1221
COMPUTER ORGANIZATION AND

ARCHITECTURE
Lec07
Computer Arithmetic – III

Floating Point Numbers (Representation and Arithmetic)
What we are going to discuss?
 Fixed-Point Representation – Drawbacks

 Floating Point Representation
Principles
Normalization, Examples, Range and Precision, Density
IEEE 754 standard format
Single Precision and Double Precision
Decimal to IEEE754 standard format conversion
IEEE754 standard format to decimal equivalent conversion
Interpretation of floating-point numbers
 Arithmetic with Floating Point numbers
Floating-Point Addition/Subtraction
Floating-Point Multiplication/Division
2
Fixed-Point Representation - Drawbacks
 Using Fixed-point representation (ex: Twos complement),

 Possible to represent a range of positive and negative integers centered on
0
 Numbers with a fractional component can be represented by assuming a
fixed radix point.
 Limitations of fixed-point representation
 Very large numbers and very small fractions cannot be represented
 Fractional part of the quotient in a division of two large numbers could be
lost
 How to represent very large and very small numbers with only a
few digits?
 dynamically slide the radix point to a convenient location and use the
exponent part to keep track of the radix point – Floating Point
representation
3
Floating-point Representation - Principles
 For decimal numbers

 Use scientific notation
 Example 1: 976,000,000,000,000 can be represented as 9.76 × 1014
 Example 2: 0.0000 0000 0000 0976 can be represented as 9.76 × 10-14
 Dynamically slide the decimal point to a convenient location and use the exponent of 10 to
keep track of the decimal point.
 For binary numbers
 Represent the number in the form
±S × B±E
Store the number with three fields:
• Sign: plus (0) or minus (1)
• Significand S (or Mantissa or Fraction)
• Exponent E
 Base B is implicit and need not be stored
 Assumption: Radix point is to the right of MSB of significand
4
Typical 32-bit Floating-Point Format
 ± Significand x 2±exponent
 Left most bit (MSB) – Sign of the number (0=positive, 1= negative)
 Exponent is in excess or biased notation.
 A fixed value, called the bias, is subtracted from the field to get the true exponent value
 Typically the bias equals (2k-1 – 1) where k is the number of bits in the binary exponent field
 Excess (biased exponent) 127 means
 8-bit exponent field
 8-bit yields the numbers: 0-255
 Subtract 127 (2 k-1 - 1) to get true exponent value
 True exponent Range: -127 to +128
5
Normalization
 Floating-Point numbers are usually normalized

 A normalized number is one in which most significant digit of the significand
is nonzero
 For binary representation, normalized number is one in which MSB of the
significand is one.
• Exponent is adjusted so that leading bit (MSB) of significand is 1
• Since it is always 1 there is no need to store it
– 23-bit filed is used to store 24-bit significand with a value in the
interval [1,2].
 Form of normalized nonzero floating-point number
±1.bbb….b × 2±E
6
Floating-Point Representation - Examples
negative
True Biased exponent
exponent
127 + 20 = 147
20
negative
Biased exponent
True
normalized exponent 127 - 20 = 107
-20
The bias equals to (2K-1 – 1)  28-1 – 1 = 127 7

Range of numbers in Typical 32-bit formats
In Floating-point representation: We are not representing more

individual values, but spreading the numbers – Approximate
representation
8
Density of Floating-Point Numbers
 Numbers are not spaced evenly along the number line

 Possible values get closer together near the origin and farther
apart as we move away
 Many calculations produce results that are not exact and
rounded to the nearest value that the notation can represent.
9
Range and precision – Trade-off
 For a fixed n-bit format, if number of bits in the exponent

is increased:
 Range of expressible numbers will be increased
 Density of numbers is reduced and hence precision
 If base of 16 is used instead of 2:
 Greater range can be achieved for the same number of
exponent bits
 Less precision (since we have not increased the
number of different values that can be represented)
 Example: In IBM base-16 format,
0.11010001 × 210100 = 0.11010001 × 16101
10
IEEE Standard 754
Floating-Point Format (1)…
 Standard for floating point storage defined in IEEE 754,
adopted in 1985
 To facilitate portability of programs from one processor to
another
 Widely adopted and used on all processors and arithmetic
coprocessors
 Defines 32-bit (single-precision) and 64-bit (double precision)
standards
 8-bit and 11-bit exponents respectively
 Extended formats for intermediate results
 Additional bits in significand (extended precision)
 Additional bits in exponent (extended range)
11
IEEE Standard 754
• In this format, the numbers are normalized so that the

significand lie in the range 1  F < 2
• An IEEE format floating-point number X is formally
defined as:
where S = sign bit [0  +ve, 1  ve]

E = exponent biased by B
F = fractional significand
12
IEEE Standard 754
Sign
8 bits 23 bits
bit
Biased
Significand
Exponent
(a) Single format
Sign
11 bits 52 bits
bit
Biased Exponent Significand
(b) Double format
• A sign-magnitude representation has been adopted for

the significand
• Significand is negative if S =1, and positive if S =0
13
IEEE Standard 754
Floating-Point Format - Parameters
Format
Parameter Single Single Extended Double Double Extended
Word width 32 ≥ 43 64 ≥ 79
(bits)
Exponent 8 ≥ 11 11 ≥ 15
width (bits)
Exponent 127 unspecified 1023 unspecified
bias
Maximum 127 ≥ 1023 1023 ≥ 16383
exponent
Minimum –126  –1022 –1022  –16382
exponent
Number 10–38, 10+38 unspecified 10–308, 10+308 unspecified
range (base
10)
Significand 23 ≥ 31 52 ≥ 63
width (bits)* * Does not include
implied bit
Number of 254 unspecified 2046 unspecified
exponents
Number of 223 unspecified 252 unspecified
fractions
Number of 1.98  231 unspecified 1.99  263 unspecified
values
14
IEEE Standard 754
Floating-Point Format – Example 1
Convert the given numbers to IEEE single precision format:
(a) 199.95312510 = 1100 0111.1111012
= 1.100 0111 111101 x 27 stored
+ 7 + 127 = 13410 1 · 1 0 0 0 1 1 1 1 1 1 1 0 1
0 1 0 0 0 0 1 1 0 1 0 0 0 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0
sign biased exponent significand
(b) -77.710 = -100 1101.10110 01102 ... 7710 = 100 11012
= -1.00 1101 101100110 ... x 26 0.710 Þ 0.7 x 2  1.4

0.4 x 2  0.8
0.8 x 2  1.6
0.6 x 2  1.2
0.2 x 2  0.4
0.4 x 2  0.8
0.8 x 2  1.6
0.6 x 2  1.2
0.2 x 2  0.4
...
stored [23 bits]
– 6 + 127 = 13310 1 · 0 0 1 1 0 1 1 0 1 1 0 ...
1 1 0 0 0 0 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
15
IEEE Standard 754
Floating-Point Format – Example 2
Convert the given IEEE single precision floating-point numbers to their
decimal equivalent:
(a) 0100 0101 1001 1100 0100 0001 0000 00002
0 1 0 0 0 1 0 1 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0
+ 139 – 127 = 1210 1.0011100012
1.0011100010000012 X 212 = 1001110001000.0012
= 5000.12510
(b) 1100 0100 0111 1001 1111 1100 0000 00002

1 1 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
– 136 – 127 = 910 1.11110011111112
-1.11110011111112 x 29 = -1111100111.11112
= -999.937510
16
Floating-Point Format – Interpretation of
Numbers
Single Precision (32 bits) Double Precision (64 bits)

Biased Biased
Sign Fraction Value Sign Fraction Value
exponent exponent
positive zero 0 0 0 0 0 0 0 0
negative zero 1 0 0 –0 1 0 0 –0
plus infinity 0 255 (all 1s) 0 ∞ 0 2047 (all 1s) 0 ∞
minus infinity 1 255 (all 1s) 0 –∞ 1 2047 (all 1s) 0 –∞
quiet NaN 0 or 1 255 (all 1s) ≠0 NaN 0 or 1 2047 (all 1s) ≠0 NaN
signaling NaN 0 or 1 255 (all 1s) ≠0 NaN 0 or 1 2047 (all 1s) ≠0 NaN
positive
normalized 0 0 < e < 255 f 2e–127(1.f) 0 0 < e < 2047 f 2e–1023(1.f)
nonzero
negative
normalized 1 0 < e < 255 f –2e–127(1.f) 1 0 < e < 2047 f –2e–1023(1.f)
nonzero
positive
denormalized 0 0 f≠0 2e–126(0.f) 0 0 f≠0 2e–1022(0.f)
negative
denormalized 1 0 f≠0 –2e–126(0.f) 1 0 f≠0 –2e–1022(0.f)
17
Floating-point Arithmetic Operations
Floating Point Numbers Arithmetic Operations

X  X s  BX E  
X  Y  X s  B XE YE  Ys  B YE 


YE X E  YE
Y  Ys  B
YE

X  Y  Xs  B
XE YE

 Ys  B  
X  Y  X s  Ys   B XE YE
X  Xs  XE YE
    B
Y  
 Ys 
Examples:
X = 0.3  102 = 30
Y = 0.2  103 = 200
 X + Y = (0.3  102–3 + 0.2)  103 = 0.23  103 = 230
 X – Y = (0.3  102–3 – 0.2)  103 = (–0.17)  103 = –170
 X  Y = (0.3  0.2)  102+3 = 0.06  105 = 6000
 X  Y = (0.3  0.2)  102–3 = 1.5  10–1 = 0.15
18
Floating-point Arithmetic operations –
Conditions
Some problems that may arise during arithmetic operations are:

 Exponent overflow
A positive exponent exceeds the maximum possible exponent value
and this may be designated as + or - in some systems
 Exponent underflow
A negative exponent is less than the minimum possible exponent
value (e.g. 2-200), the number is too small to be represented and
maybe reported as 0
 Significand underflow
In the process of aligning significands, digits may flow off the right
end of the significand. Some form of rounding is required.
 Significand overflow
The addition of two Significands of the same sign may result in a
carry out of the most significant bit. This can be fixed by realignment.
19
Floating-Point Arithmetic +/- (1)…
• Unlike integer and fixed-point number representation,

floating-point numbers cannot be added in one simple
operation
• Consider adding two decimal numbers:
A = 12345
B = 567.89
If these numbers are normalized and added in floating-
point format, we will have
0.12345 x 105
+ 0.56789 x 103
?.????? x 10?
Obviously, direct addition cannot take place as the

exponents are different
20
 Four basic phases of the algorithm for floating-point addition

and subtraction
1) Check for zeros
2) Align significands (adjusting exponents)
3) Add or subtract significands
4) Normalize result
 For addition and subtraction, it is necessary to ensure that both
operand exponents have the same value
 This may involves shifting the radix point of one of the
operands to achieve alignment
21
Phase 1: Zero Check:

• Since addition and subtraction are identical except for a
sign change, the process begins by changing the sign of the
subtrahend if it is a subtract operation
• If either operand is zero, the other operand is reported as
the result
Phase 2: Significand alignment:
• The floating-point numbers can only be added if the two
exponents are equal
• This can be done by aligning the smaller number with the
bigger number [increasing its exponent] or vice-versa, so
that both numbers have the same exponent
22
• As the aligning operation may result in the loss of digits, it

is the smaller number that is shifted so that any loss will
therefore be of relatively insignificant
8 bits remains
shift
left
1.1001 x 29 110010000 x 21 1 x 29 is lost
1.0111 x 21 1.0111000 x 21
• Hence, the smaller number are shifted right by increasing its
exponent until the two exponents are the same
• If both numbers have exponents that differ significantly, the
smaller number may be lost as a result of shifting
1.1001001 x 29 1.1001001 x 29
1.0110001 x 21 shift
0.0000000 x 29
right
23
1.1101 x 24
+ 0.0101 x 24
Phase 3: Addition of significands: 10.0010 x 24 1.0001 x 25
• After the numbers have been aligned, the two significands are added
together taking into account their signs
• There might be a possibility of significand overflow due to a carry out
from the most significant bit:
o If this occurs, the significand of the result is shifted right and the
exponent is incremented
o As the exponents are incremented, it might overflow and the
operation will stop
Phase 4: Normalization:
• Lastly, the result is normalized by shifting significand digits left until the
most significant digit is non-zero
• Each shift causes a decrement of the exponent and thus could cause an
exponent underflow
• Finally, the result is rounded off and reported
24
1.01101 x 27
SUBTRACT X = 1.01101 x 27
+ 0.110101 x 27
Y = 1.10101 x 26
10.001111 x 27
Change sign of Y
X = 1.01101 x 27
Y = 0.110101 x 27 1.0001111 x 28
X+Y=Z
no no Exponents yes Add signed Results yes
ADD X = 0? Y = 0? Round result
Equal? significands normalized?
yes yes no 10.001111 x 27 no
Increment smaller yes Significand Shift significand RETURN

ZY ZX Z0
exponent = 0? left
no 1.0001111 x 28
RETURN Shift significand RETURN Significand no Decrement

right overflow? exponent
10.001111 x 27 yes
7
Y = 0.110101 x 2
Significand Shift significand no Exponent
no = 0? right underflow?
yes
Put other number Increment

Addition in Z RETURN exponent
Report underflow
1.0001111 x 28
RETURN yes Exponent no RETURN 25
Report overflow
overflow?
1.01101 x 27
X–Y=Z X = 1.01101 x 27
SUBTRACT
– 0.110101 x 27
Y = 1.10101 x 26
0.100101 x 27
Change sign of Y
X = 1.01101 x 27
Y = 0.110101 x 27
0.100101 x 27
no no Exponents yes Add signed Results yes
ADD X = 0? Y = 0? Round result
Equal? significands normalized?
yes yes no
0.100101 x 27 no
Increment smaller yes Significand Shift significand RETURN

ZY ZX Z0
exponent = 0? left
1.00101 x 26
no
RETURN Shift significand RETURN Significand no Decrement

right overflow? exponent
7
yes
1.00101 x 26
Y = 0.110101 x 2
Significand Shift significand no Exponent
no = 0? right underflow?
yes
Put other number Increment

Subtraction in Z RETURN exponent
Report underflow
RETURN yes Exponent no RETURN 26

Report overflow
overflow?
Rounding (1)…
• Some of the floating-point arithmetic will lead to an increase

number of bits in the mantissa
• For example, consider adding these 5 significant bits floating-
point numbers:
A = 0.11001 x 24
B = 0.10001 x 23
A = 0.11001 x 24
B = 0.010001 x 24 normalize
1.000011 x 24 0.1000011 x 25
• The result has two extra bit of precision which cannot be

fitted into the floating point format
• For simplicity, the number can be truncated to give 0.10000 x
25 27
Rounding (2)…
• Truncation is the simplest method which involves nothing

more than taking away the extra bits
• A much better technique is rounding
• If the value of the extra bits are greater than half of the
least significant bit of the retained bits, 1 is added to the
LSB of the remaining digits
• For example, consider rounding these numbers to 4 significant
bits:
i. 0.1101101
0.1101
0.1 1 0 1 1 0 1
+ 1
0.1110
more than half add 1 to the
LSB
28
Rounding (3)…
(ii) 0.0000011
extra bits  0.0000011
LSB of retained bits  0.0001
(iii) 0.1101011 extra bits are

truncated
0.1 1 0 1 0 1 1 0.1101
less than half
• Rounding is always preferred to truncation partly because it

is more accurate and partly it gives rise to an unbiased error
• Major disadvantage of rounding is that it requires a further
arithmetic operation on the result
29
Floating-Point Arithmetic +/-
- Example 1
Perform the following arithmetic operation using floating point
arithmetic. Show how the numbers would be stored using IEEE
single-precision format.
1150.62510  525.2510
1150.62510 = 100 0111 1110. 1012
= 1. 0001 1111 10101 x 210
stored
+ 10 + 127 = 13710 1 · 0 0 0 1 1 1 1 1 1 0 1 0 1
0 1 0 0 0 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0
525.2510 = 10 0000 1101.012

= 1. 0000 0110 101 x 29
stored
+ 9 + 127 = 13610 1 · 0 0 0 0 0 1 1 0 1 0 1
0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
Floating-Point Arithmetic +/-
- Example 1
As these numbers have different exponents, the smaller number is
shifted right to align with the larger number
1000 1000 1.00000110101  1000 1001 0.100000110101
exponent mantissa exponent mantissa
Subtract the mantissa

1.0001111110101
– 0.100000110101
0.1001110001011
Normalize the result

1000 1001 0.1001110001011  1000 1000 1.001110001011
stored
+ 9 + 127 = 13610 1 · 0 0 1 1 1 0 0 0 1 0 1 1
0 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0
Floating-Point Arithmetic x/
 Basic phases of the algorithm for floating-point multiplication

and division
1) Check for zero
2) Add/subtract exponents
3) Multiply/divide significands (watch sign)
4) Normalize
5) Round
6) All intermediate results should be in double length
storage
32
Floating-point Multiplication
XxY=Z X = 6.2510 = 110.012 = 1.1001 x 22
MULTIPLY Y = 12.510 = 1100.12 = 1.1001 x 23
E1 = 127 + 2 = 129
no no
E2 = 127 + 3 = 130
X = 0? Y = 0? Add exponents
E1 + E2 = 259
yes yes
ET = 259 – 127 = 132

Z0 Subtract bias
RETURN Exponent yes Report

overflow? overflow
no
Exponent yes Report

underflow underflow
1.10012 no
x 1.10012
Multiply
10.011100012 significands
10.01110001 x 25
=1.001110001 x 26 Normalize
33
Round RETURN
Floating-point Division
Y = 3.7510 = 11.112 = 1.111 x 21
XY=Z
DIVIDE
X = 95.62510 = 101 1111.1012
= 1.011111101 x 26
E1 = 127 + 1 = 128
no no
X = 0? Y = 0?
Subtract
exponents
E2 = 127 + 6 = 133
E2 – E1 = 5
yes yes
Z 0 Z  Add bias
ET = 127 + 5 = 132
RETURN Exponent yes Report

overflow? overflow
no
Exponent yes Report

underflow underflow
no
0.110011
Divide
1.111 1.011111101 significands
0.110011 x 25
Normalize
= 1.10011 x 24
34
Round RETURN
PROBLEM (1)
 Express the number (- 640.5)10 in IEEE 32-bit and 64-bit

floating point format
35
SOLUTION (1)…
 IEEE 32 BIT FLOATING POINT FORMAT

MSB 8 bits 23 bits
sign Biased Mantissa/Significand
Exponent (Normalized)
Step 1: Express the given number in binary form

(640.5) = 1010000000.1* 20
Step 2: Normalize the number into the form 1.bbbbbbb
1010000000.1* 20 = 1. 0100000001* 29
Once Normalized, every number will have 1 at the leftmost bit. So IEEE notation is saying
that there is no need to store this bit. Therefore significand to be stored is 0100 0000 0100
0000 0000 000 in the allotted 23 bits
36
SOLUTION (1)…
 Step 3: For the 8 bit biased exponent field, the bias used is
2k-1-1 = 28-1-1 = 127
Add the bias 127 to the exponent 9 and convert it into binary
in order to store for 8-bit biased exponent.
127 + 9 =136 ( 1000 1000)
 Step 4: Since the given number is negative, put MSB as 1
 Step 5: Pack the result into proper format (IEEE 32 bit)
1 1000 1000 0100 0000 0100 0000 0000 000
37
SOLUTION (1)…
 IEEE 64 BIT FLOATING POINT FORMAT
MSB 11 bits 52 bits

sign Biased Mantissa/Significand
Exponent (Normalized)
Step 1: Express the given number in binary form

(640.5) = 1010000000.1* 20
Step 2: Normalize the number into the form 1.bbbbbbb
1010000000.1* 20 = 1. 0100000001* 29
Once Normalized, every number will have 1 at the leftmost bit. So IEEE notation is saying
that there is no need to store this bit. Therefore significand to be stored is 0100 0000 0100
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 in the allotted 52 bits
38
SOLUTION (1)
 Step 3: For the 11 bit biased exponent field, the bias

used is
2k-1-1 = 211-1-1 = 1023
Add the bias 1023 to the exponent 9 and convert it into
binary in order to store for 11-bit biased exponent.
1023 + 9 =1032 ( 1000 0001 000)
 Step 4: Since the given number is negative, put MSB as 1
 Step 5: Pack the result into proper format(IEEE 64 bit)
1 1000 0001 000 0100 0000 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
39
PROBLEM (2)
Perform the following addition using floating point arithmetic and show how
the numbers would be stored using IEEE single-precision format
68.310 + 12.210
6810 = 100 01002
0.310 Þ 0.3 x 2  0.6
68.310 = 100 0100.01001 1001 ...
0.6 x 2  1.2
= 1.00 0100 01001 1001 ... x 26
0.2 x 2  0.4
0.4 x 2  0.8
0.8 x 2  1.6
0.6 x 2  1.2
...
only 24 bits can be stored
1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
32-bit register
more than half
+1 of the LSB
stored [23 bits]
+ 6 + 127 = 13310 1 · 0 0 0 1 0 0 0 1 0 0 1 ...
0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 0
SOLUTION (2)…
1210 = 11002
12.210 = 1100.0011 0011 ... 0.210 Þ 0.2 x 2  0.4
= 1.100 0011 0011 ... x 23 0.4 x 2  0.8
0.8 x 2  1.6
0.6 x 2  1.2
0.2 x 2  0.4
...
only 24 bits can be stored
1 1 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
less than half of
the LSB
stored [23 bits]

+ 3 + 127 = 13010 1 · 1 0 0 0 0 1 1 0 0 1 1 ...
0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
41
SOLUTION (2)
Align the smaller number with the larger number by

shifting it to the right [increasing the exponent]
1000 0010 1.1000011001100110011  1000 0101 0.0011000011001100110011
Subtract the mantissa

1.00010001001100110011010
+ 0.00110000110011001100110011
1.01000010000000000000000011
Store the result in IEEE single-precision format

0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
42

Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic

Caricato da

Copyright:

Formati disponibili

TOA 1221

COMPUTER ORGANIZATION AND

Computer Arithmetic – III

 Fixed-Point Representation – Drawbacks

 Using Fixed-point representation (ex: Twos complement),

 For decimal numbers

 Floating-Point numbers are usually normalized

The bias equals to (2K-1 – 1)  28-1 – 1 = 127 7

In Floating-point representation: We are not representing more

 Numbers are not spaced evenly along the number line

 For a fixed n-bit format, if number of bits in the exponent

• In this format, the numbers are normalized so that the

where S = sign bit [0  +ve, 1  ve]

Biased Exponent Significand

(b) Double format

• A sign-magnitude representation has been adopted for

(b) -77.710 = -100 1101.10110 01102 ... 7710 = 100 11012

= -1.00 1101 101100110 ... x 26 0.710 Þ 0.7 x 2  1.4

1.0011100010000012 X 212 = 1001110001000.0012

(b) 1100 0100 0111 1001 1111 1100 0000 00002

sign biased exponent significand

Single Precision (32 bits) Double Precision (64 bits)

Floating Point Numbers Arithmetic Operations

Some problems that may arise during arithmetic operations are:

• Unlike integer and fixed-point number representation,

Obviously, direct addition cannot take place as the

 Four basic phases of the algorithm for floating-point addition

Phase 1: Zero Check:

• As the aligning operation may result in the loss of digits, it

yes yes no 10.001111 x 27 no

Increment smaller yes Significand Shift significand RETURN

RETURN Shift significand RETURN Significand no Decrement

Put other number Increment

Increment smaller yes Significand Shift significand RETURN

RETURN Shift significand RETURN Significand no Decrement

Put other number Increment

RETURN yes Exponent no RETURN 26

• Some of the floating-point arithmetic will lead to an increase

• The result has two extra bit of precision which cannot be

• Truncation is the simplest method which involves nothing

(iii) 0.1101011 extra bits are

less than half

• Rounding is always preferred to truncation partly because it

525.2510 = 10 0000 1101.012

Subtract the mantissa

Normalize the result

 Basic phases of the algorithm for floating-point multiplication

ET = 259 – 127 = 132

RETURN Exponent yes Report

Exponent yes Report

RETURN Exponent yes Report

Exponent yes Report

 Express the number (- 640.5)10 in IEEE 32-bit and 64-bit

 IEEE 32 BIT FLOATING POINT FORMAT

Step 1: Express the given number in binary form

 IEEE 64 BIT FLOATING POINT FORMAT

MSB 11 bits 52 bits

Step 1: Express the given number in binary form

 Step 3: For the 11 bit biased exponent field, the bias

stored [23 bits]

Align the smaller number with the larger number by

Subtract the mantissa

Store the result in IEEE single-precision format