Sei sulla pagina 1di 42

TOA 1221

COMPUTER ORGANIZATION AND


ARCHITECTURE

Lec07

Computer Arithmetic – III


Floating Point Numbers (Representation and Arithmetic)
What we are going to discuss?

 Fixed-Point Representation – Drawbacks


 Floating Point Representation
Principles
Normalization, Examples, Range and Precision, Density
IEEE 754 standard format
Single Precision and Double Precision
Decimal to IEEE754 standard format conversion
IEEE754 standard format to decimal equivalent conversion
Interpretation of floating-point numbers
 Arithmetic with Floating Point numbers
Floating-Point Addition/Subtraction
Floating-Point Multiplication/Division

2
Fixed-Point Representation - Drawbacks

 Using Fixed-point representation (ex: Twos complement),


 Possible to represent a range of positive and negative integers centered on
0
 Numbers with a fractional component can be represented by assuming a
fixed radix point.
 Limitations of fixed-point representation
 Very large numbers and very small fractions cannot be represented
 Fractional part of the quotient in a division of two large numbers could be
lost
 How to represent very large and very small numbers with only a
few digits?
 dynamically slide the radix point to a convenient location and use the
exponent part to keep track of the radix point – Floating Point
representation
3
Floating-point Representation - Principles

 For decimal numbers


 Use scientific notation
 Example 1: 976,000,000,000,000 can be represented as 9.76 × 1014
 Example 2: 0.0000 0000 0000 0976 can be represented as 9.76 × 10-14
 Dynamically slide the decimal point to a convenient location and use the exponent of 10 to
keep track of the decimal point.
 For binary numbers
 Represent the number in the form

±S × B±E
Store the number with three fields:
• Sign: plus (0) or minus (1)
• Significand S (or Mantissa or Fraction)
• Exponent E
 Base B is implicit and need not be stored
 Assumption: Radix point is to the right of MSB of significand

4
Typical 32-bit Floating-Point Format

 ± Significand x 2±exponent
 Left most bit (MSB) – Sign of the number (0=positive, 1= negative)
 Exponent is in excess or biased notation.
 A fixed value, called the bias, is subtracted from the field to get the true exponent value
 Typically the bias equals (2k-1 – 1) where k is the number of bits in the binary exponent field
 Excess (biased exponent) 127 means
 8-bit exponent field
 8-bit yields the numbers: 0-255
 Subtract 127 (2 k-1 - 1) to get true exponent value
 True exponent Range: -127 to +128

5
Normalization

 Floating-Point numbers are usually normalized


 A normalized number is one in which most significant digit of the significand
is nonzero
 For binary representation, normalized number is one in which MSB of the
significand is one.
• Exponent is adjusted so that leading bit (MSB) of significand is 1
• Since it is always 1 there is no need to store it
– 23-bit filed is used to store 24-bit significand with a value in the
interval [1,2].
 Form of normalized nonzero floating-point number
±1.bbb….b × 2±E

6
Floating-Point Representation - Examples

negative
True Biased exponent
exponent
127 + 20 = 147
20

negative
Biased exponent
True
normalized exponent 127 - 20 = 107
-20

The bias equals to (2K-1 – 1)  28-1 – 1 = 127 7


Range of numbers in Typical 32-bit formats

In Floating-point representation: We are not representing more


individual values, but spreading the numbers – Approximate
representation

8
Density of Floating-Point Numbers

 Numbers are not spaced evenly along the number line


 Possible values get closer together near the origin and farther
apart as we move away
 Many calculations produce results that are not exact and
rounded to the nearest value that the notation can represent.

9
Range and precision – Trade-off

 For a fixed n-bit format, if number of bits in the exponent


is increased:
 Range of expressible numbers will be increased
 Density of numbers is reduced and hence precision
 If base of 16 is used instead of 2:
 Greater range can be achieved for the same number of
exponent bits
 Less precision (since we have not increased the
number of different values that can be represented)
 Example: In IBM base-16 format,
0.11010001 × 210100 = 0.11010001 × 16101

10
IEEE Standard 754
Floating-Point Format (1)…
 Standard for floating point storage defined in IEEE 754,
adopted in 1985
 To facilitate portability of programs from one processor to
another
 Widely adopted and used on all processors and arithmetic
coprocessors
 Defines 32-bit (single-precision) and 64-bit (double precision)
standards
 8-bit and 11-bit exponents respectively
 Extended formats for intermediate results
 Additional bits in significand (extended precision)
 Additional bits in exponent (extended range)

11
IEEE Standard 754
Floating-Point Format (2)…

• In this format, the numbers are normalized so that the


significand lie in the range 1  F < 2
• An IEEE format floating-point number X is formally
defined as:

where S = sign bit [0  +ve, 1  ve]


E = exponent biased by B
F = fractional significand

12
IEEE Standard 754
Floating-Point Format (3)…

Sign
8 bits 23 bits
bit
Biased
Significand
Exponent
(a) Single format

Sign
11 bits 52 bits
bit

Biased Exponent Significand

(b) Double format

• A sign-magnitude representation has been adopted for


the significand
• Significand is negative if S =1, and positive if S =0

13
IEEE Standard 754
Floating-Point Format - Parameters
Format
Parameter Single Single Extended Double Double Extended
Word width 32 ≥ 43 64 ≥ 79
(bits)
Exponent 8 ≥ 11 11 ≥ 15
width (bits)
Exponent 127 unspecified 1023 unspecified
bias
Maximum 127 ≥ 1023 1023 ≥ 16383
exponent
Minimum –126  –1022 –1022  –16382
exponent
Number 10–38, 10+38 unspecified 10–308, 10+308 unspecified
range (base
10)
Significand 23 ≥ 31 52 ≥ 63
width (bits)* * Does not include
implied bit
Number of 254 unspecified 2046 unspecified
exponents
Number of 223 unspecified 252 unspecified
fractions
Number of 1.98  231 unspecified 1.99  263 unspecified
values
14
IEEE Standard 754
Floating-Point Format – Example 1
Convert the given numbers to IEEE single precision format:
(a) 199.95312510 = 1100 0111.1111012
= 1.100 0111 111101 x 27 stored
+ 7 + 127 = 13410 1 · 1 0 0 0 1 1 1 1 1 1 1 0 1
0 1 0 0 0 0 1 1 0 1 0 0 0 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0
sign biased exponent significand

(b) -77.710 = -100 1101.10110 01102 ... 7710 = 100 11012

= -1.00 1101 101100110 ... x 26 0.710 Þ 0.7 x 2  1.4


0.4 x 2  0.8
0.8 x 2  1.6
0.6 x 2  1.2
0.2 x 2  0.4
0.4 x 2  0.8
0.8 x 2  1.6
0.6 x 2  1.2
0.2 x 2  0.4

...
stored [23 bits]
– 6 + 127 = 13310 1 · 0 0 1 1 0 1 1 0 1 1 0 ...
1 1 0 0 0 0 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
sign biased exponent significand
15
IEEE Standard 754
Floating-Point Format – Example 2
Convert the given IEEE single precision floating-point numbers to their
decimal equivalent:
(a) 0100 0101 1001 1100 0100 0001 0000 00002
sign biased exponent significand
0 1 0 0 0 1 0 1 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0
+ 139 – 127 = 1210 1.0011100012

1.0011100010000012 X 212 = 1001110001000.0012

= 5000.12510

(b) 1100 0100 0111 1001 1111 1100 0000 00002

sign biased exponent significand


1 1 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
– 136 – 127 = 910 1.11110011111112

-1.11110011111112 x 29 = -1111100111.11112
= -999.937510

16
Floating-Point Format – Interpretation of
Numbers

Single Precision (32 bits) Double Precision (64 bits)


Biased Biased
Sign Fraction Value Sign Fraction Value
exponent exponent
positive zero 0 0 0 0 0 0 0 0
negative zero 1 0 0 –0 1 0 0 –0
plus infinity 0 255 (all 1s) 0 ∞ 0 2047 (all 1s) 0 ∞
minus infinity 1 255 (all 1s) 0 –∞ 1 2047 (all 1s) 0 –∞
quiet NaN 0 or 1 255 (all 1s) ≠0 NaN 0 or 1 2047 (all 1s) ≠0 NaN
signaling NaN 0 or 1 255 (all 1s) ≠0 NaN 0 or 1 2047 (all 1s) ≠0 NaN
positive
normalized 0 0 < e < 255 f 2e–127(1.f) 0 0 < e < 2047 f 2e–1023(1.f)
nonzero
negative
normalized 1 0 < e < 255 f –2e–127(1.f) 1 0 < e < 2047 f –2e–1023(1.f)
nonzero
positive
denormalized 0 0 f≠0 2e–126(0.f) 0 0 f≠0 2e–1022(0.f)
negative
denormalized 1 0 f≠0 –2e–126(0.f) 1 0 f≠0 –2e–1022(0.f)

17
Floating-point Arithmetic Operations

Floating Point Numbers Arithmetic Operations


X  X s  BX E  
X  Y  X s  B XE YE  Ys  B YE 


YE X E  YE
Y  Ys  B
YE

X  Y  Xs  B
XE YE

 Ys  B  

X  Y  X s  Ys   B XE YE

X  Xs  XE YE
    B
Y  
 Ys 
Examples:
X = 0.3  102 = 30
Y = 0.2  103 = 200
 X + Y = (0.3  102–3 + 0.2)  103 = 0.23  103 = 230
 X – Y = (0.3  102–3 – 0.2)  103 = (–0.17)  103 = –170
 X  Y = (0.3  0.2)  102+3 = 0.06  105 = 6000
 X  Y = (0.3  0.2)  102–3 = 1.5  10–1 = 0.15
18
Floating-point Arithmetic operations –
Conditions

Some problems that may arise during arithmetic operations are:


 Exponent overflow
A positive exponent exceeds the maximum possible exponent value
and this may be designated as + or - in some systems
 Exponent underflow
A negative exponent is less than the minimum possible exponent
value (e.g. 2-200), the number is too small to be represented and
maybe reported as 0
 Significand underflow
In the process of aligning significands, digits may flow off the right
end of the significand. Some form of rounding is required.
 Significand overflow
The addition of two Significands of the same sign may result in a
carry out of the most significant bit. This can be fixed by realignment.
19
Floating-Point Arithmetic +/- (1)…

• Unlike integer and fixed-point number representation,


floating-point numbers cannot be added in one simple
operation
• Consider adding two decimal numbers:
A = 12345
B = 567.89
If these numbers are normalized and added in floating-
point format, we will have

0.12345 x 105
+ 0.56789 x 103
?.????? x 10?

Obviously, direct addition cannot take place as the


exponents are different
20
Floating-Point Arithmetic +/- (2)…

 Four basic phases of the algorithm for floating-point addition


and subtraction
1) Check for zeros
2) Align significands (adjusting exponents)
3) Add or subtract significands
4) Normalize result
 For addition and subtraction, it is necessary to ensure that both
operand exponents have the same value
 This may involves shifting the radix point of one of the
operands to achieve alignment

21
Floating-Point Arithmetic +/- (3)…

Phase 1: Zero Check:


• Since addition and subtraction are identical except for a
sign change, the process begins by changing the sign of the
subtrahend if it is a subtract operation
• If either operand is zero, the other operand is reported as
the result
Phase 2: Significand alignment:
• The floating-point numbers can only be added if the two
exponents are equal
• This can be done by aligning the smaller number with the
bigger number [increasing its exponent] or vice-versa, so
that both numbers have the same exponent

22
Floating-Point Arithmetic +/- (4)…

• As the aligning operation may result in the loss of digits, it


is the smaller number that is shifted so that any loss will
therefore be of relatively insignificant
8 bits remains
shift
left
1.1001 x 29 110010000 x 21 1 x 29 is lost
1.0111 x 21 1.0111000 x 21
• Hence, the smaller number are shifted right by increasing its
exponent until the two exponents are the same
• If both numbers have exponents that differ significantly, the
smaller number may be lost as a result of shifting
1.1001001 x 29 1.1001001 x 29
1.0110001 x 21 shift
0.0000000 x 29
right

23
Floating-Point Arithmetic +/- (5)…
1.1101 x 24
+ 0.0101 x 24
Phase 3: Addition of significands: 10.0010 x 24 1.0001 x 25
• After the numbers have been aligned, the two significands are added
together taking into account their signs
• There might be a possibility of significand overflow due to a carry out
from the most significant bit:
o If this occurs, the significand of the result is shifted right and the
exponent is incremented
o As the exponents are incremented, it might overflow and the
operation will stop
Phase 4: Normalization:
• Lastly, the result is normalized by shifting significand digits left until the
most significant digit is non-zero
• Each shift causes a decrement of the exponent and thus could cause an
exponent underflow
• Finally, the result is rounded off and reported
24
1.01101 x 27
SUBTRACT X = 1.01101 x 27
+ 0.110101 x 27
Y = 1.10101 x 26
10.001111 x 27

Change sign of Y
X = 1.01101 x 27
Y = 0.110101 x 27 1.0001111 x 28
X+Y=Z
no no Exponents yes Add signed Results yes
ADD X = 0? Y = 0? Round result
Equal? significands normalized?

yes yes no 10.001111 x 27 no

Increment smaller yes Significand Shift significand RETURN


ZY ZX Z0
exponent = 0? left

no 1.0001111 x 28

RETURN Shift significand RETURN Significand no Decrement


right overflow? exponent

10.001111 x 27 yes
7
Y = 0.110101 x 2
Significand Shift significand no Exponent
no = 0? right underflow?

yes

Put other number Increment


Addition in Z RETURN exponent
Report underflow

1.0001111 x 28
RETURN yes Exponent no RETURN 25
Report overflow
overflow?
1.01101 x 27
X–Y=Z X = 1.01101 x 27
SUBTRACT
– 0.110101 x 27
Y = 1.10101 x 26
0.100101 x 27

Change sign of Y
X = 1.01101 x 27
Y = 0.110101 x 27
0.100101 x 27
no no Exponents yes Add signed Results yes
ADD X = 0? Y = 0? Round result
Equal? significands normalized?

yes yes no
0.100101 x 27 no

Increment smaller yes Significand Shift significand RETURN


ZY ZX Z0
exponent = 0? left

1.00101 x 26
no

RETURN Shift significand RETURN Significand no Decrement


right overflow? exponent

7
yes
1.00101 x 26
Y = 0.110101 x 2
Significand Shift significand no Exponent
no = 0? right underflow?

yes

Put other number Increment


Subtraction in Z RETURN exponent
Report underflow

RETURN yes Exponent no RETURN 26


Report overflow
overflow?
Rounding (1)…

• Some of the floating-point arithmetic will lead to an increase


number of bits in the mantissa
• For example, consider adding these 5 significant bits floating-
point numbers:
A = 0.11001 x 24
B = 0.10001 x 23
A = 0.11001 x 24
B = 0.010001 x 24 normalize
1.000011 x 24 0.1000011 x 25

• The result has two extra bit of precision which cannot be


fitted into the floating point format
• For simplicity, the number can be truncated to give 0.10000 x
25 27
Rounding (2)…

• Truncation is the simplest method which involves nothing


more than taking away the extra bits
• A much better technique is rounding
• If the value of the extra bits are greater than half of the
least significant bit of the retained bits, 1 is added to the
LSB of the remaining digits
• For example, consider rounding these numbers to 4 significant
bits:
i. 0.1101101
0.1101
0.1 1 0 1 1 0 1
+ 1
0.1110
more than half add 1 to the
LSB
28
Rounding (3)…

(ii) 0.0000011
extra bits  0.0000011
LSB of retained bits  0.0001

(iii) 0.1101011 extra bits are


truncated

0.1 1 0 1 0 1 1 0.1101

less than half

• Rounding is always preferred to truncation partly because it


is more accurate and partly it gives rise to an unbiased error
• Major disadvantage of rounding is that it requires a further
arithmetic operation on the result
29
Floating-Point Arithmetic +/-
- Example 1
Perform the following arithmetic operation using floating point
arithmetic. Show how the numbers would be stored using IEEE
single-precision format.
1150.62510  525.2510
1150.62510 = 100 0111 1110. 1012
= 1. 0001 1111 10101 x 210
stored
+ 10 + 127 = 13710 1 · 0 0 0 1 1 1 1 1 1 0 1 0 1
0 1 0 0 0 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0
sign biased exponent significand

525.2510 = 10 0000 1101.012


= 1. 0000 0110 101 x 29
stored
+ 9 + 127 = 13610 1 · 0 0 0 0 0 1 1 0 1 0 1
0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
sign biased exponent significand
Floating-Point Arithmetic +/-
- Example 1
As these numbers have different exponents, the smaller number is
shifted right to align with the larger number
1000 1000 1.00000110101  1000 1001 0.100000110101
exponent mantissa exponent mantissa

Subtract the mantissa


1.0001111110101
– 0.100000110101
0.1001110001011

Normalize the result


1000 1001 0.1001110001011  1000 1000 1.001110001011
exponent mantissa exponent mantissa

stored
+ 9 + 127 = 13610 1 · 0 0 1 1 1 0 0 0 1 0 1 1
0 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0
sign biased exponent significand
Floating-Point Arithmetic x/

 Basic phases of the algorithm for floating-point multiplication


and division
1) Check for zero
2) Add/subtract exponents
3) Multiply/divide significands (watch sign)
4) Normalize
5) Round
6) All intermediate results should be in double length
storage

32
Floating-point Multiplication
XxY=Z X = 6.2510 = 110.012 = 1.1001 x 22
MULTIPLY Y = 12.510 = 1100.12 = 1.1001 x 23
E1 = 127 + 2 = 129
no no
E2 = 127 + 3 = 130
X = 0? Y = 0? Add exponents
E1 + E2 = 259
yes yes

ET = 259 – 127 = 132


Z0 Subtract bias

RETURN Exponent yes Report


overflow? overflow

no

Exponent yes Report


underflow underflow

1.10012 no

x 1.10012
Multiply
10.011100012 significands

10.01110001 x 25
=1.001110001 x 26 Normalize

33
Round RETURN
Floating-point Division
Y = 3.7510 = 11.112 = 1.111 x 21
XY=Z
DIVIDE
X = 95.62510 = 101 1111.1012
= 1.011111101 x 26
E1 = 127 + 1 = 128
no no
X = 0? Y = 0?
Subtract
exponents
E2 = 127 + 6 = 133
E2 – E1 = 5
yes yes

Z 0 Z  Add bias
ET = 127 + 5 = 132

RETURN Exponent yes Report


overflow? overflow

no

Exponent yes Report


underflow underflow

no

0.110011
Divide
1.111 1.011111101 significands

0.110011 x 25
Normalize
= 1.10011 x 24

34
Round RETURN
PROBLEM (1)

 Express the number (- 640.5)10 in IEEE 32-bit and 64-bit


floating point format

35
SOLUTION (1)…

 IEEE 32 BIT FLOATING POINT FORMAT


MSB 8 bits 23 bits
sign Biased Mantissa/Significand
Exponent (Normalized)

Step 1: Express the given number in binary form


(640.5) = 1010000000.1* 20
Step 2: Normalize the number into the form 1.bbbbbbb
1010000000.1* 20 = 1. 0100000001* 29
Once Normalized, every number will have 1 at the leftmost bit. So IEEE notation is saying
that there is no need to store this bit. Therefore significand to be stored is 0100 0000 0100
0000 0000 000 in the allotted 23 bits

36
SOLUTION (1)…

 Step 3: For the 8 bit biased exponent field, the bias used is
2k-1-1 = 28-1-1 = 127
Add the bias 127 to the exponent 9 and convert it into binary
in order to store for 8-bit biased exponent.
127 + 9 =136 ( 1000 1000)
 Step 4: Since the given number is negative, put MSB as 1
 Step 5: Pack the result into proper format (IEEE 32 bit)
1 1000 1000 0100 0000 0100 0000 0000 000

37
SOLUTION (1)…

 IEEE 64 BIT FLOATING POINT FORMAT

MSB 11 bits 52 bits


sign Biased Mantissa/Significand
Exponent (Normalized)

Step 1: Express the given number in binary form


(640.5) = 1010000000.1* 20
Step 2: Normalize the number into the form 1.bbbbbbb
1010000000.1* 20 = 1. 0100000001* 29
Once Normalized, every number will have 1 at the leftmost bit. So IEEE notation is saying
that there is no need to store this bit. Therefore significand to be stored is 0100 0000 0100
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 in the allotted 52 bits

38
SOLUTION (1)

 Step 3: For the 11 bit biased exponent field, the bias


used is
2k-1-1 = 211-1-1 = 1023
Add the bias 1023 to the exponent 9 and convert it into
binary in order to store for 11-bit biased exponent.
1023 + 9 =1032 ( 1000 0001 000)
 Step 4: Since the given number is negative, put MSB as 1
 Step 5: Pack the result into proper format(IEEE 64 bit)

1 1000 0001 000 0100 0000 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

39
PROBLEM (2)
Perform the following addition using floating point arithmetic and show how
the numbers would be stored using IEEE single-precision format
68.310 + 12.210
6810 = 100 01002
0.310 Þ 0.3 x 2  0.6
68.310 = 100 0100.01001 1001 ...
0.6 x 2  1.2
= 1.00 0100 01001 1001 ... x 26
0.2 x 2  0.4
0.4 x 2  0.8
0.8 x 2  1.6
0.6 x 2  1.2

...
only 24 bits can be stored
1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
32-bit register
more than half
+1 of the LSB
stored [23 bits]
+ 6 + 127 = 13310 1 · 0 0 0 1 0 0 0 1 0 0 1 ...
0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 0
sign biased exponent significand
SOLUTION (2)…
1210 = 11002
12.210 = 1100.0011 0011 ... 0.210 Þ 0.2 x 2  0.4
= 1.100 0011 0011 ... x 23 0.4 x 2  0.8
0.8 x 2  1.6
0.6 x 2  1.2
0.2 x 2  0.4

...
only 24 bits can be stored
1 1 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
less than half of
the LSB

stored [23 bits]


+ 3 + 127 = 13010 1 · 1 0 0 0 0 1 1 0 0 1 1 ...
0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
sign biased exponent significand
41
SOLUTION (2)

Align the smaller number with the larger number by


shifting it to the right [increasing the exponent]
1000 0010 1.1000011001100110011  1000 0101 0.0011000011001100110011
exponent mantissa exponent mantissa

Subtract the mantissa


1.00010001001100110011010
+ 0.00110000110011001100110011
1.01000010000000000000000011

Store the result in IEEE single-precision format


0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sign biased exponent significand

42

Potrebbero piacerti anche