Sei sulla pagina 1di 141

# Numerical Methods Preliminaries

Colombia

March 9, 2017

## High Dimensional Signal Processing Group

www.hdspgroup.com
henarfu@uis.edu.co
LP 304

Outline

1 Introduction

2 Binary numbers

3 Error Analysis

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 2 / 78

Introduction: numerical methods applications

(a) Model the probable evolution of (b) Model and simulate the growth
a pathology of a tumor

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 3 / 78

Contents

1 Introduction

2 Binary numbers
Base 2 numbers
Base 2 representation of the integer N
Sequences and Series
Binary Fractions
Binary shifting
Scientific Notation
Machine Numbers

3 Error Analysis

Base 2 numbers

Base 2 numbers

## 1563 = (1 × 103 ) + (5 × 102 ) + (6 × 101 ) + (3 × 100 ).

Let N denote a positive integer; then the digits a0 , a1 , ..., ak exist so that
N has the base 10 expansion

Base 10 expansion

Base 2 numbers

## 1563 =(1 × 210 ) + (1 × 29 ) + (0 × 28 ) + (0 × 27 ) + (0 × 26 ) + (0 × 25 )+

(1 × 24 ) + (1 × 23 ) + (0 × 22 ) + (1 × 21 ) + (1 × 20 ).

So that:

Base 2 numbers

## Let N denote a positive integer; the digits b0 , b1 , ..., bJ exist so that N

has the base 2 expansion

Base 2 expansion
N = (bJ × 2J ) + (bJ−1 × 2J−1 ) + · · · + (b1 × 21 ) + (b0 × 20 ), (2)

## Where each digit bj is either a 0 or 1. Thus N is expressed in binary

notation as
N = bJ bJ−1 · · · b2 b1 b0two . (3)

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 7 / 78

Contents

1 Introduction

2 Binary numbers
Base 2 numbers
Base 2 representation of the integer N
Sequences and Series
Binary Fractions
Binary shifting
Scientific Notation
Machine Numbers

3 Error Analysis

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 8 / 78

Base 2 representation of the integer N
Process: Generate sequences Qk and Rk of quotients and remainders,
respectively. End the process when Qk = 0, for some integer k = J.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 9 / 78

Base 2 representation of the integer N
Process: Generate sequences Qk and Rk of quotients and remainders,
respectively. End the process when Qk = 0, for some integer k = J.
Example:
𝒌 1563 𝑸𝒌 𝑹𝒌
0 1563/2= 781 1
1 781/2= 390 1
2 390/2= 195 0
3 195/2= 97 1
4 97/2= 48 1
5 48/2= 24 0
6 24/2= 12 0
7 12/2= 6 0
8 6/2= 3 0
9 3/2= 1 1
10 1/2= 0 1

1 1 0 0 0 0 1 1 0 1 1
b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0

## Most Significant Bit -MSB Least Significant Bit - LSB

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 9 / 78
Base 2 representation of the integer N
Exercise 1: Find the base 2 representation of 697

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 10 / 78

Base 2 representation of the integer N
Exercise 1: Find the base 2 representation of 697
Start by dividing the integer N from 2 to calculate Q0 and R0 .
697/2 = 348.5 → Q0 = 348 and R0 = 1

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 10 / 78

Base 2 representation of the integer N
Exercise 1: Find the base 2 representation of 697
Start by dividing the integer N from 2 to calculate Q0 and R0 .
697/2 = 348.5 → Q0 = 348 and R0 = 1
Continue the process until finding Qk = 0, for some integer k = J.
Qk = Qk−1 /2

𝒌 𝟔𝟗𝟕 𝑸𝒌 𝑹𝒌
0 697/2= 348 1
1
2
3
4
5
6
7
8
9

b9 b8 b7 b6 b5 b4 b3 b2 b 1 b0

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 10 / 78

Base 2 representation of the integer N
Solution

𝒌 𝟔𝟗𝟕 𝑸𝒌 𝑹𝒌
0 697/2= 348 1
1 348/2= 174 0
2 174/2= 87 0
3 87/2= 43 1
4 43/2= 21 1
5 21/2= 10 1
6 10/2= 5 0
7 5/2= 2 1
8 2/2= 1 0
9 1/2= 0 1

1 0 1 0 1 1 1 0 0 1
b9 b8 b7 b6 b5 b4 b3 b2 b 1 b0

## Then, 69710 = 10101110012

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 11 / 78
Contents

1 Introduction

2 Binary numbers
Base 2 numbers
Base 2 representation of the integer N
Sequences and Series
Binary Fractions
Binary shifting
Scientific Notation
Machine Numbers

3 Error Analysis

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 12 / 78

Sequences and Series

## Commonly, when you express a rational number in decimal form, you

require infinitely many digits.

1
For example, in = 0.3 , the symbol 3 means that the digit 3 is repeated
3
forever to form an infinite repeating decimal.

1
But, the number is the shorthand notation for the infinite series S
3

X 1
S= 3(10)−k = .
3
k=1

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 13 / 78

Sequences and Series

Definition 1.
The infinite series S

X
S= crn = c + cr + cr2 + · · · + crn + · · · , (4)
n=0

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 14 / 78

Sequences and Series

Definition 1.
The infinite series S

X
S= crn = c + cr + cr2 + · · · + crn + · · · , (4)
n=0

## Theorem 1. (Geometric Series)

The geometric series has the following properties:

X c
If |r| < 1, then crn = .
1−r (5)
n=0
If |r| > 1, then the series diverges.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 14 / 78

Sequences and Series

## Example: The series S is given by

 1  2  ∞ X∞  n
1 1 1 1
S = (7) + (7) + · · · + (7) = 7 ,
7 7 7 7
n=1

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 15 / 78

Sequences and Series

## Example: The series S is given by

 1  2  ∞ X∞  n
1 1 1 1
S = (7) + (7) + · · · + (7) = 7 ,
7 7 7 7
n=1

∞  n
X 1
which is equal to − 7 + 7 ,
7
n=0

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 15 / 78

Sequences and Series

## Example: The series S is given by

 1  2  ∞ X∞  n
1 1 1 1
S = (7) + (7) + · · · + (7) = 7 ,
7 7 7 7
n=1

∞  n
X 1
which is equal to − 7 + 7 ,
7
n=0

7 7
and acording with (5) S = −7 + = = 1.16,
1 6
1−
7
7
Then, is the shorthand notation for the infinite series S
6

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 15 / 78

Contents

1 Introduction

2 Binary numbers
Base 2 numbers
Base 2 representation of the integer N
Sequences and Series
Binary Fractions
Binary shifting
Scientific Notation
Machine Numbers

3 Error Analysis

Binary Fractions

## A binary fraction is a serie of sums with negative powers of 2, which is

used to express a real number R that lies in the range 0 < R < 1.

Binary Fractions

## A binary fraction is a serie of sums with negative powers of 2, which is

used to express a real number R that lies in the range 0 < R < 1.

Binary fractions
R = (d1 × 2−1 ) + (d2 × 2−2 ) + · · · + (dn × 2−n ) + · · · , (6)

## Binary fraction Representation of R

P∞ −j
R = 0.d1 d2 · · · dn · · ·two R= j=1 dj (2)

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 17 / 78

Binary Fractions-Decimal to binary
Process: Generate sequences dk and Fk multiplying by two.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 18 / 78

Binary Fractions-Decimal to binary
Process: Generate sequences dk and Fk multiplying by two.
Example:
d1 d2 d 3 d 4 d 5 d6 d7 d8 d9
0. 1 0 1 1 0 0 1 1 0 …
𝑗 0.7 𝐹𝑗 𝑑𝑗 𝑓𝑟𝑎𝑐
1 (0.7)(2) = 1.4 1 0.4
2 (0.4)(2) = 0.8 0 0.8
3 (0.8)(2) = 1.6 1 0.6
4 (0.6)(2) = 1.2 1 0.2
5 (0.2)(2) = 0.4 0 0.4
6 (0.4)(2) = 0.8 0 0.8
7 (0.8)(2) = 1.6 1 0.6
8 (0.6)(2) = 1.2 1 0.2
9 (0.2)(2) = 0.4 0 0.4
0.7  0.10110 2

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 18 / 78

Binary Fractions-Decimal to binary
Exercise 2: Calculate the binary fraction for 0.6.
Start by multiplying 0.6 by 2, to generate sequences dj and Fj
d1 d2 d3 d4 d5 d6 d7 d8 d9

𝑗 0.6 𝐹𝑗 𝑑𝑗 𝑓𝑟𝑎𝑐
1 (0.6)(2) = 1.2 1 0.2
2
3
4
5
6
7
8
9

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 19 / 78

Binary Fractions-Decimal to binary

Solution
d1 d2 d3 d4 d5 d6 d7 d8 d9

𝑗 0.6 𝐹𝑗 𝑑𝑗 𝑓𝑟𝑎𝑐
1 (0.6)(2) = 1.2 1 0.2
2 (0.2)(2) = 0.4 0 0.4
3 (0.4)(2) = 0.8 0 0.8
4 (0.8)(2) = 1.6 1 0.6
5 (0.6)(2) = 1.2 1 0.2
6 (0.2)(2) = 0.4 0 0.4
7 (0.4)(2) = 0.8 0 0.8
8 (0.8)(2) = 1.6 1 0.6
9 (0.6)(2) = 1.2 1 0.2

0.6 = 0. 1001

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 20 / 78

Binary Fractions-Binary to decimal
The base 10 rational number R10 associated to a base 2 binary fraction
R2 can be found using geometric series.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 21 / 78

Binary Fractions-Binary to decimal
The base 10 rational number R10 associated to a base 2 binary fraction
R2 can be found using geometric series.
Example:

## 0.012 =(0 × 2−1 ) + (1 × 2−2 ) + (0 × 2−3 ) + (1 × 2−4 ) · · ·

the expression above is writted as

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 21 / 78

Binary Fractions-Binary to decimal
The base 10 rational number R10 associated to a base 2 binary fraction
R2 can be found using geometric series.
Example:

## 0.012 =(0 × 2−1 ) + (1 × 2−2 ) + (0 × 2−3 ) + (1 × 2−4 ) · · ·

the expression above is writted as
X∞ ∞
X
= (2−2 )k = −1 + (2−2 )k
k=1 k=0

1 2 1
= −1 + = −1 + = .
1 3 3
1−
4

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 21 / 78

Binary Fractions-Binary to decimal
The base 10 rational number R10 associated to a base 2 binary fraction
R2 can be found using geometric series.
Example:

## 0.012 =(0 × 2−1 ) + (1 × 2−2 ) + (0 × 2−3 ) + (1 × 2−4 ) · · ·

the expression above is writted as
X∞ ∞
X
= (2−2 )k = −1 + (2−2 )k
k=1 k=0

1 2 1
= −1 + = −1 + = .
1 3 3
1−
4
1
then, is the 10 rational number associated to 0.012
3
Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 21 / 78
Contents

1 Introduction

2 Binary numbers
Base 2 numbers
Base 2 representation of the integer N
Sequences and Series
Binary Fractions
Binary shifting
Scientific Notation
Machine Numbers

3 Error Analysis

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 22 / 78

Binary shifting

Let R be
R = 0.00000110002 . (7)

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 23 / 78

Binary shifting

Let R be
R = 0.00000110002 . (7)

## Multiplying both sides of (7) by Multiplying both sides of (7) by

25 = 32 will shift the binary 210 = 1024 will shift the binary
point 5 places to the right point 10 places to the right
32R = 0.110002 . 1024R = 11000.110002 .

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 23 / 78

Binary shifting

Let R be
R = 0.00000110002 . (7)

## Multiplying both sides of (7) by Multiplying both sides of (7) by

25 = 32 will shift the binary 210 = 1024 will shift the binary
point 5 places to the right point 10 places to the right
32R = 0.110002 . 1024R = 11000.110002 .

## Taking the difference 1024R − 32R = 11000.110002 − 0.110002 ,

we obtain 992R = 110002 ,
given that 110002 = 2410 we find that,
3
992R = 24, Therefore R = .
124 10

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 23 / 78

Contents

1 Introduction

2 Binary numbers
Base 2 numbers
Base 2 representation of the integer N
Sequences and Series
Binary Fractions
Binary shifting
Scientific Notation
Machine Numbers

3 Error Analysis

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 24 / 78

Scientific Notation

## The scientific notation is a standard way to present a real number. It is

obtained by properly shifting the decimal point.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 25 / 78

Scientific Notation

## The scientific notation is a standard way to present a real number. It is

obtained by properly shifting the decimal point.
Examples
0.0000747 = 7.47 × 10−5
31.4159265 = 3.14159265 × 10
9, 700, 000.000 = 9.7 × 109
The Avogadro’s constant used in chemistry = 6.02252 × 1023 .
The quantity 1K = 1.024 × 103 used in computer science.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 25 / 78

Contents

1 Introduction

2 Binary numbers
Base 2 numbers
Base 2 representation of the integer N
Sequences and Series
Binary Fractions
Binary shifting
Scientific Notation
Machine Numbers

3 Error Analysis

Machine Numbers

Machine Numbers

## A mathematical quantity x is stored in a computer as a binary approxi-

mation given by
x ≈ ±q × 2n . (8)

## The finite binary number q is the mantissa, where 1/2 ≤ q ≤ 1.

The integer n is the exponent.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 27 / 78

Floating-point format

expressing:
The sign
The exponent
The mantissa

## 𝑆𝑖𝑔𝑛 𝐸𝑥𝑝𝑜𝑛𝑒𝑛𝑡 𝑀𝑎𝑛𝑡𝑖𝑠𝑠𝑎

The sign is always one bit where, S = 0 if, x > 0 and S = 1, if x < 0.
The amount of bits for the exponent and the mantissa depends on
the precision of the machine.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 28 / 78

Floating-point format-IEEE 754 standard

## Precision   Total   Sign   Exponent   Man4ssa   Exponent

bias
Single   32  bits   1  bit   8  bits   23  bits   127
Double   64  bits   1  bit   11  bits   52  bits   1023

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 29 / 78

Floating-point format-IEEE 754 standard

## Precision   Total   Sign   Exponent   Man4ssa   Exponent

bias
Single   32  bits   1  bit   8  bits   23  bits   127
Double   64  bits   1  bit   11  bits   52  bits   1023

## Note: Biasing is done because exponents have to be signed values to

be able to represent both tiny and huge values, but two’s complement.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 29 / 78

Floating-point format-IEEE 754 standard

## Precision   Total   Sign   Exponent   Man4ssa   Exponent

bias
Single   32  bits   1  bit   8  bits   23  bits   127
Double   64  bits   1  bit   11  bits   52  bits   1023

## Note: Biasing is done because exponents have to be signed values to

be able to represent both tiny and huge values, but two’s complement.
Then, the exponent is biased by adjusting its value.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 29 / 78

Floating-point format-IEEE 754 standard

## Precision   Total   Sign   Exponent   Man4ssa   Exponent

bias
Single   32  bits   1  bit   8  bits   23  bits   127
Double   64  bits   1  bit   11  bits   52  bits   1023

## Note: Biasing is done because exponents have to be signed values to

be able to represent both tiny and huge values, but two’s complement.
Then, the exponent is biased by adjusting its value.
The exponent bias is calculated as bias = 2exp−1 −1, where exp indicates
the amount of bits for the exponent.
Example:
if exp = 15 bits, then, bias = 215−1 − 1 = 16383

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 29 / 78

Floating-point format-IEEE 754 standard

Possible cases:

## Sign  (S)   Exponent  (E)   Man0ssa  (M)   Value

0-­‐1   All  0  <  E  <  All  1   M   (-­‐1)S  (1.M)(2E-­‐bias)
0   E=all  1   M=0   +∞
1   E=all  1   M=0   -­‐∞
0-­‐1   E=all  1   M≠0   NaN
0-­‐1   E=all  0   M=0   0
0-­‐1   E=all  0   M≠0   (-­‐1)S  (0.M)(21-­‐bias)

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 30 / 78

Floating-point format
Example: Determine the floating point format to stored the number
59.187510 in a computer with 32 bits of precision.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 31 / 78

Floating-point format
Example: Determine the floating point format to stored the number
59.187510 in a computer with 32 bits of precision.

59.187510

## Integer part Decimal part

5910 0.187510
59 2
0.1875×2 = 0.375 0 MSB
LBS 1 29 2
0.375×2 = 0.75 0
1 14 2
2 0.75×2 = 1.5 1
0 7
1 3 2 0.5×2 = 1.0 1 LBS
1 1 2
MSB 1 0

111011.00112

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 31 / 78

Floating-point format
2. Do the proper binary shifting
111011.00112 = 1.1101100112 × 25

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 32 / 78

Floating-point format
2. Do the proper binary shifting
111011.00112 = 1.1101100112 × 25

## 3. Calculate the bias

bias = 28−1 − 1 = 127

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 32 / 78

Floating-point format
2. Do the proper binary shifting
111011.00112 = 1.1101100112 × 25

## 3. Calculate the bias

bias = 28−1 − 1 = 127

## 4. Determine the mantissa

Mantissa = 1101100112

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 32 / 78

Floating-point format
2. Do the proper binary shifting
111011.00112 = 1.1101100112 × 25

## 3. Calculate the bias

bias = 28−1 − 1 = 127

## 4. Determine the mantissa

Mantissa = 1101100112

## 5. Determine the exponent

exp = 5 + bias = 5 + 127 = 13210 = 100001002

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 32 / 78

Floating-point format
2. Do the proper binary shifting
111011.00112 = 1.1101100112 × 25

## 3. Calculate the bias

bias = 28−1 − 1 = 127

## 4. Determine the mantissa

Mantissa = 1101100112

## 5. Determine the exponent

exp = 5 + bias = 5 + 127 = 13210 = 100001002
S   E   M
0   10000100   11011001100000000000000
Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 32 / 78
Floating-point format
Example: Determine the floating point format to stored the number
132.2812510 in a computer with 32 bits of precision.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 33 / 78

Floating-point format
Example: Determine the floating point format to stored the number
132.2812510 in a computer with 32 bits of precision.

132.2812510

## Integer part Decimal part

132 2 13210
0.2812510
LBS 0 66 2
0.28125×2 = 0.5625 0 MSB
0 33 2
1 16 2 0.375×2 = 1.125 1
0 8 2
0.125×2 = 0.25 0
0 4 2
0 2 2 0.25×2 = 0.5 0
0 1 2 0.5×2 = 1.0 1 LBS
MSB 1 0
10000100.010012

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 33 / 78

Floating-point format
2. Do the proper binary shifting
10000100.010012 = 1.0000100010012 × 27

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 34 / 78

Floating-point format
2. Do the proper binary shifting
10000100.010012 = 1.0000100010012 × 27

## 3. Calculate the bias

bias = 28−1 − 1 = 127

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 34 / 78

Floating-point format
2. Do the proper binary shifting
10000100.010012 = 1.0000100010012 × 27

## 3. Calculate the bias

bias = 28−1 − 1 = 127

## 4. Determine the mantissa

Mantissa = 0000100010012

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 34 / 78

Floating-point format
2. Do the proper binary shifting
10000100.010012 = 1.0000100010012 × 27

## 3. Calculate the bias

bias = 28−1 − 1 = 127

## 4. Determine the mantissa

Mantissa = 0000100010012

## 5. Determine the exponent

exp = 7 + bias = 7 + 127 = 13410 = 100001102

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 34 / 78

Floating-point format
2. Do the proper binary shifting
10000100.010012 = 1.0000100010012 × 27

## 3. Calculate the bias

bias = 28−1 − 1 = 127

## 4. Determine the mantissa

Mantissa = 0000100010012

## 5. Determine the exponent

exp = 7 + bias = 7 + 127 = 13410 = 100001102
S   E   M
0   10000110   00001000100100000000000
Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 34 / 78
Floating-point format

## The real value associated with a given 32 bit binary is calculated as

23
!
X
value = (−1)S 1 + d(23−i) 2−i × 2(E−127)
i=1

Where,
S = The sign
E = Exponent
127 = Bias
dj = Bits of the mantissa

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 35 / 78

Floating-point format
Exercise: Find the real value for the binary data:

S   E   M
0   01010010   01101000000100100000000

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 36 / 78

Floating-point format
Exercise: Find the real value for the binary data:

S   E   M
0   01010010   01101000000100100000000

23
!
X
−i
value = (−1) S
1+ d(23−i) 2 × 2(E−127)
i=1

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 36 / 78

Floating-point format
Exercise: Find the real value for the binary data:

S   E   M
0   01010010   01101000000100100000000

23
!
X
−i
value = (−1) S
1+ d(23−i) 2 × 2(E−127)
i=1
In this example:
S=0
P23
1 + i=1 d(23−i) 2−i = 1 + 2−2 + 2−3 + 2−5 + 2−12 + 2−15 = 1.4065246582
1 4
+26 )−127)
2(E−127) = 2((2 +2 = 282−127 = 2−45

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 36 / 78

Floating-point format
Exercise: Find the real value for the binary data:

S   E   M
0   01010010   01101000000100100000000

23
!
X
−i
value = (−1) S
1+ d(23−i) 2 × 2(E−127)
i=1
In this example:
S=0
P23
1 + i=1 d(23−i) 2−i = 1 + 2−2 + 2−3 + 2−5 + 2−12 + 2−15 = 1.4065246582
1 4
+26 )−127)
2(E−127) = 2((2 +2 = 282−127 = 2−45
Thus
value = 1.4065246582 × 2−45
Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 36 / 78
Floating-point format

## Example: Find the real value for the binary data:

S   E   M
1   10000100   01000000000000000000000
            
31  30   23  22   0

In this example:
S=1
P23
1 + i=1 d(23−i) 2−i = 1 + 2−2 = 1.25
2(E−127) = 2(132−127) = 25

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 37 / 78

Floating-point format

## Example: Find the real value for the binary data:

S   E   M
1   10000100   01000000000000000000000
            
31  30   23  22   0

In this example:
S=1
P23
1 + i=1 d(23−i) 2−i = 1 + 2−2 = 1.25
2(E−127) = 2(132−127) = 25
Thus
value = 1.25 × 25 = −40.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 37 / 78

Contents

1 Introduction

2 Binary numbers

3 Error Analysis
Absolute and relative error
Truncation Error
Round-off Error
Loss of Significance
Order of Approximation
Propagation of Error

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 38 / 78

Absolute and relative error

Definition 2.
Suppose that b p is an approximation to p. The absolute error is
Ep = |p − b
p|, and the relative error is Rp = |p − b
p|/|p|, provided that
p 6= 0.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 39 / 78

Absolute and relative error

Definition 2.
Suppose that b p is an approximation to p. The absolute error is
Ep = |p − b
p|, and the relative error is Rp = |p − b
p|/|p|, provided that
p 6= 0.

The absolute error is the difference between the true value and
the approximate value.
The relative error expresses the error as a percentage of the true
value.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 39 / 78

Absolute and relative error

Example: Find the absolute and relative error in the following three
cases:

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 40 / 78

Absolute and relative error

Example: Find the absolute and relative error in the following three
cases:

## Real |p| x = 3.141592 y = 1, 000, 000 z = 0.000012

Approximation p̂ x = 3.14
b by = 999, 996 bz = 0.000009
Absolute Error Ep Ex = |x − b
x| Ey = |y − by| Ez = |z − bz|
= 0.001592 =4 = 0.000003
Relative Error Rp Rx = Ex /|x| Ry = Ey /|y| Rz = Ez /|z|
= 5.067×10−4 = 0.000004 = 0.25

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 40 / 78

Absolute and relative error

Example: Find the absolute and relative error in the following three
cases:

## Real |p| x = 3.141592 y = 1, 000, 000 z = 0.000012

Approximation p̂ x = 3.14
b by = 999, 996 bz = 0.000009
Absolute Error Ep Ex = |x − b
x| Ey = |y − by| Ez = |z − bz|
= 0.001592 =4 = 0.000003
Relative Error Rp Rx = Ex /|x| Ry = Ey /|y| Rz = Ez /|z|
= 5.067×10−4 = 0.000004 = 0.25

Observe that as |p| moves away from 1 (greater than or less than) the
relative error Rp is a better indicator than Ep of the accuracy of the ap-
proximation.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 40 / 78

Absolute and relative error

Definition 3.
The number bp is said to approximate p to d significant digits if d is the
largest nonnegative integer for which

|p − p|
b 101−d
< .
|p| 2

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 41 / 78

Absolute and relative error

Example:
Let ŵ be the approximation for w = 2.1645, then

|2.1645 − 2.16|
= 2.07900 × 10− 3
|2.1645|

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 42 / 78

Absolute and relative error

Example:
Let ŵ be the approximation for w = 2.1645, then

|2.1645 − 2.16|
= 2.07900 × 10− 3
|2.1645|
101−0
if d = 0: 2.07900 × 10− 3 < 2 = 5 Xsatisfies. However, as we need
to find the largest integer d, we need to continue..

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 42 / 78

Absolute and relative error

Example:
Let ŵ be the approximation for w = 2.1645, then

|2.1645 − 2.16|
= 2.07900 × 10− 3
|2.1645|
101−0
if d = 0: 2.07900 × 10− 3 < 2 = 5 Xsatisfies. However, as we need
to find the largest integer d, we need to continue..
101−1
if d = 1: 2.07900 × 10− 3 < 2 = 0.5 Xsatisfies

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 42 / 78

Absolute and relative error

Example:
Let ŵ be the approximation for w = 2.1645, then

|2.1645 − 2.16|
= 2.07900 × 10− 3
|2.1645|
101−0
if d = 0: 2.07900 × 10− 3 < 2 = 5 Xsatisfies. However, as we need
to find the largest integer d, we need to continue..
101−1
if d = 1: 2.07900 × 10− 3 < 2 = 0.5 Xsatisfies
101−2
if d = 2: 2.07900 × 10− 3 < 2 = 0.05 Xsatisfies

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 42 / 78

Absolute and relative error

Example:
Let ŵ be the approximation for w = 2.1645, then

|2.1645 − 2.16|
= 2.07900 × 10− 3
|2.1645|
101−0
if d = 0: 2.07900 × 10− 3 < 2 = 5 Xsatisfies. However, as we need
to find the largest integer d, we need to continue..
101−1
if d = 1: 2.07900 × 10− 3 < 2 = 0.5 Xsatisfies
101−2
if d = 2: 2.07900 × 10− 3 < 2 = 0.05 Xsatisfies
101−3
if d = 3: 2.07900 × 10− 3 < 2 = 0.005 Xsatisfies

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 42 / 78

Absolute and relative error

Example:
Let ŵ be the approximation for w = 2.1645, then

|2.1645 − 2.16|
= 2.07900 × 10− 3
|2.1645|
101−0
if d = 0: 2.07900 × 10− 3 < 2 = 5 Xsatisfies. However, as we need
to find the largest integer d, we need to continue..
101−1
if d = 1: 2.07900 × 10− 3 < 2 = 0.5 Xsatisfies
1−2
if d = 2: 2.07900 × 10− 3 < 102 = 0.05 Xsatisfies
1−3
if d = 3: 2.07900 × 10− 3 < 102 = 0.005 Xsatisfies
1−4
if d = 4: 2.07900 × 10− 3 < 102 = 0.0005 X does not satisfy
Then, ŵ approximate w to 3 significant digits.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 42 / 78

Absolute and relative error

Other examples:

## x|/|x| = 0.000507 < 10−2 /2.

x = 3.14, then |x − b
If x = 3.141592 and b
Therefore, bx approximates x to three significant digits.

## If y = 1, 000, 000 and by = 999, 996, then

|y − by|/|y| = 0.000004 < 10−5 /2. Therefore, by approximates y to six
significant digits.

## If z = 0.000012 and bz = 0.000009, then |z − bz|/|z| = 0.25 < 10−0 /2.

Therefore, bz approximates z to one significant digits.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 43 / 78

Contents

1 Introduction

2 Binary numbers

3 Error Analysis
Absolute and relative error
Truncation Error
Round-off Error
Loss of Significance
Order of Approximation
Propagation of Error

Truncation Error

## Truncation error refers to errors introduced when a more complicated

mathematical expression is "replaced" with a more elementary formula.

Truncation Error

## Truncation error refers to errors introduced when a more complicated

mathematical expression is "replaced" with a more elementary formula.

## For example, the infinite Taylor series

2 x4 x6 x8 x2n
ex = 1 + x 2 + + + + ··· + + ···
2! 3! 4! n!
x4 x6 x8
might be replaced with just the first five terms 1 + x2 + + + .
2! 3! 4!
Then a truncation error appears.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 45 / 78

Truncation Error
R 1/2 2
Example: Given p = 0 ex dx = 0.544987104184. Determine the accu-
2
racy of the approximation obtained by replacing the integrand f (x) = ex
x4 x6 x8
with the truncated Taylor series P8 (x) = 1 + x2 + + + .
2! 3! 4!
R 1/2
Determine 0 P8 (x)dx:

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 46 / 78

Truncation Error
R 1/2 2
Example: Given p = 0 ex dx = 0.544987104184. Determine the accu-
2
racy of the approximation obtained by replacing the integrand f (x) = ex
x4 x6 x8
with the truncated Taylor series P8 (x) = 1 + x2 + + + .
2! 3! 4!
R 1/2
Determine 0 P8 (x)dx:
1/2  x=1/2
x4 x6 x8 x3 x5 x7 x9
Z   
2

1+x + + + dx = x + + + +
0 2! 3! 4! 3 5(2!) 7(3!) 9(4!) x=0
1 1 1 1 1
= + + + +
2 24 320 5376 110592
2109491
= = 0.544986720817 = b p
3870720
Since
|p − bp| 101−6
= 7.03442 × 10−7 < = 5 × 106
|p| 2
then, the approximation b
p agrees with the true value to 6 significant digits.
Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 46 / 78
Contents

1 Introduction

2 Binary numbers

3 Error Analysis
Absolute and relative error
Truncation Error
Round-off Error
Loss of Significance
Order of Approximation
Propagation of Error

Round-off Error

## The accuracy of the representation of a real number stored in a

computer is determined by the precision of the mantissa.

Round-off Error

## The accuracy of the representation of a real number stored in a

computer is determined by the precision of the mantissa.

error.

Round-off Error

## The accuracy of the representation of a real number stored in a

computer is determined by the precision of the mantissa.

error.

## The actual number that is stored in the computer may be

chopping or rounding of the last digit.

Round-off Error

## The accuracy of the representation of a real number stored in a

computer is determined by the precision of the mantissa.

error.

## The actual number that is stored in the computer may be

chopping or rounding of the last digit.

## The computer hardware works with a limited number of digits in

machine numbers, errors are introduced and propagated in
successive computations.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 48 / 78

Chopping Off versus Rounding Off

Example:
Consider p expressed in normalized decimal form:

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 49 / 78

Chopping Off versus Rounding Off

Example:
Consider p expressed in normalized decimal form:

## If k is the maximum number of decimal digits; then the real number p is

represented by flchop (p), which is given by

## Where 1 ≤ d1 ≤ 9 and 0 ≤ dj ≤ 9 for 1 < j ≤ k. The number flchop (p) is

called the chopped floating-point representation of p.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 49 / 78

Chopping Off versus Rounding Off

## On the other hand, the rounded floating-point representation

flround (p) is given by

## where 1 ≤ d1 ≤ 9 and 0 ≤ dj ≤ 9 for 1 < j < k and the last digit, rk , is

obtained by rounding the number dk dk+1 dk+2 · · · to the nearest integer.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 50 / 78

Chopping Off versus Rounding Off

Example:
22
The real number p = = 3.142857142857142857... has the following
7
six-digit representations:

## flchop (p) = 0.314285 × 101 ,

flround (p) = 0.314286 × 101 .

## For common purposes the chopping and rounding would be written as

3.14285 and 3.14286, respectively.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 51 / 78

Contents

1 Introduction

2 Binary numbers

3 Error Analysis
Absolute and relative error
Truncation Error
Round-off Error
Loss of Significance
Order of Approximation
Propagation of Error

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 52 / 78

Loss of Significance

## Consider p = 3.14155926536 ans q = 3.1415957341, which are

nearly equal and both carry 11 decimal digits of precision.

## Their difference is formed: p − q = −0, 0000030805. Since the first

six digits of p and q are the same, their difference p − q contains
only five decimal digits of precision.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 53 / 78

Loss of Significance
Example:
Compare the results of calculating f (500) and g(500) using six digits and round-
√ √ x
ing. Where, f (x) = x( x + 1 − x) and g(x) = √ √ .
x+1+ x

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 54 / 78

Loss of Significance
Example:
Compare the results of calculating f (500) and g(500) using six digits and round-
√ √ x
ing. Where, f (x) = x( x + 1 − x) and g(x) = √ √ .
x+1+ x
For the first function,
√ √ 
f (500) =500 501 − 500
500(22.3830 − 22.3607) = 500(0.0223) = 11.1500

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 54 / 78

Loss of Significance
Example:
Compare the results of calculating f (500) and g(500) using six digits and round-
√ √ x
ing. Where, f (x) = x( x + 1 − x) and g(x) = √ √ .
x+1+ x
For the first function,
√ √ 
f (500) =500 501 − 500
500(22.3830 − 22.3607) = 500(0.0223) = 11.1500

For g(x)
500
g(500) = √ √
501 + 500
500 500
= = 11.1748.
22.3830 + 22.3607 44.7437

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 54 / 78

Loss of Significance
Example:
Compare the results of calculating f (500) and g(500) using six digits and round-
√ √ x
ing. Where, f (x) = x( x + 1 − x) and g(x) = √ √ .
x+1+ x
For the first function,
√ √ 
f (500) =500 501 − 500
500(22.3830 − 22.3607) = 500(0.0223) = 11.1500

For g(x)
500
g(500) = √ √
501 + 500
500 500
= = 11.1748.
22.3830 + 22.3607 44.7437

The second function, g(x), is algebraically equivalent to f (x), but the answer,
g(500) = 11.1748, involves less error and it is the same as that obtained by
rounding the true 11.174755300747198... to six digits.
Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 54 / 78
Loss of Significance
Example: Compare the results of calculating f (0.01) and P(0.01) using six
digits and rounding, where

ex − 1 − x 1 x x2
f (x) = and P(x) = + +
x2 2 6 24
The function P(x) is the Taylor polynomial of degree n = 2 for f (x) expanded

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 55 / 78

Loss of Significance
Example: Compare the results of calculating f (0.01) and P(0.01) using six
digits and rounding, where

ex − 1 − x 1 x x2
f (x) = and P(x) = + +
x2 2 6 24
The function P(x) is the Taylor polynomial of degree n = 2 for f (x) expanded
For the first function
e0.01 − 1 − 0.01 1.010050 − 1 − 0.01
f (0.01) = = = 0.5.
(0.01)2 0.001

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 55 / 78

Loss of Significance
Example: Compare the results of calculating f (0.01) and P(0.01) using six
digits and rounding, where

ex − 1 − x 1 x x2
f (x) = and P(x) = + +
x2 2 6 24
The function P(x) is the Taylor polynomial of degree n = 2 for f (x) expanded
For the first function
e0.01 − 1 − 0.01 1.010050 − 1 − 0.01
f (0.01) = = = 0.5.
(0.01)2 0.001

## For the second function

1 0.01 0.001
P(0.01) = + + = 0.5 + 0.001667 + 0.000004 = 0.501671.
2 6 24

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 55 / 78

Loss of Significance
Example: Compare the results of calculating f (0.01) and P(0.01) using six
digits and rounding, where

ex − 1 − x 1 x x2
f (x) = and P(x) = + +
x2 2 6 24
The function P(x) is the Taylor polynomial of degree n = 2 for f (x) expanded
For the first function
e0.01 − 1 − 0.01 1.010050 − 1 − 0.01
f (0.01) = = = 0.5.
(0.01)2 0.001

## For the second function

1 0.01 0.001
P(0.01) = + + = 0.5 + 0.001667 + 0.000004 = 0.501671.
2 6 24
The answer P(0.01) = 0.501671 contains less error and it is the same as that
obtained rounding the true answer 0.5016708416805... to six digits.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 55 / 78

Contents

1 Introduction

2 Binary numbers

3 Error Analysis
Absolute and relative error
Truncation Error
Round-off Error
Loss of Significance
Order of Approximation
Propagation of Error

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 56 / 78

O(hn ) Order of Approximation
For functions
Definition 4.
The function f (h) is said to be big Oh of g(h), denoted f (h) = O(g(h)),
if there exist constants C and c such that:

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 57 / 78

O(hn ) Order of Approximation
For functions
Definition 4.
The function f (h) is said to be big Oh of g(h), denoted f (h) = O(g(h)),
if there exist constants C and c such that:

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 57 / 78

O(hn ) Order of Approximation
For functions
Definition 4.
The function f (h) is said to be big Oh of g(h), denoted f (h) = O(g(h)),
if there exist constants C and c such that:

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 57 / 78

O(hn ) Order of Approximation
For functions
Definition 4.
The function f (h) is said to be big Oh of g(h), denoted f (h) = O(g(h)),
if there exist constants C and c such that:

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 57 / 78

O(hn ) Order of Approximation
For functions
Definition 4.
The function f (h) is said to be big Oh of g(h), denoted f (h) = O(g(h)),
if there exist constants C and c such that:

## The big Oh notation provides an useful way of describing the rate of

growth of a function in terms of the well-known elementary function (xn ,
x1/n , ax , loga (x), etc.).
Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 57 / 78
O(hn ) Order of Approximation

For sequences
Definition 5.
Let xn = 1∞ and yn = 1∞ be two sequences. The sequence xn is said
to be of order big Oh of yn , denoted xn = O(yn ), if there exist constants
C and N such that

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 58 / 78

O(hn ) Order of Approximation

For sequences
Definition 5.
Let xn = 1∞ and yn = 1∞ be two sequences. The sequence xn is said
to be of order big Oh of yn , denoted xn = O(yn ), if there exist constants
C and N such that

## |xn | ≤ C|yn | whenever n ≥ N. (12)

Example:
n2 − 1 n2 − 1 n2
 
1 1
=O , since ≤ = whenever n ≥ 1.
n3 n n3 n3 n

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 58 / 78

O(hn ) Order of Approximation

Definition 6.
Assume that f (h) is approximated by the function p(h) and there exist a
real constant M > 0 and a positive integer n so that

|f (h) − p(h)|
≤ M for sufficiently small h. (13)
hn
We say that p(h) approximates f (h) with order of approximation O(hn )
and write
f (h) = p(h) + O(hn ) (14)

When relation (13) is rewritten in the form |f (h) − p(h)| ≤ M|hn |, we see
that the notation O(hn ) stands in place of the error bound M|hn |.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 59 / 78

O(hn ) Order of Approximation

## Theorem 2. Order of approximation for basic operations

Assume that f (h) = p(h) + O(hn ), g(h) = q(h) + O(hm ), and
r = min(m, n). Then

## f (h)g(h) = p(h)q(h) + O(hr ), (16)

and
f (h) p(h)
= + O(hr ) provided that g(h) 6= 0 and q(h) 6= 0. (17)
g(h) q(h)

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 60 / 78

O(hn ) Order of Approximation

## Theorem 3. (Taylor’s Theorem).

Assume f ∈ Cn+1 [a, b]. If both x0 and x = x0 + h lie in [a, b], then
n
X f (k)(x0 )
f (x0 + h) = hk + O(hn+1 ). (18)
k!
k=0

(i) O(hp ) + O(hp ) = O(hp ),
(ii) O(hp ) + O(hq ) = O(hr ), where r = min(m, n), and
(iii) O(hp )O(hq ) = O(hs ), where s = p + q.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 61 / 78

O(hn ) Order of Approximation

Example:
Consider the Taylor polynomial expansions

h2 h3 h2 h4
eh = 1+h+ + +O(h4 ) and cos(h) = 1 − + + O(h6 ).
2! 3! 2! 4!

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 62 / 78

O(hn ) Order of Approximation

Example:
Consider the Taylor polynomial expansions

h2 h3 h2 h4
eh = 1+h+ + +O(h4 ) and cos(h) = 1 − + + O(h6 ).
2! 3! 2! 4!

## Determine the order of approximation for their sum and product.

For the sum we have

h2 h3 h2 h4
eh + cos(h) =1 + h + + + O(h4 ) + 1 − + + O(h6 )
2! 3! 2! 4!
h3 h4
=2+h+ + O(h4 ) + + O(h6 )
3! 4!

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 62 / 78

O(hn ) Order of Approximation

h4
Since O(h4 ) + = O(h4 ) and O(h4 ) + O(h6 ) = O(h4 ), this reduces to
4!

h3
eh + cos(h) = 2 + h + + O(h4 ),
3!
and the order of approximation is O(h4 ).

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 63 / 78

O(hn ) Order of Approximation

## The product is treated similarly:

h2 h3 h2 h4
  
eh cos(h) = 1 + h + + + O(h4 ) 1− + + O(h6 )
2! 3! 2! 4!
h2 h3 h2 h4
  
= 1+h+ + 1− + +
2! 3! 2! 4!
h2 h3 h2 h4
   
6
1+h+ + O(h ) + 1 − + O(h4 ) + O(h4 )O(h6 )
2! 3! 2! 4!
h3 5h4 h5 h6 h7
=1 + h − − − + + + O(h6 ) + O(h4 ) + O(h4 )O(h6 ).
3 24 24 48 144

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 64 / 78

O(hn ) Order of Approximation

## Since O(h4 )O(h6 ) = O(h10 ) and

−5h4 h5 h6 h7
− + + + O(h6 ) + O(h4 ) + O(h10 )
24 24 48 144
Since O(h0 ) + O(h4 ) + O(h10 ) = O(h4 ), the preceding equation is
simplified to yield

h3
eh cos(h) = 1 + h + + O(h4 ),
3
and the order of approximation is O(h4 ).

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 65 / 78

Order of Convergence of a Sequence

Convergence of a sequence
Definition 7.
Suppose that limn−→∞ xn = x and {rn }∞ n=1 is a sequence with
limn−→∞ rn = 0. We say that {xn }∞
n=1 converges to x with the order
of convergence O(rn ), if there exists a constant K ≥ 0 such that

|xn − x|
≤ K for n sufficiently large. (19)
|rn |

## This is indicated by writing xn = x + O(rn ), or xn −→ x with order

of convergence O(rn )

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 66 / 78

Order of Convergence of a Sequence

Definition 7.

Example:
Let xn = cos(n)/n2 and rn = 1/n2 then,

limn−→∞ xn = 0

## with a rate of convergence O(1/n2 ). This follows immediately from the

relation
|cos(n)/n2 |
= |cos(n) ≤ 1| for all n.
|1/n2 |

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 67 / 78

Contents

1 Introduction

2 Binary numbers

3 Error Analysis
Absolute and relative error
Truncation Error
Round-off Error
Loss of Significance
Order of Approximation
Propagation of Error

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 68 / 78

Propagation of Error

Addition consider two numbers p and q (the true values) with the
approximate values b p and bq, which contains errors p and q ,
respectively. Starting with p = b
p + p and q = b
q + q , the sum is

p + q = (b
p + p ) + (b
q + q ) = (b
p+b
q) + (p + q ). (20)
Hence, for addition, the error in the sum is the sum of the errors in

s = p + q .

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 69 / 78

Propagation of Error

## The propagation of error in multiplication is more complicated. The

product is

pq = (b
p + p )(b
q + q ) = bq+b
pb pp + b
qp + p q . (21)

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 70 / 78

Propagation of Error

## The propagation of error in multiplication is more complicated. The

product is

pq = (b
p + p )(b
q + q ) = bq+b
pb pp + b
qp + p q . (21)
Hence, if bp and bq are larger than 1 in absolute value, the terms bpq and
qp show that there is a possibility of magnification of the original errors
b
p and q . Insights are gained if we look at the relative error. Rearrange
the terms in (21) to get

pq − bq=b
pb pq + b
qp + p q . (22)

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 70 / 78

Propagation of Error

## The propagation of error in multiplication is more complicated. The

product is

pq = (b
p + p )(b
q + q ) = bq+b
pb pp + b
qp + p q . (21)
Hence, if bp and bq are larger than 1 in absolute value, the terms bpq and
qp show that there is a possibility of magnification of the original errors
b
p and q . Insights are gained if we look at the relative error. Rearrange
the terms in (21) to get

pq − bq=b
pb pq + b
qp + p q . (22)
Suppose that b p 6= 0 and b
q 6= 0; then we can divide (22) by pq to obtain
the relative error in the product pq:
pq − b
pb
q pq + b
b qp + p q pq b
b qp p q
Rpq = = = + + . (23)
pq pq pq pq pq

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 70 / 78

Propagation of Error

## Furthermore, suppose that b p and b q are good approximations for b

p and
p/p ≈ 1, b
q; then b
b q/q ≈ 1, and Rp Rq = (p /p)(q /q) ≈ 0 (Rp and Rq are
the relative errors in the approximations b p and b
q). Then making these
substitutions yields the simplified relationship

pq − b
pb
q
Rpq = ≈ q /q + p /p + 0 = Rq + Rp . (24)
pq

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 71 / 78

Propagation of Error

## Furthermore, suppose that b p and b q are good approximations for b

p and
p/p ≈ 1, b
q; then b
b q/q ≈ 1, and Rp Rq = (p /p)(q /q) ≈ 0 (Rp and Rq are
the relative errors in the approximations b p and b
q). Then making these
substitutions yields the simplified relationship

pq − b
pb
q
Rpq = ≈ q /q + p /p + 0 = Rq + Rp . (24)
pq
This shows that the relative error in the product pq is approximately the
sum of the relative errors in the approximations p b and qb.

A quality that is desirable for any numerical process is that a small error
in the initial conditions will produce small changes in the final result.
An algorithm with this feature is called stable; otherwise, it is called
unstable.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 71 / 78

Propagation of Error

Definition 8.
Suppose that  represents an initial error and (n) represents the growth
of the error after n steps. If |(n)| ≈ n, the growth of error is said to be
linear. If |(n)| ≈ K n , the growth of error is called exponential. If
K > 1, the exponential error growns without bound as n −→ ∞, and if
0 < K < 1, the exponential error diminishes to zero as n −→ ∞.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 72 / 78

Propagation of error

Example: Show that the following three schemes can be used with finite-
precision arithmetic to recursively generate the terms in the sequence {1/3n }∞
n=0 .

1
r0 = 1 and rn = rn−1 for n = 1, 2, · · · , (25)
3

1 4 1
p0 = 1, p1 = , and pn = pn−1 − pn−2 for n = 1, 2, · · · , (26)
3 3 3
1 10
q0 = 1, q1 = , and qn = qn−1 − qn−2 for n = 1, 2, · · · , (27)
3 3

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 73 / 78

Propagation of error

Formula (25) is obvious. In (26) the difference equation has the general solu-
tion pn = A(1/3n ) + B. This can be verified by direct substitution:
   
4 1 4 A 1 A
pn−1 − pn−2 = + B − + B
3 3 3 3n−1 3 3n−2
   
4 3 4 1 1
= − A − − B = A n + B = pn
3n 3n 3 3 3
Setting A = 1 and B = 0 will generate the sequence desired.

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 74 / 78

Propagation of error

Formula (25) is obvious. In (26) the difference equation has the general solu-
tion pn = A(1/3n ) + B. This can be verified by direct substitution:
   
4 1 4 A 1 A
pn−1 − pn−2 = + B − + B
3 3 3 3n−1 3 3n−2
   
4 3 4 1 1
= − A − − B = A n + B = pn
3n 3n 3 3 3
Setting A = 1 and B = 0 will generate the sequence desired. In (27) the
difference equation has the general solution qn = A(1/3n ) + B3n . This too
verified by substitution:
   
10 10 A n−1 A n−2
qn−1 − qn−2 = + B3 − + B3
3 3 3n−1 3n−2
 
10 9 1
= n
− n A − (10 − 1)3n−1 B = A n + B3n = qn
3 3 3

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 74 / 78

Propagation of error
Example:
Generate approximations to the sequences {xn } = 1/3n using hemes

1
r0 = 0.99996 and rn = rn−1 for n = 1, 2, · · · , (28)
3

4 1
p0 = 1, p1 = 0.33332, and pn = pn−1 − pn−2 for n = 1, 2, · · · ,
3 3
(29)

10
q0 = 1, q1 = 0.33332, and qn = pn−1 − pn−2 for n = 1, 2, · · · ,
3
(30)
In (28) the initial error in r0 is 0.00004, and in (29) and (30) the initial
errors in p1 and q1 are 0.000013. Investigate the propagation of error for
each scheme.
Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 75 / 78
Propagation of error

## Table: Sequence xn = 1/3n and the approximations rn , pn , and qn

n xn rn pn qn
0 1.0000000000 0.9999600000 1.0000000000 1.0000000000
1 0.3333333333 0.3333200000 0.3333200000 0.3333200000
2 0.1111111111 0.1111066667 0.1110933333 0.1110666667
3 0.0370370370 0.0370355556 0.0370177778 0.0369022222
4 0.0123456790 0.0123451852 0.0123259259 0.0119407407
5 0.0041152263 0.0041150617 0.0040953086 0.0029002469
6 0.0013717421 0.0013716872 0.0013517695 -0.0022732510
7 0.0004572474 0.0004572291 0.0004372565 -0.0104777503
8 0.0001524158 0.0001524097 0.0001324188 -0.0326525834
9 0.0000508053 0.0000508032 0.0000308063 -0.0983641945
10 0.0000169351 0.0000169344 -0.0000030646 -0.2952280648

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 76 / 78

Propagation of error

## Table: Error sequences xn − rn , xn − pn , and xn − qn

n xn − rn xn − pn xn − qn
0 0.0000400000 0.0000000000 0.0000000000
1 0.0000133333 0.0000133333 0.0000133333
2 0.0000044444 0.0000177778 0.0000444444
3 0.0000014815 0.0000192593 0.0001348148
4 0.0000004938 0.0000197531 0.0004049383
5 0.0000001646 0.0000199177 0.0012149794
6 0.0000000549 0.0000199726 0.0036449931
7 0.0000000183 0.0000199909 0.0109349977
8 0.0000000061 0.0000199970 0.0328049992
9 0.0000000020 0.0000199990 0.0984149997
10 0.0000000007 0.0000199997 0.2952449999

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 77 / 78

Propagation of error
−5 −5
x 10 x 10
6 2

1.5
4

xn−pn
xn−rn

1
2
0.5

0 0
0 2 4 6 8 10 0 2 4 6 8 10
n n
0.4

0.3
xn−qn

0.2

0.1

0
0 2 4 6 8 10
n

## Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 78 / 78

Propagation of error
−5 −5
x 10 x 10
6 2

1.5
4

xn−pn
xn−rn

1
2
0.5

0 0
0 2 4 6 8 10 0 2 4 6 8 10
n n
0.4

0.3
xn−qn

0.2

0.1

0
0 2 4 6 8 10
n

The error for {rn } is stable and decreases in an exponential manner. The error
{pn } is stable. The errror for {qn } is unstable and grows at an exponential rate.
Although the error for {pn } is stable, the terms pn −→ 0 as n −→ ∞, so that the
error eventually dominates and teh terms past p8 have no significant digits.
Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 78 / 78