Sei sulla pagina 1di 14

ChE 4C3/6C3

Engineering Statistics
Winter 2006/2007

Introduction

• Lecturer: Dr. John MacGregor


• TA’s: Arv Jegatheesan, NRB-B105, Ext. 26876, jegatha@mcmaster.ca
• Fu Kuan Tsiao, JHE-369, Ext 24031, tsiaof@mcmaster.ca
• Office Hours: TBA ??
• Grading Scheme:
™ Assignments (5 or 6) 20% (Marking)
™ Midterm Exam 30% (Feb. 8, 2007)
™ Final Exam 50% (Final Exam Period, TBA)
• Course Notes:
™ Available on course web site
™ Power-point lecture slides will be available several days before each
lecture.

1
Notes and Reference Materials

• Course Notes: Some course notes (available on the ChE 4C03


directory) are designed to facilitate and enhance learning, but they
are not a complete source of reference material for this course.
The notes are intended to be a starting point and should be
supplemented with lecture material and material from other texts.

• Reference Texts:
™ )Montgomery, D.C. and Runger, G.C. “Applied Statistics and
Probability for Engineers”, First Ed., Wiley, 1994.
™ )Box, G.E.P., Hunter, W.G. and Hunter, J.S., “Statistics for
Experimenters”, Wiley, 1978.
™ Eriksson L., Johansson, E., Kettaneh-Wold, N. and Wold, S., 2001.
“Multi- and Megavariate Data Analysis Principles and Applications,
Umetrics AB, Sweden

Part 1: Objectives

• Focus:
™ The presentation of the material will emphasize the use of
statistical techniques to extract information from measured data,
and the use of this information for decision making. The focus of
the course will be on the practical use of various statistical
techniques, and this will sometimes demand a close look at the
mathematics underlying the techniques so as to understand their
strengths and limitations.
™ We will discuss several examples in class, but many additional
examples are covered in the reference texts.

2
Outline

• Outline:
™ Review
– measures of position (mean, median, mode)– measures of
spread (variance, standard deviation, …)
– measures of uncertainty (confidence intervals, hypothesis tests)
™ Statistical Process Control
– philosophy
– SPC charts

Outline (Cont.)

• Empirical Model Building


– Classes of models (linear, nonlinear, dynamic, steady-state, multi-
response, empirical, first principles)
– Steps in model building
– Parameter estimation
- linear regression from an optimization perspective
- measuring uncertainty in parameter estimates
- testing for lack of fit
- multiple linear regression
- use of dummy variables
- nonlinear regression
- ridge regression
- multi-response estimation

3
Outline (Cont.)

• PART II
¾ 1) Design of Experiments (DOE)
- Concepts behind DOE
- Randomization and blocking
- Factorial Designs
- Fractional Factorial Designs
- Response Surface Designs
- Optimal Designs
¾ 2) Introduction to Multivariate Statistics
- Introduction to Principal Component Analysis (PCA)
- Troubleshooting processes using plant data
- Multivariate SPC
- Industrial applications
- Introduction to Partial Least Squares (PLS)

A common View of Statistics

• If it’s alive, it’s Biology

• If it’s in motion, its Physics

• If it changes colour, it’s Chemistry

• If it’s boring, it’s Statistics

4
Review - Visualizing Data

• When studying data, it is very important to visualize the data if


possible.
• Examples of visualization techniques include:
¾ time series plots (plots of observations as a function of time)
¾ histogram (plot showing the frequency of occurrence of groups of
values within the dataset)
¾ stem and leaf plots (a type of histogram which has the advantage of
retaining all of the original values)
¾ scatter plot (plot of one variable versus another; it is used when there
is data available on more than one relevant variable)

Time Series Plot

• Plot of mold level versus time

5
Histogram

• Histogram of the results of Desulphurization for one of Dofasco’s


products

End Sulphur Distributions


0.2

0.18

0.16

0.14
Frequency

0.12

0.1
Pre DRC
DRC
0.08

0.06
Pre DRC
target
0.04
DRC
target
0.02
constraint
0

End Sulphur

Histogram

` To construct a histogram, first divide the range of the data into


intervals (sometimes called classes or bins)
`Usually the bins are of equal width
`The number of bins must be chose wisely or else the plots will not
be informative
`A rule of thumb is:
# bins ≈ n
`Count the number of data points that fall within each bin. This is
the frequency.
` Plot the frequency as a function of the bins center points. This is a
histogram.
` Relative frequency:

frequency
relative frequency =
n

6
Histogram 2

Bins 1 FrequencyCumulative %
Histogram
84.5 4 10.26%
89.5 0 10.26%
94.5 4 20.51%
12 120.00%
99.5 9 43.59% 10 100.00%

Frequency
104.5 10 69.23% 8 80.00%
109.5 6 84.62% Frequency
6 60.00%
114.5 3 92.31% Cumulative %
4 40.00%
More 3 100.00%
2 20.00%
0 .00%

e
89.5
94.5
9 .5
109.5
104.5
119.5
M .5
or
84

4
Bins 1

Scatter Plot

` A plot of the values of one variable versus the values of a second


variable.
` Used to look for correlation or relationships among variables.
Scatter Plot

55
y = 0.2239x + 37.349
50

45
X2
Y

Linear (X2)
40

35

30
5 10 15 20
X

7
Probability distributions

• Probability distribution: a description of the set of possible values


of X, along with the probability associated with each of the possible
values
• Probability density function: a function that enables the calculation
of probabilities involving a continuous random variable X. It is
analogous to the probability mass function of a discrete random
variable. It is usually denoted by f(x). The area under the curve
f(x) between x1 and x2 defines the probability of obtaining a value
of X in the interval [x1, x2].
¾ A probability density function must satisfy the following properties:
™ f(x) >= 0

∫ f (x )dx = 1
−∞
x2

P( x1 ≤ X ≤ x 2 ) = ∫ f ( u )du
x1

Cumulative Density Function

• Cumulative Density Function:


¾ Usually denoted by F(x) and defined by:

x
F( x ) = P( X ≤ x ) = ∫ f (u )du
−∞

dF( x )
f (x) =
dx

8
The Normal Distribution

• In engineering applications we often assume that measured


continuous random variables are normally distributed.
• The mean defines the center of the normal probability function.
• The standard deviation determines the spread (dispersion)
• The normal probability density function is symmetric and bell-
shaped.
• The expression for the normal probability density function for the
random variable X is:
−( x −μ )2
1
f x (x ; μ , σ ) = e 2σ 2

σ 2π

The Standard Normal Distribution

• The standard normal distribution refers to the normal distribution


with mean zero and variance one.
• The standard normal distribution is important in that we can use
tabulated values of the cumulative standard normal distribution for
any normally distributed random variable by first standardizing it.
We standardize a random variable X that is N(μ, σ2) using:
X−μ
Z=
σ
P (a ≤ X ≤ b ) = P ( X ≤ b ) − P ( X ≤ a )
P (X ≥ a ) = 1 − P(X ≤ a )
P (X ≤ −a ) = P ( X ≥ a ) = 1 − P ( X ≤ a )
⎛ a−μ⎞
P ( X ≤ a ) = P⎜ z ≤ ⎟
⎝ σ ⎠

9
Example

e.g. (D.W.B. + D.D.M., Example 4.1)


The temperature of a heated flotation cell under standard operating
conditions is believed to fluctuate as a normal p.d.f. with a mean value of 40
degrees Celsius and a standard deviation of 5 degrees Celsius. What is the
probability that the next measured temperature will lie between 37 degrees
Celsius and 43.5 degrees Celsius?

Let X denote the measured temperature. It is given that X is distributed as


N(40,25).
We are looking for P(37 ≤ X ≤ 43.5) .
In order to use the standard normal table, we have to normalize the
temperature. To normalize the variable, subtract the mean and divide by the
standard deviation. We will call the normalized variable Z. For this
example, the normalized temperature is given by
X − 40
Z=
5

Example (Cont.)

Now we use this expression to express the probability in terms of Z.


Therefore, we are looking for
⎛ 37 − 40 43.5 − 40 ⎞
P⎜ ≤Z≤ ⎟ = P(− 0.6 ≤ Z ≤ 0.7 )
⎝ 5 5 ⎠
= P( Z ≤ 0.7) − P( Z ≤ −0.6)
= F (0.7) − F (−0.6)
= F (0.7) − [1 − F (0.6)]
Using the standard normal tables we find that
F(0.7)=0.7580
F(0.6)=0.7257
Therefore
P (−0.6 ≤ Z ≤ 0.7) = 0.7580 − [1 − 0.7257]
= 0.4837
The probability that the next measured temperature will be between 37 and
43.5 degrees Celsius is 48.37 %.

10
Mean

Population mean (parameter of the distribution):



μ = E(X) = ∫ xf ( x ) dx
−∞

The mean is the center of mass of the range of values of X


The mean is the first moment of the probability density function f(x).

An estimator (statistic) for the population mean is the sample mean or


average: n

∑x i
x= i =1
n

Another statistical estimator the mean is the median (It has larger
variance than the sample mean, but is more robust to outliers)

Variance


V (X) = σ x2 = E (X − μ ) 2 = ∫ ( x − μ ) 2 f ( x ) dx
−∞

V(X) = σ = E(X − μ ) = ∑ ( x − μ ) 2 f ( x )
2
x
2

Standard deviation: σ =+ σ2

Estimate of the variance based on a sample of n observations:


n

∑ (x i − x)2
s =
2 i =1
n −1

11
Expected Value


E[g ( x )] = ∫ g ( x ) ⋅ f ( x ) dx
−∞

• The mean and variance are special cases of this general definition

• Properties of Expectations and Variances:

E[cX] = cE[X]
E[X + Y] = E[X] + E[Y]
V(cX) = c 2 V( x )
V(X + Y) = V( X) + V(Y) + Cov( X, Y)

Example - Mean and Variance

• A beverage company uses the following procedure for checking the


net weight of beverage in a case of 24 bottles. The beverage in
one bottle, selected at random from the case, is weighed and that
weight is then multiplied by 24 to provide a “measurement” of the
net weight of the beverage in the case. If the weight of beverage
in an individual bottle is a random variable having a population
mean of 340 grams, and a population standard deviation of 1.2
grams, what are the expected value and standard deviation of the
deduced weight of net beverage content in a case?

12
Solution

let X = continuous random variable representing the mass of beverage in an


individual bottle.

The problem provides the following information:

n = 24 bottles
μ = 340 grams
σ = 1.2 grams

(a) Expected Value

The expected value of the mass of beverage in an individual bottle is 340 grams
(μ). Extrapolating this data, with the definition of expected value, to utilise for
batch size of N bottles, yields:

E(nX) = n * E(X)

For this problem:

E(nX) = E(24X) = 24 * E(X) = 8160 grams

Solution (Cont.)

(b) Standard Deviation

The standard deviation is the square root of the variance, with variance defined
as (p.8.4):

σ2(X) = E [ (X - μ)2 ]

= E { [X - E(X)]2 }

Similar to above, for batch size ‘n’, variance is:

σ2(nX) = E { [nX - E(nX)]2 }

= E { [nX - nE(X)]2 }

= n2 * E { [X-E(X)]2 }

= n2 * σ2(X)

13
Covariance

• Covariance is a measure of the linear association between random


variables.

• Population covariance:

σ XY
2
= E{(X − μ X )(Y − μ Y )} = E(XY) − μ X μ Y

• Sample covariance:
n

∑ (x − x )( y
i i − y)
σˆ XY
2
= s XY
2
= i =1
n −1

Correlation

• Correlation: a scaled version of covariance. The scaling is done so


that the range of ρ is [-1, 1] .

• Population correlation:
σ XY
2
ρ=
σ Xσ Y
• Sample correlation:
n

∑(y i − y )(xi − x )
r= i =1

⎛ n 2 ⎞⎛
n
2⎞
⎜ ∑ ( xi − x ) ⎟⎜ ∑ ( yi − y ) ⎟
⎝ i =1 ⎠⎝ i =1 ⎠

14

Potrebbero piacerti anche