Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Statistics - start with data (observed values from an unknown "empirical" distribution)
Functions of the data that estimate parameters {mean, variance, skewness, and kurtosis} Estimate probabilities
Xi n
Median (Md) Md = the value that divides ranked observations in half = X(n+1)/2 if n is odd X n / 2 + X ( n / 2 )+1 = if n is even 2 Mode (Mo) Mo = the most frequent data point
4 Properties of the Average (Xi- X )2 is less than the squared deviations from any other estimate Ex. (Xi- X )2 (Xi-Md)2 - average is the minimum variance estimate Gets pulled in the direction of extreme points
Example
Data {1, 2, 3, 4, 9}
_ X
Including X5 = 9 X = 3.8 Md = 3 X5 Md
Average can be very sensitive towards extreme points, while the median is fairly robust Sensitivity depends upon the sample size and the deviation of the extreme point! Assumption of X : Xi's are independently and identically distributed (i.i.d.) This is often not a good assumption!
Properties of the Range Bad: It only uses two pieces of information. Good: It is easy to compute manually. Uses of the Range Range itself is useful for characterizing a distribution (order statistics) Range can be used to estimate the standard deviation ( Many practical applications once the standard deviation is approximated: - Control Charts - Process Capability - Gage Repeatability & Reproducibility Problems when using the Range to Approximate the Standard Deviation The d2 coefficient depicts the relationship between the range and standard deviation for a normal distribution. Thus, the Range method for estimating standard deviation is only valid if the parent distribution is normally distributed.
= R/d2)
S2 =
i =1
S= S
Xi 3 2 9 1 6 8 2 Average = 4.43
(Xi- X )2 2.04 5.90 20.88 11.76 2.46 12.74 5.90 Sum = 61.71
(Xi- X )
Importance of S:
4.43
7 Same units as measurements Positive numbers that increase when variability increases Sample variance is the unbiased and minimum variance estimate for the population variance (irrespective of the distribution type)
S2 =
i =1
n 1
8 Grand Average and Pooled Variance Estimates Subgroup averages and variances are merged into historical estimates of average and variance used for control chart centerlines Grand average ( x ) = average of subgroup averages Pooled variance ( S 2 p ) = average of subgroup variances
Xi m
m
x=
i= 1
if n is constant
x=
i =1 m
ni Xi
if ni is variable
Always Correct
i =1
ni
m
S2 p
S
i =1
2 i
if n is constant
S2 p
i =1 m
iSi2
i =1
if ni is variable
i
Always Correct
10
Probability Theory
Distribution Functions
Discrete Distributions Discrete Probability Density Function (pdf): f(x) = Pr[X=x] Properties of the discrete pdf 1) f(x) 0 each probability is greater than or equal to 0. f(x) = 1 2) the sum of the probabilities equals 1.0 x f(t) Discrete Cumulative Distribution Function (cdf): F(x) = P(X x) = t x Ex. Binomial (n=5 trials, p=.2) f(x) =
( x ) px (1-p)n-x
(x )
5
n! = x!(n x)!
f(0) = ( 0 ) .20 (1-.2)5 = 0.32768 5 f(1) = ( 1 ) .21 (1-.2)4 = 0.4096 5 f(2) = ( 2 ) .22 (1-.2)3 = 0.2048 5 f(3) = ( 3 ) .23 (1-.2)2 = 0.0512 5 f(4) = ( 4 ) .24 (1-.2)1 = 0.0064 5 f(5) = ( 5 ) .25 (1-.2)0 = 0.00032 f(0) + f(1) + f(2) + f(3) + f(4) + f(5) = 1.0 F(2) = P(X2) = f(0) + f(1) + f(2) = 0.94208
Probability of 0 successes in 5 trials Probability of 1 success in 5 trials Probability of 2 successes in 5 trials Probability of 3 successes in 5 trials Probability of 4 successes in 5 trials Probability of 5 successes in 5 trials Property of a pdf
11
Combinations:
Number of combinations = (
n x
n! x!(n x)!
5
5! 1 !(5 1)!
=5
5! 2!(5 2)!
5x 4 2!
= 10
12 Continuous Distributions Continuous Probability Density Function (pdf): f(x) does not equal a probability Properties of the continuous pdf 1) f(x) 0 the function is positive over all the region of X 2) f(x) dx = 1
+
the total area under the curve equals 1.0 (probability) = P(X x) = f(t) dt
x
f(x)
x
F(x) = area under f(x) to the left of x
Ex. f(x) = 2x (0 x 1)
1 0
| = 12 - 02 = 1 2x dx = x2 0 F(x) = 2t dt = t2 | = x2 0
0 x x
area under the curve equals 1, thus proving it is a pdf F(.5) = 0.52 = 0.25 = Pr[X .5]
Probabilities and Percentage Points (Variates) from Common Distributions Copyright 2001 Genemetrix
13 Tables or functions exist for common distributions such as Z, t, F, and chi-squared to: determine the lower tail probability for a given value of x determine the value of x based on the lower tail probability Area between two limits Pr[a < X < b] = f(x) dx = F(b)-F(a)
a b
14 Expectations Discrete Distributions Let the possible values (sample space) for X be denoted by x1,x2, ... ,xn and f(xi) = Pr[X=xi] E[X] = xi f(xi)
i=1 n n
E[X2] = x 2 i f(xi)
i=1 n
Ex. Binomial (n=5,p=.2) E[X] = 0 (0.32768) + 1 (0.4096) + 2 (0.2048) + 3 (0.0512) + 4 (0.0064) + 5 (0.00032) E[X2] = 02 (0.32768) + 12 (0.4096) + 22 (0.2048) + 32 ((0.0512) + 42 (0.0064) + 52 (0.00032)
= 1.8
i=1
Xi n
= sample average
15 Continuous Distributions
+
E[X] =
E[u(X)] =
Ex. f(x) = 2x (0 x 1) E[X] = x 2x dx = 2x2 dx = 2/3 x3 | = 2/3 0 E[X ] = x 2x dx = 2x3 dx = 2/4 x4 | = 2/4 0
2 2
0 0 0 1 0 1 1 1 1 1
16 Variance is an Expectation VAR[X] = E[(X-E[X]) 2] VAR[X] = E[X2] - {E[X]} 2 where E[X] = when factored out
VAR[X] = "Expected value of the product minus the product of the expected values"
Ex. Binomial (n=5,p=.2) VAR[X] = (1.8) - {1} 2 = 0.8 = npq = np(1-p) *binomial property
17 Example: Discrete Expected Value Daily sales records for a computer manufacturing firm show that it will sell 0, 1, or 2 mainframe computer systems with probabilities as listed. Number of sales (x) Probability f(x) A) 0 0.7 1 0.2 2 0.1
Find the expected value and standard deviation of daily sales. Expected value of daily sales: E(x) =
i =1
i =1
B)
The firms daily fixed cost is $30,000 and their marginal cost is $200,000 (cost per unit). If a mainframe system sells for $500,000, what is the expected daily profit? Daily profit = Revenues costs
Fixed daily cost = $30,000; Cost per unit = $200,000; Revenue per unit = $500,000 Daily profit = (revenue per unit)(expected value sold) fixed daily cost (cost per unit)(expected value ) = (500000)(0.4) - (30000) (200000)(0.4) = $90,000 per day
18 Example: Continuous Expected Value The outside diameter of washers is a continuous random variable, x, distributed uniformly from 300 320 mm. Calculate: A) f(x) Let x = outside diameter This is a uniform distribution f(x) = c, a constant for a pdf f(x) dx = 1
320 300
Therefore: f(x) = 1/20 for 300 < x < 320 f(x) = 0 elsewhere
B)
E[x]
320
E[x] =
300
320
320
300
= 310 mm
C)
E[x2] =
300
320
320
= 96133.33
300
19 Median and Mode Median value of the 50th percentile F(x) 1.0
0.5
0 Md
Mode value with the largest f(x) Value of X where the derivative of f(x) equals 0 f(x)
X Mo
( x ) px (1-p)n-x
x t =0
x={0,1, ...,n}
F(x) =
t n-t (n t ) p (1-p)
E[X] = np =
VAR[X] = npq = 2
Example
The probability that a piece of luggage will survive the stress test is 0.65. If six bags are randomly tested: A) What is the probability that exactly four will survive? Given: P(luggage survives) = 0.65 P(luggage fails) = 0.35
Exactly 4 bags survive, let x = number that survive This is binomial. p = 0.65, q = 0.35, n = 6 P(x = 4) = P(4) = 0.3280 B)
6 4
6 4
Given that the 1st and 2nd bags survived, what is the probability that the 3rd and 4th bags will fail? Note here that the trials are independent, and that trials 1 and 2 already occurred, so the probability of their occurrence = 1. Let x = number that survive p = 0.65, q = 0.35, n = 2 P(x = 0) = P(0) =
( ) p0 q2 = ( ) (0.65)0 (0.35)2 =
Copyright 2001 Genemetrix
2 0
2 0
0.1225
21 Poisson X=N(t)= number of arrivals occurring in a given time interval e t (t)x f(x) = x! e t (t)i F(x) = i=0 i!
x
x={0,1, ...,
E[X] = t =
VAR[X] = t = 2
Example
The manufacturing defect rate of a product is 0.005 defects per unit. probability of zero defects occurring in 100 units? t = 0.005 DPU * 100 units = 0.5 f(0) =
e 0.5 (0.5) 0 0!
What is the
= 0.60653
1 (X ) 2 exp -1/2 2
- x
F(x) =
Since an infinite number of mean-variance combinations exist, a standardized variable was developed. Standard Normal Transformation Z= X
Transforms all the observations of any normal random variable X to a new set of observations of a standard normal variable Z. E[X] = E[Z] = 0 Proof: E[Z] = E[X] = =0 VAR[X] = 2 VAR[Z] = 1 X X ] = VAR[ ] 2 1 VAR[Z] = 2 VAR[X] = 2 = 1 VAR[Z] = VAR[
Corollary: VAR[cX] = c2 VAR[X] X - N(,2) Z - N(0,1) Importance: a single table of Z probabilities can be used for all combinations of (,2). FYI S2 is an unbiased estimate of 2, but S 2 is a biased estimate of . Some authors espouse using a C4 index to compensate for the bias induced by taking the square root. The problem with the C4 index is that the VAR[Z] no longer equals 1 as described above. Copyright 2001 Genemetrix
23
62 50 10
= Pr[-0.5 Z 1.2]
45 50 -0.5 0
62 1.2
X Space Z Space
2) Known probability Find corresponding Z value Solve for 1 unkown given 2 knowns Example On an examination, the average grade was 74 and the standard deviation was 7. If 12% of the class are given As, and the grades are follow a normal distribution, what is the lowest possible A?
z = 1.175 X = 82.225
74
82.225
X Space Z Space
24
Example Given a normal distribution with = 50, = 10, and a sample size of n = 40, find the probability that X falls within its control limits of 47 and 54. Pr[45
X
62] = Pr 10
47 50 40
54 50 10 40
= Pr[-1.90 Z 2.53]
Pr[Z 2.53] = 0.9943 Pr[Z < -1.90] = 0.0287 Pr[-1.90 Z 2.53] = 0.9943 0.0287 = 0.9656 2) Known probability Find corresponding Z value Solve for 1 unkown given 3 knowns Example A drilling operation produces holes with diameters that are approximately normally distributed. If the process mean and variance are 2.1 and 0.0225, respectively, what should be the sample size to ensure that no more than 14% of the sample means will be greater than 2.15? This is normally distributed, where = 2.1 and 2 = 0.0225. We want to find n. Given: P(x > 2.15) < 0.14 or P(x < 2.15) > 0.86 Transform to Z P(Z < Z*) > 0.86 Look up Z-value in the table Z* > 1.08 Now: Z* > Solve for n:
X n
(0.05) ! n (0.15)
25
When the population variance is unknown, there is uncertainty in the estimate of 2. Therefore, a wider distribution was developed to account for this uncertainty. t Distribution: t= X X S2 p /n S2 p = pooled variance
Probabilities and percentage points can be obtained from a t table. E[t]=0, VAR[t]=1 Example The outside diameter of washers follows a normal distribution with a mean of 1.20 inches. A sample of 9 washers will result in a sample standard deviation of 0.03. Calculate the probability that a sample mean will lie between 1.18140 and 1.22306. This is normally distributed, where = 1.20, s = 0.0225, and n = 9. P(1.18140 < X < 1.22306) = P(t1 < t < t2) Use the transformation: t1 = t2 =
1.1#140 1.20 = 0.03 " 1.22306 1.20 = 0.03 "
-1.86 2.306
Therefore: P(-1.86 < t < 2.306) = P(t < 2.306) P(t < -1.86) = 0.975 0.05 = 0.925
(Look up values.)
26
Considerations: 1) Are the samples representative of the population? (Sampling) 2) How do we make inferences about the population parameters? 3) How reliable are these inferences?
Sampling - In order to obtain valid inferences of the population, we must obtain samples that are representative of the population.
Random Sample - observations are made independently (x1, x2, ..., xn) and randomly - each value (xi) came from distributions having the same pdf {f(x)} - i.i.d.: independently and identically distributed
Importance - Joint probability equals the product of the marginal probabilities. - COV[X1,X2]=0, Variance of the sum equals the sum of the variances - Rational sample
27 Hypothesis Testing Make a hypothesis (assumption) about the population parameter of interest
ex.
/2 0
/2
Two Conclusions: 1) Reject H0 2) Cannot reject H0 - Can never "accept" because we don't know what the true parameter really is, however we can conclude that it is not some value.
Compute test statistic (Zcalc)- where does the observed value fall with respect to the assumed reference distribution? Rejection Criterion: Given a mean of 0, (1-) of the values will fall between Zcrit and -Zcrit. If the calculated statistic (Zcalc) falls in the rejection regions, then with a probability of (1-) this sample did not come from a population with mean 0. 0 Possible Situations: Cannot Reject H0 Reject H0 H0 is True Correct Decision Type I Error H0 is False Type II Error Correct Decision
Type I Error - "Wrongful rejection" - rejection of null hypothesis when it is true Pr[Type I Error] = Type II Error - "Wrongful acceptance" - "acceptance" of the null hypothesis when it is false Pr[Type II Error] = Pr[Rejection] = when null hypothesis is true Pr[Rejection] = 1- = "power" when null hypothesis is false
30
Example: Hypothesis Test Using a Z Distribution For a random sample of 50 measurements on the breaking strength of cotton threads, the mean breaking strength was found to be 210 grams and the standard deviation 18 grams. A) The manufacturer claims that the population mean is 215 grams. State the hypothesis and solve for = 0.10. The claim is that = 215 g. This is a two-tailed test. Null hypothesis: HO: = 215 Zcritical = Z0.10/2 = Z0.05 = 1.645 Compare Zcalc to Zcritical :
1."6
210 215 1# 50
= -1.96
C)
Is there evidence that the population mean of breaking strength exceeds 218 grams? State the hypothesis and solve for = 0.05. Using an = 0.05, check if the population mean is > 218g. This is a one-tail test. Null hypothesis: HO: < 218 Zcritical = Z = Z0.05 = 1.645 Alternative hypothesis: HA: > 218 Zcalc =
X
210 21# 1# 50
= -3.771
Compare Zcalc to Zcritical : -3.771 is not greater than 1.645 Therefore: The null hypothesis, HO: < 218, cannot be rejected at = 0.05.
31 Example: Hypothesis Test Using a t Distribution An auto company states that its new compact car has an average fuel economy (miles per gallon) greater than or equal to 55 mpg on the highway. Eight cars were randomly selected and driven. The results of the study were: 57, 52, 50, 49, 53, 51, 47, and 55. State the hypothesis and solve for = 0.05. The claim is that the average mpg > 55 mpg. This is a one-tailed test with = 0.05, and (n-1) = = (8-1) = 7 (degrees of freedom). Null hypothesis: HO: > 55 Alternative hypothesis: HA: < 55
i =1
( xi 51.75) 2 n 1
73.5 7
2 Sp n =sX =
3.24037 s = # n
51.75 55 1.14564
= -2.8368
Therefore: Reject the null hypothesis, HO: > 55 at a = 0.05 Manufacturers claim is invalid.
32 Example: Hypothesis Test Using a 2 Distribution The same auto company as in the previous example claims that the true variance of fuel economy (mpg) is less than or equal to 5. Using the same data, state the hypothesis and solve for = 0.01. The company claims that 2 of car fuel economy < 5. This is a one-tailed test with = 0.01. Null hypothesis: HO: 2 < 5 Alternative hypothesis: HA: 2 > 5
From exercise 2-7, Sp2 = 10.5 and = 7 2critical = 2, = 27, .01 = 18.745 (from tables) 2calc =
(n 1) Sp 2
(7)(10.5) 5
= 14.7
Therefore: Reject the null hypothesis, HO: 2 < 5, at = 0.01. Manufacturers claim is invalid.
33
Female ( M ) 25
Total 75
Unemployed ( E )
10
15
25
Total
60
40
100
Marginal Probabilities
Consider only one distribution Pr[Employed] = 75 / 100 = 0.75 Pr[Unemployed] = 25 / 100 = 0.25 Pr[Male] = 60 / 100 = 0.6 Pr[Female] = 40 / 100 = 0.4
34 Joint Probabilities
Consider more than one distribution Pr[A, B] = Probability that event A occurred and event B occurred Pr[Employed, Male] = 50 / 100 = 0.50 Pr[Employed, Female] = 25 / 100 = 0.25 Pr[Unemployed, Male] = 10 / 100 = 0.10 Pr[Unemployed, Female] = 15 / 100 = 0.15
Conditional Probabilities
Pr[A | B] = Probability that event A will occur given that event B has already occurred Pr[Employed | Male] = 50 / 60 = 0.833 Pr[Unemployed | Male] = 10 / 60 = 0.167 Pr[Employed, Female] = 25 / 40 = 0.625 Pr[Unemployed, Female] = 15 / 40 = 0.375
35 Bayes Theorem
.833
E E E E
M
.6 .4 .167 .625
M
.375
Equivalent Representation
M
.75
E M M E M
.25
.6 ! .#33 .75
= 0.666