Sei sulla pagina 1di 37

Making Confident Decisions

Parameter Distributions from


Multiple Samples:
The Central Limit Theorem
Two important aspects are:
• Repetitions of the measurement event produce different
outcome results.
• This resulting measurement, or sampling distribution, is
normally distributed
• Statisticians call repeated measurements of a characteristic or
a process samples. So the variation that occurs in repeated
sampling events they call its sampling distribution.
Calculating Decision Risk
Confidence Intervals
• The key to objective decision-making lies in confidence intervals
• Use the central limit theorem to quantify how much confidence you
can place in any of your measurements or statistical conclusions
from samples
Confidence intervals for means
Making decisions with large samples

• Z is the sigma value corresponding to the desired level of confidence


you want to have.
• Anytime you calculate a confidence interval, you also have an
associated risk of being incorrect.
• This risk is simply the complement of the calculated confidence.
• The risk of incorrectly concluding that the population average is
within the calculated confidence interval when it really isn’t is called
alpha (α) risk.
Making decisions with small samples
• When the sample has anywhere from 2 to 30 data points, you have
to use a different factor in place of Z
• Statisticians call this new factor for small-sized samples t.
• t is more conservative because the smaller sample size lessens the
accuracy of the calculated value for the standard deviation
• Using t, the formula for the confidence interval around the true
population average becomes

• Where the value for t depends on desired level of confidence and


the number of data points in the sample.
Comparing averages
• Take samples from each of the different versions or conditions
you’re comparing.
• Calculate the appropriate confidence interval for each
different version or condition of the characteristic.
• Graphically or numerically determine whether the confidence
intervals of the different versions or conditions overlap at all.
Confidence intervals for standard
deviations
• To construct a confidence interval around your calculated
sample standard deviation, you have to use a new factor
invented by statisticians called Χ2 (CHI)
• The value of χ2 depends on how many data points are in the
sample, the more data points in the sample, the more
confident the estimate
• There are different values of χ2 for the lower and upper limits
of the confidence interval
• suppose your sample of five data points leads to a sample
standard deviation of 3.7. create a 95-percent confidence
interval for the population standard deviation
• Corresponding to a 95-percent confidence and n = 5. χ2 LOWER
= 11.365 and χ2 UPPER = 0.460.

• With 95- percent confidence, you know that the real


population standard deviation lies somewhere between 2.195
and 10.907.
Comparing Variances
Constructing a confidence interval around this ratio of variances
requires another statistical factor, F, whose value depends on three
things:
• The desired level of confidence,
• The number of data points in the numerator distribution (n1),
• The number of data points in the denominator distribution (n2)

• If the ratio confidence interval includes the value 1 within its


limits, the two populations have equal variability.
• If the confidence interval doesn’t contain the value of 1 within its
limits, the two populations have different amounts of variation.
• suppose you have a ten-point sample from population A, and its
variance=4 . Another population, called B, has a sample of five
points, and its variance=7.5 . Calculate the confidence interval the
variances.

• This confidence interval contains the value of 1 within its limits, so


all you can say with 95-percent confidence is that no evidence
indicates that the two populations have different variances.
Confidence intervals for
proportions
• To calculate the number of successes out of a certain number
of attempts like “four out of five dentists recommend
sugarless gum” this proportion can be written as

• Where y is the number of successes and n is the total number


of attempts or trials.
• Proportions can never be less than zero or greater than one
• Calculating a proportion creates yet another sampling distribution.
The resulting confidence interval around a calculated proportion is

• Example, if you wanted to be 90-percent sure of the calculated


proportion for the four out of five dentists, calculate the confidence
interval.
Difference of proportions
• you’re part of a company with two production lines. You suspect
that your Toledo (T) plant produces a higher proportion of good
items (yield) than your Buffalo (B) plant. You select samples of size
nT = nB = 300 from each plant and find that the number of good
items from the Toledo plant (yT) is 213, while the number from the
Buffalo plant (yB) is 189. That means that a 95-percent confidence
interval for the difference between the Toledo and the Buffalo
yields is

• equivalently, [0.004, 0.156]. Because this confidence interval


doesn’t include zero, you can conclude with 95-percent confidence
that the Toledo plant produces, on average, a higher proportion of
good items than the Buffalo plant.
DMAIC: Improving
and Controlling
Forecasting Future Performance
Improving and Controlling
• With the critical few factors known, now the
company is ready to embark a journey towards
improvements
• Now we need to quantify potential improvement
effects so that we can see how much in the output
we will get for any given adjustment to the critical
inputs.
• We also need to understand the relationships
between the input variables and the critical
outputs
• Knowing the outcome of a potential improvement
without spending the resources to test it out is the
essence of Six Sigma improvement power
Correlation
• Scatter plots are a great way to visually discover and explore
relationships between variables — both between Ys and Xs
and between Xs and Xs.
• The calculated correlation coefficient is always between –1
and 1
• The sign of r tells the direction of the relationship between
the variables.
• The absolute value of r tells how strong the relationship is
• Correlation tells only how linear the relationship between the
variables is.
• The scatter plot shows a negative relationship between the
curb weight of the vehicle and its fuel economy
• The relationship between the two variables is approximately
linear, meaning that its shape approximately follows a straight
line.
• The relationship between the variables is fairly strong
• Correlation basically just confirms the existence of a
linear relationship between two variables and quantifies
how linear that relationship is.
• Correlation doesn’t equal causation.
• The fact that two variables correlate doesn’t mean that
one causes the other.
Curve Fitting
Curve fitting, means determine the equation for
the curve that best fits the data
• It quantitatively shows what effect one variable has
on another, which variables are significant
influencers, and which ones are just in the noise
• Finally, it also shows how much of the system
behavior the equation does not explain.
• The goal of curve fitting is to develop an
approximate equation that describes the system’s or
process’s statistical behavior as much as possible.
• Regression is used to explain the statistical behavior
Simple linear regression
• In simple linear regression, assume that each observed
output point Yi can be described by a two-part equation:

• β0 + β1 X part of the equation for Y is just an equation for


straight line
• β0 by itself tells at what value the fitted line crosses the Y axis
• β1is the line’s slope
• The ε part is a normal, random distribution with a center value
equal to zero.
• Mathematically determine values for β0 and β1 so that the
resulting line fits the observed X – Y data as closely as possible
with the minimum amount of error

• Where xi and yi are the paired data points and and are the
calculated averages for all the X points and all the Y points,
respectively.
• For the example given before, β0=47.3 and β1=–0.00632. so
the line is ,

• Where represents the estimate or prediction for Y, not an


actual observed value for Y.
Point of Cautions
• Do not extend the predictions very far beyond
the range of the data in the study.
• The derived equation for the line is missing the ε
component.
• The line predicts only the expected average
performance.
• In reality, the actual performance level varies from
the predicted value
Discovering residuals and the
fitted model
• It is important to understand and quantify
the ε component of the regression
equation
• For each of the i data points in the study,
the error term ei can be calculated
• An error term ei shows how far off the
predictive equation line is from the
observed data.
• For example, referring to Table earlier in the lecture, the data
for the Toyota Camry shows that its curb weight is 3,140
pounds and its fuel economy is 29 mpg.
• Plugging an X value of 3,140 pounds into the derived
regression equation shows,

• The difference between the observed and the predicted fuel


economy is:

• These ei terms are called residuals or what’s left over after


using the predictive equation
• The beginning assumption of the predictive
linear equation is that it has a secondary ε part
that is a normal, random distribution with a
center value equal to zero.
• The most efficient way to check the predictive
linear equation’s validity is graphically reviewing
the residuals to make sure they’re behaving as
per the assumption
Create up to four different graphical checks of the
residuals
• A scatter plot of the residuals ei versus the predicted values
from the derived equation
• A scatter plot of the residuals ei versus the observed X data
• Additional scatter plots of the residuals ei versus any other X
variables that you didn’t include in your equation
• A run chart of the residuals ei versus the previous residuals
ei-1 if you collected your study data sequentially over time
In each of these graphical residual checks, find
• The variation has no obvious patterns and is truly random,
like a cloud of scattered dots.
• The residual variation is centered on the value of zero.
Examples of residual-checking plots for the earlier automobile
weight-fuel economy study are as follows:
• An added way to investigate how good the derived regression
model is involves looking at the variation of the output
variable Y.
• This assessment is based on a squared error basis.
• The total sum of the squared error (SSTO) in the output
variable Y

• Where Yi are the n observed output values.


• In a similar way, state the squared error from just the derived
regression equation (SSR) as

• where are Yi is the predicted estimates for the n data points.


• Finally, express the squared error from the remaining ε
variation (SSE) as

• Together, these three squared error terms can be related with


the simple sum
SSTO = SSR + SSE
You can do three important tests with these squared error terms:
• Calculate the coefficient of determination, R2, for your predictive
model.

• R2 tells you is how much of the total observed variation is


determined or explained by your linear model
• For a business setting, you want this number to be 80 percent or
higher.
• With a high R2 value, you can know that your predictions will be
close and not dominated by the unexplained variation.

• Ninety-four percent of the observed variation is explained by your


derived linear model.
• Quantify the unexplained ε variation in terms of its standard
deviation.
• The ε value is an inherent part of your predictive linear
equation, so you need a way to figure out how big its variation
is.

• This estimate comes in handy when you want to mimic what


may happen in reality.
• For the automobile curb weight versus fuel economy study,
you can estimate the standard deviation of the unexplained
variation as
• Perform an F test to quantify your confidence in the validity of
your regression model.
• statistically compare the variation explained by your
regression model to the unexplained variation.
• Another way to mathematically represent the variation in the
regression model is by an estimate of its variance

• Creating a ratio of this two is just like the confidence intervals


for comparing the size of two different distributions
• You can say with 95-percent or 99-percent confidence —
whichever level of confidence you select from the F table in
that your derived predictive model is, in fact, valid
• For the automobile curb weight versus fuel economy study, if
you want to be 99-percent confident with your n = 10 data
points, the F test of the variances becomes

• Because the calculated ratio value of 16.4 is greater than the


critical 99-percent F value of 11.3, you can conclude with 99-
percent confidence that your derived model is valid.
Multiple linear regression
• When you work to create an equation that includes
• more than one variable — such as Y = f(X1, X2, . . ., Xn) — you
use multiple linear regression.
• if you have a system where
• X1 and X2 both contribute to Y, the multiple linear regression
model becomes

Potrebbero piacerti anche