Sei sulla pagina 1di 51

Welcome to

Contact information

(on Intranet: intranet.utsc.utoronto.ca, My Courses)


STAB27 • E-mail: butler@utsc.utoronto.ca
• Office: H 417

Instructor: Dr. Ken Butler • Office hours: Mon whenever or by appointment


• Phone: 5654 (416-287-5654)
Lectures whenever

1 2

Begin with review of testing and confidence interval procedures, to


be sure we have them straight.

Use the BRAINPMI data set of Exercise 1.18. Contains times


between death and autopsy of 22 humans. Important because PMI
Review affects conclusions from autopsy.

Assuming these 22 humans random sample from “all possible”, can


draw conclusions about mean PMI in population.

First get data from disk; copy, paste into Minitab (say c1).

3 4
Confidence interval for mean

Because we don’t know population SD, have to use t procedures,


not z .
To be precise, if we were to follow this procedure many times, each
Get 95% confidence interval from Minitab: Stat, Basic Statistics,
time drawing a random sample from the population and then
1-sample t, select C1, click on Confidence Interval. Leave Level at
calculating the confidence interval, 95% of the intervals produced
95%. Get this:
would contain the population mean PMI. We have to hope that our
T Confidence Intervals
interval is one of those that does contain the mean.
Variable N Mean StDev SE Mean 95.0 % CI
C1 22 7.300 3.185 0.679 ( 5.888, 8.712)

We believe, with 95% confidence, that the mean PMI in the


population lies between about 5.9 and 8.7.

5 6

Test for mean

Normally, having calculated a confidence interval, you wouldn’t do a


test for the same thing. But for illustration, let’s test whether the Minitab gives us the null hypothesis (pop. mean is 7) and the
population mean PMI is 7. alternative (pop. mean not 7).

In Minitab, select Stat, Basic Statistics, 1-sample t, then select C1 The P-value 0.66 at the end is not small (eg. smaller then 0.05), so
again. This time, click on Test Mean, and put 7.0 into the box. I got there is no reason to reject the null hypothesis (note wording). The
this: population mean could be 7, even though the sample mean is 7.3,
T-Test of the Mean just by chance. The mean is not significantly different from 7.

Test of mu = 7.000 vs mu not = 7.000

Variable N Mean StDev SE Mean T P


C1 22 7.300 3.185 0.679 0.44 0.66

7 8
Why do confidence intervals and tests? Summary

• Confidence intervals and tests are ways to draw conclusions


We do them because they are the most important part of statistical
from data.
inference: using Statistics to draw conclusions about data.

We’ll be seeing many more of them in this course.


• Have a population quantity eg. population mean that we don’t
know and want to estimate from data (eg. using sample mean)
Confidence intervals and tests are always for a population
parameter like population mean, because always being used to • Confidence interval says what population quantity might be.
draw conclusion about population based on sample. • Test says whether population quantity might be particular value.

9 10

What is regression analysis?

Example: measure high-school GPA for a number of students, and


also measure their SAT scores. These data can be used to see
Introduction to Regression whether SAT score really does depend on high-school GPA, and if
so, how.
Analysis In general, regression is a way of predicting one variable from
others, to discover what the relationship is (if any), and to use the
data to make predictions about the population (like SAT scores for
other students for whom we only know the GPA).

11 12
Some symbols
There could be chance involved, so let’s split up y into two parts:
We need some symbols to describe what we’re doing.
y = E(y) + ǫ.
Let y denote the variable we’re trying to predict (like SAT score).
Called dependent variable or response variable. E(y) is the population mean, while ǫ is a random (chance) error
that prevents y from being exactly E(y).
Let x denote the variable we’re trying to predict it from (eg.
high-school GPA). Called independent variable or predictor There can be more than one predictor variable: in that case, call
variable. them x1 , x2 , . . . , xp .

Observe data on x and y ; if anything is going on, y depends on x.

13 14

Sometimes we get to choose the x’s. Example: measuring effect of


temperature and pressure on impurity of batches of chemical. Can
set temperature and pressure. Called (statistical) experiment.
Advantage: may be able to demonstrate cause and effect: eg.
changing temperature changes impurity; nothing else has effect. Straight-line models
Otherwise, can only observe x’s and y ’s; observational study,
much harder to show cause and effect (might be other causes). Eg.
SAT scores may depend on high-school GPA, but can’t control GPA;
may be other causes like tutoring, socio-economic status, etc.

15 16
Straight lines, mathematically

A straight line, like y = 2 + 3x, looks like this:

Number on its own, 2 here, called intercept: value of y when


x = 0.
Number next to x, here 3, called slope: increase in y when you
increase x by 1.

Specify a straight line by giving slope and intercept.

17 18

Scatterplot

Nice way to see x, y data is to plot it. If you plot each x-value This shows sales (on the vertical scale) against amount of

against its corresponding y -value you get something like this: advertising (on the horizontal scale). Appears that more advertising
goes with more sales!

Also appears that the relationship could be fairly well described by a


straight line. But which straight line?

That is, which slope and intercept best describes this relationship?

Have to define what we mean by “best”.

19 20
Least squares
Can try again with a different line; get different sum of squared
We want the straight line to go “close” to points on scatterplot. errors. Idea: choose line that makes SSE smallest. This is principle
of least squares.
Pick a line, say y= 0.2 + 0.5x. In the data, when x = 3, y = 2;
on this line, when x = 3, ŷ = 0.2 + (0.5)(3) = 1.7. (Use ŷ to Formulas via calculus for slope and intercept of this best line (p. 96),
denote the point on the line. This is off by 2 − 1.7 = 0.3. but we’ll get answers from Minitab.

Can do this for all 5 points for this line. Get an “error” at each point – Select Stat, Regression, Regression again. Response is sales;
some + (above line), some - (below). To combine errors, square double click to select. Predictor is advert; again select. Click OK.
them first to make them all +, then add up to get sum of squared I got this:
errors.

21 22

Regression Analysis

The regression equation is


sales = - 0.100 + 0.700 advert
From “regression equation”, see that least-squares intercept is
Predictor Coef StDev T P
Constant -0.1000 0.6351 -0.16 0.885
−0.1 and slope is 0.7. This is “best” straight line. In the bottom
advert 0.7000 0.1915 3.66 0.035 table, look for SS (”sum of squares”) for residual error; 1.1 is

S = 0.6055 R-Sq = 81.7% R-Sq(adj) = 75.6%


smallest possible value.

Analysis of Variance
(Errors of data points from this line are 0.4, -0.3, 0, -0.7, 0.6; sum of
squares of these is 1.1.)
Source DF SS MS F P
Regression 1 4.9000 4.9000 13.36 0.035
Residual Error 3 1.1000 0.3667
Total 4 6.0000

23 24
Assumptions

In order to go further, we need to make some assumptions (and Estimating σ


later check them).

We earlier wrote y = E(y) + ǫ, to allow for the points not being σ is a population parameter, so we don’t know it. However, we can
exactly on the line (the ǫ are “random errors”). estimate it from the data.

Make two assumptions about the random ǫ: The further the points are off the line, the larger σ should be. Now,
SSE measures how far the points are off the line, so can estimate
• ǫ have mean 0 and constant SD σ (same for all x).
σ 2 from SSE: precisely, SSE divided by n − 2, where n is the
• ǫ have a normal distribution. number of data points you have.

These assumptions make it possible for us to get answers, but


(later) need to worry about checking assumptions.

25 26

Inference about slope

We only ever have a sample of data; we want to know something


In the ADSALES data, we had n= 5 points, SSE was 1.1, so
2 about the population. So there has to be some uncertainty about
estimate σ as 1.1/(5 − 2) = 0.3667, and therefore estimate σ
√ the figures we get from the sample, like the estimated slope.
as 0.367 = 0.6055. Compare output:
S = 0.6055 R-Sq = 81.7% R-Sq(adj) = 75.6% If the points are all close to a line, then we’ll be estimating the slope
of the line well; if not, not.
Source DF SS MS
Residual Error 3 1.1000 0.3667 In the ADSALES data we have:

S is the estimate of σ ; the estimate of σ 2 is SSE divided by Predictor Coef StDev T P


Constant -0.1000 0.6351 -0.16 0.885
“degrees of freedom”. advert 0.7000 0.1915 3.66 0.035

The estimated slope is 0.7. The figure in the “StDev” column next to
it expresses uncertainty about the slope.

27 28
For, say, a 95% interval, look in t table (p. 762). 95% means 5% “cut
Confidence interval for population slope off”, or 0.025 each end. Look in the 0.025 column and the row for
the error df, 3 here. This gives 3.182.
The reason we don’t know the “real” slope is that we don’t know
Then the 95% confidence interval for the slope is
about the population; we only have a sample and a sample slope
(from the regression line). If we knew the population slope exactly, 0.7 ± 3.182(0.1915) = (0.09, 1.31).
the “StDev” figure would be 0.
This doesn’t pin down the population slope very well, because we
We can make a confidence interval for the population slope. We use don’t have much data.
the sample slope, its StDev, and the t-distribution (because we don’t
For a different interval (say 90%), decide how much to cut off each
know σ ).
end – here half of 10%, or 0.05. Use the error df again.

29 30

Hypothesis test for slope

To decide whether the population slope could be a certain value, we


So you would reject the null hypothesis and conclude that the
need a hypothesis test or test of significance.
population slope is not 0, ie. that there is a relationship between
If the slope is 0 in the population, it means the line is flat: there’s no
advertising and sales here.
relationship between x and y . So this is a good value to test: null
hypothesis that the population slope is 0, alternative that it is not. With a smaller cutoff for the P-value, like 0.01, you wouldn’t reject
the null, so the evidence for a relationship isn’t that strong.
Minitab makes this test easy:
Predictor Coef StDev T P Notice also that the 95% confidence interval contains all positive
Constant -0.1000 0.6351 -0.16 0.885 numbers: the population slope, at the 5% level, is positive.
advert 0.7000 0.1915 3.66 0.035

Look along the advert line. The P-value for this test is 0.035,
which is small (smaller than 0.05).

31 32
Correlation and R-squared Correlations (Pearson)

Correlation of advert and sales = 0.904, P-Value = 0.035


So far, we’ve seen whether there is a relationship, not worried much
The correlation for these data is a high 0.904; the P-value is same
about how strong it is. To measure that, we need to look at the
as for test of slope (not a coincidence).
correlation.
A high correlation means that x and y tend to go together for your
The correlation is a number between −1 and 1. 1 means a perfect
data. It doesn’t mean that x causes y ; there could be many other
upward trend, −1 means a perfect downward trend, and 0 means
variables that cause the effect.
no trend at all. See the pictures on p. 118.
Magic phrase: correlation does not imply causation.
To calculate the correlation in Minitab, go to the Stat menu, select
Basic Statistics and Correlation. Select the x and y variables (for But in a statistical experiment (as opposed to observational study)
the ADSALES data, advert and sales), then click OK. I got have tried to level out effects of other variables. Then, can be more

this: confident that correlation is cause and effect.

33 34

R-squared Analysis of Variance

Source DF SS MS F P
The correlation squared, called R-squared, has another nice Regression 1 4.9000 4.9000 13.36 0.035
interpretation in regression. Recall this (ADSALES data again): Residual Error 3 1.1000 0.3667
Total 4 6.0000
Regression Analysis
Correlation squared is 0.9042 = 0.817, same as R-sq above.
The regression equation is
sales = - 0.100 + 0.700 advert But also regression SS divided by total SS is 4.9/6 = 0.817. Thus
Predictor Coef StDev T P R-squared also says this here: “out of variation in y , 81.7% of it
Constant -0.1000 0.6351 -0.16 0.885 explained by fact that y depends on x”. Higher the better.
advert 0.7000 0.1915 3.66 0.035
In multiple regression (later), correlation not so useful, but
S = 0.6055 R-Sq = 81.7% R-Sq(adj) = 75.6%
R-squared still helpful.

35 36
In the population, though, don’t know “true” sales when advertising
is 4; only have sample estimates, with uncertainty. Do 2 kinds of
Confidence interval for mean y and
prediction:
prediction interval for new y at particular x
1. mean value of y for given x.

Having decided that the regression line is useful, next step is to use 2. Value of y for new observation with given x.
it for prediction. First says: “imagine all sales in population where advertising was 4,
To simply predict y at a particular x, just use the line. make CI for mean of those”.

For example (ADSALES), the line was y = −0.1 + 0.7x. To Second says: have one new observation with advertising 4, guess
predict sales when advertising is 4, put in x = 4 to get what sales is for that.
−0.1 + 0.7(4) = 2.7. Second CI has more uncertainty: even if line known very well, still
uncertainty about y .

37 38

To do in Minitab: select Regression twice, fill in y (sales) and x


(advertising), then click Options. In box “Prediction intervals for new
observations”, enter 4 (value of x). Click OK twice. Get output from
before plus:
Predicted Values

Fit
2.700
StDev Fit
0.332 (
95.0% CI
1.645, 3.755) (
95.0% PI
0.503, 4.897) Multiple regression
Fit 2.7 is value from putting x = 4 into line. 95% CI says that mean
sales when advertising is 4 lies between 1.645 and 3.755 – not very
useful! But PI even less useful: sales for new month when
advertising 4 could be between 0.5 and 4.9 – not useful at all!
(Quite typical of PIs.)

39 40
Multiple regression model

Sales of a product might depend on advertising, but also on season, Kinds of multiple regression model
inventory, sales force, productivity.

In other words, our y -variable (sales) depends on not one but The x’s can represent several different things:

several x-variables. • genuine different numerical x-variables


This kind of dependence works with the multiple regression
• functions of numerical x variables like x21 (allows modelling of
model:
curved not straight-line relationships)

y = β0 + β1 x1 + β2 x2 + · · · + βk xk + ǫ. • categorical x-variables like “sex”, “age group” (see how later).


In other words: there is one intercept, a slope for each x-variable,
and an error term.

41 42

Estimation in multiple regression


Multiple regression in action
We have to estimate:
Let’s look at an example:
• the intercept β0
Florida real estate agents kept track of some properties: specifically
• the slopes β1 , β2 , . . . , βk selling price, land value, value of improvements to the house, and
• the error SD σ the area (square footage). Can selling price be predicted from land
value, improvements and area, and if so, how?
from the data.
In Minitab, select Stat, Regression and Regression. Select selling
Do so as before (least squares): for any proposed values for the β s,
price as the Response, and select the other three variables as the
can get prediction ŷ from line, find line that makes sum of squared
predictors. (Minitab knows to do a multiple regression because of
prediction errors smallest. (In practice, do by Minitab).
more than one predictor.)
Estimate σ 2 by s2 = SSE/(n − (k + 1)).

43 44
Regression Analysis

The regression equation is


price = 1470 + 0.814 land + 0.820 improve + 13.5 area
The R-squared is high, suggesting that the selling prices are well
Predictor Coef StDev T P predicted by the other variables. The best prediction of selling price
Constant 1470 5746 0.26 0.801
land 0.8145 0.5122 1.59 0.131
for these data is given by the regression equation.
improve 0.8204 0.2112 3.88 0.001
area 13.529 6.586 2.05 0.057
Minitab marks two houses as “unusual”: because they’re marked
with an “R”, house #11 and house #17 are both badly predicted by
S = 7919 R-Sq = 89.7% R-Sq(adj) = 87.8%
<...>
this regression. House #11 was predicted to sell for $55,000 (“Fit”
Unusual Observations column), but actually sold for just over $40,000; house #17 was
Obs land price Fit StDev Fit Residual St Resid
11 7300 40500 54981 3358 -14481 -2.02R predicted to sell for $40,000, but actually sold for $56,000.
17 4000 56000 40024 3401 15976 2.23R

R denotes an observation with a large standardized residual

45 46

Interpreting intercept and slopes Confidence intervals and tests for slopes
The regression equation is
As in 1-variable regression. Bear in mind interpretation.
price = 1470 + 0.814 land + 0.820 improve + 13.5 area

Don’t know “real” slope for a variable because don’t know about
1470 is intercept: value of y (price) when other variables all 0 (land,
population; only have sample and sample slope (from the
improve, area). Doesn’t mean much here.
regression equation).
0.814 is slope for land value. Says that if land value increases by 1
Can make a confidence interval for population slope for a variable.
unit, with other variables not changing, price will increase by 0.814
As before, use the sample slope, its StDev, and t-distribution
units.
(because we don’t know σ ).
Same idea for slopes of improve and area.
Assumptions same as before:
Note that interpretation of slopes requires other variables in model
to be held fixed; gives idea of effect of that variable over and above
• ǫ have mean 0 and constant SD σ (same for all x).
others. • ǫ have a normal distribution.

47 48
Previous example: slope for land value 0.814, StDev 0.5122. More
of output:
95% interval for slope of land value. Use t table (p. 762). Look in the
S = 7919 R-Sq = 89.7% R-Sq(adj) = 87.8%
0.025 column and the row for the error df, 16 here. This gives 2.120
Analysis of Variance
Then the 95% confidence interval for the slope is
Source DF SS MS F P
Regression 3 8779676741 2926558914 46.66 0.000
0.814 ± 2.120(0.5122) = (−0.27, 1.90).
Residual Error 16 1003491259 62718204
Total 19 9783168000

so 16 error df. Estimator of σ is s = 7919.

49 50

Hypothesis test for a slope


Look along the land line: P-value is 0.131, which is not smaller
If an x-variable has no effect on y . its slope will be 0, or at least not
than 0.05, so we wouldn’t reject the null hypothesis. Land value has
significantly different from 0. (Strictly, if the x-variable has no effect
no effect on selling price if value of improvements and area of
on y once the effect of the other x-variables has been allowed for).
property are in the regression.
That is, the null hypothesis, that βi = 0, says that the variable has
P-value can change if variables are added to or taken away from the
no effect, the alternative, βi 6= 0, says that it has an effect.
regression. Area of property is not quite significant (P-value 0.057).
As before, Minitab makes the test easy: But you could imagine taking land out of the regression, and
Predictor Coef StDev T P predicting selling price from only improve and area; in the
Constant 1470 5746 0.26 0.801
land 0.8145 0.5122 1.59 0.131
dialog box, put only improve and area in Predictors:
improve 0.8204 0.2112 3.88 0.001
area 13.529 6.586 2.05 0.057

51 52
The regression equation is
price = 98 + 0.960 improve + 16.4 area R-squared, correlation, and test for whole
Predictor Coef StDev T P model
Constant 98 5931 0.02 0.987
improve 0.9604 0.2004 4.79 0.000
area 16.373 6.617 2.47 0.024 Minitab quotes an r-squared for multiple regressions as well.
Interpret as before (regression containing all 3 x’s):
area now clearly has an effect on price.
S = 7919 R-Sq = 89.7% R-Sq(adj) = 87.8%
Explanation: land and area predict price in similar way, so that
having both is unnecessary, but having one is useful. This is high, 89.7%, so regression is doing a good job of predicting
selling price.
(In regression with land and improve only, land is nearly
significant.) Doesn’t mean that regression is best in any way, just good.

53 54

Correlation is only defined between two variables, not meaningful in Above table also contains P-value, 0 to 3 decimals. So some null
multiple regression. So R-squared here only defined as hypothesis is being rejected, but what?

2 regression SS Called global F -test: null hypothesis is that no x-variables help to


R = .
total SS predict y , alternative is that one or more of them do.
To check:
So here, one or more of land, improve, area does help to
Source DF SS MS F P
predict selling price. Doesn’t mean they all do.
Regression 3 8779676741 2926558914 46.66 0.000
Residual Error 16 1003491259 62718204
Test is of “something vs. nothing”; failing to reject here means the
Total 19 9783168000
x-variables have no effect. Here, fortunately, those variables
between them do predict selling price.
8779676741/978316800 = 0.897.
Useful as a first step; not rejecting: regression is a waste of time.

55 56
First, do regression with all 3 x’s, note SS:
Comparing two models: the partial F -test
Analysis of Variance

So far, we know how to test for one x-variable (t-test) and how to Source DF SS MS F P
Regression 3 8779676741 2926558914 46.66 0.000
test for all of them (global F -test). How to test some of the x’s? Residual Error 16 1003491259 62718204
Total 19 9783168000
Answer: partial F -test.
Then do regression with land only:
Fit regression containing all the x’s under consideration; then
Analysis of Variance
remove those you want to test, see if fit is “significantly” worse.
Source DF SS MS F P
Example: in real estate data, see if regression with all 3 x’s better Regression 1 6102224089 6102224089 29.84 0.000
Residual Error 18 3680943911 204496884
than regression with only land area. Total 19 9783168000

57 58

Another way to do the same test is via only the regression with all
three variables, provided you include the ones to be tested last.
Then take difference in error SS divided by difference in error df, That is, do a regression with land, improve and area in that
and divide that by (smaller error SS divided by df): order. You get sequential SS:
Source DF Seq SS
(3680943911 − 1003491259)/2
F = = 21.35. land 1 6102224089
1003491259/16 improve 1 2412784751
area 1 264667901
P-value for this is very small (Minitab or tables).
Add these up to get the top of the test statistic:
So we conclude that the smaller regression fits worse: that is, we
should include both value of improvements and area of property in (2412784751 + 264667901)/2
F = = 21.35,
the regression, rather than taking both out. 1003491259/16
same as before. Conclusion is same: those two variables should be
in the regression, since the fit is much worse without them.

59 60
In this chapter, we will see how to use non-linear functions of x (like
x2 ), how to include categorical x’s, and what interactions are and
Variations on regression how to model them. We’ll also see how to fit some non-linear
models (that can be made to look like multiple regression).

61 62

Quadratic models

Sometimes a straight line doesn’t tell the whole story.

A physiologist investigated the effect of physical fitness on the


immune system. Immunoglobulin (igg, milligrams) measures how
well the immune system is working; maximal oxygen uptake
(oxygen) is a measure of physical fitness.
One way to model this kind of curve is to:
Only appears to be one x-variable, but scatterplot of igg vs.
oxygen is curved: • calculate variable x2
• regress y on x and x2

63 64
The regression equation is
igg = - 1464 + 88.3 oxygen - 0.536 oxygensq
To calculate x2 in Minitab: select Calc then Calculator. Type a name
Predictor Coef StDev T P
like oxygensq in the Store Result box. In the Expression box, Constant -1464.4 411.4 -3.56 0.001
select oxygen by double-clicking on it. Select * (for oxygen 88.31 16.47 5.36 0.000
oxygensq -0.5362 0.1582 -3.39 0.002
multiplication), then select oxygen again. This multiplies the
oxygen values by themselves, and saves the results in a new S = 106.4 R-Sq = 93.8% R-Sq(adj) = 93.3%

column called oxygensq. Analysis of Variance

Then run a regression predicting igg from oxygen and Source DF SS MS F P


oxygensq. I got this: Regression 2 4602211 2301105 203.16 0.000
Residual Error 27 305818 11327
Total 29 4908029

65 66

The R-squared is high, and the global F -test is clearly significant.


Interactions
So the regression is good, but is it best? Do we need the
oxygen-squared term, or is a straight line good enough? We’ve seen that adding an x2 term is a way of dealing with one kind

The significant t-test for the slope of oxygensq says that this of curvedness in a relationship. Here’s another kind:

adds something to the prediction over and above the other Consider antique selling; in particular, selling grandfather clocks at
variables. That is, the squared term is needed to capture the curve auction. You might expect that an older clock will sell for a higher
– the curve is real, not just chance. price, but also the price will be higher if the auction is more

The slope for oxygensq is −0.54. This is not far from zero, but competitive: that is, if more people bid. This suggests predicting

far enough to be significant. The negative sign means that the curve selling price from both age of clock and number of bidders. Using

“opens downward”, as you see from the scatter plot. the GFCLOCKS data set (p. 173):

67 68
The regression equation is
price = - 1339 + 12.7 age + 86.0 bidders

Predictor Coef StDev T P First in Minitab, select Manip then Code. We will code the number of
Constant -1339.0 173.8 -7.70 0.000
bidders into “high” and “low”, so select bidders into the top box.
age 12.7406 0.9047 14.08 0.000
bidders 85.953 8.729 9.85 0.000 Below that, enter a new column, like c4. Then define “low” (0–8) and
“high” (8–20). Column 4 will contain an “l” for each auction with a
S = 133.5 R-Sq = 89.2% R-Sq(adj) = 88.5%
low number of bidders, and “h” for “high”.
So far so good: R-squared is high, and both variables are strongly
Now to plot: select Graph and then Plot. Select price and age
significant.
into the y and x boxes, then click the arrow next to Annotation and
Let’s see how well this model predicts in a different way: first then Data Labels. Click next to “use labels from column” and enter
separate the auctions into high and low numbers of bidders, then c4 in the box. Click OK twice. Plot:
plot selling price against age showing whether bidders is high or
low.

69 70

Let’s investigate mathematically: suppose

y = 1 + 2x1 + 3x2 .

If x2 = 0, what happens as x1 increases by 1?

x1 = 0, x2 = 0 : y = 1 + 0 + 0 = 1;

x1 = 1, x2 = 0 : y = 1 + 2 + 0 = 3.
Notice how the h’s are at the top of the picture and the l’s at the
That is, y goes up by 2, the slope of x1 . This is true for any x2 (try
bottom. Also, the h’s seem to go up more quickly.
it!); it’s what the slope means.
That is, auction prices go up faster with age when there are many
bidders than when there are few.

71 72
Now suppose
y = 1 + 2x1 + 3x2 + 4x1 x2 . This is exactly what we want for the grandfather clocks: when the
number of bidders is high, the price should increase faster with age
Put x2 = 0; when x1 = 0, y = 1, and when x1 = 1, y = 3 as
then when #bidders is low.
before; y goes up by 2.
So let’s put an interaction term into our regression. First create a
Now put x2 = 1. When x1 = 0, y = 1 + 0 + 3 + 0 = 4; when
column containing the age values times the bidders values.
x2 = 1, y = 1 + 2 + 3 + 4 = 10. Now increasing x1 by 1
increases y by 6. Select Calc and Calculator. Name the new variable something like
agebid (type into the top box) and define it as
That is, in this model, the effect of x1 on y depends on the value of
age * bidders. (Similarity with defining x2 before.) Then do
x2 .
a regression predicting price from age, bidders and
The term 4x1 x2 is called an interaction term: it describes how x1 agebid:
and x2 combine to influence y .

73 74

The regression equation is


price = 320 + 0.88 age - 93.3 bidders + 1.30 agebid Interpreting higher-order models
Predictor Coef StDev T P
Constant 320.5 295.1 1.09 0.287 Regressions with x2 terms or x1 x2 interactions are called
age 0.878 2.032 0.43 0.669
higher-order models. These are both order 2 because the overall
bidders -93.26 29.89 -3.12 0.004
agebid 1.2978 0.2123 6.11 0.000 power is 2.

S = 88.91 R-Sq = 95.4% R-Sq(adj) = 94.9% Interpret a higher order model by first testing the highest-order
terms. Don’t remove any lower-order terms containing the same
R-squared has gone up (was 89% before); more important, t-test for
variables, even if they’re not significant.
interaction term significant.
Thus for the grandfather clocks, the right model includes age and
That is, interaction helps to predict over and above age and
bidders as well as the interaction, even though age isn’t
bidders. The selling price depends on age differently for each
significant.
number of bidders.

75 76
Other higher-order models
Define all these variables and give names, then run regression
The amount charged for sending a package using a regional including them:
express delivery service depends on its weight and the distance The regression equation is
sent. But cost to the delivery company depends on other things too cost = 0.827 - 0.609 weight + 0.00402 distance + 0.0898 wtsq +0.000015 d
+ 0.00733 wtdist
(like size of package and how full delivery truck is).
Predictor Coef StDev T P
Supposing we only had weight of package and distance shipped. Constant 0.8270 0.7023 1.18 0.259
Could we predict cost of delivery (data set EXPRESS)? weight -0.6091 0.1799 -3.39 0.004
distance 0.004021 0.007998 0.50 0.623
First fit model predicting cost from weight and distance. Fit is good: wtsq 0.08975 0.02021 4.44 0.001
distsq 0.00001507 0.00002243 0.67 0.513
R-squared 91.5%, error SS 37.9 with 17 df.
wtdist 0.0073271 0.0006374 11.49 0.000
Relationship doesn’t have to be linear. Try second-order model
S = 0.4428 R-Sq = 99.4% R-Sq(adj) = 99.2%
including weight squared, distance squared, weight by distance
interaction.

77 78

Analysis of Variance

Source DF SS MS F P
Regression 5 449.341 89.868 458.39 0.000
Residual Error 14 2.745 0.196
Total 19 452.086

Do second-order terms really help? To answer this, need partial


Distance-squared term could be removed from regression (t-test
F -test. The model without the second-order terms had error SS
not significant), but wtsq and wtdist, though small, should stay.
37.19 with 17 df, so

37.19 − 2.745/3
F = = 58.6;
2.745/14
P-value from F -dist with 3 and 14 df is very small; reject null hyp.
that second-order terms are useless – that is, keep them.

79 80
Example: a certain drug may increase anxiety level in patients; rate
Using categorical x-variables of increase suspected to be different for males and females. Data in
ANXIETY. Results:
Variables come in two kinds: numerical (counted or measured), and
The regression equation is
categorical (classified). Examples: surveys on people might score = 13.6 + 0.341 dose + 2.80 sex

measure sex (male/female), age group (18–25, 25–34, 35-44 etc.).


Predictor Coef StDev T P
These variables have a number of levels, the number of categories: Constant 13.5500 0.5471 24.77 0.000
dose 0.34143 0.03741 9.13 0.000
2 for sex, 3 or more for age group.
sex 2.8000 0.4666 6.00 0.000

To put a 2-level categorical variable into a regression, code 1st


Variable sex coded as 1/0. Coef significantly different from zero, so
category as 1, 2nd as 0. Eg. males=1, females=0. Then add to
results different for males and females. Value 2.8 means males (1)
regression like any other variable. Effectively, dividing data into
score average of 2.8 units higher than females (0), allowing for
males and females.
effects of dose.

81 82

Can also fit interaction of sex and dose:


The regression equation is
score = 15.3 + 0.191 dose - 0.700 sex + 0.300 sexdose

Predictor Coef StDev T P Because interaction significant, should keep both dose and sex
Constant 15.3000 0.5983 25.57 0.000
in the model even though sex by itself not significant.
dose 0.19143 0.04523 4.23 0.000
sex -0.7000 0.8461 -0.83 0.412
If interaction had not been significant, would have had straight line
sexdose 0.30000 0.06396 4.69 0.000
for each sex with same slope, and first regression would have been
Interaction significant. What does this mean?
appropriate.
Means: way score depends on dose different for each sex. That is,
fitting straight line for each sex separately, and lines have different
slopes.

83 84
Categorical variables with more than 2
levels 4 levels, so defined 3 dummy variables (1 less). In words: 1st
dummy variable is 1 if it’s the 2nd level, 0 otherwise; 2nd is 1 if 3rd
With more then 2 levels, define a string of dummy variables like level and 0 otherwise, and so on. No need to define a dummy
this: variable specific to first level, because if not others, must be first.
Age group x1 x2 x3 A consulting firm sells a computerized system for monitoring road
18–25 0 0 0 construction bids. It wants to compare the mean annual

25–34 1 0 0 maintenance costs for the system users in 3 different states:


Kansas, Kentucky, Texas. Data in BIDMAINT.
35–44 0 1 0

45–54 0 0 1

85 86

Note layout: all costs, regardless of state, in 1 column. Then x1 =1 Slope for a dummy variable compares its category (where it is 1)
if Kentucky (2nd), 0 otherwise; x2= 1 if Texas (3rd), 0 otherwise. with “baseline” category (Kansas).
Then run regression predicting cost from x1 and x2 :
How to tell if maintenance costs really do differ between states? Null
The regression equation is
cost = 280 + 80.3 kentucky + 198 texas hypothesis of no difference: all states have same mean, so dummy
variables for all states 0 (difference from Kansas 0). These are all
Predictor Coef StDev T P
Constant 279.60 53.43 5.23 0.000 the slopes in this regression, so global F-test tells story:
kentucky 80.30 75.56 1.06 0.297
Analysis of Variance
texas 198.20 75.56 2.62 0.014

Source DF SS MS F P
S = 168.9 R-Sq = 20.5% R-Sq(adj) = 14.6%
Regression 2 198772 99386 3.48 0.045
Residual Error 27 770671 28543
Intercept of 280 is mean cost for Kansas (omitted state). Mean for Total 29 969443
Kentucky is $80.30 larger than for Kansas, mean for Texas $198.20
Just significant at 0.05 level. Mean costs do differ among states.
bigger than for Kansas.

87 88
Predict burn rate from brake power and dummy variables:

Mixing categorical and numerical variables The regression equation is


burnrate = 13.3 + 4.36 brake - 22.6 df2 - 7.36 bln

In the SYNFUELS data set, diesel engines with different brake Predictor Coef StDev T P
Constant 13.320 6.931 1.92 0.084
power were run with three different fuels: a diesel called DF-2, a brake 4.3650 0.8057 5.42 0.000
synthetic blended fuel, and a blended fuel with advanced timing. df2 -22.600 5.464 -4.14 0.002
bln -7.360 5.464 -1.35 0.208
The mass burning rate was measured.
S = 8.057 R-Sq = 81.2% R-Sq(adj) = 75.6%
One numerical x, brake power, and one categorical, fuel type (3
levels). Define 3 − 1 = 2 dummy variables for the fuels. In data R-squared high, so model useful. Brake power definitely helps to
set: df2 1 for fuel DF-2, 0 otherwise, bln 1 for blended fuel, 0 predict over and above fuel type. For fixed brake power, Advanced
otherwise. Timing fuel has highest burn rate, followed by blended fuel, followed
by DF-2. (Dummy variable slopes negative).

89 90

To see whether fuel type helps to predict (over and above brake
power), need partial F-test to compare fit of regression containing
just brake power. Can use partial SS for this:
Analysis of Variance

Source DF SS MS F P
Regression 3 2807.90 935.97 14.42 0.001
Residual Error 10 649.08 64.91 Has 2 and 10 df. P-value 0.0053. The fuel type definitely affects the
Total 13 3456.99
burn rate.
Source DF Seq SS
brake 1 1603.93
df2 1 1086.22
bln 1 117.76

(1086.22 + 117.76)/2
F = = 9.27.
649.08/10

91 92
Introduction

We are often faced with data sets containing a y -variable and many
x-variables.
Problem: some of the x’s may have nothing to do with y ; including

Choosing the x-variables all of them in a regression is wasteful.

Principle of parsimony (Occam’s Razor) says that simplest model


describing data is best.

Best: use subject-matter knowledge to say which x’s should be in


the model.

Otherwise: try to use statistical arguments to choose the x’s

93 94

Example: problem 6.4, page 335, uses CLERICAL data set.


Measured total number of hours worked per day by clerical staff in a
department store, and how this depends on other variables:
Best subsets regression
1. pieces of mail opened, sorted
One natural way to go is to think about fitting all the possible
2. money orders, gift certificates sold
regressions and choosing the best in some way. This seems like a
lot of work, but with fast computers (and some computational 3. customer account payments processed
shortcuts) it is feasible for most datasets. 4. change order transactions processed
If you’re comparing regressions with the same number of x’s, it is
5. cheques cashed
easy to compare them by comparing R-squared.
6. pieces of miscellaneous mail processed

7. bus tickets sold

95 96
To run all possible regressions in Minitab, select Regression, then 3 47.6 44.3 9.4 11.597 X X X
Best Subsets. Enter the y -variable in Reponse, then enter all the 4 51.8 47.7 7.1 11.235 X X X X
4 51.6 47.4 7.4 11.265 X X X X
x’s into Free Predictors. I got this output: 5 54.5 49.6 6.4 11.036 X X X X X
Response is y 5 54.3 49.4 6.5 11.055 X X X X X
6 56.1 50.2 6.8 10.961 X X X X X X
m g m 6 56.0 50.1 6.9 10.978 X X X X X X
a i a c i 7 56.8 50.0 8.0 10.990 X X X X X X X
i f c c h s
l t c h e c This gives the best regression with each possible number of
o c t a q m
p e p n u a b x-variables (up to 7 here). For instance, the first line says that the
Adj. e r a g e i u best regression with 1 x-variable has an R-squared of 34.5%, and it
Vars R-Sq R-Sq C-p s n t y e s l s
predicts total hours worked from the number of cheques cashed
1 34.5 33.2 18.8 12.701 X
(read cheques downwards on the right). Also shown is the
1 24.9 23.4 28.6 13.599 X
2 43.6 41.3 11.5 11.902 X X second-best 1-variable regression, which has an R-squared of
2 43.3 41.0 11.8 11.934 X X
24.9%.
3 48.1 44.8 8.9 11.543 X X X

97 98

Adjusted R-squared

Looking further down, the best 6-variable regression has an One way: adjust definition of R-squared so that it goes down when a
R-squared of 56.1%, and contains all the variables except the worthless x-variable is added.
number of bus tickets sold.
R-squared is regression SS / total SS, or 1 - error SS / total SS. The
But how do you compare regressions with different numbers of total SS is same for any regression with the same y , so really
x-variables? depends on error SS.

Looking at R-squared: every time you add a new x-variable, even if Idea: base adjusted R-sq on error MS:
it’s useless, R-squared will go up. Remembering Occam’s Razor, error MS
 
Ra2 = 1 − (n − 1) ,
we only want to add a new variable if it’s useful. So R-squared is no total SS
good. n being the number of observations.
Since error MS can go up or down depending on usefulness of x, so
can this adjusted R-squared.

99 100
Minitab puts “R-sq (adj)” on regression output and also has column Mallows’ Cp
“adj R-sq” on best subsets output. Look for highest. In example:

Mallows’ Cp gives another way of choosing the “best” regression.


m g m
a i a c i For a particular regression model, it is
i f c c h s
error SSmodel
l t c h e c + 2(p + 1) − n.
o c t a q m error MSall
p e p n u a b
Adj. e r a g e i u In the 1st term, error SSmodel comes from the regression you’re
Vars R-Sq R-Sq C-p s n t y e s l s fitting, and error MSall comes from the regression with all the x’s in
...
5 54.3 49.4 6.5 11.055 X X X X X it. n is the number of observations, and p the number of x’s in the
6 56.1 50.2 6.8 10.961 X X X X X X fitted regression.
6 56.0 50.1 6.9 10.978 X X X X X X
7 56.8 50.0 8.0 10.990 X X X X X X X A better regression has a smaller error SS or a smaller number of
50.2% for model with all x’s except number of bus tickets sold. (Also variables p, so a smaller Cp is better. Any regressions with Cp
regression with lowest MS Error; look in s column.) smaller than p + 1 are “satisfactory”.

101 102

For example, consider CLERICAL data set again, fit model


containing all x’s except bus:
The regression equation is The SS Error for this model is 5406.9, with n = 52 clerical workers
y = 61.2 + 0.00112 mailopen + 0.0887 giftcert + 0.0115 acctpay - 0.0438
+ 0.0499 cheques + 0.215 miscmail
and p = 6 variables in this regression. The regression containing
all 7 x’s has an MS Error of 120.8 (not shown here). Thus Cp is
<...>
5406.9
S = 10.96 R-Sq = 56.1% R-Sq(adj) = 50.2% + 2(6 + 1) − 52 = 6.76.
120.8
Analysis of Variance This is smaller than 6 + 1 = 7, so this regression is satisfactory.
Source DF SS MS F P (The value 6.76 checks with the best-subsets calculation.)
Regression 6 6905.6 1150.9 9.58 0.000
Residual Error 45 5406.9 120.2
Total 51 12312.5

103 104
Cautions
• by doing many regressions and taking the best, we may
“capitalize on chance” (“if you look at enough things, you’re
We now know that the “best” regression has the largest adjusted
bound to find something good”)
R-squared or the smallest Cp .
• an automatic procedure is no substitute for subject-matter
These will sometimes disagree about the “best” regression (as in
knowledge: which variables should be important.
our example): usually Cp favours fewer x’s.
The best use of these methods is as a suggestion for future study.
Tempting to go with the “best” regression for further analysis, but:
Collect a new data set with the suggested variables, then analyze
• picking the “best” regression gives an optimistic R-squared (that that.
future studies won’t repeat)
Another technique, “stepwise regression”, should be ignored
• because of chance, “best” regression may not actually be the completely! (It has all these problems and more.)
best one

105 106

Introduction

Detecting and correcting Not everything will go smoothly in a regression analysis. We need to
see whether anything is not as it appears, decide how (if possible) to

problems fix it, and to recognize the limitations of our analysis.

This chapter shows some ways in which our analysis or conclusions


could be wrong/misleading, and how we might fix them.

107 108
Example: study where researchers gave IQ tests to 2-year-old
infants (score y ); also noted whether the mother admitted using
Observational data vs. designed
cocaine during preganancy: x = 1, 0.
experiments This is observational study because value of x not controlled
(impossible!).
Recall distinction between these two:
Mothers not randomly assigned to cocaine use/not, so groups could
observational data occur when the data are just observed: no
differ on other variables not recorded. Eg. IQ might differ by
effort is made to control anything.
socioeconomic status of mothers and this might be related to
designed experiments occur when the x-variables are controlled cocaine use.
and can be changed by the experimenter.
General principle: be cautious about drawing conclusions from
observational study.

109 110

Multicollinearity The regression equation is


co = 3.20 + 0.963 tar - 2.63 nicotine - 0.13 weight

In a regression, want y to be correlated with (at least some) of the Predictor Coef StDev T P
x’s. Constant 3.202 3.462 0.93 0.365
tar 0.9626 0.2422 3.97 0.001
But what if the x’s are correlated among themselves? nicotine -2.632 3.901 -0.67 0.507
weight -0.130 3.885 -0.03 0.974
Example: the American Federal Trade Commission measures
S = 1.446 R-Sq = 91.9% R-Sq(adj) = 90.7%
cigarette brands according to tar, nicotine and carbon monoxide.
Can we predict carbon monoxide from other variables plus weight? R-squared is high (good). CO depends on amount of tar in positive

Data in FTCCIGAR. way, as expected. Would also expect CO to depend on nicotine


content in positive way. But slope negative and non-significant.
Do regression predicting carbon monoxide from other variables:

111 112
Reason: tar and nicotine highly correlated with each other: To get VIFs in Minitab: select Stat, Regression, Regression. Click

Correlation of tar and nicotine = 0.977 Options, select Variance Inflation Factors. Click OK twice. Get this:
Predictor Coef StDev T P VIF
so that two variables predict CO in the same way: once you have Constant 3.202 3.462 0.93 0.365
tar 0.9626 0.2422 3.97 0.001 21.6
one, you don’t need the other.
nicotine -2.632 3.901 -0.67 0.507 21.9
weight -0.130 3.885 -0.03 0.974 1.3
How do you tell this has happened? Usual clue: an expected
significant variable is non-significant. VIF greater 10 “large”. Here, as expected, VIFs for tar and

Remedy: calculate variance inflation factors. These based on nicotine high: variables correlated with each other.
correlation from fictitious regression predicting each x-variable from Suppose now one x-variable correlated with sum of two others.
other x’s. Here, expect VIFs for tar and nicotine to be high because Then no high correlations between x’s, but VIF for that one variable
can predict one from other. high. Thus VIFs better than correlations in general.

113 114

What effect do correlated x’s have on predictions?

Finally, to illustrate that we could use either tar or nicotine in Let’s find confidence and prediction intervals for a cigarette brand
example, do regression without tar: with 12.2 mg of tar, 0.88 mg of nicotine, and weighs 0.97 g. First,

The regression equation is use regression with all x’s. In Regression, click Options, enter 13
co = 1.61 + 12.4 nicotine + 0.06 weight 1.1 1.05 on the “new observations” line, then click OK.
Predicted Values
Predictor Coef StDev T P
Fit StDev Fit 95.0% CI 95.0% PI
Constant 1.614 4.447 0.36 0.720
12.503 0.290 ( 11.901, 13.106) ( 9.437, 15.569)
nicotine 12.388 1.245 9.95 0.000
weight 0.059 5.024 0.01 0.991
Now compare with regression omitting nicotine (so no
S = 1.870 R-Sq = 85.7% R-Sq(adj) = 84.4% correlation problems). Take out 1.1 from the “new observations” line:
Predicted Values
R-squared hasn’t changed much, but now slope for nicotine
positive and significant. Fit StDev Fit 95.0% CI 95.0% PI
12.515 0.286 ( 11.923, 13.107) ( 9.496, 15.535)

Intervals almost identical.


115 116
Extrapolation

Extrapolation is the often tempting process of predicting outside the


range of your data.

Example: 8 supermarkets were chosen with similar past demands


Thus for prediction, can include all variables even correlated ones, for a brand of coffee. To investigate the effect of price on demand, 8
whereas for building a model that can be understood, better to different prices were randomly assigned to the stores and then the
take out x’s highly correlated with other x’s. demand for the coffee measured. (Experiment). Data in COFFEE.
The regression equation is
demand = 2898 - 608 price

Predictor Coef StDev T P


Constant 2898.1 151.0 19.20 0.000
price -607.98 44.96 -13.52 0.000

S = 29.14 R-Sq = 96.8% R-Sq(adj) = 96.3%

117 118

R-squared is high, so prediction should be good.

Suppose the price is $5. What is the predicted demand then?


If you try to extrapolate getting confidence or prediction interval from
demand = 2898 − 608(5) = −142.
Minitab, you get warned:
Oops! Can’t be a negative demand. Predicted Values

What happened? Prices in data set only go up to $3.70, so we are Fit StDev Fit 95.0% CI 95.0% PI
-141.8 74.9 ( -325.1, 41.5) ( -338.5, 54.9) XX
extrapolating. X denotes a row with X values away from the center
XX denotes a row with very extreme X values
Here, got nonsense, but often get apparently reasonable prediction.

For extrapolation to work, the linear relationship would have to A price of $5 is a very extreme x-value.

continue beyond the data, and we have no way of knowing that this
will happen.

119 120
Note also the size of the intervals, compared with those for price
$3.30 (an average price in the data):
Predicted Values
Transformations of y and x
Fit StDev Fit 95.0% CI 95.0% PI
891.8 10.5 ( 866.0, 917.6) ( 815.9, 967.6)
For coffee data, predicted demand for a price of $5 was nonsense.
A plot shows why:
The intervals are narrower – we have more “nearby” data to work
with.

121 122

Only know how to fit straight lines, not curves. But to make things
more linear, can use functions of x and y instead of x and y
themselves.

Sometimes function, transformation, suggested by data (see how


later). Sometimes suggested by subject-matter theory.

For coffee data, economic theory suggests linear relationship


between 1/price and demand, giving transformation of price to try.

Only endmost points are above the line: relationship curved not
linear.

123 124
Two steps:

• create new column(s) of transformed data Now predict demand from 1/price, including new column as
x-variable in regression:
• run analysis on transformed data.
The regression equation is
demand = - 1180 + 6808 1/price
For coffee data, create new column of 1/price. To do this:
Predictor Coef StDev T P
• give new column a name (type it into the header row) Constant -1180.5 107.7 -10.96 0.000
1/price 6808.1 358.4 19.00 0.000
• Select Calc, Calculator. Double-click name of new column, then
go to Expression box. There type or select 1/, then S = 20.90 R-Sq = 98.4% R-Sq(adj) = 98.1%

double-click price.
Plot shows that relationship is (a little) more linear now.
• Click OK; new column appears in worksheet.

125 126

Other transformations
Predict demand for price $3.30, first by hand:
Can transform y instead of/as well as x. Consider these data:
Row x y
demand = −1180.5 + 6808.1(1/3.3) = 882.4.
1 1 1.0
2 2 1.5
Or can do by Minitab. Now predicting for 1/price, so in Regression, 3 3 2.0
4 4 2.75
Options under “prediction intervals for new observations”, enter
5 5 3.5
0.303, which is 1/3.3. Click OK: 6 6 4.5

Predicted Values
y seems to increase faster as x increases, so linear relationship no
Fit StDev Fit 95.0% CI 95.0% PI √
882.37 7.47 ( 864.08, 900.66) ( 828.04, 936.71) good. Predict instead y from x. (Plot indicates this relationship
straight.)
The fit is better, so the intervals are shorter.
Define new column sqrty containing square root of y (in
Calculator, select “square root” or type sqrt(’y’)).

127 128
How to predict y for x = 3?
Predict square root of y from x:
First, use the line, put in x = 3, get
The regression equation is
sqrty = 0.769 + 0.223 x
0.769 + 0.223(3) = 1.438.
Predictor Coef StDev T P
Constant 0.76934 0.01542 49.90 0.000 But this is predicted value of square root of y , so have to undo
x 0.222541 0.003959 56.21 0.000
square root: square this to get predicted value of y :
S = 0.01656 R-Sq = 99.9% R-Sq(adj) = 99.8%
(1.438)2 = 2.07
Fit is very good.
which fits well with the data.

129 130

Relationships made linear by


transformation
Take logarithms of both sides:
Exponential growth
ln m = ln(100(1.02t )) = ln 100+ln(1.02t ) = ln 100+t ln 1.02.
How does money grow if you put it in a savings account with 2%
That is, ln m is linear function of t.
annual interest?
Can model exponential growth with regression: take logarithm
Start with $100; after 1 year, have $102; after 2 years, $104.04,
transform of y , and predict ln y from x.
after 5 years $110.41. Formula: m = 100(1.02t ) for money after t
years.

Not linear: 2% of larger amount is larger.

131 132
Power relationships

If y is a power of x, can fit this using regression. If y = kxc , then The regression equation is
c c logy = - 2.20 + 2.95 logx
ln y = ln(kx ) = ln k + ln(x ) = ln k + c ln x.
Predictor Coef StDev T P
Since k is a constant, ln y is linear function of ln x. Constant -2.20406 0.09557 -23.06 0.000
logx 2.94686 0.07631 38.62 0.000
Consider following data, where y is approx 0.1x3 :
S = 0.1131 R-Sq = 99.7% R-Sq(adj) = 99.7%
x 1 2 3 4 5 6
Fit again very good. But since ln(kxc )
= ln k + c ln x, note that
y 0.1 1 3 6 12 22
−2.20 ≃ ln k so k ≃ 0.11 and c = 2.95 ≃ 3; relationship
Relationship definitely curved. Calculate logs of x and y . I used approx y = 0.1x3 .

“natural log” in Calculator. Plot of logy against logx straight, so


predict logy from logx:

133 134

Example: fat/cholesterol data


Residuals and plots
A sample was taken of 10 Olympic athletes. Two variables
measured: avg daily intake of saturated fat (x), cholesterol level (y )
Need to check whether a regression is working all right. To do this,
(data in OLYMPIC).
define residual as difference between observed y and predicted y :
Predict cholesterol level from fat intake using linear regression. Save
ǫ̂i = yi − ŷi residuals: to do this, select cholesterol as Response and fat

Regression should summarize all relationship between y and x’s, as Predictor as usual, then click Storage. Select Residuals and Fits.

so should be no pattern in plots of residuals vs. anything. Two kinds of useful plots:

Any pattern indicates problem; kind of pattern indicates kind of • residuals vs. fitted values: Any patterns: problem with y -variable
problem.
• residuals vs. x-variables: patterns: problems with that
x-variable

135 136
Residuals vs. fitted values: select Graph, then Plot. Notice residuals
Interpret: curved relation needed, or one obs. unusual.
(RESI1) and fitted values (FITS1) available for plot. Select them with
residuals as y . Get this: Plot residuals against 1 x-variable fat:

Very similar. Interpret as curve: relation with that x-variable curved.


Curved pattern, or point different from others (bottom right).

137 138

One fix: add fat-squared to regression, look at residual plot then:

The regression equation is


cholesterol = - 1216 + 2.40 fat -0.000450 fatsq

Predictor Coef StDev T P


Constant -1216.1 242.8 -5.01 0.002
fat 2.3989 0.2458 9.76 0.000
fatsq -0.00045004 0.00005908 -7.62 0.000

S = 46.80 R-Sq = 98.2% R-Sq(adj) = 97.7%

Even though coeff of fatsq very small, definitely significant, so


was worth including.

No apparent pattern: good.

139 140
Example: social worker data
The regression equation is
salary = 20242 + 522 experience + 53.0 expˆ2
The SOCWORK data set contains years of experience and salary
data for 50 social workers. How does salary depend on experience? Predictor Coef StDev T P
Constant 20242 4423 4.58 0.000
Plotting salary against years of experience suggests a curved experien 522.3 616.7 0.85 0.401
expˆ2 53.01 19.57 2.71 0.009
relationship, so calculate experinence-squared and add that as well.
S = 8123 R-Sq = 81.6% R-Sq(adj) = 80.8%
(Or: do straight-line regression, look at residuals, note that
relationship not straight line, add experience-squared.) Definitely need expˆ2 term.

Use Calc-Calculator to create column of experience-squared values. But store residuals, fitted values; plot:
Add to regression. I get this:

141 142

Problem on plot with fitted values, so fix by fixing y (salary).

With increasing spread, try transformation like square root or


logarithm. (Decreasing spread rarer; try reciprocal 1/y ).

Might guess that percent change in salary is what depends on


experience. Suggests logarithm transformation of salary.

Calculate column of ln(salary) using Calculator (look for “natural


log” in function box). Use this instead of original salary:
The regression equation is
logsal = 9.84 + 0.0497 experience +0.000009 expˆ2

Predictor Coef StDev T P


Residuals average out to 0, but spread increases towards right
Constant 9.84289 0.08479 116.08 0.000
(“cone shape”). No good, because variability of residuals should experien 0.04969 0.01182 4.20 0.000
expˆ2 0.0000094 0.0003753 0.03 0.980
stay same all the way along.
S = 0.1557 R-Sq = 86.4% R-Sq(adj) = 85.8%

143 144
Stored residuals and fits again (in RESI2 and FITS2). Has plot
Also, look at regression: experience-squared term no longer
improved?
signficant. That is, by using log of salary, get simpler model without
expˆ2. Redo regression:
The regression equation is
logsal = 9.84 + 0.0500 experience

Predictor Coef StDev T P


Constant 9.84132 0.05636 174.63 0.000
experien 0.049979 0.002868 17.43 0.000

S = 0.1541 R-Sq = 86.4% R-Sq(adj) = 86.1%

Residual plot looks OK. By looking at residual plots, have arrived at


good model for predicting salary.
Residuals no longer increase in spread: good.

145 146

Example: predict salary of social worker with 15 years experience.

Log of predicted salary is 9.84 + 0.05(15)


= 10.59. To get actual
predicted salary, take exp(10.59) = 39735. Design of Experiments
(Salaries go up by fixed percentage for each year of experience: go
up faster as experience increases.)

147 148
Introduction

Regression:y depends on x’s. x’s may not be controllable; if not,


have observational data; then difficult to show that x causes y . • y or outcome variable called response.

If we can select x’s, have a statistical experiment. This gives us a • Objects on which y measured (people, animals, plants,
chance of saying “x’s cause y ” (rather than “x’s and y happen to samples) called experimental units.
vary together”). Thus can hope to prove cause and effect.
• x-variables that can be controlled called factors.
• Chosen value of factor called level.
Terminology • Combination of factor levels called a treatment.

• Process of collecting data called experiment.


• Plan for collecting data called design of experiment.

149 150

Example Allocate each week to a treatment combination; each one appears 3


times because 6 × 3 = 18.
Experiment to investigate effect of brand, shelf location on coffee
Allocation done using randomness to ensure that certain
sales. 2 brands, A, B; 3 locations (bottom, middle, top). Therefore
brand/location combinations don’t get favourable weeks.
2 × 3 = 6 treatment combinations of brand/location.
If sales higher for certain treatment combinations, hope to say that
Brand, location are factors, weekly coffee sales response.
higher sales were caused by brand or location.
Have 18 weeks to collect data: week is experimental unit.

151 152
Randomization
Basic idea: randomly allocate experimental units to treatment
Experimental units are not all the same. Eg. some weeks are better
combinations. Result: no factor has advantage, because of
for selling coffee than others, and this has nothing to do with shelf or
experimental units, over any other. Thus any difference must be
location. (We therefore don’t care about it). Likewise, people or
because of factor.
animals are different physically.
Various approaches to randomization. Simplest is completely
If all the best units go with a particular treatment combination, that
randomized design.
treatment combination will look best even when it is not. Want to
“share out” experimental units.

153 154

Completely randomized design Thus (yours will be different):

Suppose we want to investigate effect of 3 different advertising


• supermarkets 28, 9, 8, 14, 13, 29, 25, 18, 5, 22 get display 1;
displays on sales of a beverage. Have 30 different supermarkets • 7, 6, 15, 17, 2, 23, 11, 24, 21, 1 get display 2;
(units), some of which will sell more of the beverage than others.
• 27, 4, 26, 20, 19, 10, 3, 16, 12, 30 get display 3.
Use randomization to get completely randomized design.
All 30 supermarkets ended up with one of displays, but impossible
Minitab: put numbers 1–30 in a column (say C1) to represent
to predict ahead of time which one would get which.
supermarkets. Want to arrange them in random order, then take 1st
10 to get display 1, next 10 to get display 2, last 10 to get display 3. Sometimes cannot assign experimental units randomly to
treatments. Example: selecting university professors by department
Select Calc, Random Data, Sample from Columns. In dialog,
(eg. to measure salaries). Then take instead random sample from
sample all 30 rows from c1, put in c2. c2 then contains random
each department. Still a completely randomized design.
rearrangement of c1.

155 156
ANOVA: analysis of completely randomized Example: GPA and socioeconomic class
design
Seven students were randomly selected from each socioeconomic
class (lower, middle, upper), their GPAs taken from university files at
In regression, issue was not only whether y appeared to depend on
end of academic year.
x’s, but whether it did so more than chance.
Does GPA depend on socioeconomic class (for all students at that
Likewise, here want to know whether, if you looked at whole
university)?
population of experimental units, whether factors would affect
response. That is, is effect observed in data stronger tham chance? Data set GPA3. Classes numbered 1–3.

Example: effect of displays on beverage sales; is one display so Know how to do this with regression: define dummy variables
much more effective than the others in data (sample) that we would picking out students in classes 2 and 3. That is, dummy1 1 for
be confident in claiming this display best for all supermarkets in all students from class 2, 0 otherwise; dummy2 for students from
weeks (population), not just ones in sample? class 3, 0 otherwise.

157 158

Then do regression predicting GPA from these 2 dummy variables: Slope for dummy1 0.73, so students from class 2 have higher GPA
The regression equation is on average than class 1 (by 0.73). Likewise, students from class 3
gpa = 2.52 + 0.727 dummy1 + 0.021 dummy2
have slightly higher average GPA than class 1 (by 0.02).
Predictor Coef StDev T P Issue whether these differences just chance (would be different with
Constant 2.5214 0.1934 13.04 0.000
dummy1 0.7271 0.2735 2.66 0.016 different data) or real (different data would show same pattern).
dummy2 0.0214 0.2735 0.08 0.938
Recall what analysis of variance table says: is there any effect of
S = 0.5116 R-Sq = 33.7% R-Sq(adj) = 26.4% any x’s, that is is there any real difference in GPA between
socioeconomic classes?
Analysis of Variance
Here, P-value is 0.025, smaller than 0.05, so are justified in saying
Source DF SS MS F P
Regression 2 2.3969 1.1984 4.58 0.025 there is a real difference in GPA between socioeconomic classes.
Residual Error 18 4.7111 0.2617
Total 20 7.1080 Null hypothesis: all the groups have the same mean; alternative is
that the null is not true. ie. that one or more groups has a different
Intercept 2.52, mean of class 1 (not interesting: comparing classes).
mean from the others.

159 160
Analysis of Variance for gpa
Analysis of variance in a completely Source DF SS MS F P
class 2 2.397 1.198 4.58 0.025

randomized design Error


Total
18
20
4.711
7.108
0.262

Individual 95% CIs For Mean


Based on Pooled StDev
For a completely randomized design, the analysis of variance table
Level N Mean StDev --------+---------+---------+--------
tells us whether all the group means could be the same, or there is 1 7 2.5214 0.5041 (-------*--------)
2 7 3.2486 0.3526 (-------*-------)
evidence that one or more of the groups differ from the others.
3 7 2.5429 0.6377 (-------*-------)
--------+---------+---------+--------
Also an easier way to do in Minitab. Illustrate on GPA data again:
Pooled StDev = 0.5116 2.50 3.00 3.50

select Stat, ANOVA, One-way. Response is GPA, factor is class.


Analysis of Variance table same as before; 2nd table gives idea of
(See how GPA depends on class.) Output:
which classes may differ from which (maybe class 2 is highest).

161 162

Predictor Coef StDev T P


Constant 2.5214 0.1934 13.04 0.000
dummy1 0.7271 0.2735 2.66 0.016
dummy2 0.0214 0.2735 0.08 0.938

In the t-tests, dummy1 significantly different from 0 (P-value less


Tukey’s method
than 0.05). dummy1 compares class 1 with class 2. Likewise,
dummy2 not significant, so no difference between class 1 and
Suppose we have done an analysis of variance, and the table says
class 3.
that there are differences among the groups. How do we find out
which groups differ from which? Two problems:

Recall regression approach for GPA data set: • what if we want to compare class 2 and 3?
• more important, one of these t tests only works if we decided
before collecting the data that we wanted to compare these and
only these 2 groups.

163 164
If we compare all possible pairs of groups (1–2, 1–3, 2–3), are doing
several tests at once. Suppose all groups have same mean. By doing more than one test,
increase chance of declaring some pair of groups to be different
Why does this matter?
when actually not. (Have 3 chances to make mistake not just 1.)
Think about how tests work: by rejecting when P-value less than
Better idea (Tukey): if all groups have same mean, figure out how
0.05, we have 5% chance of incorrectly rejecting null when actually
big difference between largest and smallest sample mean could be.
true.
Any sample means further apart than this significantly different.
Here, null hypothesis for each test is that groups being compared Doesn’t matter how many groups; idea still works.
have same mean.

165 166

Individual 95% CIs For Mean


Doing Tukey’s method in practice Based on Pooled StDev
Level N Mean StDev --------+---------+---------+--------
1 7 2.5214 0.5041 (-------*--------)
To do Tukey’s method, have to do a little hand calculation first. 2 7 3.2486 0.3526 (-------*-------)
3 7 2.5429 0.6377 (-------*-------)
Advantage is that variations on Tukey’s method are small variations
--------+---------+---------+--------
on this calculation. Pooled StDev = 0.5116 2.50 3.00 3.50

We’ll assume that all the groups are the same size. First, note that overall F is significant, so there are some

First, look back at output (eg. GPA data): differences to find.

Analysis of Variance for gpa Track down “Pooled StDev”, here 0.5116. Call this s. In the DF
Source DF SS MS F P
class 2 2.397 1.198 4.58 0.025
column, look for the “error” row. Result here is 18. Call this v . Note
Error 18 4.711 0.262 we have 3 groups to compare. Call this p. All groups have n =7
Total 20 7.108
observations.

167 168
Turn to Table 11 in Appendix C of the text. Find the row for v and
the column for p. Here v = 18, p = 3. Number in table is 3.61.
Call this q . A nice way to illustrate this:
√ √
Calculate w = qs/ n. Here w = (3.61)(0.5116)/ 7 = 0.698. list means in order, match up with groups, then put line on top of
means that are not significantly different.
Finally, any groups whose means differ by more than w are
significantly different (at the 5% level). Here the differences are: -------------
Mean 2.5214 2.5429 3.2486
2 vs 3 3.2486 − 2.5429 = 0.7057 Significant
Class 1 3 2
2 vs 1 3.2486 − 2.5214 = 0.7272 Significant
This shows that group 2’s mean is significantly bigger than the
3 vs 1 2.5429 − 2.5214 = 0.0215 Not significant
others.
so class 2’s mean GPA is significantly higher than that of the other
classes, but the class 1–3 difference is probably just chance.

169 170

Groups of different sizes


Example: sorption rate data

If groups not all same size, smaller groups have to be further apart Study made of chemical properties of 3 types of hazardous
to be significantly different (more room for chance when groups solvents. Measured “sorption rate” for samples of each solvent type:
smaller).
aromatics, chloroalkanes and esters. Data in SORPRATE and on p.
Calculate a w for each pair of groups. For comparing groups i and 564 of text.
j with ni and nj observations, ANOVA itself has no difficulties:
s
s 1 1 Analysis of Variance for sorprate
wij = q √ + . Source DF SS MS F P
2 ni nj solvent 2 3.3054 1.6527 24.51 0.000
Error 29 1.9553 0.0674
Everything else as before. Sometimes known as “Tukey-Kramer”.
Total 31 5.2607
If can choose size of each group, best to have groups of equal size:
Pooled StDev = 0.2597
gives best chance to detect any significant differences.

171 172
s
1 1
w12 = (3.49)(0.2597/2) + = 0.22.
Are significant differences among the solvents. In data, esters seem 9 8
s
to be lower than other two. 1 1
w13 = (3.49)(0.2597/2) + = 0.20.
Tukey: need mean sorption rates for each solvent: 0.94, 1.01 and
9 12
s
0.33 (order as above). 1 1
w23 = (3.49)(0.2597/2) + = 0.21.
8 12
From table C-11 (text): p = 3 groups, v = 29 error df, q = 3.49
w ’s similar because group sizes similar (smallest groups have
(used 30 df).
largest w ).
From ANOVA, pooled SD is s = 0.2597. 9 aromatics, 8
Summary:
chloroalkanes, 12 esters.
Group Esters Aromatics Chloroalkanes
Mean 0.33 0.94 1.01
------------------

173 174

Remember: Tukey only used if significant differences to find.


Randomized block designs
Consider Exercise 12.3 on page 561. The ANOVA gives this:
Sometimes the outcome of an experiment not just on one variable
One-way Analysis of Variance
(above) but two.
Analysis of Variance for result
Source DF SS MS F P Example: In comparing crop yields for different varieties of carrots,
trtmnt 1 7.50 7.50 3.15 0.104
can just plant carrots in randomly chosen plots and do one-way
Error 11 26.19 2.38
Total 12 33.69 ANOVA. But: some plots may be more fertile than others: get good
yield of carrots whichever variety planted there. Better: categorize
There are no significant differences between treatments, so stop
plots by quality (eg. good, medium, bad) and include quality of plot
here: no point going further.
as second variable.

175 176
Example: Prompting in a walking program

Study (problem 12.24): 135 people randomly divided into 5 groups.


Agreed to participate by walking 20 mins once or more a week.
Plot quality Variety
Participants reminded (prompted):
Good 1 3 2 4
• No prompting (control)
Medium 4 1 3 2
• Weekly prompting / low structure
Bad 3 2 4 1
• Weekly prompting / high structure
Randomize so that each variety appears once in a plot of each
quality. • Prompt every 3 weeks / low structure
• Prompt every 3 weeks / high structure
Count kept of how many people (out of 27) actually did their walking
each week. Results kept for weeks 1, 4, 8, 12, 16, 24.

177 178

3 variables: the week, kind of prompting, number of walkers.


Analysis of Variance for walkers
Compare data layout on p. 581 with what Minitab expects: 1 column Source DF SS MS F P
call 4 1185.00 296.25 39.87 0.000
per variable (WALKERS data set).
week 5 386.40 77.28 10.40 0.000
Error 20 148.60 7.43
Variables called week, call, walkers.
Total 29 1720.00
This is randomized block design, with weeks as blocks – expect Individual 95% CI
week to affect number of walkers, but really interested in effect of call Mean ----------+---------+---------+---------+-
1 2.7 (--*---)
prompting. Week is “nuisance”. 2 17.0 (---*---)
3 20.7 (--*---)
In Minitab: select Stat, ANOVA, Two-way. walkers is response; 4 10.5 (---*--)
5 9.2 (---*---)
call and week are “factors” (enter in either order). Will be
----------+---------+---------+---------+-
interested in comparing calling strategies, so click Display Means for 6.0 12.0 18.0 24.0

call.

179 180
Tukey for randomized blocks

Having decided that call strategy does make a difference, now want
The ANOVA table has two tests now: one for any difference in
to decide which call strategies are better.
number of walkers due to call strategy, one for any difference due to
weeks. First get pooled SD as square root of error MS:

s= 7.43 = 2.7258.
Difference due to weeks no surprise: expected this (reason for
including weeks as blocks in the first place). Then same calculation as for completely randomized ANOVA.
Comparing p = 5 call strategies, v = 20 error df. In table C-11:
Difference due to call strategy. Strictly, “difference due to call
q = 4.23.
strategy allowing for differences due to weeks” (like regression). √ √
Calculate w = qs/ n = (4.23)(2.7258)/ 6 = 4.71. Each call
This is what we want.
strategy mean is based on the 6 weeks’ worth of data, so divide by

6. Any means differing by more than 4.71 are significantly
different.

181 182

General two-factor factorial designs


Summary table:
In randomized block design, have one observation for each
Call strategy 1 5 4 2 3
combination of factor levels (in example, one person got each
Mean 2.7 9.2 10.5 17.0 20.7
-------- --------- prompting strategy each week). What if we have resources for more
observations, and what if interested in both factors?
Call strategies 2 and 3 were most effective at getting people to walk.
These were the strategies involving weekly calls. The kind of call Example: a manufacturer has a variable supply of raw material, and
(conversation vs. goal-setting) didn’t make any significant difference. needs to decide the best ratio of raw material to the two product
lines. Response is profit (per unit of raw material). Data in
RAWMATERIAL.

183 184
Decides to run experiment with ratios 0.5, 1 and 2, supply 15, 18, 21 Results:

tons (typical values). (Ratio 0.5 means “make half as much of Two-way Analysis of Variance

product 1 as product 2”.) Analysis of Variance for profit


Source DF SS MS F P
Might expect that most profitable ratio would depend on level of ratio 2 8.22 4.11 1.71 0.209
supply 2 20.22 10.11 4.20 0.032
supply (strategy “if supply is S, use ratio R”, where R depends on S).
Interaction 4 46.22 11.56 4.80 0.008
Error 18 43.33 2.41
When response depends on combination of the 2 factors like this,
we have interaction between them. When data have more than 1 Now have 3 F -tests: one for each factor, and one for the interaction.
observation for each factor combination (here 3), can test
Look first at test for interaction, here significant. No surprise – effect
interaction. Said to have 3 replications here.
of ratio on profit depends on supply level.
In Minitab, select STAT, ANOVA, two-way. Select response for top
To see how, need to look at all combinations of ratio and supply. 3
box, two factors (order doesn’t matter) in other two. Don’t worry
different ratios, 3 different supply levels, so 3 × 3 = 9 combinations
about “display means” for now.
in total.

185 186

How to get the means? Select Stat, Tables, Cross-tabulation. Select


the two factors, Ratio and Supply. Then tell Minitab you want the
Which of these significantly different? Same calculation as before.
means of Profit for each combo: click Summaries, select Profit and
9 “groups” (means being compared), 18 error df. Table C-11 gives
click Means. (You get means for each ratio and each supply level
q = 4.96 (table continues overleaf if you have more groups). Then
too.) Result: √
w = qs/ n as before. s is sqrt of error mean square,
Rows: ratio Columns: supply √
s = 2.41 = 1.552. Each mean is based on 3 observations
15 18 21 All
(replications), so n = 3. Hence
0.5 21.333 20.333 19.333 20.333
1.0 20.333 23.667 20.333 21.444 √
2.0 17.333 21.333 22.000 20.222 w = (4.96)(1.552)/ 3 = 4.446
All 19.667 21.778 20.556 20.667
and any means further apart than this are significantly different.
Cell Contents --
profit:Mean

187 188
Story so far: look first at interaction test. If significant, test means for
all combinations (by Tukey).

What if interaction not significant?


For a diagram, arrange means in order:
Study done to investigate effect of vitamin B supplements on kidney
Ratio/Supply 2/15 0.5/21 0.5/18 1/15 1/21 0.5/15 2/18 2/21 1/18
Mean 17.33 19.33 20.33 20.33 20.33 21.33 21.33 22.00 23.67 weight of rats. Rats classified before study as “lean” (1) or “obese”
-------------------------------------------- (2). 14 rats of each size given randomly regular diet (1) or
--------------------------------------------------
vitamin-B-supplemented diet (2). Data in VITAMINB. ANOVA gives:
Ratio 2, supply 15 is significantly worse than the top two, 2/21 and Analysis of Variance for kidneywt
1/18, but no other significant differences. Source DF SS MS F P
ratsize 1 8.0679 8.0679 141.18 0.000
diet 1 0.0124 0.0124 0.22 0.645
Interaction 1 0.0364 0.0364 0.64 0.432
Error 24 1.3715 0.0571
Total 27 9.4883

189 190

Interaction not significant. So now look at size and diet (main General procedure for 2-way ANOVA
effects).
In general, when interaction not significant, test each of the main
Diet not significant, but rat size is. That is, kind of diet has no effect
effects (from the ANOVA). For any significant ones, do Tukey as
on kidney weight, but size before experiment does.
necessary.
Usually here do Tukey on any significant factors (ratsize only).
When interaction significant, that is finding. Do Tukey on means for
But in this experiment, no point: only two different sizes of rats, so
all combinations of groups.
sizes sig. different in kidney weight from each other. Looking at
data, obese rats have larger kidney weights than lean rats. Diagram on p. 588 of text:

191 192
Checking ANOVA assumptions

As with regression, need to check that assumptions OK so that we


can trust results.

2 major assumptions with ANOVA:

• data follow normal distribution within each group.


• data have same spread within each group.
Compare regression: there, assumed normal distribution around
line, equal spread around line.

193 194

Testing variances
Select Stat, ANOVA, homogeneity of variance. Fill in response and
Usually, normal distribution assumption not crucial. Are tests for factors (kidneywt is response, others factors). Output includes a
normality, but tend to be too sensitive: reject normality when data graph as well as this text:
are “normal enough”. Levene’s Test (any continuous distribution)

So concentrate on testing spread within groups. Tests based on Test Statistic: 1.036
variance (SD squared). Best test is Levene’s test. Null hypothesis: P-Value : 0.394

all groups have same variance, alternative “not”.


P-value not small, so don’t reject null. Conclude all groups have
Recall rat vitamin B data (rats given different diets, weight of kidney same spread, therefore that previous ANOVA was OK.
measured). Check assumptions.

195 196
Second example: Vanadium important trace element found in living Get means and SDs for groups. Can be done by running ANOVA
organisms. Experiment done to compare V concentrations in and ignoring half the results!
different materials (oyster tissue, citrus leaves, bovine liver, human Individual 95% CIs For Mean
Based on Pooled StDev
serum). Data p. 640 and in VANADIUM.
Level N Mean StDev ----+---------+---------+---------+--
1 3 1.3300 1.0053 (-----*------)
Ultimately: test whether mean concentration same in materials. But
2 3 3.1600 0.8884 (-----*------
first test assumption of equal spread with Levene’s test: 3 3 0.4100 0.1212 (-----*------)
4 5 0.1460 0.0279 (----*----)
Levene’s Test (any continuous distribution)
----+---------+---------+---------+--
Pooled StDev = 0.6027 0.0 1.2 2.4 3.6
Test Statistic: 3.214
P-Value : 0.070
Groups with larger mean also have larger SD.
P-value not smaller than 0.05, but is small, so have doubts about
When this happens (quite common), transformation like logarithm or
equal-spread assumption.
square root often helps. May be theory to guide choice, eg. if
Need transformation of response variable, as in regression. percent change in response meaningful, log better.

197 198

Analysis of Variance for logconc


Source DF SS MS F P
Try log here. Create new column of logged concentrations using material 3 19.313 6.438 25.96 0.000
Calculator. Select Calc and Calculator, call the new variable Error 10 2.480 0.248
Total 13 21.793
logconc, and select Natural log of conc. See new variable Individual 95% CIs For Mean
Based on Pooled StDev
appear in worksheet.
Level N Mean StDev -+---------+---------+---------+-----
Do Levene’s test again with logconc as response: 1 3 0.0127 0.9905 (----*----)
2 3 1.1239 0.2835 (----*-----)
Levene’s Test (any continuous distribution) 3 3 -0.9206 0.2945 (----*-----)
4 5 -1.9412 0.2142 (---*---)
Test Statistic: 1.568 -+---------+---------+---------+-----
P-Value : 0.258 Pooled StDev = 0.4980 -2.4 -1.2 0.0 1.2

This is a lot better. Running ANOVA shows that we have Now can do F -test: there is a significant difference in mean
(somewhat) evened out the SDs: (log-)concentrations among the 4 materials, since P-value is 0 to
accuracy shown. Confident that assumptions of ANOVA are OK.

199 200
Follow up here with Tukey-Kramer (groups of different sizes).

From Table C-11, 4 groups, 10 error df, q = 4.33. Pooled SD is


s = 0.498.
Save work: only ever compare 2 groups of size 3, or groups of sizes
3 and 5. So only need two w ’s:
s
1 1
w(3, 3) = (4.33)(0.498/2) + = 0.880.
3 3
s
1 1
w(3, 5) = (4.33)(0.498/2) + = 0.787.
3 5

All group means further apart than these, so all materials have
significantly different means.

201

Potrebbero piacerti anche