Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
• Suppose we observe bivariate data (X, Y ), but we do not know the regression function
E(Y |X = x). In many cases it is reasonable to assume that the function is linear:
E(Y |X = x) = α + βx.
-1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
A bivariate data set with E(Y |X = x) = 3 + 2X, where the line Y = 2.5 + 1.5X is
shown in blue. The residuals are the green vertical line segments.
1
• One approach to estimating the unknowns α and β is to consider the sum of squared residuals
function, or SSR.
The SSR is the function i ri2 = i (Yi − α − βXi )2 . When α and β are chosen so the fit
P P
to the data is good, SSR will be small. If α and β are chosen so the fit to the data is poor,
SSR will be large.
8 8
6 6
4 4
2 2
0 0
-2 -2
-4 -4
-6 -6
-8 -8
-10 -10
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Left: a poor choice of α and β that give high SSR. Right: α and β that give
nearly the smallest possible SSR.
• It is a fact that among all possible α and β, the following values minimize the SSR:
β̂ = cov(X, Y )/var(X)
α̂ = Ȳ − β̂ X̄,
Ê(Y |X = x) = α̂ + β̂x
Ŷi = α̂ + β̂xi .
2
• Some properties of the least square estimates:
1. β̂ = cor(X, Y )σ̂Y /σ̂X , so β̂ and cor(X, Y ) always have the same sign – if the
data are positively correlated, the estimated slope is positive, and if the data are
negatively correlated, the estimated slope is negative.
2. The fitted line α̂ + β̂x always passes through the overall mean (X̄, Ȳ ).
3. Since cov(cX, Y ) = c · cov(X, Y ) and var(cX) = c2 · var(X), if we scale the X
values by c then the slope is scaled by 1/c. If we scale the Y values by c then the
slope is scaled by c.
• Once we have α̂ and β̂, we can compute the residuals ri based on these estimates, i.e.
ri = Yi − α̂ − β̂Xi .
Yi = α + βXi + i ,
where α and β are the population regression coefficients, and the i are iid random variables
with mean 0 and standard deviation σ. The i are called errors.
• Model assumptions:
Assumption 3 is not always necessary. Least squares estimates α̂ and β̂ are still valid when
the i are not normal (as long as 1 and 2 are met).
However hypothesis tests, CI’s, and PI’s (derived below) depend on normality of the i .
3
• Since α̂ and β̂ are functions of the data, which is random, they are random variables, and
hence they have a distribution.
This distribution reflects the sampling variation that causes α̂ and β̂ to differ somewhat from
the population values α and β.
The sampling variation is less if the sample size n is large, and if the error standard deviation
σ is small.
The sampling variation of β̂ is less if the Xi values are more variable.
We will derive formulas later. For now, we can look at histograms.
300 300
250 250
200 200
150 150
100 100
50 50
0 0
0 0.5 1 1.5 2 -3 -2.5 -2 -1.5 -1
Sampling variation of α̂ (left) and β̂ (right) for 1000 replicates of the simple
linear model Y = 1 − 2X + , where SD() = 2, the sample size is n = 200, and
σX ≈ 1.2.
4
250 300
250
200
200
150
150
100
100
50
50
0 0
0 0.5 1 1.5 2 -3 -2.5 -2 -1.5 -1
Sampling variation of α̂ (left) and β̂ (right) for 1000 replicates of the simple
linear model Y = 1 − 2X + , where SD() = 1/2, the sample size is n = 200, and
σX ≈ 1.2.
300 300
250 250
200 200
150 150
100 100
50 50
0 0
0 0.5 1 1.5 2 -3 -2.5 -2 -1.5 -1
Sampling variation of α̂ (left) and β̂ (right) for 1000 replicates of the simple
linear model Y = 1 − 2X + , where SD() = 2, the sample size is n = 50, and
σX ≈ 1.2.
5
250 250
200 200
150 150
100 100
50 50
0 0
0 0.5 1 1.5 2 -3 -2.5 -2 -1.5 -1
Sampling variation of α̂ (left) and β̂ (right) for 1000 replicates of the simple
linear model Y = 1 − 2X + , where SD() = 2, the sample size is n = 50, and
σX ≈ 2.2.
250 250
200 200
150 150
100 100
50 50
0 0
1 1.5 2 2.5 3 1 1.5 2 2.5 3
6
Sampling properties of the least squares estimates
• The following is an identity for the sample covariance:
1 X
cov(X, Y ) = (Yi − Ȳ )(Xi − X̄)
n−1 i
1 X n
= Yi Xi − Ȳ X̄.
n−1 i n−1
The average of the products minus the product of the averages (almost).
A similar identity for the sample variance is
1 X
var(Y ) = (Yi − Ȳ )2
n−1 i
1 X 2 n
= Yi − Ȳ 2 .
n−1 i n−1
The average of the squares minus the square of the averages (almost).
• An identify for the regression model Yi = α + βXi + i :
1X 1X
Yi = α + βXi + i
n n i
Ȳ = α + β X̄ + ¯.
i Yi Xi − nȲ X̄
P
β̂ = P 2 2
.
i Xi − nX̄
7
Since
Xi2 +
X X X X
(α + βXi + i )Xi = α Xi + β i Xi
i i i
Xi2 +
X X
= nαX̄ + β i Xi
i i
Xi2 − nβ X̄ 2 + i i Xi − n¯X̄
P P
β i
β̂ = 2 2
,
i Xi − nX̄
P
and further to
i Xi − n¯X̄
P
β̂ = β + Pi 2 2
i Xi − nX̄
To apply this result, by the assumption of the linear model Ei = E¯ = 0, so Ecov(X, ) = 0,
and we can conclude that E β̂ = β.
This means that β̂ is an unbiased estimate of β – it is correct on average.
If we observe an independent SRS every day for 1000 days from the same linear model, and
we calculate β̂i each day for i = 1, . . . , 1000, the daily β̂i may differ from the population β
P
due to sampling variation, but the average i β̂i /1000 will be extremely close to β.
• Now that we know E β̂ = β, the corresponding analysis for α̂ is straightforward. Since
α̂ = Ȳ − β̂ X̄,
then
E α̂ = E Ȳ − β X̄,
E α̂ = α + β X̄ − β X̄ = α,
so α is also unbiased.
8
• Next we would like to calculate the standard deviation of β̂, which will allow us to produce
a CI for β.
Beginning with
i Xi − n¯X̄
P
β̂ = β + Pi 2 2
i Xi − nX̄
+ var(n¯X̄) − 2cov(
P P
var( i i Xi ) i i Xi , n¯
X̄)
var(β̂) = .
( i Xi2 − nX̄ 2 )2
P
Simplifying
X
cov(i , ¯) = cov(i , j )/n
j
2
= σ /n.
So we get
σ2
Xi2 + nσ 2 X̄ 2 − 2nX̄ i Xi σ 2 /n
P P
i
var(β̂) =
( i Xi2 − nX̄ 2 )2
P
σ 2 i Xi2 + nσ 2 X̄ 2 − 2nX̄ 2 σ 2
P
= .
( i Xi2 − nX̄ 2 )2
P
9
Alomst done:
σ 2 i Xi2 − nX̄ 2 σ 2
P
var(β̂) =
( i Xi2 − nX̄ 2 )2
P
σ2
= P 2 2
i Xi − nX̄
2
σ
= ,
(n − 1)var(X)
and
σ
sd(β̂) = √ .
n − 1σ̂X
• The slope SD formula is consistent with the three factors that influenced the precision of β̂
in the histograms:
Xi2 /n
P
2
var(α̂) = σ .
(n − 1)var(X)
Xi2 /n.
P
Thus var(α̂) = var(β̂)
Xi2 /n term the estimate will be more precise when the Xi values are close to
P
Due to the
zero.
Since α̂ is the intercept, it’s easier to estimate when the data is close to the origin.
• Summary of sampling properties of α̂, β̂:
Both are unbiased: E α̂ = α, E β̂ = β.
Xi2 /n
P
2
var(α̂) = σ .
(n − 1)var(X)
σ2
var(β̂) =
(n − 1)var(X)
10
• Start with the basic inequality for standardized β̂:
√ β̂ − β
P (−1.96 ≤ n − 1σ̂X ≤ 1.96) = 0.95
σ
then get β alone in the middle:
σ σ
P (β̂ − 1.96 √ ≤ β ≤ β̂ + 1.96 √ ) = .95,
n − 1σ̂X n − 1σ̂X
Replace 1.96 with 1.64, etc. to get CI’s with different coverage probabilities.
• Note that in general we will not know σ, so we will need to plug-in σ̂ (defined above) for σ.
This plug-in changes the sampling distribution to tn−2 , so to be exact, we would replace
the 1.96 in the above formula with QT (.975), where QT is the quantile function of the tn−2
distribution.
If n is reasonably large, the normal quantile will be an excellent approximation.
0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
11
6
0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
12
• The fitted value at X, denoted Ŷ , is the Y coordinate of the estimated regression line at X:
Ŷ = α̂ + β̂X
The fitted value is an estimate of the regression function E(Y |X) evaluated at the point X,
so we may also write Ê(Y |X).
Fitted values may be calculated at any X value. If X is one of the observed X values, say
X = Xi , write Ŷi = α̂ + β̂Xi .
• Since Ŷi is a random variable, we can calculate its mean and variance.
To get the mean, recall that E α̂ = α and E β̂ = β. Therefore
To derive cov(α̂, β̂), similar techniques as were used to calculate varα̂ and varβ̂ can be
applied. The result is
σ 2 X̄
cov(α̂, β̂) = − 2
.
nσX
13
Simplifying yields
σ2 2
varŶi = 2
(σX + X̄ 2 + Xi2 − 2Xi X̄),
nσX
which reduces further to
σ2 2
varŶi = 2
(σX + (Xi − X̄)2 ).
nσX
An equivalent expression is
!2
σ2 Xi − X̄
varŶi = 1+ .
n σX
To simplify notation define
!2
1 Xi − X̄
σi2 = 1 +
n σX
14
• We now know the mean and variance of Ŷi . Standardizing yields
Ŷi − (α + βXi )
P (−1.96 ≤ ≤ 1.96) = .95,
σσi
equivalently
We can make the coverage probability exactly 0.95 by using the tn−2 distribution to calculate
quantiles:
• The following show CI’s for the population regression function E(Y |X). In each data figure,
a CI is formed for each Xi value. Note that the goal of each CI is to cover the green line,
and this should happen 95% of the time.
Note also that the CI’s are narrower for Xi close to X̄ compared to Xi that are far from X̄.
Also note that the CI’s are longer when σ is greater.
15
-1
-2
-3
-4
-5
-6
-7
-8
-2 -1.5 -1 -0.5 0 0.5 1 1.5
The red points are a bivariate data set generated according to the model Y =
−4 + 1.4X + , where SD() = .4. The green line is the population regression
function, the blue line is the fitted regression function, and the vertical blue
bars show 95% CI’s for E(Y |X = Xi ) at each Xi value.
16
-1
-2
-3
-4
-5
-6
-7
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
The red points are a bivariate data set generated according to the model Y =
−4 + 1.4X + , where SD() = 1. The green line is the population regression
function, the blue line is the fitted regression function, and the vertical blue
bars show 95% CI’s for E(Y |X = Xi ) at each Xi value.
17
0
-1
-2
-3
-4
-5
-6
-7
-8
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
This is an independent realization from the model shown in the previous figure.
-1
-2
-3
-4
-5
-6
-7
-8
-1.5 -1 -0.5 0 0.5 1 1.5 2
18
Prediction intervals
• Suppose we observe a new X point X ∗ after having calculated α̂ and β̂ based on an inde-
pendent data set. How can we predict the Y value Y ∗ corresponding to X ∗ ?
It makes sense to use α̂ + β̂X ∗ as the prediction. We would also like to quantify the uncer-
tainty in this prediction.
• First note that E(α̂ + β̂X ∗ ) = α + βX ∗ = EY ∗ , so the prediction is unbiased.
Calculate the variance of the prediction error:
Note that the covariance term is 0 since Y ∗ is independent from the data used to fit the
model.
When n is large, α and β are very precisely estimated, so σ∗ is very small, and the variance
of the prediction error is ≈ σ 2 – nearly all of the uncertainty comes from the error term .
The prediction interval
Y ∗ − α̂ − β̂X ∗
P (−1.96 ≤ q ≤ 1.96) = .95,
σ 1 + σ∗2
can be rewritten
q q
P (α̂ + β̂X ∗ − 1.96σ 1 + σ∗2 ≤ Y ∗ ≤ α̂ + β̂X ∗ + 1.96σ 1 + σ∗2 ) = .95.
• As with the CI, we will plug-in σ̂ for σ, making the coverage approximate:
q q
P (α̂ + β̂X ∗ − 1.96σ̂ 1 + σ∗2 ≤ Y ∗ ≤ α̂ + β̂X ∗ + 1.96σ̂ 1 + σ∗2 ) ≈ .95.
For the coverage probability to be exactly 95%, 1.96 should be replaced with Q(0.975), where
Q is the tn−2 quantile function.
• The following two figures show fitted regression lines for a data set of size n = 20 (the fitted
regression line is shown but the data are not shown). Then 95% PI’s are calculated at each
Xi , and an independent data set of size n = 20 is generated at the same set of Xi values.
The PI’s should cover the new data values 95% of the time.
The PI’s are slightly narrower in the center, but this is hard to see unless n is quite small.
19
1
-1
-2
-3
-4
-5
-6
-7
-8
-9
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
20
1
-1
-2
-3
-4
-5
-6
-7
-8
-9
-1.5 -1 -0.5 0 0.5 1 1.5
Residuals
• The residual ri is the difference between the fitted and observed values at Xi : ri = Yi − Ŷi .
The residual is a random variable since it depends on the data.
Be sure you understand the difference between the residual (ri ) and the error (i ):
Yi = α + βXi + i
Yi = α̂ + β̂Xi + ri
X X X
ri = Yi − Ŷi
X X
= Yi − nα̂ − β̂ Xi .
21
• Each residual ri estimates the corresponding error i . The i are iid, however the ri are not
iid.
We already saw that Eri = 0. To calculate varri , begin with:
Since a variance must be positive, it must be true that σi2 ≤ 1. This is easier to see by
rewriting σi2 as follows:
1 (Xi − X̄)2
σi2 = +P 2
.
n j (Xj − X̄)
It is true that
Sums of squares
22
• We would like to understand how the following quantitites are related:
1X 1X 1X
Yi − Ȳ = ri = Ŷi − Ȳ = 0.
n n n
• The following figure shows n = 20 points generated from the model Y = −4 + 1.4X + ,
where SD() = 2. The green line is the population regression line, the blue line is the fitted
regression line, and the black line is the constant line Y = EY . Note that another way to
write EY18 is E(Y |X = X18 ).
23
2
(X18 , Y18 )
-2
(X18 , EY18 )
(X18 , Ŷ18 )
-4
Y = α + βX
(X18 , EY )
-6
Y = α̂ + β̂X
-8
-10
-1.5 -1 -0.5 0 0.5 1 1.5
24
• We will begin with two identities. First,
Ŷi = α̂ + β̂Xi
= Ȳ − β̂ X̄ + β̂Xi
= Ȳ + β̂(Xi − X̄).
X X
(Yi − Ŷi )(Ŷi − Ȳ ) = β̂ (Yi − Ȳ − β̂(Xi − X̄))(Xi − X̄)
(Yi − Ȳ )(Xi − X̄) − β̂(Xi − X̄)2
X
= β̂
= β̂(n − 1)cov(Y, X) − (n − 1)β̂ 2 var(X)
= β̂(n − 1)cov(Y, X) − (n − 1)β̂cov(Y, X)
= 0
Since the mean of Yi − Ŷi and the mean of Ŷi − Ȳ are both zero,
X
(Yi − Ŷi )(Ŷi − Ȳ ) = (n − 1)cov(Yi − Ŷi , Ŷi − Ȳ ).
Therefore we have shown that the residual ri = Yi − Ŷi and the fitted values Ŷi are uncorre-
lated.
We now have the following “sum of squares law”:
25
• The following terminology is used:
Abbrev. DF Formula
(Yi − Ȳ )2 /(n − 1)
P
MSTO n-1
(Yi − Ŷi )2 /(n − 2)
P
MSE n-2
(Ŷi − Ȳ )2
P
MSR 1
Note that the MSTO is the sample variance, and the MSE is the estimate of σ̂ 2 in the
regression model.
The “SS” values add: SSTO = SSE + SSR and the degrees of freedom add: n−1 = (n−2)+1.
The “MS” values do not add: MSTO 6= MSE + MSR.
• If the model fits that data well, MSE will be small and MSR will be large. Conversely, if the
model fits the data poorly then MSE will be large and MSR will be small. Thus the statistic
MSR
F =
MSE
can be used to evaluate the fit of the linear model (bigger F = better fit).
The distribution of F is an “F distribution with 1, n − 2 DF”, or F1,n−2 .
We can test the null hypothesis that the data follow a model Yi = µ+i against the alternative
that the data follow a model Yi = α+βXi +i using the F statistic (an “F test”). A computer
package or a table of the F distribution can be used to determine a p-value.
• In the case of simple linear regression, the F test is equivalent to the hypothesis test β = 0
versus β 6= 0. Later when we come to multiple linear regression, this will not be the case.
A useful way to think about what the F-test is evaluating is that the null hypothesis is “all
Y values have the same expected value” and the alternative is that “the expected value of
Yi depends on the value of Xi ”.
Diagnostics
26
• In practice, we may not be certain that the assumptions underlying the linear model are
satisfied by a particular data set. To review, the key assumptions are:
Note that (3) is not essential for the estimates to be valid, but should be approximately
satisfied for confidence intervals and hypothesis tests to be valid. If the sample size is large,
then it is less crucial that (3) be met.
• To assess whether (1) and (2) are satsified, make a scatterplot of the residuals ri against the
fitted values Ŷi .
This is called a “residuals on fitted values plot”.
Recall that we showed above that ri and Ŷi are uncorrelated.
Thus if the model assumptions are met this plot should look like iid noise – there should be
no visually apparent trends or patterns.
For example, the following shows how a residual on fitted values plot can be used to detect
nonlinearity in the regression function.
40 20
15
30
10
20
5
Residuals
10 0
-5
0
-10
-10 -15
-20
-20 -10 -5 0 5 10 15
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Fitted values
Left: A bivariate data set (red points) with fitted regression line (blue). Right:
A diagnostic plot of residuals on fitted values.
27
The following shows how a residual on fitted values plot can be used to detect heteroscedas-
ticity.
-1 1.5
-1.5
1
-2
-2.5
0.5
Residuals
-3
-3.5 0
-4
-0.5
-4.5
-5 -1
-5.5
-1.5
-6 -6 -5.5 -5 -4.5 -4 -3.5 -3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Fitted values
Left: A bivariate data set (red points) with fitted regression line (blue). Right:
A diagnostic plot of residuals on fitted values.
• Suppose that the observations were collected in sequence, say two per day for a period of
one month, yielding n = 60 points. There may be some concern that the distribution has
shifted over time.
These are called “sequence effects” or “time of measurement effects”.
To detect these effects, plot the residual ri against time. There should be no pattern in the
plot.
• To assess the normality of the errors use a normal probability plot of the residuals.
For example, the following shows a bivariate data set in which the errors are uniform on
[−1, 1] (i.e. any value in that interval is equally likely to occur as the error). This is evident
in the quantile plot of the ri .
28
-1 2.5
-2
1.5
Normal Quantile
-3
0.5
-4 0
-0.5
-5 -1
-1.5
-6
-2
-2.5
-7 -2 -1.5 -1 -0.5 0 0.5 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Residual Quantil
Left: A bivariate data set (red points) with fitted regression line (blue). Right:
A normal probability plot of the residuals.
29
80
70
January Temperature
60
50
40
30
20
Non outliers
2 SD outliers
3 SD outliers
Ann Arbor, MI
10
20 25 30 35 40 45 50
Latitude
It turns out that of the 19 outliers, 18 are warmer than expected, and these stations are all
in northern California and Oregon.
The one outlier station that is substantially colder than expected is in Gunnison County,
Colorado, which is very high in elevation (at 2,339 ft, it is the fourth highest of 1072 stations
in the data set).
In January 2001, Ann Arbor, Michigan was slightly colder than the fitted value (i.e. it was
a bit colder here than in other places of similar latitude).
30
30
25
20
15
10
Residuals
-5
-10
-15
-20
-25
20 30 40 50 60 70 80
Fitted values
A plot of residuals on fitted values for the regression of January maximum tem-
perature on latitude.
Transformations
• If the assumptions of the linear model are not met, it may be possible to transform the data
so that a linear fit to the transformed data meets the assumptions more closely.
Your options are to transform Y only, transform X only, or transform both Y and X.
The most useful transforms are the log transform X → log(X + c) and the power transform
X → (X + c)q .
The following example shows a situation where the errors do not seem to be homoscedastic.
31
6 2.5
2
5
1.5
4
Residuals
3
0.5
2
0
1 -0.5
0 -1
-1.5
-1 -0.5 0 0.5 1 1.5 2 2.5 3
-5 -4 -3 -2 -1 0 1 2 3 4 5
Fitted values
Left: Scatterplot of the raw data, with the regression line drawn in green. Right:
Scatterplot of residuals on fitted values.
Here is the same example where the Y variable was transformed to log(Y ):
2 1.5
1.5
1
1
0.5
Residuals
0.5
0
-0.5
-1
-0.5
-1.5
-1
-2 -1.5 -1 -0.5 0 0.5 1 1.5
-5 -4 -3 -2 -1 0 1 2 3 4 5 Fitted values
Left: Scatterplot of the transformed data, with the regression line drawn in
green. Right: Scatterplot of residuals on fitted values.
32
• Another common situation occurs when the X values are skewed:
1 1
0.8
0.5
0.6
0.4
0
0.2
Residuals
-0.5 0
-0.2
-1
-0.4
-1.5 -0.6
-0.8
-2
-1
-1.2
-2.5 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6
0 2 4 6 8 10 12 14 16
Fitted values
Left: Scatterplot of the raw data, with the regression line drawn in green. Right:
Scatterplot of residuals on fitted values.
In this case transforming X to X 1/4 removed the skew:
0.5 0.6
0 0.4
-0.5 0.2
Residuals
-1 0
-0.2
-1.5
-0.4
-2
-0.6
-2.5 -2 -1.8 -1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Fitted values
Left: Scatterplot of the transformed data, with the regression line drawn in
green. Right: Scatterplot of residuals on fitted values.
33
• Logarithmically transforming both variables (a “log/log” plot) can reduce both heteroscedas-
ticity and skew:
5 2
4.5
1.5
4
3.5
1
Residuals
3
2.5 0.5
2
0
1.5
1 -0.5
0.5
-1
0 0 0.5 1 1.5 2 2.5 3
0 2 4 6 8 10 12 14 16 18 20
Fitted values
Left: Scatterplot of the raw data, with the regression line drawn in green. Right:
Scatterplot of residuals on fitted values.
after the transform...
2 0.8
1.5
0.6
1
0.4
0.5
Residuals
0.2
0
-0.5 0
-1
-0.2
-1.5
-0.4
-2
-0.6
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5
0 0.5 1 1.5 2 2.5 3 Fitted values
Left: Scatterplot of the transformed data, with the regression line drawn in
green. Right: Scatterplot of residuals on fitted values.
34