Decomposing Variance

Author(s): Kerby Shedden, Ph.D.
, 2010 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution Share Alike 3.0 License: http://creativecommons.org/licenses/by-sa/3.0/
We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your ability to use, share, and adapt it. The citation key on the following slide provides information about how you may share and adapt this material. Copyright holders of content included in this material should contact open.michigan@umich.edu with any questions, corrections, or clarification regarding the use of content. For more information about how to cite these materials visit http://open.umich.edu/privacy-and-terms-use. Any medical information in this material is intended to inform and educate and is not a tool for self-diagnosis or a replacement for medical evaluation, advice, diagnosis or treatment by a healthcare professional. Please speak to your physician if you have questions about your medical condition. Viewer discretion is advised: Some medical content is graphic and may not be suitable for all viewers.
1 / 41
Decomposing Variance
Kerby Shedden
Department of Statistics, University of Michigan
October 10, 2011
2 / 41
Law of total variation

For any regression model involving a response Y and a covariate vector X , we have var(Y ) = varX E (Y |X ) + EX var(Y |X ). Note that this only makes sense if we treat X as being random. We often wish to distinguish these two situations: The population is homoscedastic: var(Y |X ) does not depend on X , so we can simply write var(Y |X ) = 2 , and we get var(Y ) = varX E (Y |X ) + 2 . The population is heteroscedastic: var(Y |X ) is a function 2 (X ) with expected value 2 = EX 2 (X ), and again we get var(Y ) = varX E (Y |X ) + 2 . If we write Y = f (X ) + with E ( |X ) = 0, then E (Y |X ) = f (X ), and varX E (Y |X ) summarizes the variation of f (X ) over the marginal distribution of X .
3 / 41
Law of total variation

4 3
E(Y|X)
2 1 0 10 1
Orange curves: conditional distributions of Y given X Purple curve: marginal distribution of Y Black dots: conditional means of Y given X
4 / 41
Pearson correlation
The population Pearson correlation coecient of two jointly distributed scalar-valued random variables X and Y is XY cov(X , Y ) . X Y
Given data Y = (Y1 , . . . , Yn ) and X = (X1 , . . . , Xn ) , the Pearson correlation coecient is estimated by
XY =
cov(X , Y ) = X Y
i (Xi i (Xi
X )(Yi Y )
i (Yi
X )2
Y )2
(X X ) (Y Y ) Y Y . X X
When we write Y Y here, this means Y Y 1, where 1 is a vector of is a scalar. 1s, and Y
5 / 41
Pearson correlation
By the Cauchy-Schwartz inequality, 1 1 XY XY 1 1.
The sample correlation coecient is slightly biased, but the bias is so small that it is usually ignored.
6 / 41
Pearson correlation and simple linear regression slopes

For the simple linear regression model Y = + X + , if we view X as a random variable that is uncorrelated with , then
2 cov(X , Y ) = X
and the correlation is XY cor(X , Y ) = 2

2 + 2 /X
The sample correlation coecient is related to the least squares slope estimate: cov(X , Y ) = XY Y . = X X 2
7 / 41
Orthogonality between tted values and residuals

Recall that the tted values are Y = X = PY and the residuals are R = Y Y = (I P)Y . Since P(I P) = 0 it follows that Y R = 0. since R = 0, it is equivalent to state that the sample correlation between R and Y is zero, i.e. cor(R, Y ) = 0.
8 / 41
Coecient of determination
A descriptive summary of the explanatory power of X for Y is given by the coecient of determination, also known as the proportion of explained variance, or multiple R 2 . This is the quantity Y Y Y Y
2 2
R2 1
Y Y Y Y
2 2
var(Y ) . var(Y )
The equivalence between the two expressions follows from the identity
Y Y
= = =
Y Y +Y 2 Y Y + 2+ Y Y
Y 2 Y Y Y Y
2 2
+ 2(Y Y ) (Y Y ) ,
It should be clear that R 2 = 0 i Y = Y and R 2 = 1 i Y = Y .

9 / 41
Coecient of determination
The coecient of determination is equal to cor(Y , Y )2 . To see this, note that
cor(Y , Y )
= = = =
(Y Y ) (Y Y ) Y Y Y Y (Y Y ) (Y Y + Y Y ) Y Y Y Y (Y Y ) (Y Y ) + (Y Y ) (Y Y ) Y Y Y Y Y Y . Y Y
10 / 41
Coecient of determination in simple linear regression

In general, R 2 = cor(Y , Y )2 = In the case of simple linear regression, cov(Y , Y )2 . var(Y ) var(Y )
cov(Y , Y )
= cov(Y , + X ) cov(Y , X ), =
and var(Y ) = var( + X ) 2 var(X ) =
Thus for simple linear regression, R 2 = cor(Y , X )2 = cor(Y , Y )2 .

11 / 41
Relationship to the F statistic
The F-statistic for the null hypothesis 1 = . . . = p = 0 is Y Y Y Y

2 2
np1 R2 np1 = , 2 p 1R p
which is an increasing function of R 2 .
12 / 41
Adjusted R 2
The sample R 2 is an estimate of the population R 2 : 1 var(Y |X ) . var(Y )
Since it is a ratio, the plug-in estimate R 2 is biased, although the bias is not large unless the sample size is small or the number of covariates is large. The adjusted R 2 is an approximately unbiased estimate of the population R 2 : 1 (1 R 2 ) n1 . np1
The adjusted R 2 is always less than the unadjusted R 2 . The adjusted R 2 is always less than or equal to one, but can be negative.
13 / 41
The unique variation in one covariate

How much information about Y is present in a covariate Xk ? This question is not straightforward when the covariates are non-orthogonal, since several covariates may contain overlapping information about Y .
Let Xk be the residual of Xk after regressing it against all other covariates (including the intercept). If Pk is the projection onto span({Xj , j = k}), then
Xk = (I Pk )Xk . We could use var(Xk )/var(Xk ) to assess how much of the variation in Xk is unique in that it is not also captured by other predictors.
But this measure doesnt involve Y , so it cant tell us whether the unique variation in Xk is useful in the regression analysis.
14 / 41
The unique regression information in one covariate
To learn how Xk contributes uniquely to the regression, we can consider how introducing Xk to a working regression model aects the R 2 . Let Yk = Pk Y be the tted values in the model omitting covariate k.
2 Let R 2 denote the multiple R 2 for the full model, and let Rk be the 2 multiple R for the regression omitting covariate Xk . The value of
2 R 2 Rk
is a way to quantify how much unique information about Y in Xk is not captured by the other covariates. This is called the semi-partial R 2 .
15 / 41
Identity involving norms of tted values and residuals

Before we continue, we will need a simple identity that is often useful. In general, if A and B are orthogonal, then A + B If A and B A are orthogonal, then B Thus we have B
2 2
= A
+ B
= B A+A A
2
= B A
+ A 2.
= B A 2.
Applying this fact to regression, we know that the tted values and residuals are orthogonal. Thus for the regression omitting variable k, Yk k are orthogonal, so and Y Y so Y Yk
2
= Y
Yk
2 2
. = Y
2
By the same argument, Y Y
16 / 41
Improvement in R 2 due to one covariate
Now we can obtain a simple, direct expression for the semi-partial R 2 .

Since Xk is orthogonal to the other covariates, Y , Xk Xk , , X Xk k
Y = Yk + and Y
2
= Yk
+ Y , Xk 2 / Xk 2 .
17 / 41
Improvement in R 2 due to one covariate

Thus we have
R2
= = = =
1 1 1 1
Y Y Y Y Y
2
2 2
Y 2 2 Y Y Y 2 Yk
Y , Xk 2 / Xk Y Y 2 2 2
2 = Rk
Y Yk 2 Y , Xk 2 / Xk 2 + 2 Y Y Y Y 2 2 Y , Xk / Xk . + Y Y 2
18 / 41
Semi-partial R 2
Thus the semi-partial R 2 is
2 R 2 Rk = Y , Xk 2 / Xk Y Y 2 2 Y , Xk / Xk Y Y 2 2
where Yk is the tted value for regressing Y on Xk . Since Xk / Xk is centered and has length 1, it follows that 2 R 2 Rk = cor(Y , Xk )2 = cor(Y , Yk )2 .
Thus the semi-partial R 2 for covariate k has two equivalent interpretations: It is the improvement in R 2 resulting from including covariate k in a working regression model that already contains the other covariates. It is the R 2 for a simple linear regression of Y on Xk = (I Pk )Xk .
19 / 41
Partial R 2
The partial R 2 is
2 R 2 Rk Y , Xk 2 / Xk = 2 1 Rk Y Yk 2 2
The partial R 2 for covariate k is the fraction of the maximum possible improvement in R 2 that is contributed by covariate k. Let Yk be the tted values for regressing Y on all covariates except Xk .
Since Yk Xk = 0, Y , Xk 2 Y Yk 2 Xk Y Yk , Xk 2 Y Yk 2 X k
The expression on the left is the usual R 2 that would be obtained when regressing Y Yk on Xk . Thus the partial R 2 is the same as the usual 2 R for (I Pk )Y regressed on (I Pk )Xk .
20 / 41
Decomposition of projection matrices

Suppose P Rnn is a rank-d projection matrix, and U is a n d orthogonal matrix whose columns span col(P). If we partition U by columns | U = U1 | then P = UU , so we can write
d
| U2 |
| Ud , |
P=
j=1
Uj Uj .
Note that this representation is not unique, since there are dierent orthogonal bases for col(P). Each summand Uj Uj Rnn is a rank-1 projection matrix onto Uj .
21 / 41
Decomposition of R 2
Question: In a multiple regression model, how much of the variance in Y is explained by a particular covariate? Orthogonal case: If the design matrix X is orthogonal (X X = I ), the projection P onto col(X ) can be decomposed as
p
P=
j=0
11 Pj = + n
Xj Xj ,
j=1
where Xj is the j th column of the design matrix (assuming here that the rst column of X is an intercept).
22 / 41
Decomposition of R 2 (orthogonal case)

The n n rank-1 matrix Pj = Xj Xj is the projection onto span(Xj ) (and P0 is the projection onto the span of the vector of 1s). Furthermore, by orthogonality, Pj Pk = 0 unless j = k. Since
p
Y Y =
j=1
Pj Y ,
by orthogonality
p
Y Y
=
j=1
Pj Y
Here we are using the fact that if U1 , . . . , Um are orthogonal, then U1 + + Um

2
= U1
+ + Um 2 .
23 / 41
Decomposition of R 2 (orthogonal case)
The R 2 for simple linear regression of Y on Xj is Rj2 Y Y

2
/ Y Y
= Pj Y
/ Y Y
so we see that for orthogonal design matrices,

p
R =
j=1
Rj2 .
That is, the overall coecient of determination is the sum of univariate coecients of determination for all the explanatory variables.
24 / 41
Non-orthogonal case: If X is not orthogonal, the overall R 2 will not be the sum of single covariate R 2 s. If we let Rj2 be as above (the R 2 values for regressing Y on each Xj ), 2 2 2 2 then there are two dierent situations: j Rj > R , and j Rj < R .
25 / 41
Case 1: Rj2 > R 2
j
Its not surprising that suppose that
Rj2 can be bigger than R 2 . For example, Y = X1 +
is the data generating model, and X2 is highly correlated with X1 (but is not part of the data generating model). For the regression of Y on both X1 and X2 , the multiple R 2 will be 1 2 /var(Y ) (since E (Y |X1 , X2 ) = E (Y |X1 ) = X2 ). The R 2 values for Y regressed on either X1 or X2 separately will also be approximately 1 2 /var(Y ).
2 2 Thus R1 + R2 2R 2 .
26 / 41
Case 2:
j
Rj2 < R 2
This is more surprising, and is sometimes called enhancement. As an example, suppose the data generating model is Y =Z+ , but we dont observe Z (for simplicity assume EZ = 0). Instead, we observe a value X2 with mean zero that is independent of Z and , and a value X1 that satises X1 = Z + X2 . Since X2 is independent of Z and , it is also independent of Y , thus 2 R2 0 for large n.
27 / 41
Decomposition of R 2 (enhancement example)
2 2 The multiple R 2 of Y on X1 and X2 is approximately Z /(Z + 2 ) for large n, since the tted values will converge to Y = X1 X2 = Z . 2 To calculate R1 , rst note that for the regression of Y on X1 ,
cov(Y , X1 ) 2 = 2 Z 2 var(X1 ) Z + X2 and 0.
28 / 41

Therefore for large n,
n1 Y Y
= = =
2 2 2 n1 Z + Z X1 /(Z + X2 )
2 2
2 2 2 2 2 2 n1 X2 Z /(Z + X2 ) + Z X2 /(Z + X2 ) 4 2 2 2 4 2 2 2 X2 Z /(Z + X2 )2 + 2 + Z X2 /(Z + X2 )2 2 2 2 2 X2 Z /(Z + X2 ) + 2 .
Therefore
2 R1
= =
n1 Y Y 2 n1 Y Y 2 2 2 2 2 /( + X2 ) + 2 1 X2 Z 2Z Z + 2
2 (Z
2 Z 2 )(1 +
2 2 X2 /Z )
29 / 41
Thus
2 2 2 R1 /R 2 1/(1 + X2 /Z ), 2 which is strictly less than one if X2 > 0. 2 2 2 Since R2 = 0, it follows that R 2 > R1 + R2 .
The reason for this is that while X2 contains no directly useful 2 information about Y (hence R2 = 0), it can remove the measurement error in X1 , making X1 a better predictor of Z .
30 / 41
Partial R 2 example I
Suppose the design matrix satises 1 X X /n = 0 0 and the data generating model is Y = X1 + X2 + with var = 2 . 0 1 r 0 r 1
31 / 41
We will calculate the partial R 2 for X1 , using the fact that the partial R 2 is the regular R 2 for regressing (I P1 )Y on (I P1 )X1 where P1 is the projection onto span ({1, X2 }). Since this is a simple linear regression, the partial R 2 can be expressed cor((I P1 )Y , (I P1 )X1 )2 .
32 / 41
The numerator of the partial R 2 is the square of
cov((I P1 )Y , (I P1 )X1 )
= =
Y (I P1 )X1 /n (X1 + X2 + ) (X1 rX2 )/n
1 r 2. The denominator contains two factors. The rst is
(I P1 )X1 2 /n
= =
X1 (I P1 )X1 /n X1 (X1 rX2 )/n
1 r 2.
33 / 41
The other factor in the denominator is Y (I P1 )Y /n: Y (I P1 )Y /n = (X1 + X2 ) (I P1 )(X1 + X2 )/n + (I P1 ) /n + 2 (I P1 )(X1 + X2 )/n (X1 + X2 ) (X1 rX2 )/n + 2 1 r 2 + 2 .
Thus we get that the partial R 2 is approximately equal to 1 r2 . 1 r 2 + 2 If r = 1 then the result is zero (X1 has no unique explanatory power), and if r = 0, the result is 1/ 2 , indicating that after controlling for X2 , around 1/ 2 fraction of the remaining variance is explained by X1 (the rest is due to ).
34 / 41
Partial R 2 example II
Suppose Y = bX1 + X2 = X1 +
2 where E ( 1 |X ) = E ( 2 |X ) = 0, var( k |X ) = k , EX1 = 0, var(X1 ) = 1, and X1 is independent of 1 and 2 .
The interpretation of this example is that X1 is causal and X2 is a surrogate.
35 / 41
2 The four R 2 s for this model are related as follows, where R{} is the R 2 based only on the intercept.
2 R1

2 R{}
?? ?? ?? ?? ?? ?? ?? ?? ?? ??
2 R2
?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 2 R1,2
36 / 41
We can calculate the limiting values for each R 2 :

2 R{} = 0
2 2 R1 = R1,2 =
b2 2 b 2 + 1
37 / 41
For the regression on X2 , the limiting value of the slope is
cov(Y , X2 ) var(X2 )
= =
b cov(X1 , X2 ) + cov( 1 , X2 ) 2 1 + 2 b 2. 1 + 2
Therefore the residual mean square is approximately
n1 Y Y2
= =
2 n1 bX1 + 1 b(X1 + 2 )/(1 + 2 ) 2 b2 b 2 n1 2 X1 + 1 1 + 2 2 1 + 2 2 2 b 2 2 2 2 + 1 . 1 + 2
38 / 41
So,
2 R2
1 = = =
2 2 2 b 2 2 /(1 + 2 ) + 1 2 b 2 + 1 2 2 b 2 b 2 2 /(1 + 2 ) 2 + 2 b 1 b2 2 2 (1 + 2 )(b 2 + 1 ) 1 2 )(1 + 2 /b 2 ) (1 + 2 1
2 If 2 = 0 then X1 = X2 , and we recover the usual R 2 for simple linear regression of Y on X1 .
39 / 41
With some algebra, we get an expression for the partial R 2 for adding X1 to a model already containing X2 :
2 2 2 R1,2 R2 b 2 2 2 = b2 2 + 2 + 2 2 . 1 R2 2 1 1 2 2 If 2 = 0, the partial R 2 is 0. 2 2 If b = 0, 2 > 0 and 1 = 0, the partial R 2 is 1.
40 / 41
Summary
Each of the three R 2 values can be expressed either in terms of variance ratios, or as a squared correlation coecient:
Multiple R 2 Y 2/ Y Y Y , Y )2 cor(Y Semi-partial R 2 2 R 2 Rk cor(Y , Xk )2 Partial R 2 2 2 (R Rk )/(1 Rk ) 2 cor((I Pk )Y , Xk )
2
VR Correlation
41 / 41

Decomposing Variance

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Decomposing Variance

Caricato da

Copyright:

Formati disponibili

Author(s): Kerby Shedden, Ph.D.

October 10, 2011

Law of total variation

Law of total variation

By the Cauchy-Schwartz inequality, 1 1 XY XY 1 1.

Pearson correlation and simple linear regression slopes

and the correlation is XY cor(X , Y ) = 2

Orthogonality between tted values and residuals

It should be clear that R 2 = 0 i Y = Y and R 2 = 1 i Y = Y .

Coecient of determination in simple linear regression

and var(Y ) = var( + X ) 2 var(X ) =

Thus for simple linear regression, R 2 = cor(Y , X )2 = cor(Y , Y )2 .

Relationship to the F statistic

The F-statistic for the null hypothesis 1 = . . . = p = 0 is Y Y Y Y

which is an increasing function of R 2 .

The unique variation in one covariate

The unique regression information in one covariate

Identity involving norms of tted values and residuals

By the same argument, Y Y

Improvement in R 2 due to one covariate

Now we can obtain a simple, direct expression for the semi-partial R 2 .

Improvement in R 2 due to one covariate

Decomposition of projection matrices

Decomposition of R 2 (orthogonal case)

Here we are using the fact that if U1 , . . . , Um are orthogonal, then U1 + + Um

Decomposition of R 2 (orthogonal case)

The R 2 for simple linear regression of Y on Xj is Rj2 Y Y

so we see that for orthogonal design matrices,

Its not surprising that suppose that

Rj2 can be bigger than R 2 . For example, Y = X1 +

Decomposition of R 2 (enhancement example)

cov(Y , X1 ) 2 = 2 Z 2 var(X1 ) Z + X2 and 0.

Decomposition of R 2 (enhancement example)

2 2 2 2 2 2 n1 X2 Z /(Z + X2 ) + Z X2 /(Z + X2 ) 4 2 2 2 4 2 2 2 X2 Z /(Z + X2 )2 + 2 + Z X2 /(Z + X2 )2 2 2 2 2 X2 Z /(Z + X2 ) + 2 .

Decomposition of R 2 (enhancement example)

Y (I P1 )X1 /n (X1 + X2 + ) (X1 rX2 )/n

1 r 2. The denominator contains two factors. The rst is

X1 (I P1 )X1 /n X1 (X1 rX2 )/n

2 where E ( 1 |X ) = E ( 2 |X ) = 0, var( k |X ) = k , EX1 = 0, var(X1 ) = 1, and X1 is independent of 1 and 2 .

The interpretation of this example is that X1 is causal and X2 is a surrogate.

We can calculate the limiting values for each R 2 :

Therefore the residual mean square is approximately

2 n1 bX1 + 1 b(X1 + 2 )/(1 + 2 ) 2 b2 b 2 n1 2 X1 + 1 1 + 2 2 1 + 2 2 2 b 2 2 2 2 + 1 . 1 + 2

2 2 2 b 2 2 /(1 + 2 ) + 1 2 b 2 + 1 2 2 b 2 b 2 2 /(1 + 2 ) 2 + 2 b 1 b2 2 2 (1 + 2 )(b 2 + 1 ) 1 2 )(1 + 2 /b 2 ) (1 + 2 1

2 If 2 = 0 then X1 = X2 , and we recover the usual R 2 for simple linear regression of Y on X1 .

Potrebbero piacerti anche