Lecture 13

Statistics and Quantitative
Analysis U4320
Lecture 13: Explaining
Variation
Prof. Sharyn OHalloran
I. Explaining Variation: R2
A. Breaking Down the Distances
Let's go back to the basics of regression

analysis.
Y
Money Spent
on Health Care
x
x
6
5
4
20
30
x
Y=a+bx
x
Y
10
40
50 60 70
Income
How well does the predicted line explain the

variation in the independent variable money spent?
Total Variation
$ Spent on
Health Care
Y
unexplained by regression
x Y-Y=deviation
x
8
(x,y)
7
6
5
4
5.9
x Y-Y=deviation explained by regression

Y
Y-Y=total deviation around Y
x
Y=a+bx
10
20
30
40
50 60 70
Income
Total Deviation
Y Y
(Y$ Y )
Total =
Explained +
Deviation Deviation
(Y Y$) .
Unexplained
Deviation
Y to is the
The total distance from any point
sum of the distance from Y to the regression
Y
line plus the distance from the regression
line
to .
B. Sums of Squares
(Y Y )
We can sum this equation across all the

Y's and square both sides to get:
2
$
$
(Y Y ) (Y Y )
(Y Y$)2 2 (Y Y$)(Y$ Y ) (Y$ Y )2

(Y$ Y )2 (Y Y$)2 ,
1. Total Sum of Squares (SST).
The term on the left-hand side of this equation

Y
is the sum of the squared distances from all
points to . We call this the total variation
in the Y's, or the Total Sum of Squares
(SST).
2. Regression Sum of Squares
The first term on the right hand side is the Y

sum of the squared distances from the
regression line to . We call it the
Regression Sum of Squares, or SSR.
3. Error Sum of Squares
Finally, the last term is the sum of the squared

distances from the points to the regression
line. Remember, this is the quantity that least
squares minimizes. We call it the Error Sum
of Squares, or SSE.
We can rewrite the previous equation as:
SST = SSR + SSE.
C. Definition of R2
We can use these new terms to

determine how much variation is
explained by the regression line.
If the points are perfectly linear, then the
Error Sum of Squares is 0:
Y
Money Spent
on Health Care
x
x
6
5
4
x
Y
10
20
30
40
50
60
70
Income
Here, SSR = SST. The variance in the Y's is

completely explained by the regression line.
On the other hand, if there is no relation

between X and Y:
Y
Money Spent
on Health Care
6
5
4
10
20
30
40
50
60
70
Income
Now SSR is 0 and SSE=SST. The regression

line explains none of the variance in Y.
3. Formula
So we can construct a useful statistic.
Take the ratio of the Regression Sum of

Squares to the Total Sum of Squares:
R2
SSR
SST
We call this statistic R2

It represents the percent of the variation in
Y explained by the regression
R2 is always between 0 and 1.
For a perfectly straight line it's 1, which is

perfect correlation.
For data with little relation, it's near 0.
R2 measures the explanatory power of your
model.
The more of the variance in Y you can

explain, the more powerful your model.
D. Example
I wanted to investigate why people have

confidence in what they see on TV.
1. Dependent variable
TRUSTTV = 1 if has a lot of confidence
= 2 if somewhat confidence
= 3 if the individual has no confidence.
2. Independent variables
TUBETIME = number of Hours of TV watched a

week
SKOOL = years of education.
LIKEJPAN = feelings towards Japan.
YELOWSTN = attitudes whether the US should
spend more on national parks.
MYSIGN = the respondents astrological sign.
3. Calculating R2
a) Correlation matrix
TRUSTTV TUBETIME
TRUSTTV
1.000
TUBETIME -.177
SKOOL
.112
LIKEJPAN
.043
YELOWSTN .003
MYSIGN
-.038
SKOOL LIKEJPAN YELOWSTN MYSIGN
-.177
1.000
-.272
.080
-.137
.053
.112
-.272
1.000
-.072
-.016
.012
.043
.080
-.072
1.000
.040
-.001
.003
-.137
-.016
.040
1.000
-.020
-.038
.053
.012
-.001
-.020
1.000
b) The first model is:
TRUSTTV = 2.34 0.0539TUBETIME)
Analysis of Variance
DF Sum of Squares
Regression
1
5.90547
Residual
468
183.61793
Mean Square
5.90547
0.39235
How do we calculate the Total Sum of

Squares?
SST = SSR + SSE
SST = 5.90 + 183.61 = 189.51
Now we can calculate R2:
SSR
SST
5.90
189.54
.031
R2
c) Each of the 4 different models has an

associated R2.
Eq # 2: R2 = 0.035
Eq # 3: R2
= 0.039
Eq # 4 = R2 = .04
F. Using R2 in Practice
1. Useful Tool
2. Measure of Unexplained Variance
3. Not a Statistical Test
4. Don't Obsess about R2
5. You can always improve R2 by adding
variables
G. Example
You'll notice that R2 increase every time.

No matter what variables you add you
can always increase your R2.
II. Adjusted R2
II. Adjusted R2
A. Definition of Adjusted R2
So we'd like a measure like R2, but one that takes

into account the fact that adding extra variables
always increases your explanatory power.
The statistic we use for this is call the Adjusted R2,
and its formula is:
n 1
(1 R2 );
nk
n number of observations,
k number of independent variables.
R2 1
So the Adjusted R2 can actually fall if the variable

you add doesn't explain much of the variance.
II. Adjusted R2
B. Back to the Example
1. Adjusted R2
You can see that the adjusted R2 rises from

equation 1 to equation 2, and from equation 2
to equation 3.
But then it falls from equation 3 to 4, when we
add in the variables for national parks and the
zodiac.
II. Adjusted R2
2. Calculating Adjusted R2
Example: Equation 2
Multiple R
.18856
R Square
.03555
Adjusted R Square .03142
Standard Error
.62562
DF
Sum of Squares
Regression
2
6.73848
Residual
F=
467
8.60813
182.78492
Mean Square
3.36924
.39140
Signif F = .0002
II. Adjusted R2
Variable
SKOOL
B
.015524
SE B
.010641
Beta
.068897
T
1.459
Sig T
.1453
TUBETIME
-.048228
.014436 -.157772
-3.341
.0009
(Constant)
2.127991
.158260
13.446
.0000
We calculate:
n 1
R2 1
(1 R2 )
nk
470 1
1
(1.03555)
470 3
.0314
II. Adjusted R2
C. Stepwise Regression
One strategy for model building is to add

variables only if they increase your adjusted
R 2.
This technique is called stepwise regression.
However, I don't want to emphasize this
approach to strongly. Just as people can fixate
on R2 they can fixate on adjusted R 2.
****IMPORTANT****
If you have a theory that suggests that certain
variables are important for your analysis then
include them whether or not they increase the
adjusted R2.
Negative findings can be important!
III. F Tests
III. F Tests
A. When to use an F-Test?
Say you add a number of variables into a

regression model and you want to see if,
as a group, they are significant in
explaining variation in your dependent
variable Y.
The F-test tells you whether a group of
variables, or even an entire model, is
jointly significant.
This is in contrast to a t-test, which tells

whether an individual coefficient is significantly
different from zero.
III. F Tests
B. Equations
To be precise, say our original equation is:

EQ 1: Y = b0 + b1X1 + b2X2,
and we add two more variables, so the new

equation is:
EQ 2: Y = b0 + b1X1 + b2X2 + b3X3 + b4X4.
We want to test the hypothesis that
0 3 = 4 = 0.
That is, we want to test the joint hypothesis that X 3

and X4 together are not significant factors in
determining Y.
III. F Tests
C. Using Adjusted R2 First
There's an easy way to tell if these two

variables are not significant.
First, run the regression without X3 and X4 in it,

then run the regression with X3 and X4.
Now look at the adjusted R2's for the two
regressions. If the adjusted R2 went down,
then X3 and X4 are not jointly significant.
So the adjusted R2 can serve as a quick

test for insignificance.
III. F Tests
D. Calculating an F-Test
If the adjusted R2 goes up, then you need to do
a more complicated test, F-Test.
1. Ratio
Let regression 1 be the model without X3 and

X4, and let regression 2 include X3 and X4.
The basic idea of the F statistic, then, is to
compute the ratio:
SSE1 SSE2
SSE2
III. F Tests
2. Correction
We have to correct for the number of

independent we add.
So the complete statistic is:
SSE1 SSE2
m
F
;
SSE2
nk
m number of restrictions;
k number of independent variables.
Remember that k is the total number of
independent variables, including the ones that
you are testing and the constant.
III. F Tests
2. Correction (cont.)
This equation defines an F statistic with m and

n-k degrees of freedom.
We write it like this:
Fnm k
To get critical values for the F statistic, we use

a set of tables, just like for the normal and tstatistics.
III. F Tests
E. Example
1. Adding Extra Variables: Are a group of

variables jointly significant?
Are the variables YELOWSTN and MYSIGN

jointly significant?
EQ 1: TRUSTTV = 0 + 1LIKE JPAN + 2SKOOL + 3TUBETIME .

EQ 2: TRUSTTV = 0 + 1LIKE JPAN + 2SKOOL + 3TUBETIME +
4MYSIGN + 5YELOWSTN
III. F Tests
1. Adding Extra Variables (cont.)
a) State the null hypothesis
H0: 4 = 5 = 0.
b) Calculate the F-statistic

Our formula for the F statistic is:
SSE1 SSE2
m
F
,
SSE2
nk
III. F Tests
What is SSE1 --the sum of squared errors in

the first regression?
What is SSE2, the sum of squared errors in
the second regression?
m=2
N = 470
The formula is:
182.07 18182
.
2
F
18182
.
470 6
= 0.319
k=6
III. F Tests
c) Reject or fail to reject the null hypothesis?
The critical value at the 5 % level

2
F
470 6 from the table, is 3.00.
2
Is the F-statistic >
F470
6 ?
If yes, then we reject the null hypothesis that

the variables are not significantly different from
zero; otherwise we fail to reject.
.319 < 3.00, so we can reject the null
hypothesis.
B=0 0.319
3.0
III. F Tests
2. Testing All Variables: Is the Model

Significant?
Equation 2
Multiple R
.18856
R Square
.03555
Adjusted R Square .03142
Standard Error
.62562
DF
Sum of Squares
Regression
2
6.73848
Residual
F=
467
8.60813
182.78492
Signif F = .0002
Mean Square
3.36924
.39140
III. F Tests
Variable
SE B
Beta
Sig T
SKOOL
.015524
.010641
.068897
1.459
.1453
TUBETIME
-.048228
.014436
-.157772
-3.341
.0009
(Constant)
2.127991
.158260
13.446
.0000
a) Hypothesis:
H0: 1 = 2 = 0.
Again, we start with our formula:
SSE1 SSE2
m
F
,
SSE2
nk
III. F Tests
b) Calculate F-statistic
SSE = 182.78
2
SSE1 is the sum of squared errors when

there are no explanatory variables at all.
If there are no explanatory variables, then SSR
must be 0. In this case, SSE=SST.
So we can substitute SST for SSE1 in our formula.
SST = SSR + SSE = 6.738 + 182.78 = 189.54
189.54 182.78
2
F
182.78
470 3
= 8.61.
This is the number reported in your printout

under the F statistic.
III. F Tests
c) Reject or fail to reject the null hypothesis?

2
F470
The critical value at the 5% level,
3
from your table, is 3.00.
So this time we can reject the null
hypothesis that 1 = 2 = 0.

Lecture 13

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Lecture 13

Caricato da

Copyright:

Formati disponibili

Statistics and Quantitative

A. Breaking Down the Distances

Let's go back to the basics of regression

How well does the predicted line explain the

x Y-Y=deviation explained by regression

We can sum this equation across all the

(Y Y$)2 2 (Y Y$)(Y$ Y ) (Y$ Y )2

1. Total Sum of Squares (SST).

The term on the left-hand side of this equation

2. Regression Sum of Squares

The first term on the right hand side is the Y

3. Error Sum of Squares

Finally, the last term is the sum of the squared

We can use these new terms to

Here, SSR = SST. The variance in the Y's is

On the other hand, if there is no relation

Now SSR is 0 and SSE=SST. The regression

So we can construct a useful statistic.

Take the ratio of the Regression Sum of

We call this statistic R2

R2 is always between 0 and 1.

For a perfectly straight line it's 1, which is

The more of the variance in Y you can

I wanted to investigate why people have

TUBETIME = number of Hours of TV watched a

SKOOL LIKEJPAN YELOWSTN MYSIGN

b) The first model is:

TRUSTTV = 2.34 0.0539TUBETIME)

How do we calculate the Total Sum of

c) Each of the 4 different models has an

You'll notice that R2 increase every time.

So we'd like a measure like R2, but one that takes

So the Adjusted R2 can actually fall if the variable

B. Back to the Example

You can see that the adjusted R2 rises from

One strategy for model building is to add

A. When to use an F-Test?

Say you add a number of variables into a

This is in contrast to a t-test, which tells

To be precise, say our original equation is:

and we add two more variables, so the new

That is, we want to test the joint hypothesis that X 3

C. Using Adjusted R2 First

There's an easy way to tell if these two

First, run the regression without X3 and X4 in it,

So the adjusted R2 can serve as a quick

Let regression 1 be the model without X3 and

We have to correct for the number of

This equation defines an F statistic with m and

To get critical values for the F statistic, we use

1. Adding Extra Variables: Are a group of

Are the variables YELOWSTN and MYSIGN

EQ 1: TRUSTTV = 0 + 1LIKE JPAN + 2SKOOL + 3TUBETIME .

1. Adding Extra Variables (cont.)

a) State the null hypothesis

b) Calculate the F-statistic

What is SSE1 --the sum of squared errors in

The formula is:

c) Reject or fail to reject the null hypothesis?

The critical value at the 5 % level

If yes, then we reject the null hypothesis that

2. Testing All Variables: Is the Model

Again, we start with our formula:

SSE1 is the sum of squared errors when