Sei sulla pagina 1di 38

Statistics and Quantitative

Analysis U4320
Lecture 13: Explaining
Variation
Prof. Sharyn OHalloran

I. Explaining Variation: R2

A. Breaking Down the Distances

Let's go back to the basics of regression


analysis.
Y

Money Spent

on Health Care

x
x

6
5
4

20

30

x
Y=a+bx

x
Y

10

40

50 60 70
Income

How well does the predicted line explain the


variation in the independent variable money spent?

I. Explaining Variation: R2

Total Variation

$ Spent on
Health Care

Y
unexplained by regression
x Y-Y=deviation
x

8
(x,y)

7
6
5
4

5.9

x Y-Y=deviation explained by regression


Y
Y-Y=total deviation around Y

x
Y=a+bx

10

20

30

40

50 60 70
Income

I. Explaining Variation: R2

Total Deviation

Y Y
(Y$ Y )
Total =
Explained +
Deviation Deviation

(Y Y$) .
Unexplained
Deviation

Y to is the
The total distance from any point
sum of the distance from Y to the regression
Y
line plus the distance from the regression
line
to .

I. Explaining Variation: R2

B. Sums of Squares

(Y Y )

We can sum this equation across all the


Y's and square both sides to get:

2
$
$
(Y Y ) (Y Y )

(Y Y$)2 2 (Y Y$)(Y$ Y ) (Y$ Y )2


(Y$ Y )2 (Y Y$)2 ,

I. Explaining Variation: R2

1. Total Sum of Squares (SST).

The term on the left-hand side of this equation


Y
is the sum of the squared distances from all
points to . We call this the total variation
in the Y's, or the Total Sum of Squares
(SST).

2. Regression Sum of Squares

The first term on the right hand side is the Y


sum of the squared distances from the
regression line to . We call it the
Regression Sum of Squares, or SSR.

I. Explaining Variation: R2

3. Error Sum of Squares

Finally, the last term is the sum of the squared


distances from the points to the regression
line. Remember, this is the quantity that least
squares minimizes. We call it the Error Sum
of Squares, or SSE.
We can rewrite the previous equation as:
SST = SSR + SSE.

I. Explaining Variation: R2

C. Definition of R2

We can use these new terms to


determine how much variation is
explained by the regression line.
If the points are perfectly linear, then the
Error Sum of Squares is 0:

I. Explaining Variation: R2
Y
Money Spent

on Health Care

x
x

6
5
4

x
Y

10

20

30

40

50

60

70

Income

Here, SSR = SST. The variance in the Y's is


completely explained by the regression line.

I. Explaining Variation: R2

On the other hand, if there is no relation


between X and Y:
Y

Money Spent
on Health Care

6
5
4

10

20

30

40

50

60

70
Income

Now SSR is 0 and SSE=SST. The regression


line explains none of the variance in Y.

I. Explaining Variation: R2

3. Formula

So we can construct a useful statistic.

Take the ratio of the Regression Sum of


Squares to the Total Sum of Squares:

R2

SSR
SST

We call this statistic R2


It represents the percent of the variation in
Y explained by the regression

I. Explaining Variation: R2

R2 is always between 0 and 1.

For a perfectly straight line it's 1, which is


perfect correlation.
For data with little relation, it's near 0.
R2 measures the explanatory power of your
model.

The more of the variance in Y you can


explain, the more powerful your model.

I. Explaining Variation: R2

D. Example

I wanted to investigate why people have


confidence in what they see on TV.

1. Dependent variable
TRUSTTV = 1 if has a lot of confidence
= 2 if somewhat confidence
= 3 if the individual has no confidence.

2. Independent variables

TUBETIME = number of Hours of TV watched a


week
SKOOL = years of education.
LIKEJPAN = feelings towards Japan.
YELOWSTN = attitudes whether the US should
spend more on national parks.
MYSIGN = the respondents astrological sign.

I. Explaining Variation: R2

3. Calculating R2

a) Correlation matrix

TRUSTTV TUBETIME
TRUSTTV
1.000
TUBETIME -.177
SKOOL
.112
LIKEJPAN
.043
YELOWSTN .003
MYSIGN
-.038

SKOOL LIKEJPAN YELOWSTN MYSIGN

-.177
1.000
-.272
.080
-.137
.053

.112
-.272
1.000
-.072
-.016
.012

.043
.080
-.072
1.000
.040
-.001

.003
-.137
-.016
.040
1.000
-.020

-.038
.053
.012
-.001
-.020
1.000

I. Explaining Variation: R2

b) The first model is:

TRUSTTV = 2.34 0.0539TUBETIME)

Analysis of Variance
DF Sum of Squares
Regression
1
5.90547
Residual

468

183.61793

Mean Square
5.90547
0.39235

I. Explaining Variation: R2

How do we calculate the Total Sum of


Squares?
SST = SSR + SSE
SST = 5.90 + 183.61 = 189.51
Now we can calculate R2:

SSR
SST
5.90

189.54
.031

R2

I. Explaining Variation: R2

c) Each of the 4 different models has an


associated R2.
Eq # 2: R2 = 0.035
Eq # 3: R2
= 0.039
Eq # 4 = R2 = .04

I. Explaining Variation: R2

F. Using R2 in Practice

1. Useful Tool
2. Measure of Unexplained Variance
3. Not a Statistical Test
4. Don't Obsess about R2
5. You can always improve R2 by adding
variables

I. Explaining Variation: R2

G. Example

You'll notice that R2 increase every time.


No matter what variables you add you
can always increase your R2.

II. Adjusted R2

II. Adjusted R2

A. Definition of Adjusted R2

So we'd like a measure like R2, but one that takes


into account the fact that adding extra variables
always increases your explanatory power.
The statistic we use for this is call the Adjusted R2,
and its formula is:

n 1
(1 R2 );
nk
n number of observations,
k number of independent variables.

R2 1

So the Adjusted R2 can actually fall if the variable


you add doesn't explain much of the variance.

II. Adjusted R2

B. Back to the Example

1. Adjusted R2

You can see that the adjusted R2 rises from


equation 1 to equation 2, and from equation 2
to equation 3.
But then it falls from equation 3 to 4, when we
add in the variables for national parks and the
zodiac.

II. Adjusted R2

2. Calculating Adjusted R2

Example: Equation 2
Multiple R
.18856
R Square
.03555
Adjusted R Square .03142
Standard Error
.62562

Analysis of Variance
DF
Sum of Squares
Regression
2
6.73848
Residual

F=

467

8.60813

182.78492

Mean Square
3.36924
.39140

Signif F = .0002

II. Adjusted R2
Variable
SKOOL

B
.015524

SE B
.010641

Beta
.068897

T
1.459

Sig T
.1453

TUBETIME

-.048228

.014436 -.157772

-3.341

.0009

(Constant)

2.127991

.158260

13.446

.0000

We calculate:
n 1
R2 1
(1 R2 )
nk
470 1
1
(1.03555)
470 3
.0314

II. Adjusted R2

C. Stepwise Regression

One strategy for model building is to add


variables only if they increase your adjusted
R 2.
This technique is called stepwise regression.
However, I don't want to emphasize this
approach to strongly. Just as people can fixate
on R2 they can fixate on adjusted R 2.
****IMPORTANT****
If you have a theory that suggests that certain
variables are important for your analysis then
include them whether or not they increase the
adjusted R2.
Negative findings can be important!

III. F Tests

III. F Tests

A. When to use an F-Test?

Say you add a number of variables into a


regression model and you want to see if,
as a group, they are significant in
explaining variation in your dependent
variable Y.
The F-test tells you whether a group of
variables, or even an entire model, is
jointly significant.

This is in contrast to a t-test, which tells


whether an individual coefficient is significantly
different from zero.

III. F Tests

B. Equations

To be precise, say our original equation is:


EQ 1: Y = b0 + b1X1 + b2X2,

and we add two more variables, so the new


equation is:
EQ 2: Y = b0 + b1X1 + b2X2 + b3X3 + b4X4.
We want to test the hypothesis that
0 3 = 4 = 0.

That is, we want to test the joint hypothesis that X 3


and X4 together are not significant factors in
determining Y.

III. F Tests

C. Using Adjusted R2 First

There's an easy way to tell if these two


variables are not significant.

First, run the regression without X3 and X4 in it,


then run the regression with X3 and X4.
Now look at the adjusted R2's for the two
regressions. If the adjusted R2 went down,
then X3 and X4 are not jointly significant.

So the adjusted R2 can serve as a quick


test for insignificance.

III. F Tests

D. Calculating an F-Test
If the adjusted R2 goes up, then you need to do
a more complicated test, F-Test.

1. Ratio

Let regression 1 be the model without X3 and


X4, and let regression 2 include X3 and X4.
The basic idea of the F statistic, then, is to
compute the ratio:

SSE1 SSE2
SSE2

III. F Tests

2. Correction

We have to correct for the number of


independent we add.
So the complete statistic is:
SSE1 SSE2
m
F
;
SSE2
nk
m number of restrictions;
k number of independent variables.
Remember that k is the total number of
independent variables, including the ones that
you are testing and the constant.

III. F Tests

2. Correction (cont.)

This equation defines an F statistic with m and


n-k degrees of freedom.
We write it like this:

Fnm k

To get critical values for the F statistic, we use


a set of tables, just like for the normal and tstatistics.

III. F Tests

E. Example

1. Adding Extra Variables: Are a group of


variables jointly significant?

Are the variables YELOWSTN and MYSIGN


jointly significant?

EQ 1: TRUSTTV = 0 + 1LIKE JPAN + 2SKOOL + 3TUBETIME .


EQ 2: TRUSTTV = 0 + 1LIKE JPAN + 2SKOOL + 3TUBETIME +
4MYSIGN + 5YELOWSTN

III. F Tests

1. Adding Extra Variables (cont.)

a) State the null hypothesis

H0: 4 = 5 = 0.

b) Calculate the F-statistic


Our formula for the F statistic is:
SSE1 SSE2
m
F
,
SSE2
nk

III. F Tests

What is SSE1 --the sum of squared errors in


the first regression?
What is SSE2, the sum of squared errors in
the second regression?
m=2

N = 470

The formula is:

182.07 18182
.
2
F
18182
.
470 6
= 0.319

k=6

III. F Tests

c) Reject or fail to reject the null hypothesis?

The critical value at the 5 % level


2
F
470 6 from the table, is 3.00.
2
Is the F-statistic >
F470
6 ?

If yes, then we reject the null hypothesis that


the variables are not significantly different from
zero; otherwise we fail to reject.
.319 < 3.00, so we can reject the null
hypothesis.

B=0 0.319

3.0

III. F Tests

2. Testing All Variables: Is the Model


Significant?

Equation 2
Multiple R
.18856
R Square
.03555
Adjusted R Square .03142
Standard Error
.62562

Analysis of Variance
DF
Sum of Squares
Regression
2
6.73848
Residual
F=

467

8.60813

182.78492
Signif F = .0002

Mean Square
3.36924
.39140

III. F Tests
Variable

SE B

Beta

Sig T

SKOOL

.015524

.010641

.068897

1.459

.1453

TUBETIME

-.048228

.014436

-.157772

-3.341

.0009

(Constant)

2.127991

.158260

13.446

.0000

a) Hypothesis:

H0: 1 = 2 = 0.

Again, we start with our formula:

SSE1 SSE2
m
F
,
SSE2
nk

III. F Tests

b) Calculate F-statistic
SSE = 182.78
2

SSE1 is the sum of squared errors when


there are no explanatory variables at all.
If there are no explanatory variables, then SSR
must be 0. In this case, SSE=SST.
So we can substitute SST for SSE1 in our formula.
SST = SSR + SSE = 6.738 + 182.78 = 189.54

189.54 182.78
2
F
182.78
470 3
= 8.61.

This is the number reported in your printout


under the F statistic.

III. F Tests

c) Reject or fail to reject the null hypothesis?


2

F470
The critical value at the 5% level,
3
from your table, is 3.00.
So this time we can reject the null
hypothesis that 1 = 2 = 0.

Potrebbero piacerti anche