Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Multiple Regression
Introduction:
Last week, we used bivariate regression to assess the effect of one independent variable (X) on a
dependent variable (Y). However, in the real world Y is typically influenced by many different
independent variables, which are likely to be correlated. Multiple regression allows us to assess the
effects of many different independent variables (X1, X2, X3 etc.) on the same dependent variable (Y)
at the same time.
Important note: Remember from last week that the dependent variable (Y) of a regression must be
interval level. The independent variables (Xs) can be interval or dummy.
Why do we need to control for other variables?
Why do we need to include all independent variables into the same multiple regression model? Why
dont we just run a series of bivariate regressions instead?
The reason is that including additional independent variables into our regression allows us to control
for the effects of those variables. This is important, since our independent variables are typically
correlated with each other. If we are interested in the effect of X1 on Y, then we also need to include
X2 and X3 if they also influence the dependent variable and are correlated with X1. Not including
them would cause X1 to capture also parts of the effects of X2 and X3. In this case, the coefficient
estimate of X1 would be biased. This bias that results from the omission of a relevant variable in our
model is called omitted variable bias.
Lets see what happens if we dont include all relevant independent variables in our regression model.
In the following, we run three regressions with the share of women in parliament as the dependent
variable. The first one uses GDP per capita to explain the share of women in parliament (you may
recognise this model from last week). The second uses the democracy dummy variable by to explain
the share of women in parliament. And the third one is a multiple regression, which uses GDP per
capita, the democracy dummy variable, and the share of students enrolled in education to explain the
share of women in parliament.
. regress women2000 gdppc2000
Source
SS
df
MS
Model
Residual
2627.15635
8565.94689
1
139
2627.15635
61.6255172
Total
11193.1032
140
79.9507374
women2000
Coef.
gdppc2000
_cons
.0004655
8.393485
Std. Err.
.0000713
.9063686
6.53
9.26
Number of obs
F( 1,
139)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
141
42.63
0.0000
0.2347
0.2292
7.8502
P>|t|
0.000
0.000
.0003246
6.601433
.0006065
10.18554
SS
df
MS
Model
Residual
693.492558
14594.2217
1
187
693.492558
78.0439662
Total
15287.7142
188
81.3176289
women2000
Coef.
aclp_democ2000
_cons
3.915333
8.768
Std. Err.
Number of obs
F( 1,
187)
Prob > F
R-squared
Adj R-squared
Root MSE
1.313462
1.020091
2.98
8.60
=
=
=
=
=
=
189
8.89
0.0033
0.0454
0.0403
8.8342
P>|t|
0.003
0.000
1.324226
6.755634
6.506441
10.78037
SS
df
MS
Model
Residual
3095.09093
8098.01231
3
137
1031.69698
59.1095789
Total
11193.1032
140
79.9507374
women2000
Coef.
gdppc2000
aclp_democ2000
educ2001
_cons
.0002838
.5907561
.1158674
1.612113
Std. Err.
Number of obs
F( 3,
137)
Prob > F
R-squared
Adj R-squared
Root MSE
.0000953
1.483168
.0433338
2.568476
2.98
0.40
2.67
0.63
P>|t|
0.003
0.691
0.008
0.531
=
=
=
=
=
=
141
17.45
0.0000
0.2765
0.2607
7.6883
.0004722
3.523618
.201557
6.691097
Question: Compare the effects of GDP per capita and the democracy dummy variable in the bivariate
regression models and in the multiple regression model. What has happened?
We see that by just looking at the bivariate regressions, we would get biased coefficient estimates for
GDP per capita and the democracy variable, which would strongly overestimate the effects of these
two variables. Thus, it is crucial that we control for other relevant independent variables that influence
Y and are correlated with GDP per capita or the level of democracy.
Note that by including additional variables in our regression model the interpretation of the
coefficients of the independent variables also slightly changes. They now indicate the change in the
dependent variable for a one unit increase in the independent variable, while holding all other
independent variables constant. Thus the coefficients now report the effects of the independent
variables while controlling for all the other variables we have included in our model. The p-values tell
us whether the respective variables have a significant effect on the dependent variable, while
controlling for all other independent variables.
Also note that the value of R2 now reports the share of the variance in the dependent variable that can
be explained by all the independent variables together.
The multiple regression model:
Although the last example made use of just three independent variables, we can easily include any
number of additional independent variables in our multiple regression model. As in the bivariate
regression case, we can express the dependent variable of our multiple regression as a linear function
of the independent variables:
2
Using this equation we can again use the coefficient estimates combined with a set of hypothetical
values of our independent variables to make predictions about the expected value of Y for a given
scenario of X values.
Question: Write down the regression equation for the multiple regression model that used GDP per
capita, democracy, and the share of students enrolled in education to explain the share of women in
parliament.
by how much will the % of women in parliament change if GDPPC increases by $10000?
by how much will the % of women in parliament change if a country is democratic instead of
autocratic?
what % of women in parliament would we expect in a country that is autocratic and has a
GDPPC of $0 and a share of students enrolled in education of 0%?
what % of women in parliament would we expect in a country that is democratic with a GDPPC
of $10000 and a share of students enrolled in education of 70%?
Freq.
Percent
Cum.
free
partly free
not free
86
52
53
45.03
27.23
27.75
45.03
72.25
100.00
Total
191
100.00
Notice that Stata has created three new dummy variables: fhdum1, fhdum2 and fhdum3. fhdum1
takes on a value of 1 if fhcat2000 is equal to free and a value of 0 otherwise; fhdum2 takes on a
value of 1 if fhcat2000 is equal to partly free and a value of 0 otherwise; and fhdum3 takes on a
value of 1 if fhcat2000 is equal to not free and a value of 0 otherwise.
. list country fhcat2000 fhdum1 fhdum2 fhdum3
1.
2.
3.
4.
5.
country
fhc~2000
fhdum1
fhdum2
fhdum3
Afghanistan
Albania
Algeria
Andorra
Angola
not free
partly f
not free
free
not free
0
0
0
1
0
0
1
0
0
0
1
0
1
0
1
We can now add these dummy variables to our multiple regression in order to see whether the level of
freedom has an effect on the share of women in parliament if controlling for GDP per capita and the
share of students enrolled in education. (We exclude the aclp_democ2000 variable from the model,
since it measures roughly the same thing as the fhcat2000 variable.)
To assess the effect of fhcat2000 on the share of women in parliament, we can either include just one
of the newly created dummy variables or we can include all of them except for one. Note that we
cannot include all dummy variables, since one of them has to serve as the reference category against
which we interpret the coefficients of the other dummy variables.
In the following regression, we only include the fhdum1 dummy variable.
Question: How do you interpret the coefficient of the fhdum1 variable?
SS
df
MS
Model
Residual
3325.33302
7867.77022
3
137
1108.44434
57.4289797
Total
11193.1032
140
79.9507374
women2000
Coef.
gdppc2000
educ2001
fhdum1
_cons
.0002246
.0933386
3.325281
2.49
Std. Err.
.0000981
.0438816
1.627918
2.534783
2.29
2.13
2.04
0.98
Number of obs
F( 3,
137)
Prob > F
R-squared
Adj R-squared
Root MSE
141
19.30
0.0000
0.2971
0.2817
7.5782
P>|t|
0.024
0.035
0.043
0.328
.0000306
.0065657
.1061845
-2.52236
In the next regression, we include both the fhdum1 and the fhdum3 variable.
Question: How do you interpret the coefficients of the two dummy variables?
=
=
=
=
=
=
.0004186
.1801115
6.544378
7.50236
SS
df
MS
Model
Residual
3372.69196
7820.41129
4
136
843.172989
57.5030242
Total
11193.1032
140
79.9507374
women2000
Coef.
gdppc2000
educ2001
fhdum1
fhdum3
_cons
.0002249
.0941024
3.983734
1.621946
1.765097
Std. Err.
.0000982
.043918
1.783245
1.787232
2.65922
2.29
2.14
2.23
0.91
0.66
Number of obs
F( 4,
136)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.023
0.034
0.027
0.366
0.508
=
=
=
=
=
=
141
14.66
0.0000
0.3013
0.2808
7.5831
.0004191
.1809528
7.510209
5.156306
7.023866
SS
df
MS
Model
Residual
3325.33302
7867.77022
3
137
1108.44434
57.4289797
Total
11193.1032
140
79.9507374
women2000
Coef.
gdppc2000
educ2001
fhdum1
_cons
.0002246
.0933386
3.325281
2.49
Std. Err.
.0000981
.0438816
1.627918
2.534783
t
2.29
2.13
2.04
0.98
Number of obs
F( 3,
137)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
141
19.30
0.0000
0.2971
0.2817
7.5782
P>|t|
Beta
0.024
0.035
0.043
0.328
.2337447
.2143155
.1863789
.
Question: Interpret the standardised coefficients of the three variables. Which of them has the
strongest effect on the share of women in parliament?
Stata exercise:
As in the last few weeks we will be using the data set Democracy small.dta.
1. Run a bivariate regression with the share of students enrolled in education (educ2001) as the
dependent variable and government spending (cengov2000) as the independent variable. Note
that this is the same regression you ran last week.
2. Interpret the coefficient of the cengov2000 variable. Is it significant?
3. Now run a multiple regression with the share of students enrolled in education (educ2001) as the
dependent variable and government spending (cengov2000), GDP per capita (gdppc2000) and
democracy (aclp_democ2000) as the independent variables.
4. Again interpret the coefficient of the cengov2000 variable. Is it still significant? What has
happened and why?
5. By how much does the share of students enrolled in education change if GDP per capita increases
by $1000? By how much does the share of students enrolled in education change if a country is
democratic instead of autocratic?
6. Compare the R2 between the two regression model. Which model explains a larger share in the
variance of the share of students enrolled in education?
7. Lets say we are interested in whether the share of students enrolled in education is significantly
different in Africa, after controlling for government spending, GDP per capita and democracy.
Create a set of region dummy variables using the region variable. Include the dummy variable for
Africa in your model. By how much is the share of students enrolled in education lower or higher
in Africa compared to other regions?